mirror of https://gitee.com/namelin2022/ollama
Browse Source
Computing an attention mask for a large context and max batch is expensive - over 100ms. Models like Gemma3 that have multiple types of caches and custom attention masks need to do this 4 times, so this adds approximately 500ms to startup time when using 128k context When we are reserving the worst case graph, we don't need the mask, only its shape, so we can skip this.parth/deepseek-r1-tools
committed by
Jesse Gross
1 changed files with 12 additions and 1 deletions
Loading…
Reference in new issue