ollama

Commit Graph

Author	SHA1	Message	Date
Michael Yang	333e360422	model: handle multiple eos tokens (#10577 ) * get eos_token_id from generation_config.json * refactor * include both ids and strings in trace * comments * remove special case for gemma3 special vocab (#10743)	11 months ago
Michael Yang	54055a6dae	fix test	11 months ago
Parth Sareen	a53d744b01	llama: remove model loading for grammar (#10096 )	12 months ago
Parth Sareen	42a14f7f63	sample: add error handling for empty logits (#9740 )	1 year ago
Parth Sareen	108fe02165	sample: make mutations in transforms explicit (#9743 ) * updated minP to use early exit making use of sorted tokens	1 year ago
Parth Sareen	5c0b663969	sample: separate softmax and temperature transforms (#9732 )	1 year ago
ParthSareen	4aeb67ef4c	sample: do all sorting in topK	1 year ago
ParthSareen	3ba91634c1	sample: simplify top_k=0 sorting	1 year ago
ParthSareen	1b7433b71e	sample: use container/heap for top_k	1 year ago
Parth Sareen	7e34f4fbfa	sample: add numerical stability to temperature/softmax transform (#9631 )	1 year ago
Jeffrey Morgan	e093db92c4	sample: temporarily use grammars for constrained generation in new engine (#9586 )	1 year ago
Parth Sareen	0682dae027	sample: improve ollama engine sampler performance (#9374 ) This change bring in various interface cleanups along with greatly improving the performance of the sampler. Tested with llama3.2 on local machine. Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled. Without topK performance is ~ 110 tokens/s	1 year ago
Parth Sareen	c245b0406f	sample: remove transforms from greedy sampling (#9377 )	1 year ago
Parth Sareen	0b7e1676eb	sample: add sampling package for new engine (#8410 )	1 year ago
Michael Yang	58245413f4	next ollama runner (#7913 ) feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	1 year ago

15 Commits (6dcc5dfb9c0a033e4e8dde627d55580600418fb6)