mirror of https://gitee.com/namelin2022/ollama
1 changed files with 64 additions and 135 deletions
@ -1,170 +1,99 @@ |
|||
# Import a model |
|||
# Import |
|||
|
|||
This guide walks through importing a GGUF, PyTorch or Safetensors model. |
|||
GGUF models and select Safetensors models can be imported directly into Ollama. |
|||
|
|||
## Importing (GGUF) |
|||
## Import GGUF |
|||
|
|||
### Step 1: Write a `Modelfile` |
|||
A binary GGUF file can be imported directly into Ollama through a Modelfile. |
|||
|
|||
Start by creating a `Modelfile`. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more. |
|||
|
|||
``` |
|||
FROM ./mistral-7b-v0.1.Q4_0.gguf |
|||
```dockerfile |
|||
FROM /path/to/file.gguf |
|||
``` |
|||
|
|||
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`: |
|||
## Import Safetensors |
|||
|
|||
``` |
|||
FROM ./mistral-7b-v0.1.Q4_0.gguf |
|||
TEMPLATE "[INST] {{ .Prompt }} [/INST]" |
|||
``` |
|||
If the model being imported is one of these architectures, it can be imported directly into Ollama through a Modelfile: |
|||
|
|||
### Step 2: Create the Ollama model |
|||
- LlamaForCausalLM |
|||
- MistralForCausalLM |
|||
- GemmaForCausalLM |
|||
|
|||
Finally, create a model from your `Modelfile`: |
|||
|
|||
``` |
|||
ollama create example -f Modelfile |
|||
```dockerfile |
|||
FROM /path/to/safetensors/directory |
|||
``` |
|||
|
|||
### Step 3: Run your model |
|||
For architectures not directly convertable by Ollama, see llama.cpp's [guide](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize) on conversion. After conversion, see [Import GGUF](#import-gguf). |
|||
|
|||
Next, test the model with `ollama run`: |
|||
## Automatic Quantization |
|||
|
|||
``` |
|||
ollama run example "What is your favourite condiment?" |
|||
``` |
|||
|
|||
## Importing (PyTorch & Safetensors) |
|||
|
|||
> Importing from PyTorch and Safetensors is a longer process than importing from GGUF. Improvements that make it easier are a work in progress. |
|||
> [!NOTE] |
|||
> Automatic quantization requires v0.1.35 or higher. |
|||
|
|||
### Setup |
|||
Ollama is capable of quantizing FP16 or FP32 models to any of the supported quantizations with the `-q/--quantize` flag in `ollama create`. |
|||
|
|||
First, clone the `ollama/ollama` repo: |
|||
|
|||
``` |
|||
git clone git@github.com:ollama/ollama.git ollama |
|||
cd ollama |
|||
```dockerfile |
|||
FROM /path/to/my/gemma/f16/model |
|||
``` |
|||
|
|||
and then fetch its `llama.cpp` submodule: |
|||
|
|||
```shell |
|||
git submodule init |
|||
git submodule update llm/llama.cpp |
|||
$ ollama create -q Q4_K_M mymodel |
|||
transferring model data |
|||
quantizing F16 model to Q4_K_M |
|||
creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd |
|||
creating new layer sha256:0853f0ad24e5865173bbf9ffcc7b0f5d56b66fd690ab1009867e45e7d2c4db0f |
|||
writing manifest |
|||
success |
|||
``` |
|||
|
|||
Next, install the Python dependencies: |
|||
### Supported Quantizations |
|||
|
|||
``` |
|||
python3 -m venv llm/llama.cpp/.venv |
|||
source llm/llama.cpp/.venv/bin/activate |
|||
pip install -r llm/llama.cpp/requirements.txt |
|||
``` |
|||
|
|||
Then build the `quantize` tool: |
|||
<details> |
|||
<summary>Legacy Quantization</summary> |
|||
|
|||
``` |
|||
make -C llm/llama.cpp quantize |
|||
``` |
|||
- `Q4_0` |
|||
- `Q4_1` |
|||
- `Q5_0` |
|||
- `Q5_1` |
|||
- `Q8_0` |
|||
|
|||
### Clone the HuggingFace repository (optional) |
|||
</details> |
|||
|
|||
If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model. |
|||
<details> |
|||
<summary>K-means Quantization</summary>` |
|||
|
|||
Install [Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage), verify it's installed, and then clone the model's repository: |
|||
- `Q3_K_S` |
|||
- `Q3_K_M` |
|||
- `Q3_K_L` |
|||
- `Q4_K_S` |
|||
- `Q4_K_M` |
|||
- `Q5_K_S` |
|||
- `Q5_K_M` |
|||
- `Q6_K` |
|||
|
|||
``` |
|||
git lfs install |
|||
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 model |
|||
``` |
|||
</details> |
|||
|
|||
### Convert the model |
|||
> [!NOTE] |
|||
> Activation-aware Weight Quantization (i.e. IQ) are not currently supported for automatic quantization however you can still import the quantized model into Ollama, see [Import GGUF](#import-gguf). |
|||
|
|||
> Note: some model architectures require using specific convert scripts. For example, Qwen models require running `convert-hf-to-gguf.py` instead of `convert.py` |
|||
## Template Detection |
|||
|
|||
``` |
|||
python llm/llama.cpp/convert.py ./model --outtype f16 --outfile converted.bin |
|||
``` |
|||
> [!NOTE] |
|||
> Template detection requires v0.1.42 or higher. |
|||
|
|||
### Quantize the model |
|||
Ollama uses model metadata, specifically `tokenizer.chat_template`, to automatically create a template appropriate for the model you're importing. |
|||
|
|||
```dockerfile |
|||
FROM /path/to/my/gemma/model |
|||
``` |
|||
llm/llama.cpp/quantize converted.bin quantized.bin q4_0 |
|||
``` |
|||
|
|||
### Step 3: Write a `Modelfile` |
|||
|
|||
Next, create a `Modelfile` for your model: |
|||
|
|||
``` |
|||
FROM quantized.bin |
|||
TEMPLATE "[INST] {{ .Prompt }} [/INST]" |
|||
``` |
|||
|
|||
### Step 4: Create the Ollama model |
|||
|
|||
Finally, create a model from your `Modelfile`: |
|||
|
|||
``` |
|||
ollama create example -f Modelfile |
|||
``` |
|||
|
|||
### Step 5: Run your model |
|||
|
|||
Next, test the model with `ollama run`: |
|||
|
|||
``` |
|||
ollama run example "What is your favourite condiment?" |
|||
``` |
|||
|
|||
## Publishing your model (optional – early alpha) |
|||
|
|||
Publishing models is in early alpha. If you'd like to publish your model to share with others, follow these steps: |
|||
|
|||
1. Create [an account](https://ollama.com/signup) |
|||
2. Copy your Ollama public key: |
|||
- macOS: `cat ~/.ollama/id_ed25519.pub | pbcopy` |
|||
- Windows: `type %USERPROFILE%\.ollama\id_ed25519.pub` |
|||
- Linux: `cat /usr/share/ollama/.ollama/id_ed25519.pub` |
|||
3. Add your public key to your [Ollama account](https://ollama.com/settings/keys) |
|||
|
|||
Next, copy your model to your username's namespace: |
|||
|
|||
``` |
|||
ollama cp example <your username>/example |
|||
``` |
|||
|
|||
> Note: model names may only contain lowercase letters, digits, and the characters `.`, `-`, and `_`. |
|||
|
|||
Then push the model: |
|||
|
|||
``` |
|||
ollama push <your username>/example |
|||
```shell |
|||
$ ollama create mymodel |
|||
transferring model data |
|||
using autodetected template gemma-instruct |
|||
creating new layer sha256:baa2a0edc27d19cc6b7537578a9a7ba1a4e3214dc185ed5ae43692b319af7b84 |
|||
creating new layer sha256:ba66c3309914dbef07e5149a648fd1877f030d337a4f240d444ea335008943cb |
|||
writing manifest |
|||
success |
|||
``` |
|||
|
|||
After publishing, your model will be available at `https://ollama.com/<your username>/example`. |
|||
|
|||
## Quantization reference |
|||
|
|||
The quantization options are as follow (from highest highest to lowest levels of quantization). Note: some architectures such as Falcon do not support K quants. |
|||
|
|||
- `q2_K` |
|||
- `q3_K` |
|||
- `q3_K_S` |
|||
- `q3_K_M` |
|||
- `q3_K_L` |
|||
- `q4_0` (recommended) |
|||
- `q4_1` |
|||
- `q4_K` |
|||
- `q4_K_S` |
|||
- `q4_K_M` |
|||
- `q5_0` |
|||
- `q5_1` |
|||
- `q5_K` |
|||
- `q5_K_S` |
|||
- `q5_K_M` |
|||
- `q6_K` |
|||
- `q8_0` |
|||
- `f16` |
|||
Defining a template in the Modelfile will disable this feature which may be useful if you want to use a different template than the autodetected one. |
|||
|
|||
Loading…
Reference in new issue