* llama: wire up builtin runner
This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build. After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.
* build: Make target improvements
Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.
* Support customized CPU flags for runners
This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash. If the user builds a customized set, we omit the naming
scheme and don't check for compatibility. This avoids checking
requirements at runtime, so that logic has been removed as well. This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.
* Use relative paths
If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.
* Remove payloads from main binary
* install: clean up prior libraries
This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.
* Better support for AMD multi-GPU
This resolves a number of problems related to AMD multi-GPU setups on linux.
The numeric IDs used by rocm are not the same as the numeric IDs exposed in
sysfs although the ordering is consistent. We have to count up from the first
valid gfx (major/minor/patch with non-zero values) we find starting at zero.
There are 3 different env vars for selecting GPUs, and only ROCR_VISIBLE_DEVICES
supports UUID based identification, so we should favor that one, and try
to use UUIDs if detected to avoid potential ordering bugs with numeric IDs
* ROCR_VISIBLE_DEVICES only works on linux
Use the numeric ID only HIP_VISIBLE_DEVICES on windows
This adds back a check which was lost many releases back to verify /dev/kfd permissions
which when lacking, can lead to confusing failure modes of:
"rocBLAS error: Could not initialize Tensile host: No devices found"
This implementation does not hard fail the serve command but instead will fall back to CPU
with an error log. In the future we can include this in the GPU discovery UX to show
detected but unsupported devices we discovered.
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
This fixes a few bugs in the new sysfs discovery logic. iGPUs are now
correctly identified by their <1G VRAM reported. the sysfs IDs are off
by one compared to what HIP wants due to the CPU being reported
in amdgpu, but HIP only cares about GPUs.
This allows people who package up ollama on their own to place
the rocm dependencies in a peer directory to the ollama executable
much like our windows install flow.
The recent ROCm change partially removed idempotent
payloads, but the ggml-metal.metal file for mac was still
idempotent. This finishes switching to always extract
the payloads, and now that idempotentcy is gone, the
version directory is no longer useful.
This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed. It also cleans up after itself.
We now build only a single ROCm version (latest major) on both windows
and linux. Given the large size of ROCms tensor files, we split the
dependency out. It's bundled into the installer on windows, and a
separate download on windows. The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.
For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.