Running Ollama on K8S
A popular way to run open-source Large Language Models (LLMs) is using Ollama, a sort of wrapper service around various models including Llama, Mistral, and others. For example, if you have a reasonably high-perfomance personal machine, you can have a Llama 3.1:8B server running locally in minutes, with no compiler or other overhead required. A benefit of using Ollama over downloading and running a particular model directly is that it standardizes a lot of the process and exposes a consistent API. (A caveat is that Ollama may apply quantization to the models for efficiency in an opaque way.)
We can set Ollama up on K8S with a service to interact with its API. This will allow any containers elsewhere on the cluster to call the LLMās API, all in their own containerized environments.
Important note:
This post assumes you have
pod
,deployment
,namespace
, andservice
privileges on the cluster. However, we will make note where you can modify the procedure to work in the more limited setting of just pod, or no namespace, etc.
Deploying Ollama
To deploy Ollama on Kubernetes, we will largely cop the manifest on the official Github written for this purpose (`ollama/ollama/examples/kubernetes).
This YAML is actually two manifests, separated by ---
. (Notice weāve removed the portion of the example manifest at the link above that creates a namespace ā the assumption here is that youāve already been given a namespace by the HPC, so trying to create one will return an error. For this post weāll pretend your namespace is humorously called my-space
.)
-
First, we setup a
deployment
which will create apod
. Think of a deployment as a desired state you want some set of pods to be in, whereas the pods themselves are the host components of actual execution. In our case, weāll create a deployment which ensures there is always a single pod running with an image ready to host Ollama models and services. -
Lastly, we setup a
service
. This is a layer attached to the pod created by the deployment (all in the same namespace), which allows interface between other containers in the namespace and the Ollama pod itself, via a pre-defined dedicated port.
ollama-gpu.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: my-space
spec:
strategy:
type: Recreate # this strategy ensures one pod is spawned at all times
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
env:
- name: PATH
value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
ports:
- name: http
containerPort: 11434 # this is the default ollama API port
protocol: TCP
resources:
limits:
nvidia.com/gpu: 1 # this will ensure we're running on a GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: my-space
spec:
type: ClusterIP # this exposes the service at cluster level
selector:
name: ollama # this hooks the service to the deployment
ports:
- port: 80 # this is the outward facing port the service is exposed on
name: http
targetPort: http
protocol: TCP
We can then apply this manifest with
kubectl apply -f ollama-gpu.yml
And then letās go check what we have wrought. First check on the deployment, you will see:
kubectl get deployments -n my-space
NAME READY UP-TO-DATE AVAILABLE AGE
ollama 1/1 1 1 32s
where notice we have to remember to do this within our namespace with the -n
flag. Next, the service will show:
kubectl get services -n my-space
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ollama ClusterIP 10.108.110.54 <none> 80/TCP 3m18s
Note: If you donāt have a namespace, you can just strip the ānamespaceā lines out of the manifest above and run it in your own user namespace just fine. However, you wonāt get some nice DNS resolution for your service that weāll use later, so note the Cluster IP listed here.
Lastly, try checking on pods:
kubectl get pods -n my-space
NAME READY STATUS RESTARTS AGE
ollama-7c87b867c4-bap8r 1/1 Running 0 7m20s
You will see a pod with the deployment name and then a weird hash-looking string. This is the first pod spawned by the deployment. If you delete it with k delete pod
and type out that exact name with the hash, the deployment will immediately spawn another one, because it was written with the āRecreateā strategy. There are other strategies for deployments ā hereās some more on this topic.
Interacting with the pod
We can now do a proof of concept that we can interact with this ollama
pod from anywhere on the cluster, via the ollama
service. To do this, weāll setup a very simple pod that can only do one thing: run curl
commands. curl
will let us send requests to the ollama API and play around.
So, letās just create a ācurlā pod that can do just that:
curl-pod.yml
apiVersion: v1
kind: Pod
metadata:
name: curl-pod
namespace: my-space
spec:
containers:
- name: curl-container
image: curlimages/curl:latest # Lightweight image with curl pre-installed
command: [ "sh", "-c", "sleep 3600" ] # Keeps the pod alive for testing
stdin: true # To allow interactive mode
tty: true
Now create it with kubectl apply -f curl-pod.yml
, wait for it to be fully created, and then hop into it in interactive mode:
kubectl exec -it curl-pod -n my-space -- /bin/sh
~ $
Now letās try to check Ollama actually exists over there on that pod our deployment created. Our service is listening on port 80, so from within our curl pod (which Iāll continue denoting with the ~ $
prompt) we can call:
~ $ curl http://ollama:80/
and we should get the pleasant response Ollama is running
. Yay! (Note: if you donāt have a dedicated namespace, then instead of ollama
in this url, youāll need to put the explicit cluster IP the service is running on.)
Letās try checking on the API:
~ $ curl http://ollama:80/api/version
{"version":"0.3.10"}
Nice. So letās do something more interesting and pull an actual model.
Pull a model and make calls to the API
Recall that Ollama is just a wrapper service ā we donāt currently have any actual LLM on the pod, just a server and REST API interface. We could exec
into the ollama spawn pod and pull a model directly from there, but it is just as convenient to do it from a distance in our little curl pod.
Still in our curl pod, letās download the smallest Llama 3.1 model (8B parameters). This takes only a minute or so so weāll set "stream": false
. You can check out their API documentation for all the parameters.
~ $ curl http://ollama:80/api/pull -d '{"name": "llama3.1", "stream": false}'
And after a minute or so you should see {"status":"success"}
. Now we can test the model is working with a prompt/response call:
~ $ curl http://ollama:80/api/generate -d '{"model": "llama3.1", "prompt": "Hi!", "stream": false}'
which will give something like
{"model":"llama3.1","created_at":"2024-09-17T01:47:29.158282962Z","response":"It's nice to meet you. Is there something I can help you with, or would you like to chat?","done":true,"done_reason":"stop","context":[128006,882,128007,271,13347,0,128009,128006,78191,128007,271,2181,596,6555,311,3449,499,13,2209,1070,2555,358,649,1520,499,449,11,477,1053,499,1093,311,6369,30],"total_duration":547907910,"load_duration":19225765,"prompt_eval_count":12,"prompt_eval_duration":23206000,"eval_count":24,"eval_duration":463147000}