LLM as a service
SHALB helm chart
One of easy ways to deploy an LLM is to use a model provided by HuggingFace with the help of SHALB helm chart.
The helm chart allows installing a text genertation inference container, optionally accompanied by the chat-ui interface to talk to the service.
To deploy the LLM, choose a text generation model without download restrictions and modest footprint (f.e. Mistral is a good one). You can get approval for protected (like LLama) and/or choose the larger ones and adjust the GPU type that will be able to handle it.
Create the helm values file (huggingface-values.yaml
) similar to this one (at least replace the values in first "model" block):
model:
organization: "mistralai"
name: "Mistral-7B-Instruct-v0.2"
hf_user: "your_user"
hf_token: "your_token"
persistence:
accessModes:
- ReadWriteOnce
storageClassName: rook-ceph-block
storage: 500Gi
updateStrategy:
type: Recreate
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: haproxy
hosts:
- host: <subdomain>.nrp-nautilus.io
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- <subdomain>.nrp-nautilus.io
resources:
requests:
cpu: "3"
memory: "10Gi"
nvidia.com/gpu: 2
limits:
cpu: "8"
memory: "25Gi"
nvidia.com/gpu: 2
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- <desired_gpu_type>
chat:
enabled: true
resources:
limits:
cpu: "2"
memory: "5G"
requests:
cpu: "500m"
memory: "512M"
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: haproxy
hosts:
- host: <subdomain>-chat.nrp-nautilus.io
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- <subdomain>-chat.nrp-nautilus.io
modelConfig: {}
# !!! THis doesn't seem to work now. You can manually add this later as described below.
## e.g.
# parameters:
# temperature: 0.1
# top_p: 0.95
# repetition_penalty: 1.2
# top_k: 50
# truncate: 1000
# max_new_tokens: 1024
# datasetName: OpenAssistant/oasst1
# description: A good alternative to ChatGPT
# websiteUrl: https://open-assistant.io
# userMessageToken: ""
# assistantMessageToken: ""
# messageEndToken: "</s>"
# preprompt: |
# Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.
# -----
# promptExamples:
# - title: Write an email from bullet list
# prompt: "As a restaurant owner, write a professional email to the supplier to
# get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
# - title: Code a snake game
# prompt: Code a basic snake game in python, give explanations for each step.
# - title: Assist in a task
# prompt: How do I make a delicious lemon cheesecake?
# parameters:
# temperature: 0.9
# top_p: 0.95
# repetition_penalty: 1.2
# top_k: 50
mongodb:
updateStrategy:
type: Recreate
resources:
limits:
cpu: "10"
memory: "10G"
requests:
cpu: "1"
memory: "1G"
Replace <subdomain>
. Optionally leave and modify the section with desired_gpu_type
or remove the whole affinity
block.
Install Helm and deploy the LLM into your namespace:
Many of the hugging face repositories and models to use a token to deploy or run the system. To do this you must set these values. Since the token is sensitive information you can pass this to helm directly instead of including it in the yaml file by replacing (or using environment variables) the $hf_user
, $hf_token
in addition to setting the $namespace
variable. The token can be generated at https://huggingface.co/settings/tokens.
helm install hug -n $namespace oci://registry-1.docker.io/shalb/huggingface-model -f huggingface-values.yaml \
--set "model.hf_user=$hf_user" --set "model.hf_token=$hf_token" \
--set "extraEnvVars[0].name=HF_TOKEN" --set "extraEnvVars[0].value=$hf_token"
If you see 3 pods started in your namespace, you're almost done! The model will be downloaded and cached by the init container. Go stretch, make some tea, and give it some time to be downloaded into our persistent storage. Onse the init container is done and main one starts, give it some more time to start, and you can start chatting with the AI.
Some models require more memory when downloading the model via git-lfs, to do this you must update the helm chart (https://github.com/shalb/charts).
diff --git a/huggingface-model/templates/application.yaml b/huggingface-model/templates/application.yaml
index 73b1c17..192c78a 100644
--- a/huggingface-model/templates/application.yaml
+++ b/huggingface-model/templates/application.yaml
@@ -71,7 +71,11 @@ spec:
mountPath: "/usr/src/{{ .Values.model.name }}"
resources:
requests:
- cpu: "0.5"
+ memory: 8Gi
+ cpu: "2"
+ limits:
+ memory: 8Gi
+ cpu: "2"
containers:
- name: model
image: {{ .Values.image.repo }}:{{ .Values.image.tag }}
Your chat-ui will be available at <subdomain>-chat.nrp-nautilus.io
, and API at <subdomain>.nrp-nautilus.io
.
The chat parameters for calling the model can be tuned by adding the MODEL config into the chat-ui deployment (the corresponding section in the helm chart doesn't seem to work currently):
spec:
containers:
- name: model
env:
- name: MODELS
value: |
[
{"parameters":
{
"temperature":0.1,
"top_p":0.9,
"repetition_penalty":1.2,
"top_k":50,
"truncate":3000,
"max_new_tokens":1024,
"stop":
[
"<|end_of_text|>,<|eod_id|>"
]
},
"endpoints":
[
{
"type":"tgi",
"url":"http://hug-llama-3-8b:8080"
}
],
"name":"meta-llama/Llama-3-8B"
}
]
Please scale down or purge unused deployments to free up resources for other users of the cluster. Your model will remain cached in our persistent storage and next time the start up will be much quicker.
kubectl scale deployment -n <your_namespace> hug-mistral-7b-instruct-v0-2 hug-mistral-7b-instruct-v0-2-chat hug-mongodb --replicas=0
H2O charts
H2O project provides a complete toolset for running LLMs.
To run their helm chart, clone the repo and cd into the cloned folder:
Create the values file with your desired settings. To see the defaults, run helm show values helm/h2ogpt-chart
.
Example of values file to use:
h2ogpt:
enabled: true
stack:
enabled: true
storage:
size: 128Gi
class: rook-ceph-block
useEphemeral: false
# -- Example configs to use when not using Model Lock and External LLM
overrideConfig:
base_model: h2oai/h2ogpt-4096-llama2-7b-chat
use_safetensors: True
prompt_type: llama2
save_dir: /workspace/save/
use_gpu_id: False
score_model: None
max_max_new_tokens: 2048
max_new_tokens: 1024
extraAffinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A10
updateStrategy:
type: Recreate
service:
type: ClusterIP
resources:
requests:
cpu: 1
memory: 100Mi
limits:
cpu: 2
memory: 10Gi
vllm:
enabled: true
storage:
size: 512Gi
class: rook-ceph-block
useEphemeral: false
containerArgs:
- "--model"
- h2oai/h2ogpt-4096-llama2-7b-chat
- "--tokenizer"
- hf-internal-testing/llama-tokenizer
- "--tensor-parallel-size"
- 2
- "--seed"
- 1234
- "--trust-remote-code"
updateStrategy:
type: Recreate
resources:
requests:
cpu: 1
memory: 20Gi
nvidia.com/gpu: 2
limits:
cpu: 10
memory: 20Gi
nvidia.com/gpu: 2
Install the helm chart:
helm install h2ogpt helm/h2ogpt-chart -f h2o-values.yaml
(substitute the name of the values file you created).
After the model starts (takes a long time for llama2 in the example), in the chat window you'll have to "load" the model in the Model tab first.
The list of available models to download is available at https://huggingface.co/h2oai.
Once the model is started, refer to the Ingress docs to expose the service to the world.