1) Install prerequisites
A) Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
B) llama.cpp via Homebrew
Get llama.cpp via brew:
brew install llama.cpp
Verify:
llama-cli --version
C) ollama via Homebrew
There’s an easy installer now instead of doing it the brew way.
brew install ollama
brew services start ollama
Verify:
ollama --version
D) Hugging Face CLI (for downloads)
python3 -m pip install -U huggingface_hub
Verify:
huggingface-cli --help | head -n 5
(Optional) login if you want gated models later:
huggingface-cli login
2) Create a local workspace
mkdir -p ~/Documents/HuggingFace
mkdir -p ~/ollama
3) Download a tiny GGUF from HF
Repo: QuantFactory/SmolLM-135M-GGUF
File: SmolLM-135M.Q4_K_M.gguf
cd ~/Documents/HuggingFace
huggingface-cli download QuantFactory/SmolLM-135M-GGUF \
--local-dir . \
--max-workers 1 \
--include "SmolLM-135M.Q4_K_M.gguf"
Confirm it exists:
ls -lh ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf
4) Run it with llama-cli
llama-cli -m ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf \
-p "Write one short sentence about diagrams." \
-n 40 --temp 0.7 \
--no-display-prompt \
--repeat-penalty 1.15
If you ever see prompt echoing or repetition, the two knobs that matter most are:
- –no-display-prompt
- –repeat-penalty 1.10–1.25
5) Import the GGUF into ollama and run it
A) Create an ollama model folder and copy the GGUF
mkdir -p ~/ollama/smollm-135m
cp ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf ~/ollama/smollm-135m/
B) Create a Modelfile (plain completion)
cat > ~/ollama/smollm-135m/Modelfile <<'EOF'
FROM ./SmolLM-135M.Q4_K_M.gguf
TEMPLATE "{{ .Prompt }}"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 80
EOF
C) Create the Ollama model and run it
cd ~/ollama/smollm-135m
ollama create smollm-135m -f Modelfile
ollama run smollm-135m "Write one short sentence about diagrams."
Inspect config anytime:
ollama show smollm-135m
6) Let’s get a better model
Look for what’s available in Qwen flavor …
python3 - <<'EOF'
from huggingface_hub import list_repo_files
rid="Qwen/Qwen2.5-1.5B-Instruct-GGUF"
for f in list_repo_files(rid):
if f.endswith(".gguf"):
print(f)
EOF
On my end it looks like:
qwen2.5-1.5b-instruct-fp16.gguf
qwen2.5-1.5b-instruct-q2_k.gguf
qwen2.5-1.5b-instruct-q3_k_m.gguf
qwen2.5-1.5b-instruct-q4_0.gguf
qwen2.5-1.5b-instruct-q4_k_m.gguf
qwen2.5-1.5b-instruct-q5_0.gguf
qwen2.5-1.5b-instruct-q5_k_m.gguf
qwen2.5-1.5b-instruct-q6_k.gguf
qwen2.5-1.5b-instruct-q8_0.gguf
How to read a GGUF filename (example)
qwen2.5-1.5b-instruct-q4_k_m.gguf
| Part | Meaning |
|---|---|
| qwen2.5 | Model family / architecture |
| 1.5b | Parameter count (1.5 billion parameters) |
| instruct | Trained to follow instructions (chat / assistant style) |
| q4_k_m | Quantization method (quality vs size tradeoff) |
| .gguf | File format used by llama.cpp / Ollama |
Quantization variants explained (the part that matters most)
When looking at “q4_k_m” — think of the “4” in it as determining the quality:
- Higher number = better quality, bigger file, slower
- Lower number = smaller, faster, less accurate
Another way to think of it is q4 means how many bits the weights have been quantized to (i.e. 4-bits). It goes q2, q3, q4, q5, q6, q8 as the different number of bits used to store weights.
GGUF quantization cheat sheet
| Filename suffix | What it means | Quality | Size | Speed | When to choose it |
|---|---|---|---|---|---|
| fp16 | Full precision (16-bit floats) | ⭐⭐⭐⭐⭐ | 💾💾💾💾💾 | 🐢 | Debugging, benchmarking, or research only |
| q8_0 | Very high-quality quant | ⭐⭐⭐⭐½ | 💾💾💾💾 | 🐢/🐇 | Near-fp16 quality, still smaller |
| q6_k | High-quality modern quant | ⭐⭐⭐⭐ | 💾💾💾 | 🐇 | If you want max quality without fp16 |
| q5_k_m | Excellent quality | ⭐⭐⭐⭐ | 💾💾½ | 🐇 | Writing, reasoning, coding |
| q5_0 | Older style q5 | ⭐⭐⭐½ | 💾💾½ | 🐇 | Usually skip (prefer _k_m) |
| q4_k_m | Best default | ⭐⭐⭐½ | 💾💾 | 🐇🐇 | Most people, most uses |
| q4_0 | Older q4 | ⭐⭐⭐ | 💾💾 | 🐇🐇 | Works, but _k_m is better |
| q3_k_m | Aggressive compression | ⭐⭐½ | 💾 | 🐇🐇🐇 | Low-RAM machines |
| q2_k | Extreme compression | ⭐⭐ | 💾½ | 🐇🐇🐇 | Only if desperate |
Legend:
- ⭐ = output quality
- 💾 = disk / memory usage
- 🐇 = speed
How the bits are used is explained in the suffix:
| Variant | How the bits are used | Plain-English meaning |
|---|---|---|
| q4_0 | Same bits, used uniformly everywhere | “Compress everything equally” |
| q4_k | Same bits, grouped and scaled | “Use bits more carefully” |
| q4_k_m | Same bits, but critical parts get more precision | “Spend bits where they matter” |
What stays constant vs what changes
| Part | What it controls |
|---|---|
| q4 | How many bits are available |
| _0 / _k / _k_m | How intelligently those bits are allocated |
The number sets the budget. The suffix decides how wisely the budget is spent.
One-glance decision guide
| Your goal | Pick this |
|---|---|
| Just works, fast, good | q4_k_m |
| Better writing / reasoning | q5_k_m |
| Maximum quality without fp16 | q6_k or q8_0 |
| Debugging / evaluation | fp16 |
| Very limited memory | q3_k_m |
Let’s use the q4_k_m model
Installing Xet will help speed up some model downloads:
python3 -m pip install -U "huggingface_hub[hf_xet]"
Download the model
cd ~/Documents/HuggingFace
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--local-dir . \
--max-workers 1 \
--include "qwen2.5-1.5b-instruct-q4_k_m.gguf"
Confirm it’s there
ls -lh ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf
Try it out
llama-cli -m ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf \
-p "Explain OAuth in 3 bullet points." \
-n 200 --temp 0.7 \
--no-display-prompt \
--repeat-penalty 1.15
Import into ollama and run it
mkdir -p ~/ollama/qwen2.5-1.5b
cp ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf ~/ollama/qwen2.5-1.5b/
cat > ~/ollama/qwen2.5-1.5b/Modelfile <<'EOF'
FROM ./qwen2.5-1.5b-instruct-q4_k_m.gguf
TEMPLATE "{{ .Prompt }}"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 256
EOF
cd ~/ollama/qwen2.5-1.5b
ollama create qwen2.5-1.5b -f Modelfile
ollama run qwen2.5-1.5b "Explain OAuth in 5 short lines."
Did that work for you? Congratulations! —JM
Advanced: Convert a HF model to GGUF and run in ollama
Excellent — that tells us everything important.
Below is the clean, from-scratch, reproducible guide with the correct chat template baked in. This is the version you can keep and reuse.
End-to-End Guide: HF → GGUF → llama.cpp → Ollama (macOS)
1. Create directories (if don’t have these already)
mkdir -p ~/Documents/HuggingFace
mkdir -p ~/Documents/HuggingFace/gguf
2. Download the HF model (original format)
cd ~/Documents/HuggingFace
mkdir -p tinyllama-1.1b-chat-v1.0
cd tinyllama-1.1b-chat-v1.0
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--local-dir . \
--max-workers 1
Sanity check:
ls | egrep -i "safetensors|tokenizer|sentencepiece|\.model"
3. Convert HF → GGUF (FP16)
python3 ~/llama.cpp/convert_hf_to_gguf.py \
~/Documents/HuggingFace/tinyllama-1.1b-chat-v1.0 \
--outtype f16 \
--outfile ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.f16.gguf
4. Quantize GGUF → Q4_K_M
~/llama.cpp/build/bin/llama-quantize \
~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.f16.gguf \
~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
Q4_K_M
5. Test with llama.cpp
llama-cli \
-m ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
-p "<s>[INST] Write one sentence about diagrams. [/INST]" \
-n 80 \
--temp 0.7 \
--no-display-prompt
You must see output here before continuing.
6. Import into Ollama (correct chat template)
mkdir -p ~/ollama/tinyllama-1.1b
cp ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
~/ollama/tinyllama-1.1b/
Correct Modelfile
cat > ~/ollama/tinyllama-1.1b/Modelfile <<'EOF'
FROM ./tinyllama-1.1b-chat.Q4_K_M.gguf
TEMPLATE """<s>[INST] {{ .Prompt }} [/INST]"""
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 256
EOF
Create and run:
cd ~/ollama/tinyllama-1.1b
ollama create tinyllama-1.1b -f Modelfile
ollama run tinyllama-1.1b "Write one sentence about diagrams."
✅ You should now see output.
Note on the ollama template
- TinyLlama is a chat-tuned LLaMA model
- It requires [INST] … [/INST]
- TEMPLATE “{{ .Prompt }}” works only for completion models
- Ollama does not auto-infer chat format for custom GGUFs