MacOS Run HF (HuggingFace) Model In Ollama, Part 1 (2025)

John Maeda

4 months ago

1) Install prerequisites

A) Homebrew

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

B) llama.cpp via Homebrew

Get llama.cpp via brew:

brew install llama.cpp

Verify:

llama-cli --version

C) ollama via Homebrew

There’s an easy installer now instead of doing it the brew way.

brew install ollama
brew services start ollama

Verify:

ollama --version

D) Hugging Face CLI (for downloads)

python3 -m pip install -U huggingface_hub

Verify:

huggingface-cli --help | head -n 5

(Optional) login if you want gated models later:

huggingface-cli login

2) Create a local workspace

mkdir -p ~/Documents/HuggingFace
mkdir -p ~/ollama

3) Download a tiny GGUF from HF

Repo: QuantFactory/SmolLM-135M-GGUF

File: SmolLM-135M.Q4_K_M.gguf

cd ~/Documents/HuggingFace

huggingface-cli download QuantFactory/SmolLM-135M-GGUF \
  --local-dir . \
  --max-workers 1 \
  --include "SmolLM-135M.Q4_K_M.gguf"

Confirm it exists:

ls -lh ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf

4) Run it with llama-cli

llama-cli -m ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf \
  -p "Write one short sentence about diagrams." \
  -n 40 --temp 0.7 \
  --no-display-prompt \
  --repeat-penalty 1.15

If you ever see prompt echoing or repetition, the two knobs that matter most are:

–no-display-prompt
–repeat-penalty 1.10–1.25

5) Import the GGUF into ollama and run it

A) Create an ollama model folder and copy the GGUF

mkdir -p ~/ollama/smollm-135m
cp ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf ~/ollama/smollm-135m/

B) Create a Modelfile (plain completion)

cat > ~/ollama/smollm-135m/Modelfile <<'EOF'
FROM ./SmolLM-135M.Q4_K_M.gguf
TEMPLATE "{{ .Prompt }}"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 80
EOF

C) Create the Ollama model and run it

cd ~/ollama/smollm-135m
ollama create smollm-135m -f Modelfile
ollama run smollm-135m "Write one short sentence about diagrams."

Inspect config anytime:

ollama show smollm-135m

6) Let’s get a better model

Look for what’s available in Qwen flavor …

python3 - <<'EOF'
from huggingface_hub import list_repo_files
rid="Qwen/Qwen2.5-1.5B-Instruct-GGUF"
for f in list_repo_files(rid):
    if f.endswith(".gguf"):
        print(f)
EOF

On my end it looks like:

qwen2.5-1.5b-instruct-fp16.gguf
qwen2.5-1.5b-instruct-q2_k.gguf
qwen2.5-1.5b-instruct-q3_k_m.gguf
qwen2.5-1.5b-instruct-q4_0.gguf
qwen2.5-1.5b-instruct-q4_k_m.gguf
qwen2.5-1.5b-instruct-q5_0.gguf
qwen2.5-1.5b-instruct-q5_k_m.gguf
qwen2.5-1.5b-instruct-q6_k.gguf
qwen2.5-1.5b-instruct-q8_0.gguf

How to read a GGUF filename (example)

qwen2.5-1.5b-instruct-q4_k_m.gguf

Part	Meaning
qwen2.5	Model family / architecture
1.5b	Parameter count (1.5 billion parameters)
instruct	Trained to follow instructions (chat / assistant style)
q4_k_m	Quantization method (quality vs size tradeoff)
.gguf	File format used by llama.cpp / Ollama

Quantization variants explained (the part that matters most)

When looking at “q4_k_m” — think of the “4” in it as determining the quality:

Higher number = better quality, bigger file, slower
Lower number = smaller, faster, less accurate

Another way to think of it is q4 means how many bits the weights have been quantized to (i.e. 4-bits). It goes q2, q3, q4, q5, q6, q8 as the different number of bits used to store weights.

GGUF quantization cheat sheet

Filename suffix	What it means	Quality	Size	Speed	When to choose it
fp16	Full precision (16-bit floats)	⭐⭐⭐⭐⭐	💾💾💾💾💾	🐢	Debugging, benchmarking, or research only
q8_0	Very high-quality quant	⭐⭐⭐⭐½	💾💾💾💾	🐢/🐇	Near-fp16 quality, still smaller
q6_k	High-quality modern quant	⭐⭐⭐⭐	💾💾💾	🐇	If you want max quality without fp16
q5_k_m	Excellent quality	⭐⭐⭐⭐	💾💾½	🐇	Writing, reasoning, coding
q5_0	Older style q5	⭐⭐⭐½	💾💾½	🐇	Usually skip (prefer _k_m)
q4_k_m	Best default	⭐⭐⭐½	💾💾	🐇🐇	Most people, most uses
q4_0	Older q4	⭐⭐⭐	💾💾	🐇🐇	Works, but _k_m is better
q3_k_m	Aggressive compression	⭐⭐½	💾	🐇🐇🐇	Low-RAM machines
q2_k	Extreme compression	⭐⭐	💾½	🐇🐇🐇	Only if desperate

Legend:

⭐ = output quality
💾 = disk / memory usage
🐇 = speed

How the bits are used is explained in the suffix:

Variant	How the bits are used	Plain-English meaning
q4_0	Same bits, used uniformly everywhere	“Compress everything equally”
q4_k	Same bits, grouped and scaled	“Use bits more carefully”
q4_k_m	Same bits, but critical parts get more precision	“Spend bits where they matter”

What stays constant vs what changes

Part	What it controls
q4	How many bits are available
_0 / _k / _k_m	How intelligently those bits are allocated

The number sets the budget. The suffix decides how wisely the budget is spent.

One-glance decision guide

Your goal	Pick this
Just works, fast, good	q4_k_m
Better writing / reasoning	q5_k_m
Maximum quality without fp16	q6_k or q8_0
Debugging / evaluation	fp16
Very limited memory	q3_k_m

Let’s use the q4_k_m model

Installing Xet will help speed up some model downloads:

python3 -m pip install -U "huggingface_hub[hf_xet]"

Download the model

cd ~/Documents/HuggingFace

huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --local-dir . \
  --max-workers 1 \
  --include "qwen2.5-1.5b-instruct-q4_k_m.gguf"

Confirm it’s there

ls -lh ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf

Try it out

llama-cli -m ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  -p "Explain OAuth in 3 bullet points." \
  -n 200 --temp 0.7 \
  --no-display-prompt \
  --repeat-penalty 1.15

Import into ollama and run it

mkdir -p ~/ollama/qwen2.5-1.5b
cp ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf ~/ollama/qwen2.5-1.5b/

cat > ~/ollama/qwen2.5-1.5b/Modelfile <<'EOF'
FROM ./qwen2.5-1.5b-instruct-q4_k_m.gguf
TEMPLATE "{{ .Prompt }}"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 256
EOF

cd ~/ollama/qwen2.5-1.5b
ollama create qwen2.5-1.5b -f Modelfile
ollama run qwen2.5-1.5b "Explain OAuth in 5 short lines."

Did that work for you? Congratulations! —JM

Advanced: Convert a HF model to GGUF and run in ollama

Excellent — that tells us everything important.

Below is the clean, from-scratch, reproducible guide with the correct chat template baked in. This is the version you can keep and reuse.

End-to-End Guide: HF → GGUF → llama.cpp → Ollama (macOS)

1. Create directories (if don’t have these already)

mkdir -p ~/Documents/HuggingFace
mkdir -p ~/Documents/HuggingFace/gguf

2. Download the HF model (original format)

cd ~/Documents/HuggingFace
mkdir -p tinyllama-1.1b-chat-v1.0
cd tinyllama-1.1b-chat-v1.0

huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --local-dir . \
  --max-workers 1

Sanity check:

ls | egrep -i "safetensors|tokenizer|sentencepiece|\.model"

3. Convert HF → GGUF (FP16)

python3 ~/llama.cpp/convert_hf_to_gguf.py \
  ~/Documents/HuggingFace/tinyllama-1.1b-chat-v1.0 \
  --outtype f16 \
  --outfile ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.f16.gguf

4. Quantize GGUF → Q4_K_M

~/llama.cpp/build/bin/llama-quantize \
  ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.f16.gguf \
  ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
  Q4_K_M

5. Test with llama.cpp

llama-cli \
  -m ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
  -p "<s>[INST] Write one sentence about diagrams. [/INST]" \
  -n 80 \
  --temp 0.7 \
  --no-display-prompt

You must see output here before continuing.

6. Import into Ollama (correct chat template)

mkdir -p ~/ollama/tinyllama-1.1b
cp ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
   ~/ollama/tinyllama-1.1b/

Correct Modelfile

cat > ~/ollama/tinyllama-1.1b/Modelfile <<'EOF'
FROM ./tinyllama-1.1b-chat.Q4_K_M.gguf

TEMPLATE """<s>[INST] {{ .Prompt }} [/INST]"""

PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 256
EOF

Create and run:

cd ~/ollama/tinyllama-1.1b
ollama create tinyllama-1.1b -f Modelfile
ollama run tinyllama-1.1b "Write one sentence about diagrams."

✅ You should now see output.

Note on the ollama template

TinyLlama is a chat-tuned LLaMA model
It requires [INST] … [/INST]
TEMPLATE “{{ .Prompt }}” works only for completion models
Ollama does not auto-infer chat format for custom GGUFs