Site icon John Maeda’s Blog

MacOS Run HF (HuggingFace) Model In Ollama, Part 1 (2025)

1) Install prerequisites

A) Homebrew

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

B) llama.cpp via Homebrew

Get llama.cpp via brew:

brew install llama.cpp

Verify:

llama-cli --version

C) ollama via Homebrew

There’s an easy installer now instead of doing it the brew way.

brew install ollama
brew services start ollama

Verify:

ollama --version

D) Hugging Face CLI (for downloads)

python3 -m pip install -U huggingface_hub

Verify:

huggingface-cli --help | head -n 5

(Optional) login if you want gated models later:

huggingface-cli login

2) Create a local workspace

mkdir -p ~/Documents/HuggingFace
mkdir -p ~/ollama

3) Download a tiny GGUF from HF

Repo: QuantFactory/SmolLM-135M-GGUF

File: SmolLM-135M.Q4_K_M.gguf

cd ~/Documents/HuggingFace

huggingface-cli download QuantFactory/SmolLM-135M-GGUF \
  --local-dir . \
  --max-workers 1 \
  --include "SmolLM-135M.Q4_K_M.gguf"

Confirm it exists:

ls -lh ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf

4) Run it with llama-cli

llama-cli -m ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf \
  -p "Write one short sentence about diagrams." \
  -n 40 --temp 0.7 \
  --no-display-prompt \
  --repeat-penalty 1.15

If you ever see prompt echoing or repetition, the two knobs that matter most are:


5) Import the GGUF into ollama and run it

A) Create an ollama model folder and copy the GGUF

mkdir -p ~/ollama/smollm-135m
cp ~/Documents/HuggingFace/SmolLM-135M.Q4_K_M.gguf ~/ollama/smollm-135m/

B) Create a Modelfile (plain completion)

cat > ~/ollama/smollm-135m/Modelfile <<'EOF'
FROM ./SmolLM-135M.Q4_K_M.gguf
TEMPLATE "{{ .Prompt }}"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 80
EOF

C) Create the Ollama model and run it

cd ~/ollama/smollm-135m
ollama create smollm-135m -f Modelfile
ollama run smollm-135m "Write one short sentence about diagrams."

Inspect config anytime:

ollama show smollm-135m

6) Let’s get a better model

Look for what’s available in Qwen flavor …

python3 - <<'EOF'
from huggingface_hub import list_repo_files
rid="Qwen/Qwen2.5-1.5B-Instruct-GGUF"
for f in list_repo_files(rid):
    if f.endswith(".gguf"):
        print(f)
EOF

On my end it looks like:

qwen2.5-1.5b-instruct-fp16.gguf
qwen2.5-1.5b-instruct-q2_k.gguf
qwen2.5-1.5b-instruct-q3_k_m.gguf
qwen2.5-1.5b-instruct-q4_0.gguf
qwen2.5-1.5b-instruct-q4_k_m.gguf
qwen2.5-1.5b-instruct-q5_0.gguf
qwen2.5-1.5b-instruct-q5_k_m.gguf
qwen2.5-1.5b-instruct-q6_k.gguf
qwen2.5-1.5b-instruct-q8_0.gguf

How to read a GGUF filename (example)

qwen2.5-1.5b-instruct-q4_k_m.gguf

PartMeaning
qwen2.5Model family / architecture
1.5bParameter count (1.5 billion parameters)
instructTrained to follow instructions (chat / assistant style)
q4_k_mQuantization method (quality vs size tradeoff)
.ggufFile format used by llama.cpp / Ollama

Quantization variants explained (the part that matters most)

When looking at “q4_k_m” — think of the “4” in it as determining the quality:

Another way to think of it is q4 means how many bits the weights have been quantized to (i.e. 4-bits). It goes q2, q3, q4, q5, q6, q8 as the different number of bits used to store weights.


GGUF quantization cheat sheet

Filename suffixWhat it meansQualitySizeSpeedWhen to choose it
fp16Full precision (16-bit floats)⭐⭐⭐⭐⭐💾💾💾💾💾🐢Debugging, benchmarking, or research only
q8_0Very high-quality quant⭐⭐⭐⭐½💾💾💾💾🐢/🐇Near-fp16 quality, still smaller
q6_kHigh-quality modern quant⭐⭐⭐⭐💾💾💾🐇If you want max quality without fp16
q5_k_mExcellent quality⭐⭐⭐⭐💾💾½🐇Writing, reasoning, coding
q5_0Older style q5⭐⭐⭐½💾💾½🐇Usually skip (prefer _k_m)
q4_k_mBest default⭐⭐⭐½💾💾🐇🐇Most people, most uses
q4_0Older q4⭐⭐⭐💾💾🐇🐇Works, but _k_m is better
q3_k_mAggressive compression⭐⭐½💾🐇🐇🐇Low-RAM machines
q2_kExtreme compression⭐⭐💾½🐇🐇🐇Only if desperate

Legend:


How the bits are used is explained in the suffix:

VariantHow the bits are usedPlain-English meaning
q4_0Same bits, used uniformly everywhere“Compress everything equally”
q4_kSame bits, grouped and scaled“Use bits more carefully”
q4_k_mSame bits, but critical parts get more precision“Spend bits where they matter”

What stays constant vs what changes

PartWhat it controls
q4How many bits are available
_0 / _k / _k_mHow intelligently those bits are allocated

The number sets the budget. The suffix decides how wisely the budget is spent.


One-glance decision guide

Your goalPick this
Just works, fast, goodq4_k_m
Better writing / reasoningq5_k_m
Maximum quality without fp16q6_k or q8_0
Debugging / evaluationfp16
Very limited memoryq3_k_m

Let’s use the q4_k_m model

Installing Xet will help speed up some model downloads:

python3 -m pip install -U "huggingface_hub[hf_xet]"

Download the model

cd ~/Documents/HuggingFace

huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --local-dir . \
  --max-workers 1 \
  --include "qwen2.5-1.5b-instruct-q4_k_m.gguf"

Confirm it’s there

ls -lh ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf

Try it out

llama-cli -m ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  -p "Explain OAuth in 3 bullet points." \
  -n 200 --temp 0.7 \
  --no-display-prompt \
  --repeat-penalty 1.15

Import into ollama and run it

mkdir -p ~/ollama/qwen2.5-1.5b
cp ~/Documents/HuggingFace/qwen2.5-1.5b-instruct-q4_k_m.gguf ~/ollama/qwen2.5-1.5b/

cat > ~/ollama/qwen2.5-1.5b/Modelfile <<'EOF'
FROM ./qwen2.5-1.5b-instruct-q4_k_m.gguf
TEMPLATE "{{ .Prompt }}"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 256
EOF

cd ~/ollama/qwen2.5-1.5b
ollama create qwen2.5-1.5b -f Modelfile
ollama run qwen2.5-1.5b "Explain OAuth in 5 short lines."

Did that work for you? Congratulations! —JM


Advanced: Convert a HF model to GGUF and run in ollama

Excellent — that tells us everything important.

Below is the clean, from-scratch, reproducible guide with the correct chat template baked in. This is the version you can keep and reuse.


End-to-End Guide: HF → GGUF → llama.cpp → Ollama (macOS)


1. Create directories (if don’t have these already)

mkdir -p ~/Documents/HuggingFace
mkdir -p ~/Documents/HuggingFace/gguf

2. Download the HF model (original format)

cd ~/Documents/HuggingFace
mkdir -p tinyllama-1.1b-chat-v1.0
cd tinyllama-1.1b-chat-v1.0

huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --local-dir . \
  --max-workers 1

Sanity check:

ls | egrep -i "safetensors|tokenizer|sentencepiece|\.model"

3. Convert HF → GGUF (FP16)

python3 ~/llama.cpp/convert_hf_to_gguf.py \
  ~/Documents/HuggingFace/tinyllama-1.1b-chat-v1.0 \
  --outtype f16 \
  --outfile ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.f16.gguf

4. Quantize GGUF → Q4_K_M

~/llama.cpp/build/bin/llama-quantize \
  ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.f16.gguf \
  ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
  Q4_K_M

5. Test with llama.cpp

llama-cli \
  -m ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
  -p "<s>[INST] Write one sentence about diagrams. [/INST]" \
  -n 80 \
  --temp 0.7 \
  --no-display-prompt

You must see output here before continuing.


6. Import into Ollama (correct chat template)

mkdir -p ~/ollama/tinyllama-1.1b
cp ~/Documents/HuggingFace/gguf/tinyllama-1.1b-chat.Q4_K_M.gguf \
   ~/ollama/tinyllama-1.1b/

Correct Modelfile

cat > ~/ollama/tinyllama-1.1b/Modelfile <<'EOF'
FROM ./tinyllama-1.1b-chat.Q4_K_M.gguf

TEMPLATE """<s>[INST] {{ .Prompt }} [/INST]"""

PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.15
PARAMETER num_predict 256
EOF

Create and run:

cd ~/ollama/tinyllama-1.1b
ollama create tinyllama-1.1b -f Modelfile
ollama run tinyllama-1.1b "Write one sentence about diagrams."

✅ You should now see output.


Note on the ollama template


Exit mobile version