Getting LLaMA 2 Running On MacOS

Now this was really a journey unto itself. The easiest way to get things done was with the brew package by Simon Willison. That felt really good.

% brew install llm
% llm install llm-llama-cpp
% llm install llama-cpp-python
% llm install https://static.simonwillison.net/static/2023/llama_cpp_python-0.1.77-cp311-c
% llm llama-cpp download-model \
  https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin \
  --alias llama2-chat --alias l2c --llama2-chat

Sure enough, it worked with:

% llm -m l2c 'Tell me a joke about a llama'

And that can go looooong. So instead you want to give it a system prompt:

% llm -m l2c 'Tell me a joke about a llama' --system 'You are funny'

You can download a bigger model, but that’s when stuff stopped working for me.

% llm llama-cpp download-model \
  'https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q8_0.bin'\
  -a llama2-chat-13b --llama2-chat

And then you try this example but mine just froze up.

% llm -m llama2-chat-13b 'Tell me a joke about a llama' --system 'You are Jerry Seinfeld'

The subsequent example and the promise of using a non-Llama model sparked my interest.

% llm llama-cpp download-model \
  https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML/resolve/main/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin \
  --alias wizard-vicuna-7b --alias wizard

Sure enough I downloaded it and ran it and ….

% llm -m wizard 'A cocktail recipe involving a mango:'

Running From LLaMA.cpp

I got a little ambitious and tried to figure out how to get LLaMA.cpp working just like the big kids. And … it worked. OMG. What did I do? First off I just cloned the repo and did an old-fashioned make:

% git clone https://github.com/ggerganov/llama.cpp
% cd llama.cpp
% make

By some miracle, it went all the way through. I know. Miracle. Right? And then I wanted to figure out how get the right kind of model. That wasn’t immediately obvious to me. What I did in the end was get it off of HuggingFace with the following steps.

  1. Your target is a file named llama-2-7b.Q4_0.gguf
  2. Well, at least that worked for me and I figure others could work too …
  3. Visit “TheBloke”
  4. Grab the file by clicking on the Files tab and downloading it
  5. Copy it into the models directory of your cloned repo for neatness sake

Then we just need to run from the command line and see something magical happen:

% ./main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

Model File Formats?

In the process I learned about the difference between BIN and GGML and GGUF (this is the newer one). In case you were wondering:

  • BIN is just a binary file
  • GGML is by the fellow who made llama.cpp
  • GGUF seems to be by the community

As written on Reddit regarding GGUF:

It is a successor file format to GGML, GGMF and GGJT, and is designed to be unambiguous by containing all the information needed to load a model. It is also designed to be extensible, so that new features can be added to GGML without breaking compatibility with older models.

Happy Redditor

So the old format is .bin and the new format is .gguf until some even newer file format comes out.