How To Run Whisper Transcription Locally On An M-series MacOS Machine Using .NET/C#

I don’t know why I started doing this today, but since it worked … I’m a bit thrilled.

The repo to reference is for Whisper.NET — the .NET/C# implementation of the OpenAI Whisper voice transcription technology.

Recipe

First off, you want to use this program as the foundation for your overall destination. Let’s get going:

1/ Make a .NET console project in your directory of choice

% dotnet new console

2/ Then you want to install a few packages

% dotnet add package Whisper.net
% dotnet add package Whisper.net.Runtime.CoreML

3/ Next you want to clone the repo for Whisper.cpp to create the special modelc file you need to run accelerated on your Arm Mac. Those instructions are all here.

  • Make a Python environment
  • Install cool ML packages
  • Generate a model appropriate for M-series Macs that is GPU-enhanced
% cd <to that repo locally>

% conda create -n py310-whisper python=3.10 -y
% conda activate py310-whisper

% pip install ane_transformers
% pip install openai-whisper
% pip install coremltools

% models/generate-coreml-model.sh base.en

Awesome. You should have been able to generate a folder
./models/ggml-base.en-encoder.mlmodelc

4/ You’ll need to download a model from HuggingFace:

% bash ./models/download-ggml-model.sh base.en

5/ Keep in mind that Whisper needs stuff to be recorded as 16-bit audio files. You might bump into that issue later. And on MacOS there’s no obvious Nuget package to record audio, so I used ffmpeg which is a bit clumsy but it worked fine for my eneds.

% brew install ffmpeg

Note there are a few other things you can do directly from the Whisper.cpp repo documented in the MacOS Arm core-ml support section that worked when I ran them, but I didn’t need them for my little example app in .NET/C#.

6/ Jump back to the original .NET/C# console project you were working on and be sure to copy the folder entitled ./models/ggml-base.en-encoder.mlmodelc to the root of your project. And also the ggml-base.en.bin mode file.

For reference, my directory looks like this:

The Finish Line

Okay! You should be able to record a file and transcribe it with the following code:

using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using Whisper.net;
using Whisper.net.Ggml;

class Program
{
    public static async Task Main(string[] args)
    {
        var ggmlType = GgmlType.Base;
        var modelFileName = "ggml-base.en.bin";
        var wavFileName = "recorded_audio.wav";

        Console.WriteLine("Press Enter to start recording...");
        Console.ReadLine();

        Console.WriteLine("Recording... Press Enter to stop.");
        await RecordAudio(wavFileName);

        Console.WriteLine("Recording stopped. Transcribing...");

        if (!File.Exists(modelFileName))
        {
            await DownloadModel(modelFileName, ggmlType);
        }

        using var whisperFactory = WhisperFactory.FromPath(modelFileName);
        using var processor = whisperFactory.CreateBuilder()
            .WithLanguage("auto")
            .Build();

        using var fileStream = File.OpenRead(wavFileName);

        await foreach (var result in processor.ProcessAsync(fileStream))
        {
            Console.WriteLine($"{result.Start}->{result.End}: {result.Text}");
        }
    }

    private static async Task RecordAudio(string outputPath)
    {
    using var ffmpegProcess = new Process
    {
        StartInfo = new ProcessStartInfo
        {
            FileName = "ffmpeg",
            // you want to have the -y in here to erase the file if it exists -- otherwise it stalls
            // you want to get the input device correct; the way you list the devices is with `ffmpeg -f avfoundation -list_devices true -i ""`
            Arguments = $"-f avfoundation -y -i \":3\" -ar 16000 {outputPath}",
            RedirectStandardInput = true,
            RedirectStandardOutput = true,
            RedirectStandardError = true,
            UseShellExecute = false
        }
    };

    ffmpegProcess.Start();

    // Await user input to stop recording
    await Task.Run(() => Console.ReadLine());

    // Stop FFmpeg process
    ffmpegProcess.StandardInput.Write('q');
    await ffmpegProcess.WaitForExitAsync();
}

    private static async Task DownloadModel(string fileName, GgmlType ggmlType)
    {
        Console.WriteLine($"Downloading Model {fileName}");
        using var modelStream = await WhisperGgmlDownloader.GetGgmlModelAsync(ggmlType);
        using var fileWriter = File.OpenWrite(fileName);
        await modelStream.CopyToAsync(fileWriter);
    }
}

And then just a simple dotnet run and you’re off to the races. Good luck!

Why am I doing this?

I wanted to start polling the microphone for input to Semantic Kernel and as I go deeper into the .NET/C# world I needed something like this. Of course I could have just called the OAI endpoint but since I’ve been watching the cool kids use local models, I couldn’t help but see if this would really work.