I build a voice typing app. Full disclosure up front: it's called Meander, so read everything here with that bias in mind. When you build one of these, the first thing everyone asks is "why not just run it locally? No cloud, no privacy worries, no bills." It's a fair question. So I spent a while putting the local speech-to-text models I could run through real dictation, with one rule that ruled a lot of them out: it had to handle Chinese.
Here's the honest map of what I found, including the one thing that quietly changed how I think about the whole local-versus-cloud question.
Start with Whisper, because everyone does
OpenAI's Whisper is the default. I tested the lineup on real dictation:
- Base. Tiny and fast, but the accuracy is rough. Fine for a throwaway note, frustrating the moment you want clean text.
- Large-v3-turbo. A distilled version, roughly 4x faster than the full large model. Quick, with two catches: the accuracy sits a notch below full large on many languages, and turbo drops Whisper's translation task entirely.
- Large-v3. The most accurate of the family, and the one I'd actually trust. Here's the part that surprises people: turbo is not the quality winner. Large-v3 is. You trade speed to get there.
The benchmarks line up with this. Large-v3 lands around 2.7% word error rate on clean audio and wants about 3 GB of VRAM (model comparison, Hugging Face).
If you need Chinese, the Whisper-only story falls apart
Most "best local model" roundups stop at Whisper. But the moment you need anything other than English, the picture changes, and this is where my testing got interesting.
- NVIDIA Parakeet. Blazing fast, accuracy on par with Whisper large for English. The newer V3 added a batch of European languages, but here's the catch no benchmark headline mentions: still no Chinese, and the fastest one (V2) is English-only. For a product that needs CJK, fast doesn't help if the language simply isn't there (benchmarks).
- SenseVoice (Alibaba's FunASR). This one impressed me. It's non-autoregressive, so it's fast (the small model is about 15x faster than Whisper-large), it's small, and it handles Chinese, Cantonese, Japanese, and Korean well. For local Chinese dictation, it's the one I kept coming back to (GitHub).
- Breeze ASR (MediaTek Research). A Whisper-large-v2 fine-tune built for Taiwanese Mandarin and Mandarin-English code-switching, the everyday habit of mixing Chinese and English mid-sentence. It improves code-switching by over 50% versus stock Whisper. Niche, but if that's how you actually talk, nothing else comes close (GitHub).
- Cohere Transcribe (Cohere Labs). The accuracy champion of the bunch. It's a 2B-parameter open model (Apache 2.0) that currently tops the Open ASR Leaderboard and beats Whisper Large-v3 (about 5.4% word error rate to Whisper's 7.4% on English), across 14 languages including Mandarin, Japanese, and Korean. The trade-off in practice: it's heavier and slower than the tiny models above, so you pay in speed for that accuracy (release notes).
There's a whole tail of smaller, single-region models too: Moonshine for English, Canary for European languages plus translation, GigaAM for Russian. All worth a look if Chinese isn't on your list, and most of them are tiny (some under 100 MB).
The part that changed my mind: the model isn't the bottleneck anymore
Speech recognition got small and good. SenseVoice is tiny and fast. So I expected local to be easy from here. It wasn't, and the reason caught me off guard.
A voice typing app isn't just transcription. It's transcribe, then run an AI cleanup pass to fix grammar and filler words. That cleanup is a second model: an LLM. And the LLM is the hog.
Take SenseVoice, the model I kept reaching for. On its own it barely touches 1 GB of VRAM. The recognition side is basically a solved problem now, and the cost-performance is hard to argue with. Then you add the cleanup LLM, and even a modest one runs 3 to 6 GB (a small Gemma or Qwen, say). Stack the two, add the model's context and the usual overhead, and on my 16 GB card the whole thing sat at 8 to 10 GB. The operating system and whatever else I had open took a couple more. Suddenly there was nothing left. I couldn't run ComfyUI. I couldn't load another model. A voice input method had eaten my entire graphics card just sitting there, waiting for me to talk.
That's the real local tax, and it stopped being about the speech model a while ago. It's the whole stack.
And it isn't only the memory. The small LLMs that fit on a consumer card are quick, but they make unreliable editors. Ask one to fix grammar and strip filler and it will sometimes ignore the instruction, rewrite more than you wanted, or behave differently every time. The models that do clean, predictable cleanup are big, a 120B-class model like OpenAI's gpt-oss, and you are simply not running that on a desktop. So local cleanup turns into a no-win: the model that fits isn't good enough, and the one that's good enough doesn't fit. That is the part that pushed me to the cloud, where you can reach the big model without having to own the big machine.
(Which local LLM to pick, and how to fit it, is a rabbit hole of its own, probably its own post someday. The short version is the one above: small enough to run usually means not reliable enough to trust.)
So, local or cloud?
Local is good, and free, if two things are true: you have stable, sufficient hardware, and you're willing to give that GPU to it. If your machine has the headroom, or you offload the models to a separate GPU box, run local, keep your audio on your own hardware, and pay nothing. For privacy-hard requirements, this is the only real answer.
For everyone else the math is harder. A dedicated GPU is expensive, and handing your whole card to an input method so it can listen is a steep price to pay. That's where the cloud wins: someone else runs the big, accurate models, your machine stays free, and long recordings come back fast.
That, in the end, is why Meander is cloud-first. I went down the local road far enough to know it well, and for most people on most machines, the cloud is the better trade. There's a free tier if you want to try it.
At a glance
| Model | Speed | Chinese / CJK | Best for |
|---|---|---|---|
| SenseVoice | Very fast | Strong (zh, yue, ja, ko) | Local Chinese dictation |
| Breeze ASR | Moderate | Taiwan Mandarin + code-switch | Mixed Chinese and English |
| Cohere Transcribe | Slower | Strong (incl. zh, ja, ko) | Top accuracy with CJK |
| Whisper Large-v3 | Slow | Good | Accuracy you can trust |
| Whisper Turbo | Fast | Good (drops translation) | Speed and accuracy balance |
| Whisper Small / Medium | Fast / medium | Decent | A lighter Whisper |
| Parakeet V3 | Very fast | Multilingual, but no Chinese | Fast, non-CJK |
| Moonshine / Canary / GigaAM | Very fast | English / European / Russian | Tiny, single-region |
The speech models themselves are all small (most are well under 1 GB to download). The memory that bites comes later, when the cleanup LLM stacks on top, and that is what filled my 16 GB card.
The honest takeaway
Don't choose a local voice setup based on the speech model alone. The model is the easy part now. The real question is whether you want to spend your GPU, your time, and your machine's headroom to keep everything offline. If you do, the open models above are good, and you should use them. If you'd rather your computer stay yours, the cloud is the simpler trade.