Peter Kukol asked. Great question from months back. How do you run big models like DeepSeek AI or Meta Llama4 on consumer hardware? Should you go big on CPU DRAM or not?
Long answer but…
Yes, that’s what most people are doing:g buying a big CPU and lots of RAM and a few big Nvidia processors (depends on how insane you are)
The main issue is that DeepSeek is a “dense” model, so it needs all 600GB of VRAM to run (plus space for context
The latest technology is using a mixture of experts. Above 30M parameters or so, so the configuration ur talking about works way better.
For instance, a version of Qwen3 is a 30B parameter model, but only 3B are active at once. That makes it way more practical to have a common GPU/CPU system.
And it looks like the right option about 30B is MoE. (Basically, there’s a network in front that decides which of N dense models to run)
Llama4 Scout, for instance, does this.
The other approach is to use quantization. Studies show you lost about 1-3% accuracy at 4-bit quantization (a huge topic, but it’s a hybrid. Some things are 4-bit, others are 8)
Finally, there is distillation, which can take a 670B dense model down to 70B, which does fit in a single big GPU. Performance is also relatively good.
Net, net, you don’t need as much RAM if you’re doing inference mainly, and 24-80GB VRAM for a lot of ordinary mortals is plenty.
Leave a Reply