How DeepSeek Builds World-Class AI | Francis Okafor | Deep Dives

I have read maybe thirty articles trying to explain DeepSeek. Most of them fall into two camps: breathless hype pieces that call it “China’s Sputnik moment” without explaining a single technical detail, or academic breakdowns so dense they might as well be written in LaTeX. Neither helps you actually understand what happened.

So here is my attempt. I am going to walk through the specific engineering decisions that let a Chinese hedge fund subsidiary build an AI that goes toe-to-toe with GPT-4 — for roughly the price of a nice house in Lekki. I’ll explain every technical term as we go. No jargon left behind.

Fair warning: this is a long read. But by the end, you’ll understand Mixture-of-Experts, attention mechanisms, and reinforcement learning well enough to hold your own at any AI conversation. That’s the deal.

Some basics before we dive in

If you already know what parameters and training are, skip ahead. No shame in it. But if “671 billion parameters” sounds like gibberish, stay with me for two minutes.

A model is a mathematical function. It takes words as input and predicts what word comes next. That’s the entire foundation of ChatGPT, DeepSeek, Claude, Gemini — all of them. Very fancy next-word-prediction engines.

Parameters are the adjustable numbers inside that function. Picture a massive mixing board at a recording studio, thousands of knobs and sliders, each one controlling something about the output. A model with 671 billion parameters has 671 billion of these knobs. During training, the system tweaks them over and over until the outputs start sounding intelligent.

Training means feeding the model enormous amounts of text (we are talking trillions of words, books, websites, code, forum posts, everything) and letting it adjust its knobs based on how well it predicts what comes next. Better predictions, better model, better conversation.

Tweaking 671 billion knobs requires specialized chips called GPUs. These things cost $25,000–40,000 each, and you need thousands running in parallel for weeks. That’s where the insane costs come from. GPT-4’s training reportedly cost over $100 million in compute. Google’s Gemini Ultra? Around $191 million.

DeepSeek V3 cost $5.6 million and performs in the same league.

Now let’s talk about how.

Where DeepSeek came from

This part matters because the backstory explains the constraints, and the constraints explain the innovations.

DeepSeek is owned by High-Flyer, a quant hedge fund in Hangzhou managing about $8 billion. The founder, Liang Wenfeng, is an electronic engineering grad from Zhejiang University who got rich using AI to trade stocks. In 2019, he started buying Nvidia GPUs in bulk — $28 million worth. Then in 2021, he grabbed 10,000 A100 chips, Nvidia’s top-tier AI processor at the time.

Good timing. In October 2022, the U.S. government banned the sale of advanced AI chips to China. The door slammed shut, but Liang was already inside with his 10,000 chips stacked up.

He launched DeepSeek in July 2023 with around 140 researchers. No VC money. No board meetings. No pressure to monetize by Q3. Just a well-funded hedge fund saying: go build the best model you can with what we’ve got.

What they had was limited. OpenAI trains on 25,000+ of the latest H100 GPUs. DeepSeek trained V3 on 2,048 H800s — a deliberately nerfed export-compliant chip with about 44% of the H100’s communication bandwidth. Fewer chips, weaker chips, smaller budget.

Every single one of DeepSeek’s key innovations traces back to working around these limitations. And this is the part that I find genuinely fascinating: several of those workarounds turned out to produce better results than the brute-force approach they were designed to replace.

Innovation #1: Only wake up the experts you need

This is DeepSeek’s biggest cost saver, and it’s surprisingly intuitive once you get it.

Think about a hospital. When you walk in with a broken arm, you don’t need every doctor on staff. You need orthopedics, radiology, maybe a nurse. The cardiologist, the dermatologist, the neurologist — they keep doing their own thing. The hospital has hundreds of specialists, but your case only activates a few.

DeepSeek V3 does exactly this. The model contains 671 billion parameters total, but any given input only activates about 37 billion of them — roughly 5.5%. This is called Mixture-of-Experts (MoE).

Under the hood, the model has 256 small specialist networks plus one generalist that’s always on (sort of like the triage nurse who sees every patient). A routing mechanism looks at each input and picks the 8 most relevant specialists. So 9 experts handle each token — 8 selected ones plus the always-on generalist.

Why does this slash costs? Because you only pay for the computation you actually use. DeepSeek V3 does about 250 billion operations per token. Meta’s Llama 3.1 405B — a dense model where everything activates for every token — does about 2,448 billion. That’s roughly a 10x gap.

The classic headache with MoE is load balancing. Some experts get slammed with requests while others sit idle, like a hospital where everyone ends up in the cardiology waiting room while the dermatologist plays solitaire. Previous solutions essentially penalized popular experts — which worked, but degraded quality. You’re forcing patients away from the specialist they actually need.

DeepSeek’s fix is smarter. They keep a running popularity score for each expert. If one gets too crowded, its score drops, making the router slightly less likely to send the next input there. The critical detail: this adjustment doesn’t touch the learning process. It’s traffic management, not medical interference. The doctors still practice medicine the same way — you’re just managing the queue better.

In their tests, this approach produced strictly better results than the penalty method. Not equal. Better.

Innovation #2: Compressing the model’s memory

Here is a problem that doesn’t get enough attention (pun sort of intended).

When you are having a long conversation with an AI, the model needs to remember everything that came before. Every word you said, every word it said. The mechanism for this is called attention — the model looks back at the full conversation history when generating each new word.

This history gets stored in what’s called a KV-cache (Key-Value cache). For each word in the conversation, the model stores a “key” (what the word represents) and a “value” (what information it carries). As conversations get longer, this cache gets massive.

How massive? For Llama 3.1 405B, each token in the conversation takes up 516 KB of cache space. At a 128,000-token conversation length, that’s about 63 GB of memory just for remembering the conversation. That’s more RAM than most laptops have, consumed entirely by context memory.

DeepSeek developed Multi-Head Latent Attention (MLA), and the concept is surprisingly elegant. Instead of storing the full key-value pairs, MLA compresses them into a small, dense summary vector — 512 dimensions instead of the standard 32,768 values per token per layer. When the model needs to reference a past token, it reconstructs the full information from the summary on the fly.

DeepSeek V3’s cache needs only 70 KB per token. That’s versus Llama’s 516 KB. About 7x smaller.

And here is what caught me off guard when I first read the paper: MLA actually produces better quality output than standard attention. The compression forces the model to be more selective about what information it stores, and that selectivity makes it better at identifying what actually matters. I keep coming back to this pattern with DeepSeek — the constrained solution outperforming the expensive one.

Innovation #3: Training with less precise numbers

This one requires a tiny detour into how computers store numbers, but I promise it’s quick.

Numbers in a computer have a precision setting. BF16 (Brain Floating Point 16-bit) uses 16 bits per number — plenty of precision, but takes up space and computation. FP8 uses just 8 bits — half the space, roughly double the speed on compatible hardware.

Most large models train in BF16. DeepSeek V3 was one of the first to successfully train at scale in FP8.

The problem with lower precision is accumulating errors. Each individual rounding is tiny, but across billions of calculations, those tiny errors accumulate. Imagine navigating a 10,000 km road trip where every GPS instruction is rounded to the nearest kilometer. Any single rounding barely matters. But after thousands of instructions, you could end up in a completely different country.

DeepSeek’s solution breaks the numbers into small blocks and gives each block its own scaling factor. They also built custom code that periodically promotes calculations back to higher precision — basically checking your actual position on a real map every so often to correct the accumulated drift.

This lead to less than 0.25% quality loss compared to full-precision training. They drove the entire V3 training run in FP8 without a single catastrophic failure. Considering this was widely believed to be impractical at this scale, that’s a meaningful engineering achievement.

Innovation #4: Hiding the slow connection

The H800 has about 44% of the H100’s inter-chip communication bandwidth. When 2,048 GPUs need to constantly exchange information during training, that’s a serious bottleneck. Fast workers, narrow hallway.

DeepSeek built DualPipe, a system that overlaps computation with communication. While one batch of calculations is running, the results from the previous batch are being transmitted. They also designed their routing so each piece of data only needs to travel to at most 4 of the GPU groups, minimizing traffic.

This is less conceptually interesting than the other innovations — it’s pure systems engineering, Without DualPipe, the bandwidth limitation would have torpedoed the entire project.

R1: When the model taught itself to think

If V3 was impressive engineering, DeepSeek R1 was something closer to a scientific discovery.

Most AI models answer instantly, they see your question, generate a response, done. OpenAI’s o1 introduced “reasoning” models that think step-by-step before answering, like a student showing their work. But OpenAI never published how they did it.

DeepSeek showed a path that was both simpler and more unsettling.

They took the V3 base model and applied reinforcement learning (RL) with only two feedback signals: is the answer correct? And is the formatting right? No human-written examples of reasoning. No curated demonstrations of “this is how good thinking looks.” Just right or wrong.

Over about 10,400 training steps, the model’s accuracy on the AIME math competition climbed from 15.6% to 71.0%. But the accuracy isn’t the wild part. The wild part is what the model started doing on its own.

Without anyone teaching it, the model began pausing mid-answer to reconsider. It started generating phrases like “Wait, let me rethink this...” It developed the habit of trying alternative approaches when one path wasn’t working. It learned to spend more time on hard problems and less on easy ones.

The researchers flagged a specific moment in training where the frequency of words like “wait,” “mistake,” “verify,” and “retry” suddenly spiked. The model had independently invented self-correction. Nobody programmed this behavior. The selection pressure of getting correct answers was enough.

Let that sit for a moment. A mathematical function, optimized purely on whether its answers are right or wrong, spontaneously developed the ability to doubt itself and try again.

The RL algorithm — GRPO (Group Relative Policy Optimization) — is itself a cost innovation. The standard approach (PPO) needs a separate “critic” model roughly the same size as the main model to evaluate outputs. That doubles memory requirements. GRPO skips the critic entirely by generating 16 responses to each question and comparing them against each other. The group itself becomes the grading curve. Saves about 50% of the memory overhead.

R1’s final numbers: 79.8% on AIME 2024 (matching OpenAI o1), 97.3% on MATH-500 (beating o1), and a Codeforces rating in the 96th percentile of competitive programmers. API pricing: about 27x cheaper than o1.

Then DeepSeek distilled R1’s reasoning into smaller models. Distillation is when a smaller model learns to mimic a larger one’s behavior — the 32B version outperforms OpenAI’s o1-mini. The 7B version — runnable on a decent laptop — would have been state-of-the-art two years ago. All open-source. MIT License. Free.

The fallout and what came after

January 27, 2025: DeepSeek’s chatbot hits #1 on the U.S. App Store. Nvidia loses ~$600 billion in market cap in a single day, the largest loss in U.S. stock market history.

The development pace after that was relentless. V3-0324 in March integrated R1’s RL techniques into the base model. R1-0528 in May nearly doubled reasoning depth. V3.1 in August merged thinking and non-thinking modes into one system. V3.2 in December introduced faster attention for long texts and massive agent-training expansions. The high-end V3.2-Speciale won gold at the International Math Olympiad and placed 2nd at the ICPC World Finals.

R1’s paper was published in Nature in September 2025 — a first for any major language model. The peer-reviewed training cost for R1’s reinforcement learning phase: $294,000.

V4: expected, delayed, and revealing

DeepSeek V4 was supposed to drop in mid-February 2026. It hasn’t, and the reason is telling.

After R1’s global success, Chinese authorities pushed DeepSeek to train V4 on Huawei’s Ascend chips — domestic hardware, national pride, technological independence. DeepSeek tried. The training kept failing. Stability problems, slower interconnects, immature software tools.

They reportedly switched back to Nvidia chips for training and relegated Huawei hardware to inference only (serving the model to users, which is less demanding). This tells you something important about where China’s domestic chip capabilities actually stand versus the aspirations. Progress is real, but the gap is still wide.

What we know about V4: heavy focus on coding, support for extremely long contexts (they quietly pushed their context window to 1 million tokens in early February), and internal tests showing it beats Claude 3.5 Sonnet and GPT-4o on code benchmarks.

So what does this mean for the rest of us?

The old story was: frontier AI requires billions of dollars and access to hardware that maybe five organizations on Earth can afford. If you’re not in that club, you’re a consumer, not a builder.

DeepSeek’s models are released under the MIT License. You can use them, modify them, build businesses on them, and pay nothing. The distilled R1 models run on hardware that a funded startup in Lagos, Nairobi, or Accra can realistically access today. The API pricing — $0.27 per million input tokens — is 94% cheaper than GPT-4.

For a fintech building AI-powered credit scoring, or a healthtech doing preliminary screening in underserved areas, or an agritech optimizing crop advice for smallholder farmers — the gap between “AI is too expensive for us” and “AI is our core product” just collapsed.

And the efficiency techniques are documented in public papers. MoE doesn’t require a special license. MLA isn’t patented. FP8 training is a technique, not a trade secret. The blueprints are on the table.

A note from inside the ecosystem

I will end with something personal. I am part of a local Chinese tech community here in Shenzhen where engineers, founders, and researchers meet regularly to tear apart new models and figure out what’s actually usable versus what’s just hype. The highlight of one of our Deepseek Technology break down was how denied the best chips, the Chinese invented training techniques more efficient than what those chips were designed for. Limited to a small cluster, they built communication systems that hide the bottleneck. Unable to afford mountains of human-labeled reasoning data, they discovered that pure reinforcement learning produces reasoning without it.

For those of us building technology in places where resources are limited and constraints are plenty, there might be a lesson in there worth more than any benchmark score.

How DeepSeek Builds World-Class AI on a Shoestring Budget

💻 Technologies Covered

Some basics before we dive in

Where DeepSeek came from

Innovation #1: Only wake up the experts you need

Innovation #2: Compressing the model’s memory

Innovation #3: Training with less precise numbers

Innovation #4: Hiding the slow connection

R1: When the model taught itself to think

The fallout and what came after

V4: expected, delayed, and revealing

So what does this mean for the rest of us?

A note from inside the ecosystem

🧠 Concepts Covered

⚙️ Algorithms Explained

🏗️ System Design Patterns

Frequently Asked Questions

How does DeepSeek compare to ChatGPT and other models in actual performance?

Can I actually use DeepSeek’s models for my own projects or business?

What’s the difference between DeepSeek V3 and DeepSeek R1?

Did DeepSeek really train their model for only $5.6 million?

Have Questions?