A Bit Late but Ultimate Analysis: DeepSeek

tldr: if you want to find out everything about deepseek you can read this blogpost that has 3 separate sections with increasing technical details and difficulty from one to another, so you can stop your reading at any point once you’ve got enough information or material becomes too difficult to comprehend.

Lately DeepSeek released their latest model R1 which has performance comparable with all the latest available OpenAI models while having much less computational costs. Unfortunately due to a lot of optimistic claims by their team and a lot of difficult to comprehend innovations introduced in their work, we’ve got a lot of rumours and misunderstanding circling around this mode.

In our blogpost we will briefly break down most common rumours and speculations about R1 model, give detailed but easily comprehensible explanations of all DeepSeek innovations in this model and explain why it was so cheap to train and so easy to operate, and in the end provide some deeper explanation on the most difficult parts of their research, so you could understand how it works up until the last bit.

Rumours and speculations breakdown

  1. “DeepSeek R1 is on the same level as OpenAI models, but much cheaper!” While DeepSeek’s inference is definitely much cheaper, it’s performance excellence is not so clear. Yes, it shows comparable or better performance than some OpenAI’s models on several open benchmarks, but this holds true only for math and coding, it shows much worse results for other common tasks. Also there are some independent researches that it is worse for more general math and coding tasks outside of popular benchmarks, which was partially confirmed on latest AIME competition (see Data Labelling Pipeline NB for details). I definitely recommend to think about this model more as Google Gemini Flash Thinking competitor, than full-fledged OpenAI model’s.
  2. “DeepSeek is dirt-cheap to use!” Well, yes and no. Yes, you can use DeepSeek model from their official API for the fraction of the cost of other popular models like LLama. But unfortunately their team was not ready for such a hype, so their API is down very often and very unstable to use. And if you will try to use it internally or buy some other APIs that run it, you will quickly find out that it is several times more expensive to do. The main problem is that while weights of the model and white paper about it were openly published, their hardware-specific source code was not. And it contains tons of optimizations that make this model cheaper to run.
  3. “DeepSeek stole OpenAI’s data!” From what we are seeing from our internal and other independent tests this statement seems quite unlikely to be true and probably were made to cool down OpenAI’s investors. Later in the second section you will see some details on their innovative technique to gather data, provided in the DeepSeekMath paper. And in third section we will discuss how this technique was further improved and changed to make a DeepSeek-Zero and then DeepSeek-R1 model. These innovations are also contradict that initial OpenAI’s statement.
  4. “DeepSeek spent 5.58 million to train — over 89 times cheaper than OpenAI’s rumored 500 million budget for its o1 model!” Well, that’s complete nonsense. While 5.58 mil is probably a true number and it is much cheaper than competitors, we are talking about 4-8 times difference at most. The main issue is that 5.58 mil was spent only for a single final training run of the model, which for example for other comparable sized models with known costs were in between 7 to 20 mil. This price tag does not incorporate all intermediate runs, which are usually much cheaper, but there are up to several hundreds of them. It also does not include data gathering, research, development and human resources spendings. So in the end completely developed DeepSeek model probably costed at least 200 millions. Nevertheless, they provided a lot of innovations to reduce both the training and inference costs, which we discuss later in this blogpost.

Innovations breakdown

Now let’s take a look at all optimisations and innovations made by DeepSeek. I will mostly focus on either general scientific achievements or technical cost-reduction innovations. This section is still general-public oriented, so I hope it will be easy to digest.

1. Low-Level Optimization for Faster Computation

Most AI models are trained using PyTorch, a popular deep-learning framework that provides ease of use but adds extra computational overhead. For faster training, many advanced AI teams use NVIDIA’s NCCL instead (a high-performance library for communication between GPUs). However, DeepSeek went even deeper — they customized NCCL itself, optimizing GPU Streaming Multiprocessors (SMs) using super low level PTX (Parallel Thread Execution) assembly language. This super low-level tuning allowed them to better match their specific hardware architecture, reducing latency and improving data transfer between GPUs. This approach was introduced in their DeepSeek V2 paper.

2. 8-bit hybrid Training Instead of 32-bit for Cost Efficiency

Most AI models train in 32-bit floating point (FP32) or 16-bit floating point (FP16) precision. This is a standard approach that ensures stability but requires significant computational power. DeepSeek was able to stabilize 8-bit training (FP8), drastically cutting memory usage and increasing speed. But they didn’t just naively apply 8-bit across the board which is well known to be unstable. They used a hybrid approach where most layers operated in FP8, but some carefully picked ones were aggregated in 32-bit precision when needed for stability. This “Floating Point Adaptive” (FPA) training balances efficiency and accuracy while reducing training costs and memory requirements.

3. Mixture of Experts (MoE) for Massive Parameter Efficiency

DeepSeek R1 uses a Mixture of Experts (MoE) architecture, meaning that instead of activating all 671 billion parameters during inference, it selectively activates only 37 billion. This drastically reduces computational load while still leveraging a large model’s capability. While MoE approach itself is well-known and already were used by OpenAI and Mistral models, they gave an extra spin on it. MoE introduces a new challenge — balancing the GPU workload. Since only a subset of experts is active at any given time, not all GPUs are used equally, and some of them are basically idling and waiting for data. Instead of relying on NVIDIA’s default load management, DeepSeek developed a custom load balancer to optimally distribute work across concrete GPUs infrastructure they had according to their specific architecture.

4. Optimized Hardware Choices for US Export-Limited GPUs

Training and running large models depend on three key factors:

  • Compute power (FLOPs) – Main speed multiplier for training base LLMs.
  • Memory bandwidth – How fast GPUs can access and process data.
  • Interconnect speed – How efficiently GPUs communicate with each other.

Due to US export restrictions, DeepSeek was unable to access the highest-end NVIDIA GPUs, which limited them in FLOPs. However, they made up for this by NVIDIA providing specialized cards with high memory bandwidth and fast interconnect speeds, much higher than their top performing server GPUs. This turned out to be more important for reasoning models (models optimized for tasks like problem-solving and step-by-step reasoning rather than raw number crunching), which DeepSeek-R1 is. So unintentionally NVIDIA helped them to overcome US Export limitations, at least for their reasoning model. I assume that this might result into additional restrictions later.

5. Efficient Attention Mechanism: MLA Attention

Traditional Transformer models, like those introduced in the famous “Attention is All You Need” paper, use quadratic complexity for attention mechanisms, meaning computational cost grows rapidly with longer input sequences. DeepSeek R1 uses Multi-Layer Aggregation (MLA) Attention, which allows it to reduce complexity by leveraging fewer latent representations while maintaining accuracy. This helps improve speed and scalability when processing large inputs. Moreover they once again did it with a low-level hardware-specific implementation, this approach showed up to 50% performance boost in attention calculations when was applied by other AI labs, so it is probably comparable here.

6. From TRPO to PPO and GRPO: Evolution of Reinforcement Learning

DeepSeek R1 improves training stability by leveraging policy optimization techniques in reinforcement learning. Originally, Trust Region Policy Optimization (TRPO) was used in many RL-based training approaches, but it had limitations — it imposed strict constraints that could slow down learning. The transition to Proximal Policy Optimization (PPO) relaxed these constraints while maintaining stability, making it more efficient for fine-tuning AI models. The main issue with PPO was in it’s should store additional model that is needed to approximate special value function that is used to optimise LLMs parameters. DeepSeek introduced novel approach called Group Relative Policy Optimization (GRPO) based on PPO which completely excludes this costly requirement. For more details on this approach you can look at the last section of this blogpost.

7. Self-Learning with Automated Rule-Based Rewards

While it is not really related to the cost of the final training run, or inference costs, one of DeepSeek’s most cost-effective strategies was minimizing human intervention in fine-tuning. Instead of relying heavily on Reinforcement Learning from Human Feedback (RLHF) (which requires expensive human labelers), they introduced a rule-based self-learning system with two types of rewards:

  • Accuracy Rewards – For tasks with clear right/wrong answers (e.g., math problems, programming challenges), the system automatically evaluates correctness using predefined test cases or expected formats.
  • Format Rewards – The model was trained to structure its reasoning process clearly by placing intermediate thoughts between <think> and </think> tags, making its responses more interpretable.

This automation reduced costs while surprisingly maintaining high-quality learning outcomes. While the idea of this approach is not novel, model was able to effectively train itself to reason from the ground up, which was not properly achieved before. I will focus more on the whole pipeline in the next section.

8. Complicated but efficient dataset generation and R1 training pipelines

In their work they used original DeepSeekMath paper as a starting point. In that paper they utilised open Common Crawl repository and expanded it with multiple iterations through the semi-automated approach using old-fashioned FastText model for webpages filtering and annotating them. As a result they obtained good reasoning dataset which had math and programming problems. These kind of problems not only has some internal reasoning, but this reasoning is possible to validate automatically.

DeepSeekMath showed outstanding performance in math and programming tasks within its weight class. From there they trained DeepSeek-R1-Zero model using prompt and applying automated rewards you’ve seen in previous point. Unfortunately DeepSeek-R1-Zero was mixing languages in its thinking process, so they have to perform extra steps in order to obtain DeepSeek-R1. You can get more technical details in the next section.

This approach excluded both Supervised Fine Tuning (SFT) — a process of using big specially labelled dataset (in this case with handcrafted reasoning chains) to train the initial model. Also it excluded Reinforcement Learning from Human Feedback (RLHF) from the process — it is a long process of running model again and again and using humans to evaluate its outputs. As you can imagine both of these processes are quite costly.

9. Potentially Lower Safety Standards?

Some experts speculate that DeepSeek R1 was able to ship faster and more affordably by cutting back on certain safety features. One indicator is that the model sometimes incorrectly identifies itself as “ChatGPT” instead of “DeepSeek,” suggesting that less effort was spent on refining safety guardrails and brand-specific fine-tuning. This makes sense for an open-source model, where users are expected to modify and adapt the AI themselves. Also this model definitely has almost no safeguards and produces harmful and discriminatory outputs with ease, so much less resources were spent there. But maybe it is even better for some applications, try to automatically translate dubs for any TV show where main characters are swearing a lot with OpenAI, you will get rejected pretty fast. Just to be clear: DeepSeek’s official API still has some extra guardrails incorporated, but most of them are not in the model weights themselves.

Technical breakdown

In this section we will focus on some deeper technical details that will give you better perspective on some innovations and math behind the scenes and also provide some extra evidence on their corpus and research both being novel, contradicting some of OpenAI’s claims.

Data Labelling Pipeline and DeepSeek-R1-Zero

As a foundation for their data labelling DeepSeek-R1 used DeepSekMath corpus which was constructed from the Common Crawl open dataset. In their paper they provide this picture of iterative pipeline.

ady1ely.image_(29).png

  • It starts with an initial seed corpus OpeWebMath dataset. It is a small high-quality math dataset.
  • Then, they trained simple and lightweight fastText model (from 2016!) using 500k data points from it as a positive examples database, and using the same number of web pages form Common Crawl as negative ones.
  • In the next step they applied this model to find deduplicated URLs (i.e. pages with the same URL prefix were merged into one point) that lead to math-related pages preserving only top-ranking ones.
  • As initial dataset lacked diversity, their next step was to find “disjoint domains”, i.e. internet resources where some percentage of web-pages were math-related.
  • After finding these domains they were labeled manually, adding more positive examples to the positive corpus and the cycle starts over again with the new math seed.

NB. Some of these websites contains tasks from known benchmarks. DeepSeek’s team applied extra filtering to avoid benchmark contamination in their training data, but as latest American Invitational Mathematics Examination (AIME) competition showed, although all models saw a notable decline in performance, R1 suffered a far greater drop. This might be a signal that they still had a benchmark contamination of some degree.

Obtaining DeepSeek-R1

First model they have created was DeepSeek-R1-Zero. Basically they took DeepSeek-V3, took their math and code dataset and trained it with this prompt using simple Rule-Based RL training:

atfnm8i.image_(30).png

They used the same reward model I’ve showed in point 7 at previous section.

From that point they have to transition to R1. Why do we need to have a such complicated pipeline instead of just simply using DeepSeek-R1-Zero once we’ve got it? Unfortunately this model suffers both from poor readability and English and Chinese languages mixing. While test showed that single-language restriction reduced benchmarks metrics, it still was a preferable way to go, as the main point of this model is to show proper and understandable reasoning process behind the answer.

Before moving forward just a small reminder: Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by performing actions and receiving feedback in the form of rewards or penalties, aiming to maximize cumulative rewards over time.

  1. It starts with a pre-trained DeepSeek-V3 which is an LLM trained in a standard way as all other LLMs, but using optimizations we’ve discussed in previous section.
  2. Perform Supervised Fine Tuning on this V3 model on a carefully selected small set (several thousands samples) of R1-Zero outputs manually validated as high-quality and readable.
  3. Apply the same reasoning self-learning procedure as it was for the R1-Zero using math and coding dataset where auto-validation is possible for the Reinforcement Learning rewards calculation.
  4. Apply rejection sampling. With all generated samples we’ve obtained on the 3-rd step, DeepSeek-V3 used as an external expert that decides which samples should be left. This helps to generate more reasoning chains across more general-purpose domains.
  5. Once again reinforcement learning based training. At this stage some rule-based rewards are applied for areas where it is possible (like math), for others LLM validation is used.

Deep dive into the TRPO → PPO → GPPO transition

While TRPO and PPO were known in the RL domain, GPPO is completely new and proposed in the DeepSeek-R1 paper. Let’s move from the beginning to understand how it works.

In Reinforcement Learning you usually have some Actor A and some Environment E, E gives you an observation (in this case question q) and A give output (in this case direct answer or a chain of though answer depending on the model). Last element of the schema is the reward that E gives to A depending on the answer quality.

In RL this actor internally will have a neural network (LLM) in our case, in mathematical terms we can call it policy πΘ(obs)\pi_{\Theta}(obs), where Θ\Theta represents tunable parameters of the LLM. Then output can be denoted as o=LLM(q,Θ)o=LLM(q, \Theta). The task is fine-tune LLMs parameters and get the most of the reward.

The main issue that in order to tune the LLM you need to have some Loss function L(o,oˉ)L(o,\bar o) where oˉ\bar o is a correct answer. Then using Loss function you can calculate gradients and update model parameters. In the problem statement we have we do not have correct answers as most of the data is unlabelled. So instead we perform next trick.

We perform and action an assume that this action was correct. In this case loss L=log(π(obs))rewardL=-log(\pi(obs))\cdot reward. By default we calculate a gradient and perform gradient descent, reward in this case shows how big a step should be based of known correct answer. As we do not have a way to calculate it directly in our case, we introduce new function Advantage: A=(rb)A=(r-b), where rr is a post-action reward and bb is a baseline.

Reward r(obs,act)r(obs,act) is calculated via (1) some external reward estimation like complier with tests in the case of code, (2) some direct internal validation via unsupervised metrics or rule-based ones, (3) LLM as a judge like setting, where you use external LLM or even train one in parallel with this one. DeepSeek went with direct approach which is described in the point 7 in the previous section.

Baseline bb is calculated via value function, a regression that is pre-trained on the labeled data you have to answer the question “what will be an average reward for an action from a given state”.

Then loss can be written as

L=log(π(obs))A=log(π(obs))(rb),\begin{align} L=-log(\pi(obs))\cdot A =-log(\pi(obs))\cdot (r-b), \hspace{10mm} \tag{1} \end{align}

where advantage A>0A>0 when the action we perfromed is better than average expected and less than zero when vice versa.

TRPO is a Trust Region Policy Optimization works the following way. You have a gradient, but you assume that it is dangerous to trust your gradient too much as it was produced by some random stochastic process (via working with concrete data samples). To incorporate that you modify your original loss by adding KL-divergence which basically says how different are 2 distributions:

Lnew=Lold+DKL(π,π^).\begin{align} L_{new}=L_{old}+D_{KL}(\pi, \hat \pi). \hspace{10mm} \tag{2} \end{align}

Basically you are measuring how different your new policy in comparison to previous one you had and applying extra penalty on that, forcing gradient descent not to move too far away from the policy you had, which adds extra stability into the optimization process. Unfortunately TRPO is computationally intensive as in order to perform this estimation you need to calculate extra derivatives, make 2-nd order approximations, evaluate landscape and perform extra line search, so instead of it PPO approximation was developed.

PPO is a Proximal Policy Optimization and has this complicated formula to work with, lets break it down:

aimvkol.image_(31).png

(3)

In its core it has once again Policy multiplied by Advantage πΘ(otq,o<t)At\pi_{\Theta}(o_t|q,o_{<t})\cdot A_t. The πold\pi_{old} in formula (3) is the instantiated model which produces outputs qq and oo, so it is just a number that can be computed directly from the current instance. The πΘ\pi_{\Theta} is the model you are optimizing gradients for, which will be used as old one on the next step. This results into the whole equation being the similar as before in (1) but with different pre-calculated constants and a bit different form. The main idea is that this helps to avoid the sampling bias of the current policy making weight of this part larger for policies that are rare under the old policy, as they are undersampled under it, they must be weighed stronger.

But, as we find more radical new policies, this drastically increases the first part and moves new policy too far away, resulting into the second term under the minmin coming into play. This clipclip function basically imitates simple cutting rule that works the same way as in TRPO, but without complicated calcluations.

The next thing they did they applied the same mechanism that was showed in (2), but instead of using some heavy calculations to obtain KL-divergence they made in a sense similar term:

aclh92m.image_(32).png (4)

The πref\pi_{ref} here is the reference model, which is usually the initial SFT model they had at the initialization of the whole optimization process. The main idea is that while we want to perform RL optimization, we still assume that initial model already had somewhat good representation of the world, and we do not want to move to far away from that. Although it is not exactly the same as KL-divergence used in TRPO, it still gives similar results.

Next, they decided to move from direct advantage to more computationally effective approach. They made groups of outputs by direct sampling and from the old policy πΘold\pi_{\Theta_{old}} and optimized policy model πΘ\pi_{\Theta} by maximizing this objective:

aepfg81.image_(33).png

(5)

It is the same as before in (3), but with (4) added in the end, group sampling instead object sampling and new advantage function A^i,t\hat A_{i,t} calculated based on relative rewards of the outputs inside each group only. This group-relative advantage calculation helps to exclude value model, which usually required to run secondary instance of LLM model and thus cut a lot of computational resources.

If you want some extra comments on how and why it works you can try to read the original R1 paper. But the main idea of the benefits we got was shown in this image with transition from PPO to GRPO.

aeit39q.image_(34).png

To sum up, as a result of this transition:

  • Value model which was computational heavy is excluded.
  • Instead of estimating average possible reward with value model we just sample several outputs, evaluate them with reward model and use average reward as our “value” in older terms.
  • For a reward model we use simple rule-based system which was described in point 7 of previous section. This also reduces computational costs.
  • Changed heavy KL-divergence calculation with some light-weighted approximation.

As a result we only need to make some extra sampling, apply light-weighted reward model to get average reward and apply the same procedure as it was in PPO with some extra tweaks while nearly halving all other much heavier calculations.

Outro

Hopefully this blogpost gave you a better understanding of the foundations of the DeepSeek innovations. I tried to collect everything you need to know about it and tackle every rumour we had so far. You can share it with any person who gives provoking and unsupported claims about various aspects of the R1 model, hopefully this helps.

Sources

More detailed analyses in visual form:

For the advanced readers:

Banner that links to Serokell Shop. You can buy cool FP T-shirts there!
More from Serokell
haskell to core thumbnailhaskell to core thumbnail
quantum computing capabilitiesquantum computing capabilities
How to Implement an LR(1) ParserHow to Implement an LR(1) Parser