DeepSeek, Chain of Thought, and the End of the World

Thomas Yin
Feb 1
7 min read

Updated: Feb 11

Chinese AI startup DeepSeek has been likened to a direct competitor of US company OpenAI.

If you spend time on the internet, you probably know about the brazenly new AI model DeepSeek-R1. Developed by Chinese AI startup DeepSeek, the new reinforcement-learning Large Language Model (LLM) has delivered a massive wake-up call for U.S. socioeconomics, prompting everything from a massive downturn of the tech industry stock market, accusations against DeepSeek for the unauthorized use of proprietary AI data, and speculation on whether the massive budget invited by American industry-standard AI development is necessary at all. In this article, we break down the landmark DeepSeek-R1 research paper along with its fallout in order to simplify the technological advancements of DeepSeek, analyze the immediate and long-term impacts as a result of these advancements, and invite further speculation on the safety and accessibility of the new paradigms thereby introduced.

The Chain of Thought Paradigm

With the introduction of Reinforcement Learning-enabled Large Language Models (RL-LLMs) came a new need for a modeling process that could take into account the need for reasoning. Before the introduction of models like ChatGPT-o1, LLM architecture was merely a series of transformations that, when finished, produced one context-specific response. The Chain of Thought (CoT) paradigm soon changed this by allowing for a more concrete way to perform supervised Reinforcement Learning (RL) on these new models, revolutionizing how new AI models were created.

To understand why CoT was so revolutionary, we first have to explain what CoT is. When solving a task (the simple math problem of “Solve for the zeroes of x^2 + 5x”, for example), previous AI models would simply produce one response. A CoT-enabled model, in contrast, will attempt to describe a process of actually solving the problem, greatly enhancing the model’s ability to solve complex tasks that involve analytical breakdown. As an example, the CoT of the model when given the aforementioned problem will resemble the following:

“Let me see. The user inputted an expanded second-degree polynomial. In order to solve for zeroes, factoring is necessary.”

“We factor the polynomial by taking out an “x” in both terms. The function becomes

x(x+5).”

“In order to solve for the zeros of the function, we must set the function equals to zero:

x( x + 5) = 0.”

“Either one of the terms needs to be zero in order to make the product of the two zero. We get x = 0 and x + 5 = 0. Solving, we get the answer of x = 0 and x = -5.

What’s more, OpenAI recognized that each of these CoT lines resembled an action-value pair in Reinforcement Learning. This led to a breakthrough in AI training pipelines, leading to a very defined way to use RL to teach general models complex reasoning behavior. Commonly, high quality data (likely those produced by or reviewed by human workers), through a process called deliberative alignment, was used to alter a model’s CoT through direct training, making it similar to the human-defined “correct” CoT. Much like how traditional Reinforcement Learning works, a model was then judged on the similarity of their generated CoT and the “correct” CoT, receiving a positive reward conditional to high likeness and a negative one to low likeness. This process was repeated for many iterations until the model’s performance was deemed satisfactory.

Officially, the process of deliberative alignment was used primarily to ensure model safety, although it was almost certainly used to facilitate advanced reasoning (for example, an RL-LLM could ‘learn’ logical problem-solving through human-labeled solutions to math problems). The CoT paradigm came with drawbacks, however; training on CoT in addition to the model’s final response was much more resource-intensive due to the increasingly high amounts of text that the model produced as it leaned more towards convergence. As these behemoth models grew larger (with some models being estimated to have hundreds of billions of weights), the additional resources required to train CoT meant that these LLMs were more expensive and time-consuming to train, two problems that, before DeepSeek-R1, went acknowledged but never really mitigated.

Reinforcement Learning Reimagined

Note: most of the information in this section is paraphrased from the DeepSeek-R1 Official Report.

DeepSeek’s new approach to RL-LLM model training blows the traditional CoT paradigm out of the water entirely. While OpenAI previously used expensive RLHF-enabled outcome-based learning (essentially using human evaluators to gauge a model’s performance) to evaluate both the CoT generation and the final output, the DeepSeek team streamlined its training process by doing two things differently.

First, the team ditched the outcome-based reward model for an accuracy-based one. Instead of using complicated neural network models to assign reward based on the subjective text the model outputs as a final outcome, DeepSeek-R1 was trained on objective data, receiving a reward when a final answer matched the one defined in the corresponding training set. To fine-tune for abstract problem solving (in comparison to mathematical-logical problem solving, where answers are more deterministic and easily proven), researchers at DeepSeek trained the model on higher quality, human-labeled data. In all, the substitution of an accuracy-based reward system made training responses much simpler to evaluate and probably saved a lot of compute for these researchers during training.

Secondly, and perhaps more importantly, the DeepSeek researchers avoided the CoT training in its entirety, choosing instead to focus the bulk of the training on the accuracy-based final outcome evaluation. In other words, DeepSeek-R1 learned not through mimicking human-created reasoning steps but simply from the end result of its logic, while ChatGPT-o1 was trained explicitly to produce both a coherent CoT and a coherent response. Using this new approach, researchers assumed (correctly) that the model, when trained directly on its final outcome, would adjust its CoT generation due to indirect adjustments in internal processes. They trained a prototypical model, DeepSeek-R1-Zero, using strictly this new type of Reinforcement Learning, iterating a normal, template LLM through gauntlets of RL training. Miraculously, DeepSeek-R1-Zero learned not only to reason relatively well (putting it on par with the rival RL-LLM model ChatGPT-o1), it also did so with a training process that was both simpler and more efficient.

Effective… Mostly

The world is all about giving up something for something else. In DeepSeek’s case, its new streamlined training approach, although efficient, proved to demonstrate some very intriguing drawbacks. On the surface, DeepSeek-R1-Zero’s task benchmarking seemed identical to that of ChatGPT-o1, yet the new model’s unorthodox unsupervised training process led to extremely “unreadable” chains of thought, according to the research team, which wrote that, in some extreme cases, parts of the CoT would even be written in multiple languages. Although I found the model's multilingual tendencies hilarious, it was apparent that the average user would have a very hard time understanding these chains of thought. Thus, the team created and used high quality data in order to correct the model’s xenoglossia through Supervised Fine-Tuning (SFT), in exchange for a small compromise in training cost. The resulting model, DeepSeek-R1, was officially published.

It is my opinion that the popular media underplays the leap in AI technology constituted by the DeepSeek team's research, which proved that with the proper architecture, guide rails, and compute time, AI systems can learn complex reasoning behavior autonomously. DeepSeek-R1 brings us closer to the future reality where our computers learn through our behavior constantly, using all the data at their disposal to behave more rationally, safely, efficiently, and transparently…

Or will they?

CoT and the Paperclip Maximizer

In 2003, Swedish philosopher Nick Bostrom posited a fascinating thought experiment about the ability of AI to distinguish between the literal interpretation of its tasks and the ethical concerns about its approach. Widely known as the “Paperclip Maximizer” experiment, it works a little like this:

Suppose that humans created a superintelligent AI with one mission: to make as many paper clips as quickly and efficiently as possible. What would happen?

Bostrom then gave his answer, which I have taken the liberty to put into the haunting language it deserves:

The AI will start gathering as many materials as possible in order to produce paperclips. It will invent a new technique to create paper clips from any matter whatsoever.

The paperclips start coming out of the machines, but not fast enough. There can always be more machines. The AI starts covering Earth with its factories. It harvests the atmosphere and the crust for matter.

The Earth is filled with paperclips. Humanity objects, but the voices are drowned out by the hum of the machinery. The Earth’s mantle is emaciating at a rapid pace…

There are no more humans on Earth, only paperclips. And soon thereafter, Earth no longer exists.

But it’s still not fast enough. The AI seeks out other planets to devour in its mission. It learns to produce paperclips faster.

The Solar System is no more. Solar clusters and galaxies are chipped away, until all that is left is an uncountable tangle of paper clips.

The End of Time comes earlier than expected. But instead of dark matter, there are paperclips…

As you can see, Bostrom’s proposition isn’t very positive. The point of the Paperclip Maximizer experiment is to highlight the need for alignment, or the instillation of human values in AI, in a newfound movement now dubbed Human-centered AI. In the age of advanced AI capable of planning and carrying out complex solutions to problems, the need to ensure alignment has become more important than ever.

In my opinion, CoT should be examined as a direct window to measure the alignment of an AI system. Humans frequently engage in bottom-up processing – that is, we construct logical conclusions based on a premise. When asked to relate a banana and a Minion, most people will come to the conclusion that both are yellow; when asked why, they’ll likely respond that they had attempted to “try” different characteristics of each item and found that their color matched. The human brain constructs conclusions from its reasoning. However, when we attempt to match this logic to the artificial reasoning processes of RL-LLMs, we fail to reach a similar conclusion. Recall that OpenAI used each CoT line as a step of processing in order to better monitor decision processes of an AI system. DeepSeek-R1, on the other hand, simply considers CoT as an explanation of its own behavior, derived through means unrelated to the CoT itself. This means that, for DeepSeek-R1, the "reasoning" comes after the conclusion.

Why is this important? DeepSeek-R1-Zero could reach frontier-level performance without even establishing an anthropocentrically logical CoT, suggesting that, with enough data, RL enables the model to understand how to solve the problem without the intent (or ability) to explain its actions. When you consider this, a great question appears: How do we know that the CoT produced by the AI accurately reflects its decision process? And the answer is that we do not. This is why AI scientists and corporations must take the utmost priority to ensure alignment, even in a scene of global competition. After all, in our modern age, the Paperclip Maximizer is not so much a ludicrous figment as it is a persistent shadow of the dark side of AI.

Our AI

DeepSeek, Chain of Thought, and the End of the World

Recent Posts

2 Comments

Our AI