This marks a pivotal moment for the fundamental restructuring of AI's underlying logic. For a long time, the Transformer architecture has been trapped in a costly paradox: we utilize the most advanced GPU computing power to make AI models "rote memorize" static knowledge that could simply be looked up in a dictionary. The groundbreaking paper "Conditional Memory via Scalable Lookup," released in the early hours by the DeepSeek Liang Wenfeng team and their collaborators from Peking University, has completely shattered this deadlock. They propose a novel Engram module, opening a second front for sparsity—"conditional memory"—alongside the traditional "conditional computation" (MoE). This is not merely a technical patch but a supply-side reform concerning the model's "brain capacity." It demonstrates that when "memory" is decoupled from "computation," delegating rote memorization to the "dictionary" and leaving calculation to the brain, the AI's reasoning capabilities will experience a counter-intuitive, explosive surge.
DeepSeek plans to officially release V4 around the Chinese New Year in February, and this moment might just be the eve of DeepSeek V4's arrival. Prologue: The "Futile Effort" of a Six-Layer Neural Network The story begins with a "MRI scan" of the Transformer's internal mechanisms conducted by the DeepSeek team. Inside the black box of artificial intelligence, when a large language model encounters the phrase "Diana, Princess of Wales," an inexplicable and extremely costly "internal conflict" occurs. Researchers discovered that to recognize this fixed entity, the model surprisingly mobilizes a full 6 layers of the network:
Layers 1-2: The model is still figuring out that "Wales" is probably a country. Layer 3: It realizes this is a geographical concept in Europe. Layer 4: It begins to piece together that "Princess of Wales" seems to be a title. Layer 5: It associates this with "the wife of the Prince of Wales." Layer 6: Only here does it finally confirm that this refers to the famous "Princess Diana."
To an architect pursuing ultimate efficiency, this is a sheer waste of computational power. "Princess Diana" is an objectively existing, static entity; its essence does not change with context. To extract this fact, which could be known by simply looking it up, the Transformer actually employs expensive matrix operations across 6 deep layers to "reconstruct" this concept. This is akin to a genius, before solving a calculus problem, having to spend half an hour reciting the multiplication table every single time. This mechanism of "implicit memory" forces the model to waste precious parameter capacity and network depth on simple pattern matching. In this 33-page paper, DeepSeek poses a soul-searching question: Why not simply equip the large model with a "super dictionary" that can be consulted on demand? Chapter One: Architectural Reshaping - The Brutalist Aesthetics of the Engram Module To solve this problem, DeepSeek proposes a novel module named "Engram (Conditional Memory)." If MoE (Mixture of Experts) divides the "brain" into different regions, having different experts responsible for different types of thinking (conditional computation), then Engram is like attaching a massive "hippocampus" to the brain, specifically responsible for storing static knowledge (conditional memory).
1. Reviving "N-gram": Finding Answers in Ancient Wisdom The core inspiration for Engram surprisingly comes from an "ancient artifact" in the NLP (Natural Language Processing) field—the N-gram. Before deep learning dominated the world, we relied on statistics like "the probability of N words appearing together" to understand language. DeepSeek has given this classic concept a modern twist:
Traditional Transformer: Knowledge is distributed within the weights of neurons. Extracting knowledge requires complex linear layer calculations, resulting in high computational complexity. Engram Module: It is a massive, scalable embedding table. When the model encounters fixed phrases (N-grams) like "Zhang Zhongjing" or "Four Great Inventions," it doesn't need to engage the cerebral cortex for reasoning. Instead, it directly "looks up" the corresponding vector in the memory table via a hash index.
This process has a time complexity of O(1)—meaning that no matter how large the knowledge base grows (even to 100 billion parameters), the lookup speed remains almost constant and extremely fast.
2. Three Technical Moats If lookup tables are so great, why hasn't anyone done this before? Because of three major obstacles: storage explosion, polysemy conflicts, and parameter allocation. DeepSeek provides textbook-level solutions: A. Vocabulary Compression: Extreme Deduplication The number of possible word combinations in the world is astronomical. DeepSeek first performs a step of "lossless compression." At the tokenizer level, it normalizes words that are semantically identical but written differently. For example, "Apple" (capitalized) and "apple" (lowercase) typically refer to the same thing. Through mapping and merging, the effective vocabulary size is directly reduced by 23%. This not only saves space but also significantly increases knowledge density. B. Multi-Head Hashing: Resolving "Hash Collisions" It's impossible to store every possible N-gram. Engram uses "Multi-Head Hashing" technology. It maps the infinite set of N-grams to a finite number of memory slots using multiple hash functions. Although hash collisions can occur (where two different phrases map to the same slot), the "multi-head" design allows the model to piece together the correct information from multiple candidate results, greatly enhancing robustness. C. Contextual Gating: Providing a "Referee" for Memory This is the most ingenious part. Lookup tables are static, but language is dynamic. For instance, the word "apple." In the context of "eating an apple," it refers to the fruit; in the context of an "Apple product launch," it refers to the tech company. Direct lookup might introduce noise. DeepSeek designs a "Context-aware Gating" mechanism.
Query: The hidden state of the current context. Key/Value: The static vector obtained from the lookup table.
This gate acts like a referee. If the retrieved "static knowledge" doesn't align with the current "context," the referee lowers the weight (gate value tends towards 0), instructing the model to ignore this noise. If it's a perfect match (e.g., "Treatise on Cold Pathogenic and Miscellaneous Diseases" followed by "Zhang Zhongjing"), the referee opens the gate wide (gate value tends towards 1), directly injecting the knowledge into the model.
Chapter Two: The Golden Ratio - Discovering the AI Model's "U-Shaped Curve" With the architecture designed, the next question is: How to divide the "assets"? Assuming the VRAM in our graphics cards is limited and the total parameter budget is fixed, how many parameters should we allocate to the MoE "experts" (responsible for computation) and how many to the Engram "dictionary" (responsible for memory)? This is a classic resource allocation game. The DeepSeek team conducted a large-scale ablation study, scanning allocation ratios from 0% to 100%, and the results plotted a perfect "U-shaped Scaling Law curve."
This graph reveals a fundamental law of AI model design:
Left Extreme (Pure Engram): If all parameters are allocated to the dictionary, the Loss is high. The model becomes a "bookworm," good only at rote memorization without logical reasoning ability. Right Extreme (Pure MoE): If all parameters are allocated to the experts, the Loss is also high. Because the experts are forced to spend their energy on memorizing static knowledge, leaving no time for their real job. Golden Ratio Point (ρ ≈ 75%-80%): When we allocate approximately 20%-25% of the sparse parameter budget to Engram and the rest to MoE, the model's validation Loss reaches its lowest point.
This is an extremely instructive finding: for large models with hundreds of billions of parameters, simply stacking computational units (MoE experts) is already subject to diminishing marginal returns. It is necessary to introduce a dedicated static memory module to achieve a "memory-computation balance." Chapter Three: Counter-intuitive Explosion - Why Does "Looking Up a Dictionary" Improve "Math Scores"? If Engram merely made models "better at memorization," the weight of this paper wouldn't be enough to shake the community. After all, RAG (Retrieval-Augmented Generation) can also address knowledge issues. What truly stunned the industry were the unexpected gains revealed in the experimental results. DeepSeek constructed three comparative models, strictly controlling for identical activated parameter count (3.8B) and training data volume (262B tokens):
Dense-4B: Traditional dense model. MoE-27B: Pure MoE model (72 experts). Engram-27B: Hybrid model (55 experts + 5.7B Engram parameters).
The results were astonishing: 1. Expected: Dominance in Knowledge Tasks On MMLU (comprehensive knowledge), the Engram model improved by 3.4 points; on CMMLU (Chinese knowledge), it improved by 4.0 points. This is easily understood—with an external dictionary, common sense naturally improves, and hallucinations decrease. 2. Unexpected: Comprehensive Surge in Logic, Code, and Math Logically, "looking up a dictionary" shouldn't relate to "solving math problems." Yet, on BBH (broad reasoning), Engram-27B surprisingly outperformed the pure MoE baseline with the same parameter count by a full 5.0 points!
MATH: Improved by 2.4 points. HumanEval (code generation): Improved by 3.0 points. ARC-Challenge (complex reasoning): Improved by 3.7 points.
3. In-depth Analysis: The Effective Depth Theory Why? How can a "rote memorization" module enhance intelligence? Using LogitLens and "CKA (Centered Kernel Alignment)" techniques, the DeepSeek team "dissected" the model's internals. They discovered a startling phenomenon: Remember "Princess Diana" from the beginning? In the pure MoE model, the initial layers were busy "piecing together the concept." In the Engram model, because the Engram module is inserted as early as the 2nd layer, the retrieval of static knowledge is completed at a very early stage. This means that the initial network layers originally used for "rote memorization" are liberated! This is equivalent to "virtually increasing" the model's depth. The freed-up network layers and attention heads no longer need to handle trivial local dependencies (like identifying who "Zhang Zhongjing" is), allowing them to focus entirely on more complex global reasoning, long-range logic construction, and code logic generation. The essence of Engram is not to "replace" reasoning, but to "offload" mundane tasks, enabling the brain to concentrate on higher-dimensional thinking.
Chapter Four: Engineering Marvel - Breaking Nvidia's "VRAM Hegemony" For Wall Street investors and data center operators, the sexiest part of this paper lies not in the Score, but in the Cost. In the AI era, the most expensive resource is not compute (FLOPs), but memory (HBM). A significant reason for the high cost of Nvidia's H100 is its scarce HBM3e memory. Engram introduces a disruptive characteristic: complete separation of storage and computation. 1. The Pain Point of MoE: The VRAM Devourer In traditional MoE models, the routing mechanism is dynamic. The model must first compute the features of the current token; only after finishing a layer does it know which expert to consult next. This means all expert models must reside in expensive GPU VRAM at all times, ready for dispatch. 2. Engram's Breakthrough: Deterministic Foresight Engram's lookup logic is deterministic. Once the input text is determined (e.g., "A New Axis of Sparsity"), the corresponding N-gram index is determined. We don't need to wait for the model to finish the previous layer; the moment a token enters the model, we know exactly which row of which table it needs to look up. 3. CPU's Counterattack: Stuffing Large Models into RAM This characteristic brings enormous engineering benefits:
Offload: We can place Engram vocabularies with hundreds of billions, or even trillions, of parameters directly into cheap, abundant, and easily expandable "CPU memory (DRAM)," or even on NVMe SSDs. Prefetching: While the GPU is busy computing the previous Transformer layer, the CPU uses the PCIe bus to asynchronously "prefetch" the memory data required for the next layer and push it to the GPU.
This hides latency and enables parallel processing. DeepSeek's actual test data shows: Even when mounting a 100B (hundred-billion) parameter Engram table onto CPU memory, the throughput degradation compared to pure GPU inference is less than 3%. This is a conclusion that delights everyone anxious about HBM shortages. It means that for future large models, "memory capacity" can be expanded infinitely at low cost, without being constrained by Nvidia's VRAM limitations.
Chapter Five: Victory for Long Context - The Leap in NIAH Testing Beyond general reasoning, Engram's performance in the long context domain also proves the value of "division of labor." In long-text processing, the attention mechanism's window is limited. If attention is occupied by大量的 local information (like fixed phrases), its ability to process global information diminishes. After Engram takes over local dependencies, the Attention mechanism can finally look up and see the road ahead. In the rigorous RULER benchmark tests, the performance of Engram-27B is astounding:
Multi-Query NIAH (Needle in a Haystack): Soared directly from the MoE baseline's 84.2 points to 97.0 points. Variable Tracking: Improved from 77.0 points to 89.0 points.
This indicates that after outsourcing "local memory" to Engram, the Transformer's original attention mechanism can more efficiently capture subtle clues and connections within documents tens of thousands of words long.
Epilogue: The Puzzle Pieces of DeepSeek V4 Are Emerging Stringing all this information together, we can faintly discern the outline of DeepSeek's next-generation model—DeepSeek V4. It has been reported that DeepSeek plans to officially release V4 in February (around the Chinese New Year). Reviewing DeepSeek's pace: from R1 in January 2024, to V3.2 surpassing GPT-5 benchmarks by year-end, to the upcoming V4, each step accurately follows the pulse of technological iteration. If R1 demonstrated the depth of "reasoning," and V3 showcased the efficiency of "MoE," then the upcoming V4, potentially by introducing Engram technology, aims to resolve the coupling of memory and computation, achieving a perfect symbiosis between the "electronic brain (computation)" and "external memory (Engram)."
DeepSeek V2: Introduced MLA (Multi-head Latent Attention), compressing KV Cache, solving the inference VRAM bottleneck. DeepSeek V3: Optimized "MoE (Mixture of Experts)" with无损 load balancing, addressing training stability and computational cost. DeepSeek V4 (Speculated): Introduces Engram (Conditional Memory), resolving the coupling of memory and computation, achieving perfect symbiosis between the "electronic brain (computation)" and "external memory (Engram)."
This is not a simple version iteration; it is a systematic surgical procedure addressing the underlying flaws of the Transformer architecture. After DeepSeek V3 has already swept the globe with its extremely low API prices and powerful performance, if V4 integrates Engram technology, it will bring even more formidable competitiveness: it will possess a larger knowledge base (low-cost memory expansion), stronger logical reasoning (liberated network depth), and lower inference costs (storage-computation separation). More importantly, reports mention improvements in V4's data pattern understanding, "avoiding the performance degradation seen in previous models under prolonged training." This aligns perfectly with Engram's characteristic of solidifying static knowledge and reducing the burden on dynamic networks—it makes the model more stable, less prone to "forgetting" or "confusion." At the end of the paper, the DeepSeek team confidently writes:
"We envision conditional memory as an indispensable modeling primitive for next-generation sparse models."
This paper released on the eve of the Chinese New Year is not just a technology showcase for DeepSeek; it is a signal to the entire industry: the野蛮 era of simply "competing on compute" and "piling on parameters" is over. The红利 period of architectural innovation has just begun. And in this race to define the next generation of AI standards, Chinese large models are not just keeping pace; they are potentially redefining the rules of the game. 2026 might see the "Normandy moment" for China's commercial aerospace sector pass; but the moment for "separating storage and computation" in the AI field might be right now. Paper Address: https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf Open Source Address: https://github.com/deepseek-ai/Engram
Comments