Uncovering the Hidden Lineage of Large Models: Fine-Tuning and Distillation Secrets Revealed

Deep News04-25 19:21

The large language model ecosystem has evolved far beyond just a few dominant models. The number of models on Hugging Face continues to grow rapidly. Different families, various architectures, and unique tokenizers, combined with a multitude of fine-tuned, distilled, and adapted versions, have transformed the ecosystem into a fast-expanding "model jungle."

A key challenge is determining the genealogical relationships between models. It is often impossible to discern from model cards or release notes which capabilities are inherited from upstream models and which are merely superficial similarities. This lack of clarity impacts our understanding of the model landscape and has implications for model governance, safety audits, and the design of multi-agent systems.

Existing methods for analyzing model relationships have significant limitations. Some rely on specific tasks, making it difficult to capture a model's overall characteristics. Others only work on fixed sets of models, lacking scalability for new additions. Certain approaches are heavily dependent on tokenizers or internal architectures, hindering their application to heterogeneous models. Fundamentally, the field still lacks a more universal, stable, and scalable method for representing a model's "identity."

To address this challenge, a joint research team from the National University of Singapore and Shanghai Jiao Tong University has proposed "LLM DNA." This approach attempts to characterize the "kinship" between models based on their functional behavior, similar to studying biological evolution. The team not only provided a mathematical definition for LLM DNA but also introduced RepTrace, a training-free extraction method, validated on 305 large models. Results indicate that this "DNA" can identify relationships between models and even construct phylogenetic trees for large language models.

The core idea of LLM DNA is to create a unified representation based on a model's functional behavior rather than its surface-level parameters. The researchers term this low-dimensional representation, distilled from functional behavior, "LLM DNA." If two models exhibit similar response patterns across many prompts, their DNA representations should be close; if they are functionally different, their DNA should reflect that distance.

The paper further demonstrates that this representation possesses two properties analogous to biological DNA: "heritability," meaning the DNA does not change drastically after fine-tuning or evolution, and "genetic determinism," meaning models with similar DNA typically exhibit more similar behavior.

To operationalize this concept, the authors proposed RepTrace, a training-free DNA extraction pipeline. This method first constructs a unified set of probe inputs and collects the textual responses from different models. A frozen sentence embedding model then encodes these responses into semantic embeddings, which are concatenated into a high-dimensional functional representation. Finally, leveraging the Johnson-Lindenstrauss lemma, a random Gaussian projection compresses this high-dimensional representation into a low-dimensional DNA space. The key is not just dimensionality reduction but preserving the relative geometric structure of models' functional behaviors, ensuring semantically similar models remain close in the DNA space.

Notably, the probe inputs do not require carefully designed task-specific data. The research shows that even using randomly generated text, created without any LLM involvement, the extracted DNA maintains strong discriminatory power. In relationship prediction tasks, this random input setup achieved an AUC of 0.987. This suggests LLM DNA captures stable functional features from general inputs, independent of specific benchmark formats, and helps reduce bias from particular evaluation sets, training corpora, or prompt distributions. For a new model, extracting its DNA using the same inputs and pipeline allows for immediate comparison within the existing framework without retraining or adjusting other models' representations.

A significant aspect of this work is its extensive experimental validation, covering 305 models from 153 institutions, encompassing different architectures, parameter scales, and including both base and instruction-tuned models. Results show that relationship detection based on LLM DNA achieved an AUC close to 0.99, significantly outperforming several baseline methods. This indicates LLM DNA can reliably distinguish between related models and those with weak or no apparent relationship.

Furthermore, DNA can help uncover potential relationships not explicitly documented. A t-SNE visualization of the 305 models showed that models from the same institution or family naturally clustered together. Some models without clear recorded origins were positioned near their likely upstream families, suggesting LLM DNA can reveal hidden evolutionary clues beyond just confirming known relationships.

Beyond identification, DNA can be used for model routing. Tested in the same routing setup as EmbedLLM, the frozen DNA representation achieved a routing accuracy of 0.672 on the test set, slightly higher than EmbedLLM's 0.665. Crucially, EmbedLLM's representation was specifically trained for the routing task, whereas LLM DNA received no task-specific training, indicating it is closer to a task-agnostic "foundational representation" for models.

Beyond the large-scale experiment, LLM DNA's value was demonstrated in a real-world analysis of a new model. Before GLM 5.1 information was fully public, the research team used the LLM DNA workbench to analyze `openrouter/pony-alpha`. The results showed its highest DNA similarity was with `z-ai/glm-4.7`, significantly higher than with models like Gemini. From a functional behavior perspective, this provided strong clues about its potential membership in the GLM lineage. Unlike relying on public documentation, naming conventions, or scattered rumors, this judgment is based directly on functional representations derived from model responses, constituting a "behavior-based genealogical analysis."

Quantifying distances between models naturally leads to constructing a "family tree" for the entire LLM world. The team built a phylogenetic tree based on DNA distances, which reflected real-world evolutionary patterns: the general shift from encoder-decoder to decoder-only architectures, the evolution of different families over time, and the branching structures of families like Llama, Qwen, and Gemma. The study also observed varying "evolutionary speeds" among different family branches.

This is one of the most compelling aspects of the work. Previously, discussions about model evolution often relied on release dates, model names, official statements, or community experience. LLM DNA offers a different perspective: reconstructing relationship maps directly from models' actual performance. For an increasingly complex LLM ecosystem, this ability to "infer lineage from behavior" is inherently valuable.

From a practical standpoint, LLM DNA could offer several direct benefits. First is model provenance. If a model raises security, copyright, or licensing issues in the future, DNA could serve as evidence to help determine its origin and evolutionary relationships. Second is model governance. For companies or platforms managing numerous models, DNA could provide a new tool for quickly identifying similar models, deciding which to retain, and spotting approximate variants of existing models. Third is multi-model system design. Quantifying "kinship distance" between models could lead to more rational routing, ensemble methods, and even task allocation in multi-agent collaboration—key motivations mentioned in the paper's introduction.

Naturally, LLM DNA is not a single low-dimensional vector that explains everything about a model. More accurately, it provides a more unified and scalable way to "observe models." Previously, relationships between models were often guessed from public information or analyzed through scattered case studies. Now, there is a method to systematically identify these potential genealogical connections.

The appeal of the LLM DNA work lies not just in coining a new term but in advancing an area many recognized as important but lacked unified tools for. In an era of proliferating models, increasingly complex versions, and less transparent public lineages, it asks: Can we, like "testing DNA," discern a model's similarities, inheritance, and hidden connections from the way it answers questions?

From this perspective, the most valuable contribution of this ICLR 2026 Oral paper is not just a near-0.99 score, but that it makes "discovering models' hidden lineages" more systematic, actionable, and practically applicable.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment