On February 26, UNISOUND (09678) announced the official launch of its document intelligence foundation model, "Unisound U1-OCR." As the first industrial-grade document intelligence base, this model officially initiates the OCR 3.0 era. Building on layout comprehension, it further delves into the deep semantics of documents, enabling automatic classification and business-level information extraction. This represents a qualitative leap from "character perception" to "document cognition," signifying AI's transition from merely "recognizing characters" to "understanding business logic." Unisound U1-OCR is a document intelligence understanding model that achieves state-of-the-art (SOTA) international standards, delivering industry-leading SOTA performance in multiple authoritative tests. Its core advantage lies in overcoming the limitation of traditional models that "only read text without understanding layout," allowing it to "interpret" complex documents like a human expert. To meet the new requirements for business-level structured extraction of documents in the OCR 3.0 era, Unisound U1-OCR adopts a ViT + LLM architecture. The visual encoder component utilizes the NaViT architecture to achieve dynamic processing of document resolution. The model has a parameter scale of 3 billion, balancing computational efficiency with the ability to understand deep semantic information in documents. The model introduces several innovative measures: it pioneers a "semantic-driven + dynamic focusing" strategy, automatically constructing a "semantic map" of documents to accurately identify the hierarchical relationships between headings, charts, and body text, possessing the intelligence to "understand structure first, then read content." It features sharp "spatial perception," actively understanding the spatial layout between elements and precisely reconstructing document structure with dynamic resolution technology. Additionally, it employs Multi-Token Prediction (MTP) technology, considering the probability distribution of multiple future tokens while predicting the current token, significantly enhancing the logical coherence of long documents. Combined with a full-task reinforcement learning strategy, the model strengthens its global foresight of layout structures and improves generation efficiency by over 80% during the inference phase. On the business front, the model is grounded in industrial-grade scenario demands, developing four core capabilities: precise traceability, business integration, secure and efficient deployment, and superior adaptability. This genuinely meets the full-scenario requirements of real enterprise operations, achieving business implementation from "comprehension" to "execution." The launch of Unisound U1-OCR marks the beginning of the OCR 3.0 era, representing not only an innovation in document intelligence but also a key step for UNISOUND toward Artificial General Intelligence (AGI). The company aims to use multimodal documents as a knowledge entry point, endowing machines with autonomous reasoning and evidence traceability capabilities, and propelling AI from perception to cognition. In the future, UNISOUND aspires to build general intelligent agents that can read, think, and solve complex problems like humans, making every document a stepping stone toward AGI.
Comments