Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

Datalab introduces lift, a 9B open-weight vision model that accurately extracts structured JSON from PDFs using schema-constrained decoding and abstention training, achieving 90.2% field accuracy.

Deep learning researchers at Datalab have unveiled a new open-weight vision model named lift, designed to transform unstructured PDF documents and images into structured JSON data using schema constraints. This advancement addresses a key challenge in document processing: extracting accurate, reliable data from complex formats while maintaining consistency with predefined data structures.

Schema-Constrained Decoding and Abstention Training

The model leverages schema-constrained decoding to ensure that the extracted data adheres strictly to a given schema, reducing errors and inconsistencies. Additionally, it incorporates trained abstention, a technique that allows the model to return null for fields it cannot confidently extract, rather than hallucinating data. This approach significantly improves data reliability compared to traditional methods that often produce fabricated or incorrect information.

Performance and Benchmark Results

In evaluations using a benchmark of 225 documents, lift achieved a 90.2% field accuracy, a strong indicator of its robustness and precision. This performance underscores its potential for real-world applications in industries such as finance, healthcare, and legal services, where accurate data extraction is critical. The open-weight nature of the model also invites broader community adoption and further development, aligning with the growing trend of open-source AI tools in enterprise environments.

Conclusion

With its innovative use of schema constraints and abstention training, lift represents a significant step forward in automated document processing. As organizations continue to grapple with massive volumes of unstructured data, tools like this offer a promising solution for improving efficiency and data integrity. Datalab's release marks an important milestone in the evolution of vision models tailored for structured data extraction.

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

Schema-Constrained Decoding and Abstention Training

Performance and Benchmark Results

Conclusion

Related Articles

India’s MoEngage bets that the future of marketing is millions of AI agents

Mistral OCR 4 Brings Citation-Ready Structured Output to RAG, Agentic, and Enterprise Search Pipelines

ByteDance unveils Seedance 2.5, a 30-second native 4K AI video model that accepts 50 reference inputs