coding

Extract Structured Data from Text with JSON, Pydantic, and LLM

Max Scholar

Mar 31, 2024 — 11 min read

Harnessing the Power of LLMs for Structured Data Extraction: A Dive into JSON and Pydantic Integration

In the age of big data, the ability to efficiently extract structured information from unstructured text is a game-changer for businesses and developers alike. With the advent of Large Language Models (LLMs) like OpenAI's GPT-3.5 Turbo, this task has become more accessible and powerful than ever before. In this blog post, we'll explore how Python libraries, Pydantic validation, and LLMs are revolutionizing the way we handle data extraction, and how you can leverage these tools to transform raw text into valuable structured data.

The Rise of Structured Data Extraction Tools

Structured data extraction is the process of converting unstructured text into a structured format, such as JSON, which can be easily processed and analyzed by computers. This is crucial for a variety of applications, from AI assistants to natural language access to APIs. The integration of LLMs into this process has provided an unprecedented level of accuracy and efficiency in data extraction tasks.

One such tool that stands out is Kor, a prototype tool that allows users to define extraction schemas and integrate with GPT-3.5 Turbo for LLMs [1]. It supports various Python versions and is compatible with Pydantic for validation, ensuring that the extracted data meets the predefined schema requirements.

JSON and Pydantic: A Perfect Match for LLMs

JSON (JavaScript Object Notation) is a lightweight data interchange format that's easy for humans to read and write, and easy for machines to parse and generate. When combined with Pydantic, a data validation and settings management library for Python, it becomes a powerful duo for managing structured data extraction.

Pydantic models enforce type hints at runtime and provide user-friendly errors when data is invalid. By integrating Pydantic with LLMs, developers can ensure that the data extracted from text not only follows the correct structure but also adheres to the specified data types and constraints [1].

Extracting Structured Data Across Various Domains

The versatility of LLMs for structured data extraction is evident in the wide range of applications. For instance, tools like OntoGPT are designed for ontology-based information extraction, which is particularly useful for assembling structured biological knowledge from unstructured biomedical texts [1]. Meanwhile, Instructor, another Python library, simplifies Python-based structured data extraction by integrating OpenAI's function calling API with Pydantic models [1].

In the realm of legal and HR, tools have been developed to classify contracts and extract entities from resumes, respectively. These tools enforce consistent JSON structures for LLM operations, making the extracted data reliable and ready for further processing [1].

The Future of Data Extraction: Efficiency and Accuracy

The future of structured data extraction looks promising with the continuous improvement of LLMs and the development of specialized tools. Jsonformer, for example, focuses on generating syntactically correct JSON by efficiently filling in fixed tokens and delegating content token creation to language models [1]. This ensures bulletproof JSON generation, which is crucial for applications that cannot afford errors in data formatting.

Moreover, projects like DB-GPT-Hub enhance Text-to-SQL capabilities by leveraging LLMs to parse natural language queries into SQL, reducing model training costs and enabling automated database querying [1]. This opens up new possibilities for businesses to interact with their databases using natural language, making data more accessible to non-technical users.

Conclusion: Embracing the Structured Data Revolution

The integration of LLMs with Python libraries and Pydantic validation is transforming the landscape of structured data extraction. By leveraging these tools, developers can create applications that understand and process natural language with remarkable accuracy and efficiency. Whether it's for extracting information from legal documents, resumes, or biomedical texts, the potential applications are vast and varied.

As we continue to witness advancements in LLMs and related technologies, it's clear that the ability to extract structured data from text will become an even more integral part of our digital world. By embracing these innovations, businesses and developers can unlock new insights, streamline processes, and create more intelligent and responsive applications.

In summary, the combination of JSON, Pydantic, and LLMs is not just a technical achievement; it's a catalyst for a new era of data management. With these tools at your disposal, the power to transform unstructured text into structured data is just a few lines of code away.

📚

resources

[1] kor

⚡A prototype tool for structured data extraction from text using LLMs.
🎯To generate prompts for LLMs, send them, and parse structured data from the output.
💡Allows defining schemas for data extraction, integrates with LangChain, supports pydantic for validation, compatible with multiple Python versions, and provides natural language processing capabilities for APIs.
📝Kor is a tool that extracts structured data from text.
It uses Large Language Models (LLMs) to process and interpret text.
Users can define their own schemas to specify the data to be extracted.
Kor can generate prompts, send them to LLMs, and parse the results.
The tool integrates with LangChain for enhanced functionality.
It supports pydantic v1 and v2 for data validation.
Kor is compatible with Python versions 3.8 to 3.11.
It is designed to power AI assistants and provide natural language access to APIs.
The tool is still a prototype and the API may change.
Kor is acknowledged to make mistakes, be slow, and crash with long texts.
The expectation is that improvements in LLMs will mitigate some of these issues.
Kor is named for being fast to type and unique.
Users are invited to contribute ideas and feature requests.
There are alternative packages suggested for different use cases.
🔑Python, Pydantic, LangChain, OpenAI GPT-3.5-turbo

[2] kor

[3] llmparser

⚡A tool for classifying and extracting structured data from text using large language models.
🎯To provide a consistent JSON input/output format for classifying and extracting information from text using large language models.
💡Enforces consistent JSON structures for LLM operations, extracts entities from resumes, classifies contracts, and pulls information from various text sources.
🔑JavaScript, TypeScript, npm, OpenAI API

[4] ontogpt

⚡A Python package for ontology-based information extraction from text using large language models.
🎯To extract structured information from text with LLMs, instruction prompts, and ontology-based grounding.
💡OntoGPT offers command line tools for information extraction, a web interface for results display, and integration with OpenAI's API for processing text. It is useful for assembling structured biological knowledge and transforming unstructured biomedical texts.
🔑Python, OpenAI API, Web Application Development, Command Line Interface

[5] kor

[6] instructor

⚡A Python library for structured data extraction using OpenAI's function calling API with Pydantic integration.
🎯To simplify and enhance Python-based structured data extraction by integrating OpenAI's function calling API with Pydantic models.
💡Key features of Instructor include response mode with Pydantic model integration, max retries for API requests, validation context for enhanced validator access, CLI tools for job creation and file management, and contribution-friendly with evals and issues.
🔑Python, OpenAI API, Pydantic, pytest, CLI

[7] instructor

[8] llmparser

[9] jsonformer

⚡A tool for generating structured JSON using language models with a focus on content token generation.
🎯To generate syntactically correct JSON that conforms to a specified schema by efficiently filling in fixed tokens and delegating content token creation to language models.
💡Jsonformer ensures bulletproof JSON generation by creating only content tokens and filling in fixed tokens; it's built on Hugging Face transformers for compatibility with different models, and is designed to be efficient, flexible, and extendable.
🔑Python, Hugging Face Transformers, JSON Schema

[10] DB-GPT-Hub

⚡A project for Text-to-SQL parsing using Large Language Models (LLMs).
🎯To enhance Text-to-SQL capabilities by leveraging LLMs to parse natural language queries into SQL, while reducing model training costs and enabling automated database querying.
💡DB-GPT-Hub includes data collection and preprocessing, model selection, integration of multiple LLMs, support for fine-tuning with SFT and QLoRA, reusability of developed code, and an end-to-end workflow for training, prediction, and evaluation.
🔑Python, PyTorch, Huggingface Transformers, Poetry, Conda, Git, Quantized Learning over Redundant Architecture (QLoRA), Supervised Fine-Tuning (SFT)

[11] json-data-ai-template

⚡A web application that generates structured JSON data based on user prompts.
🎯To provide a service that allows users to define custom JSON structures and receive data tailored to their prompts, utilizing AI technology.
💡Features include dynamic form generation for user prompts, rate limiting with Vercel VK Storage, and AI-powered data generation using OpenAI's GPT-4.
📝This project is a web application designed to generate JSON data.
It utilizes a stack that includes Vercel AI SDK, OpenAI GPT-4, and Supabase.
The application is capable of rate limiting using Vercel VK Storage.
Users can create dynamic forms with React Hook Form.
The application is useful for obtaining structured JSON data based on user-defined prompts.
Installation and setup involve creating an environment file and running the application with 'bun'.
The project is open-source with an MIT license.
🔑Vercel AI SDK, Vercel VK Storage, OpenAI GPT-4, Shadcn UI, Supabase, React Hook Form

[12] evaporate

⚡A system for generating structured views of heterogeneous data lakes using language models.
🎯To provide tools and methodologies for leveraging language models for information extraction and generating structured data from unstructured data lakes.
💡Closed and open information extraction capabilities, use of weak supervision code, support for modifying supported models, and extended write-up with in-depth explanations.
🔑Python, Hugging Face's datasets, OpenAI API, Git, Conda

[13] ontogpt

[14] zod-gpt

⚡A TypeScript library for structured and fully typed JSON outputs from OpenAI and Anthropic LLMs.
🎯To receive structured outputs from language models with complete type safety, including validated responses, schema definition, serialization, and automatic correction of outputs.
💡The project features include structured and typed JSON responses from LLMs, schema-based output validation, self-healing for error correction, rate limit handling, and text slicing for token limits.
🔑TypeScript, zod, OpenAI API, Anthropic API, llm-api

[15] autolabel

⚡A Python library for labeling, cleaning, and enriching text datasets using Large Language Models.
🎯Automate the process of labeling text datasets for various NLP tasks using state-of-the-art large language models to save time and reduce costs.
💡Autolabel enables automatic data labeling for NLP tasks, supports multiple LLM providers, incorporates few-shot learning and chain-of-thought prompting, provides confidence estimation and explanations, offers caching and state management, and gives access to Refuel hosted LLMs for improved labeling accuracy.
📝Autolabel is designed to improve the labeling process of text datasets by using Large Language Models.
It provides a simple 3-step process for labeling data that includes specifying guidelines, dry-running, and executing a labeling run.
The library supports multiple LLM providers and applies research-proven techniques like few-shot learning.
Autolabel comes with features such as confidence estimation, caching, and state management for cost efficiency.
Users can access RefuelLLM, a purpose-built LLM for data labeling, through the Autolabel platform.
The library offers a quick install via pip and comprehensive documentation for ease of use.
Refuel AI maintains a public roadmap and encourages community contributions through Discord discussions and GitHub pull requests.
🔑Python, GPT-4, LLM, RefuelLLM, Autolabel Library

[16] ontogpt

[17] semantic-search-nextjs-pinecone-langchain-chatgpt

⚡A full stack starter project for semantic search using Next.js, LangchainJS, Pinecone, and GPT-3.
🎯To build an application that embeds text into vectors, stores them in Pinecone's vector database, and allows users to perform semantic searches on the data.
💡The project features semantic search capabilities, integration with Pinecone's vector database for storing and searching text embeddings, and usage of GPT-3 for natural language processing. It is useful for developers looking to create applications with advanced search functionalities and leverages AI to understand the context and intent behind user queries.
🔑Next.js, LangchainJS, Pinecone, GPT-3

[18] PdfGptIndexer

⚡A tool for indexing and searching PDF text data using OpenAI APIs and FAISS.
🎯To provide rapid information retrieval and superior search accuracy for PDF documents.
💡Extracts text from PDFs, embeds text chunks for searchability, stores embeddings in FAISS index, offers a query interface for searching, handles large datasets, and provides offline access to indexed data.
🔑Textract, Transformers, Langchain, FAISS, Python, OpenAI APIs

[19] vectordb

⚡A lightweight, local solution for embeddings-based text retrieval.
🎯VectorDB is designed to provide fast and efficient embeddings-based text retrieval for various applications, including AI features in search engines like Kagi Search.
💡Local data handling, maximum performance, embedding models benchmarking, optimized for both small and large datasets, support for multiple chunking strategies and embedding quality levels, and extensibility through custom HuggingFace models.
🔑Python, pip, Faiss, mrpt, Universal Sentence Encoder, HuggingFace models

[20] doctran

⚡A framework for transforming documents using LLMs to process complex strings with natural language instructions.
🎯To parse and transform documents using large language models to extract structured data, redact sensitive information, summarize content, refine by topics, translate languages, and convert text into a Q&A format optimized for vector search.
💡Doctran allows for the chaining of document transformations such as redaction, extraction, summarization, refinement, translation, and interrogation into Q&A formats. It facilitates handling complex text parsing tasks that benefit from human-level judgement, making it useful for applications that require semantic understanding and confidentiality.
🔑Python, OpenAI, spaCy, JSON