Extracting data from unstructured texts, such as reports, emails, and receipts, can be a challenging yet valuable task. The ultimate goal is to transform this data into a structured form that is more convenient for further analysis. Having access to a large volume of data not only enhances the accuracy and reliability of business decisions, but can also provide valuable insights into market trends and customer behavior.
Before the advent of large-language models (LLMs), this process usually involved either manual labor or unreliable heuristics, such as searching for keywords. However, LLMs, which are typically used for text generation or conversations, have proven to be highly adept at text extraction as well.
Often, the first and the most discussed step in this process is finding the documents with relevant information. This is achieved by calculating an embedding (a vector representation) for each document or part of the document. When making a query, an embedding is also calculated for it. The closest document(s) based on vector distance are then identified and put into the LLM (GPT) context.
In this blog post, we will assume that a relevant document has already been found or that we need to extract information from every document of a given type. Extracting the information is still not a trivial problem, and we will discuss some high-level tips to approach it.
We will attempt to extract information from real estate rental ads to get a better analysis of the market landscape. By scraping a relevant website, we can acquire thousands of ads. Some information, such as the price, apartment size, location, and phone number, is structured and easy to extract.
However, most of the information resides in a free text format that the landlord fills in, including details about the types of rooms available, whether pets are allowed, and whether the apartment is represented by a real estate agent.
The first and most important step is to be explicit about which fields we want to extract. Parsing unstructured data is pointless if the output from the LLM is equally unstructured. It is also essential to define the desired output format of the query, such as CSV, JSON, or a dictionary. Providing an explicit example often helps as well.
To improve consistency, it is recommended to provide possible values for a field. For example, if we ask a yes/no question, it is important to standardize the answer as either “Yes” or “No,” rather than variations like “True,” “YES,” or “According to the ad, the answer is probably yes.” This simplifies further processing.
The following text below uses a simple prompt, but experimenting with various prompts and gradually improving towards more effective ones can yield unexpectedly valuable results.
From the text above extract the following information. Use the key: value format specified below and do not add any additional text around it. As answers use only values supplied in the brackets.
Apartment type: [studio apartment, 2-room apartment, 3-room apartment]
Parking space: [Yes, No]
To make the results more consistent and reduce so-called “hallucinations,” it is worth increasing the determinism of the answer. In the case of GPT, this can be achieved by reducing the “temperature” variable. Most LLMs should offer similar functionality to decrease the model’s creativity, which is desirable in text extraction scenarios.
Another GPT-specific tip is to choose and fix a specific version of the model, as this allows for consistent performance over time, despite ongoing updates.
Another valuable technique is to perform information retrieval in multiple steps. There are several use-cases for this approach, as outlined below:
1) Error correction: After the LLM provides an answer, you can prompt it again to verify if the returned information is correct and if all the instructions were followed. This leverages the fact that verifying a solution is often easier than generating a correct one. It also gives the LLM additional processing time to think and process the extracted information.
2) Multiple logical steps: Information extraction sometimes requires multiple logical steps. For example, if the goal is to list the information of all real estates on a given page, sorted by price, the initial sorting may be incorrect. However, if the instructions specify extracting the information first and performing post-processing in a second query, the sorting will be more likely correct.
3) Follow-up questions: The information needed often depends on previously extracted information. It can be beneficial to structure the query system to automatically provide follow-up questions based on the previous answers.
Fine-tuning the model can significantly improve its consistency. While significant improvement in reasoning may not be expected, fine-tuning allows you to consistently receive output structured in the correct way, using given values and searching for them in the correct places. Most open-sourced LLMs provide the ability to fine-tune them. As of September 2023, GPT 3.5 can also be fine-tuned, with GPT-4 promised to be released by the end of the year.
LLMs are powerful tools that enable us to structure complex unstructured data at scale. However, the process of achieving this requires careful attention and a lot of trial and error. We hope this blog post speeds up your own process and helps you achieve the desired results.
Data scientist at Pareto AI
So, here we are. You’re reading post number two on this new blog. We plan to post here when we have something to say about how we build products and function as a company, not just because some editorial calendar says so. We want to respect people’s time, so sign up for our newsletter below and each quarter, we’ll send you the good stuff with no fluff.
Some other cool things
we've worked on
Deep Dives into AI-Infused Data Lakes
Insights, projects, learnings, and views on the world of artificial intelligence as we work with businesses on their AI journey.