1. Overview
  2. Data Extraction
  3. 🚧 Parsing Tips and How to Fix Common Issues

🚧 Parsing Tips and How to Fix Common Issues

Why are only some pages of my document being parsed?

The LLM-powered parser operates within a defined context window, which limits the maximum document size it can process for data extraction.

Currently, a key limitation of all LLM engines is their inability to handle large documents (typically those exceeding 10 pages).

There are 2 potential options available for you: if feasible, split your PDF documents into multiple smaller documents, ideally less than 10 pages each. You can use Airparser's auto-splitting feature (navigate to the Inbox Settings page).

Alternatively, we offer another parsing tool called Parsio. It uses pre-trained AI models for PDF parsing and may perform better in certain scenarios compared to an LLM-powered parser.

See also: Choosing the Right Parser Type (Parsio documentation).

How to instruct Airparser to parse only some document types?

By default, Airparser attempts to parse all incoming documents. To restrict parsing to certain document types, follow these steps:

  1. You can automatically set the status of some documents to 'Skipped' based on their file type. This will prevent them from being exported and will not trigger your integrations. Learn how to write a simple post-processing code for this.

  2. You can use automation platforms such as Zapier or Make to create more complex import integrations where you can filter documents and import only those you need to parse into Airparser.

I’m parsing emails with attachments. Can I handle both the email and the attachment in the same extraction?

Yes! You can combine the parsed data from both the email and its attachment into a single document.

To do this:

  1. Go to your Inbox settings.

  2. Enable the "Parent Data" meta field.
    This will inject the parsed email data into the parsed data of the attachment, giving you one unified JSON object.

Additionally, you can use a post-processing step to set the status of the parsed email to "skipped" so it won’t trigger any integrations.

I notice that some of the extracted data isn't correct. How to improve the parsing quality?

For most documents, Airparser extracts data with high accuracy. However, in some complex cases where the document structure is too intricate for LLM engines, the extracted data may be incomplete or inaccurate. To improve results, we offer 4 LLM models — 2 text-based and 2 vision-based — which you can switch between from the Inbox settings page to find the best fit for your use case.


Was this article helpful?