📝 Creating an Extraction Schema
An extraction schema defines the data you want to extract from your documents. It ensures the extracted information is structured and formatted correctly.
Creating an extraction schema is a simple and straightforward process.
How to Create an Extraction Schema
There are two primary ways to create extraction schemas in Airparser. You can either create them manually by defining fields or let Airparser generate a schema automatically based on your document's content. Both methods allow you to edit the extraction schema later if needed.
Airparser provides two locations for creating and editing extraction schemas:
Extraction Schema Page
The "Extraction Schema" page is a dedicated space for managing extraction schemas within each inbox.
Steps:
- Navigate to the "Extraction Schema" page within your inbox.
- Upload a document to let Airparser automatically generate an extraction schema based on the document content, or manually add fields to extract specific data points from your documents (e.g., invoice amounts, names, addresses, document types).
- Save the schema.
Additional Features:
- Export schema to JSON (for backup or sharing).
- Restore schema from JSON.
- Clone schemas between different inboxes.
Document Details Page
You can also create extraction schemas directly from the detailed document page. This method supports both manual schema creation and automatic generation.
Steps:
- Open any previously uploaded document in your inbox and go to the "Extraction Schema" tab.
- Generate an extraction schema automatically based on the document content, or add fields manually to extract specific data points from your documents.
- Save the schema. To apply the updated schema to previously uploaded documents, click "Reprocess."
Notes:
- Newly received documents will be processed using this schema automatically.
- Click Reprocess to parse old documents again using the updated schema.
You can reprocess the same document as many times as needed. If the extraction schema remains unchanged, this does not consume additional parsing credits.
Reprocessing documents without changing the schema is useful for triggering integrations—Airparser will resend the parsed data, which can help with debugging and setting up integrations without using extra parsing credits.
However, changing the schema requires a full reprocessing, which consumes credits.
Here's an example of an extraction schema for an invoice:
Field Types
Using field types ensures more consistent formatting and can, in some cases, improve data extraction precision. However, for simple data types such as name, amount, age, ID, or color, it is entirely possible to set all types to 'string'.
Airparser supports different data types to ensure accurate extraction:
- String: Any text-based data.
- Integer: Whole numbers.
- Decimal: Numbers with decimal places.
- Boolean: True/false values.
- Enum: A predefined list of possible values (e.g., status: pending, approved, rejected).
- Tip: Use Enum for document classification tasks, such as identifying if a document is an invoice, receipt, or contract.
-
Object: A structured group of related fields (e.g., a "Person" object with first name, last name, and age).
- List (Table): A repeating set of structured data, such as a list of ordered items or tabular data.
- Tip: This data type requires defining a list of attributes. When setting up a table, think of attributes as column headers—each attribute represents a table column.
Field Properties
- Name: The field's identifier in the parsed data.
- Description (optional but recommended): Helps the parser extract the correct data. You can add:
- Instructions on where to find the data in the document.
- Examples of expected values.
- Formatting hints (though for complex formatting, post-processing is recommended).
- Default Value: A fallback value if the expected data is missing.
Example of a parsed invoice: