πŸ”— Extracting URLs From an Email or HTML Document

By default, Airparser does not always extract hidden URLs inside clickable text, buttons, or images. These links often exist in the underlying HTML code of the email, so you may need to enable URL parsing explicitly.

βœ”οΈ Step 1 β€” Enable URL Parsing (Text Engine Only)

URL extraction only works with the Text engine. The Vision engine analyzes documents visually and does not read the underlying HTML, so it cannot detect link URLs embedded inside HTML anchors.

To enable URL detection:

  1. Go to Inbox Settings

  2. Scroll to Advanced Settings

  3. Enable:

    • Parse URLs

    • Parse img src attributes (if you need image URLs)

After enabling these settings, reparse your documents.

Important: If your inbox uses the Vision engine, switch to the Text engine first, otherwise URL parsing will not work.

βœ”οΈ Step 2 β€” Extract the URL in Your Schema

Once URL parsing is enabled, Airparser will extract:

  • <a href="..."> link URLs

  • Image URLs (<img src="...">)

  • Visible text + metadata behind clickable buttons

You can reference these values in your schema fields as needed.

🧠 Advanced: When You Need More Than Just the URL

Sometimes users need not just the URL, but the document located at that URL (e.g., a PDF download link, invoice link, attachment link, etc.).

Airparser cannot automatically download external files from URLs on its own. To achieve this, use an automation platform.

Export the URL field from Airparser to Zapier or Make, then fetch the document from that URL and import it back into Airparser.

So the entire workflow will look like:

  1. Send the email to Airparser.

  2. Airparser extracts the data and the URL.

  3. Export the URL to Zapier or Make.

  4. In Zapier/Make, use the β€œimport document” action to download the file from the URL and send it back to Airparser for parsing.

βœ”οΈ Alternative: Use Post-Processing With Python Regex

You can also use the post-processing step to scan the raw email body with regex and extract custom patterns.

This is useful for:

  • URLs inside unusual HTML attributes

  • Encoded URLs

  • Complex query-string links


Was this article helpful?