AI_PARSE_DOCUMENT() Get PDF Invoices Into The Database

Databricks

Jul 4

The new ai_parse_document() function enables us to detect text in various document types, including PDFs and images. I decided to test it by converting PDFs with invoices to a database-friendly format. To read PDF documents, we need to read files in binary format and then use the ai_parse_document function, which transforms them into a string value and stores it in a table.

Here is the basic syntax to achieve it 👇

WITH all_files AS (
  SELECT
    path,
    content
  FROM
    READ_FILES("/Volumes/hub/default/invoices", format => 'binaryFile')
)
-- Parse the files using ai_parse document
SELECT
  path,
  ai_parse_document(content) as parsed
FROM
  all_files;

Output of the above:

The easiest way to create the above syntax is to go to "Agents" and set the PDF reader from our volume. I uploaded 3 example PDFs to the volume. "Use PDFs in Agent Bricks" will create a ready SQL script to read, and the script will be much longer in the above example, as it will also generate "cleaned" source data, which looks like a markdown invoice.

We can see that our invoices were converted into well-formatted strings and are now stored in a UC table.

Great, but we don't need a text field. We need to separate the invoice number, supplier, or even have all items in a separate table. Here, another agent tool will help: Information Extraction. Once we have a few example invoices, it should automatically build a proper JSON and deploy a small AI model (such as Managed LangChain) to process it. In "Agents," we can navigate to "Information Extraction," select our field with invoice text, and, based on my experience, the default setting should work fine, as it is a popular use case - invoices. It shouldn't have a problem automatically building a JSON structure matching invoices.

Once the model and endpoint are created, we can use SQL to convert the extracted text to a structured variant type and have separate fields like seller, invoice_number and even array with items (we can see on above screenshot that Agent Brick detected that items are array like on every invoice.

SELECT * FROM
(WITH query_results AS (
  SELECT
    `text` AS input,
    ai_query(
      'kie-f69b5222-endpoint',
      input,
      failOnError => false
    ) AS response
  FROM (
    SELECT `text` 
    FROM `hub`.`default`.`invoices_parsed`
  )
)
SELECT
  try_parse_json(response.result) AS doc,
  input,
  response.errorMessage AS error
FROM query_results)

Output of the above:

Our data is now in a nicely structured variant type, making it relatively easy to extract the necessary information.

We can easily extract company data from invoices using basic methods to traverse the variant type.

SELECT
    doc:invoice_number,
    TO_DATE(doc:issue_date)  AS issue_date,
    TO_DATE(doc:due_date)    AS due_date,
    doc:buyer.name           AS buyer_name,
    doc:buyer.address        AS buyer_address,
    doc:buyer.vat_id         AS buyer_vat_id,
    doc:seller.name          AS seller_name,
    doc:seller.address       AS seller_address,
    doc:seller.vat_id        AS seller_vat_id
FROM invoices_json;

Output of the above:

We can also explode the items to display them in a separate table.

With items AS (
  SELECT
    CAST(doc:invoice_number AS STRING) AS invoice_number,
    explode(CAST(doc:items AS ARRAY<STRUCT<description: STRING, quantity: INT, unit_price: STRING>>)) AS product
  FROM
    invoices_json
)
SELECT
  invoice_number, product.description, product.quantity, product.unit_price
FROM
  items;

Output of the above:

It is easy to extract data from all invoices uploaded to volumes. This is just one example of how Databricks can help automate your business processes. Agent Bricks also suggests creating a DLT pipeline, making it simple to convert this process into an incremental one. Combined with a file arrival trigger (when a new invoice arrives), you can build an ideal architecture for your accounting team.

How much does it cost?

During the 2 days of experimenting with agents, I spent almost 30 DBUs on Serverless Real-Time Inference. One DBU before the discount in my case, for which the service cost 0.082 USD, so my cost was around 2.5 USD. I haven't noticed any charges for hours where the AI serving endpoint was idle.

DatabricksSQL Optimization#SQL#dataData Extractiondatabasedatabricks sql