AI_PARSE_DOCUMENT() Get PDF Invoices Into The Database
The new ai_parse_document() function enables us to detect text in various document types, including PDFs and images. I decided to test it by converting PDFs with invoices to a database-friendly format. To read PDF documents, we need to read files in binary format and then use the ai_parse_document function, which transforms them into a string value and stores it in a table.
Here is the basic syntax to achieve it 👇
WITH all_files AS ( SELECT path, content FROM READ_FILES("/Volumes/hub/default/invoices", format => 'binaryFile') ) -- Parse the files using ai_parse document SELECT path, ai_parse_document(content) as parsed FROM all_files;
Output of the above:
Source: SunnyData / Hubert Dudek
The easiest way to create the above syntax is to go to "Agents" and set the PDF reader from our volume. I uploaded 3 example PDFs to the volume. "Use PDFs in Agent Bricks" will create a ready SQL script to read, and the script will be much longer in the above example, as it will also generate "cleaned" source data, which looks like a markdown invoice.
Source: SunnyData / Hubert Dudek
We can see that our invoices were converted into well-formatted strings and are now stored in a UC table.
Source: SunnyData / Hubert Dudek
Great, but we don't need a text field. We need to separate the invoice number, supplier, or even have all items in a separate table. Here, another agent tool will help: Information Extraction. Once we have a few example invoices, it should automatically build a proper JSON and deploy a small AI model (such as Managed LangChain) to process it. In "Agents," we can navigate to "Information Extraction," select our field with invoice text, and, based on my experience, the default setting should work fine, as it is a popular use case - invoices. It shouldn't have a problem automatically building a JSON structure matching invoices.
Source: SunnyData / Hubert Dudek
Once the model and endpoint are created, we can use SQL to convert the extracted text to a structured variant type and have separate fields like seller, invoice_number and even array with items (we can see on above screenshot that Agent Brick detected that items are array like on every invoice.
SELECT * FROM (WITH query_results AS ( SELECT `text` AS input, ai_query( 'kie-f69b5222-endpoint', input, failOnError => false ) AS response FROM ( SELECT `text` FROM `hub`.`default`.`invoices_parsed` ) ) SELECT try_parse_json(response.result) AS doc, input, response.errorMessage AS error FROM query_results)
Output of the above:
Source: SunnyData / Hubert Dudek
Our data is now in a nicely structured variant type, making it relatively easy to extract the necessary information.
We can easily extract company data from invoices using basic methods to traverse the variant type.
SELECT doc:invoice_number, TO_DATE(doc:issue_date) AS issue_date, TO_DATE(doc:due_date) AS due_date, doc:buyer.name AS buyer_name, doc:buyer.address AS buyer_address, doc:buyer.vat_id AS buyer_vat_id, doc:seller.name AS seller_name, doc:seller.address AS seller_address, doc:seller.vat_id AS seller_vat_id FROM invoices_json;
Output of the above:
Source: SunnyData / Hubert Dudek
We can also explode the items to display them in a separate table.
With items AS ( SELECT CAST(doc:invoice_number AS STRING) AS invoice_number, explode(CAST(doc:items AS ARRAY<STRUCT<description: STRING, quantity: INT, unit_price: STRING>>)) AS product FROM invoices_json ) SELECT invoice_number, product.description, product.quantity, product.unit_price FROM items;
Output of the above:
Source: SunnyData / Hubert Dudek
It is easy to extract data from all invoices uploaded to volumes. This is just one example of how Databricks can help automate your business processes. Agent Bricks also suggests creating a DLT pipeline, making it simple to convert this process into an incremental one. Combined with a file arrival trigger (when a new invoice arrives), you can build an ideal architecture for your accounting team.
How much does it cost?
During the 2 days of experimenting with agents, I spent almost 30 DBUs on Serverless Real-Time Inference. One DBU before the discount in my case, for which the service cost 0.082 USD, so my cost was around 2.5 USD. I haven't noticed any charges for hours where the AI serving endpoint was idle.