From https://cloud.google.com/document-ai/docs/process-forms, I can see some example of processing single files. But in most cases, companies have buckets of documents. In that case, how do you scale the document ai processing? Do you use the document ai in conjunction with Spark? Or is there another way?
2 Answers
I could only find the following: batch_process_documents
process many documents and return an async response that'll get saved in cloud storage.
From there, I think that we can parametrise our job by adding an input path of the bucket prefix and distribute the job over several machines.
All of that could be orchestrated via Airflow for example.
You will need to use Batch Processing to handle multiple documents at once with Document AI.
This page in the Cloud Documentation shows how to make Batch Processing requests with REST and the Client Libraries.
https://cloud.google.com/document-ai/docs/send-request#batch-process
This codelab also illustrates how to do this in Python with the OCR Processor. https://codelabs.developers.google.com/codelabs/docai-ocr-python