How to use LLMs to parse document data at scale


How to use LLMs to parse document data at scale

An interesting question came up during last week’s tech talk about how to run AI/ML on AWS

Lets say you were parsing medical billing documents and you wanted to extract only the fields where the bill was for a radiologist on someone's right foot.

Could AWS Textract Queries extract domain specific information from those documents?

I am making some assumptions here but my guess is that the LLM used to extract document fields from a document is pretty finetuned solely for that purpose and lacks the more specialized knowledge that most of the big LLMs have. This means it would likely struggle with grabbing that information compared to a more robust LLM tool like Amazon Q.

So how would I go about getting Q access to the Textract data? You could pipe your result from Textract into RDS or S3 depending on the density of data in each doc and the use case, then use AWS Q Business to query that data via AWS Kendra. That would give you a pretty intelligent way to extract data from the documents.

Though I am thoroughly surprised to see that, at the time of this writing, AWS Kendra does not support DynamoBD or Neptune which I would have liked to see as options for this.

Since I publicly stated it's not supported, those data sources will likely be announced by the end of the week (JK).

If you are interested in diving deeper into topics on how to use AI/ML with AWS shoot me a message. I am thinking about starting up a master mind group on the topic.