Automation for pdf extraction

Hello everyone,

I’m looking to develop an automation process to extract specific information from 15 to 30 pages within PDF documents. The challenge is that the required values are not consistently located on the same page or in the same place. Using a PDF parser tool like pdf.co to extract this information might become expensive as I’d need to parse the entire file.

I’ve considered uploading the file to OpenAI in order to train an agent to extract this information; however, it seems that OpenAI doesn’t have direct access to the PDF and provides outputs based on the examples I provide. I also attempted to convert the full PDF to text, but it becomes too large to create a prompt for OpenAI or other similar platforms. When I divided it into chunks, I encountered difficulties in using all the chunks to prompt the agent.

Does anyone have any suggestions on how to create a solution for this, or how to improve the interaction with OpenAI’s assistance?

Any help would be greatly appreciated.

Thanks

You’ll need to set up a custom assistant and use the Message an Assistant module to use previously uploaded files.

For more information, see

samliewrequest private consultation

Join the unofficial Make Discord server to chat with us!

I did this but the assistant doesn’t reply based on the content I submit but with the examples I provided in the instructions. Do you know what can be wrong?

Hi @Francisco_Reis

Ensure that you have provided system, assistant and user correctly. We have done a video on PDF Extraction here.

If you need any setup guidance, don’t hesitate to reach out to us.

Regards,
Msquare Automation - Gold Partner of Make

Book a Free Consultation | Connect Live

Explore our YouTube Channel for valuable insights and updates!

Besides the solutions I have made here: How to PDF into openAI (Solution!)

I’ve found the pdf.co module a lot better to extract pdf data. As it’s setup specifically to extract pdf data.

Also makes openAI has a module to set up a structured JSON module, which is also worth noting too.

Hello
thanks for the reply!

I used your method but in the last part when I message the assistant it doesn’t give me the information form the pdf I just uploaded

Check the back end of the playground and see what happened.

Which step did it fail it? Need more specific info. Seems it can’t find the pdf file.

Is the pdf file being moved to the vector store?
Is the correct vector store being accessed?
You have enough credits in your account?
Using GPT-4?

I dont know. The setup is exactly like yours.
The pdf is in the vector store, and I do have enough credits in my account. I selected the right vector store and the assistant was created with gpt4o. But then the output is not right

If you give the same input and pdf in playground do you get the output you want?

Essential if the playground works, then all you are doing via make is automated it, and the hardest step was for it to find the pdf. Which you say is there.

Hello Francisco_Reis,

To accurately parse specific pages from your PDF file using the PDF.co Document Parser, please use the pages parameter. By specifying the pages you want to extract, the Document Parser will focus only on those pages instead of processing the entire file. For more details on how to use the pages parameter, please visit our documentation at the following link: API Docs

If you have any questions or need further assistance, please let us know.

Hi @Francisco_Reis

You can check the detailed demo of PDF extraction here.

Regards,
Msquare Automation - Gold Partner of Make

Free Consultation | Live Implementation

Visit us here | Youtube Channel