PDF.co Parse PDF to Excel

Hi there, I’m trying to use PDF.CO to Parse a PDF and eventually push some fields into an Excel.

:cowboy_hat_face:I apologize in advance for the long post :face_with_thermometer: :face_with_head_bandage:

The PDF file im parsing contains Purchase Orders on every page, Each page being an order and each order may have several SKUs.

The PDF looks like this with multiple pages, each page looks like this:

I’m using the PDF.co PDF to JSON module and the Output of the Module is a mapping of the PDF by rows and columns, and I saw 2 ways to get the Data as below:

  1. Body/Document/Page/Row/Column
  2. Body/Object Values - Which is already a Matrix of Row/Columns that has a value in the specific document. I used this to map the data.

If I define the PDF.co module to run on 1 page, everything works fine and easily mapped into the Excel.

The challenges start when I want to run the automation on the entire PDF file which contains may pages=Orders:

  1. As mentioned above, i’m mapping according to the matrix on the “Object Values” for example Row_2_Column_11 however if the output contains more than one page, there is also a page indication, for example Page_3_Row_2_Column_11 - So if I Map a specific page into the excel it i will only output that specific page… I think that some sort of iteration might be needed but not sure how to approach it…
    image

  2. In the case that there is more than 1 Row on the PO like in above example, the same complexity apply since I’m mapping a specific row and column, and this will mean im not mapping the the 2nd row onwards

  3. Even with just parsing a single page and a single line in the order, I realized that somehow sometimes the row is not in the same position, so if in one page I map Row=2 and Column=3 in another page it might be Row=3 Column=3 and I would get an empty value" I think that somehow, although visually not evident from looking at the PDF, the content jumps one row from time to time. Any create idea on how to tackle it?

  4. a Bonus question, is there a way to Sort the Excel by SKU, for example, using Make.com ?

That’s it, I think its already too much for one post… :expressionless: :zipper_mouth_face:

@samliew any idea maybe regarding 1 or 2?

Welcome to the Make community!

I’ve only used CloudConvert + Text Parser to extract data from PDFs, so I’m not sure if I can help with PDF.co

1. Scenario blueprint

Please export the scenario blueprint file to allow others to view the mappings and settings. At the bottom of the scenario editor, you can click on the three dots to find the Export Blueprint menu item.

Screenshot_2023-08-24_230826
(Note: Exporting your scenario will not include private information or keys to your connections)

Uploading it here will look like this:

blueprint.json (12.3 KB)

2. And most importantly, Output bundles

Please provide the output bundles of the modules by running the scenario, then click the white speech bubble on the top-right of each module, save the bundle contents in your text editor as a bundle.json file, and upload it here into this discussion thread.
Screenshot_2023-10-06_141025

Providing the output bundles will allow others to replicate what is going on in the scenario even if they do not use the external service.

Following these steps will allow others to assist you here. Thanks!

2 Likes

Hi @samliew You are absolutely right… In this case I probably can’t share it since its containing confidential info and its too much work to remove all that info (-:

It was more of a theoretic question about mapping using the result matrix.

I might try CloudConvert and see what they offer. Is there a thread on extracting certain field values from PDF using CloudConvert?

Thank you very much nevertheless ! :trophy: :hugs:

2 Likes

If you can convert the PDF to plain text (removing private info of course), I can create patterns for the Text Parser “Match Pattern” module.

Screenshot_2023-11-15_141107

2 Likes