Extract item lines from PDF into Excel

Hi guys,

I have several different Purchase Orders in PDF (Text). Each purchase order contains multiple items lines. The files are formatted quite well, but there are some issues, for example the item name will appear on the following line, so each item is actually broken down into 2 lines:
image

The approaches I have tried so far:

  • Use PDF.co to convert to JSON. This works well and producing a workable JSON, but there are some challenges here:
  1. I cant use ChatGPT to manipulate the data and merge the items into 1 row because the JSON file is more than 150K characters.
  2. If each product-line would come in 1 single row I would be able to process it on Make with iterators etc, but if I iterate on row 1 the product name only appears on row 2 which makes this kind of manipulation using Make tools quite complicated
  3. Would trying to break the JSON into chunks and then send them to chatgpt to manipulate and merge would be a good idea?
  • If I upload the PDF directly on the OpenAI console and ask him to parse it correctly it works well but its a manual process. Is there a way to send a PDF using the API?

Any suggestions will be welcomed ? I dont mind taking any other approach :grinning:

Hi @OmriPe,

What about converting it in json as you did, but after iterating you just filter out bundle position if it’s Odd or Even with mod(bundleposition;2)=0 being even and mod(bundleposition;2)=1 being odd. Assuming you’re only working with 2 lines. If not, we can just iterate till next “Pos” column value is not empty. If you provide what you’re trying to do with the data, I might be able to help to determine the output data structure and optimize the scenario.

Hmmm, not sure i follow…

At the moment i set filters to start at bundle position X to avoid all the stuff coming before the line items.

So let’s assume i start with Bundle 10, then i take the data and ADD a certain row in Excel.
Maybe what i should do is then UPDATE the same row with info from the following bundle. The trick here is to know which row to update in the Excel and how to know when its an add and when its an update (probably that can be done with MOD).

Still a bit confused :laughing:

Can you send a screen of your current scenario ? The trick of knowing if it’s add or update is filtering through a router with 2 choices : either bundleposition is odd or even. That’s it. You probably will need a ‘set variable’ just after the ‘add row’ module to set ‘current row’, and then in the next router just before ‘update a row’ add ‘get variable’ module to reference that row. cuz data in different router branch don’t transfer to other branches. There might be another solution if you’re looking to optimize operations and don’t use setvariable-getvariable. You can set up this solutions and if it works, we can talk about optimization afterwards.

1 Like

@kudracha thank you, you have given me some good ideas to work with. Let me play with it a bit and see what i come up with :blush::pray::pray::pray::+1: