Slicing Arrays and collections

Hey guys,

I have a project where I’m trying to get a PDF Purchase Order and extract all the items into an Excel. I have tried this with Chat GPT and Kimi but they make too many mistakes and invent stuff, so I decided to proceed doing it the old fashioned way, which means:

  1. Use PDF.co to convert the PDF to JSON
  2. Slice the JSON file. For example, I know that the relevant information starts only from Line 7 as below, so I want to remove lines 1-6.
  3. Merge the pages so I don’t have to worry about some items spilling between the pages (its happening)
  4. Send the cleaned JSON to Chat GPT to receive a slightly manipulated JSON
    P.s. I cannot send the raw JSON to Chat GPT since its above the characters limit (278,097
    characters - and thats why I want to slice it first)

Here is the PDF2JSON Output:

Here is the info expanded:

Here is how the PDF looks like (the relevant item lines).

If anyone can point to how to:

  • Merge the pages
  • Slice the unnecessary rows…

Many thanks !

@OmriPe After PDF.co output, put an Iterator, run through pages and then use an array aggregator.

@Ronak_Bhagdev Thanks, its a good start, and im doing it already, but still need to see how to do the same on the inner collections…