Hey guys,
I have a project where I’m trying to get a PDF Purchase Order and extract all the items into an Excel. I have tried this with Chat GPT and Kimi but they make too many mistakes and invent stuff, so I decided to proceed doing it the old fashioned way, which means:
- Use PDF.co to convert the PDF to JSON
- Slice the JSON file. For example, I know that the relevant information starts only from Line 7 as below, so I want to remove lines 1-6.
- Merge the pages so I don’t have to worry about some items spilling between the pages (its happening)
- Send the cleaned JSON to Chat GPT to receive a slightly manipulated JSON
P.s. I cannot send the raw JSON to Chat GPT since its above the characters limit (278,097
characters - and thats why I want to slice it first)
Here is the PDF2JSON Output:
Here is the info expanded:
Here is how the PDF looks like (the relevant item lines).
If anyone can point to how to:
- Merge the pages
- Slice the unnecessary rows…
Many thanks !