PDF to TEXT

Hello,

I want to extract the raw text from a PDF file, I can use the Google Drive module “download file”.

It gives me raw data, how can I convert it to text?

Thanks
David

Hey David!

Yeah, that’s not easy. PDF files are applications themselves, unlike other simpler formats. There’s no straightforward way to convert them.

The way to go is to use OCR (Optical Character Recognition) software. You can search for “OCR API” and try to find some options. I’ve never used any so I can’t recommend based on experience, but I know Make has built-in apps for Google Cloud Vision and PDF4me.

OCR is not perfect, so some typos can occur, but to the best of my knowledge that’s your best (or only) bet.

I hope this at least points you in the right direction.

3 Likes

Thanks Bruno, will try this!

2 Likes

Hey David,

I used Google cloud vision to solve this issue. Worked like a charm and was super easy to set up.

We use ist to convert invoices to text. Works at >95% accuracy. Feel free to reach out in case of questions.

A post was split to a new topic: Read and analyze email content and PDFs using Chat GPT

2 posts were split to a new topic: How to configure Google Cloud Vision on Make

Hey Trond, I’m trying to do something similar with invoices and Google Cloud Vision.

I’ve got GCV working and to some extent Text Parsing working using RegEx. But I’m intrigued by what you’re doing to get the invoice data out of the files and into the format that you’re using.

How are you getting things like invoice date, name, total etc?

Would you mind sharing more of your scenario and how it works? :grin:


1 Like

Thanks for reaching out.

Here is some screenshots, I hope they help :slight_smile:

Set-Up of Text parser (replace):

Set-up of ChatGPT:

Prompt I am using to extract the data:
"I want you to act as an accountant. The company you are working for is called YOUR COMPANY NAME. You are tasked with handling the inbound invoices, your company receives. Your task is to specifically extract certain data from any inbound invoice for further processing of the invoices.

You will output your results into a result-array.

Here is the Text:{{26.text}}

The data you are looking for is:
Invoicing Company name (ie. the company which has issued the invoice),
Invoice Date (ie. the date on which the invoice has been issued),
Invoice Total Amount (ie. the total amount which needs to be paid including VAT (excluding any currency signs, use the euro amount if two options are available),
Invoice Number (ie. the unique identifier of this invoice),
Invoice Email (ie. the email of the invoicing company)
Invoice currency: (i.e. the currency of the total amount, EUR or USD)

The invoice date is required to be formatted like dd.mm.yyyy
The invoice company can never be YOUR COMPANY NAME, if you only find this result, reconsider your answer and search again.
All values inside the result-array must be seperated by §

The result-array [Value of invoice company§value of invoice date§value of Invoice Total amount§value of invoice number§value of invoice email§value of invoice currency]

Print result-array"

Hi David,

To convert the raw data from the PDF file that you’ve downloaded using the Google Drive module into text, you can make use of PDFco’s API. PDFco offers powerful tools for working with PDF documents, including text extraction. We provide you a simple guide on how you can achieve this task using this link: How to Extract Text from Scanned PDF using Make - PDF.co

If you have any questions or encounter any issues during the process, please don’t hesitate to reach out to us via email at support@bytescout.com. Our dedicated support team is available to provide prompt and helpful assistance.

We hope you have a fantastic day!