Using Chat GPT to extract data from email

I am exploring using OpenAI’s API to extract data from notification email I receive from vendors. In the past, I’ve successfully created a scenario that extracts 7 data items from the notification email using Regex expressions. The items are common to purchasing workflows: PO #, Order #, Shipping Carrier, Tracking Numbers, Ship to Address, License Key, and Customer Name.

The issue I’ve now run into is that the format of the notification email has changed and I’m getting at least 3 different structures of email. They differ only slightly, but enough that my previous regex statements are not working correctly. I am testing a scenario using OpenAI to do the extractions and hoping it has the capability to cope with the varying email formats and can still extract the data. The regex extractions were so fast!

I am hitting the 429 error codes and have added PAUSE steps to the scenario to make it go slower. I don’t like having to slow down, but don’t know an alternative. This seems to work, but the processing of multiple email is very slow. I am using a variable to set the sleep time so I only have to set it once as I test, but currently I have it set to 30 seconds. So I force the scenario to stop execution between each call to OpenAI to extract a data element. In total, I spend 2.5 minutes just waiting. It is very slow for testing. It doesn’t seem like the optimal way to leverage the module and I suspect there is a better approach.

One of the other key questions I have is that I currently have 1 OpenAI module for each data item to extract from the email body. Is this the only way to do this? I tried to specify more than one “Data Definition” in the module, but I did not have success extracting any data when i specified more than one.

The other thing I am running in to is the lenth of the prompts and the count of tokens used in my requests. It isn’t clear to me how to provide enough information in a prompt to ensure an accurate extraction but remain within the token limit.

Is there a more efficient way to extract multiple data elements from an email with varying formats?

Why not just use one OpenAI “Structured Data” module to extract all of the variables/items at the same time, if they are all running on the same input text?

Or, if you’d still want to use regex, you can provide examples of the different structures and perhaps we can find a solution?

samliewrequest private consultation

Join the unofficial Make Discord server to chat with other makers!

3 Likes

Thanks for the direction Sam.

This approach worked much better and the result seems able to return accurate results regardless of the format of the email which seems to be a big advantage.

It may prove to be adaptable and more durable.

2 Likes