I am exploring using OpenAI’s API to extract data from notification email I receive from vendors. In the past, I’ve successfully created a scenario that extracts 7 data items from the notification email using Regex expressions. The items are common to purchasing workflows: PO #, Order #, Shipping Carrier, Tracking Numbers, Ship to Address, License Key, and Customer Name.
The issue I’ve now run into is that the format of the notification email has changed and I’m getting at least 3 different structures of email. They differ only slightly, but enough that my previous regex statements are not working correctly. I am testing a scenario using OpenAI to do the extractions and hoping it has the capability to cope with the varying email formats and can still extract the data. The regex extractions were so fast!
I am hitting the 429 error codes and have added PAUSE steps to the scenario to make it go slower. I don’t like having to slow down, but don’t know an alternative. This seems to work, but the processing of multiple email is very slow. I am using a variable to set the sleep time so I only have to set it once as I test, but currently I have it set to 30 seconds. So I force the scenario to stop execution between each call to OpenAI to extract a data element. In total, I spend 2.5 minutes just waiting. It is very slow for testing. It doesn’t seem like the optimal way to leverage the module and I suspect there is a better approach.
One of the other key questions I have is that I currently have 1 OpenAI module for each data item to extract from the email body. Is this the only way to do this? I tried to specify more than one “Data Definition” in the module, but I did not have success extracting any data when i specified more than one.
The other thing I am running in to is the lenth of the prompts and the count of tokens used in my requests. It isn’t clear to me how to provide enough information in a prompt to ensure an accurate extraction but remain within the token limit.
Is there a more efficient way to extract multiple data elements from an email with varying formats?