Faster OpenAI webhooks

Hi!

I have made a scenario with a webhook from Softr.com, to OpenAI and then to Airtable. As of now it seems OpenAI is very slow with the API. Is there any settings in the scenario i can do to make the process faster? I know its the OpenAI since I have tried with just a webhook and Airtable and it goes basically instant.

I have tried different settings and AI modell in the OpenAI module but havent gotten it faster.

Here are some pictures:





Thanks in advance :smiling_face:

gpt-4 is incredibly slow to the point of almost unusable for anything aside from background processing, typically leading to process timeouts (plus it’s expensive). gpt-3.5-turbo (which you’re using according to your screenshot) is about as fast as you’re going to get while getting decent results (and way cheaper), but it’s still pretty slow when compared to things like databases and general APIs.

But there are some things you can do to make gpt-3.5-turbo faster. The fewer tokens you use in your prompt (number of messages and length of each), the faster it’s going to go. So keep it as brief as you can. And then also keep the returned result as constrained as you can as well. Or if you’re okay with sometimes getting a truncated result back because you really need it to return ASAP, make sure your max tokens value is set as low as possible while still giving you enough returned data, because it’ll just stop early mid-thought if it hits that limit, if necessary.

Basically give it less work to do. But I expect it’ll still be slower than you’ll like.

What you may not know is that the results are actually coming back almost immediately, but they’re slowly streaming in from OpenAI to Make in the background, but your scenario doesn’t see the results until it has fully received the whole response from OpenAI. If you were using the API in a different system, you’d be seeing the results right away, such as if you were watching it in the OpenAI GPT Playground or via ChatGPT. That slow typing effect isn’t just an effect, it’s really what it’s sending from the server to you. So, if you set a low max token limit, it’ll spend less time from the start of the stream until the end of the stream because the stream will be forced to end earlier (because it would end up hitting your specified limit), so you’re pretty much guaranteed a limited amount of time spent producing that stream. But if you do that, you must also consider keeping the prompt you send (message count & the token length of each message) low so that it will start streaming that response back to Make ASAP as well, because there is some time spent before it even starts streaming a response back too.

Hopefully that helps.

BTW, Google’s PaLM API doesn’t stream the response back like OpenAI’s GPT system does, but just sends it back immediately. It returns the result so much faster, but in my testing so far, there are so many differences between the two systems as far as where I’d be using them. I highly recommend applying with Google for access to their PaLM API (to everyone interested) and then check out their MakerSuite as soon as they get you access to it (unknown how long that takes) to try it out. Personality wise, GPT-3.5 and GPT-4 are hands down far better for anything human interactive, but with regard to no-personality processing of data, PaLM does a great job and is super fast. If that’s suits your use case, I recommend seeking that out. I use them both and love them both.

Good luck! :smile:

2 Likes

This is your user prompt in English, with the help of Google Translate:

Give me a recipe with a detailed procedure
and a list of ingredients based on the factors
mentioned below. Avoid spaces in the lists and
start the answer with the "Recipe name" and
then after "Number of people" before you get
to the ingredients and follow the procedure:

Number of people: {{1. Number of people}}
Meal type: {{1. Meal type}}
Cost limit: {{1. Cost limit}}
Kitchen preference: {{1. Style}}
Preparation time: {{1. Preparation time}}

This tells me that your prompt is not likely to be fast as-is. This is a complex request, asking for a creative response with a number of complex variables. To be honest, it’s kind of a big ask of GPT-3.5 to perform within a low time constraint. I have a feeling you’re not likely to find much improvement. But if you do, I would really love to hear about what changes you made to get it as I’m always looking for ways to get faster response times from the GPT APIs!

Below is my general guidance for ways to improve response time and some attempts at efficiency. Maybe it’ll help, but you’re more or less already doing it I think. If it doesn’t help you directly, perhaps it’ll help someone else looking for similar help.

First, in my previous reply, I meant to mention that you should use the OpenAI Playground (if you aren’t already doing that) to test out your prompts before you use it in the API. That way you can get a feel for how long these responses will take and how they feel. You might find a way to get the results back to you faster that way, then you can reproduce that in your Make scenario afterward. It’ll be the entire time from submit button press until the streaming response completely ends that matters.

Very literally, the goal is these three things:

  1. Minimize the time spent from the moment you submit in the Playground to the time it starts to stream a response back to you
  2. Minimize the length of time it spends streaming a response back to you
  3. Minimize the length of time it takes from when you push the submit button until the submit button becomes re-enabled for pressing again, as this indicates that the full cycle has run from start to finish

General tactics I would try:

  • Define for yourself what an acceptable API response time is, according to your needs, dependencies, timeouts, etc., and then use that as your benchmark; don’t compare it to other APIs in general
  • If reasonable, use caching (like the Data Store module perhaps) so that same inputs are immediately returned instead of unnecessarily processing again, if it makes sense for your use case (plus you might be able to pre-process some results if you can predict your inputs in advance)
  • If time constraints are the critical factor, do not consider using GPT-4 (you’re already not using GPT-4; keep to that)
  • For GPT-3.5, do not set a system prompt (you’re already not setting one; keep to that)
  • Set as tiny of a user prompt as you can while reliably getting the result you require
  • Be very clear what the expected output format is to be and do not be open-ended or allow it freedom to choose the output format
  • Offer example results in your prompt that it can use as a template, if it makes sense to do so (in your case, it might or might not)
  • Offer an enumerated list of options for it to select from if it is to choose from a list of options as part of its response (doesn’t feel applicable here), and possibly provide a definition of those options if necessary and if it obviously provides a benefit
  • Be clear about what data to be returned is required and what data is optional, and possibly what format that data should be in (especially if you were using a format like JSON as the output, but even when not doing so)
  • Evaluate to see if using English is faster, though I would hope it makes no difference, if nothing else but to get a benchmark
  • Compare the speeds of ChatGPT (if possible) to using the Playground to using the API via Make (all using the same prompt as much as possible) and see if there is a difference in response speed from start to finish (from submit until the end of the streamed response).

If you do any kind of data filtering for purposes of reject/reprocessing/caching/etc, here is a tactic to consider after you’ve settled on a prompt:

  1. Use your large-result prompt and set a very low temperature, but set the maximum token length to something very small such that it will truncate the response very early in the response. The purpose of this is that it should consider the prompt in the same way, but stop streaming very early. Then you can grab just the header section (as defined by you) containing something actionable and useful, which you can use to then reject the output or send it back to the user or modify the data and resend it back to the API to try again or to accept by moving on to the next step here.
  2. Use the exact same settings, but change only the max tokens value.

The result of this tactic is that you could fail/ignore/return faster if the top of the response gives you enough data to answer questions for purposes of filtering or vetting or testing a cache to see if you have already generated that result before perhaps. But that tactic also results in a longer amount of time in total if you will always run both steps every single time, so it truly is only helpful if having some data early will save you from having to regenerate the whole result multiple times when you were filtering only on data in the top portion of the returned result. It does add the extra cost of additional tokens spent on that first step, however many times you run that, but it might save you in the long run if you were already doing filtering in this same way but were not using the max token attribute to truncate the response.

1 Like

Thank you so much for a comprehensive answer! I will look into this! :grin: Thank you!

2 Likes

Hi there @Petter_Bjerke :wave:

Have you had a chance to check out @Emmaly’s solution?

If it did the trick, could you mark it as a solution? :white_check_mark: This way we can keep the community tidy and neat for others while they search for solutions.

Thank you!