Faster OpenAI webhooks

This is your user prompt in English, with the help of Google Translate:

Give me a recipe with a detailed procedure
and a list of ingredients based on the factors
mentioned below. Avoid spaces in the lists and
start the answer with the "Recipe name" and
then after "Number of people" before you get
to the ingredients and follow the procedure:

Number of people: {{1. Number of people}}
Meal type: {{1. Meal type}}
Cost limit: {{1. Cost limit}}
Kitchen preference: {{1. Style}}
Preparation time: {{1. Preparation time}}

This tells me that your prompt is not likely to be fast as-is. This is a complex request, asking for a creative response with a number of complex variables. To be honest, it’s kind of a big ask of GPT-3.5 to perform within a low time constraint. I have a feeling you’re not likely to find much improvement. But if you do, I would really love to hear about what changes you made to get it as I’m always looking for ways to get faster response times from the GPT APIs!

Below is my general guidance for ways to improve response time and some attempts at efficiency. Maybe it’ll help, but you’re more or less already doing it I think. If it doesn’t help you directly, perhaps it’ll help someone else looking for similar help.

First, in my previous reply, I meant to mention that you should use the OpenAI Playground (if you aren’t already doing that) to test out your prompts before you use it in the API. That way you can get a feel for how long these responses will take and how they feel. You might find a way to get the results back to you faster that way, then you can reproduce that in your Make scenario afterward. It’ll be the entire time from submit button press until the streaming response completely ends that matters.

Very literally, the goal is these three things:

  1. Minimize the time spent from the moment you submit in the Playground to the time it starts to stream a response back to you
  2. Minimize the length of time it spends streaming a response back to you
  3. Minimize the length of time it takes from when you push the submit button until the submit button becomes re-enabled for pressing again, as this indicates that the full cycle has run from start to finish

General tactics I would try:

  • Define for yourself what an acceptable API response time is, according to your needs, dependencies, timeouts, etc., and then use that as your benchmark; don’t compare it to other APIs in general
  • If reasonable, use caching (like the Data Store module perhaps) so that same inputs are immediately returned instead of unnecessarily processing again, if it makes sense for your use case (plus you might be able to pre-process some results if you can predict your inputs in advance)
  • If time constraints are the critical factor, do not consider using GPT-4 (you’re already not using GPT-4; keep to that)
  • For GPT-3.5, do not set a system prompt (you’re already not setting one; keep to that)
  • Set as tiny of a user prompt as you can while reliably getting the result you require
  • Be very clear what the expected output format is to be and do not be open-ended or allow it freedom to choose the output format
  • Offer example results in your prompt that it can use as a template, if it makes sense to do so (in your case, it might or might not)
  • Offer an enumerated list of options for it to select from if it is to choose from a list of options as part of its response (doesn’t feel applicable here), and possibly provide a definition of those options if necessary and if it obviously provides a benefit
  • Be clear about what data to be returned is required and what data is optional, and possibly what format that data should be in (especially if you were using a format like JSON as the output, but even when not doing so)
  • Evaluate to see if using English is faster, though I would hope it makes no difference, if nothing else but to get a benchmark
  • Compare the speeds of ChatGPT (if possible) to using the Playground to using the API via Make (all using the same prompt as much as possible) and see if there is a difference in response speed from start to finish (from submit until the end of the streamed response).

If you do any kind of data filtering for purposes of reject/reprocessing/caching/etc, here is a tactic to consider after you’ve settled on a prompt:

  1. Use your large-result prompt and set a very low temperature, but set the maximum token length to something very small such that it will truncate the response very early in the response. The purpose of this is that it should consider the prompt in the same way, but stop streaming very early. Then you can grab just the header section (as defined by you) containing something actionable and useful, which you can use to then reject the output or send it back to the user or modify the data and resend it back to the API to try again or to accept by moving on to the next step here.
  2. Use the exact same settings, but change only the max tokens value.

The result of this tactic is that you could fail/ignore/return faster if the top of the response gives you enough data to answer questions for purposes of filtering or vetting or testing a cache to see if you have already generated that result before perhaps. But that tactic also results in a longer amount of time in total if you will always run both steps every single time, so it truly is only helpful if having some data early will save you from having to regenerate the whole result multiple times when you were filtering only on data in the top portion of the returned result. It does add the extra cost of additional tokens spent on that first step, however many times you run that, but it might save you in the long run if you were already doing filtering in this same way but were not using the max token attribute to truncate the response.

1 Like