You should use more than just OpenAI

The case for alternative large language models

Jun 14, 2023

OpenAI has created some amazing technology with their recent large language models (LLMs), such as ChatGPT and GPT-4. These models can do incredible things, such as passing the bar exam, diagnosing rare medical ailments, or even inventing a new programming language. That said, there are other great models out there, and it's worth exploring the larger ecosystem to see if there are other models that may suite your needs better.

We use OpenAI heavily, but we also use other models as well. Here's why you may want to check out some alternatives:

Reliability and Speed

If you've been building on OpenAI recently, you may have noticed something troubling: the failure rate of the API can be pretty high, especially if you are trying to generate text at high-traffic times. Depending on the time and the amount of requests that you send, you can get failure rates as high as a few percent. This isn't a deal-breaker by itself, since retrying the API call generally succeeds. However, having to retry API calls runs into a second issue--which is speed.

GPT-3.5 (ChatGPT) and GPT-4 inferences are slow. Each API call can take anywhere from a couple of seconds up to over 10 seconds. This may work for certain asynchronous scenarios, but many products building on top of large language models need to rely on real-time feedback for their users. This isn't possible if API calls take too long.

To make the problem worse, time costs compound if you want better results. Recent research has shown that large language models perform complex tasks better if they are broken up into smaller sub-tasks. This way, the models can "think" in steps, yielding better quality results. The breakdown of a larger goal into tasks is also what makes agents possible. All of these techniques rely on many sequential inferences from an LLM, making them even slower than before. A product architect is left with the choice of getting better results or waiting a very long time.

Other models can produce results faster and more reliably than OpenAI. Claude from Anthropic, for example, often finishes generation tasks in half the time of GPT-3.5. They also have lower failure rates (at least from our internal testing), which results in a faster output overall.

If you are having trouble with reliability and speed, it's worth looking at alternatives to supplement GPT-3.5 and GPT-4.

Cost

Another reason you should look at the larger ecosystem is cost. While models like GPT-4 give great results, they are not cheap. If you use the larger 32k context-window version of GPT-4, you could be paying $0.12 per 1k sampled tokens (roughly 750 words of output). This cost adds up quickly if you are using these models in production.

If you use the OpenAI embeddings API to index your information, then you will incur even more cost. Embeddings are used to index content into a vector database so that they can be retrieved later. This means that every single bit of content that you want to index incurs a cost, which adds up quickly.

There are many alternative models out there that can help you if cost is a concern. If you need to create embeddings for less cost, you can find many open source embedding models available on HuggingFace. These can be quite effective depending on your specific use case. If you are looking to generate text, there are cheaper alternatives out there as well. First of all, GPT-3.5 (ChatGPT) is already an order of magnitude cheaper than GPT-4. If you need to cut costs even further, you could self-host an open source model or run a hosted model using Replicate.

There are many options to explore to cut your costs. The key lies in understanding what your use case and what you need a model for.

Capabilities

OpenAI models are fantastic, but they are not the best in every way. Different large language models have different capabilities that may give them an edge depending on your specific use case. If you have strict privacy requirements or need a large context window, you may want to look elsewhere.

If you have strict privacy requirements and don't want to send your data up to the cloud, you'll need to look beyond OpenAI. There are models like LLaMa and Vicuna that can be self-hosted, or even run on the edge. These models can be fine-tuned with your own data without sending any proprietary data up to a third party vendor. Contrast that with fine-tuning GPT-3, which requires sending all of your data to OpenAI.

Another area where other models have OpenAI beat is the context window. The most expensive variant of GPT-4 has a context window of 32k tokens (roughly 24k words). This means that you can have a total of 32k tokens across both your input and the output for every API call. While this is a large number of words to process at once, there are scenarios where a larger context window is necessary. For example, if you want to summarize a very long transcript, you may run into the context limit and need to break the text apart in order to process it. This will incur additional cost in time and money.

Did you know that Claude has a context window of 100k tokens? This means that you can process 3 times as many tokens as GPT4 at once. You don't need to split up your long text, and you'll use fewer API calls. This allows you to get to the results you need faster. You can also maintain a longer memory as you chain together multiple calls, making longer chains smoother.

Conclusion

OpenAI has captured the world's imagination with their incredible large language models, but there are many reasons to use alternative models. If you are building something in production, you'll need to carefully balance speed, cost, and capabilities to find out what works for you. The best approach may involve mixing and matching multiple different models so that you can get the best of every world.

William Cheng is an engineering and product leader, and the co-founder of Maestro AI, a tool that solves info overload for dev teams. Connect with him on LinkedIn.

Maestro AI has just released their Slack agent, which summarizes conversations, extracts follow-ups, and creates documentation from conversations. Sign up here.

Engineering Futures

Discussion about this post