How can you run an LLM?
As mentioned at the beginning of this episode, one way of running an LLM is by using open-source models and executing them yourself. The alternative is managed LLM services, which are provided by companies like OpenAI, Google, and Anthropic.
As open-source models are free to use, and you need to pay per token for proprietary models, you might think that open-source models are naturally cheaper.
Not so fast! Remember all of those powerful GPU machines you need to do inference with LLMs? They don’t come cheap. Let’s say that you decide to run a state-of-the-art open-source model, LLaMA 3 70B, to power your data application. AWS recommends using their ml.p4d.24xlarge instance to run this model, which costs around $33 per hour to use on demand. Running your model on one of these instances will cost more than $20,000 a month, assuming a single instance will be sufficient to meet your usage load. Moreover, you will need in-house devops or MLOps experts to manage this deployment, especially if you want to make sure that you are not consuming more of these eye-wateringly expensive cloud resources than absolutely necessary.
So what about using managed models instead?
Their price-per-token costs can seem attractively low, but you of course need a way of estimating how they will scale once your employees adopt your product wholesale. Here are some things to consider when calculating token costs:
- Tokens are not equivalent to words, and how words are divided into tokens varies across different languages.
- You’ll be charged for both input and output tokens, with the price of output tokens often being higher. You also don’t have a large amount of control over the size of the output, which is often significantly larger than the input.
- You’ll need to account for additional instructions you may want to pass along to the LLM along with your user’s query, which will also be processed as tokens. For example, you may want to give the LLM specific instructions for working with data.
In order to account for these “hidden costs”, GPT for Work has developed this pricing calculator for some of the most popular proprietary models. While not perfect, this can help you project your likely costs and avoid “sticker shock” when that first OpenAI bill comes through.
Additional costs associated with running LLMs
As LLMs are increasingly used in more complex applications, customers have started to hit the limitations of these models, including the following:
- They are language models, which means that they are strong at language tasks but weak at rule-based tasks like mathematics.
- While they have limited reasoning abilities, they are not good at complex or multistep reasoning.
- As they are so expensive to train, their internal knowledge is frozen at the point of training.
This means that, in order to maximize your LLM application’s performance, you’ll need to use additional components that – surprise, surprise! – add costs. This can include:
- Using LLMs as part of agentic workflows, which use multiple LLM inference rounds in order to do multi-step reasoning.
- Including retrieval augmented generation (RAG), which involves pulling relevant documents into the prompt to increase the accuracy of the LLM’s answer. RAG can also involve the use of expensive vector databases to store your documents.