The DataLead

by Datalore

Hi,

As a DataLeader, have you ever thought about running an LLM in your own organization? Perhaps building an in-house LLM service to make it easier for your data scientists and analysts to get their work done? Well, if you work with sensitive data and have concerns about how LLM providers might handle it, you’ve probably already discussed this issue at length with other stakeholders in your organization.

As part of this process, you’ve probably talked about the pros and cons of using your own LLM, such as being able to onboard people to new projects faster versus taking on the additional resources of managing an LLM service. However, eventually, the conversation comes around to costs… and everyone goes silent. It is indeed hard to estimate how much it will cost to run an LLM-based application, and depending on your usage, they may not come cheap.

Why are LLMs so expensive?

With models like GPT-4 and LLaMA 3 costing hundreds of millions of dollars to train, LLMs have gained a reputation for being expensive models to produce. However, what a lot of end users of these models don’t realize is that the costs don’t stop there: Using these models is also expensive.

The main reason for this is that these models have grown so large that they require specialized machines with large GPUs in order to make predictions or perform inferences. Moreover, LLMs perform inference using what's known as an auto-regressive process. Rather than predicting the entire output text from a prompt all at once, the model generates one token (a word or part of a word) at a time. After each token is produced, it’s added to the existing prompt, and the updated prompt is fed back into the model for the next prediction. This process repeats for every token until the full response is generated. As you can guess, this makes inference for these models both slow and expensive.

Graphics

How can you run an LLM?

As mentioned at the beginning of this episode, one way of running an LLM is by using open-source models and executing them yourself. The alternative is managed LLM services, which are provided by companies like OpenAI, Google, and Anthropic.

As open-source models are free to use, and you need to pay per token for proprietary models, you might think that open-source models are naturally cheaper.

Not so fast! Remember all of those powerful GPU machines you need to do inference with LLMs? They don’t come cheap. Let’s say that you decide to run a state-of-the-art open-source model, LLaMA 3 70B, to power your data application. AWS recommends using their ml.p4d.24xlarge instance to run this model, which costs around $33 per hour to use on demand. Running your model on one of these instances will cost more than $20,000 a month, assuming a single instance will be sufficient to meet your usage load. Moreover, you will need in-house devops or MLOps experts to manage this deployment, especially if you want to make sure that you are not consuming more of these eye-wateringly expensive cloud resources than absolutely necessary.

So what about using managed models instead?

Their price-per-token costs can seem attractively low, but you of course need a way of estimating how they will scale once your employees adopt your product wholesale. Here are some things to consider when calculating token costs:

Tokens are not equivalent to words, and how words are divided into tokens varies across different languages.
You’ll be charged for both input and output tokens, with the price of output tokens often being higher. You also don’t have a large amount of control over the size of the output, which is often significantly larger than the input.
You’ll need to account for additional instructions you may want to pass along to the LLM along with your user’s query, which will also be processed as tokens. For example, you may want to give the LLM specific instructions for working with data.

In order to account for these “hidden costs”, GPT for Work has developed this pricing calculator for some of the most popular proprietary models. While not perfect, this can help you project your likely costs and avoid “sticker shock” when that first OpenAI bill comes through.

Additional costs associated with running LLMs

As LLMs are increasingly used in more complex applications, customers have started to hit the limitations of these models, including the following:

They are language models, which means that they are strong at language tasks but weak at rule-based tasks like mathematics.
While they have limited reasoning abilities, they are not good at complex or multistep reasoning.
As they are so expensive to train, their internal knowledge is frozen at the point of training.

This means that, in order to maximize your LLM application’s performance, you’ll need to use additional components that – surprise, surprise! – add costs. This can include:

Using LLMs as part of agentic workflows, which use multiple LLM inference rounds in order to do multi-step reasoning.
Including retrieval augmented generation (RAG), which involves pulling relevant documents into the prompt to increase the accuracy of the LLM’s answer. RAG can also involve the use of expensive vector databases to store your documents.

Alternatives to using state-of-the-art LLMs

After crunching the numbers, you might be despairing: There is no way you can justify the cost of GPT-4o or something similar for your application! Well, don’t worry – there are still some other options for hosting your own LLM.

Using LLMs is a trade-off between size and accuracy, with the largest models being able to handle both the most complex language tasks and also costing the most to run. Depending on your task, you may be able to get sufficient performance with a smaller, cheaper model. A process called fine-tuning, where a model is further trained on your domain’s data, can further help improve the performance of LLMs, allowing you to squeeze even more out of these smaller models.

Larger LLMs can also be made smaller using a process called quantization, where the memory footprint of the models is reduced while still preserving most or all of the performance. Quantizing a model can reduce its running costs by 50% or more.

What about using an existing product?

Finally, if you don’t want to build something from scratch, you can use products that have created their own LLM-powered assistants for working with data, such as JetBrains Datalore. Depending on your use case, such services may provide your data analysts and scientists with a convenient and secure way of passing your data to LLMs, without the overhead and cost of managing this yourself.

Datalore AI can help speed up the productivity of data analysts and scientists. One of its most powerful features, Autopilot, gives tailored suggestions to help guide data professionals through their analyses, generating code for them at each step. JetBrains’ contracts with our third-party LLM providers forbid the storage of data or use of data for training, meaning the privacy of your company’s information is assured.

Screenshot

Wrapping up

After this episode, I hope you can see that open source does not mean “free” when it comes to LLMs, and the cost and overhead of running your own model may not come with the expected payoff. You need to take into consideration hidden costs, such as expensive cloud machines and in-house experts who can run these models. However, there are a number of ways to drive these costs down, depending on the complexity of your task.

In some use cases, building and running your own LLM service for your data analysts and scientists may not even be necessary. As many mature data-focused AI assistants currently exist, you may find it more cost effective to train your workforce to use one of these solutions.

I sourced some of the findings in this episode from this excellent article by Gad Benram. I highly recommend it if you’d like more information on how to estimate the cost of your LLM application.

Best,
Jodie

JetBrains
The Drive to Develop

Previous episodes

The DataLead by Datalore