From name: DataLead by Datalore From address: [email protected]

Reply to: [email protected]

Subject: Keeping your data private in the age of LLMs

How to get the most out of LLMs without exposing yourself to data breaches or other privacy issues ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

The DataLead

by Datalore

Hi,

As a DataLeader, one of the most important parts of your job is keeping the data in your organization safe and secure. This was hard enough in the pre-AI era, but introducing LLMs into your workflows opens up a range of new ways for data to be leaked.

Imagine this scenario: One of your data analysts is on a time crunch, and they need to get a report out the door quickly. They decide to get a little help from one of the popular LLM providers, uploading your sensitive customer data straight into the app. They’re thrilled with the results – they get the report done in half the time! But you might not be so happy when you read the fine print and find out that the LLM provider has the right to keep that data and use it to train their models. Worse, there’s nothing you can do to reverse this.

Read on to find out how to get the most out of LLMs without exposing yourself to data breaches or privacy issues.

What’s in the training data?

Large language models are extremely data hungry – the bigger they’ve become, the more data is needed to train them. We can see this in the amount of data required to train two landmark models, four years apart. Back in 2020, GPT-3 was trained on 500 billion tokens – equivalent to around 420 million copies of the complete works of Shakespeare. While that seems a lot, Llama 3 was trained this year using a whopping 15 trillion tokens, or 30 times the data used for GPT-3!(1)

This data is sourced from the public internet, and because the data is so vast, it can’t be cleaned manually. The automated models or rules used to clean this data aren’t perfect, meaning undesirable content, including personally identifiable information (PII), can slip through into the training data. What is more troubling is that there is no known method to make LLMs forget what they have learned, meaning that once sensitive data makes it into the model, there’s no way to remove it.

LLMs’ memorization problem

This wouldn’t be so bad, but as models have grown, they’ve also started memorizing parts of their training data. One of the first times I saw this with PII was when GitHub’s Copilot was first released in 2022.

Copilot had memorized the About Me page of my colleague Maarten’s blog and served it up as a code suggestion to someone else. While this example is harmless, and quite funny, the same memorization could happen with much more sensitive information.

Research (2) has found that LLMs are much more likely to memorize data as they grow larger and when the data contains duplicates. As such, if you’re concerned about PII from the public internet ending up in your LLM outputs, you may want to consider using a smaller model trained on well-curated data, such as the Falcon models.

Models outputting PII is more of a concern with open source models, as proprietary models have built-in safeguards that control the sort of outputs they can produce, including blocking outputs that are likely to contain PII. However, it’s good to be aware that these safeguards are not perfect, and attackers do have ways of circumventing them.

How can you keep your data safe?

So we now come to the most important point: How can you get the most out of LLMs while keeping your data safe?

The most important thing is to understand the terms of use of LLMs. Proprietary LLMs can either be used through personal accounts, such as the one you use to sign up to ChatGPT directly, or through third-party providers, which use the vendor’s APIs to access the models and enable their use in their applications.

Personal accounts for popular LLMs generally allow the company to use your data for training (3). However, many applications using these models have negotiated much more secure terms of use, forbidding the use of customer data. So, while using personal ChatGPT accounts for sensitive work is never a good idea, a third-party application running on ChatGPT may be perfectly secure. Datalore AI, for example, only uses third-party LLMs which can guarantee terms of use that protect our customer’s data, making them a secure way to pass your information to these models.

Secondly, you should be careful about which parts of your system you grant LLMs access to and what safeguards you have. LLMs are vulnerable to prompt injection attacks, where malicious actors hijack the LLM for their own purposes. For example, if you give an LLM access to a database containing sensitive data, an attacker may be able to get the model to extract and send this data to them.

Finally, if you decide to fine-tune your own model, you should be very careful about the data you use. Given the tendency of models to memorize their training data, there is a chance that sensitive information from your training set could end up in the model outputs.

Using LLMs securely

Let’s revisit our data analyst, who turned to an LLM to speed up their report. While using a personal account was a mistake, the decision to leverage an LLM for assistance wasn’t inherently wrong. In fact, when used correctly, LLMs can be a game-changer for data professionals.

One standout example is troubleshooting and correcting code. Anyone who has spent hours hunting down bugs or fixing errors knows how draining this can be. LLMs can serve as an on-demand coding assistant, like a form of interactive Stack Overflow, offering quick insights and solutions. This transforms a tedious task into something far more efficient, allowing your team to focus on higher-value work.

Datalore AI integrates secure, powerful LLM-based debugging directly into its user interface. As soon as an error is thrown, the user has the option for state-of-the-art LLMs to explain the error, as well as suggest fixes, right within the Jupyter cell. Better still, JetBrains does not work with large language models that use customer data for training models, ensuring that your company’s information remains private and secure.

Wrapping up

While the use of AI models like LLMs has made keeping your company’s data safe even trickier, you can avoid data breaches and other pitfalls by keeping a few simple rules in mind. By being aware of an LLM provider’s terms of use, limiting the sensitive data that LLMs can get access to, and being aware that LLMs can memorize – and regurgitate – sensitive information, you can get the most out of these powerful models to help your teams without exposing your company to unnecessary risk.

Best,
Jodie

JetBrains
The Drive to Develop