Summary

  • ChatGPT and other similar language models like Google Bard may have been trained on personally identifiable information.
  • Users are providing confidential information to ChatGPT themselves, which also raises significant privacy issues.
  • If you're concerned about your data being used by these language models, the options are limited. For example, you can either refrain from using them or reach out to companies to request the removal of your data.

ChatGPT has taken the world by storm in 2023, and with good reason. It can generate text and images for all kinds of things at a level that no other service was previously capable of. It's since been in active competition with Google and other LLMs, but by and large, ChatGPT has reigned supreme. However, ChatGPT, and by extension others like Google Bard, is a privacy nightmare for a myriad of reasons.

GPT-3.5 and GPT-4 were trained on personally identifiable information

It's hard to tell exactly what, though

Source: Google DeepMind

When it comes to the training of both of these models, OpenAI has already said that they may have been trained on private information that was found publicly. Just recently, researchers at Google's DeepMind discovered that asking ChatGPT to repeat the word "poem" infinitely resulted in it spitting out random data about individuals. In some cases, it gave full names, addresses, and phone numbers of people that were in its training data, and it's not clear why that was. Regardless, training on private data has caused a number of problems for OpenAI. It was banned in Italy initially due to privacy concerns, though it has since been reinstated with a number of warnings.

Even then, privacy and OpenAI are dicey. While one may argue that the data it was trained on was publicly shared online, contextual integrity is important when examining how the data was collected and analyzed.

Sadly, the scraping of this data online has also led to websites like Reddit and X (formerly Twitter) significantly scaling back how much of their data is available and through what methods. GPT-3.5 and GPT-4 were both trained on publicly available information online, including Reddit threads and tweets. One Twitter user even discovered their own tweet was very likely in GPT-4's training data.

While we'll likely never know for sure if the person's tweet was actually in the training data, the point is that it absolutely could have been. OpenAI will never confirm (or potentially will never even be able to confirm) what specific data is in the training set.

All of this poses questions for EU regulators, who may also have questions regarding GDPR regulations. If the training data contains personal information, how can one request a copy of their data under GDPR? How can I, an EU citizen, request that OpenAI delete all the data that pertains to me? I can't, and the company very likely can't either.

People tell ChatGPT anything

Including doctors giving private information about patients

The other problem is that people are telling it literally anything. Content being put into ChatGPT is used as training data by default (though this can be turned off), and that includes confidential data that users have entered about themselves or about the companies that they work for. JP Morgan and Verizon have both entirely blocked employees from using the service, and Amazon had to warn employees not to do so, too.

Even if you opt out of data collection and training, your chats are still held for 30 days, including if you enter confidential information with the setting turned off. In that time period, your conversations could be accessed by third parties in the event of a data breach or even rogue employees. We know this because if they're to be used for training, conversations would need to be stored in plaintext with access given to OpenAI employees in the first place.

Security firm Cyberhaven has seen a lot of sensitive information put into ChatGPT, including doctors typing in patient names and diagnoses, which is a major problem. ChatGPT is inherently a privacy nightmare both from the data it has collected externally and the data that it has collected internally from its own users. It's essentially taking user data and then selling it back to you as ChatGPT Plus.

Cyberhaven's actual breakdown of that data is even more astonishing. It found that "11% of data flowing into ChatGPT" could be marked as sensitive. Over the course of a week, it estimated that the average 100,000-person company experienced:

  • 43 leaks of sensitive project files (Example: a land acquisition planning document for a new theme park)
  • 75 leaks of regulated personal data PII (Example: a list of customers with their associated home addresses that needs to be reformatted)
  • 70 leaks of regulated health data PHI (Example: a letter being drafted by a doctor to a patient’s insurance company with details of their diagnosis)
  • 130 leaks of client data (Example: content from a document sent to a law firm by their client)
  • 119 leaks of source code (Example: code used in a social media app that needs to be edited to change its functionality)
  • 150 leaks of confidential documents (Example: a memo discussing how to handle an upcoming regulatory action by the government)"

All of these concerns are applicable to Google Bard as well, and the company is very upfront about that fact. Google says in its privacy hub not to "enter confidential information in your Bard conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies."

What can you do about it?

There are a few options

If you're bothered by your data being used for training large language models, there isn't a lot you can do aside from not using them directly. They've already scraped other sites that you may have been active on, and that behavior will likely continue. Your best bet is to make your voice heard and reach out to companies, asking them to remove your data from their models if they can.

If you aren't bothered by your data being scraped on other websites, and you still want to use ChatGPT, Google Bard, or any other LLM, then you have two options. The first is to opt out of any data collection and hope that the data is effectively removed after 30 days like OpenAI says it is, or you can run an LLM locally on your computer. If you have a powerful PC, LM Studio will let you run powerful LLMs locally on your computer so you can interact with them without any data leakage.