Many are calling for the regulation of AI applications. Ideally, mass data for training AI models should no longer contain personal data, even if it comes from public sources. The Federal Data Protection Commissioner, for example, is calling for this. What does this mean in practice?
Introduction
A AI-model is an electronic brain, representing a neural network. The connections between neurons represent knowledge, entirely analogous to the human brain. The knowledge is fed in through reading millions or billions of online freely available documents. These documents include especially web pages.
In many of these texts that feed into AI models, personal data is present. These data thus land in the training data of an artificial intelligence. Moreover: Outputs generated by a chatbot based on this training data can also contain personal data.
Some people, such as Germany's Federal Data Protection Commissioner, find it problematic that this personal data ends up in AI models. This data in AI models raises several fundamental questions:
- Does the data owner (the data subject) consent to their personal data ending up in a particular AI model? More precisely (as long as there is no requirement for consent):
- How can a data owner block their data from being used in AI models (opt-out)?
- How can data from an existing AI model be deleted retrospectively?
These questions give rise to a number of problems in practice, which are discussed below.
When does personal data exist?
Whether a data value is personal or not can very often not or not reliably be determined. A person may often recognize the proper names of people as such, but certainly not always. A machine (AI) can do this even less well.
Directly personal data, such as names or postal addresses, cannot be reliably identified by machines.
If an vehicle registration plate, a phone number or a company name is personal, no one knows (except for someone who is intimately familiar with the vehicle, phone number or company). A machine can therefore not know whether "Maier GmbH" is a personal data value. The name of a GmbH is personal if it can be inferred directly or indirectly to a person (Art. 4 No. 1 GDPR). A one-person GmbH is apparently personal. The name of a GmbH with 50 employees is apparently not personal. If the name of a GmbH with 50 employees is mentioned in conjunction with an employee who is 1.98 meters tall ("our company's tallest employee"), then this combined statement consisting of company name and height measurement of an employee is to be considered as personal.
Automated data in its entirety can never be reliably classified as personal or non-personal.
Algorithms therefore always involve considerable uncertainties when recognizing personal data.
The previous example in particular makes it clear that nobody and nothing can reliably see whether data is personal or not. Even a telephone number cannot directly tell whether it belongs to a person or a company and whether the company consists of one person or several.
How can data be blocked from use in AI models?
The short answer is: Not at all. At least this is the current state of affairs. There simply no standard for protecting data on websites from unauthorized access. Reading a public website is obviously always possible. This is exactly what a website is meant to be: accessible to as broad a public as possible. Robot programs (Crawlers, Scanners) can hardly be distinguished from a human reader. Many websites do not even have the possibility of trying this in any technical way. That's where we stand with technology today.
The only currently practical way is to use the robots.txt file. This file allows website operators to define which search engines are allowed to access their content and which are not. Meanwhile, this file is also respected by some AI applications that scrape content.
It is technically not possible to block your own data from being used in AI models.
As of today and until further notice.
Many AI applications are not interested in this robots.txt file or any exclusion requests from website owners anyway. Furthermore, it is about wishes rather than technically hard definitions. Even if ChatGPT says that it respects the wishes of a webpage regarding blocking its content against AI use by ChatGPT, this is purely a matter of trust. Whoever still trusts OpenAI and ChatGPT should recall the facts:
- Italy's data protection authority has banned ChatGPT because OpenAI appears to have stored data illegally, such as user input.
- OpenAI did not request consent from the user, but merely offered an opt-out option.
- OpenAI now advertises ChatGPT Enterprise and the benefit "Get enterprise-grade security & privacy". This means: "We only adhere to data protection rules if you buy the Enterprise version".
Those who trust companies like OpenAI, Google or Microsoft as soon as a reassuring report comes out, although these companies have previously shown numerous questionable behavior, are acting at least not rationally, but driven by desire.
Data from crawling databases, such as The Pile or Common Crawl or C4, initially act independently of ChatGPT, but are then used by ChatGPT and other AI models for training large language models. In this way, one problem becomes a multiple problem, namely one for each data reader.
How is data deleted from an existing AI model?
The short answer is: not at all. In any case, to date there is no mathematical procedure that can be used to delete data from an AI model with surgical precision (or at all).
The currently only way to delete data from an existing AI model is to throw away the model and train it completely anew. When retraining, the data to be deleted are no longer taken into account for training.
Data cannot be deleted from an existing AI model.
As of today and until further notice.





My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.
