How can a user prevent their data from being used in AI models?

There is currently no reliable method to fully protect data from websites before using it in AI models. While the robots.txt file is respected, many AI applications ignore it or other exclusion requests.

What problem does classifying data as personal or non-personal present?

The automatic classification of data as personal or non-personal is not always possible due to uncertainties in algorithms and the difficulty in reliably identifying proper nouns. This leads to uncertainty regarding compliance with data protection regulations.

What is the current situation regarding the use of data by AI models?

AI models are currently primarily trained by reading millions of documents from the internet, many of which may contain personal data. There is no technical way to prevent this, and many applications disregard website operators' exclusion requests.

Can I reliably delete data from an existing AI model?

No, it is currently not possible to delete data from an AI model. The models are designed so that data is permanently stored and can be used for training. There is no method to surgically remove data.

How exactly does the filtering of responses from AI models work?

AI models use filters to remove certain information, such as names or phone numbers, from their responses. However, this filtering is not always reliable and data can still be present even if it is not directly output.

Why is it problematic that AI models learn from personal data?

AI models learn from massive datasets that often contain personal information. This raises questions about user consent for this use of their data, which can lead to significant privacy concerns.

Can personal data be reliably deleted from AI models?

No, data cannot be reliably deleted from AI models. This means that information that was previously stored in the models can potentially still be retrieved and used, which exacerbates the privacy concerns.

Why is strict regulation of AI models criticized?

Demands for strict regulation of AI models are considered unrealistic and impractical. The widespread distribution of AI models worldwide makes complete control and restriction almost impossible.

Artificial intelligence: Personal data in AI models

Many are calling for the regulation of AI applications. Ideally, mass data for training AI models should no longer contain personal data, even if it comes from public sources. The Federal Data Protection Commissioner, for example, is calling for this. What does this mean in practice?

Introduction

A AI-model is an electronic brain, representing a neural network. The connections between neurons represent knowledge, entirely analogous to the human brain. The knowledge is fed in through reading millions or billions of online freely available documents. These documents include especially web pages.

In many of these texts that feed into AI models, personal data is present. These data thus land in the training data of an artificial intelligence. Moreover: Outputs generated by a chatbot based on this training data can also contain personal data.

Some people, such as Germany's Federal Data Protection Commissioner, find it problematic that this personal data ends up in AI models. This data in AI models raises several fundamental questions:

Does the data owner (the data subject) consent to their personal data ending up in a particular AI model? More precisely (as long as there is no requirement for consent):
How can a data owner block their data from being used in AI models (opt-out)?
How can data from an existing AI model be deleted retrospectively?

These questions give rise to a number of problems in practice, which are discussed below.

When does personal data exist?

Whether a data value is personal or not can very often not or not reliably be determined. A person may often recognize the proper names of people as such, but certainly not always. A machine (AI) can do this even less well.

Directly personal data, such as names or postal addresses, cannot be reliably identified by machines.

If an vehicle registration plate, a phone number or a company name is personal, no one knows (except for someone who is intimately familiar with the vehicle, phone number or company). A machine can therefore not know whether "Maier GmbH" is a personal data value. The name of a GmbH is personal if it can be inferred directly or indirectly to a person (Art. 4 No. 1 GDPR). A one-person GmbH is apparently personal. The name of a GmbH with 50 employees is apparently not personal. If the name of a GmbH with 50 employees is mentioned in conjunction with an employee who is 1.98 meters tall ("our company's tallest employee"), then this combined statement consisting of company name and height measurement of an employee is to be considered as personal.

Automated data in its entirety can never be reliably classified as personal or non-personal.
Algorithms therefore always involve considerable uncertainties when recognizing personal data.

The previous example in particular makes it clear that nobody and nothing can reliably see whether data is personal or not. Even a telephone number cannot directly tell whether it belongs to a person or a company and whether the company consists of one person or several.

How can data be blocked from use in AI models?

The short answer is: Not at all. At least this is the current state of affairs. There simply no standard for protecting data on websites from unauthorized access. Reading a public website is obviously always possible. This is exactly what a website is meant to be: accessible to as broad a public as possible. Robot programs (Crawlers, Scanners) can hardly be distinguished from a human reader. Many websites do not even have the possibility of trying this in any technical way. That's where we stand with technology today.

The only currently practical way is to use the robots.txt file. This file allows website operators to define which search engines are allowed to access their content and which are not. Meanwhile, this file is also respected by some AI applications that scrape content.

It is technically not possible to block your own data from being used in AI models.
As of today and until further notice.

Many AI applications are not interested in this robots.txt file or any exclusion requests from website owners anyway. Furthermore, it is about wishes rather than technically hard definitions. Even if ChatGPT says that it respects the wishes of a webpage regarding blocking its content against AI use by ChatGPT, this is purely a matter of trust. Whoever still trusts OpenAI and ChatGPT should recall the facts:

Italy's data protection authority has banned ChatGPT because OpenAI appears to have stored data illegally, such as user input.
OpenAI did not request consent from the user, but merely offered an opt-out option.
OpenAI now advertises ChatGPT Enterprise and the benefit "Get enterprise-grade security & privacy". This means: "We only adhere to data protection rules if you buy the Enterprise version".

Those who trust companies like OpenAI, Google or Microsoft as soon as a reassuring report comes out, although these companies have previously shown numerous questionable behavior, are acting at least not rationally, but driven by desire.

Data from crawling databases, such as The Pile or Common Crawl or C4, initially act independently of ChatGPT, but are then used by ChatGPT and other AI models for training large language models. In this way, one problem becomes a multiple problem, namely one for each data reader.

How is data deleted from an existing AI model?

The short answer is: not at all. In any case, to date there is no mathematical procedure that can be used to delete data from an AI model with surgical precision (or at all).

The currently only way to delete data from an existing AI model is to throw away the model and train it completely anew. When retraining, the data to be deleted are no longer taken into account for training.

Data cannot be deleted from an existing AI model.
As of today and until further notice.

Sounds extremely complicated and expensive. And that's exactly what it is. Training an AI model from scratch, especially for large language models, is extremely time-consuming, very expensive and takes what feels like an eternity, even on huge server farms. An AI server consumes a lot of power and is very expensive because it uses at least one very expensive or several expensive graphics cards at the same time in order to perform the otherwise extremely lengthy calculations in an acceptable amount of time.

Read full article now via free Dr. GDPR newsletter.

More extras for subscribers:
Offline-AI · Free contingent+ for Website-Checks

Already a subscriber? Click on the link in the newsletter & refresh this page.

↓

Subscribe to Newsletter

Alle Bilder in diesem Beitrag wurden von einem Computer-Programm erzeugt. Verwendet wurde das selbst entwickelte KI-System von Dr. DSGVO, ähnlich zu Midjourney. Die Bilder dürfen auf Webseiten frei verwendet werden, mit der Bitte, eine Verlinkung auf diesen Blog zu setzen.