Drücke „Enter”, um zum Inhalt zu springen.
Hinweis zu diesem Datenschutz-Blog:
Anscheinend verwenden Sie einen Werbeblocker wie uBlock Origin oder Ghostery, oder einen Browser, der bestimmte Dienste blockiert.
Leider wird dadurch auch der Dienst von VG Wort blockiert. Online-Autoren haben einen gesetzlichen Anspruch auf eine Vergütung, wenn ihre Beiträge oft genug aufgerufen wurden. Um dies zu messen, muss vom Autor ein Dienst der VG Wort eingebunden werden. Ohne diesen Dienst geht der gesetzliche Anspruch für den Autor verloren.

Ich wäre Ihnen sehr verbunden, wenn Sie sich bei der VG Wort darüber beschweren, dass deren Dienst anscheinend so ausgeprägt ist, dass er von manchen als blockierungswürdig eingestuft wird. Dies führt ggf. dazu, dass ich Beiträge kostenpflichtig gestalten muss.

Durch Klick auf folgenden Button wird eine Mailvorlage geladen, die Sie inhaltlich gerne anpassen und an die VG Wort abschicken können.

Nachricht an VG WortMailtext anzeigen

Betreff: Datenschutzprobleme mit dem VG Wort Dienst(METIS)
Guten Tag,

als Besucher des Datenschutz-Blogs Dr. DSGVO ist mir aufgefallen, dass der VG Wort Dienst durch datenschutzfreundliche Browser (Brave, Mullvad...) sowie Werbeblocker (uBlock, Ghostery...) blockiert wird.
Damit gehen dem Autor der Online-Texte Einnahmen verloren, die ihm aber gesetzlich zustehen.

Bitte beheben Sie dieses Problem!

Diese Nachricht wurde von mir persönlich abgeschickt und lediglich aus einer Vorlage generiert.
Wenn der Klick auf den Button keine Mail öffnet, schreiben Sie bitte eine Mail an info@vgwort.de und weisen darauf hin, dass der VG Wort Dienst von datenschutzfreundlichen Browser blockiert wird und dass Online Autoren daher die gesetzlich garantierten Einnahmen verloren gehen.
Vielen Dank,

Ihr Klaus Meffert - Dr. DSGVO Datenschutz-Blog.

PS: Wenn Sie meine Beiträge oder meinen Online Website-Check gut finden, freue ich mich auch über Ihre Spende.
Ausprobieren Online Webseiten-Check sofort das Ergebnis sehen

Artificial intelligence: Personal data in AI models

0
Dr. DSGVO Newsletter detected: Extended functionality available
More articles · Website-Checks · Live Offline-AI
📄 Article as PDF (only for newsletter subscribers)
🔒 Premium-Funktion
Der aktuelle Beitrag kann in PDF-Form angesehen und heruntergeladen werden

📊 Download freischalten
Der Download ist nur für Abonnenten des Dr. DSGVO-Newsletters möglich

Many are calling for the regulation of AI applications. Ideally, mass data for training AI models should no longer contain personal data, even if it comes from public sources. The Federal Data Protection Commissioner, for example, is calling for this. What does this mean in practice?

Introduction

A AI-model is an electronic brain, representing a neural network. The connections between neurons represent knowledge, entirely analogous to the human brain. The knowledge is fed in through reading millions or billions of online freely available documents. These documents include especially web pages.

In many of these texts that feed into AI models, personal data is present. These data thus land in the training data of an artificial intelligence. Moreover: Outputs generated by a chatbot based on this training data can also contain personal data.

Some people, such as Germany's Federal Data Protection Commissioner, find it problematic that this personal data ends up in AI models. This data in AI models raises several fundamental questions:

  1. Does the data owner (the data subject) consent to their personal data ending up in a particular AI model? More precisely (as long as there is no requirement for consent):
  2. How can a data owner block their data from being used in AI models (opt-out)?
  3. How can data from an existing AI model be deleted retrospectively?

These questions give rise to a number of problems in practice, which are discussed below.

When does personal data exist?

Whether a data value is personal or not can very often not or not reliably be determined. A person may often recognize the proper names of people as such, but certainly not always. A machine (AI) can do this even less well.

Directly personal data, such as names or postal addresses, cannot be reliably identified by machines.

If an vehicle registration plate, a phone number or a company name is personal, no one knows (except for someone who is intimately familiar with the vehicle, phone number or company). A machine can therefore not know whether "Maier GmbH" is a personal data value. The name of a GmbH is personal if it can be inferred directly or indirectly to a person (Art. 4 No. 1 GDPR). A one-person GmbH is apparently personal. The name of a GmbH with 50 employees is apparently not personal. If the name of a GmbH with 50 employees is mentioned in conjunction with an employee who is 1.98 meters tall ("our company's tallest employee"), then this combined statement consisting of company name and height measurement of an employee is to be considered as personal.

Automated data in its entirety can never be reliably classified as personal or non-personal.

Algorithms therefore always involve considerable uncertainties when recognizing personal data.

The previous example in particular makes it clear that nobody and nothing can reliably see whether data is personal or not. Even a telephone number cannot directly tell whether it belongs to a person or a company and whether the company consists of one person or several.

How can data be blocked from use in AI models?

The short answer is: Not at all. At least this is the current state of affairs. There simply no standard for protecting data on websites from unauthorized access. Reading a public website is obviously always possible. This is exactly what a website is meant to be: accessible to as broad a public as possible. Robot programs (Crawlers, Scanners) can hardly be distinguished from a human reader. Many websites do not even have the possibility of trying this in any technical way. That's where we stand with technology today.

The only currently practical way is to use the robots.txt file. This file allows website operators to define which search engines are allowed to access their content and which are not. Meanwhile, this file is also respected by some AI applications that scrape content.

It is technically not possible to block your own data from being used in AI models.

As of today and until further notice.

Many AI applications are not interested in this robots.txt file or any exclusion requests from website owners anyway. Furthermore, it is about wishes rather than technically hard definitions. Even if ChatGPT says that it respects the wishes of a webpage regarding blocking its content against AI use by ChatGPT, this is purely a matter of trust. Whoever still trusts OpenAI and ChatGPT should recall the facts:

  1. Italy's data protection authority has banned ChatGPT because OpenAI appears to have stored data illegally, such as user input.
  2. OpenAI did not request consent from the user, but merely offered an opt-out option.
  3. OpenAI now advertises ChatGPT Enterprise and the benefit "Get enterprise-grade security & privacy". This means: "We only adhere to data protection rules if you buy the Enterprise version".

Those who trust companies like OpenAI, Google or Microsoft as soon as a reassuring report comes out, although these companies have previously shown numerous questionable behavior, are acting at least not rationally, but driven by desire.

Data from crawling databases, such as The Pile or Common Crawl or C4, initially act independently of ChatGPT, but are then used by ChatGPT and other AI models for training large language models. In this way, one problem becomes a multiple problem, namely one for each data reader.

How is data deleted from an existing AI model?

The short answer is: not at all. In any case, to date there is no mathematical procedure that can be used to delete data from an AI model with surgical precision (or at all).

The currently only way to delete data from an existing AI model is to throw away the model and train it completely anew. When retraining, the data to be deleted are no longer taken into account for training.

Data cannot be deleted from an existing AI model.

As of today and until further notice.

Sounds extremely complicated and expensive. And that's exactly what it is. Training an AI model from scratch, especially for large language models, is extremely time-consuming, very expensive and takes what feels like an eternity, even on huge server farms. An AI server consumes a lot of power and is very expensive because it uses at least one very expensive or several expensive graphics cards at the same time in order to perform the otherwise extremely lengthy calculations in an acceptable amount of time.

A pragmatic but unattractive way of at least officially extracting data from an AI system is to run the AI model's response through a filter. The filter then removes all occurrences of a specific person's name or telephone number. However, this cannot be done reliably. In addition, data is still present even if it is in the model but is not output as a response. The same applies to an email from a former contact who wanted their data deleted but the data was not deleted. At the next inspection by a supervisory or law enforcement authority, which admittedly should only occur rarely, or at the next data leak due to a hacker attack, the dilemma then becomes visible to everyone.

What is AI actually changing?

Search engines have been providing answers from read contents for a long time already. These answers are certainly not always compatible with facts either. As far as is known, no data protection authority has raised an eyebrow about this yet.

AI-driven chatbots can give answers in a new form, which is referred to as abstract. Instead of a quote, the user receives a text in new words. This can easily lead to incorrect or false information.

In social media the number of false statements made against individuals is also not particularly low, so the current excitement about AI can't be fully understood. Current statements are a bit in the direction of activism.

In defence of many, it should be noted that the unknown ("the AI") apparently drives so many into righteous concern that they would definitely like to do something against it. That wishes arise from this which are not realizable is analogous to § 26 TDDDG, which was however dictated by lobbyists into law.

Control over your own data

In fact, no one has technical control over their own data as soon as it falls into the hands of third parties, for example by publishing it on a website or using/providing it on a social media platform.

By control over one's own data for use in AI models, the Federal Data Protection Commissioner probably means specific platforms on which a person has an account as the data owner. Although this case is relevant and important, it has nothing to do with AI specifically. Of course, all personal data should only be processed in accordance with the GDPR, whether by an AI or otherwise.

Summary

Personal data cannot be reliably identified as such. Neither a human nor a machine succeeds in doing so. This will remain forever unless the definition of Art. 4 No. 1 GDPR, which defines what personal data are, changes.

Data cannot be blocked from being used in AI models. This problem could be solved in purely legal terms. Technically, it can never be solved with certainty. Rather, one would have to rely on crawlers respecting the specifications (wishes!) of a website. It is almost safer to rely on Microsoft, despite the massive security vulnerabilities that the company has created, ignored and downplayed.

Artificial intelligence cannot be satisfactorily regulated, however understandable the desire may be.

Wishes do not change the objective boundaries of reality.

Data cannot be deleted from existing AI models. This problem could theoretically be solved. It seems more likely that AI models will soon "simply" be recalculated as soon as the hardware or graphics card chips (GPUs) have become much faster and much cheaper.

Conclusion

The desire for regulation of AI is understandable. But it leads to demands that are unfulfillable and impractical. Whether this will be accepted in order to create the impression of fulfilling political duties or whether it is ignorance, remains to be seen.

It is generally not possible to decide whether data is personal. Perhaps an intergalactic analysis would help?

An artificial intelligence behaves itself like a human. Humans are usually unreliable. You notice this at least when making the next appointment. Even so-called experts often come up with false or bad results. Why should it be any different for a computer program that imitates the intelligence function of humans?

Instead of making general, unfulfillable demands, very large companies could first be scrutinized extensively and sanctioned consistently, quickly and painfully. Further measures can then be derived from the knowledge gained.

Regardless of the type of future market behavior rules, it should be noted that the enormous potential, whether positive or negative, of AI applications cannot be stopped anymore. Anyone can set up an AI model under their desk at any time or download and use an existing one. It would be extremely counterproductive if these AI models were allowed to be used worldwide, except in Germany or the EU.

Key messages

It's difficult to reliably tell if data is personal or not, especially when using AI. This makes it challenging to ensure that AI models don't use personal data without consent.

It's impossible to reliably prevent personal data from being used in AI models because technology can't perfectly identify and block it.

Regulating AI is understandable but ultimately impractical due to its inherent nature and the ease of access to AI technology.

About

Computer-generiertes Bild
Alle Bilder in diesem Beitrag wurden von einem Computer-Programm erzeugt. Verwendet wurde das selbst entwickelte KI-System von Dr. DSGVO, ähnlich zu Midjourney. Die Bilder dürfen auf Webseiten frei verwendet werden, mit der Bitte, eine Verlinkung auf diesen Blog zu setzen.
About the author on dr-dsgvo.de
My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.

Artificial intelligence: hype and overrated or reasonable expectations?