Drücke „Enter”, um zum Inhalt zu springen.
Hinweis zu diesem Datenschutz-Blog:
Anscheinend verwenden Sie einen Werbeblocker wie uBlock Origin oder Ghostery, oder einen Browser, der bestimmte Dienste blockiert.
Leider wird dadurch auch der Dienst von VG Wort blockiert. Online-Autoren haben einen gesetzlichen Anspruch auf eine Vergütung, wenn ihre Beiträge oft genug aufgerufen wurden. Um dies zu messen, muss vom Autor ein Dienst der VG Wort eingebunden werden. Ohne diesen Dienst geht der gesetzliche Anspruch für den Autor verloren.

Ich wäre Ihnen sehr verbunden, wenn Sie sich bei der VG Wort darüber beschweren, dass deren Dienst anscheinend so ausgeprägt ist, dass er von manchen als blockierungswürdig eingestuft wird. Dies führt ggf. dazu, dass ich Beiträge kostenpflichtig gestalten muss.

Durch Klick auf folgenden Button wird eine Mailvorlage geladen, die Sie inhaltlich gerne anpassen und an die VG Wort abschicken können.

Nachricht an VG WortMailtext anzeigen

Betreff: Datenschutzprobleme mit dem VG Wort Dienst(METIS)
Guten Tag,

als Besucher des Datenschutz-Blogs Dr. DSGVO ist mir aufgefallen, dass der VG Wort Dienst durch datenschutzfreundliche Browser (Brave, Mullvad...) sowie Werbeblocker (uBlock, Ghostery...) blockiert wird.
Damit gehen dem Autor der Online-Texte Einnahmen verloren, die ihm aber gesetzlich zustehen.

Bitte beheben Sie dieses Problem!

Diese Nachricht wurde von mir persönlich abgeschickt und lediglich aus einer Vorlage generiert.
Wenn der Klick auf den Button keine Mail öffnet, schreiben Sie bitte eine Mail an info@vgwort.de und weisen darauf hin, dass der VG Wort Dienst von datenschutzfreundlichen Browser blockiert wird und dass Online Autoren daher die gesetzlich garantierten Einnahmen verloren gehen.
Vielen Dank,

Ihr Klaus Meffert - Dr. DSGVO Datenschutz-Blog.

PS: Wenn Sie meine Beiträge oder meinen Online Website-Check gut finden, freue ich mich auch über Ihre Spende.
Ausprobieren Online Webseiten-Check sofort das Ergebnis sehen

Artificial intelligence: How do AI language models store data? Do the models also contain personal data?

0
Dr. DSGVO Newsletter detected: Extended functionality available
More articles · Website-Checks · Live Offline-AI
📄 Article as PDF (only for newsletter subscribers)
🔒 Premium-Funktion
Der aktuelle Beitrag kann in PDF-Form angesehen und heruntergeladen werden

📊 Download freischalten
Der Download ist nur für Abonnenten des Dr. DSGVO-Newsletters möglich

Data is stored by AI language models in the form of columns of numbers. How exactly does this work and why does it decode the intelligence function of humans? Does the AI model contain personal or even copyrighted data after training?

Introduction

The triumphal march of today's AI began in 2017 with the invention of the Transformer approach. It works with a coder and decoder and uses so-called embeddings as carriers of meaning (semantics). An embedding is a series of numbers also called a vector.

With language models, the idea is to determine the meaning of a word over its context and store it as a vector. The context of a word are especially the other words in the same sentence. Meaning arises therefore through analysis of the common occurrence of several terms (co-occurrence).

A modern AI works so that any kind of data is converted into number sequences. Data types are for example texts (in language models), spoken language, images, videos, music, temperature sensor values, weather data, stock prices, seismological values, odor sensor data, UV sensor values and everything else that can be expressed in numbers, thus digitalized.

Whole words are sometimes stored in AI language models.

Also applies to newer ChatGPT models, see article.

In my opinion, this corresponds exactly to how the human brain works. The brain initially works analog, the computer digital. Since biological neurons in humans work via an action potential, analog quickly becomes digital.

For AI-Sprachmodels, texts are therefore divided into units like sentences and then converted into semantically loaded number sequences. This is done for example by means of the algorithm named Word2Vec, which calculates a vector for each word in a context. Meanwhile there are better procedures than Word2Vec that work equally outwardly (see for example so-called Sentence Transformer).

Calculate with vectors

Two vectors can be subtracted using classical mathematics, among other things. Their difference can also be calculated. The difference here is the semantic similarity or difference between two terms, expressed via their vectors.

For a large document collection, one can calculate vectors for all possible terms occurring in the document collection (corpus) using Word2Vec. The system has no understanding of German (or English) grammar up to that point. Nevertheless, "the system knows" through comparisons of vectors which terms semantically behave towards each other.

Some popular conclusions that are made possible with Word2Vec are:

  • _Poland behaves towards Warsaw like Spain to Madrid (the bolded term being what Word2Vec itself determines when you input the first three italicized terms).
  • The German word Katze corresponds to the English word cat (with Word2Vec translations can be made, and context-dependent: "Snail" can be an animal, but also a funding facility).
  • _Chancellor plus wife minus man = Chancellorwoman*

The basis for all of this are only words that occur in context, thus in sentences. Exactly so can also humans understand texts, with the currently still existing difference that machines have much less environmental experience than humans. This will certainly change soon and lead to robots being by far the most intelligent existences on this planet (and other planets). Unless, of course, humans have meanwhile self-destructed elsewhere and cannot build these robots anymore.

Back to the question of how a language model stores data, i.e., concepts, and whether these concepts can be person-related. A personal reference would then be confirmed if proper nouns or identifiers such as phone numbers, vehicle registration plates or tax identification numbers reconstructable are stored in the AI model.

Example of data management in the AI model

The following screenshot shows an excerpt from the vocabulary of a German AI model that is subject to the GPT-2 architecture of OpenAI. In contrast to its successors, GPT-2 has been made publicly available.

Extract from the 52,000 vocabulary words of a German GPT-2 model

In total, the vocabulary happens to consist of exactly 52,000 words. The reason for this relatively small number (compared to the larger number of existing German words) is explained below.

Data pairs are recognizable. The first part is coded yellow in the image and represents a term. The second part is the index or identifier of the term, which can be seen in blue here.

When looking at the terms, it is noticeable that many are preceded by a disruptive character. This is due to the respective coding of the vocabulary and is resolved below.

The terms were determined by using numerous texts for training the language model. The corpus of texts what formed in the existing example model over a cut from Wikipedia, the EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl.

The texts were then broken down into words, which presents a certain challenge. This problem is assigned to the field of NLP. NLP stands for Natural Language Processing and denotes the processing of natural language texts (or other modalities). Even widely spread and well-developed frameworks like Scipy and Spacy often allow themselves errors, which the experienced AI developer only gets a grip on by using their own routines for post-processing.

AI language models can reproduce entire sentences verbatim, which are thus stored in the language model.

Also applies to ChatGPT-3.5 and ChatGPT-4, see post.

When determining the concepts, many dirty results arise, as shown equally. The terms are determined in a conventional manner, not using new AI methods. They represent a pre-stage. Only after concept determination is the new AI methodology applied by using the terms to generate a AI language model, which is referred to as training

Read full article now via free Dr. GDPR newsletter.
More extras for subscribers:
Offline-AI · Free contingent+ for Website-Checks
Already a subscriber? Click on the link in the newsletter & refresh this page.
Subscribe to Newsletter
About the author on dr-dsgvo.de
My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.

Using Google Analytics without consent in a legally compliant way: How it works