Artificial intelligence: How do AI language models store data? Do the models also contain personal data?

Data is stored by AI language models in the form of columns of numbers. How exactly does this work and why does it decode the intelligence function of humans? Does the AI model contain personal or even copyrighted data after training?

Introduction

The triumphal march of today's AI began in 2017 with the invention of the Transformer approach. It works with a coder and decoder and uses so-called embeddings as carriers of meaning (semantics). An embedding is a series of numbers also called a vector.

With language models, the idea is to determine the meaning of a word over its context and store it as a vector. The context of a word are especially the other words in the same sentence. Meaning arises therefore through analysis of the common occurrence of several terms (co-occurrence).

A modern AI works so that any kind of data is converted into number sequences. Data types are for example texts (in language models), spoken language, images, videos, music, temperature sensor values, weather data, stock prices, seismological values, odor sensor data, UV sensor values and everything else that can be expressed in numbers, thus digitalized.

Whole words are sometimes stored in AI language models.
Also applies to newer ChatGPT models, see article.

In my opinion, this corresponds exactly to how the human brain works. The brain initially works analog, the computer digital. Since biological neurons in humans work via an action potential, analog quickly becomes digital.

For AI-Sprachmodels, texts are therefore divided into units like sentences and then converted into semantically loaded number sequences. This is done for example by means of the algorithm named Word2Vec, which calculates a vector for each word in a context. Meanwhile there are better procedures than Word2Vec that work equally outwardly (see for example so-called Sentence Transformer).

Calculate with vectors

Two vectors can be subtracted using classical mathematics, among other things. Their difference can also be calculated. The difference here is the semantic similarity or difference between two terms, expressed via their vectors.

For a large document collection, one can calculate vectors for all possible terms occurring in the document collection (corpus) using Word2Vec. The system has no understanding of German (or English) grammar up to that point. Nevertheless, "the system knows" through comparisons of vectors which terms semantically behave towards each other.

Some popular conclusions that are made possible with Word2Vec are:

_Poland behaves towards Warsaw like Spain to Madrid (the bolded term being what Word2Vec itself determines when you input the first three italicized terms).
The German word Katze corresponds to the English word cat (with Word2Vec translations can be made, and context-dependent: "Snail" can be an animal, but also a funding facility).
_Chancellor plus wife minus man = Chancellorwoman*

The basis for all of this are only words that occur in context, thus in sentences. Exactly so can also humans understand texts, with the currently still existing difference that machines have much less environmental experience than humans. This will certainly change soon and lead to robots being by far the most intelligent existences on this planet (and other planets). Unless, of course, humans have meanwhile self-destructed elsewhere and cannot build these robots anymore.

Back to the question of how a language model stores data, i.e., concepts, and whether these concepts can be person-related. A personal reference would then be confirmed if proper nouns or identifiers such as phone numbers, vehicle registration plates or tax identification numbers reconstructable are stored in the AI model.

Example of data management in the AI model

The following screenshot shows an excerpt from the vocabulary of a German AI model that is subject to the GPT-2 architecture of OpenAI. In contrast to its successors, GPT-2 has been made publicly available.

Extract from the 52,000 vocabulary words of a German GPT-2 model

In total, the vocabulary happens to consist of exactly 52,000 words. The reason for this relatively small number (compared to the larger number of existing German words) is explained below.

Data pairs are recognizable. The first part is coded yellow in the image and represents a term. The second part is the index or identifier of the term, which can be seen in blue here.

When looking at the terms, it is noticeable that many are preceded by a disruptive character. This is due to the respective coding of the vocabulary and is resolved below.

The terms were determined by using numerous texts for training the language model. The corpus of texts what formed in the existing example model over a cut from Wikipedia, the EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl.

The texts were then broken down into words, which presents a certain challenge. This problem is assigned to the field of NLP. NLP stands for Natural Language Processing and denotes the processing of natural language texts (or other modalities). Even widely spread and well-developed frameworks like Scipy and Spacy often allow themselves errors, which the experienced AI developer only gets a grip on by using their own routines for post-processing.

AI language models can reproduce entire sentences verbatim, which are thus stored in the language model.
Also applies to ChatGPT-3.5 and ChatGPT-4, see post.

When determining the concepts, many dirty results arise, as shown equally. The terms are determined in a conventional manner, not using new AI methods. They represent a pre-stage. Only after concept determination is the new AI methodology applied by using the terms to generate a AI language model, which is referred to as training

Read full article now via free Dr. GDPR newsletter.

More extras for subscribers:
Offline-AI · Free contingent+ for Website-Checks

Already a subscriber? Click on the link in the newsletter & refresh this page.

↓

Subscribe to Newsletter