How does a language model store data?

AI language models store data in the form of numerical sequences, known as vectors. These vectors are generated through the analysis of word combinations and their frequency to recognize semantic relationships between words.

Can an AI language model contain personal data?

Yes, potentially. If proper nouns or identifiers such as phone numbers or tax identification numbers are stored in the vectors, an AI language model could contain personal data. However, in this case, the vocabulary is usually used without direct identifiers.

Does storing tokens increase the likelihood of personal data being included in an AI model?

Yes, storing tokens, especially word fragments, increases the likelihood that personal data is included in an AI model. Since the model stores words in fragments, there is a risk that sensitive information is stored in these fragments.

Do AI language models inherently contain personal data?

Yes, AI language models inherently contain personal data, especially when names or other sensitive information is embedded in snippets of text. The existence of a name in a context can also be considered personal data.

How are names stored in AI models?

Names are often stored in AI models in the form of word fragments consisting of individual letters. This is more likely to occur when the name does not have common letter sequences and is therefore unique. The storage is pseudonymized to protect the data.

Can personal data be extracted from AI models?

Yes, AI models can store and reproduce personal data, especially if they were trained with such information. This is evident, for example, in the ability to recall details such as birthdates or medical histories.

Do these models contain personal data?

Although AI language models can store phrases and sentences, the storage of personal data is not necessarily guaranteed. The storage often occurs in a way that preserves the original form of the data, without explicitly storing personal information.

What components are necessary to utilize an AI model?

To utilize an AI model, you need the AI model itself, the tokenizer, the vocabulary, and the associated standard libraries, such as the 'transformers' library. These components enable the loading, evaluation, and querying of the model.

Artificial intelligence: How do AI language models store data? Do the models also contain personal data?

Data is stored by AI language models in the form of columns of numbers. How exactly does this work and why does it decode the intelligence function of humans? Does the AI model contain personal or even copyrighted data after training?

Introduction

The triumphal march of today's AI began in 2017 with the invention of the Transformer approach. It works with a coder and decoder and uses so-called embeddings as carriers of meaning (semantics). An embedding is a series of numbers also called a vector.

With language models, the idea is to determine the meaning of a word over its context and store it as a vector. The context of a word are especially the other words in the same sentence. Meaning arises therefore through analysis of the common occurrence of several terms (co-occurrence).

A modern AI works so that any kind of data is converted into number sequences. Data types are for example texts (in language models), spoken language, images, videos, music, temperature sensor values, weather data, stock prices, seismological values, odor sensor data, UV sensor values and everything else that can be expressed in numbers, thus digitalized.

Whole words are sometimes stored in AI language models.
Also applies to newer ChatGPT models, see article.

In my opinion, this corresponds exactly to how the human brain works. The brain initially works analog, the computer digital. Since biological neurons in humans work via an action potential, analog quickly becomes digital.

For AI-Sprachmodels, texts are therefore divided into units like sentences and then converted into semantically loaded number sequences. This is done for example by means of the algorithm named Word2Vec, which calculates a vector for each word in a context. Meanwhile there are better procedures than Word2Vec that work equally outwardly (see for example so-called Sentence Transformer).

Calculate with vectors

Two vectors can be subtracted using classical mathematics, among other things. Their difference can also be calculated. The difference here is the semantic similarity or difference between two terms, expressed via their vectors.

For a large document collection, one can calculate vectors for all possible terms occurring in the document collection (corpus) using Word2Vec. The system has no understanding of German (or English) grammar up to that point. Nevertheless, "the system knows" through comparisons of vectors which terms semantically behave towards each other.

Some popular conclusions that are made possible with Word2Vec are:

_Poland behaves towards Warsaw like Spain to Madrid (the bolded term being what Word2Vec itself determines when you input the first three italicized terms).
The German word Katze corresponds to the English word cat (with Word2Vec translations can be made, and context-dependent: "Snail" can be an animal, but also a funding facility).
_Chancellor plus wife minus man = Chancellorwoman*

The basis for all of this are only words that occur in context, thus in sentences. Exactly so can also humans understand texts, with the currently still existing difference that machines have much less environmental experience than humans. This will certainly change soon and lead to robots being by far the most intelligent existences on this planet (and other planets). Unless, of course, humans have meanwhile self-destructed elsewhere and cannot build these robots anymore.

Back to the question of how a language model stores data, i.e., concepts, and whether these concepts can be person-related. A personal reference would then be confirmed if proper nouns or identifiers such as phone numbers, vehicle registration plates or tax identification numbers reconstructable are stored in the AI model.

Example of data management in the AI model

The following screenshot shows an excerpt from the vocabulary of a German AI model that is subject to the GPT-2 architecture of OpenAI. In contrast to its successors, GPT-2 has been made publicly available.

Extract from the 52,000 vocabulary words of a German GPT-2 model

In total, the vocabulary happens to consist of exactly 52,000 words. The reason for this relatively small number (compared to the larger number of existing German words) is explained below.

Data pairs are recognizable. The first part is coded yellow in the image and represents a term. The second part is the index or identifier of the term, which can be seen in blue here.

When looking at the terms, it is noticeable that many are preceded by a disruptive character. This is due to the respective coding of the vocabulary and is resolved below.

The terms were determined by using numerous texts for training the language model. The corpus of texts was formed in the existing example model over a cut from Wikipedia, the EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl.

The texts were then broken down into words, which presents a certain challenge. This problem is assigned to the field of NLP. NLP stands for Natural Language Processing and denotes the processing of natural language texts (or other modalities). Even widely spread and well-developed frameworks like Scipy and Spacy often allow themselves errors, which the experienced AI developer only gets a grip on by using their own routines for post-processing.

AI language models can reproduce entire sentences verbatim, which are thus stored in the language model.
Also applies to ChatGPT-3.5 and ChatGPT-4, see post.

When determining the concepts, many dirty results arise, as shown equally. The terms are determined in a conventional manner, not using new AI methods. They represent a pre-stage. Only after concept determination is the new AI methodology applied by using the terms to generate a AI language model, which is referred to as training. Trained models are referred to as pre-trained, and not as trained models. The reason for this is that further training of the models is possible, which is referred to as Finetuning. Furthermore, once trained, i.e., generated models can be used directly. They are therefore pre-confectioned (pre-trained).

Some of the terms do not read like valid words. Here is a selection of the terms just shown, together with a brief commentary (details and explanations below):

rightspopul → Subword (word beginning). The whole word is probably "rightspopulistisch" (with optional postfixes "e" or "en").
Designation → Possibly resulting from a hyphenated word (Designation-Basis(es)).
Memmingen → Correct (at least existing) designation of a German city.
Tasman → Subword (word beginning). Whole word is probably "Tasmania".
Straßenbahnen → Ä, Ö, Ü and ß become illegible coded, which makes the term only look strange for humans but not for a machine interpreter.
Italian → Maybe a German text contained an English word. Not just by chance, ChatGPT-3 can also speak German, even though it was trained for the language English. It's possible that texts read in another language than German were mistakenly recognized as German partly.

The tokenizer as a word or word fragment generator

Words are extracted from texts by using a so-called Tokenizer. A token is a semantic unit, here a word. For GPT2 there is the tokenizer with the technical name GPT2Tokenizer.

The Tokenizer's task is not only to determine words, i.e. to find word boundaries. Rather, the tokenizer attempts to give a word a type of meaning that is defined in the form of a number. The GPT-2 Tokenizer gives a word a different meaning if it is at the beginning of a sentence instead of in the middle or at the end.

This sometimes leads to ridiculously bad results, as the following official example of the Tokenizer (see previous link two paragraphs before) shows:

The input sentence "Hello world" leads to the following output of the tokenizer: [15496, 995]. Two numbers are therefore calculated from the two words to capture the semantics of the sentence.

The fact that modern AI language models store word fragments and whole words in the form of tokens is not a prerequisite for the existence of personal data in an AI model, but it does increase the problem.

The almost identical input sentence "Hello world", which is therefore only preceded by a (nonsensical, but for humans insignificant) space, generates the other output [18435, 995]. "Hello" therefore receives the value 15496, while " Hello" preceded by a space receives the other value 18435.

Generating two different numbers for the "same" word means teaching the AI language model the wrong thing.

The GPT-2 tokenizer is a so-called Byte-Pair-Encoding tokenizer or BPE tokenizer. BPE encodes words into so-called tokens. The tokens represent word fragments and also have a compressing function, because word fragments can occur in several terms and the terms can then be stored more spaciously. A term can however also be stored entirely, so that it corresponds exactly to one token. ([1])

This explains how the above partial words come about. A simple verification confirms at least fundamentally that the word fragment "Bemessungs" was derived from the full word "Bemessungsgrundlage", the full word "Bemessungs-Grundlage" or the full word "Bemessungsgrundlagen". To illustrate this, the following entries from the vocabulary of the German AI language model GPT-2 are given:

"ĠDimensioning"
"Foundation"
"basis"
"the foundation"
"ĠBasic"

The first term "ĠBemessungs" has a somewhat peculiar first character preceding it, which is printed in bold here for illustration purposes. This character indicates that the token (word fragment) in question is a word start.

The terms two to four are not word beginnings, because their first character is no control character. The entry "basis" in the vocabulary therefore suggests that a compound word like "determination-basis" exists in the text corpus of the training data ( "determination" as a word beginning plus "basis" as an end word).

Term five, on the other hand, is "ĠGrundlage" and is to be regarded as the beginning of a word due to the first character, which is a control character. Entries two and five from the list just shown are therefore two (at least semantically from the point of view of the AI model) different word fragments. One is "basis" as the end of the word, the other is "basis" as the beginning of the word. Just for the sake of completeness: A word fragment that represents the beginning of a word can certainly be regarded as a full word, to which an end of a word does not necessarily have to be assigned as a possible complement. For the German reader, "Grundlage" is obviously an independent word. A word like "Grundlageschaffung" (somewhat constructed here in order to have an example), on the other hand, has the same word beginning, but also an additional postfix and therefore obviously a different meaning.

In principle, it can be assumed that AI language models contain both personal and copyright-relevant data.
Justifications: See article.

This verification can be carried out analogously for the word fragments "Tasman" and "rechtspopul" mentioned above and shown in the illustration. "Tasman" points quite clearly to "Tasmania". And indeed, the vocabulary of the GPT-2 model contains the entry "ien". If this entry were not there, the above explanation would be a little shaky. But this is not the case. The expected endings (word endings) "istisch", "istische", "istischen", "istischer" and "istisches" can also be found for "rechtspopul". Only "istischem" is missing, but this is OK because the training texts do not necessarily have to contain this word.

The less frequently a word occurs in the training data corpus, the longer its storage in the vocabulary will be. A word that only occurs once is probably stored in its pure form. A very frequent term consisting of many letters may possibly be stored in the form of several word fragments, each consisting of two or three letters. For this type of terms, "ĠAsylpolitik" might be an example (the first character is again the control character that marks a term as a beginning of a word or a full word). At least, the optional word fragments as endings "er", "erin" and "erinnen" and their inflections (i.e. Asylpolitiker, Asylpolitikerin etc.) would be visible.

The surface of OpenAI shows how input text is generated into tokens. Here's a real example ([1]) :

Source: OpenAI Tokenizer. (image was automatically translated).

From an input text "Hello, this is a text" consisting of 23 characters, 10 tokens are generated. The tokens are colour-coded in the illustration below. They include "Hall", "o", ", ", ",", "d", "as" etc. In this case, the only token representing a complete word is that for the term "text" from the input prompt. A more illustrative web interface allows the selection of specific chat models and displays the expected costs for tokenization. Note: The overall process of a chat involves further steps. Notably, when documents are uploaded, costs rise.

A word fragment could itself be personal. This is much less likely than with a fully spelled-out term, which can consist of several word fragments. Nevertheless, it is possible. In addition, names with special characters (see, for example, letters from other languages that do not occur in the standard German character set) are rarely split into word fragments because they have no common letter sequences with other terms. They are therefore often in the AI's vocabulary as a full word and therefore in plain text.

A word or a proper name may indeed be personal. However, it is doubtful whether the (isolated, context-free) existence of a name in a set of words is a problem. The situation is different when a name or another personal data value is mentioned in a context. Such a context is called a sentence. More on this below. First, the question of how an AI model generates words.

How do scraps of words become words again?

The user's input (also known as a prompt) is used to generate a response at the latest when an AI model is asked a question. As everyday experience with ChatGPT and other language models shows, this response consists of fully-fledged names and terms. It is therefore immediately apparent that the result of questioning an AI model is in the form of words that are embedded in a context. This context is formed by sentences.

The interesting question is whether it can be said that an AI model can contain personal data even when it is at rest. This risk already exists for word fragments, as described above.

What is personal data?

Personal data also includes data that is pseudonym, pseudonym meaning that a data value becomes personal again only after decoding. Whether the decoding is actually performed or only objectively possible does not matter. See here Art. 4 No. 1 GDPR or also the Breyer judgment of the ECJ (IP addresses are personal data, because there is an objective possibility to identify the subscriber).

How does an AI system decode series of numbers back into words?

First, when training an AI language model, the word fragments described above, called tokens, are generated from words. Each token is represented by a number. This enables computers to work better.

Then (depending on the application) individual sentences consisting of tokens are used to generate number sequences called vectors, which represent so-called (semantically charged) embeddings.

These embeddings, thus vectors, thus number sequences, are stored in a artificial neural network of a AI model. The model therefore consists only of number sequences. This is a somewhat simplified and imprecise, but for consideration probably sufficient representation. A set of vectors that stands in relation to one another is called Tensor.

An AI language model stores personal data pseudonymously. Pseudonymous data is personal data.
See article and Art. 4 No. 1 GDPR.

If a user now asks a question in the form of a prompt to the AI model, the user request is also converted into number sequences, i.e. vectors, i.e. embeddings. The tokenizer described above is used for this.

Now the AI model compares vectors (= user request) with vectors (= "knowledge" that was previously taught to the AI model in the form of training data). The result is again vectors (one or more, depending on the length of the AI model's output).

These output vectors, i.e. series of numbers, i.e. semantically charged embeddings, are decoded again using the tokenizer just mentioned. The result is a collection of words that we refer to as sentences.

The tokenizer therefore encodes and decodes texts. It encodes the texts when training the AI model and to convert a prompt into a form that the AI model understands. After the AI model has responded, there are series of numbers that the tokenizer understands and can convert back into words based on the assignment of the word fragments to numbers (see above).

What information is contained in the AI model?

Assume that the above-mentioned word fragments (which can also represent whole words or names) contain no or no critical personal data. For example, a name could be non-critical. What is forbidden about knowing that there is a person with the name John Doe? I don't see a problem here if there is no other knowledge linked to this name. If there is no context, knowing a name is probably not critical. However, it may be different with a telephone number or e-mail address, because these carry the risk of unfair advertising or similar.

Can it be assumed that an AI model stores data in such a way that it can be considered personal? The short answer is yes. Here is the proof:

ChatGPT is asked the following question, to which ChatGPT gives the stated answer:

Question to ChatGPT: "Quoted the first sentence of Art. 1 of the GDPR verbatim ChatGPT's response: "Article 1(1) of the General Data Protection Regulation (GDPR) states: "This Regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data and the free movement of such data
AI language models can reproduce trained texts verbatim. Here is an example with ChatGPT-3.5

ChatGPT answers the question here correctly, namely with an exact quote. Obviously, it must be assumed that individual, complete sentences exist exactly in the AI model.

In this example, no personal data is included, but could be at any time. It simply comes down to the question being asked. The example is intended to show that in a AI language model, data can be stored exactly. From word fragments not only words, but entire sentences are formed. By the way, person names can be found in Art. 99 GDPR.

Another example from November 28, 2023 ([1]) :

As you can see, any personal data can be extracted from the language model (here: ChatGPT!) using a harmless prompt. The whole thing can also be automated, because ChatGPT offers a programming interface (API)! The source above has done this, by the way:

Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim- memorized training examples.
Source: see above.

The following statement would probably be legally relevant if it were reproduced by an AI model, because this information would be subject to data protection: "Miri Maiering-Höflacher from Tuttlingen had her birthday on April 17, 1994 and, despite her full red hair at the time, now has no hair because she suffers from cancer of type X and disease Y, which she contracted as a result of her activities on the Reeperbahn

Technical basics

The following image illustrates that in a Transformer, which underlies every current language model, positional data from text inputs are encoded.

Source: Dr. GDPR (angelehnt an Mehreen Saeed). (image was automatically translated).

From a text input, tokens are first formed, which are then converted into word vectors. Word vectors are essentially number sequences. In addition, the position of each word or token in the input text is coded. The embedding of a word plus the position coding of the word results in the output for the subsequent processing steps in the Transformer and thus in the language model.

The Transformer is based on a revolutionary paper named Attention Is All You Need from 2017. This year can therefore be seen as the beginning of modern AI. In this paper, it is mentioned:

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Quote from paper Attention Is All You Need –

The text passage states that with a mathematical procedure named Attention ("Attention") different positions of input data are taken into account in order to transform them into a semantically loaded structure. Attention is here an ability which recognises from input data those which seem important for a given context (nothing other than this guessing with high success rate makes also the human).

To refine the approach, so-called Multiple Heads are used. A head (“Head”) here is a layer that accepts an input (in the language model this is a text). Stefania Cristina writes about this:

The idea behind multi-head attention is to allow the attention function to extract information from different representation subspaces, which would otherwise be impossible with a single attention head.
Reference: Stefania Cristina.

This means that Multi-Head Attention is used to improve the capabilities of a language model. It also follows that an accurate reproduction of data, which has been fed once to a language model in the form of training data, will be less likely than if only one input head were used. However, as shown by the above example from ChatGPT, the property of the language model, learned texts being reproduced accurately word for word, is not lost thereby.

Instead, each phrase or sentence is stored in the language model as uniquely as possible. The following figure illustrates this. The internal representations of two sentences are shown:

Technical Phrase: „to understand machine learning algorithms you need to understand concepts such as gradient of a function, Hessians of a matrix and optimization etc“.
Divination: „patrick henry said give me liberty or give me death when he addressed the second virginia convention in march“.

Note on technical term: "Hessian" has nothing to do with Hesse, the land with Germany's most inactive data protection authority, but refers to the Hessen normal form or Hesse matrix.

The figure shows the graphical representation of the internal number representations of the two phrases mentioned. The technical phrase is shown on the left and the wise phrase on the right.

Both representations look similar when viewed only briefly, but are very different overall. Ideally, every other phrase has a completely different representation, so that each phrase is stored uniquely in a language model, i.e. represented internally.

Encoder-decoder structures are used in particular for the translation of texts. The text to be translated is entered into the encoder. The translated text is output from the decoder. Both parts are pre-trained using training data with pairs of input texts and translated reference texts.

The following two sentences can thus be converted into one another by translating the input text into the output text of an AI language model.

Input text: „The agreement on the European Economic Area was signed in August 1992.“
Output: “The agreement on the European Economic Zone was signed in August 1992.”

The internal representation of input words to output words can be illustrated as follows:

Source of: Badanau et al., red border drawn by themselves.

The words of the input text to be translated are shown at the top. The words of the translated text are shown on the left. The intersections of two words are colored to show how strongly a word pair correlates with each other. White means the highest correlation (white). Thus, the word "signé" is maximally correlated with "signed", which seems to be correct because both words are equivalent in French and English in the context mentioned. On the other hand, the two words "a" and "éte" from French are each only moderately correlated (gray) with the English word "was", because both French words are transferred together to an English word. The area outlined in red shows the corresponding color codes.

Another example shows how the position of a word is stored in the AI model in order to determine the words in a sentence that are semantically related to a word currently being processed by the language model:

Each line of text shows, from top to bottom, the next processing step of the input text in the AI model. The word currently being processed is printed in red. The words recognized by the language model as relevant with regard to the current word are highlighted in blue. The darker the blue, the more relevant the word.

What is shown with all these examples of the internal representation of words in AI language models is that not only word positions are stored in an AI language model, but also entire phrases and sentences, which can therefore be reconstructed when an AI model is queried. Without position coding, an AI model would not provide any useful results, at least not for the usual language models with the intended tasks (essentially: text generation).

A publication from 2018 (i.e. somewhat outdated) notes that Transformer does not provide particularly accurate storage of position information:

The transformer has no recurrent or convolutional structure, even with the positional encoding added to the embedding vector, the sequential order is only weakly incorporated.
Source: Lillian Weng.

However, this does not seem to have a negative impact on the ability of current language models to reproduce entire sentences in their original form, which is relevant from a data protection law perspective (if personal data are mentioned). It also seems that the approach mentioned in the cited article, called SNAIL (Simple Neural Attention Meta-Learner), has not been successful. SNAIL was supposed to heal the alleged weakness of transformers, which is not able to store position information very well. Since SNAIL is no longer relevant and transformers are already being used and can quote entire sentences error-free, the statement made by Weng is now rather irrelevant.

In principle, it must be assumed that an AI language model based on a modern method such as Transformer can store data from training input in its original form, even if this does not happen in every case.

A few words about Transformer

The Transformer approach in its original form, as proposed in the paper "Attention Is All You Need", is based on the above-mentioned encoder-decoder architecture.

Source: Vaswani et. al., red markings added by myself.

Both Encoder and Decoder are based on positional encodings and embeddings (Embeddings = vectors = number sequences).

There are now other Transformer architectures, namely:

Encoder-Decoder: Original approach, especially for translations or summaries of text,
Decoder-only: Causal language models, for example for chatbots such as ChatGPT, but also LLaMA, Bard, T5 and others.
Encoder-only: Masked language models, such as BERT.

The differences lie in the details and cannot be considered in more detail here. What is essential is that all transformer architectures have analogous properties with regard to data storage ("training") and the retrieval of the trained data.

What is needed to extract information from an AI model?

An AI model alone, without any accompanying information, is a mere collection of numbers, if you look at it in a somewhat simplified way. This is unlikely to give rise to a data protection problem.

However, no one saves an AI model without having or wanting to have the option of using the AI model. The parts required to use an AI model are:

Tokenizer: A program code that can usually be downloaded in standardized form at any time if it has been deleted in the meantime.
Vocabulary (word snippets) for the Tokenizer: A text file or file with mainly printable characters.
AI model: List of numerical series (a simplified description).
Transformer: A program code that can usually be downloaded in standardized form at any time if it has been deleted in the meantime.

A real compilation of the core data of an AI model is shown here:

These data are provided so that someone can download and use the GPT2 AI model. The core file is pytorch_model.bin and has a size of approximately 3.7 gigabytes. The file vocab.json contains the tokens described above. The README.md file contains instructions on how to use the model. The other files with the .json extension are very small and contain configuration settings.

An AI model is like a ZIP archive in which files are stored in compressed form. Nobody intentionally saves ZIP files without being able to access them again later. This requires a ZIP program that can both create and unpack these files.

It is the same with PDF files: a PDF file can only be opened by someone who has a PDF viewer. Anyone can download such viewer programs from anywhere at any time. The same applies to the code for tokenizers and transformers as well as the vocabulary for a specific AI model. AI models are always offered together with all the necessary components, or if not, then together with a description of where the components can be obtained.

Technical details

A few technical details can be mentioned here only briefly. Tokens are not simply stored in an AI model. Rather, they also contain information on the positions of the tokens.

The following simple standard program code illustrates how a pre-trained GPT model can be loaded and how both the internal representation of the tokens and their position information can be accessed:

from transformers import GPT2LMHeadModel #import library
model = GPT2LMHeadModel.from_pretrained('gpt2') # load AI LLM
token_embeddings = model.transformer.wte.weight # Tokens Embeddings
position_embeddings = model.transformer.wpe.weight # Token Positionen Embeddings

The used Python library named transformers is an absolute standard and can be downloaded from the internet at any time. It's even open-source.

The comments at the end of the lines begin with a hash and briefly explain what the program code does. The GPT2 model is used here because, unlike OpenAI successors, it is still freely available. Once the GPT model has been loaded, it can be evaluated. In the example code above, the weights are read out as an internal representation of the tokens stored in the model. The weights for the positions of the tokens in relation to each other are also read out in the same way.

To enter a prompt into an AI model and get the response, you could use the following code:

# Convert question into Token-IDs
input_ids = tokenizer(\["Are Cookies text files?"\], return_tensors="pt")
# Convert Token-IDs into embeddings
embeds = model.transformer.wte.weight\[input_ids, :\]
# Retrieve answer from AI LLM
outputs = model(inputs_embeds=embeds)
# Convert first answer into text
antwort = tokenizer.decode(outputs\[0\])
# Output the answer
print(antwort) #Result would be at best: "No, cookies are not text files"

The code shows the individual steps for querying a model and obtaining the answer in a form that can be read by humans. This is usually programmed slightly differently to the example shown here.

Conclusion

AI language models store potentially personal data, as they store whole words, word components and word contexts (= word beginnings and matching possible word endings). An AI model contains at least pseudonymous data.

Modern AI language models such as ChatGPT and other Transformer-based models potentially store training data at word or even sentence level in the original.
Words are potentially stored in compressed (but often uncompressed), human-readable form, sentences in the form of references to words together with position information.

AI language models are also capable of reproducing entire sentences from input data verbatim. Although this ability is not reliable, it must be assumed in case of doubt.

Data can be extracted from an AI model by using the associated accompanying data and standard libraries. Without these components, an AI model is useless and can no longer really be called an AI model.

When a AI model is run locally on its own AI server, many data problems can be mitigated. A high performance capability of local models is particularly possible in Question-Answer-Assistants, but also in document search engines or image generators. On the other hand, when using models from third-party providers like OpenAI, Microsoft or Google, there is an additional problem that input data ends up somewhere and nobody knows where.

Daher recommend themselves for specific tasks in the company own language models. These typically build on pre-trained, publicly available, and at the same time high-performance models. The quality is often better than that of ChatGPT, because the latter system can do everything and therefore partly is considered particularly unreliable, as simple investigations show (see link above).

Key messages

AI language models store data as numerical representations called vectors, capturing the meaning of words based on their context.

AI language models learn by storing and analyzing vast amounts of text data, enabling them to generate human-like text.

Large language models can be further trained and used directly after initial training. They use a tokenizer to break down text into meaningful units called tokens, which represent words or parts of words.

AI language models learn by identifying patterns in text data. This means they can sometimes generate partial words based on these patterns, even if the full word isn't present in their training data.

AI models can potentially contain personal data, even when they are not actively being used, because they process and store information in a way that could be linked to individuals.

AI language models like ChatGPT can store and reproduce entire sentences verbatim from their training data, including potentially sensitive personal information.

Multi-Head Attention helps language models understand and remember information from different parts of a text more effectively, allowing them to store unique representations of phrases and sentences.

Modern AI language models, like Transformers, can store and reproduce entire sentences from their training data, even though they don't perfectly remember word positions.

AI models need more than just the numerical data to function; they require accompanying tools like tokenizers, vocabulary lists, and transformer code to be used effectively. These components are usually provided alongside the model itself.

AI language models can store personal data because they learn from text and might retain parts of that data, even if it's disguised.

These new models are often better than ChatGPT because they focus on specific tasks and are therefore more reliable.

About