Drücke „Enter”, um zum Inhalt zu springen.
Hinweis zu diesem Datenschutz-Blog:
Anscheinend verwenden Sie einen Werbeblocker wie uBlock Origin oder Ghostery, oder einen Browser, der bestimmte Dienste blockiert.
Leider wird dadurch auch der Dienst von VG Wort blockiert. Online-Autoren haben einen gesetzlichen Anspruch auf eine Vergütung, wenn ihre Beiträge oft genug aufgerufen wurden. Um dies zu messen, muss vom Autor ein Dienst der VG Wort eingebunden werden. Ohne diesen Dienst geht der gesetzliche Anspruch für den Autor verloren.

Ich wäre Ihnen sehr verbunden, wenn Sie sich bei der VG Wort darüber beschweren, dass deren Dienst anscheinend so ausgeprägt ist, dass er von manchen als blockierungswürdig eingestuft wird. Dies führt ggf. dazu, dass ich Beiträge kostenpflichtig gestalten muss.

Durch Klick auf folgenden Button wird eine Mailvorlage geladen, die Sie inhaltlich gerne anpassen und an die VG Wort abschicken können.

Nachricht an VG WortMailtext anzeigen

Betreff: Datenschutzprobleme mit dem VG Wort Dienst(METIS)
Guten Tag,

als Besucher des Datenschutz-Blogs Dr. DSGVO ist mir aufgefallen, dass der VG Wort Dienst durch datenschutzfreundliche Browser (Brave, Mullvad...) sowie Werbeblocker (uBlock, Ghostery...) blockiert wird.
Damit gehen dem Autor der Online-Texte Einnahmen verloren, die ihm aber gesetzlich zustehen.

Bitte beheben Sie dieses Problem!

Diese Nachricht wurde von mir persönlich abgeschickt und lediglich aus einer Vorlage generiert.
Wenn der Klick auf den Button keine Mail öffnet, schreiben Sie bitte eine Mail an info@vgwort.de und weisen darauf hin, dass der VG Wort Dienst von datenschutzfreundlichen Browser blockiert wird und dass Online Autoren daher die gesetzlich garantierten Einnahmen verloren gehen.
Vielen Dank,

Ihr Klaus Meffert - Dr. DSGVO Datenschutz-Blog.

PS: Wenn Sie meine Beiträge oder meinen Online Website-Check gut finden, freue ich mich auch über Ihre Spende.
DSGVO-Schnellcheck
Testen Sie Ihre Website kostenlos – Ergebnis in Sekunden
Analyse starten

Training of AI models: What does that exactly mean?

0
Dr. DSGVO Newsletter detected: Extended functionality available
More articles · Website-Checks · Live Offline-AI
📄 Article as PDF (only for newsletter subscribers)
🔒 Premium-Funktion
Der aktuelle Beitrag kann in PDF-Form angesehen und heruntergeladen werden

📊 Download freischalten
Der Download ist nur für Abonnenten des Dr. DSGVO-Newsletters möglich

AI models and AI image generators are the most widely used AI model types. Often, training, pre-training or fine-tuning is spoken of. What do these terms mean and what are the differences? Which data and especially how many are typically required for each process?

Introduction

An AI-model is an electronic brain consisting of a neural network. It can be questioned and gives an answer. This is in a way possible that reminds very much of the human brain. Others have a different opinion. Anyway, also the human brain is based on statistics. On the question of what intelligence is, see the linked article.

Examples of types of AI models are:

  • AI language mode, often referred to as LLM (LLM = Large Language Model). However, there are now also highly capable SLMs (SLM = Small Language Model) available.
  • AI image generator: An image is generated from a text input. Often, an image can also be created from a text and an input image. Or several images can be stylistically linked together.
  • Text-to-Speech: From an input text, the AI model generates a spoken output
  • Speech-To-Text: From a speech input, the AI model generates a text (transcription)
  • Object recognition in image or video (segmentation)
  • Medical prognosis models

For simplicity, only speech models and image models are referred to in what follows. These are very common representatives of the AI field.

There are essentially two training processes for AI models:

  1. Pre-Training
  2. Fine-Tuning

Further training processes do not exist in practice for the most part. A possible fine-tuning of an already fine-tuned model is still conceivable, which ultimately technically amounts to the first fine-tuning. The term post-training is occasionally used, but this is based on fine-tuning.

What does pre-training mean and what is the difference from fine-tuning? The following representations consider several constellations:

  1. Pre-training ("Creation") of a massive large language model, such as ChatGPT-4
  2. Pre-training of a small large language model (yes, read correctly), such as GPT-2
  3. Fine-tuning of the model from 1.
  4. Fine-tuning of the model from 2.

Cases 1 and 3 are usually handled by AI companies. Case 2 is less common, or if it occurs, then for somewhat larger models than GPT-2, such as Llama3-8B. But even the 8B model is usually created and provided by AI companies.

The 4th case is practically realizable by every company. The focus of this contribution are generally companies that want to introduce AI themselves or organizations that care for such companies.

Pre-Training

Pre-training means training a machine learning model. The machine learning model is not there. It gets pre-trained (pre-training). Then it's there.

Often talk is of "training". There is no such term as training in this context. When someone says "training", they either mean pre-training or fine-tuning, depending on the intended context.

When someone talks about training a Custom-GPT, they mean fine-tuning. When someone generally talks about training a powerful language model, they mean pre-training (e.g. "Training ChatGPT-4 took many millions of hours of compute time, I read").

Pre-training is the training of a machine learning model.

It corresponds to a child's upbringing/education from birth by its parents, up to school education.

In case of doubt one must assume that by "training" "pre-training" is meant, because this is linguistically closer than "fine-tuning".

For language models, billions of documents with text are needed so that the language model can have a very good quality. A document is usually an excerpt from a webpage on the internet.

Known data sources are:

  • Common Crawl (CC) or C4 (Colossal Cleaned Common Crawl): approximately 700 GB of data, excluding many websites from the internet
  • The Pile: 825 GB an Daten, angeblich Open-Source
  • Wikipedia (in several languages)
  • RefinedWeb: Duplicated and refined version of Common Crawl
  • StarCoder Data: ca. 780 GB of data for generating program code. Sources are particularly GitHub and Jupyter Notebooks (which are programming sheets, similar to Excel, but for easy creation of shareable program code).

The training time for a language model varies greatly in size (many months or just a few hours). For very large AI models, millions of GPU hours were consumed during pre-training. GPU stands for graphics card. In a high-end AI server, 8 graphics cards are installed at a price of around €25,000.

Very small language models (GPT-2) were not considered very small just a few years ago and were the gold standard. A GPT-2 language model can be trained on your own AI server or laptop in a few hours, days or weeks (pre-training = pre-training). How long the pre-training takes exactly depends on the scope of the training data.

To make a highly capable AI language model, several terabytes (thousands of gigabytes) of raw texts as training data are required.

For a good first start, even one hundred gigabytes are sufficient, which can be quickly read through. For this, the training of the AI model (pre-training) only takes a manageable number of hours.

How long it exactly depends on also the number of iterations. An iteration corresponds approximately to a classroom. The more classes someone attends in school, the higher the chance that intelligence increases. Just like with humans, however, it eventually brings no more benefit to go to school for another year longer. Learning success can be ruined just like with humans by too long pre-training and deteriorate again.

A AI-model that was created through pre-training, thus trained, is also called a Foundation Model (FM) or base model. A base model can be used for general tasks. The larger the base model, the better it can solve special tasks. The size of a model is expressed in the number of its neuron connections. ChatGPT can therefore calculate very well due to its sheer size (at least better than most people on this earth, taking into account the errors that both ChatGPT and humans make).

Fine-Tuning

Fine-tuning can also be referred to as fine training.

A prerequisite for fine-tuning is an existing AI language model. The AI model exists after it has been pre-trained (pre-trained). Only a pre-trained AI model can be subjected to fine-tuning.

Fine-tuning is comparable to a university education that one attaches to high school education.

Without a school education, a study is not possible or rather also not meaningful.

Fine-tuning is then sensible when a model should be trained for a specific task. With fine-tuning, the language model is thus further trained.

Maybe a language model can't summarize texts well from scratch. This might only be the case occasionally, for example, for a doctor's practice that uses entirely different vocabulary in medical records than is anchored in the training data of the AI model.

Fine-tuning therefore improves the abilities of a pre-trained AI model with regard to a specific task assignment. This task assignment is also referred to as Downstream-Task.

Depending on the task assignment and aptitude of a machine learning model as well as the mathematical training method used, different amounts of data are needed in order to achieve good results.

For classifying texts, a hundred examples may suffice to successfully fine-tune. To have an AI image generator learn the style of an artist, ten examples are sufficient. After fine-tuning, the AI model then generates images that could have been painted by the creator of the 10 example images.

In total, significantly fewer training data are required for fine-tuning and also sensible, in stark contrast to pre-training. One can assume that the number of datasets for fine-tuning often does not exceed 10,000 examples. Very often, significantly fewer than these 10,000 examples are sensible and necessary. It all depends on the case. For completeness, a special case should be mentioned: A base model is fine-tuned with the goal that an actually improved version of the base model emerges from it. This happened, for example, with Llama3. The fine-tuned offshoot received 64,000 training data sets donated. This process is usually carried out by others. One can then use these improved models as if they had been there from the beginning (pre-training).

Fine-tuning is practiced in companies for small language models. Small does not mean it would not be a large language model LLM, but rather describes the relation between huge (ChatGPT) and very good LLM (like Llama3-8B). ChatGPT has well over 1000 billion neuron connections, whereas an 8B-model "only" has 8 billion.

Pre-Training versus Fine-Tuning

The following overview briefly and concisely presents the differences between Pre-Training and Fine-Tuning. The overview also includes features of data protection and synthetic data. Synthetic data are artificially generated data to expand the scope of training data. These data are obtained by AI-models!

AttributePre-TrainingFine-Tuning
PurposeCreation of a general AI modelImproving an existing AI model for a specific task
AnalogyChild upbringing by its parents + school educationUniversity studies or further education after high school
Training data countAs many as possible, often billions of data setsOften it takes 10 examples, often 100. Very rarely will there be 10,000 or more examples.
Calculation timeFor modern models, many millions of hoursVery few hours to weeks
Data protectionCannot be practically observedCan be generally (only for fine training data) met
Possible anonymization?In practice notYes, very good in principle
Copyright compliant?In practice notYes, very good in principle
Is synthetic data meaningful?Only in case of emergency or for improvements within a model seriesYes, for training data multiplication and increasing the variance of these
Differences between pre-training and fine-tuning of AI models.

Conclusion

From a data perspective, fine-tuning is controllable by orders of magnitude better than pre-training. This only applies to the data that flows into fine-tuning, however. The initial training data for pre-training are already stored in the AI model and retrievable.

Pre-training is a technical challenge. From a software perspective, it is almost the same as fine-tuning. However, it requires enormous computing capacities and an extremely large amount of training data.

Fine-tuning is completely different. It gets by with affordable consumer-grade hardware and very often requires little or very little training data.

The fine-tuning inherits therefore the "brain" with its stored initial training data and adds only a few new data. These few new data can be well controlled from the point of view of the GDPR. Nevertheless, a legally wrongful base model remains, which has been fine-tuned, a legally wrongful fine-tuned model. The legally wrongful data from the base model thus colors all follow-up versions of the model. Something unlawful cannot become lawful by adding something lawful.

Synthetic data does not really improve quality or privacy in a baseline model:

  • Synthetic data can also contain a reference to a person or a work protected by copyright. No wonder, since their model is genuine data.
  • When synthetic data is obtained by modifying true data, it can happen that false statements about people are made. This would be a deterioration of the legal situation in the AI language model.

In general one can say: AI models are only competitive when they have been presented with as many and good training data as possible. Thus, all available competitive closed and open-source AI language models are actually formally legally wrong. By the way, Mistral was also trained on data from the "open web", as Mistral itself says.

The continued accepted use of something formally illegal will likely lead, according to the logic of jurisprudence in AI, to it being considered allowed or at least its "illegal use" tolerated.

Another problem is the use of cloud services like ChatGPT or Azure. For in this case, data from third parties or business secrets are often sent to American companies and their national intelligence agencies.

If the argument of Data Security is not sufficient, it is suggested to name its applications specifically and use an optimized AI for this purpose. This type of AI is referred to here Offline-AI. It runs completely autonomously, either on a rented server or a company-owned server and often delivers better results than general intelligences like ChatGPT.

Key messages

There are two main ways to train AI models: pre-training and fine-tuning. Pre-training creates the basic model, while fine-tuning adapts it for a specific task.

Training a powerful language model involves feeding it massive amounts of text data to learn patterns and relationships within language, similar to how a child learns through education. The more data and training time, the better the model performs on various tasks.

Fine-tuning is like specialized training for AI models, making them better at specific tasks by further teaching them with smaller, focused datasets.

Because AI models are trained on massive amounts of data, which may include copyrighted material or personal information, most existing AI models are technically illegal.

Use specialized AI ("Offline-AI") that operates independently on your own servers for better data security and control.

About

About the author on dr-dsgvo.de
My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.

Data Protection: What is personal data?