Drücke „Enter”, um zum Inhalt zu springen.
Hinweis zu diesem Datenschutz-Blog:
Anscheinend verwenden Sie einen Werbeblocker wie uBlock Origin oder Ghostery, oder einen Browser, der bestimmte Dienste blockiert.
Leider wird dadurch auch der Dienst von VG Wort blockiert. Online-Autoren haben einen gesetzlichen Anspruch auf eine Vergütung, wenn ihre Beiträge oft genug aufgerufen wurden. Um dies zu messen, muss vom Autor ein Dienst der VG Wort eingebunden werden. Ohne diesen Dienst geht der gesetzliche Anspruch für den Autor verloren.

Ich wäre Ihnen sehr verbunden, wenn Sie sich bei der VG Wort darüber beschweren, dass deren Dienst anscheinend so ausgeprägt ist, dass er von manchen als blockierungswürdig eingestuft wird. Dies führt ggf. dazu, dass ich Beiträge kostenpflichtig gestalten muss.

Durch Klick auf folgenden Button wird eine Mailvorlage geladen, die Sie inhaltlich gerne anpassen und an die VG Wort abschicken können.

Nachricht an VG WortMailtext anzeigen

Betreff: Datenschutzprobleme mit dem VG Wort Dienst(METIS)
Guten Tag,

als Besucher des Datenschutz-Blogs Dr. DSGVO ist mir aufgefallen, dass der VG Wort Dienst durch datenschutzfreundliche Browser (Brave, Mullvad...) sowie Werbeblocker (uBlock, Ghostery...) blockiert wird.
Damit gehen dem Autor der Online-Texte Einnahmen verloren, die ihm aber gesetzlich zustehen.

Bitte beheben Sie dieses Problem!

Diese Nachricht wurde von mir persönlich abgeschickt und lediglich aus einer Vorlage generiert.
Wenn der Klick auf den Button keine Mail öffnet, schreiben Sie bitte eine Mail an info@vgwort.de und weisen darauf hin, dass der VG Wort Dienst von datenschutzfreundlichen Browser blockiert wird und dass Online Autoren daher die gesetzlich garantierten Einnahmen verloren gehen.
Vielen Dank,

Ihr Klaus Meffert - Dr. DSGVO Datenschutz-Blog.

PS: Wenn Sie meine Beiträge oder meinen Online Website-Check gut finden, freue ich mich auch über Ihre Spende.
Ausprobieren Online Webseiten-Check sofort DSGVO-Probleme finden

Training of AI models: What does that exactly mean?

0
Dr. DSGVO Newsletter detected: Extended functionality available
More articles · Website-Checks · Live Offline-AI
📄 Article as PDF (only for newsletter subscribers)
🔒 Premium-Funktion
Der aktuelle Beitrag kann in PDF-Form angesehen und heruntergeladen werden

📊 Download freischalten
Der Download ist nur für Abonnenten des Dr. DSGVO-Newsletters möglich

AI models and AI image generators are the most widely used AI model types. Often, training, pre-training or fine-tuning is spoken of. What do these terms mean and what are the differences? Which data and especially how many are typically required for each process?

Introduction

An AI-model is an electronic brain consisting of a neural network. It can be questioned and gives an answer. This is in a way possible that reminds very much of the human brain. Others have a different opinion. Anyway, also the human brain is based on statistics. On the question of what intelligence is, see the linked article.

Examples of types of AI models are:

  • AI language mode, often referred to as LLM (LLM = Large Language Model). However, there are now also highly capable SLMs (SLM = Small Language Model) available.
  • AI image generator: An image is generated from a text input. Often, an image can also be created from a text and an input image. Or several images can be stylistically linked together.
  • Text-to-Speech: From an input text, the AI model generates a spoken output
  • Speech-To-Text: From a speech input, the AI model generates a text (transcription)
  • Object recognition in image or video (segmentation)
  • Medical prognosis models

For simplicity, only speech models and image models are referred to in what follows. These are very common representatives of the AI field.

There are essentially two training processes for AI models:

  1. Pre-Training
  2. Fine-Tuning

Further training processes do not exist in practice for the most part. A possible fine-tuning of an already fine-tuned model is still conceivable, which ultimately technically amounts to the first fine-tuning. The term post-training is occasionally used, but this is based on fine-tuning.

What does pre-training mean and what is the difference from fine-tuning? The following representations consider several constellations:

  1. Pre-training ("Creation") of a massive large language model, such as ChatGPT-4
  2. Pre-training of a small large language model (yes, read correctly), such as GPT-2
  3. Fine-tuning of the model from 1.
  4. Fine-tuning of the model from 2.

Cases 1 and 3 are usually handled by AI companies. Case 2 is less common, or if it occurs, then for somewhat larger models than GPT-2, such as Llama3-8B. But even the 8B model is usually created and provided by AI companies.

The 4th case is practically realizable by every company. The focus of this contribution are generally companies that want to introduce AI themselves or organizations that care for such companies.

Pre-Training

Pre-training means training a machine learning model. The machine learning model is not there. It gets pre-trained (pre-training). Then it's there.

Often talk is of "training". There is no such term as training in this context. When someone says "training", they either mean pre-training or fine-tuning, depending on the intended context.

When someone talks about training a Custom-GPT, they mean fine-tuning. When someone generally talks about training a powerful language model, they mean pre-training (e.g. "Training ChatGPT-4 took many millions of hours of compute time, I read").

Pre-training is the training of a machine learning model.

It corresponds to a child's upbringing/education from birth by its parents, up to school education.

In case of doubt one must assume that by "training" "pre-training" is meant, because this is linguistically closer than "fine-tuning".

For language models, billions of documents with text are needed so that the language model can have a very good quality. A document is usually an excerpt from a webpage on the internet.

Known data sources are:

  • Common Crawl (CC) or C4 (Colossal Cleaned Common Crawl): approximately 700 GB of data, excluding many websites from the internet
  • The Pile: 825 GB an Daten, angeblich Open-Source
  • Wikipedia (in several languages)
  • RefinedWeb: Duplicated and refined version of Common Crawl
  • StarCoder Data: ca. 780 GB of data for generating program code. Sources are particularly GitHub and Jupyter Notebooks (which are programming sheets, similar to Excel, but for easy creation of shareable program code).

The training time for a language model varies greatly in size (many months or just a few hours). For very large AI models, millions of GPU hours were consumed during pre-training. GPU stands for graphics card. In a high-end AI server, 8 graphics cards are installed at a price of around €25,000.

Very small language models (GPT-2) were not considered very small just a few years ago and were the gold standard. A GPT-2 language model can be trained on your own AI server or laptop in a few hours, days or weeks (pre-training = pre-training). How long the pre-training takes exactly depends on the scope of the training data.

To make a highly capable AI language model, several terabytes (thousands of gigabytes) of raw texts as training data are required.

For a good first start, even one hundred gigabytes are sufficient, which can be quickly read through. For this, the training of the AI model (pre-training) only takes a manageable number of hours.

How long it exactly depends on also the number of iterations. An iteration corresponds approximately to a classroom. The more classes someone attends in school, the higher the chance that intelligence increases. Just like with humans, however, it eventually brings no more benefit to go to school for another year longer. Learning success can be ruined just like with humans by too long pre-training and deteriorate again.

A AI-model that was created through pre-training, thus trained, is also called a Foundation Model (FM) or base model. A base model can be used for general tasks. The larger the base model, the better it can solve special tasks. The size of a model is expressed in the number of its neuron connections. ChatGPT can therefore calculate very well due to its sheer size (at least better than most people on this earth, taking into account the errors that both ChatGPT and humans make).

Fine-Tuning

Fine-tuning can also be referred to as fine training.

Read full article now via free Dr. GDPR newsletter.
More extras for subscribers:
Offline-AI · Free contingent+ for Website-Checks
Already a subscriber? Click on the link in the newsletter & refresh this page.
Subscribe to Newsletter
About the author on dr-dsgvo.de
My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.

Data Protection: What is personal data?