AI models and AI image generators are the most widely used AI model types. Often, training, pre-training or fine-tuning is spoken of. What do these terms mean and what are the differences? Which data and especially how many are typically required for each process?
Introduction
An AI-model is an electronic brain consisting of a neural network. It can be questioned and gives an answer. This is in a way possible that reminds very much of the human brain. Others have a different opinion. Anyway, also the human brain is based on statistics. On the question of what intelligence is, see the linked article.
Examples of types of AI models are:
- AI language mode, often referred to as LLM (LLM = Large Language Model). However, there are now also highly capable SLMs (SLM = Small Language Model) available.
- AI image generator: An image is generated from a text input. Often, an image can also be created from a text and an input image. Or several images can be stylistically linked together.
- Text-to-Speech: From an input text, the AI model generates a spoken output
- Speech-To-Text: From a speech input, the AI model generates a text (transcription)
- Object recognition in image or video (segmentation)
- Medical prognosis models
For simplicity, only speech models and image models are referred to in what follows. These are very common representatives of the AI field.
There are essentially two training processes for AI models:
- Pre-Training
- Fine-Tuning
Further training processes do not exist in practice for the most part. A possible fine-tuning of an already fine-tuned model is still conceivable, which ultimately technically amounts to the first fine-tuning. The term post-training is occasionally used, but this is based on fine-tuning.
What does pre-training mean and what is the difference from fine-tuning? The following representations consider several constellations:
- Pre-training ("Creation") of a massive large language model, such as ChatGPT-4
- Pre-training of a small large language model (yes, read correctly), such as GPT-2
- Fine-tuning of the model from 1.
- Fine-tuning of the model from 2.
Cases 1 and 3 are usually handled by AI companies. Case 2 is less common, or if it occurs, then for somewhat larger models than GPT-2, such as Llama3-8B. But even the 8B model is usually created and provided by AI companies.
The 4th case is practically realizable by every company. The focus of this contribution are generally companies that want to introduce AI themselves or organizations that care for such companies.
Pre-Training
Pre-training means training a machine learning model. The machine learning model is not there. It gets pre-trained (pre-training). Then it's there.
Often talk is of "training". There is no such term as training in this context. When someone says "training", they either mean pre-training or fine-tuning, depending on the intended context.
When someone talks about training a Custom-GPT, they mean fine-tuning. When someone generally talks about training a powerful language model, they mean pre-training (e.g. "Training ChatGPT-4 took many millions of hours of compute time, I read").
Pre-training is the training of a machine learning model.
It corresponds to a child's upbringing/education from birth by its parents, up to school education.
In case of doubt one must assume that by "training" "pre-training" is meant, because this is linguistically closer than "fine-tuning".
For language models, billions of documents with text are needed so that the language model can have a very good quality. A document is usually an excerpt from a webpage on the internet.
Known data sources are:
- Common Crawl (CC) or C4 (Colossal Cleaned Common Crawl): approximately 700 GB of data, excluding many websites from the internet
- The Pile: 825 GB an Daten, angeblich Open-Source
- Wikipedia (in several languages)
- RefinedWeb: Duplicated and refined version of Common Crawl
- StarCoder Data: ca. 780 GB of data for generating program code. Sources are particularly GitHub and Jupyter Notebooks (which are programming sheets, similar to Excel, but for easy creation of shareable program code).
The training time for a language model varies greatly in size (many months or just a few hours). For very large AI models, millions of GPU hours were consumed during pre-training. GPU stands for graphics card. In a high-end AI server, 8 graphics cards are installed at a price of around €25,000.
Very small language models (GPT-2) were not considered very small just a few years ago and were the gold standard. A GPT-2 language model can be trained on your own AI server or laptop in a few hours, days or weeks (pre-training = pre-training). How long the pre-training takes exactly depends on the scope of the training data.
To make a highly capable AI language model, several terabytes (thousands of gigabytes) of raw texts as training data are required.
For a good first start, even one hundred gigabytes are sufficient, which can be quickly read through. For this, the training of the AI model (pre-training) only takes a manageable number of hours.
How long it exactly depends on also the number of iterations. An iteration corresponds approximately to a classroom. The more classes someone attends in school, the higher the chance that intelligence increases. Just like with humans, however, it eventually brings no more benefit to go to school for another year longer. Learning success can be ruined just like with humans by too long pre-training and deteriorate again.
A AI-model that was created through pre-training, thus trained, is also called a Foundation Model (FM) or base model. A base model can be used for general tasks. The larger the base model, the better it can solve special tasks. The size of a model is expressed in the number of its neuron connections. ChatGPT can therefore calculate very well due to its sheer size (at least better than most people on this earth, taking into account the errors that both ChatGPT and humans make).
Fine-Tuning
Fine-tuning can also be referred to as fine training.




My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.
