What is the difference between pre-training and fine-tuning in AI?

Pre-training is the training of an AI model with massive datasets, similar to the training of a child. Fine-tuning, on the other hand, is the adaptation of a pre-trained model to specific tasks or datasets to improve its performance.

What types of data are needed for the pre-training of AI models?

For pre-training, massive datasets are required, typically from sources like Common Crawl, The Pile, or Wikipedia, to impart a broad knowledge base to the model. These datasets can encompass several hundred gigabytes or even terabytes.

What is pre-training in AI models?

Pre-training is the process of training an AI language model with massive amounts of text data to develop a general understanding of language and knowledge. This process often requires terabytes of data and can take hours.

What is Fine-Tuning of LLMs and when is it used?

Fine-tuning is a process of fine-tuning, where a pre-trained model is adapted to a specific task or domain. It requires significantly less data than pre-training and is used to improve the model's performance for a particular downstream task, such as text summarization.

What is the difference between Pre-training and Fine-tuning in language models?

Pre-training aims to create a comprehensive, general language model, while fine-tuning optimizes an existing model for a specific task. Pre-training is more resource-intensive and utilizes very large datasets, while fine-tuning is more efficient and applied to smaller, task-specific datasets.

Can fine-tuning of AI models resolve the legal issues from the base model?

No, an illegally trained base model remains illegally so even after fine-tuning. New data cannot improve the legal situation, as the underlying problem in the base model persists.

Training of AI models: What does that exactly mean?

AI models and AI image generators are the most widely used AI model types. Often, training, pre-training or fine-tuning is spoken of. What do these terms mean and what are the differences? Which data and especially how many are typically required for each process?

Introduction

An AI-model is an electronic brain consisting of a neural network. It can be questioned and gives an answer. This is in a way possible that reminds very much of the human brain. Others have a different opinion. Anyway, also the human brain is based on statistics. On the question of what intelligence is, see the linked article.

Examples of types of AI models are:

AI language mode, often referred to as LLM (LLM = Large Language Model). However, there are now also highly capable SLMs (SLM = Small Language Model) available.
AI image generator: An image is generated from a text input. Often, an image can also be created from a text and an input image. Or several images can be stylistically linked together.
Text-to-Speech: From an input text, the AI model generates a spoken output
Speech-To-Text: From a speech input, the AI model generates a text (transcription)
Object recognition in image or video (segmentation)
Medical prognosis models

For simplicity, only speech models and image models are referred to in what follows. These are very common representatives of the AI field.

There are essentially two training processes for AI models:

Pre-Training
Fine-Tuning

Further training processes do not exist in practice for the most part. A possible fine-tuning of an already fine-tuned model is still conceivable, which ultimately technically amounts to the first fine-tuning. The term post-training is occasionally used, but this is based on fine-tuning.

What does pre-training mean and what is the difference from fine-tuning? The following representations consider several constellations:

Pre-training ("Creation") of a massive large language model, such as ChatGPT-4
Pre-training of a small large language model (yes, read correctly), such as GPT-2
Fine-tuning of the model from 1.
Fine-tuning of the model from 2.

Cases 1 and 3 are usually handled by AI companies. Case 2 is less common, or if it occurs, then for somewhat larger models than GPT-2, such as Llama3-8B. But even the 8B model is usually created and provided by AI companies.

The 4th case is practically realizable by every company. The focus of this contribution are generally companies that want to introduce AI themselves or organizations that care for such companies.

Pre-Training

Pre-training means training a machine learning model. The machine learning model is not there. It gets pre-trained (pre-training). Then it's there.

Often talk is of "training". There is no such term as training in this context. When someone says "training", they either mean pre-training or fine-tuning, depending on the intended context.

When someone talks about training a Custom-GPT, they mean fine-tuning. When someone generally talks about training a powerful language model, they mean pre-training (e.g. "Training ChatGPT-4 took many millions of hours of compute time, I read").

Pre-training is the training of a machine learning model.

It corresponds to a child's upbringing/education from birth by its parents, up to school education.

In case of doubt one must assume that by "training" "pre-training" is meant, because this is linguistically closer than "fine-tuning".

For language models, billions of documents with text are needed so that the language model can have a very good quality. A document is usually an excerpt from a webpage on the internet.

Known data sources are:

Common Crawl (CC) or C4 (Colossal Cleaned Common Crawl): approximately 700 GB of data, excluding many websites from the internet
The Pile: 825 GB an Daten, angeblich Open-Source
Wikipedia (in several languages)
RefinedWeb: Duplicated and refined version of Common Crawl
StarCoder Data: ca. 780 GB of data for generating program code. Sources are particularly GitHub and Jupyter Notebooks (which are programming sheets, similar to Excel, but for easy creation of shareable program code).

The training time for a language model varies greatly in size (many months or just a few hours). For very large AI models, millions of GPU hours were consumed during pre-training. GPU stands for graphics card. In a high-end AI server, 8 graphics cards are installed at a price of around €25,000.

Very small language models (GPT-2) were not considered very small just a few years ago and were the gold standard. A GPT-2 language model can be trained on your own AI server or laptop in a few hours, days or weeks (pre-training = pre-training). How long the pre-training takes exactly depends on the scope of the training data.

To make a highly capable AI language model, several terabytes (thousands of gigabytes) of raw texts as training data are required.

For a good first start, even one hundred gigabytes are sufficient, which can be quickly read through. For this, the training of the AI model (pre-training) only takes a manageable number of hours.

How long it exactly depends on also the number of iterations. An iteration corresponds approximately to a classroom. The more classes someone attends in school, the higher the chance that intelligence increases. Just like with humans, however, it eventually brings no more benefit to go to school for another year longer. Learning success can be ruined just like with humans by too long pre-training and deteriorate again.

A AI-model that was created through pre-training, thus trained, is also called a Foundation Model (FM) or base model. A base model can be used for general tasks. The larger the base model, the better it can solve special tasks. The size of a model is expressed in the number of its neuron connections. ChatGPT can therefore calculate very well due to its sheer size (at least better than most people on this earth, taking into account the errors that both ChatGPT and humans make).

Fine-Tuning

Fine-tuning can also be referred to as fine training.

A prerequisite for fine-tuning is an existing AI language model. The AI model exists after it has been pre-trained (pre-trained). Only a pre-trained AI model can be subjected to fine-tuning.

Fine-tuning is comparable to a university education that one attaches to high school education.

Without a school education, a study is not possible or rather also not meaningful.

Fine-tuning is then sensible when a model should be trained for a specific task. With fine-tuning, the language model is thus further trained.

Maybe a language model can't summarize texts well from scratch. This might only be the case occasionally, for example, for a doctor's practice that uses entirely different vocabulary in medical records than is anchored in the training data of the AI model.

Fine-tuning therefore improves the abilities of a pre-trained AI model with regard to a specific task assignment. This task assignment is also referred to as Downstream-Task.

Depending on the task assignment and aptitude of a machine learning model as well as the mathematical training method used, different amounts of data are needed in order to achieve good results.

For classifying texts, a hundred examples may suffice to successfully fine-tune. To have an AI image generator learn the style of an artist, ten examples are sufficient. After fine-tuning, the AI model then generates images that could have been painted by the creator of the 10 example images.

In total, significantly fewer training data are required for fine-tuning and also sensible, in stark contrast to pre-training. One can assume that the number of datasets for fine-tuning often does not exceed 10,000 examples. Very often, significantly fewer than these 10,000 examples are sensible and necessary. It all depends on the case. For completeness, a special case should be mentioned: A base model is fine-tuned with the goal that an actually improved version of the base model emerges from it. This happened, for example, with Llama3. The fine-tuned offshoot received 64,000 training data sets donated. This process is usually carried out by others. One can then use these improved models as if they had been there from the beginning (pre-training).

Fine-tuning is practiced in companies for small language models. Small does not mean it would not be a large language model LLM, but rather describes the relation between huge (ChatGPT) and very good LLM (like Llama3-8B). ChatGPT has well over 1000 billion neuron connections, whereas an 8B-model "only" has 8 billion.

Pre-Training versus Fine-Tuning

The following overview briefly and concisely presents the differences between Pre-Training and Fine-Tuning. The overview also includes features of data protection and synthetic data. Synthetic data are artificially generated data to expand the scope of training data. These data are obtained by AI-models!

Attribute	Pre-Training	Fine-Tuning
Purpose	Creation of a general AI model	Improving an existing AI model for a specific task
Analogy	Child upbringing by its parents + school education	University studies or further education after high school
Training data count	As many as possible, often billions of data sets	Often it takes 10 examples, often 100. Very rarely will there be 10,000 or more examples.
Calculation time	For modern models, many millions of hours	Very few hours to weeks
Data protection	Cannot be practically observed	Can be generally (only for fine training data) met
Possible anonymization?	In practice not	Yes, very good in principle
Copyright compliant?	In practice not	Yes, very good in principle
Is synthetic data meaningful?	Only in case of emergency or for improvements within a model series	Yes, for training data multiplication and increasing the variance of these

Differences between pre-training and fine-tuning of AI models.

Conclusion

From a data perspective, fine-tuning is controllable by orders of magnitude better than pre-training. This only applies to the data that flows into fine-tuning, however. The initial training data for pre-training are already stored in the AI model and retrievable.

Pre-training is a technical challenge. From a software perspective, it is almost the same as fine-tuning. However, it requires enormous computing capacities and an extremely large amount of training data.

Fine-tuning is completely different. It gets by with affordable consumer-grade hardware and very often requires little or very little training data.

The fine-tuning inherits therefore the "brain" with its stored initial training data and adds only a few new data. These few new data can be well controlled from the point of view of the GDPR. Nevertheless, a legally wrongful base model remains, which has been fine-tuned, a legally wrongful fine-tuned model. The legally wrongful data from the base model thus colors all follow-up versions of the model. Something unlawful cannot become lawful by adding something lawful.

Synthetic data does not really improve quality or privacy in a baseline model:

Synthetic data can also contain a reference to a person or a work protected by copyright. No wonder, since their model is genuine data.
When synthetic data is obtained by modifying true data, it can happen that false statements about people are made. This would be a deterioration of the legal situation in the AI language model.

In general one can say: AI models are only competitive when they have been presented with as many and good training data as possible. Thus, all available competitive closed and open-source AI language models are actually formally legally wrong. By the way, Mistral was also trained on data from the "open web", as Mistral itself says.

The continued accepted use of something formally illegal will likely lead, according to the logic of jurisprudence in AI, to it being considered allowed or at least its "illegal use" tolerated.

Another problem is the use of cloud services like ChatGPT or Azure. For in this case, data from third parties or business secrets are often sent to American companies and their national intelligence agencies.

If the argument of Data Security is not sufficient, it is suggested to name its applications specifically and use an optimized AI for this purpose. This type of AI is referred to here Offline-AI. It runs completely autonomously, either on a rented server or a company-owned server and often delivers better results than general intelligences like ChatGPT.

Key messages

There are two main ways to train AI models: pre-training and fine-tuning. Pre-training creates the basic model, while fine-tuning adapts it for a specific task.

Training a powerful language model involves feeding it massive amounts of text data to learn patterns and relationships within language, similar to how a child learns through education. The more data and training time, the better the model performs on various tasks.

Fine-tuning is like specialized training for AI models, making them better at specific tasks by further teaching them with smaller, focused datasets.

Because AI models are trained on massive amounts of data, which may include copyrighted material or personal information, most existing AI models are technically illegal.

Use specialized AI ("Offline-AI") that operates independently on your own servers for better data security and control.

About