A new language model (LLM) recently caused quite a stir. It achieved the highest score in a popular benchmark and was therefore even significantly better than ChatGPT-4 Omni, the current premium model from OpenAI. But which language model is really the best?
Introduction
With the Benchmark AlpacaEval, new language models are tested. The so-called Win-Rate indicates how well an LLM performs in the test. Here are the top places of the models that can be considered known:

On first place is GPT-4 Omni from OpenAI with a win rate of 57.5%. This rate is length-corrected ("LC Win Rate"). That means, the length-corrected win rates reduce the length distortions of GPT-4. With that account is taken that GPT-4 is considered a frontrunner and has a few quirks which would disadvantage other models without correction.
Now to the language models from the community that are less well known. The ranking list for the community models looks like this:

As can be seen, the model with the name NullModel is in first place. It has an LC Win Rate of 86.5 %. In contrast, ChatGPT-4 Omni only had 57.5% (16th place in the ranking, which also includes the community models).
The benchmark itself is not a good representative for AI tasks that occur in your company or authority. For one thing, it very much depends on the task. Some models can better understand questions, others can better draw conclusions, while still others can better summarize or translate texts.
Above all, however, it is relevant for German companies that German is usually the main language in the company and in text documents. The benchmarks are usually optimized for English or other languages such as Chinese or Hindi.
The special feature of the test winner
In itself, a benchmark is therefore more of an indicator than a reliable statement.
There is a peculiarity with the test winner, NullModel: It cheated. But the perfidy comes first: The language model NullModel always delivers the same answer to all questions that are asked in the benchmark. The code for this is even publicly accessible.
The NullModel comes in first place in the test result, although it always delivers the same answer to all questions asked. The questions have completely different correct answers each time. If the correct answers were always "Yes", one would not need to worry about this at all.
In truth, therefore, there are many different correct answers for the many questions in the benchmark. Nevertheless, the benchmark provides the Best grades for the LLM, which always gives the same answer.
So the benchmark has been fooled.
What is the best language model?
A lawyer would say: it depends. It depends on the application.
If you don't know what an AI system is to be used for, you have completely different problems than finding the best language model. The familiar models shown in the first figure are very suitable for a general chatbot.
If knowledge is to be drawn from the internet, ChatGPT regularly fails. The reason is that a low-cost system (from the user's perspective, who often also pays with their data) cannot perform an arbitrary number of searches on the internet per prompt. That would simply be uneconomical for OpenAI. As one can read about Anthropic and their Computer Use-Ansatz, it quickly becomes very expensive. There are indeed 20 dollars per hour for a task that requires research work. Unfortunately, when submitting the task to the AI, it is not known how labor-intensive it is to determine the result.
The best language model for a use case in your company is a finely trained LLM.
Some recommendations for language models help with the right setup and the start of an AI strategy.
Size of the language model
As a rule of thumb: The more unspecific the task assignment, the larger the LLM should be. The maximum example is ChatGPT. This model is so massive that the hardware for running it costs millions of euros (and even more for OpenAI, since more than 10 users use the system).
ChatGPT can answer questions of all kinds and often delivers surprisingly good results. However, even simple questions are sometimes not answered correctly. So ChatGPT cannot accurately determine the number of "r"s in the word Strawberry. Furthermore, ChatGPT also relies on false knowledge stored in the LLM. Not only do Hallucinations result from this.
The size of a language model is specified in billions of parameters. One billion is 1 B (B = billion). A parameter is a connection between two neurons in the neural network.
Very small language models, such as for example Llama3.2-1B, are well-suited for mobile devices or generally for high response speeds. However, the answer quality suffers under this. General questions can often be answered quite well. But when asked in German, it looks different again, namely worse. The German grammar is not adequately appreciated here.
Smaller language models like 7B- or 8B-models often master the German language very well. They can summarize texts, generate ideas or translate texts. On a standard AI-server, the execution speed is moderate.
With the help of downscaled models, inference speed can be increased. Quality suffers only minimally from it.
The best AI models are those that are embedded in a AI System and intended to solve specific tasks. A AI-System is a kind of framework program that contains not only the AI part but also conventional logic. Why should a language model have to count the number of letters in a word when classical programming code can do it much faster and better, namely with 100% reliability?
An example of a concrete task assignment is a AI assistant for the personnel department. A candidate sends their resume in response to a job posting to the HR person. The HR person now wants to know how well the candidate's resume matches the requirements listed in the job posting (hopefully). The AI assistant compares the resume with the job posting. The AI system around it ensures that the resume and the skills mentioned in it are viewed from several perspectives: Which required knowledge is well fulfilled and which not? What outstanding qualities does the candidate have in general, which can be valuable for any company?
In addition, fine details are taken into account: An IT professional does not have to mention in their resume that they master JSON. Either they do or they learn it in 5 to 45 minutes. Something like this ChatGPT can't know. But the department knows it and can feed it to the AI system.
The HR department could also have a Online research (online research) performed by the AI assistant on the candidate, and present the results for review. This cannot be done by an AI model. A system like ChatGPT can't do this either, at least not for around 22 euros per month or fractions of cents per inquiry. OpenAI won't search the internet extensively because you either don't want to give them money or are already thinking about your costs when they reach 50 euros.
With the help of Fine tuning, language models can be fine-tuned for specific task assignments. The results are usually much better than you would achieve with ChatGPT or any other universal intelligence. Such finely trained models can also be very small. Thus, inference speed is potentially very high.
Other models besides LLMs
Classic language models are probably the most widespread AI models. But there are many more.
For example, there are so-called Safeguard-Models. These LLMs are only intended to check inputs from a user or outputs from another language model. Does the input contain an invitation to an illegal action? Does the output contain instructions for building a bomb?
For classification tasks, other model types are more suitable than LLMs. For example, you want to find out what kind of email someone sent to your company. What it a request? What it a complaint? What it a termination? Or did the sender just want to get in touch with a contact person? To do this, you train a classifier. That's little effort, but it brings enormous results.
To support less experienced employees, vector search engines are very suitable. A customer of a car rental reports a damage by email or app. The employee at the car rental company should now decide how the damage is regulated. The AI assistant searches for comparable cases from the past and presents the employee with recommendations for the most likely course of action. Such historical data are particularly abundant in insurance companies.
Image models are generally well-known. They provide good to very good services. However, it goes even better with finely-tuned image models or Adapters. With these, images can be produced according to your specifications (style, mood, color scheme, motif). Here's an example:

You will certainly be able to work out what the template for this type of image was. The number of examples for teaching a picture adapter can be very small. Often 8 or 15 examples are sufficient, depending on the spread of the image material. The number of examples can be increased by synthetic addition.
For Audio transcription there are now excellent Whisper models available. They deliver significantly better results than the Microsoft standard in Teams. This was the result of a test by a data protection publisher. The transcription was compared with Microsoft Teams and that of Dr. GDPR using its own AI system. The company's own AI system takes into account a company-specific vocabulary, which also includes surnames. Nobody knows whether Schmitt is written with one or two "t"s or with "dt", let alone an AI.
Examples of AI models and their capabilities
A few examples will be used to demonstrate how model size, up-to-dateness of the model and type of data input (text, image, …) affect the response quality:
- Llama3-7b: Bad by today's standards, great when it was released; can run well on its own hardware
- Llama3-1:8b: Very good for many tasks; can run well on own hardware
- Llama3-70b: Good to very good for many tasks, but partly worse than the newer Llama3.1:8b; can only be operated reasonably on expensive hardware
- Llama3.1-70b: Very good for many tasks; a few weaknesses for German; can only be operated reasonably on expensive hardware
- Llama3.1-405b: Even better than Llama3.1:70b, but not necessarily for German; can only be run reasonably on very expensive hardware
- Llama3.2-3b: Good, but worse than Llama3.1:8b, but faster answers
In addition to these LLMs, there are other model types. Here are a few examples:
- Pixtral-12B: Very good for interrogating images. Acceptable hardware requirements
- Qwen2.5-72B: Very good for generating program code; can only be operated reasonably on expensive hardware
- FLUX.1-fast: Sometimes very good results in image generation, but often inadequacies in the generation of German texts in the image; can also be operated reasonably on cheaper hardware with tricks
The quality of the results therefore varies depending on the topicality and size of the model. Text tends to require exact output, except for creative tasks. The situation is often different for images.
Conclusion
Define your use case. If you don't have an idea where AI can support you, then you don't need AI. Use a search engine instead, as always.
Start with a simple use case. If you are unsure about what could be simple, ask for advice.
The smaller the AI model, the more specific the use case should be. Very large models, such as those with 405B parameters, should not usually be operated by your company itself. Even if the resources were available, there are usually better options.
A 70B model such as Llama3.1-70B is already quite large for self-operation. This is just a rough guide to give you an idea. Models no larger than half this size are better.
For tasks that do not require generative answers, there are better options than the AI models that "everyone" knows. These models are ideal for finding knowledge in your company documents. The hardware requirements are also so low that nobody has to think about the purchase or rental prices. Semantic search, i.e. the comparison of texts or images (or audio or …), is another example of a sensible start to the AI age.
Who runs their own AI, has very few to no worries about data security. Very few worries, if a GPU server in Germany is rented from a German provider with DPA, and no worries at all, if one's own server stands in your data center or is rented via colocation.
Your own AI means: Full data control. Data goes nowhere, unless you want it to. Data is retrieved from nowhere, unless you want it to. Only authorized users can access documents via AI. This is called Offline-AI.
Finally: Which language model or other AI model is best suited to your use case should be assessed on the basis of the specific use case. There are new AI innovations and models every week. It is therefore worth taking a closer look.
Key messages of this article
The NullModel is the "best" model in the benchmark, but it always gives the same answer to all questions – which is not really helpful. The best language model depends on the application.
For simple questions, smaller language models such as 7B or 8B models are more suitable, as they often have a better command of German grammar than larger models.
AI assistants can search historical cases to recommend the best course of action.
Start with a simple application such as the semantic search in company documents.




My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.
