A new language model (LLM) recently caused quite a stir. It achieved the highest score in a popular benchmark and was therefore even significantly better than ChatGPT-4 Omni, the current premium model from OpenAI. But which language model is really the best?
Introduction
With the Benchmark AlpacaEval, new language models are tested. The so-called Win-Rate indicates how well an LLM performs in the test. Here are the top places of the models that can be considered known:

On first place is GPT-4 Omni from OpenAI with a win rate of 57.5%. This rate is length-corrected ("LC Win Rate"). That means, the length-corrected win rates reduce the length distortions of GPT-4. With that account is taken that GPT-4 is considered a frontrunner and has a few quirks which would disadvantage other models without correction.
Now to the language models from the community that are less well known. The ranking list for the community models looks like this:

As can be seen, the model with the name NullModel is in first place. It has an LC Win Rate of 86.5 %. In contrast, ChatGPT-4 Omni only had 57.5% (16th place in the ranking, which also includes the community models).
The benchmark itself is not a good representative for AI tasks that occur in your company or authority. For one thing, it very much depends on the task. Some models can better understand questions, others can better draw conclusions, while still others can better summarize or translate texts.
Above all, however, it is relevant for German companies that German is usually the main language in the company and in text documents. The benchmarks are usually optimized for English or other languages such as Chinese or Hindi.
The special feature of the test winner
In itself, a benchmark is therefore more of an indicator than a reliable statement.
There is a peculiarity with the test winner, NullModel: It cheated. But the perfidy comes first: The language model NullModel always delivers the same answer to all questions that are asked in the benchmark. The code for this is even publicly accessible.
The NullModel comes in first place in the test result, although it always delivers the same answer to all questions asked. The questions have completely different correct answers each time. If the correct answers were always "Yes", one would not need to worry about this at all.
In truth, therefore, there are many different correct answers for the many questions in the benchmark. Nevertheless, the benchmark provides the Best grades for the LLM, which always gives the same answer.
So the benchmark has been fooled.
What is the best language model?
A lawyer would say: it depends. It depends on the application.
If you don't know what an AI system is to be used for, you have completely different problems than finding the best language model. The familiar models shown in the first figure are very suitable for a general chatbot.
If knowledge is to be drawn from the internet, ChatGPT regularly fails. The reason is that a low-cost system (from the user's perspective, who often also pays with their data) cannot perform an arbitrary number of searches on the internet per prompt. That would simply be uneconomical for OpenAI. As one can read about Anthropic and their Computer Use-Ansatz, it quickly becomes very expensive. There are indeed 20 dollars per hour for a task that requires research work. Unfortunately, when submitting the task to the AI, it is not known how labor-intensive it is to determine the result.
The best language model for a use case in your company is a finely trained LLM.
Some recommendations for language models help with the right setup and the start of an AI strategy.
Size of the language model
As a rule of thumb: The more unspecific the task assignment, the larger the LLM should be. The maximum example is ChatGPT. This model is so massive that the hardware for running it costs millions of euros (and even more for OpenAI, since more than 10 users use the system).
ChatGPT can answer questions of all kinds and often delivers surprisingly good results. However, even simple questions are sometimes not answered correctly. So ChatGPT cannot accurately determine the number of "r"s in the word Strawberry. Furthermore, ChatGPT also relies on false knowledge stored in the LLM. Not only do Hallucinations result from this.
The size of a language model is specified in billions of parameters. One billion is 1 B (B = billion). A parameter is a connection between two neurons in the neural network.
Very small language models, such as for example Llama3.2-1B, are well-suited for mobile devices or generally for high response speeds. However, the answer quality suffers under this. General questions can often be answered quite well. But when asked in German, it looks different again, namely worse. The German grammar is not adequately appreciated here.
Smaller language models like 7B- or 8B-models often master the German language very well. They can summarize texts, generate ideas or translate texts. On a standard AI-server, the execution speed is moderate.
With the help of downscaled models, inference speed can be increased. Quality suffers only minimally from it.
The best AI models are those that are embedded in a AI System and intended to solve specific tasks. A AI-System is a kind of framework program that contains not only the AI part but also conventional logic. Why should a language model have to count the number of letters in a word when classical programming code can do it much faster and better, namely with 100% reliability?
An example of a concrete task assignment is a AI assistant for the personnel department.




My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.
