Which language model won the AlpacaEval benchmark?

The NullModel language model won the AlpacaEval benchmark, despite always providing the same answer to all questions. This led to a deception of the benchmark.

Why might the AlpacaEval benchmark not be reliable?

The benchmark is unreliable due to the deception caused by the null model, which always provides the same answer. Furthermore, the benchmark is often optimized for English, which is problematic for German companies as the German language is not sufficiently considered.

What types of AI models are best suited for classifying emails into specific categories?

Classification models are excellent for categorizing emails, such as inquiries, complaints, or terminations. These models are trained with specific datasets to recognize the various email types.

Why are fine-tuned models often better than generic AI models like ChatGPT?

Fine-tuned models deliver better results because they are tailored to specific tasks. By training with relevant data, they can solve tasks with higher accuracy and efficiency than models trained universally.

What hardware is required to effectively utilize Qwen2.5-72B?

Qwen2.5-72B requires relatively expensive hardware due to its very large size and optimization for code generation. It is not suitable for operation on cheaper hardware.

How does the size of an AI model affect the results of text generation?

Smaller models like 7B- or 8B-models are often better suited for German grammar and therefore better for simple text generation tasks.

Is semantic search a meaningful first AI application?

Yes, semantic search in corporate documents is a good starting point because it has no major hardware requirements and focuses on searching for knowledge.

Sichere KI, digitaler Datenschutz & Website-Compliance

A new language model (LLM) recently caused quite a stir. It achieved the highest score in a popular benchmark and was therefore even significantly better than ChatGPT-4 Omni, the current premium model from OpenAI. But which language model is really the best?

Introduction

With the Benchmark AlpacaEval, new language models are tested. The so-called Win-Rate indicates how well an LLM performs in the test. Here are the top places of the models that can be considered known:

On first place is GPT-4 Omni from OpenAI with a win rate of 57.5%. This rate is length-corrected ("LC Win Rate"). That means, the length-corrected win rates reduce the length distortions of GPT-4. With that account is taken that GPT-4 is considered a frontrunner and has a few quirks which would disadvantage other models without correction.

Now to the language models from the community that are less well known. The ranking list for the community models looks like this:

As can be seen, the model with the name NullModel is in first place. It has an LC Win Rate of 86.5 %. In contrast, ChatGPT-4 Omni only had 57.5% (16th place in the ranking, which also includes the community models).

The benchmark itself is not a good representative for AI tasks that occur in your company or authority. For one thing, it very much depends on the task. Some models can better understand questions, others can better draw conclusions, while still others can better summarize or translate texts.

Above all, however, it is relevant for German companies that German is usually the main language in the company and in text documents. The benchmarks are usually optimized for English or other languages such as Chinese or Hindi.

The special feature of the test winner

In itself, a benchmark is therefore more of an indicator than a reliable statement.

There is a peculiarity with the test winner, NullModel: It cheated. But the perfidy comes first: The language model NullModel always delivers the same answer to all questions that are asked in the benchmark. The code for this is even publicly accessible.

The NullModel comes in first place in the test result, although it always delivers the same answer to all questions asked. The questions have completely different correct answers each time. If the correct answers were always "Yes", one would not need to worry about this at all.

In truth, therefore, there are many different correct answers for the many questions in the benchmark. Nevertheless, the benchmark provides the Best grades for the LLM, which always gives the same answer.

So the benchmark has been fooled.

What is the best language model?

A lawyer would say: it depends. It depends on the application.

If you don't know what an AI system is to be used for, you have completely different problems than finding the best language model. The familiar models shown in the first figure are very suitable for a general chatbot.

If knowledge is to be drawn from the internet, ChatGPT regularly fails. The reason is that a low-cost system (from the user's perspective, who often also pays with their data) cannot perform an arbitrary number of searches on the internet per prompt. That would simply be uneconomical for OpenAI. As one can read about Anthropic and their Computer Use-Ansatz, it quickly becomes very expensive. There are indeed 20 dollars per hour for a task that requires research work. Unfortunately, when submitting the task to the AI, it is not known how labor-intensive it is to determine the result.

The best language model for a use case in your company is a finely trained LLM.

Some recommendations for language models help with the right setup and the start of an AI strategy.

Size of the language model

As a rule of thumb: The more unspecific the task assignment, the larger the LLM should be. The maximum example is ChatGPT. This model is so massive that the hardware for running it costs millions of euros (and even more for OpenAI, since more than 10 users use the system).

ChatGPT can answer questions of all kinds and often delivers surprisingly good results. However, even simple questions are sometimes not answered correctly. So ChatGPT cannot accurately determine the number of "r"s in the word Strawberry. Furthermore, ChatGPT also relies on false knowledge stored in the LLM. Not only do Hallucinations result from this.

The size of a language model is specified in billions of parameters. One billion is 1 B (B = billion). A parameter is a connection between two neurons in the neural network.

Very small language models, such as for example Llama3.2-1B, are well-suited for mobile devices or generally for high response speeds. However, the answer quality suffers under this. General questions can often be answered quite well. But when asked in German, it looks different again, namely worse. The German grammar is not adequately appreciated here.

Smaller language models like 7B- or 8B-models often master the German language very well. They can summarize texts, generate ideas or translate texts. On a standard AI-server, the execution speed is moderate.

With the help of downscaled models, inference speed can be increased. Quality suffers only minimally from it.

The best AI models are those that are embedded in a AI System and intended to solve specific tasks. A AI-System is a kind of framework program that contains not only the AI part but also conventional logic. Why should a language model have to count the number of letters in a word when classical programming code can do it much faster and better, namely with 100% reliability?

An example of a concrete task assignment is a AI assistant for the personnel department.

Read full article now via free Dr. GDPR newsletter.

More extras for subscribers:
Offline-AI · Free contingent+ for Website-Checks

Already a subscriber? Click on the link in the newsletter & refresh this page.

↓

Subscribe to Newsletter