Why are German language models challenging compared to other languages like English?

German is a language that is treated carelessly and is only understood in the Google FLAN-T5 language model due to its emergent property. The gendered language and the use of double colons make correct processing difficult.

Where do the training data for German AI language models come from?

The training data is mainly extracted from publicly accessible sources such as judgments from the Federal Court of Justice and the Federal Gazette, which are often only available in PDF format. These are converted into text format by platforms such as openjur, either manually or semi-automatically.

What advantages does a proprietary German language model offer to businesses?

Having your own German language model allows for a focus on the German language, reduces overhead from other languages, and optimizes user guidance. Furthermore, the costs of using such a system are often lower than with cloud solutions.

Why does the article present gender-specific language in AI models as a problem?

The article criticizes gender-specific language because it complicates text processing for AI models. The double word forms and the lack of clear grammar cause uncertainties that are not reliably solvable for machine analysis.

What is the impact of using gender-specific language on the quality of data for AI models?

The use of gender-specific language leads to increased complexity and uncertainty in the training data. This results in a lower quality of data, as AI models struggle to correctly interpret and learn grammatical structures.

Why is the disagreement over the use of gender-neutral language in AI models relevant to the article?

The article considers the different opinions regarding the use of gender-specific language, as this affects the quality of the data and the performance of AI models. The majority of Germans are critical of gender-specific language, which underscores the need for cleaner and more objective data.

Why is the use of gender formulations in AI models problematic?

Gender-formulierungen complicate the training of AI language models, as they require more data and confuse the models. This leads to poorer information processing.

What are the advantages of using proprietary, German language models?

Native German language models are better adapted to the German language and require fewer resources than unreliable models like ChatGPT. This enables a more precise processing of information.

Artificial Intelligence: German texts in AI-language models

Document search engines, chatbots, voice assistants, question-and-answer systems: they can all be made suitable for the globally subordinate German language. ChatGPT does not provide exact answers. Reliable AI-language models for the German language are possible despite some minor things like gender language.

Introduction

The use of AI in a company fundamentally differs from private use of ChatGPT, Microsoft Bing , Google Bard or other data-crunching systems.

Companies share their data along with trade secrets, patent applications, employee data, customer data, contracts or other confidential information only reluctantly with ChatGPT. On the other hand, more data will have to be made available to others in the future. This is what the EU's Data Governance Act (DGA) says, which came into effect in September 2023 due to its regulatory character.

In addition, the requirements for correct answers from a chatbot or other AI language systems are significantly higher than in private matters. This applies at least outside of the creative field. The top class is legal issues, which modern but general systems like ChatGPT and Microsoft's Bing-KI cannot answer well (justification: see link above). Even administrations that serve the public should not rely on unreliable chatbots, including ChatGPT.

The gender double point is suitable for contaminating training data for language models.
Especially because a double dot is normally a sentence-ending punctuation mark.

Even the supposedly recently released auto-correct function of Google Bard does not work properly, as a practitioner's text showed with closer inspection.

Unnecessary complications are caused for AI language models when grammar is sometimes watered down in training data due to gender language. Furthermore, the double gender point ensures that entire sentences in texts are not recognized at all.

German is a stepmotherly treated language (see image below) compared globally. Powerful language models that focus on English only understand German because this language has been adopted as an unwanted byproduct in the form of an emergent property almost unintentionally.

Advantages of own language models

A language model can be obtained in the following ways:

Building from scratch typically requires hundreds of thousands of GPU hours of compute time (GPU = graphics processing unit), which is not feasible for many companies.
Reusing open language models that are shaped by fine-tuning: more demanding but manageable standard path.
Reusing open language models that only get their own documents as context fed into the prompt.

The first two possibilities have in different ways the possibility of taking on a gender language. Fine-tuning will however have problems that can't be completely avoided.

German is not a world language. The list shows the languages according to their relevance for Google's FLAN-T5 language model. Even languages such as Gujarati, which are probably completely unknown to many, are listed before German.

The third possibility of reusing open language models is the technically simplest and often functioning one. It does not deal with gender language at all fundamentally. This is a technical statement and not a political one.

A German language model of one's own is not only possible but also has many advantages. Among other things, the benefits are:

The German language is at the forefront. We live in Germany, not Spain. Anglicisms can also be understood by a German language model.
The ballast of dozens of other languages doesn't have to be dragged along. Good for hardware requirements (graphics card!) and the operational speed.
High-quality content can be used instead of trash(= general available material which has not been pre-selected).
Focus on a field of study (or also several).
Optimal User Guidance with sensitivity towards results, rather than acting as if every answer is correct (see ChatGPT or Bing).
Lower or fixed costs: An enterprise-owned AI system is essentially based on acquisition or rental costs for an AI server. Frequent use does not change that. The costs remain equally low. Cloud solutions like ChatGPT are quite different. Asking a document will quickly become expensive with frequent use. Those who use the OpenAI chatbot API should better not program recursion or infinite loop, as otherwise the budget is spent in minutes and without benefit. That cannot happen with one's own system.

The next section deals with training data for German AI language models, as these form the foundation of artificial intelligence. From this also several proposals for authorities and other state agencies follow that could enable artificial intelligence in Germany speed.

Training data for German AI speech assistants

Training data is what equates to a child's upbringing by its parents. For language models, German texts are needed. Where do these texts come from if not stolen?

The internet offers a whole wealth of German texts. Companies also have numerous documents in their Internal network, which are suitable as a source of knowledge.

PDF instead of HTML

The Federal Court of Justice (BGH) publishes its judgments apparently only in PDF form. The non-profit platform openjur takes these PDFs and extracts (manually?) the text from them. Then openjur makes the judgments freely available online. Also, the Federal Gazette publishes many documents only in PDF form.

Analogously, it behaves with some other important public sources that can be interesting for AI models. For example, many regulatory bodies publish their activity reports or guidelines only in PDF form.