Document search engines, chatbots, voice assistants, question-and-answer systems: they can all be made suitable for the globally subordinate German language. ChatGPT does not provide exact answers. Reliable AI-language models for the German language are possible despite some minor things like gender language.
Introduction
The use of AI in a company fundamentally differs from private use of ChatGPT, Microsoft Bing , Google Bard or other data-crunching systems.
Companies share their data along with trade secrets, patent applications, employee data, customer data, contracts or other confidential information only reluctantly with ChatGPT. On the other hand, more data will have to be made available to others in the future. This is what the EU's Data Governance Act (DGA) says, which came into effect in September 2023 due to its regulatory character.
In addition, the requirements for correct answers from a chatbot or other AI language systems are significantly higher than in private matters. This applies at least outside of the creative field. The top class is legal issues, which modern but general systems like ChatGPT and Microsoft's Bing-KI cannot answer well (justification: see link above). Even administrations that serve the public should not rely on unreliable chatbots, including ChatGPT.
The gender double point is suitable for contaminating training data for language models.
Especially because a double dot is normally a sentence-ending punctuation mark.
Even the supposedly recently released auto-correct function of Google Bard does not work properly, as a practitioner's text showed with closer inspection.
Unnecessary complications are caused for AI language models when grammar is sometimes watered down in training data due to gender language. Furthermore, the double gender point ensures that entire sentences in texts are not recognized at all.
German is a stepmotherly treated language (see image below) compared globally. Powerful language models that focus on English only understand German because this language has been adopted as an unwanted byproduct in the form of an emergent property almost unintentionally.
Advantages of own language models
A language model can be obtained in the following ways:
- Building from scratch typically requires hundreds of thousands of GPU hours of compute time (GPU = graphics processing unit), which is not feasible for many companies.
- Reusing open language models that are shaped by fine-tuning: more demanding but manageable standard path.
- Reusing open language models that only get their own documents as context fed into the prompt.
The first two possibilities have in different ways the possibility of taking on a gender language. Fine-tuning will however have problems that can't be completely avoided.

The third possibility of reusing open language models is the technically simplest and often functioning one. It does not deal with gender language at all fundamentally. This is a technical statement and not a political one.
A German language model of one's own is not only possible but also has many advantages. Among other things, the benefits are:
- The German language is at the forefront. We live in Germany, not Spain. Anglicisms can also be understood by a German language model.
- The ballast of dozens of other languages doesn't have to be dragged along. Good for hardware requirements (graphics card!) and the operational speed.
- High-quality content can be used instead of trash(= general available material which has not been pre-selected).
- Focus on a field of study (or also several).
- Optimal User Guidance with sensitivity towards results, rather than acting as if every answer is correct (see ChatGPT or Bing).
- Lower or fixed costs: An enterprise-owned AI system is essentially based on acquisition or rental costs for an AI server. Frequent use does not change that. The costs remain equally low. Cloud solutions like ChatGPT are quite different. Asking a document will quickly become expensive with frequent use. Those who use the OpenAI chatbot API should better not program recursion or infinite loop, as otherwise the budget is spent in minutes and without benefit. That cannot happen with one's own system.
The next section deals with training data for German AI language models, as these form the foundation of artificial intelligence. From this also several proposals for authorities and other state agencies follow that could enable artificial intelligence in Germany speed.
Training data for German AI speech assistants
Training data is what equates to a child's upbringing by its parents. For language models, German texts are needed. Where do these texts come from if not stolen?
The internet offers a whole wealth of German texts. Companies also have numerous documents in their Internal network, which are suitable as a source of knowledge.
PDF instead of HTML
The Federal Court of Justice (BGH) publishes its judgments apparently only in PDF form. The non-profit platform openjur takes these PDFs and extracts (manually?) the text from them. Then openjur makes the judgments freely available online. Also, the Federal Gazette publishes many documents only in PDF form.
Analogously, it behaves with some other important public sources that can be interesting for AI models. For example, many regulatory bodies publish their activity reports or guidelines only in PDF form.





My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.
