Drücke „Enter”, um zum Inhalt zu springen.
Hinweis zu diesem Datenschutz-Blog:
Anscheinend verwenden Sie einen Werbeblocker wie uBlock Origin oder Ghostery, oder einen Browser, der bestimmte Dienste blockiert.
Leider wird dadurch auch der Dienst von VG Wort blockiert. Online-Autoren haben einen gesetzlichen Anspruch auf eine Vergütung, wenn ihre Beiträge oft genug aufgerufen wurden. Um dies zu messen, muss vom Autor ein Dienst der VG Wort eingebunden werden. Ohne diesen Dienst geht der gesetzliche Anspruch für den Autor verloren.

Ich wäre Ihnen sehr verbunden, wenn Sie sich bei der VG Wort darüber beschweren, dass deren Dienst anscheinend so ausgeprägt ist, dass er von manchen als blockierungswürdig eingestuft wird. Dies führt ggf. dazu, dass ich Beiträge kostenpflichtig gestalten muss.

Durch Klick auf folgenden Button wird eine Mailvorlage geladen, die Sie inhaltlich gerne anpassen und an die VG Wort abschicken können.

Nachricht an VG WortMailtext anzeigen

Betreff: Datenschutzprobleme mit dem VG Wort Dienst(METIS)
Guten Tag,

als Besucher des Datenschutz-Blogs Dr. DSGVO ist mir aufgefallen, dass der VG Wort Dienst durch datenschutzfreundliche Browser (Brave, Mullvad...) sowie Werbeblocker (uBlock, Ghostery...) blockiert wird.
Damit gehen dem Autor der Online-Texte Einnahmen verloren, die ihm aber gesetzlich zustehen.

Bitte beheben Sie dieses Problem!

Diese Nachricht wurde von mir persönlich abgeschickt und lediglich aus einer Vorlage generiert.
Wenn der Klick auf den Button keine Mail öffnet, schreiben Sie bitte eine Mail an info@vgwort.de und weisen darauf hin, dass der VG Wort Dienst von datenschutzfreundlichen Browser blockiert wird und dass Online Autoren daher die gesetzlich garantierten Einnahmen verloren gehen.
Vielen Dank,

Ihr Klaus Meffert - Dr. DSGVO Datenschutz-Blog.

PS: Wenn Sie meine Beiträge oder meinen Online Website-Check gut finden, freue ich mich auch über Ihre Spende.
Ausprobieren Online Webseiten-Check sofort DSGVO-Probleme finden

Artificial Intelligence: German texts in AI-language models

0
Dr. DSGVO Newsletter detected: Extended functionality available
More articles · Website-Checks · Live Offline-AI
📄 Article as PDF (only for newsletter subscribers)
🔒 Premium-Funktion
Der aktuelle Beitrag kann in PDF-Form angesehen und heruntergeladen werden

📊 Download freischalten
Der Download ist nur für Abonnenten des Dr. DSGVO-Newsletters möglich

Document search engines, chatbots, voice assistants, question-and-answer systems: they can all be made suitable for the globally subordinate German language. ChatGPT does not provide exact answers. Reliable AI-language models for the German language are possible despite some minor things like gender language.

Introduction

The use of AI in a company fundamentally differs from private use of ChatGPT, Microsoft Bing , Google Bard or other data-crunching systems.

Companies share their data along with trade secrets, patent applications, employee data, customer data, contracts or other confidential information only reluctantly with ChatGPT. On the other hand, more data will have to be made available to others in the future. This is what the EU's Data Governance Act (DGA) says, which came into effect in September 2023 due to its regulatory character.

In addition, the requirements for correct answers from a chatbot or other AI language systems are significantly higher than in private matters. This applies at least outside of the creative field. The top class is legal issues, which modern but general systems like ChatGPT and Microsoft's Bing-KI cannot answer well (justification: see link above). Even administrations that serve the public should not rely on unreliable chatbots, including ChatGPT.

The gender double point is suitable for contaminating training data for language models.

Especially because a double dot is normally a sentence-ending punctuation mark.

Even the supposedly recently released auto-correct function of Google Bard does not work properly, as a practitioner's text showed with closer inspection.

Unnecessary complications are caused for AI language models when grammar is sometimes watered down in training data due to gender language. Furthermore, the double gender point ensures that entire sentences in texts are not recognized at all.

German is a stepmotherly treated language (see image below) compared globally. Powerful language models that focus on English only understand German because this language has been adopted as an unwanted byproduct in the form of an emergent property almost unintentionally.

Advantages of own language models

A language model can be obtained in the following ways:

  • Building from scratch typically requires hundreds of thousands of GPU hours of compute time (GPU = graphics processing unit), which is not feasible for many companies.
  • Reusing open language models that are shaped by fine-tuning: more demanding but manageable standard path.
  • Reusing open language models that only get their own documents as context fed into the prompt.

The first two possibilities have in different ways the possibility of taking on a gender language. Fine-tuning will however have problems that can't be completely avoided.

German is not a world language. The list shows the languages according to their relevance for Google's FLAN-T5 language model. Even languages such as Gujarati, which are probably completely unknown to many, are listed before German.

The third possibility of reusing open language models is the technically simplest and often functioning one. It does not deal with gender language at all fundamentally. This is a technical statement and not a political one.

A German language model of one's own is not only possible but also has many advantages. Among other things, the benefits are:

  • The German language is at the forefront. We live in Germany, not Spain. Anglicisms can also be understood by a German language model.
  • The ballast of dozens of other languages doesn't have to be dragged along. Good for hardware requirements (graphics card!) and the operational speed.
  • High-quality content can be used instead of trash(= general available material which has not been pre-selected).
  • Focus on a field of study (or also several).
  • Optimal User Guidance with sensitivity towards results, rather than acting as if every answer is correct (see ChatGPT or Bing).
  • Lower or fixed costs: An enterprise-owned AI system is essentially based on acquisition or rental costs for an AI server. Frequent use does not change that. The costs remain equally low. Cloud solutions like ChatGPT are quite different. Asking a document will quickly become expensive with frequent use. Those who use the OpenAI chatbot API should better not program recursion or infinite loop, as otherwise the budget is spent in minutes and without benefit. That cannot happen with one's own system.

The next section deals with training data for German AI language models, as these form the foundation of artificial intelligence. From this also several proposals for authorities and other state agencies follow that could enable artificial intelligence in Germany speed.

Training data for German AI speech assistants

Training data is what equates to a child's upbringing by its parents. For language models, German texts are needed. Where do these texts come from if not stolen?

The internet offers a whole wealth of German texts. Companies also have numerous documents in their Internal network, which are suitable as a source of knowledge.

PDF instead of HTML

The Federal Court of Justice (BGH) publishes its judgments apparently only in PDF form. The non-profit platform openjur takes these PDFs and extracts (manually?) the text from them. Then openjur makes the judgments freely available online. Also, the Federal Gazette publishes many documents only in PDF form.

Analogously, it behaves with some other important public sources that can be interesting for AI models. For example, many regulatory bodies publish their activity reports or guidelines only in PDF form.

Complicated two-column PDF from a data protection authority.

Read full article now via free Dr. GDPR newsletter.
More extras for subscribers:
Offline-AI · Free contingent+ for Website-Checks
Already a subscriber? Click on the link in the newsletter & refresh this page.
Subscribe to Newsletter
About the author on dr-dsgvo.de
My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.

Artificial Intelligence: Works of Authors and Their Protection