Artificial Intelligence: The Usage Reservation That Does Not Exist

The German legislator has defined a possibility for authors with § 44b UrhG, how they can protect their contents against AI-Crawling. But this possibility does not exist and leads to further impoverishment of the German language in AI-language models. Our domestic economy will suffer from it.

Introduction

Contents from websites, publicly accessible PDF documents, and other documents of this kind may be read in and used for purposes of processing by Artificial Intelligence, especially for Chatbots, and even temporarily stored for AI training. This is permitted by § 44b UrhG.

There it is also stated that this reading of contents for AI language models should not be allowed if the author has formulated a machine-readable usage restriction. I see generative AI models as "data mining" in the sense of § 44b UrhG, by the way. More on that in a future post, since there seem to be other opinions here. Regardless of what data mining is supposed to be, the problem this article is about remains.

Podcast episode on the topic

This type of usage reservation does not exist, as I will show. Besides chatbots there are also other very interesting and relevant AI applications. These include data analysis, automated reasoning or automated discovery. Because German will become even less significant in the future, all others worldwide can automatically gain knowledge and inventions, but we in Germany only if we no longer speak German with AI systems.

What does machine-readable mean?

Machine-readable is according to Reason 35 of EU Directive 2019/1024 a document, "when it is available in a file format that is structured so that software applications can easily identify, recognize and extract the specific data. …"

Anyone who feeds website content into AI applications ("Crawler") must, according to the will of German lawmakers, prove that NO usage restriction what present in the imprint or terms of use of the website.
This proof is only manually attainable, thus a mechanism, as necessary for AI applications, fails.

The robots.txt file is at least machine-readable. It regulates which crawlers are allowed to read contents, for the purpose of forming search engines.

The German legislator sees it differently. He understands machine-readable as something that makes me suspect that either the German legislator had extremely naive and incompetent advisors or did not have them consult.

The German legislator apparently considers entries in the imprint or terms of use as machine-readable. See Draft 19/27426 of the German Bundestag on the draft law for § 44b UrhG (there: p. 89, paragraph 2), bold print by me:

"A usage reservation must be explicitly declared and take place in a way that is appropriate for automated processes when it comes to text and data mining. In the case of online accessible works, the reservation is only effective according to paragraph 3 sentence 2 if it takes place in machine-readable form (compare ErwG 18 Subparagraph 2 sentence 2 DSM-RL). It can also be contained in the imprint or in the general terms and conditions (AGB), as long as it is also machine-readable there."

I'd say this is unlawful under European law, but I don't want to get in the way of a legal discussion. Note that, as far as I know, it's legal in Germany to conclude contracts that are impossible to fulfill, for example.

How bad bad advisors are can be seen on the website of a well-known German legal service provider. There, the reservation of use according to § 44b UrhG is declared in the imprint. This statement also appears as an informal comment in the robots.txt file of the aforementioned website.

Unfortunately, in the Robots.txt file, one has forgotten to exclude the second most well-known system (from Google) along with the most well-known AI system (ChatGPT), by simple and unmistakable technical specification.

It's just too simple.
The said legal service has sufficient resources to pay consultants.

I don't see a specific party problem with the German legislator, but rather one with the legislative process itself. Anyone who has once seen a consultation of the German Bundestag or political committees on federal level on TV might know what I mean. Here's the essence:

Experts are afraid to speak the truth.
Experts are not experts.
Experts have little time for their answers.
Experts are only allowed to answer questions that have been asked, but not to think further.
The whole event only lasts a short time.
The answer from experts is often only understandable for semi-experts, but not for politicians who want to understand everything and must think they do.
It's not refined, it's uncomfortable to speak truths, and who wants to disturb the positive vibrations anyway?

Problems upon problems

The legislative mandate of the German legislator is bullshit for several reasons. Here are the reasons for the failure of the German legislator.

Imprint and AGB-page cannot be determined automatically

Imprint and AGB-page cannot be determined automatically at all quickly. At least this is not reliable. It should be that way. Because otherwise, no AI company will dare to read German web pages for AI applications again. In the source mentioned above, it also says on p. 89: "Burden of proof for the absence of a usage reservation bears the user [=Crawler].

I speak from experience. The imprint is a subpage like any other subpage of a website. The AGB page is also one, but often in PDF form. Whoever has dealt with reading in PDFs and automated extraction of raw text from them even once knows: it's not easy.

Imprint and AGB page cannot be reliably recognized.
Says the expert who has already read many websites with crawlers.

Terms of use and general conditions should probably not be read at all

If a crawler uses a Direct Link to retrieve a document (e.g. a PDF), then the crawler often doesn't want to read further pages of a homepage. It should, however, in order to find imprint and AGB.

But it gets even worse.

An AI-Crawler is stupid

A crawler is a crawler is a crawler. There's often no AI. This AI should only come into existence after sufficient data for training are available. The crawler is supposed to deliver these data in the first place.

The naive and stupid argument of some, that software could understand everything today, is really only stupid or naive. In the end it would mean that you have to rent ChatGPT to send all possible data there and ask ChatGPT for money: "Where's the imprint?" or "Is there a usage reservation in the imprint?" or "Now we have to search through the AGB, dear ChatGPT, but please don't save any data because we first have to find out if there is a usage reservation."

A analogy would be (unfortunately I can't think of anything better at the moment): You have an appointment in two hours at a location 500 km away from your current location, where you are also involved in another appointment. You arrive late and get scolded for it, because you could have taken a helicopter. The helicopter corresponds to ChatGPT here, only that the helicopter has fewer privacy loopholes.

An AI crawler is just as dumb as some who think every German sentence could be interpreted and understood by software.

In a social network, a lady has rephrased her usage reservation against AI crawling as follows: "Any data use is exclusively intended for the purpose of gaining information in human neural networks

I highly doubt that a crawler will understand this. I also highly doubt that a language model will understand this. And besides, I highly doubt that most people will understand this.

The Dilemma

Again: A crawler is a crawler. A crawler reads in contents and saves them. Done. Everything that comes after it is done by other software components.

A crawler that reads content for a search engine should and must therefore only respect the robots.txt file and the usage restriction listed there.

The same crawler should be able to do much more according to the wish of the German legislator if the contents are used also or only for training AI-models. The crawler should then not only be able to understand the really simple robots.txt file which is always located at the same place on every website. No, this same crawler should then also be able to:

Read further into the website than perhaps intended, in order to find out where the imprint and terms of use are stored.
Terms of use.
Extract information from the imprint.
Analyze the Roh-text and try to understand it.
No usage restriction found, then go to Lottery (Step 6)
General Terms of Business (to be read in)
Attach a PDF reader. Hopefully, the terms of use are without footnotes and best one-columned.
Extract terms from the terms of use.
Analyze the Roh-text and try to understand it.
No usage restriction found, then go on Lottery (Step 11).
Most legally secure and auditable saving of
- Legal notice page
- Terms of Use page
- Page, based on which the pages for imprint and terms of use were calculated.

Have fun and above all: Good luck!

The solution

A solution requires three conventions:

Naming convention (URL): Here is the file where the usage reservation is waived can be found.
Structural convention (content): The file is structured as follows
Naming convention (content): These are the parameters that express usage reservations. There can be a general reservation of use, but also a specific one (for individual AI systems).

The status quo for the widely known and proven robots.txt file meets all these requirements. Only for the general usage reservation, a specification is missing. This specification must only be made once so that it becomes a convention. Done. Costs me 10 seconds of time (see below), which is no intellectual high performance.

However, the places mentioned by the German legislator incorrectly fulfill all three conventions NOT:

It is unclear where on a website one can find an imprint and terms of use; often there are no terms of use at all.
The imprint is structurally chaotic in its setup. Talking about the terms of use as a legal text, we don't even want to start.
See 2: The imprint is content-wise chaotic arranged, T&Cs analogous.

The German way is therefore a dead end. The German regulation for the reservation of use against AI-Crawling is doomed to fail. It also ensures that the German language will become impoverished in the AI landscape, or only large AI companies can afford not to comply with German regulations. Thank you very much, Germany.

What is German language good for in language models?

Chatbots in the form that private users use are not the problem if no sensitive data is being processed. There's ChatGPT and similar things for this purpose.

For the intelligent AI search for documents, there are also good language models that even run locally. Good for those who have already saved these LLMs locally. Because as soon as the world notices something about the German detour, newer versions of the language models will contain fewer German texts.

Above all, however, for machine learning language models are very interesting, relevant, and economically highly significant. Research is also pleased with new insights that would not be possible without AI-language models. Here's an example of what's already possible now.

The example is given in German language. It will work with freely available language models in the future only if the German detour does not cause dismay. Otherwise, you'll unfortunately have to express yourself in English, Spanish, Bengali or any other truly relevant language. Sorry that this would be more trouble for you. Thank the German legislator.

Identify companies traded on the stock market that manufacture products relevant to Artificial Intelligence applications. Identify competitors for these companies. Find suppliers for all of these companies that provide particularly valuable parts. Valuable are parts for which there are only a few manufacturers worldwide. Find the most profitable companies among them and name them, along with the products they manufacture.
Fictitious example, which would be formulated differently in reality.

In principle, as mentioned in the example, a machine-based reasoning ("Reasoning") works like this. With the help of currently available open-source procedures, language models can break down a question into sub-tasks, execute them individually, combine their results, and thus generate the final answer. In this way, for example, new findings in materials science could be gained. The solution is called MechGPT. This was especially achieved by reading research results (in English!) and finding connections. The result were new findings that were scattered across individual English articles. Too bad that the German language is becoming less important.

Conclusion

The German legislator is stupid. All who think § 44b UrhG is currently implementable are naive or stupid or want to give their opinion on things they should better not talk about.

Since § 44b of the UrhG is not realizable and the crawler operator must prove that everything what done correctly, German texts will be found even less often in AI language models in the future. A chatbot is only as good as the data it receives for training. German will be located in the Stone Age to come. If you ever plan to analyze texts on the Internet with the help of an AI (e.g., for predicting stock market prices), then write everything better right away in English, Chinese or Bengali.

The Truth About AI: No effective AI language model can be good without copyrighted data. No great AI language model is lawful.
Author's opinion, as of July 9, 2024

The solution would be: A usage reservation against AI crawling must be embedded in the robots.txt file.

There is essentially this approach already, because companies like OpenAI or Google are already stating how a usage reservation can be embedded in robots.txt. Here concrete examples from the practice:

Nutzungsvorbehalt gegen KI-Crawler, ungleich der Vorgabe des deutschen Gesetzgebers.

This file can be found under dr-dsgvo.de/robots.txt. Generally: their-website.de/robots.txt. That's it.

Because it's simply simple and everything in Germany must be complicated, the German legislator has made something simple into something complicated.

The problem are unknown or non-existent AI crawlers, whose entry for robots.txt cannot be known in any case. If you ever want to create an AI model, you will hardly ensure that the whole world (or even just Germany) knows what your AI crawler technically is called and how a usage restriction can be formulated specifically against your AI crawler.

A possible solution can be a universal entry, for example:

AI-agent: *
Disallow

So a usage reservation would be pronounced against all AI-crawlers, but not against search engines. The imagination for a concrete design is unlimited.

As soon as search engines will be like AI language models or at least AI vector searches, it won't matter anyway.

My tip: Best to ignore the usage reservation and build own AI language models. Nobody sees this from outside. Furthermore, one can build them in such a way that copyrighted texts do not appear in the answers and thus no problem can arise.

Key messages

German law aims to protect authors' content from AI use, but the way it defines "machine-readable" restrictions is impractical and ineffective, hindering AI development and potentially harming Germany's economy.

It's difficult for AI to reliably find and understand legal information like terms of service on websites because the way websites are structured makes it hard to automatically locate these pages.

AI crawlers need to be able to understand website content, including legal documents like terms of service, to ensure they comply with usage restrictions.

Germany's approach to regulating AI crawling is ineffective and will harm the German language in AI development.

They can be designed to avoid using copyrighted material in their responses, preventing any legal issues.

About