How can authors protect their works from being used by Artificial Intelligence?

Authors can issue a usage restriction that prevents their works from being used by AI models for text and data analysis. This restriction must be in a machine-readable format.

What role does the robots.txt file play in protecting online works from AI?

The robots.txt file allows authors to prevent certain AI systems, such as search engines or chatbots, from crawling their websites. This prevents these systems from extracting content and using it for training purposes.

How can AI platforms that collect content from the internet be effectively prevented?

It is almost impossible to effectively exclude individual AI platforms due to the unlimited number and constant change. The best strategy is to accept the use of Google services or to completely exclude the Google Bot.

What role do robots.txt files play in the context of AI usage by search engines?

Robots.txt files can be used to exclude specific areas of a webpage from search engine crawlers, such as the Google Bot. However, this is hardly effective against the increasing data-gathering obsession of companies like Google.

Why can AI models like Bard use information from online texts without infringing copyright?

AI models like Bard argue that they merely reflect publicly accessible information presented on websites. They claim that they do not reproduce your content verbatim, but rather provide a semantic interpretation.

How does one explain the difficulty in controlling copyrights within AI models?

Due to long training times and wide data intervals, AI models often contain outdated information. Furthermore, they cannot react as quickly to changes or deletion requests as traditional search engines, which makes it more difficult to control copyright.

Why can't AI models forget, and how does that affect copyright?

AI models do not possess the ability to forget, as they store and process information from online texts. This means that content cannot be deleted from the model even after a long time and without restrictions, further complicating the control of copyrights.

What are the main problems with AI-powered search engines like Bing?

AI search engines like Bing can provide false answers based on hallucinations. Another problem is the lack of 'grounding,' i.e., the connection to current, reliable information, which can lead to inaccurate results.

Sichere KI, digitaler Datenschutz & Website-Compliance

Creators of online accessible works have according to law the possibility to declare a usage reservation. Thus, works should be protected from flowing into electronic brains. Does this approach function? In the contribution possibilities and limitations are named.

Introduction

Artificial intelligence has enormous capabilities that often far surpass those of the average intelligent human being. The Turing Test is considered completed positively. This test checks whether a computer is as intelligent as a human being. Yes, it is now. As ChatGPT shows, an AI can even outperform humans in certain areas, at least if one averages over all people. An AI knows no fatigue and can always rely on better hardware, unlike the human with his very limited brain. The only advantages of humans are, from my point of view, still the senses and the ability to explore and perceive the environment. This will soon change greatly in favor of artificial systems.

AI-Models can online suck up texts and images from authors almost arbitrarily, and do so legally legitimized. The law gives authors the right to a usage reservation, which it effectively does not have. The reasons are of purely organizational and technical nature.

These astonishing abilities of AI are frightening at the same time. Creators worry that their works will now be sucked up and disassembled by an electronic brain. Google has already done this, only nobody got as excited: Someone enters a search term into the search machine. Instead of your website appearing for the search term and you catching the user and using them for your legitimate purposes, the answer is given as an extract of your content in the search engine. The user doesn't even land on your website, but gets drained beforehand. You are the content provider and the fool. Google is happy about it. The user doesn't care.

From many authors of online available works, a demand for consent obligation arose. The author should allow a machine learning model to use their work. Others demand only what is also in the law, namely an opt-out option. This is anchored in § 44b Abs. 3 UrhG and formulated as follows:

Uses pursuant to paragraph 2 sentence 1 [Multiplication of legally accessible works for text and data mining] are only permissible if the rights holder has not reserved them. A reservation of use at online accessible works is only effective when it occurs in machine-readable form.
Section 44b(3) of the Copyright Act (UrhG)

Furthermore, copies of copyrighted works for purposes of Artificial Intelligence are to be deleted as soon as they are no longer needed. This is not a problem, because if you read a text thoroughly, then you also know what the text meant without the original afterwards. The same applies to an AI.

Technical reservations of use

Online freely accessible works, such as websites, linked PDF files, images, audio files, raw text files or free e-books, are examples of this. Authors of such works have no consent right (opt-in inquiry) according to § 44b UrhG, but only an opt-out option. If the author does not give the signal for opt-out, their text can be read and used for Text and Data Mining according to the mentioned legal provision. Under these Sampling processes I also understand applications of Artificial Intelligence. With this view, I am probably not alone.

By the way, the term Opt-Out is actually not a synonym for usage reservation. Because an Opt-Out also affects the past, whereas a usage reservation only affects the future. If a usage reservation is given after a read operation by a crawler has taken place, it has no effect on this particular read operation.

What does a recall option look like technically?

For search engines and other crawlers, this option is already available. It is given by the robots.txt file. This file follows a generally established, widely disseminated, and well-known convention. Every search engine that wants to pretend to be law-compliant respects this file.

The robots.txt file of a website is available under the main path, for example at dr-dsgvo.de/robots.txt. It looks like this on my blog:

# robots.txt
User-agent: ia\_archiver
Disallow: /
User-agent: archive.org\_bot
Disallow: /
User-agent: slurp
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /

Additional note: I also use a dynamic bot protection that blocks some search engines as well.

In my robots.txt file, it is declared that the Internet Archive should not crawl my website. This is indicated by the User-Agent named ia_archiver and the directive Disallow. I also prohibit ChatGPT from crawling, as can be inferred from the speaking User-Agent named ChatGPT-User.

Which User-Agent name to use for which search engine, crawler, and AI platform is unknown ad hoc. Large platforms publish the names of their crawlers (User-Agents). A crawler is a program that scrapes online accessible content.

The entire principle of the robots.txt file is based on conventions. Technically, the procedure is extremely simple. If there were no such convention, then there would be no such procedure.

The use reservation of online accessible works against a CI is practically not possible for authors. The reason is the lack of technical convention. Already trained CI models consider no reservations that were only pronounced after training.
Refers to Section 44b(3) of the German Copyright Act.

Assuming you want to block a new AI platform that was announced in the press yesterday, how do you do it? Initially, you didn't know about this platform until yesterday, so you couldn't even search for its user agent, which you now want to block from today. After all, Roland or Susi could build their own AI model and use a crawler to suck up content from the internet.

They would have to find the technical names for all possible AI platforms, including mine, for all of Roland's platforms from one to 5000, for all of Sisi's AI platforms from one to 13847, for Elon's experiments, for your neighbor's, for all US-based AI companies etc.