Creators of online accessible works have according to law the possibility to declare a usage reservation. Thus, works should be protected from flowing into electronic brains. Does this approach function? In the contribution possibilities and limitations are named.
Introduction
Artificial intelligence has enormous capabilities that often far surpass those of the average intelligent human being. The Turing Test is considered completed positively. This test checks whether a computer is as intelligent as a human being. Yes, it is now. As ChatGPT shows, an AI can even outperform humans in certain areas, at least if one averages over all people. An AI knows no fatigue and can always rely on better hardware, unlike the human with his very limited brain. The only advantages of humans are, from my point of view, still the senses and the ability to explore and perceive the environment. This will soon change greatly in favor of artificial systems.
AI-Models can online suck up texts and images from authors almost arbitrarily, and do so legally legitimized. The law gives authors the right to a usage reservation, which it effectively does not have. The reasons are of purely organizational and technical nature.
These astonishing abilities of AI are frightening at the same time. Creators worry that their works will now be sucked up and disassembled by an electronic brain. Google has already done this, only nobody got as excited: Someone enters a search term into the search machine. Instead of your website appearing for the search term and you catching the user and using them for your legitimate purposes, the answer is given as an extract of your content in the search engine. The user doesn't even land on your website, but gets drained beforehand. You are the content provider and the fool. Google is happy about it. The user doesn't care.
From many authors of online available works, a demand for consent obligation arose. The author should allow a machine learning model to use their work. Others demand only what is also in the law, namely an opt-out option. This is anchored in § 44b Abs. 3 UrhG and formulated as follows:
Uses pursuant to paragraph 2 sentence 1 [Multiplication of legally accessible works for text and data mining] are only permissible if the rights holder has not reserved them. A reservation of use at online accessible works is only effective when it occurs in machine-readable form.
Section 44b(3) of the Copyright Act (UrhG)
Furthermore, copies of copyrighted works for purposes of Artificial Intelligence are to be deleted as soon as they are no longer needed. This is not a problem, because if you read a text thoroughly, then you also know what the text meant without the original afterwards. The same applies to an AI.
Technical reservations of use
Online freely accessible works, such as websites, linked PDF files, images, audio files, raw text files or free e-books, are examples of this. Authors of such works have no consent right (opt-in inquiry) according to § 44b UrhG, but only an opt-out option. If the author does not give the signal for opt-out, their text can be read and used for Text and Data Mining according to the mentioned legal provision. Under these Sampling processes I also understand applications of Artificial Intelligence. With this view, I am probably not alone.
By the way, the term Opt-Out is actually not a synonym for usage reservation. Because an Opt-Out also affects the past, whereas a usage reservation only affects the future. If a usage reservation is given after a read operation by a crawler has taken place, it has no effect on this particular read operation.
What does a recall option look like technically?
For search engines and other crawlers, this option is already available. It is given by the robots.txt file. This file follows a generally established, widely disseminated, and well-known convention. Every search engine that wants to pretend to be law-compliant respects this file.
The robots.txt file of a website is available under the main path, for example at dr-dsgvo.de/robots.txt. It looks like this on my blog:
# robots.txt
User-agent: ia\_archiver
Disallow: /
User-agent: archive.org\_bot
Disallow: /
User-agent: slurp
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
Additional note: I also use a dynamic bot protection that blocks some search engines as well.
In my robots.txt file, it is declared that the Internet Archive should not crawl my website. This is indicated by the User-Agent named ia_archiver and the directive Disallow. I also prohibit ChatGPT from crawling, as can be inferred from the speaking User-Agent named ChatGPT-User.
Which User-Agent name to use for which search engine, crawler, and AI platform is unknown ad hoc. Large platforms publish the names of their crawlers (User-Agents). A crawler is a program that scrapes online accessible content.
The entire principle of the robots.txt file is based on conventions. Technically, the procedure is extremely simple. If there were no such convention, then there would be no such procedure.
The use reservation of online accessible works against a CI is practically not possible for authors. The reason is the lack of technical convention. Already trained CI models consider no reservations that were only pronounced after training.
Refers to Section 44b(3) of the German Copyright Act.
Assuming you want to block a new AI platform that was announced in the press yesterday, how do you do it? Initially, you didn't know about this platform until yesterday, so you couldn't even search for its user agent, which you now want to block from today. After all, Roland or Susi could build their own AI model and use a crawler to suck up content from the internet.
They would have to find the technical names for all possible AI platforms, including mine, for all of Roland's platforms from one to 5000, for all of Sisi's AI platforms from one to 13847, for Elon's experiments, for your neighbor's, for all US-based AI companies etc.
Current AI-platforms can only be kept apart individually and from contents available online once their existence is known.
Technical fact.
Obviously this undertaking is doomed to fail. Firstly, you don't know all AI platforms. Secondly, you don't even want to know all AI platforms, because then you'd have to research day and night or join a possibly fee-based or negatively affecting your findability service that researches day and night. Because you don't want to block all search engines anyway, but only the evil AI platforms and maybe also evil search engines.
At some point you would have a blocklist file that could look something like this. At the end of the lines I've added fictional date values as comments, to which you would have entered the respective entry for blocking a specific AI crawler.
#Your robots.txt file
User-agent: ChatGPT-User #added on 17.04.2023
Disallow: /
User-agent: Susi-1-AI-Crawler #added on 21.05.2023
Disallow: /
User-agent: Roland-17-KI-Bot #added on 23.06.2023
Disallow: /
User-agent: Neighbor-AI-0815 #added on 15.07.2023
Disallow: /
It is also possible to define generic entries by using wildcard characters. However, this may block too many crawlers. It can also be that some crawlers have not started yet anyway.
But the problem becomes even bigger, and that is at least twofold.
Market power of Google and Meta
I tried on 31.07.2023 to find out what the technical names of Google's and Meta's AI crawlers are, so I can block them. Google Bard is just like Meta LLAMA 2 a well-known language model. I don't want my content to appear there without me getting paid for it. After all, Google and Meta make a golden nose from our data. So, there will be no free content from me for their AI.
Google explains in its data protection notices, which apply from July 1, 2023, as follows:
For example, we collect data that is available online or in other public sources to train our AI models as well as develop products and features like Google Translate, Bard, and Cloud AI. If your company information appears on a website, we can index it and display it in Google services.
From S. 32 of the aforementioned Google Privacy Notice.
It is almost a certainty that Google uses its search engine crawler for training its Google AI as well, using the read contents. Google has no interest in giving you and me the opportunity to object to this. As evidence of this, I am quoting a question from the Google Support Forum from March 29, 2023 here:

There is still no answer to this important question four months after it was asked. Additionally, Google has blocked the question so that no answer is possible anymore. Even if someone were to find out how to unblock the Google AI-Bot, this information would not appear as an answer in the Google Support Forum.
At Meta (Facebook, Instagram, WhatsApp) it seems to be the same. I could not determine a technical name of a Meta-Crawler that is used for AI training.
So you have exactly one option (with Google): Either you block the entire Google Bot and no longer or hardly appear in the Google search results. Or you let Google use your online available contents and works for all possible purposes that Google reserves for itself.
For the case that someone wants to block Google from their website, here is the instruction for the robots.txt file:
User-agent: Googlebot
Disallow: /
If a deeper path is specified as a value for the parameter ,Disallow, only that part of your website will be blocked. There are therefore few ways to counter Google's data frenzy. By the way, I find it commendable that you also pass additional data from your website users to Google through your website and thus make Google even more powerful. You work hard so that Google becomes even more powerful, and do so without compensation and often without a legal basis. At least you are making an effort to install plugins like Google Fonts, Google Maps or Google Analytics instead of local fonts, a data protection-friendly map or Matomo.
Google argues, in my opinion, as follows:
- Data Protection: "We, Google, process no personal data at all." It seems Google doesn't know what data processing is and explains the Google Tag Manager as unfit for work.
- Artificial Intelligence:
- Fall a: Your personal data appears in Google Bard's AI response. Google will say: "But you've made this information publicly available anyway. We're just showing what your website shows anyone who visits your page."."
- Your contributions will be quoted in your own words and not as a notable quote from Google Bard as an answer to user questions to the Google-AI. Google will probably say: "Our expenses are no copyright infringement, because we do not reproduce your content word for word in a notable form, but in entirely different words
Authors of online texts often don't even notice case 2 b). Case 2 a) has some explosiveness, which I will illustrate below.
Let's move on to the next problem for creators who do not want their works used with AI.
Let's work into the future
ChatGPT-4 is based on a dataset from September 2021. I myself knew nothing about ChatGPT even in 2022, and only heard of it briefly at most. Thus, it would not have been possible for most people to define a ban on their own works that prohibits ChatGPT from using their own works.
All content that has been read before setting a block on ChatGPT or other AI models is already in the electronic brain. Later blocks by an author do not change this. His works have already been sucked dry. Only new works or updates should hopefully no longer be ravaged by a third-party AI.
Data from AI models is hardly deletable
Usage restrictions by authors cannot be considered as easily and quickly in traditional search engines. Possibly, it does not even work retrospectively.
Even in large search engines, it can take several days or weeks for a removal request to be carried out. I can speak here from experience. A German city had a data breach and asked me to help with the removal of personal data from large search engines. The last unwanted hits were only gone after several weeks.
As far as I know, no one is obligated to retrain a AI model after initial training. Without further training, however, all data that was read into the model remains in the model. However, the data are not stored in their raw form but rather its structure or essence is stored. More precisely, it can't be said. I refer to the human brain and its fuzzy storage form for information.
AI-models as electronic brains cannot forget.
My current knowledge status. Please let me know if I am wrong.
A AI-model that remains as it is deletes no data, concerning online read works of authors. Neither are any data from AI-models deleted otherwise. Even AI-models, which get retrained again, often pose this problem. At ChatGPT version 3.5 is currently usable in Germany. It nützt little regarding a usage reservation of an author if this content block only affects ChatGPT-4 but not version 3.5.
Even if every larger and thus potentially powerful AI model were repeatedly retrained from scratch, the delay would be immense. Bloomberg-GPT is an AI model for financial data. For this, several million hours of most expensive computing power would be used by employing utterly many high-performance graphics cards for calculation. It simply cannot be assumed that Bloomberg-GPT appears in a new version every month. Rather, yearly timeframes are to be expected here.
To make unwanted information from a AI-model disappear, it would probably have to be grounded (grounding). This procedure is however uncertain and more suitable for eliminating false information by replacing it with correct information. The ability of forgetting that AI-models possess I am not aware of anyway. Humans can't really forget well either. Often a anchor point or stimulus word is enough to bring up a forgotten memory again. That we humans don't remember everything may be due to the fact that our hardware in the head isn't trimmed for persistence. It's different with electronic brains. As long as there is enough power or backups available, the information anchored in the brain is indelible.
Google vs Search Engine
An artificial intelligence is not a search engine when viewed from the functional side. Sure, facts can be extracted with a language model as well. These facts are however often outdated due to the long training time and widely spaced training intervals. Current facts are hardly ever found in AI models.
For an exact search, like classical search engines do exceptionally well, a AI system is not suited from scratch. Rather, an AI system resembles a semantic, structural or fuzzy search.
Technically, however, one speaks of a vector search machine in the case of an AI system.
From a data protection perspective, it doesn't matter what system is set up like. People as owners of their data have the right to be delisted from search results (ECJ ruling of 24.09.2019, Case C-507/17). So Google must ensure that personal data disappears from search results at the request of the data owner. The answers of an AI to a search query are also personal data.
In search engine Bing for example, besides normal search terms complex questions can be asked since recently. Bing answers this question with the help of its AI. Alone here it becomes clear that it cannot make a difference for a person's desire to know something whether the affected system is a classical search engine like DuckDuckGo, an AI-supported search engine like Bing or a chatbot like ChatGPT.
In passing, it should be noted that Bing often gives false answers. This has less to do with hallucinations than with alternative truths that are unfortunately often considered truth. According to Bing, cookies are text files.

References are made as evidence for the Bing answer also on my contribution. I prove in this contribution exactly the opposite. With a data-friendly AI system, which can be operated by companies themselves and without Microsoft, Google or ChatGPT, that would not have happened. The Bing-KI is therefore dangerous and does not even indicate it. Instead, another search term is suggested: "Are cookies dangerous?.
Erasable information in AI search engines
An AI is not a search engine, but it's sometimes used as one, as Bing shows. The approach arose from resource scarcity (hardware, computing time) and is as follows:
- An AI searches the entire document collection, called an index of search. This is analogous to a search engine, which however searches exactly or more precisely than an AI.
- The most suitable documents for the question are selected.
- The AI is only asked against the selected documents.
- The AI responds with knowledge from the selected documents, using its language abilities in the process.
Thus documents from a search index of an AI search can be deleted just like in a conventional search engine. However, such AI search engines, as I would like to call them here, are quite unreliable, as Bing shows. Bing is ultimately not really usable and certainly not for documents from your own company.
The hallucinations of an AI, as they are observable in the AI-driven Bing search engine, can be avoided in company-owned AI systems.
Please feel free to contact me if you're interested.
What Bing lacks is an effective grounding. Bing cannot deliver that because the resources for it are still too scarce at Microsoft, at least in my opinion based on knowledge of technical details of AI models and their hardware requirements.
Cheaper is the case with company-owned AI systems, which will be covered in a separate contribution on Dr. GDPR shortly. These systems can apply grounding and thus combine two advantages:
- Current knowledge is available.
- Answers to questions posed to this knowledge are quite precise.
Hallucinations can be avoided in local AI systems that have nothing to do with Microsoft, Google, Meta or ChatGPT, but only in local systems. Have you ever thought about such an AI system for your company? It doesn't cost a fortune.
Texts, images, and other media: copyright?
For online accessible texts, the same applies to online accessible images. Here the dilemma is perhaps even greater, because one often can't tell anymore from a KI-generated image, from which sources it originated. At least several or even many images are combined by image generators like Midjourney or DALL-E. The LAION-5B dataset, which is very often used with Stable Diffusion image procedures, allows for an image similarity search.
The following steps I performed with the LAION dataset to see if generated AI images were similar to the available online original material:
- Generating an image by a AI-image generator.
- Similar images were searched in the LAION dataset for this image, which comprises nearly six billion images.
- The similarity of the generated image to images from the dataset was so low each time that as a human I can't even recognize copyright infringement, not even with very strict examination.
My tests were not exhaustive but rather sporadic. I have already generated thousands of AI images with a local AI system, though.
Image generators often produce images that are completely different from the original source images (training data). Therefore, copyright no longer applies here.
For training, however, the very favorable conditions of the UrhG must be met for AI models.
Even with texts I regularly see that a representation by my chosen AI model occurs in a form which is quite different from the original. Therefore, I do not think it's relevant to ask for the original work here. This does not always have to be as clear-cut as judgments on poems prove. If however an enterprise uses an AI model, it can counter this problem multiple times.
Firstly, autonomous AI systems can be equipped with freely selectable training data. Secondly, the output can occur non-publicly, for example in the company network. The lawyer knows better than I how this affects the copyright law. It is certain: "What I [as author] do not know makes me hot." The risk of using non-public data is significantly lower than showing the results. Thirdly, enterprise-owned AI systems can be equipped with any kind of tampering mechanisms. The best thing is the economy. What used to cost a fortune is now affordable. Your company does not need ChatGPT (and if it did, I would like to know why. As search engine at least not).
Conclusion
Information that has once landed in a model of artificial intelligence cannot be easily erased from this electronic brain . It is even harder to prevent one's own online works from landing in AI models.
Thus, own contents are doomed to be sucked up by large AI platforms. The contradiction against being sucked up is possible in form of a list at all, but may not affect all types of works. Personal data are rather protected than texts whose essence is assimilated by third-party AI and thus removed from the control of the author of the original text.
Google works particularly perfidiously and uses all read content for any permitted purposes. This includes both the search engine as well as the AI named Google Bard, plus everything else that Google will come up with. Analogously it seems to be the case with Meta.
Texts that are not primarily written as encyclopedic articles may elude AI models, because what's important is often between the lines.
Creators of online available works will have no possibility in the medium term to prohibit an AI from using their works.
See article.
The author's usage reservation regarding their online accessible works is practically unregulated and thus hardly possible in practice. Only for globally known systems like ChatGPT can this author's reservation be implemented halfway.
However, information from AI models cannot be deleted short-term. Rather, an AI model would have to be retrained from scratch, which is very time-consuming and therefore rarely happens. As long as that's the case, at least one's own works are available in a foreign AI without the author knowing anything about it.
It is not excluded that there will be mathematical approaches to deliberately delete individual data from a AI model. I have at least heard nothing about this and could not find anything reliable on it either. I also consider this difficult and believe rather not that such a mechanism will exist in practical form within the next 12 months.
As long as the technically simple task of the usage reservation is not solved analogously to search engine crawlers, all content creators are at least worse off than they would like.
It's likely that legal regulations will be issued at EU level to better protect authors' data from being scraped by AI crawlers. But it's already too late for that, especially when these legal regulations start to take effect. The small businesses are once again the fools. Google and other large corporations simply continue to use the internet treasure trove (unless you no longer want to appear in Google search results). Whoever operates large crawlers can also search for content whose use is not prohibited.
Technology beats law because technology happens at light speed and law moves at snail pace.
A lawsuit is currently pending against LAION. A photographer wants their photos removed from the LAION dataset afterwards. Normally, these images are not stored at LAION anyway (there are indications that this may indeed be the case, which is however not necessary to build AI models). Regardless of this, the LAION dataset is used worldwide by numerous image generator models. Control over individual components (here: images) appears impossible.
ChatGPT used the Common Crawl dataset for AI training. This dataset is a snapshot of parts of the internet, some of which were selected randomly. As soon as there's a technical convention for a usage restriction (robots.txt), it will be uncomfortable for all AI models using an up-to-date Common Crawl dataset. Until then, many months or even years are likely to pass. Legally, there are also ways to get out of this. For example, OpenAI could claim that they used ChatGPT-4 as a basis for future ChatGPT-5 (Fine-Tuning) instead of training version 5 from scratch. The dataset for ChatGPT-4 seems to be legitimate regarding usage restrictions by authors because there were almost no usage restrictions in September 2021.
Summary
The essence of the contribution and the consequences in bullet points:
- Technically, a usage reservation by authors that prohibits AI models from sucking up their online accessible works is not possible (at least not now).
- A usage reservation according to § 44b UrhG only affects the future. Already trained AI models remain as they are.
- There is no consent requirement for authors of freely accessible online works against AI models.
- AI-models cannot be forgotten, and if so, only with great effort and considerable time delays.
- Non-trained AI-models do not consider usage reservations that were only given after AI-training.
- Tough times are ahead for creators. What a human can do and is allowed to do with foreign works, a machine can certainly do (and probably actually get away with it) too.
- The citation of sources for an AI model makes no difference, because usage reservations have been practically only occasionally expressed so far. ([1])
- Google, of course, uses all crawler data for both the search engine as well as Google Bard or similar things. Thus, control by authors is currently practically impossible due to Google's market power.
- Numerous legal excuses are conceivable to give AI models the appearance of legitimacy.
Key messages
AI can easily access and use online content, raising concerns for creators about copyright protection. While laws exist to allow authors to restrict AI usage, these are difficult to implement effectively.
Authors currently have limited control over how their online work is used by AI due to a lack of technical standards for restricting access.
It's impossible to effectively block all AI platforms because there are too many and their names are constantly changing.
It's nearly impossible to stop Google and AI models like ChatGPT from using your online content.
AI models can't forget information they were trained on, making it difficult to remove personal data from the internet even after requests.
AI-powered search engines like Bing can be unreliable and prone to giving incorrect information because they lack a strong connection to real-world data.
Using AI for your company can be cost-effective, but it's important to consider copyright issues as AI models learn from existing data and may generate outputs similar to copyrighted material.
AI models can easily access and use online content without permission from the creators, making it difficult for authors to control how their work is used.
Currently, there are no effective legal ways to prevent AI models from using copyrighted material, even if authors object.




My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.
