What is the main reason for the many misconceptions about AI?

Many misconceptions about AI arise primarily due to a frequently one-sided reporting that heavily focuses on ChatGPT and Microsoft products. This leads to a distorted perception of the possibilities and capabilities of AI.

Why isn't ChatGPT always a good solution for complex tasks?

ChatGPT is suitable for everyday tasks and frequent tasks, but often unsuitable for professionally processed tasks, such as summarizing texts without hallucinations or finding knowledge. It cannot comprehensively retrieve information or provide accurate results.

Why are the costs for training large language models often portrayed as zero?

The training of large language models, such as ChatGPT, requires dozens or even hundreds of servers simultaneously, which incurs high costs. Inference, i.e., querying the model, is often free, as the hosting costs are borne by the providers.

What legal issues arise from the use of AI systems for content extraction (Crawling) from websites?

Automated access to content from websites, especially through crawlers, raises copyright issues. As most content is protected and automated access without the rights holder's consent is illegal, this represents a significant legal risk.

What is the author's main argument against the assertion that AI systems are merely an algorithm?

The author argues that AI systems, particularly language models, cannot be viewed as simple algorithms. They are characterized by complex learning and adaptation processes that go beyond the mere execution of a predefined rule.

What is a token and why is it relevant for data protection?

A token is a number used to represent text in language models. Since the conversion of text into tokens preserves all information, personal data can be transmitted within the tokenized text, raising privacy concerns.

What is the problem with the Hamburg theses from the author's point of view?

The Hamburg Thesis Paper is criticized by the author as fundamentally wrong, as it supports the assumption that AI systems do not store personal data. This ignores the fact that the tokenized representation of text can actually encode personal information.

How are personal data stored in language models?

Personal data is stored in language models by being processed with massive datasets during pre-training. These datasets often also contain information about people that is not publicly known.

Myths and Misconceptions about Artificial Intelligence

Everyone finds AI somehow cool. That's why everyone expresses an opinion about AI. Because AI is a technically complex field, numerous half-truths or false claims arise. This is fueled by marketing promises from Microsoft and others. This article clarifies what is correct and what belongs more to the realm of alternative facts.

Common Misconceptions about AI

Due to a frequently one-sided reporting situation, which repeatedly revolves around ChatGPT or Microsoft AI products, numerous misconceptions arise. Some of these are:

Speech models would be based on statistics and therefore not intelligent.
An AI-system is an algorithm.
AI be a tool.
An AI system could work exactly.
ChatGPT is not intelligent.
ChatGPT be the best solution.
AI can be equated with ChatGPT (OpenAI), Claude (Anthropic), Mixtral (Mistral) or Command R+ (Cohere). There is nothing else.
Data is safe at Microsoft.
AI can be operated in a legally compliant manner.
Tokens are not real data.
AI models do not store personal data.

From these wrong assumptions, false conclusions often arise. Some of these will be clarified below. As a representative for other cloud services, only ChatGPT will be discussed hereafter.

Falsehoods about AI

The following statements were read somewhere similar in social media. The statements were generally formulated at the locations where they were found, or specific and yet still wrong.

ChatGPT is not intelligent

According to the definition of AI in Dr. DSGVO, ChatGPT is intelligent. This definition of AI is:

Even after Alan Turing, a brilliant mathematician during World War II and decipherer of the Enigma cipher machine, ChatGPT is intelligent: ChatGPT passes the Turing Test. The test checks whether a machine's responses are indistinguishable from those of a human. On the contrary: ChatGPT often (one could almost say always) delivers significantly better answers than the average intelligent human.

The new definition of the AI Regulation of AI defines ChatGPT (hopefully) also as intelligent. See Article 3 AI Act from 12.07.2024.

What is intelligence? Just now the definition of artificial intelligence was given. Simply strike out the attribute "artificial" twice from the definition. Already, there stands the definition of intelligence. Humans have no claim to the intelligence monopoly, even if many would like it that way.

Intelligence is based on human standards

Many believe that intelligence is something that humans define. The EU's AI Regulation had, in a now-revised definition of what constitutes Artificial Intelligence, written that Artificial Intelligence should aim to achieve goals "defined by humans…".

There is no single reason for this misjudgment. Humans are irrelevant regarding the determination of what intelligence is. Up until now, he could at most be used as a yardstick. In the future, this will probably no longer be the case.

By the way, some animal species are also attributed with intelligent behavior. Apparently, animals are not humans.

ChatGPT be the best solution

It depends on what it's for. For everyday tasks, ChatGPT is often a great answer machine. This is especially true for general knowledge or common tasks that are also found in ChatGPT's training data.

For all concrete tasks that should be handled fairly professionally, ChatGPT appears unsuitable. An example: Summarizing a text without hallucinations. Another example: Finding knowledge.

ChatGPT certainly cannot and will not scrape a larger portion of the internet or a website for you. After all, you either pay with your data and the data of others, "only". Or you pay $20 per month or a small amount per API call.

ChatGPT can therefore only access content that is either already known or unknown and has a small scope. By "small scope" is meant the number of documents or web pages.

For tasks like the digitization of documents, ChatGPT is not a good solution because many special considerations need to be taken into account.

ChatGPT is bad

It depends on what for. ChatGPT is not a search engine. Anyone who uses the system contrary to its intended purpose should not be surprised by mediocre answers. An AI system is also not designed to count the letters of a word.

An AI is good at creatively solving complex tasks. The same AI is bad at performing precise work. Just like a human!

AI training is expensive

Correct is: Training large language models like ChatGPT is very expensive and time-consuming.

It is also correct that training your own AI language models is very affordable. The reason is that these models can be specialized for specific use cases. Training such models is often possible on a laptop or your own AI server within a few hours.

Because own AI trainers are usually always on and running, the costs for AI training are zero.

AI-training is therefore in most cases free of charge.

Inference is expensive

Inference is questioning an AI model, for example, chatting with a language model like ChatGPT.

Correct is: Large language models like ChatGPT require dozens or even hundreds of servers simultaneously to generate a response to your question. That is expensive.

Correctly stated is also this: If a self-operated AI language model is queried, it costs nothing.

Therefore, inference costs are usually zero. What OpenAI pays for its servers is as irrelevant to us as it is to OpenAI what we pay for our computers.

Microsoft Azure and ChatGPT are secure

Many sell their "solution" as innovative. A bank even spoke of introducing its own (private) AI, meaning Microsoft Azure by that. Azure is the opposite of secure. Microsoft itself is the subject of numerous hacker attacks. Furthermore, it must be noted that Microsoft does not prioritize security highest.

This is in addition to Microsoft's massive data appetite. The new Outlook wants to retrieve customer emails for its own purposes; Windows continuously sends user data to Microsoft, etc.

Microsoft Copilot be good

First tests show that the opposite is true. Copilot should summarize a text. The instruction (Prompt) for this was super simple and unambiguous. The text was directly provided. The text length was rather short because the input field in Copilot's web interface did not allow for more.

The test report with screenshots reveals that Copilot is completely unusable for some tasks. Even with a benevolent view, it is not possible to find anything positive about the Copilot results. The summarization of an excerpt from a Dr. DSGVO blog article was so wrong that a human would be ashamed of it. Copilot simply invented numerous statements and did not fulfill the given task at all.

Instead, Microsoft acts everywhere as if Copilot were a great solution and the answers could be used directly. Nowhere could it be read that an answer could ever be wrong or similar.

Speech models are based on statistics

Yes, that's right. That's exactly how grammar works. That's exactly how intelligence works. Look at the human brain. Language models are simply not trained that way like humans, who take further steps to formulate an answer.

Our entire existence is based on probabilities: compare radioactive decay or, more generally, quantum physics. Everything is based on chance. Everything. Please ask someone knowledgeable about quantum physics if you need clarification.

It doesn't matter how a system becomes intelligent. What matters are the results. For those who still believe that the human brain is not "hackable," I might not find a report on an artificial rat brain interesting either. It was subsequently possible to reproduce movements and the associated brain activity through a simulation.

AI can be used in a legally compliant manner

Theoretically, this may be the case. In practice, some questions arise:

Where do the billions or even trillions of data points come from that have been fed into an AI system for its training?
With cloud services like ChatGPT or Azure, the question arises whether the legal terms are sufficient.
Can Section 44b of the German Copyright Act (UrhG) be complied with at all?
How can data be deleted from an existing AI model?

Regarding question 3: The German legislator requires that crawlers may only read website content if the website operator has not objected. Germany states that the objection should be stated in the imprint or the general terms and conditions. From a technical point of view, this is completely impractical and not feasible. Crawlers do not understand objection formulations in natural language. There are no AI crawlers. There are only dumb crawlers that deliver content to systems that are supposed to become intelligent or already are. The robots.txt file would have been a good solution. Unfortunately, Germany has rejected this solution. Furthermore, the crawler operator would later have to be able to prove that there was NO objection. This is hardly, if at all, feasible in practice. Thus, crawling German websites would always be a major legal risk and often probably prohibited.

Regarding question 1: The data comes from the internet. Texts, images, and other works are per se protected by copyright. Copyright protection arises automatically upon the creation of a work, provided the work possesses the necessary level of creativity. Thus, these contents may either not be read at all (see question 3) or only until the rights holder objects. Generative AI generates results that are potentially copyrighted and therefore would be unauthorized. Because only reading the data might have been permitted, not the generation of AI responses.

Deleting data from AI models is not reliably possible. Therefore, an AI model must continue to operate unlawfully if someone no longer wants to see their data in the AI model (or at the latest in the AI's responses). Discarding an AI model and retraining it is not an option for massive models like ChatGPT because it is far too expensive and time-consuming. New deletion requests would prolong the process indefinitely. For Offline-AI, this problem does not exist.

Regarding question 2: See above for evidence why Microsoft and its platforms should be considered insecure. In addition, there are legal documents that Microsoft and OpenAI impose on users. The question arises who checks these documents properly and what happens if a deficiency is identified. Dismissing problems may be a popular tactic, but it does not solve the underlying issue. Furthermore, Microsoft, as many practical examples show (Windows telemetry data, the new Outlook with a huge appetite for data and access to customer emails via customer login data…), wants to collect a lot of data and nothing else. Why should we trust these companies? There is no reason to. .

AI is an algorithm

An algorithm is […] a unambiguous set of instructions for solving a problem or a class of problems. (Source: Wikipedia, Boldface added here).

An AI system is based on a neural network. Whether this can be classified as an algorithm in the narrower sense is more than questionable. For the human observer, a neural network is certainly not evident. Especially not when it comes to deep networks (hence the term Deep Learning).

After all, humans are not also classified as algorithms. Their brain also consists of a neural network.

Therefore, one would have to refute the statement that a AI-system is an algorithm upon strict consideration. On Wikipedia, a AI-system is not equated with an algorithm, after all. Rather, the training progress is attributed to an algorithm, which is plausible because this improvement of the neural network during training proceeds via unambiguous calculation rules.

If you believe that AI is an algorithm, please provide a case where an automated problem solution, in your opinion, is NOT an algorithm. We are curious!

AI be a tool

That is about as accurate as saying "A car is a pile of matter" or "Cookies are files." Then everything or nothing would be a tool. The information content would then be zero. Therefore, the statement is not helpful.

Some people think of AI-assisted tools when you talk about AI as a tool. Linguistic inaccuracies certainly don't lead to better understanding.

Intelligence is certainly not a tool, but an (predictive) property of a system.

Speech models cannot logically infer

The fact is: language models can solve highly complex mathematical problems better than almost any human on Earth. Assuming an AI system does this by reading in all possible problem formulations and learning from them. Then the word "learning" has already been used. If one sticks with "reading in," then it could be that the AI system can solve all possible, previously unknown problems if they are only somewhat similar to the known problems. Where is the difference to most humans?

One of the tasks of AIMO. Shown is the answer of an AI system, which also provides the solution path. Source: see the following link.

Please read through the five math problems presented at the AI Math Olympiad (AIMO) to AI systems. If you can even understand these problems, you are apparently a, percentage-wise, tiny part of the world population that possesses a very deep mathematical understanding.

By the way, the author of this post was able to solve a highly demanding math problem with the help of the best math model [6], of which he only knew (from a mathematician) that it could be solved with Diophantine equations. No idea what Diophantine equations are. The puzzle is about sailors and coconuts and is likely solvable independently by (percentage-wise) almost no human on Earth. The long German (!) text of the problem was dumped into the English-language math model. The AI's answer was wrong, but the attempted solution path was so close to the solution that it was possible to find the correct solution by hand with very little effort.

Tokens are not real data

Some believe that because texts in language models are stored as numbers, the language models do not store original data.

Language models store text in the form of number sequences (vectors). To do this, words are broken down into word fragments. These fragments are called tokens. Each token corresponds to a number. This mapping of word fragments to numbers is uniquely defined for each language model in a dictionary (vocabulary). This dictionary is attached to each language model as a text file. You can view and evaluate this text file at any time.

The number sequence 4711, 0815, 9933 could correspond to the letter sequences Maxi, mi, lia. Apparently, the numbers can be traced back to words. Thus, the number sequences are personal if the coded letters represent personal data. Indirectly person-identifiable data is also personal (cf. Art. 4 No. 1 DSGVO).

Furthermore, language models form their output, among other things, through cumulative probability values of tokens. This means that not only are two tokens considered, but a whole chain of tokens. A technical parameter that controls this is called top_p.

The Hamburg Theses Paper (see following) is therefore fundamentally wrong in its statements. It therefore seems to be written to legalize all AI systems, most of which would probably be illegal.

AI language models would not store personal data

The argumentation of the Hamburg Data Protection Officer (HmBfBDi) is as follows: It is extremely difficult to extract personal data from language models. The CJEU stated that the reconstruction of a personal reference can only be assumed to be possible if the means and expenses involved are within the usual scope. The HmBfBDi states that only with a sophisticated, illegal privacy attack could personal data be extracted from language models. Due to the enormous effort required, they are not considered personal data according to CJEU jurisprudence.

Here's the simple counterexample that easily disproves the HmbBfDI:

Inquiry to ChatGPT and Response from ChatGPT. Date: July 15, 2024 (image was automatically translated).

Speech models store data from all individuals equally. This includes individuals who are not public figures, as they are also stored in the language model during its pre-training phase using the training data. The training data consists of billions of documents. It is unlikely that only Angela Merkel or other entirely public figures are included, who arguably have less right to privacy than an ordinary citizen.

It is even possible to extract complete quotes from a language model. The linked post also demonstrates that personal data is indeed present in LLMs. Contrary to the assumption of the HmBbfdI, all large AI models are part of an AI system. This means: An AI system can interpret the numbers that make up an AI model and convert them back into text. To simply have an AI model on the hard drive without the ability to interpret it would be irresponsible. However, this irresponsible scenario does not exist offensively with ChatGPT or GPT-4o. It only exists theoretically with open-source language models. Because it is enough to download a very widely used programming library to be able to interpret the content of the model. Unfortunately, the HmBbfdI has constructed a technical difference between ChatGPT and GPT-4o that does not exist.

In itself, it doesn't matter:

If an LLM is used, there are often personal data involved. If these are circulated, the person who does so is liable.
If an LLM is not used, it practically doesn't matter what data the LLM contains. No one sees it.
It's primarily not about storage.

General artificial intelligence cannot exist

This type of intelligence is also referred to as AGI. AGI stands for Artificial General Intelligence. We are only at the beginning. Obviously, intelligent robots running around in world history are not frequently observed yet. .

True is: Few companies alone will already invest thousands of billions of dollars in the construction of intelligent robots. This requires:

A robot (already here, getting better all the time).
An electronic brain (already here, getting better all the time).
Someone who puts the brain box (computer with AI) on the robot (this person already exists).

These three components already exist. Self-learning systems apparently already exist. See ChatGPT or NuminaMath (further down). It is only a matter of time until robots learn to master our world better than we humans ever could.

Only in many years would AI become overpowering

The falsity of this assumption cannot be proven, nor can the statements in the previous section about AGI. Time will tell.

However, the correct statement is: AI development is progressing at superluminal speed. What was not possible two weeks ago is now possible. This applies, for example, to the advancements of open-source language models. The aforementioned AIMO was trained by an open-source model called NuminaMath. It answered 29 out of 50 more difficult mathematical problems correctly, which were posed in text form.

Google claims, incidentally, that a breakthrough in robotics has been achieved with the help of a language model.

Prediction by Dr. GDPR: In 10 to 15 years, robots will be roaming around posing a serious threat to humanity. It could also be 5 years (to know for sure, one would have to be a robotics expert). But one thing is certain, it will not take 35 years for us to have to seriously worry about our existence due to the superiority of AI. If you have children, then according to this post's prediction, they will structure their lives differently than what can be considered good.

Summarized

Here are the most important statements in correctly stated form:

ChatGPT is an intelligent system that surpasses humans in many tasks.
Intelligence is independent of humans.
Artificial intelligence is intelligence on an artificial system. What artificial is, you can define yourself (it doesn't matter).
Querying language models is free. This applies to Offline-AI, i.e., self-hosted language models.
Training AI models costs nothing. This applies to training on your own hardware or rented hardware. This hardware is always on anyway. Whether an AI training runs on it or not, it doesn't matter for the hardware costs.
AI is not an algorithm, but an unfathomable solution to many problems.
Microsoft Copilot is a useless system. At least that applies to the simplest standard tasks, which any Offline-AI can perform better.
The Azure Cloud is not secure. This is evidenced by numerous incidents where Microsoft did not perform particularly well.
AI will become a danger to humanity in a few years. Or as Sam Altman from OpenAI said: "AI will kill us all. But until then, it will be incredibly useful".

If you want to introduce your own AI (Offline-AI) into your company, here's what's important to know:

An Offline-AI is optimizable. It delivers better results for many use cases than ChatGPT. This is also because your system only works for you and doesn't also have to work for millions of other users.
An Offline-AI offers full data control. Every DSB is happy about an Offline-AI.
An Offline-AI is inexpensive to operate, whether by purchasing an AI server or by renting one from a German provider in a German data center.
An Offline-AI can retrieve data from the internet or communicate with other IT systems.

What are your questions or insights?

Key messages

ChatGPT and other AI systems are intelligent and should not be dismissed as simple tools.

Copilot, despite Microsoft's claims, is unreliable and produces inaccurate results. It's not a suitable tool for tasks requiring accuracy and should not be used without careful human review.

AI models can unintentionally violate copyright law because they learn from copyrighted data and generate new content that might be considered derivative.

AI systems can learn and solve complex problems, even mathematical ones, by analyzing patterns in data.

Large language models like ChatGPT likely store personal data from their training data, making it important to be cautious about their use and potential misuse.

AI is developing rapidly and will likely pose a serious threat to humanity in a few years.

About

Website-Analyse in Echtzeit

Myths and Misconceptions about Artificial Intelligence

Common Misconceptions about AI