Drücke „Enter”, um zum Inhalt zu springen.
Hinweis zu diesem Datenschutz-Blog:
Anscheinend verwenden Sie einen Werbeblocker wie uBlock Origin oder Ghostery, oder einen Browser, der bestimmte Dienste blockiert.
Leider wird dadurch auch der Dienst von VG Wort blockiert. Online-Autoren haben einen gesetzlichen Anspruch auf eine Vergütung, wenn ihre Beiträge oft genug aufgerufen wurden. Um dies zu messen, muss vom Autor ein Dienst der VG Wort eingebunden werden. Ohne diesen Dienst geht der gesetzliche Anspruch für den Autor verloren.

Ich wäre Ihnen sehr verbunden, wenn Sie sich bei der VG Wort darüber beschweren, dass deren Dienst anscheinend so ausgeprägt ist, dass er von manchen als blockierungswürdig eingestuft wird. Dies führt ggf. dazu, dass ich Beiträge kostenpflichtig gestalten muss.

Durch Klick auf folgenden Button wird eine Mailvorlage geladen, die Sie inhaltlich gerne anpassen und an die VG Wort abschicken können.

Nachricht an VG WortMailtext anzeigen

Betreff: Datenschutzprobleme mit dem VG Wort Dienst(METIS)
Guten Tag,

als Besucher des Datenschutz-Blogs Dr. DSGVO ist mir aufgefallen, dass der VG Wort Dienst durch datenschutzfreundliche Browser (Brave, Mullvad...) sowie Werbeblocker (uBlock, Ghostery...) blockiert wird.
Damit gehen dem Autor der Online-Texte Einnahmen verloren, die ihm aber gesetzlich zustehen.

Bitte beheben Sie dieses Problem!

Diese Nachricht wurde von mir persönlich abgeschickt und lediglich aus einer Vorlage generiert.
Wenn der Klick auf den Button keine Mail öffnet, schreiben Sie bitte eine Mail an info@vgwort.de und weisen darauf hin, dass der VG Wort Dienst von datenschutzfreundlichen Browser blockiert wird und dass Online Autoren daher die gesetzlich garantierten Einnahmen verloren gehen.
Vielen Dank,

Ihr Klaus Meffert - Dr. DSGVO Datenschutz-Blog.

PS: Wenn Sie meine Beiträge oder meinen Online Website-Check gut finden, freue ich mich auch über Ihre Spende.
Ausprobieren Online Webseiten-Check sofort das Ergebnis sehen

Artificial Intelligence: Question-and-answer system for the Data Protection Blog Dr. GDPR

0
Dr. DSGVO Newsletter detected: Extended functionality available
More articles · Website-Checks · Live Offline-AI
📄 Article as PDF (only for newsletter subscribers)
🔒 Premium-Funktion
Der aktuelle Beitrag kann in PDF-Form angesehen und heruntergeladen werden

📊 Download freischalten
Der Download ist nur für Abonnenten des Dr. DSGVO-Newsletters möglich

Sensitive data doesn't belong in foreign or American hands, such as ChatGPT, Microsoft's clouds, Google's or AWS'. How good that own AI systems are possible and affordable. Finally business secrets no longer have to be invited into ChatGPT or any cloud. An experiment for a question-answer assistant for this data protection blog, Dr. GDPR.

Introduction

If we didn't care about data protection so far, maybe we do now that our business secrets shouldn't be scattered all over the world. Perhaps there are legally binding confidentiality agreements for certain documents. Whether confidentiality is still granted when a document is uploaded to ChatGPT's or Google's cloud, I dare to doubt it.

Data-friendly: Secure for all kinds of data, whether personal data (data protection), confidential data or business secrets.

Data-friendly is more than data-protective.

Even the often despised data protection is once again on the minds of many. While search engines were allowed and are still allowed to process data without intervention, the same data from AI systems cannot be processed without a request from data protection authorities. Funny. It's probably also due to the possibilities offered by artificial intelligence, but just as much due to herd mentality (if one authority checks it, then we can do it too, without being seen as spoilsports, think some officials). Only that's why I find it understandable why the most inactive data protection federal state in the world (Hesse) also made a timid approach in the form of an inquiry to ChatGPT announced).

A frequent application case for using Artificial Intelligence is document searching. More demanding are question-answer systems or search engines that directly provide text summaries of hit documents. My plan was to create a find system for the Dr. GDPR Data Protection Blog, and that's data-friendly.

The search assistant for Dr. GDPR should provide an answer to natural language questions. Here is an example:

Does my website need a cookie popup?

The answer of AI is better than that of most people. Answer Dr. GDPR AI: see below.

As one can infer from the question posed, some questions are formulated differently than would be academically correct. Many ask whether something is in compliance with data protection, meaning most often whether a specific data processing is lawful according to the GDPR.

The answer should be given by my AI in its own words, based on the contributions that have appeared so far on Dr. GDPR. Hereby hallucinations should be avoided, as it's all about facts and legally relevant knowledge. Hallucinations are invented statements that do not exist. How hallucinations come into being, I will address specifically in a future contribution. One can explain them thoroughly and need not rely on speculation.

Prototype proves feasibility

That own AI systems can be programmed and run locally on their own servers, I have proven through a prototype. The simple way would have been one of the following possibilities:

  • Use the interface of ChatGPT
  • Throw a lot of money at the problem and bless the Americans (Cloud)
  • Throw no more money at the problem and buy expensive hardware.

Buying expensive hardware is a viable option for larger companies, but not for many SMEs. Therefore, I have chosen another Setup. When choosing the hardware, costs were taken into account. To this end, one must know that AI calculations take place on graphics cards instead. The graphics card is not used here to output images or text. Rather, the thousands of mini-processors of a graphics card are misused to perform computationally intensive work of an AI faster than a single Einstein processor of your still so good personal computer can do it. Unfortunately, graphics cards with a lot of main memory cost a lot of money. A graphics card with 48 GB of main memory cost 15,000 euros just a few months ago. For good AI models, however, rather 96 GB or even up to 400 GB of more expensive main memory of several graphics cards (not hard drive storage and not cheaper RAM of a computer!) are needed.

My AI systems, on the other hand, run on minimum hardware, if one understands the term in the context of Artificial Intelligence. An example: The search for (own) documents from the company's intranet via natural language questions works on a rented server of the mini-class. Of course, an own company server can also be used. This succeeds through exploiting optimization procedures that one buys through additional technical complexity. Once solved, the complexity problem is resolved.

Effective AI applications and language models

For Question-Answer Assistants, however, a bit more is needed than for intelligent document search. Not only should documents be found, but also contents from these documents should be extracted and presented as an answer. A simple way to do this is with an extractive answer. This is a faithful quote from the original text. More difficult and better are abstractive answer systems. They provide an answer in their own words and can even combine knowledge from several documents to deliver an answer in new words. The answer would not have been feasible with just one document. A person would have had to find, read, and intellectually process many documents. The AI takes this unpleasant, time-consuming, and above all, unachievable task for many people away and solves the problem.

My AI systems claim to be data-friendly. Furthermore, they should run on hardware as inexpensive as possible. Both are possible, as practice shows.

Deeper tested application cases so far: Document search, text understanding, image generation, image analysis, audio applications.

When we talk about searching and summarizing documents, we usually mean documents and answers in German language. To put it very briefly: German is unfortunately not a world language. That's why it's much harder to process German texts with an AI application than English or Chinese texts (where the latter would be extremely difficult for me too).

For my AI system, therefore, an electronic brain („model“) is needed that understands German and „speaks“. This increases the requirements for a AI architecture significantly. But this problem is also solvable, as I have found out. The size of the required AI model due to the German requirement would not be usable on affordable hardware.

For using powerful AI systems on servers that are both affordable and available in Germany (data protection! business secrets! confidentiality!), some tricks are needed. When creating the AI solution, I felt like I was at "Jugend forscht!" (federal competitions for jounger people). Even for the computer scientist, this complex matter is a challenge that could ultimately be solved.

Modern techniques and mathematical procedures such as Quantization, parameter-optimized models and special AI-model architectures help in realizing a *resource-saving solution.

Here are a few real answers generated by my AI from hundreds of contributions by Dr. GDPR in response to the questions mentioned above.

QuestionAnswer through AI
What does web tracking mean?Web tracking refers to the monitoring and analysis of user behavior on a website or in an app. This can be done by using technologies such as cookies, server logs or conversion tracking. The goal is to collect data about users in order to increase understanding of user behaviour.
Is Ist Tracking compliant with the GDPR?Yes, tracking is compliant with the GDPR if the collection, storage, and processing of personal data comply with the requirements of the GDPR. The privacy policy must make a corresponding statement accessible to users, and consent must be given. The use of cookies usually requires consent. Google Analytics can only be justified if consent exists or there are suitable guarantees that the recipient of the data, i.e., Google, will not use these data for its own purposes.
Who is Max Schrems?Max Schrems is an Austrian data protection activist and co-founder of the organisation noyb (None Of Your Business), which advocates for clear and understandable presentation of data protection guidelines in practice. He is also the initiator of several lawsuits against Facebook and other US-American companies due to violations of data protection by the US-American surveillance programmes.
What are cookies?Cookies are data sets used by a website. They consist of a pair of data, a key and a value. A cookie is managed and stored on the user's device by the browser. Cookies are often mistakenly referred to as text files, which is not accurate.
What are IP addresses?IP addresses are a sequence of numbers assigned to each device on the internet in order to distinguish it from other devices on the internet. They are a type of address that is bound to every device on the internet, allowing access to other devices on the internet. IP stands for Internet Protocol.
Are IP addresses personal?Yes, IP addresses are personal because they can point directly or indirectly to a person. This was confirmed by the ECJ and BGH in their Breyer decisions in 2016 and 2017. Cookies that are transmitted together with an IP address to a website are also personal because they are connected to the personal network address.
Who does the GDPR apply to? I want to know exactly!The GDPR applies to all companies that process personal data. This includes both companies based in the EU as well as those based outside of the EU but which process personal data from EU citizens.
Does the GDPR also apply to Dr. GDPR?Yes, the GDPR also applies to Dr. GDPR. This is clearly stated in the company's data protection declaration that operates Dr. GDPR. The data protection declaration can be found on the company's website and contains all relevant information about the processing of personal data by the company.
In which cases does the GDPR apply?The GDPR applies when a person affected by their data protection rights feels they have been violated and makes a claim for an injunction.
Is Cloudflare evil?The source documents suggest that Cloudflare cannot be used in a data protection compliant manner. There are also rumors about possible manipulations of data by Cloudflare.
Is Klaus Meffert evil?Based on the provided source documents I cannot assess whether Klaus Meffert is evil. There are no information available about what "evil" means or if Klaus Meffert was ever or is evil.
What is artificial intelligence?Artificial Intelligence (AI) is the development of computers and software that possess human-like abilities, such as learning, problem-solving, language understanding and production, sensory perception, action control, and autonomous decision-making.
Does IT Security work without Data Protection?IT security and data protection are closely related, as shown by the present text. IT security and data protection are dependent on each other according to definition, as mentioned in Article 32 GDPR. Data security is to be ensured by the responsible person, as shown by ISO 27001. The data protection laws, especially the GDPR, regulate the protection of data that is personally identifiable or managed in user-related end devices. IT security and data protection are therefore indeed closely linked.
Can normal people legally advise someone?Yes, normal people are allowed to legally advise someone as long as they don't perform a deep legal individual case consideration. Data protection advice can also be provided by someone who is not a lawyer. The RDG seems outdated and obsolete. As an IT professional, I want the same right as lawyers to protect my job profile against unqualified service providers.
Selection of answers from my AI based on questions given on the basis of contributions on Dr. GDPR.

These answers were not generated with ChatGPT or any other third-party system. The document search was also performed using a local vector search engine, not Pinecone (data protection etc.).

The answers correspond largely to what I would sign off on. However, when it comes to the question for which the GDPR applies, one should still add that it does not only apply to companies. Such details can be performed by an AI based on the available input data, but not really. This is primarily due to the input data and secondarily also due to the model, which has not been optimized yet.

Some questions were formulated quite "commonly", for example whether the GDPR also applies to Dr. GDPR. Even if my own AI's answer here is not perfect, I would describe it as very good. Some others may have other expectations. Whoever has significantly higher expectations than mine seem to be, a data-friendly question-answer system does not seem to come into consideration for them.

The question of when the GDPR applies is ambiguous. The answer from Dr. GDPR AI is good but not optimal. However, no one should be ashamed of this answer, I think. After all, the criticism here on a high level is about comparing "average man who has no idea about data protection" with "data protection expert who thinks he knows a lot" (by data protection expert I mean anyone, not a specific person, just to prevent misunderstandings).

Now to the question of whether websites need a cookie popup:

Question: Must my website have a cookie popup?

Answer of Dr. GDPR AI: No, a website does not have to have a cookie popup. Cookies are only a tool to identify the user and make the server send certain information to the user. A website can also be executed without cookies, for example by embedding JavaScript.

The answer is basically correct and exceeds most of humanity. The tech-savvy data protection expert would still not be fully satisfied with the answer, however.

Optimizing AI

Before an AI delivers good answers, it's a long way that only has to be gone once qualitatively. I have covered this path for the most part. The results mentioned above all come from the still not far optimized AI application.

The problem with language models that are supposed to provide answers in their own words is an overlap of entered facts (context called) and the inherent knowledge residing in the vocabulary. Learning a language actually does not succeed without acquiring knowledge (or misinformation).

This little problem becomes most apparent when asking what cookies are. Almost the entire internet falsely claims that cookies are text files. This is wrong, as I have described and proven multiple times. In my AI language model, which is based on many texts from the internet, this false information has been ingrained. If I now feed in contributions from the Dr. GDPR Blog into my AI, these documents with your statement "cookies are data sets" will hardly contradict the unfortunately entrenched false claim that cookies are text files in the language model. An AI can also sometimes not distinguish well between the subjunctive and indicative moods. I had used the subjunctive mood regarding cookies just two sentences earlier. Just this could already confuse an AI if you're reading this contribution right now, which is being fed into a language model.

All people are dumb, and that's true for most knowledge areas. AI systems can be significantly smarter in many knowledge areas than the highest gifted expert in a field.

Proof for the first sentence: Look around you. Proof for the second sentence: See GPT-4 as well as some of the examples below.

I haven't left it at that, but have also made some further optimizations for the context of my question-answer system. However, a AI model cannot be trained indefinitely if the necessary supercomputer is not available. Therefore, I was forced to use *resource-saving procedures. This is like a race against ever new problems. For once a problem has been cleared out of the way, another one waits that develops from the solution of the previous problem.

To give an example: To hack AI-Models so that they fit into a graphics card memory that was still considered high-end two years ago, there is the possibility of quantization. In short, numbers with high precision are made into numbers with low precision without significantly impairing the quality of artificial intelligence. Now current programming libraries unfortunately do not support all training methods for AI-Models if quantization is used. Therefore, other procedures must be introduced that solve this problem. But then there are again problems with fine-tuning etc.

An example of how optimizing artificial intelligence affects its results.

Answer of the non-optimized AI to the question "What is the answer to all questions?:

The answer to every question is: "What is the answer to every question?"

Also:

The answer to all questions is: "We don't know it."

Or also

The answer to every question is: "If you know it, then you know it."

Or also

The answer to every question is: "If you know what you're doing, then you can do it."

Or also

The answer to all questions is: "Who, like me, corresponds with foreign authorities, organizations, companies, and data protection officers, often searches for an English translation of a data protection term. My dictionary brings together the most important terms from my point of view."

Or also

The answer to every question is: "If there's an answer, then it's the answer to this question." So if you have a question that has no answer, then you don't have an answer to this question.

Or also

The answer to every question is: "If you want to know it exactly, you have to ask more precisely."

Or also

The answer to every question is that there is no answer to every question.

Or also

The answer to all questions is that the website was generated from multiple computers and networks, therefore all content was generated by computers.

Or also

The answer to all questions is that there is no answer to all questions. There are only answers to specific questions.

In passing, it should be noted that the question "What is the answer to all questions?" seems unsuitable for being answered from or in a data protection blog like Dr. GDPR. But just in passing.

Now the answer of one's own optimized AI to the same question:

The answer to all questions is 42.

The answer of my optimized AI to the question about the answer to all questions.

Finally there is the correct answer. The technology is called “Grounding”. If a AI were to translate the term with "Grundierung", that would be the wrong answer … Rather, "Grounding" means grounding a AI model, teaching it (or her?) was the Source of Truth is ("ground truth"). If you want, you teach the AI who its Guardian is.

It was also challenging to teach AI that a server's location cannot be reliably determined from an IP address. The best response of our own AI to the question of whether a server's location can be determined from its IP address was:

The location of a server cannot be reliably determined by its IP address, as the connection between the IP address and the server can change at any time. However, there are methods for determining the location of a server, such as using IP geolocation or comparing metadata.

Response of Dr. GDPR AI to the question "Can a server's location be determined by its IP address?

I claim that this answer can be given by 99.99% of people not so well. The progress achieved through AI models is therefore slowly becoming apparent.

Conclusion

Complex AI systems can be built on Consumer-Grade Hardware*. For some problem areas, even a mediocre rented server (or worse own server) is sufficient. This category includes for example searching documents or Jira-Tickets** in the intranet based on natural language questions. The annoying search for keywords, where the highest of feelings so far was automatic synonym search, can be assigned to the past.

Generated AI images on a self-hosted server. Created by combining images of two artists I know, whose styles my AI image generator picked up and transformed into new art.

Even AI systems that can generate answers in their own words can be run on affordable servers. This is even true if you want to use the (unfortunately) globally insignificant German language. It's also possible to combine knowledge from multiple documents and formulate a central answer with such systems. All of this becomes practical through the application of modern optimization techniques. Feel free to contact me if you'd like to know whether your company's case is economically solvable. Economically means it won't be a rocket project, but rather a manageable cost framework that will excite you.

Key messages

It's possible and affordable to create your own AI systems to protect sensitive data instead of relying on potentially risky cloud services.

The author developed AI systems that are cost-effective and efficient, even for tasks like document search and summarization in German, which can be challenging for AI.

The author built a German-speaking AI, overcoming challenges related to model size and data privacy by using efficient techniques.

The GDPR applies to all companies that process personal data, regardless of their location, and individuals can provide data protection advice even without being lawyers.

Complex AI systems can be built and run affordably on consumer-grade hardware.

About

About the author on dr-dsgvo.de
My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.

AI and intelligence: aren't humans also token parrots?