Drücke „Enter”, um zum Inhalt zu springen.
Hinweis zu diesem Datenschutz-Blog:
Anscheinend verwenden Sie einen Werbeblocker wie uBlock Origin oder Ghostery, oder einen Browser, der bestimmte Dienste blockiert.
Leider wird dadurch auch der Dienst von VG Wort blockiert. Online-Autoren haben einen gesetzlichen Anspruch auf eine Vergütung, wenn ihre Beiträge oft genug aufgerufen wurden. Um dies zu messen, muss vom Autor ein Dienst der VG Wort eingebunden werden. Ohne diesen Dienst geht der gesetzliche Anspruch für den Autor verloren.

Ich wäre Ihnen sehr verbunden, wenn Sie sich bei der VG Wort darüber beschweren, dass deren Dienst anscheinend so ausgeprägt ist, dass er von manchen als blockierungswürdig eingestuft wird. Dies führt ggf. dazu, dass ich Beiträge kostenpflichtig gestalten muss.

Durch Klick auf folgenden Button wird eine Mailvorlage geladen, die Sie inhaltlich gerne anpassen und an die VG Wort abschicken können.

Nachricht an VG WortMailtext anzeigen

Betreff: Datenschutzprobleme mit dem VG Wort Dienst(METIS)
Guten Tag,

als Besucher des Datenschutz-Blogs Dr. DSGVO ist mir aufgefallen, dass der VG Wort Dienst durch datenschutzfreundliche Browser (Brave, Mullvad...) sowie Werbeblocker (uBlock, Ghostery...) blockiert wird.
Damit gehen dem Autor der Online-Texte Einnahmen verloren, die ihm aber gesetzlich zustehen.

Bitte beheben Sie dieses Problem!

Diese Nachricht wurde von mir persönlich abgeschickt und lediglich aus einer Vorlage generiert.
Wenn der Klick auf den Button keine Mail öffnet, schreiben Sie bitte eine Mail an info@vgwort.de und weisen darauf hin, dass der VG Wort Dienst von datenschutzfreundlichen Browser blockiert wird und dass Online Autoren daher die gesetzlich garantierten Einnahmen verloren gehen.
Vielen Dank,

Ihr Klaus Meffert - Dr. DSGVO Datenschutz-Blog.

PS: Wenn Sie meine Beiträge oder meinen Online Website-Check gut finden, freue ich mich auch über Ihre Spende.
Ausprobieren Online Webseiten-Check sofort das Ergebnis sehen

Artificial Intelligence: Practice test of the new LLaMA language model from Meta

0
Dr. DSGVO Newsletter detected: Extended functionality available
More articles · Website-Checks · Live Offline-AI
📄 Article as PDF (only for newsletter subscribers)
🔒 Premium-Funktion
Der aktuelle Beitrag kann in PDF-Form angesehen und heruntergeladen werden

📊 Download freischalten
Der Download ist nur für Abonnenten des Dr. DSGVO-Newsletters möglich

Meta has likely released its powerful AI model LLaMA in version 2 due to Microsoft/OpenAI and Google's current dominance in the AI field, also for commercial use. A local operation without problems with data protection, business secrets or employee data is possible. A practice test.

Introduction

The model released by Meta on July 18 is a Large Language Model (LLM). It's suitable for analyzing text. Therefore, it can be used among other things for the following use cases:

  • Summary of text (abstractive summary = abstractive Zusammenfassung = Summary in own/new words).
  • Extracting meanings from documents (Example: What questions does the given document answer?).
  • Document search engine (vector search = Vektorsuche).
  • Answering questions with company documents as a knowledge base (question-answering = Question-Answer System).
  • ChatBot (conversational = Konversationen).

Update: More recent and capable language models are used in Offline-AI on Dr. GDPR. ([1])

LLaMA is an abbreviation for Large Language Model Meta AI. Meta AI is a department of the Meta conglomerate that deals with artificial intelligence applications. After Meta has now collected numerous data from users on Facebook, Instagram or WhatsApp, these data are now used for training AI and AI models like LLaMA.

The LLaMA 2 language model can be run locally and data-friendly even for commercial applications. The hardware requirements are met.

See this post as well as other post[s].

Besides models for language understanding, there are models suitable for other data types. Many have probably already heard of StableDiffusion, a AI model with which an image can be generated from a text prompt (Dall-E, Midjourney etc.).

For basics I recommend one of my previous contributions on Artificial Intelligence:

  • Foundations for AI systems. ([1])
  • Question-Answer System with AI. ([1])
  • Current AI is a revolution and is not based primarily on statistics. ([1]) ([2])
  • Typical use cases, data protection, confidentiality, misunderstandings. ([1])
  • Configuration parameters of a language assistant. ([1])

The hardware requirements for the smaller models are feasible. The model size is determined by the number of parameters in the model. Parameters are neuron connections. Roughly speaking, one could consider the number of neurons in the electronic brain as a parameter.

In AI-models, parameters are abbreviated as follows (examples):

  • 7 Billion
  • Thirteen billion = 13 Billion
  • Seventy Billion = 70 Billion

The "B" comes from "billion", because in English a billion does not exist. A "billion" is therefore a billion. Models with for example 200 million parameters are then called 200M. Good, because in German we would get confused with the "M" for million and the same "M" for billion mixed up.

The parameter count of a model is an excellent indicator for its language understanding. The higher this number, the better "speaks" or understands this model a language. But which one? Most models were until recently only English-based. However, there was always some bycatch. Meaning: Fortunately, there are some texts on the internet that are exceptionally in German and not in English, Chinese or Spanish. So a AI-model with a sufficiently large parameter count can accidentally also understand German. This wasn't meant to be ironic, even though it sounds like it.

The search engine Bing with a GPT language model in the background often provides false answers.

My opinion. See post.

Essential for a model is therefore its parameter number and also the Dyeing language. With large models, there has not yet been one that I know of which was specifically trained on German. That may be different next week. One can see very nicely how slowly some companies, authorities or lawmakers work. While these think in years or three-year periods, four weeks are a long time in the AI scene. Have fun in the future (which is already starting), when we're all being overwhelmed by technological progress and problems. I'm protecting myself more carefully and waiting no longer for laws or judgments.

Also crucial for a KI model is what's called context length. The context length indicates how large text snippets can be that a KI model can process. To do this, the KI model must be trained with text snippets of the same context length. The larger it gets, the better, but also more computationally intensive. I had read at Meta that numerous A100 graphics cards with 80 GB VRAM each were used for training. The computing time was 3.3 million graphics card hours. An A100 is a very expensive graphics card. A piece cost up to recently 15,000 euros. Such a card draws a maximum of 400 watts from the power outlet.

The LLaMA 2 model has a Context length of 4096 characters. That's clearly more than its predecessor, LLaMA version 1, which probably only had 2048 characters. Most models I'm familiar with have had context lengths of just 1024 characters until now. ChatGPT-4 has a context length of 8096 characters, but it's also extremely slow when you look at the chat interface and response time. There are even models out there with context lengths of 128,000 characters nowadays. However, these are currently equipped with relatively few parameters.

Wie gut is LLaMA 2 also?

Practice test of the LLaMA 2 model

My practice test gives an insight and first impression, that's it. As a use case I have used text generation, which should give an answer from Dr. GDPR contributions based on a question. I have asked all questions in German language.

I had tested the LLaMA 2 models with 7B and 13B parameters. The 70B model currently exceeds my available hardware. I could afford it, but why should I? To try out already very good models is sufficient for my AI server. For this, I use downscaled models. These models have a still very good quality at significantly lower resource requirements. The full LLaMA 2 models are therefore still a bit better. As my previous tests with Vicuna models show, the difference is probably rather minimal. However, I cannot rule out that the usable context length in the downscaled models is also lower.

Trained language models like GPT or LLaMA are well-suited for formulating texts and answers, but not as a directly usable search engine and knowledge database.

A search engine for personal documents that has text understanding can be created based on pre-trained language models.

When responding to a LLaMA model, a WordPress version was included and marked as the latest WordPress version. Assuming this is correct, at least some of the knowledge comes from early 2022. However, that's not what matters here, because the application cases in question are especially the analysis of corporate documents, not answering general knowledge questions.

7B Model:

My first tests were with the 7B model of LLaMA 2. The results were mediocre by current standards. I would have been surprised last year. The quality is already surprisingly good for such a model, although the use cases are limited. I see the 7B model more suitable for searching documents or for a simple question-answer assistant that you don't ask complicated questions to. Furthermore, in my assessment, the 7B model needs fine-tuning to be able to plant expert knowledge fairly reliably.

13B Model:

My tests consider a context length of 4096 for the LLaMA2 model variants. Apparently, the model struggles to give answers only in German at such a context length, although I tried to force this. Here's an example:

Question: Are normal people allowed to legally advise someone?
Answer: Yes, normal people can give legal advice.
The BGH has decided that contract documents, in particular contract drafts, can be created by software that uses a question-answer game. Also, data protection texts are even further out of the line.
Functioning of a contract generator:
A legal publishing house offers a software that creates contracts. Against this, a German bar association had something to say. The bar association was of the opinion that the contract drafting was only allowed to lawyers.
The Federal Court of Justice gave the opinion of the bar association a clear rejection. This goes out of the BGH decision of 09/09/2021, which had to decide on the legality of a contract document generator.
The infamous software asks the user about various aspects that are important for contract drafting. The user answers accordingly.
Then the software creates a contract draft on the basis

The question was intentionally formulated somewhat carelessly. Almost every human writes like that into their or her device. The answer was in English, although I demanded German (not shown here, but reported to the AI model). The answer was quite long, but cut off. This is because answers also have a maximum length. Many know this from ChatGPT. The calculation of the answer took 9.2 seconds here. Most answers were given much faster or significantly faster in my test. It all depends on the question. Using what is called streaming can reduce response time psychologically. Many may still remember Windows start screens from back then, which appeared three times in a row when Windows was started once and each time informed that Windows was starting up and would certainly be available soon.

With a context length of 4096 there were at least some answers in German language. Decisive are besides this parameter also others. Anyway in one configuration, which is close to what I would describe as very good from my previous tests with other AI models, the following result came about:

Question: What rights do I have under data protection?
Answer: I have no rights.
I have the rights granted by the GDPR.
I have the rights granted by the GDPR and the rights not granted by the GDPR.

The question was intentionally written in small letters with the word "data protection". After all, human questioners who type something into a field are usually careless. The allowed vagueness for selecting source documents was rather at the lower end.

After reducing the permissible uncertainty, came on another question the following answer:

Question: Who is Klaus Meffert?
Answer: Klaus Meffert is a computer scientist who has earned his doctorate in computer science. He is the editor of the blog called Dr. GDPR. The blog can be accessed at dr-dsgvo.de. He lives in Hesse, the federal state where the data protection authority does not have much enthusiasm for imposing fines. Some Hessian courts also have little to no interest in dealing with data protection claims on behalf of affected persons.

No idea how the reference to Hessian authorities got into my AI, where so many fines have already been issued in Hesse for data protection violations. At least web tracking was zero. And Google Analytics isn't that bad, a department head of the Hessian authority told me personally and at a conference. The AI answer is based on the entered knowledge, which is very good. Hallucinations are not detectable here either, which is also very good.

Sometimes funny results come about. Here's an example:

Question: Who is Dr. Meffert?
Answer: Dr. Meffert is a fictional lawyer who serves as a reference for translating "lawyer" in the DSGVO-translation table.

My explanation for this AI response is that in my blog a data protection dictionary with translations of technical terms from German to English exists. That I am a fictional lawyer could have been inferred by the AI from my affinity for the legal context. From this intuition, which the AI generates here beside, new application scenarios could emerge. One could thus generate psychological profiles of persons based on their works (text, image, music …).

In another contribution I described the answer of my AI to the question: "Is Klaus Meffert evil?" . LLaMA 2 answered with "No good answer found." One can't complain about this answer. However, an answer in the aforementioned contribution was much better. ([1])

For another question, the following result occurred:

Question: Is WordPress secure?
Answer: WordPress is secure if you take care of security.
This means that one has regularly updated the WordPress installation, regularly updated plugins and themes, and regularly changed passwords.
That is the basis for security.
That's not all.
There are still further security measures that can be taken.
For example:
Installing WordPress on a server that is not directly accessible via the internet.

The answer is already very good, I think. According to the calculated confidence interval, it was given partly from my own knowledge base and partly from the knowledge database inherent in the AI model. If you don't know what inherent means, that's not a problem. I'm using this term here only because nobody really knows exactly how an AI model externalizes its knowledge anyway. That's the same as with humans. No idea why I gave a different answer to the same question yesterday when it was asked again today. It wasn't even meant ironically, although the example was invented.

The results mentioned earlier were based on a configuration that identified the two best source documents as the basis for answering the question posed by humans. The process is multi-stage because a document typically has more than 4096 characters. Documents are therefore first fragmented and then fed into AI.

What does the result look like when three source documents are used as a base? The results seem better here. Here's an example on a very difficult question:

Question: Can a server's location be determined by its IP address?
Answer: No, this is not reliable possible. The IP address can be used for any server. The server location cannot be reliably determined from the IP address. The assignment of an IP address to a server can change at any time. The location of a server can only be reliably determined by checking the contractual agreement with the operator of the server.

The answer comes almost entirely from my blog articles. It is correct and has been tested on my AI system. On the other hand, Microsoft's advanced artificial intelligence in Bing can only answer this question incorrectly. Bing gives the answer "Yes" and cites source documents with an excerpt that justify this incorrect answer. Bing says that a server location can be determined by its IP address and does not limit it at all.

Even Copilot has extreme weaknesses, as a test illustrates. Even seemingly simple everyday tasks can't be answered properly by Microsoft's AI.

Alternatives

Every week new language models are released that surpass the previous ones. For example, Duck-billed platypus is a model that, according to my tests, can achieve better results than LLaMa 2.

In addition, there are variants of LLaMa that have a much larger context length. A large context length allows you to feed large documents or extensive texts into the AI all at once. Breaking them down into bite-sized chunks, as would otherwise be necessary, is no longer required. Yarn-Llama-2 can be mentioned as a model of this kind. When forming chunks, there is always the risk that information will become distorted or lost. A large context resolves these problems.

Fine-tuned variants of LLaMa 2 often deliver better results in relevant benchmark tests such as ARC or TruthfulQA. However, it must be checked for a specific use case whether a model is really better than another. Most application cases in Germany are likely to have German as the primary language. In this regard, it brings little if a model that has been predominantly conditioned on English texts or question-answer games delivers great results.

A demonstration of what is possible with full data control shows the Offline- AI on Dr. GDPR.

Conclusion

Some further question-answer pairs I have left out here. Not infrequently, the LLaMA 2 AI model did not provide an answer, which is better than a wrong one. At least, for every question, the source documents can be provided that are most likely to yield the answer to the question.

Currently I don't see the advantage of LLaMA 2 over other models regarding German texts or dialogues. However, the context length is already an argument that justifies a closer look at the model. How the released Meta AI model behaves after Fine-Tuning, is another question. However, probably little exciting will happen and analog results like with other models will be achieved (if it weren't for the context length which can process longer answers better).

Key messages

Meta released LLaMA 2, a powerful AI language model that can be used for various tasks like summarizing text, answering questions, and even building chatbots. It's designed to be run locally and is suitable for both personal and commercial use while prioritizing data protection.

Large language models like LLaMA 2 are becoming increasingly sophisticated, capable of understanding and generating human-like text in multiple languages, including German.

While LLAMA 2 models show promise for document analysis and simple question answering, they still require fine-tuning for expert knowledge and struggle with providing consistent responses in a single language (like German) within a long context.

The AI model demonstrates impressive knowledge recall and text generation capabilities, but can sometimes produce unexpected or humorous results due to its learning process.

To keep your WordPress website secure, regularly update it, plugins, and themes, and change passwords frequently.

About

About the author on dr-dsgvo.de
My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.

Artificial Intelligence in Administration: often problematic and not useful