Data is a valuable resource, especially when it comes to business secrets. But confidential and personal data should not be given to third parties (like ChatGPT) for legal reasons. Own AI systems offer besides confidentiality the advantage of great flexibility and precise alignment with concrete requirements. A practice report.
Introduction
Since it's just simple, was a slogan of a mobile phone provider. Simple is what the new false often says for data-intensive applications. Data protection does not really interest many people. When it comes to employee data, vertraglich as confidentially secured data, patent foundations or other business secrets, then companies are more sensitized. Finally, no one wants legal trouble. The desire to bring the internal company knowledge out into the world is probably not so widespread.
Artificial Intelligence: The legal approach examines what may be permitted and clarifies risks. The technical approach provides data-friendly systems and resolves many legal issues on its own.
Constructively acting rather than arguing is a good strategy, I think. Lawyers still have enough to do even then.
It's easy to use ChatGPT, but some people make it too simple to their own detriment. This already shows that thinking is harder than doing something wrong or suboptimal. Even greater efforts are accepted if they are only slight, but often repeated. Rather 100 times a small effort with a high overall expenditure than 1 time a medium-sized effort with a significantly lower overall expenditure.
Recently, Meetings as a provider of video conferencing software formulated new terms of use. With this, Zoom grants itself the right to use all data received in Zoom video conferences almost arbitrarily. Included is also the dissemination of your data, including transcripts and use for machine learning ("training an AI"). This would not have happened with a data-friendly solution from Germany. Equally, it would not have been a problem with your own system. Now all Zoom users potentially have a problem.
All Zoom users potentially have a problem because they allegedly prefer free third-party systems instead of data-friendly solutions.
Thanks to Zoom for the decision-making help.
If you don't make it easier than easy, at least use the ChatGPT interface through your own program. This way many applications can be created. ChatGPT brings with it, in addition to remarkable abilities, several incurable problems:
- ChatGPT is very slow.
- Most of ChatGPT's data is irrelevant for business applications (hindering ballast, promoting hallucinations, slowing down the system, increasing error susceptibility).
- All data lands with OpenAI and thus with Microsoft.
- Data is not secure at ChatGPT (see late added opt-out instead of consent, data leak, American company policy etc.).
- ChatGPT is based on outdated general knowledge.
- ChatGPT is not familiar with your company's documents and hopefully will never learn them.
- ChatGPT costs money, depending on the number of processed text pieces (tokens). Uploading and analyzing a larger PDF will already make you poorer. Incorrect programming (infinite loop or recursion) will quickly ruin any budget.
- ChatGPT is not infinitely scalable.
If your inputs are also used for the training of a third-party AI model or for fine-tuning, then privacy and confidentiality cannot be guaranteed anymore. A language model learns not only grammar and structure, but also takes in knowledge. The resulting shortcomings are more annoying and counterproductive than a legal problem. This means that these problems cannot be legally resolved.
Offline-AI as a solution for companies and authorities.
Further information. ([1])
Similar things can be said about image generators like Dall-E or Midjourney. Many of these generators are based on an approach called Stable Diffusion. Almost all relevant methods of this kind use the LAION dataset. This one has used the Common Crawl data dump to find websites that embed images along with image descriptions. Common Crawl, in turn, is a massive dump of nearly any website. If one of your images has landed in the image dataset, it's not in its pure form. Rather, your company image (logo, product image etc.) has ended up in the artificial neurons of a third party's AI dataset in structural storage. Getting that image out again is hardly possible. Rather, the AI model would have to be recalculated. Whether the owner of the AI model will do this is questionable. After all, training is an extremely computationally intensive task with demanding data acquisition.
Proprietary AI systems
All the problems mentioned above are yours when you use your own AI system. I call this type of systems local AI systems or autonomous AI systems. These systems do not require an internet connection and could, in the best case, stand under your desk.
These benefits have in-house systems of Artificial Intelligence:
- Full Data Control: You decide which training data or pre-trained AI models are used.
- Ask your data and not internet data: Feed your company documents and media into it.
- High Speed: Anyway, your system will be faster than ChatGPT if you want it to be. The number of your users will be significantly lower than those of popular AI platforms. Moreover, you can reduce the data volume significantly.
- Customizable at will: More on this below.
- A wide range of application scenarios: Semantic search,text understanding, question-and-answer assistants, image generators, audio transcription, and many more.
Here's an example from practice, what is possible with a local system for your company. The example runs on a low-cost server and works. It is however still in development and will look much more than currently at the end. The pending completion is no big deal and only has something to do with my prioritization.
Semantic search for corporate documents
Search your documents, your ticket system (e.g. Jira), your intranet pages and much more with an intelligent system. Make all your documents a knowledge base and unite your company's knowledge in an electronic brain.
For standard document types like PDF, you can easily use import routines that won't incur additional costs for you. The Adobe Cloud is at least unnecessary in this regard. Everything that can be automated within your company results in high up-to-dateness and more free time for those who are not machines.
An AI search is not a search engine but a semantic search. Artificial intelligences are very good at searching structurally, semantically or perhaps vaguely. They are however bad at performing exact searches, although this is fundamentally possible. This is by the way analogous to humans.
Therefore I suggest a multi-step approach that ChatGPT cannot handle:
- Optimization: Identifying typos or poor synonyms in search terms. So, "CommonCrawl" becomes a suggestion for a likely intended term.
- Search with a traditional search engine. This is especially sensible when searching for "Common Crawl". An AI is so underchallenged by this type of search that it delivers poor results.
- Semantic Search: This type of search is particularly well-suited for questions that are asked in natural language. An example: “Can a server's location be determined with the help of its IP address?”
- Output of an answer to a posed question in one's own words. For example, my AI answers point 3 with: "The location of a server cannot be reliably determined by IP address, as the connection between IP address and server can change at any time. However, there are methods for determining the location of a server, such as using IP geolocation or comparing metadata." The Bing AI, on the other hand, incorrectly answers with "Yes" and cites sources that justify the incorrect answer.
- Transparency: Since an AI can certainly give false answers, as Microsoft's Bing search shows, user guidance should be designed accordingly. By this I mean not only hints but also the output of sources that led to the result, and more.
For searching in this blog I have been using a very cheap server for some time now, which doesn't even have a graphics card capable of artificial intelligence. Powerful graphics cards (CUDA-capable GPUs) from Nvidia are used for AI applications because they can perform calculations many times faster than ordinary processors (CPUs).
As long as my server is currently available, a click on the links mentioned in points 1 and 2 above will yield real results of my search. Semantic search I can also do, but for this I have not rented a server that stands on the internet. Rather, the rented AI-server (server number two, unlike the aforementioned poor server) serves me for development work.
The following results spit out my search at level one if you get caught and it's recognized:

It's not exciting to correct a small typo. However, even WordPress's own search function, which has had several years of development work behind it, does not yield any results if the search term is not found in the blog posts.
My search recognizes some spelling errors. For this, a Vocabulary of Terms has been set up that appears in (almost) all my contributions. Only these terms are "correct" or suitable for a search over my documents. As an optimization, a false search term is corrected and entered into the search field in its likely correct form. If WordPress finds no hits at all, a result for the corrected search term is given directly. Otherwise, a constructive feedback with the "Did you mean?" hint is given.
If a search term contains no space, it is obviously not a question that AI could competently answer. Therefore, semantic search will not be started here either, but rather a completely normal search.
If the search term is longer, it could be a question. First, the results of the WordPress search are displayed (if available). Then follow the results of the semantic AI search. Here's an example:

Interestingly, a hit is found by the classical search. This is probably only the case because my question is often used to demonstrate the performance of my AI. In the search result transparently it is shown that a hit comes from the traditional search and 18 hits were found from the fuzzy search. The fuzzy search is a vector search engine on minimal hardware.
As a counterexample, here is the result from the Bing search:

As can be seen, Bing provides the answer "Yes" to the question asked. The answer is wrong because IP addresses often do not refer to a specific server, and if they do, this assignment may look different a second later.
WordPress doesn't find a match for typo questions like this one: "Are Cokies personal data?" The word "Cookies" was misspelled here with only one "o". However, it finds a match using semantic search over a language model:

The AI search is successful with this hit. What does not become clear here because it has not been fully programmed yet: My AI search delivers not only a document as a hit, but can also name the location of the find in the text roughly exactly. Because for the search an index over documents is formed in such a way that each document is broken down into handy morsels. These morsels can be searched better than a long text. I could have therefore output the relevant morsel from the search result instead of showing the entire document.
The found contribution answers the question very precisely, as shown by the following excerpt from the contribution text:

The next step is to display the answer directly in the search results, preferably abstractly. Abstractly means giving a summary in new words. Humans do it too. A precursor would be the so-called extractive summarization, which resembles a quote.
Recently I have described an already implemented showcase for a question-and-answer assistant for company-owned documents. You can find details in the linked article.
Conclusion
With a company-internal AI system, numerous application cases can be solved. Such systems are data-friendly. They allow full control over data streams.
The example with the Document Search is just one of many use cases. The search logic is not yet fully programmed, but it already shows what is possible. It runs on a server that can be rented for an "apple and egg" at a German provider, if no own server is available. The possibilities for adaptation to individual needs are almost limitless.
If you want to invest a few hundred euros per month, you get a fairly powerful AI-server. With that, you can then use developed language models even in German. But it's also possible to mass-produce images. Instead of generating five images with DALL-E until eventually a good result is available, simply let hundreds of images be generated. Your AI will even learn which images you like and will sort out bad results in the future.
As with all Cloud Services, AI third-party systems are not only problematic in terms of confidentiality, but also in terms of costs (Pay per use). With local systems that belong to your company, there are no such costs. You only pay the monthly fee for your server, which either consists of a rental price or operating costs. These costs are manageable and attractive for anyone who really has an advantage from such AI systems. Without significant benefits, however, using ChatGPT is not really worthwhile.
If data protection and confidentiality are not a problem, you can at least think about using the ChatGPT interface programmatically. Artificial intelligence makes any problems economically solvable, in whatever form, that were previously unsolvable or only solvable with significant effort.
Please feel free to contact me if you would like to have your own AI system for your company or use an interface of a third-party system in order to reduce manual work. When using interfaces to third-party AI systems, at least some data problems can be reduced. For example, personal data can be automated to a certain extent manipulated.
Key messages
Building your own AI system offers more confidentiality, flexibility, and control over your data compared to using third-party AI like ChatGPT.
Using public AI models like ChatGPT can be risky for businesses due to cost, privacy concerns, and potential biases. Building your own private AI system offers more control over data, speed, and customization.
A combination of traditional search engines and AI semantic search, with careful attention to user input optimization, provides the most effective search results.
The author's AI search engine can find information even if there are spelling errors and can provide more precise answers than traditional search engines by breaking down documents into smaller, searchable units.
Companies can benefit from using their own AI systems because they offer full control over data, are cost-effective in the long run, and can be tailored to specific needs.




My name is Klaus Meffert. I have a doctorate in computer science and have been working professionally and practically with information technology for over 30 years. I also work as an expert in IT & data protection. I achieve my results by looking at technology and law. This seems absolutely essential to me when it comes to digital data protection. My company, IT Logic GmbH, also offers consulting and development of optimized and secure AI solutions.
