Ur-Text Document Systems
Doing stuff on computers since 1995

Some Thoughts On Setting Up Document Databases

The recent popularity of Large Language Models has overshadowed the more boring parts of document processing. The flashiness of being able to simply type a question into ChatGPT and have it output a reasonable-sounding answer seems almost magical, but it hides the proverbial sausage factory of technologies that chatbots and generative AI (as well as every other document database system) are built on. And like sausages the quality of the final product largely depends on the quality of the ingredients you start with.

The overly simplified version of how to make an LLM goes like this - you start with a big pile of documents (the “corpus”), then you break each document down into “tokens” (in the simplest case, individual words), then these are used to create “embeddings” (which are basically a mathematical representation of how tokens relate to each other in the corpus - for instance the words “car” and “engine” tend to show up near each other in documents more frequently than, say, “car” and “botulism”). At this point you have a database of tokens and their relationships to each other, which you then use to train the fancy deep learning algorithms that power the chatbots and whatnot.

This training is where the ridiculous resource requirements come in, the ones that are pushing up the prices of GPUs and RAM. These algorithms here are not only the most computationally expensive part of the process, they're the most abstruse and consequently the ones researchers and the press find most sexy. The training step is where the “intelligence” comes into an LLM, along with all the related, high-level problems of alignment, hallucination, and just being straight-up wrong. But all of this sits on top of the very unsexy process of gathering the documents together, processing them, and breaking them up into tokens in the first place.

To reiterate, document collecting and tokenizing isn't just for building LLMs, it's the basis for pretty much all document storage and retrieval systems. It's the go-to way for taking a bunch of unstructured data and breaking it up so that you can search for or interact with the contents. People think that adding a document to a document store is as simple as adding a record to a database, but it's not.

Tokenization breaks a text down into component parts, which seems more straightforward than it is in practice. Imagine we're creating a search engine index and we have a document containing a sentence like “car engines run on gasoline”. We could split the sentence into the individual words - “car”, “engines”, “run”, “on”, “gasoline” - and these words become the tokens to use in the index - if someone searches for “gasoline engines”, this document is returned because it matches both the tokens “gasoline” and “engines”.

So what if they search for “gasoline engine”, singular? Well here is where tokenization starts to get a bit trickier. There's a process called “stemming” we could choose to use - with stemming, each token gets stripped of its prefixes and suffixes, thus “engines” becomes “engine”. You then have a couple of choices, depending - you can either store both tokens in the index so that the document is associated with both “engine” and “engines”, or just store the stem (“engine”) and use the same stemming process on the search the user entered, so if a user searches for “engines”, that also gets changed to “engine”.

Ah, but what if they search for “automobile engine”? Well most search engines use some sort of “synonym” table - just a list of words that can safely be considered equivalent, so “car” and “automobile” are synonyms, as are “gas” and “gasoline”. It works pretty much like the stemmer above - if the tokenizer finds the word “car” in the document, it checks the synonym table and returns both “car” and “automobile” (ok there are some variants here but they're not important for the sake of this discussion. I will note here that synonym tables mostly make sense when you have a domain-specific corpus, because things can get way out of hand when you start saying “gasoline” and “gas” have the same meaning when “gas” is also a state of matter).

There are a bunch of other potential tokenizer tools as well - things for handling contractions, punctuation and other parts of grammar, and things to handle structured data like dates and email addresses. Also there are various tools for breaking down words into chunks or creating tokens longer than one word, as well as tokenization methods that use statistical or ML analysis of the text to define tokens.

Tokenization strategies depend on the particular corpus and the end use case - if all your documents are automatically generated log files then handling and processing is simple, but when your documents are things like emails or hand-typed reports then that's where things get tricky because if there's something people like to do, it's to make typos.

It all boils down to the qualities of your data source. Log files are generated by code and it's pretty near impossible for the word “ERROR” to be misspelled some of the time. But there are no end of ways for people to screw up while typing - dropping and adding characters and spaces, transposing characters, or misspelling things because English orthography sucks. Or simply using the wrong word - maybe the user typed the word “diesel” when they meant to type “gasoline”.

Scanned documents have their own problems - modern scanning systems have gotten much better, but any document scanned more than a decade ago is going to have some very wonky character recognition errors that will make them look like 1990s “leet speak”.

Some of these problems can be mitigated with tokenization techniques - imagine you have a document where gasoline is misspelled as “gasline”. There's a type of tokenizer that breaks text up into “n-grams” which are fixed-length runs of characters - commonly two letters (bigrams) or three letters (trigrams). Broken into trigrams, “gasoline” becomes the tokens “gas”, “aso”, “sol”, “oli”, “lin”, and “ine”. “Gasline” shares the trigrams “gas”, “lin” and “ine”. Not perfect, and in a search context this document would be considered a partial match, but it at least should come up in the search results.

Most of these problems need custom solutions though. If the corpus is particularly large like the ones used to train LLMs then deep learning algorithms may (emphasis on “may”) be able to infer that “gasline” is the same word as “gasoline”, just misspelled, but few organizations have corpuses that size. But whether your documents are going to be used for training data, put into a RAG, or just made searchable, you still need to clean things up before you start breaking into pieces.

The first thing you need to do when setting up a document pipeline is go through the documents (or some cross-section of them) and look for obvious problems. Spell check them and see if any patterns jump out, do some statistics for things like document size or word distribution to see if there are statistical methods for identifying edge cases (a problem with document scanners is that sometimes they'll miss entire pages or run the same page twice which will show up as documents being shorter or longer than normal). Find some “sanity checks” - is there a title that appears on every page? Does every email have a From for the same organization? Do the date stamps make sense?

Once you have this then you need to figure out if there are any automated ways of handling the variations in the data - synonym lists, hard-coded search and replace, splitting concatenated documents or concatenating split documents. Then you need to set up a system for kicking documents out for manual processing.

Manual handling requires people, which means security, access control and management. It's going to need databases, UIs and some place to host it. It's going to need its own workflow, validation steps, tracking and reporting. At least one person in the organization is going to have new responsibilities so HR needs to be involved.

Setting up a document database is a complex process, you can't just suck up a pile of documents and call it a day. I haven't even mentioned the types of decisions you have to make based on the intended use-case - in-house tools can be more complex than public ones, and RAG requirements are different from simple search tools. Even the idea of whether an end user can be expected to refine searches or to just accept their first search results affects how the index is set up and consequently how the documents are cleaned and tokenized. It's best to have all this figured out ahead of time.