This is a Q&A for my articles about data persistence.
Part 1: Data storage and retrieval from first principles
Part 2: Indexing/Querying strategy with BoltDB
Q: Why are you reinventing the wheel? Relational databases can do all of that and more
The whole exercise is about reinventing the wheel. Yes, relational databases do a lot more than what my system does; that’s exactly why I want to reinvent the wheel.
When I load an object, I don’t want to also run a query language interpreter. I just want to get the data directly, with the minimum work required. I want to minimize memory copies. I want to minimize disk loads. I want to minimize the amount of code I need to write/maintain for dealing with data persistence.
Yes, relational databases do a lot of things, but most of what they do is either bloat I don’t care about, or waste that I want to remove.
There are very few things that I do want from a persistence system, and those are the things that I sought to implement.
And what’s more, I was able to get more than what a relational database provides: indexing with auto sorting by relevance. So I not only replaced the need for a relational database, I also replaced the need for a separate “Redis” style database that would have to be used in addition to the main data storage engine.
Q: Your reasoning is not sound! What’s the context here? What specific goals do you have in mind?
The point is to come up with a more sensible “default” baseline for develping a web app that performs reasonably well and is also simple to deploy/update.
The simplicity of deployment is a whole other story with many more requirements that I will hopefully outline in a future article, but one of the requirements for such a system is to have data persistence story that is based on “database as an embedded library”.
Initially I was going with SQLite, but I found myself procrastinating a lot. Developing the backend persistence and querying story around SQL was too much of a bummer for me that a significant portion of my mind was strongly opposed to continuing to work under these conditions. That’s why the first part of the first article in the series outlines some of the problems I was having with SQL based relational databases.
Q: In that case, why have a “database” at all? Just read form disk, work from RAM, dumb everything back to disk when you quit
There are a few reasons why it makes sense to have a persistence engine that writes data to disk on demand rather than just dump everything occasionally or when the program quits.
I don’t want a crash to cause data loss.
I want the system to work even when data does not fit in RAM.
I want the system to work even when data indicies do not fit in RAM
I want to eventually extend the system to allow for “horizontal” scaling where data is sharded and duplicated across multiple machines.
I want to minimize disk writes (frequent huge writes reduce the lifespan)
In the case where data does fit in RAM, we don’t suffer any loss, because boltdb memory maps the file, and has the capability to pre-populate the memory map (at least on Linux).
In addition, it’s worth noting that writing too much data to disk reduces its lifetime. One of the benefits about a B-Tree based storage engine is writing data affects the minimum possible number of disk pages. Imagine dumping 1GB of data every hour vs just writing 16kb of data every now and then.
Q: Why not just write files to the file system directly?
For regular object serialization, we probably could just write to the file system; something like “<type>/<id>”. We don’t often have a requirement to “iterate” on a list of objects sorted by their ids, so we don’t make use of the BTree property directly. The storage system of course makes use of it to find the location of that object data quickly, whether that storage system is boltdb or the regular filesystem.
For indexing though, we do utilize the BTree property directly. We need to iterate on entries - in order - quite quickly. If we were to store the index files on the file system directly, the file content would need to be based on BTrees.
Now, BoltDB gives us a bunch of additional benefits, like memory mapping, along with the ability to preload the entire file in RAM, which should help in improving performance. In additions, it gives us transactions, which might actually hurt performance but increase stability & reliability. Although I don’t yet know whether we need that (ACID compliance) or not, but for now I’m opting to have that property ON by default.
Now, one concern I do have with just writing to the filesystem directly is wasting disk space. More of our objects are actually rather small. I know that at least some file systems take up to 8kb per file even if the file content is just 1 byte. Most file systems also store metadata per file, which is another source of waste that we don’t need for our system.
Q: Is your system even reliable? It seems rather fragile
BoltDB should be quite reliable in terms of being able to give you back the exact content of the bytes you put in. It’s also fully ACID compliant; it guarantees consistency and atomicity.
My indexing system should be rather reliable. I don’t expect to see an explosion in disk space as you would typically see in web companies where developers that just haphazerdly cache data in Redis without a proper system in place.
There’s only one function you need to call to update the terms that refer to a target, and that function automatically deletes terms that you had previously used but are not in the new list.
This is the fundamental problem that I wanted to solve with this system. Without this solution, the whole system makes no sense.
The function that persist an object is the same function that updates all the indicies for that object. There’s no book keeping.
Q: How does your indexing system handle deleting objects? How do you handle the cascading effects?
Cascading deletes are not a thing that should happen in most systems. It’s a problem you might face in relational databases because how it needs to guarantee the “integrity” of foreign keys, but it’s actually more headaches than it’s worth.
Most people in web development actually prefer to not delete objects, but to just mark them as deleted (e.g. by having a boolean column named ‘is_deleted’). It’s not that they necessarily want to preserve the data. It’s because deleting rows from a relational database can cause a lot of headaches, specially when there are other objects (that you do want to preserve) that reference the object you want to delete.
In the system I’m currently working on, I almost never “delete” anything, so it’s not even a problem I’m worrying about right now, but it’s still worth a little bit of discussion.
When you delete an object, you update the indicies by setting that target terms to nil
. This basically deletes all references to this target from the index.
You can make the maintenance of the code easy by having the same function that persist an object take a flag that determines whether you want to update or delete. This way you have the same code path, and you don’t need to remember to keep the list of indicies in sync between the “update” and the “delete” function.
One potential issue an astute reader might notice is the following: if we delete an object whose id is the term in the index (rather than the target), then we will have a bunch of stale entries.
The way that data is laid out makes it easy to delete a term from the index; I just never wrote the code to do it because I never needed it.
Now, I actually would not necessarily delete the entries even this case. Like I said previously: the index is just an acceleration structure. The “source of truth” lies in the objects which are stored in data buckets.
The index is not where I store many-to-many relationships. I store those in the target object. The ‘target’ object would have a list of ids that it references.
Let’s talke about a concrete example to make things clearer.
Let’s imagine a system where users can send direct message to each other or be involved in a grouop conversation. In a relational database, you would have a ‘users_conversations
’ table where you add an entry with ‘user_id
, conversation_id
’ to link users with conversations they are participating in.
In my case, I would either store a list of `conversationIds []int
` in the user object, or a list of `userIds []int
` in the message object. The index would then be used to make the reverse lookup fast & simple.
The rule of thumb I follow is that a target has a few terms while a term has potentially many many targets. If it’s likely that a conversation has many users while a user only participates in a few conversations, I would store `conversationIds []int
` on the user. If on the other hand, a conversation could have a few users while the user can potentially be in many conversations, then I would store `userIds []int
` on the conversation object.
Now, let’s say we store `userIds
` in the message object. This means that for the reverse lookup index, the message id is the target, and the user id is the term.
If a user deleted their account, what should happen to the message? Should it “forget” about the user? I would say it shouldn’t, and as such there’s no reason to remove the user’s id from the `userIds
` list on the message object.
If you do want deleting a user to also erase any record of their existence from the system, that is quite a big operation. In a relational database, this would be a cascading delete, and it can be quite dangerous if you don’t know exactly what’s going on. Do you want such potentially disastrious operations to just be carried out automatically? Anyone whose ever had a chat with said user will lose a portion of their chat history. Is that a desired feature of the product? Maybe it is, or maybe it’s not. Maybe you want to keep that part of the data in the system and just display the user as “Deleted User”.
No matter which choice you pick, you should want to be explicit about the nature of the operation; you should not just let the persistence engine automatically deletes things in a cascading manner.
Enjoyed this series, I found your indexing technique to be ingenious, in particular!
Let me add my two cents:
I've actually used BoltDB at my startup as my app's primary data storage. Eventually, I switched to Postgres.
In my opinion, what makes a database a "proper database" is a querying language. Otherwise, you just have a storage with nice features.
The ability to write ad-hoc, one-off queries is crucial for startups when you need to get answers to questions (from stakeholders/users/devs) while your app is running.
In the case of an embedded database, you'd have to:
- Write new custom code to query data and redeploy;
- Copy your data from the production environment to a local one, then write your code and run it.
Also, while having a single-file database with a process that has a lock on it is convenient, it becomes a huge pain when you want other processes/apps to connect to the same data source.
As always, this is all very context-dependent. I think BoltDB and similar kv-storages can be a great default option for a lot of cases.