The database is often the only stateful portion of a service. As such, software engineers rely on it for everything that asks "what did I say about this last time?" or "can you hold this for the next time I'm here?" Modern relational databases are extremely fast and can handle millions of these questions per second, even when working in less-than-ideal circumstances, and can store millions of rows without breaking a sweat. I've personally seen 1.5 billion row tables backing a service with nearly instantaneous data retrieval. Modern databases are so good at what they do, people sometimes forget that they can help the database do a good job.
This becomes a problem far sooner than people think, and I've had the unfortunately honor of having to deal with the fallout of poor database design the last few weeks at work. It's nearly all I can think about in my spare time, so please bear with me while I explain some principles of relational databases to the void.
Tables are spreadsheets. They are a series of rows, each with the same number of columns defined. These columns have names, like "id" or "name", as well as types, like "text" or "number." Tables are easy to understand, but the extra features around them can make things very interesting.
The most powerful tool databases have for speedy data retrieval are indexes. I won't bore you with the details, largely because I don't actually know enough to go much further, but there are many excellent resources out there to learn about effective index design. I default to https://use-the-index-luke.com out of habit, but that doesn't it make it better than your favorite.
Indexes are exactly that: a short version of something that the database can use to look something up in a more efficient way. If your table is a notebook you keep to track people you meet and their outfit, you'd find it difficult to look up all the times you saw "Steve Stevenson" once you reached a few pages worth of names. An index would be like starting a new, smaller notebook, where you track the names in alphabetical order and the pages they appear on. Now you can look up Steve in your index quickly, and find all the pages with his outfits.
But this illustrates the part of indexes that people don't consider: indexes aren't free. You had to start a whole new notebook to keep track of people, and keeping the list alphabetical will start to get hard. Thankfully, computers are faster than erasers, and can move data around reasonably easily. Still, if you are always writing new data, tracking that in an index that you rarely use for lookups is just a waste of space and effort. Also, indexes only work on one type of lookup. If you want to index on hat color, as well, you'd need to start another notebook.
A key feature of relational databases is their ability to combine data from several sources into one, or to use one source to filter the responses from another. This is called a join.
There are many types of joins, and additionally a few ways they can be executed, but I won't go over them all because I likely won't do them justice. Again, there are great resources out there on what joins can do for you, and how to avoid doing bad things with joins, but at the end of the day, you're trying to efficiently combine data from 2 different notebooks based on some matching information, like name or hair style.
You may have heard that joins are slow, but that's usually because people don't index correctly. If you're going to be joining the data from 2 different tables together, you should basically always index both sides of the join keys. Sometimes the join will ignore indexes for the joining operation, but will use still use them predicates, so you should also think about indexing those. Joins are weird! But if you're doing normal things with a normal number of rows, joins are also pretty straightforward, and generally fast.
Many engineers never have to think about this step, but it unfortunately comes for us all. Aggregates are functions you can use in your database queries to combine data. Sometimes this just means you combine everything into a list, and sometimes it means you sum all the numbers in a column. But usually, it's because you want to do something on a group of things.
Did that make sense?
An easy example would be something like: how many times did I see Steve in each hat? In this case, my "group" is the hat, and the "thing" is an instance of wearing such a hat. The aggregate function I would use here is
COUNT, which just counts how many rows are in the matching group.
There are some issues with aggregates once queries get complicated, but they're almost all related to figuring out exactly what the query is doing and when.
On a more philosophical level, software engineers use database features because they don't want to sort and filter and transform all the data themselves. Databases have been around for a very long time and are extremely good at what they do, so we often lean on them to do the heavy lifting and filtering for us so we can think about other things. They have indexes and joins for a reason, so we should use them! But again, this is where people put themselves into sticky situations.
The issue I ran into recently happened at a shockingly small number: 3000ish entries in a table caused a specific database query times to exceed 5 seconds, which caused a bunch of cascading badness in the system and eventually bringing the system down completely. There were 2 reasons for the collapse:
- Missing indexes on 3 of the 4 joins (technically they were subqueries but... work with me)
- Not understanding how a specific aggregate function worked
We can haggle about what went right or wrong in the decision to join so many tables together, but I came into the situation without any ability to influence that.
Let's fix the query.
Step 1: add indexes to all the join keys. I mentioned before, but indexes are basically necessary to join tables quickly. This resulted in basically no performance change because we're dealing with such puny amounts of data (3000 rows rounds to 0 for most databases), but will future-proof us a little bit going forward.
Step 2: fix the aggregate. The original author and I (as I did do the code review initially) did not quite understand how our aggregate function would function, and ended up calling it once per entry instead of once in the whole query. We also happened to call this function 4 times in the query, causing incredible pain to our database as it tried to field all those requests at once. Running these once per query instead of once per entry was huge, but involved a much more complicated and expensive join. Still, we saved about 500ms with this shift.
Now we had a messy join to contend with, and it was causing a nearly cartesian product quantity of rows to be loaded before the aggregate functions could start work. This is not ideal, but it is solvable. The first option is to move some of the joining into the service instead of the database. This might slow down the process, since we have to reach out to the database more often, but might also improve performance because each of the operations will be faster. Another option is to utilize subqueries to push the database features to their limit, but this could slow things down if the server running our database starts to see too much traffic. I'm still testing things out, but the in-memory join is looking quite promising, shaving off about half the query time in my current testing.
Databases aren't magic. They can't solve every problem you have for free, and they certainly can't do it if you don't put in the work to understand how they do what they do. Learn to not use a database sometimes, and maybe you'll keep your services running for longer than I can.