So, lately, I’ve been thinking about developers and their database workloads. From what I have gathered speaking with many developers, the narrative has significantly shifted away in the past decade from the SQL vs. NoSQL discussion. A few years back, I recall looking at Michael Stonebraker’s 2010-11 ACM blogs on this topic. My takeaway back then was “just use SQL databases”. A few days back, I revisited the same blogs, and boy and o’ boy have things evolved, or at least my understanding of the nuances has.
At that time, Stonebraker highlighted the necessity of automatic sharding for multi-node OLTP performance. He pointed out that significant overheads in single node performance often result from the database’s interaction with the application layer. Furthermore, he stated that the burden on core database operations typically originates due to factors like logging, latching, locking, and buffer management. To enhance performance in such systems, the elimination of one or more of these components is required. His argument was that NoSQL databases did this by relaxing ACID transactions.
While many of his points still hold true, especially those concerning the application layer overhead, there has indeed been a shift. Since then, the landscape of databases, the art of choosing the right one, and our knowledge of their workload scenarios have greatly evolved.
A couple of key points that were debated were sharding and ACID transactions. To be honest, the concept of native sharding in the SQL space around 2010-11 was somewhat challenging to comprehend fully. However, there were certainly fascinating cases of application-level sharding like at Instagram. Fast forward to today, there are some exciting SQL-based technologies coming to the market with native sharding capabilities such as PlanetScale, and Neon. In a similar vein, while SQL has long been equipped with ACID transactions, NoSQL databases are rapidly catching up. A case in point is MongoDB, which now supports multi-document ACID transactions.
Circling back, what I believe the general discussions lacked then was an analysis of how the applications accessed the data stored in the database! SQL databases are often the better option for storage optimization and handling ad-hoc queries in normalized, vertically-scaled server/cloud instance models. However, when the access patterns are repeatable and consistent, that’s when NoSQL really shines. By allowing the developer to design the data model around the application’s access patterns, NoSQL can support transactional and operational workloads at scale.
Interestingly, I’ve read many cases where both kinds of databases are used together. NoSQL for the operational side of the workload, handling a limited set of known access patterns from the application (typically CRUD operations). And then data being ETLed into a SQL-based data warehouse for more complex processing.
Today, we know that both SQL and NoSQL databases have demonstrated vast performance and functional improvements, with each earning their place in the application stack. In fact, I’ve not even delved into the evolution of purpose-built databases such as ElasticSearch or TimescaleDB. As the old adage goes, in the end, it is all about developers and their workloads.
So, here’s the crazy part, let’s say we start analyzing our workloads, and we eventually settle on our database of choice. What’s next? Well, what my experience tells me is that our real journey is just getting started.
As we developers flesh out the intricacies of our design choices, especially the ones surrounding the relationship of our databases with applications, a gazillion more considerations arise. Some of these could include tasks such as infrastructure setup and maintenance, database scaling, and managing backups but in recent times, fully managed cloud-based database services have certainly taken most of this barrier away from developers. In fact, it is incredible to me how in a few clicks we can get modern DB clusters up and running. This truly signifies the fascinating evolution of technology [1].
Nevertheless, our work as developers is not yet done. An enormous chunk of the complexity lies in the layers between the application and the database, such as managing schemas - a task that becomes increasingly complex as we add new application features, making sure we are using the appropriate database programming interface, and writing the correct queries. Other critical layers include query controls like rate limiters and caching, health checks and database audits, planning APIs with good documentation, setting up observability metrics, and finally, enforcing security including a flexible access control mechanism.
Ideally, we want the first principles of software development to be applied to these layers. These include version control, collaboration, knowing when performance is degrading or things are going wrong, being able to recover when something goes wrong, and mitigating regressions before they occur.
At Neurelo, we’ve been researching this complexity, and we are building an API-based abstraction layer that improves the developer experience for the last piece of this puzzle. Ultimately we believe every developer must consider this abstraction in their overall stack.
[1] Recently, I read about how NASA asked IBM to build what is today known as IMS, to track the million-plus parts that went into building a Saturn V rocket for the Apollo mission. This was before Codd’s relational model paper, and for teams to use it in the 70s, it required commissioning an IBM mainframe. The contrast between then and now underscores how intriguing this evolution is!