Editorials

Uber Going Schemaless

The topic today is based on the design, as I understand it, for the new Uber back end; Schemaless. Reading some of their design implementations, they have moved the schema out of the storage engine, and into their API and Client services. That being said, they are easily able to alter their schema without a lot of heavy conversion of the database hosting the data. In short, they are simply using MySql (having replaced it with Posgres) as a blob storage, or more accurately a key value pair, where the key is a real scalar column in MySql, and the value is simply a blob data type.

This means that the schema is now embedded in the blob. The server and the client consuming the blob knows how to serialize the contents into real objects the system understands and maintains. MySql was used to enable co-location, and indexing on the Key values. I don’t know if they are mining the contents of the blob data without serializing it. I know that using XML data types in SQL Server you can index the contents and also query across it. However, in order to do that, you typically use an XML Schema for the XML contents. So, your database engine is still Schema aware; just not in an efficient manner.

Since they rolled this solution themselves, using MySql sharding techniques to increase performance, I’m curious why they used an SQL engine in the first place. Sure, MySql sharding is cool, and there are a number of utilities to do that. However, there are a lot of other open systems storage engines that handle key value pair data, have failover and co-location, high performance, and everything else they can get with this home grown implementation. CouchDb, Mongo, and Cassandra are three front runners that come to mind. It’s the ACID capabilities in MySql and Posgres that are killing them.

I’m probably over-simplifying this somewhere. I say that because their big complaint was non-clustered indexes slowing down updates. So, there is probably more than a simple key value pair implementation. This just brings me back to a number of editorials I have done this year on indexing. Also, they could have put a lot more work into their database design replacing updates (their key problem) with inserts.

Well, it’s easy to sit here on the sidelines and critique. They at least were brave enough to publish the things they were facing to the world at large. It provides us with real world problems to help us become better designers ourselves.

Here’s some links for things if you want to dig into this topic further…

Uber Schemaless Database

https://eng.uber.com/schemaless-part-one/

Uber Schemaless Architecture

https://eng.uber.com/schemaless-part-two/

Martin Folwers Slide Presentation on Schemaless

http://martinfowler.com/articles/schemaless/#conclusion

Is Mongo really Schemaless?

https://blog.jooq.org/2014/10/20/stop-claiming-that-youre-using-a-schemaless-database/

Cheers,

Ben