Sharding Instead of Partitions

Yesterday we talked about partitioned tables. The goal was to spread the load over multiple resources allowing for large amounts of data to be live and responsive at the same time. Spreading data over multiple tables allows the data to be stored in the most optimized state with the least amount of overhead. It also improves disaster recovery.

Using partitioned tables simply places the data resources on different files. Today we’ll talk about taking it to the next level, where, even though we are using a traditional SQL storage engine, we are distributing the load even further. Rather than simply using different tables for data, and making them appear as a single partition, we can actually shard the data, and place it into separate databases altogether.

If you distribute data into multiple databases, you can, in theory, actually host those different databases on more than one server, resulting in additional hardware resources. Why would you even consider doing that? Why not simply get a bigger server with more power instead, and keep everything together on a single database?

The answer to these questions is simply scalability. You can buy the biggest database server you can find and you will ultimately end up at a limit it cannot handle. That would have to be a huge database, so sharding isn’t a good fit for any and every application. When you have that awesome breakthrough where your application is gathering data at an alarming rate, or you are hosting massive stores of data for a data warehouse then sharding can be a valuable asset.

Just like using a partitioned table, sharding uses some sort of broker that pipes the query to the appropriate database. If you have looked into the Microsoft SQL Server appliance scenarios, they will shard in a number of ways. The parallel data warehouse can host a table sharded across multiple instances. It can also have mirrored data for a single table across multiple instances. That allows queries that join things like lookup tables to not have to perform queries across multiple databases. This parallel processing allows for huge leaps in performance.

There are many different implementations of database sharding using smaller SQL Engines such as MySql resulting in massive performance, simply by having multiple instances working as a single unit through sharding.   So, now you have an introduction to two different technologies helping you grow your persistence to new levels.