Editorial Rss

Problems with Scaling Out

How do we get the performance we have come to expect from Google? Data and Workload are distributed across a large fabric of small computers. The sum of the whole results in performance greater than that of a super computer.

Google has proven the architecture of Scaling Out with dramatic performance at an amazingly low cost.
More and more systems are being architected to follow the same pattern of sharding work and data introducing a different set of problems we don’t usually experience in a relational database. Now we have the need to manipulate sets of data with Unions, Merges, Joins, Intersections, etc. outside of the SQL syntax. 

One vendor has built this capability in to a product that allows a sharded set of MySql databases to operate as a single database by presenting the consumer with a single MySql connection that handles all of the sharding work behind the scenes.

You could even consider Partitioned Tables in SQL Server to be a similar methodology by brining data from different objects together in a common endpoint familiar to the SQL consumer. The big difference here is that all of the individual tables that make up a Partitioned Table are all a part of a single instance of SQL Server.

What I’m curious about would be the separation of data for the purposes of scaling out or other scenarios. It seems to me that the database engine is not the place to bring disparate data together using linked servers, etc. Why? Because you are still placing all of your eggs in one basket, the database server, and you are slowing down the server from performing its basic function of providing access to data. A better architecture would address pulling together data in a different tier. A data access tier of your application would handle this much better than stacking up all this additional work on a database server.

If it isn’t obvious, I’d like to clarify that this kind of design is not intended for the average application. Most applications will have their data all on a single data storage type, in a single instance of the engine, and many times within a limited number databases, or often a single database.

This kind of architecture is applied when your application must scale to some very large numbers. Once you have reached a need for that kind of scale, you now need to solve the problem of managing data in a distributed arrangement.

I’m curious about how many of you have previously or are now working on systems requiring large scale performance? Is this mostly a data warehouse problem? Are you finding you need to solve large volume OLTP as well? Why not drop me a note if the need to scale out resonates with you, or if you are working in a distributed data environment? You can send your response to You can also leave your opinion below regarding the need to massively scale data.