Editorials

What, No Indexes?

Peter Heller posts a question in response to something I was writing about how non-traditional data storage can increase performance over traditional techniques used today, such as indexing. He writes, in response to my statement that new techniques do not use indexes at all, “will this be a function of the database or not? This functionality will be independent of SQL versus NOSQL since currently they both support indexing.

We are accustomed to using clustered and non-clustered indexes for query better performance. So how does a one size fits all approach in a distributed architecture perform faster than in the traditional way. I am all for innovation and would like better insight into these techniques.”

Some of the newer techniques don’t necessarily use any sort of index. One software tool I used sent a request to a cloud consisting of any number of servers. The request for data was submitted to the entire cloud through an UDP broadcast. The server hosting the desired data would respond saying, “I have it.” And be assigned the work. If more than one worker had the same data (failover) it could be handled in any sort of load leveling fashion. What is interesting here is that there is no assumption regarding how any particular worker in the cloud kept track of its data. The internal knowledge was an implementation issue at the worker level.

Perhaps the worker utilized a collection, hash table, whatever. The indexing of data was dependent on the software implemented at the worker. If it was using Node.js, the implementation would be different from that of pure Java, Dot Net, SQL, Cassandra, Mongo, and many other mainstream products. Some of those products have their own sharding and distribution techniques built in as well.

Implementations of this sort have the following benefits. 1) There is not central index to maintain in order to know where all the data is stored. The index cannot become inaccurate, or unbalanced. No index maintenance is needed for efficiency. 2) Because request is submitted to all workers simultaneously, you get the power of many machines behind the task of locating the desired data, rather than one machine walking through an index. 3) If you are performing work on the data, you can pass the task to multiple machines to be performed on the data much more efficiently, than bringing lots of data to a central location for processing. This is more valuable as the amount of data increases.

This works today only because of the impressive network bandwidth available to use as server backbones. High performance speeds are growing in availability, and the cost continues to drop year after year. Today we are getting speeds using Cat5 Eithernet that were only possible using Fiber Optics only a few years ago. Using Fiber today the possibilities are even more amazing. So, instead of using the bus on the mother board to get to the disk, we are using the network to get to other servers, and scaling out our performance in dramatic ways that were simply not possible without a high cost only available to large installations.

I hope this answers your question more fully, Peter.

Cheers,

Ben