Editorials

Storing Evolving Data

Yesterday I presented the problem of a database that has ever evolving data requirements. There are two approaches to handling this kind of problem. The first approach, which is what I implemented, is to use the database the way it was designed. If you have new data, it is saved in its own tables, using database normalization, just as you would have done if you knew about the data when you first designed and delivered the database. This is the most optimized and accepted pattern for handling evolving data.

Today I wanted to share some other approaches that can be implemented in an SQL database, allowing the needs to evolve without having to change the database structure, and maybe not even having to change the application. Any of these techniques usually have slower performance, and do not necessarily have a pleasant user experience.

One implementation that has been around for a long time is to use an EAV (Entity Attribute Values). This implementation uses tables you create in SQL Server, that emulate what the database engine already has built in. Instead of defining an actual table in your SQL database, you use your own tables to define tables, columns, constraints, and actual instances of records. This implementation requires a lot of space and resources because you are essentially writing your own SQL engine. But, it lets you define any new data structure on the fly, and then consume it as if you had actually modified the database structure itself. You can do a Google search if you ever want to consider this implementation. I recommend against it as a whole, but have used it in rare occasions.

My specific problem lends itself to other solutions. The reality of the extended data is that I need to store it, but don’t actually care what it contains. I will never write queries against the data. It is just an array of bytes. This property of the data allows me to use other solutions.

I can store it in a blob in the database or on a file, and point to it. This doesn’t mean that I store everything in a single blob. I have a blob data type in a table, and am able to associate that blob record with an already existing table. So, I am simply extending an existing table with any design contained in the blob.

There are different implementations of the blob I can implement, depending on the database engine. If I’m using SQL Server there are a number of blob data types to choose from. If I choose the XML or JSON datatypes I can store different data structures that are defined at run time instead of when the table is defined in SQL Server. Different rows in my extension table may have a completely different XML Schema, or JSON definition. My data storage, at least, may be easily extended without opening the lid. Using XML or JSON allows SQL Server to integrate the unstructured data into an SQL query. This allows you to actually include the data that doesn’t have an SQL schema in a query.

I’m not a fan of mixing the queries from SQL and unstructured data for performance reasons. However, if you really don’t care about what the contents are, there is some value to using the different data types because they are human readable. If you don’t care to have something human readable, then you can simply serialize the data into any form convenient to you. Store it in VARBINARY, or VARCHAR, data types to which you can serialize just about anything.

If your consuming your database through a RESTful service, then storing the data as JSON has some advantage. In this case, you don’t have to write an ORM, such as we do with Entity Framework, because we are storing the data the same way we consume it. In this case, you are using SQL Server more like a Key Value Pair database such as Mongo. You are using a SQL Server table to identify the key, and the JSON column as the document.

So, if you don’t have to query, joining or filtering, on the contents of the undefined data, these are some options available for you to consider.

Cheers,

Ben