Editorials

Storing Evolving Data

Yesterday I presented the problem of a database that has ever evolving data requirements. There are two approaches to handling this kind of problem. The first approach, which is what I implemented, is to use the database the way it was designed. If you have new data, it is saved in its own tables, using database normalization, just as you would have done if you knew about the data when you first designed and delivered the database. This is the most optimized and accepted pattern for handling evolving data.

Today I wanted to share some other approaches that can be implemented in an SQL database, allowing the needs to evolve without having to change the database structure, and maybe not even having to change the application. Any of these techniques usually have slower performance, and do not necessarily have a pleasant user experience.

One implementation that has been around for a long time is to use an EAV (Entity Attribute Values). This implementation uses tables you create in SQL Server, that emulate what the database engine already has built in. Instead of defining an actual table in your SQL database, you use your own tables to define tables, columns, constraints, and actual instances of records. This implementation requires a lot of space and resources because you are essentially writing your own SQL engine. But, it lets you define any new data structure on the fly, and then consume it as if you had actually modified the database structure itself. You can do a Google search if you ever want to consider this implementation. I recommend against it as a whole, but have used it in rare occasions.

My specific problem lends itself to other solutions. The reality of the extended data is that I need to store it, but don’t actually care what it contains. I will never write queries against the data. It is just an array of bytes. This property of the data allows me to use other solutions.

I can store it in a blob in the database or on a file, and point to it. This doesn’t mean that I store everything in a single blob. I have a blob data type in a table, and am able to associate that blob record with an already existing table. So, I am simply extending an existing table with any design contained in the blob.

There are different implementations of the blob I can implement, depending on the database engine. If I’m using SQL Server there are a number of blob data types to choose from. If I choose the XML or JSON datatypes I can store different data structures that are defined at run time instead of when the table is defined in SQL Server. Different rows in my extension table may have a completely different XML Schema, or JSON definition. My data storage, at least, may be easily extended without opening the lid. Using XML or JSON allows SQL Server to integrate the unstructured data into an SQL query. This allows you to actually include the data that doesn’t have an SQL schema in a query.

I’m not a fan of mixing the queries from SQL and unstructured data for performance reasons. However, if you really don’t care about what the contents are, there is some value to using the different data types because they are human readable. If you don’t care to have something human readable, then you can simply serialize the data into any form convenient to you. Store it in VARBINARY, or VARCHAR, data types to which you can serialize just about anything.

If your consuming your database through a RESTful service, then storing the data as JSON has some advantage. In this case, you don’t have to write an ORM, such as we do with Entity Framework, because we are storing the data the same way we consume it. In this case, you are using SQL Server more like a Key Value Pair database such as Mongo. You are using a SQL Server table to identify the key, and the JSON column as the document.

So, if you don’t have to query, joining or filtering, on the contents of the undefined data, these are some options available for you to consider.

Cheers,

Ben

Facebooktwittergoogle_plusredditpinterestlinkedinmail
  • Eilenblogger

    I’ve used key value pairs in the past for lookup values.

    In other words, things that might be presented as a drop down box in a UI.

    This seemed cleaner to me because I had one place to maintain lookup values. The code written to maintain these values was simple.

    Then at another job I noticed people had used the same approach. The difference was that there were thousands of lookup types and some of those lookup types had gobs of instances.

    Then I noticed that people were joining to those tables in stored procedures all over the place.

    Eventually the lookup table scheme became the culprit for database locks due to joins of these lookup tables multiplied by the complexity of the evolving beast.

    The solution was to break out the single lookup table into individual tables. This of course required every stored procedure that previously joined to the lookup table had to be modified.

    This became a huge undertaking and it became prohibitive to refactor every lookup. This left the database as a hodgepodge of separate lookup tables and the single lookup table solution.

    So the new individual lookup tables were blessed with the ability to create referential integrity and the old lookup scheme couldn’t have RI.

    What a mess.

    The common thread with your post and this comment is “evolving data”.

    My point is that one size does not fit all and don’t get trapped into the group think that tries to generalize everything lest you find yourself in a similar conundrum.

  • viviane

    This may be a stupid comment but why store data that you are never going to query?

    • Eilenblogger

      Lots of reasons.

      I’ll cite one, which happens to be the latest project I worked.

      Title Ownership Chain for the Oil & Gas industry.

      There are several columns you can query such as Grantor and Grantee, Document Type, etc.

      Depending on the Object Type, the database record references a scanned image. The image could be a Quit Claim, for example.

      You don’t query the text of a Quit Claim.

      Another example would be a video or audio file. This is data but there is no real meaningful reason to query the raw data of a video.

      A more succinct response to your question would be to pose a question back to you.

      Scenario:

      1. You have a Company table with 30 columns.

      2. Your application only allows you to query on 5 of the columns.

      Question:

      Why are you storing those additional columns if you never intend to query them?

      • viviane

        The image at some point would have to be presented. (so queried)
        Just because I don’t query the remaining 25 columns doesn’t mean that the system doesn’t need them (query them) otherwise why are we collecting the information?

    • Ben

      This is a great question. Why put something in a database that isn’t data consumed by the query engine? The answer is that sometimes your systems have data that inn’t transactional. A good example is when you use filestreams. In my case, I had users needing data for an external system, that was related as a document or json object to a record in my database. My database didn’t need to know, or understand the contents of their data. So, I didn’t translate it and create a relational structure to store the contents. I could instead simply store the JSON.

      If I had other storage systems available such as a Mongo data engine, I could have stored their data in mongo, and simply store the related key in my SQL database. However, storing it all in SQL Server has the added advantage for disaster recovery, and other things of that nature.

  • David Eaton

    Well I use a sort of hybrid model. Like what was mentioned, I use a key value pair for the catalog of the lookup values. This allows changes in lookup times do not affect the overall database design. Hence it is a very scalable sub system in the database.

    In the transactional system, the main tables hold only the data that is used constantly. Then all the optional data columns are stored in a child table. This is a slight increase in complexity, but it does save an enormous amount of empty space consumed on the disk. I have seen tables with over 60% of wasted space on large empty varchar columns.

    The golden rule to live by when designing a database – Data never gets smaller, design for scalability!