Administration, Amazon AWS, Amazon RDS, Azure, Azure SQL Database, Editorials

Love/Hate Relationship with Unstructured Data

No doubt you’ve seen, and perhaps been working with unstructured solutions out there – DynamoDB, Azure Table Storage, and there are many other options as well.  It’s pretty cool that you can define key/value pairs and just start dumping information into the system, really without regard to structure and columns and many of the things that have traditionally made up relational databases.

If the system detects a new key/value pair – something not yet defined by other write requests, in essence, a new column is simply added and the values are stored.

It’s a great tool and many times extremely scalable and able to step up to the load of a database.

All of this is positive.

I get all wishy-washy though thinking about what this means for support down the line.  Now, yes, I admit to waving my cane madly in the air at times when rules are broken or someone saving information to SQL server hasn’t taken the time to understand the best ways to do so.  But that’s not really what I’m saying here.  These are powerful solutions.  We use them for SSWUG – tracking email bounces, campaigns, etc.  In our uses, they’ve become powerful scratch-pads to help keep things flowing well.

I’m nervous about this dynamic “I don’t recognize that key, so I’ll just add it!” at the database level.  I’m just not that good of a typist, I guess, to never make a mistake, never have a typo and so-on.  I goof up.  And with a structure to post against, if I’ve added a key (column in SQL server, for example) and it doesn’t match up with the schema, SQL Server barks at me and tells me I’ve goofed and things get corrected.  Not so with these other approaches.  You just end up with a weird new key value of “Adddress” – and if it’s used correctly in some places, but not in this case, you’ll be scratching your head about where in the world that value came from.

So, there’s that.

But there’s also the case of the less than attentive, um, developer (I said the “D” word) who is CLEARLY well-intentioned (seriously), doesn’t see that there’s already an Address key, and decides to make Address1, Address2 for their pet project.  Now you have three different keys for that chunk of information.

I worry that that wouldn’t be documented well, and that chasing down where this suddenly-appearing value came from, could and would be a challenge.  You’ll be looking for code changes, doing text searches through code deployed and all of that.  I know we’ve probably all cursed a missing schema change when we thought we had things knocked on a deployment, but I think I’d rather deal with that, than a suddenly appearing growth in my database structure that no one really can account for.

So I wonder.  Is there either a) a place for these types of solutions that we should be actively looking to carve out and understand and plan for, or b) a need to work out some solutions that help manage this type of structure-creep?  At some point, it seems like it could become very, very important.

I think the answer is, “Yes.”  Both.  (And I don’t think there is a “ignore it and it will go away” option)

Facebooktwittergoogle_plusredditpinterestlinkedinmail
  • John Shadows

    Seems like the keyvalue pair type is great for those needing data stores while not understanding DBs. Where DBAs may have dropped the ball is in not making their databases adaptable in these scenarios (table-tables anyone?).

    I’ve built SQL objects (schema, tables and procs) that handle all the same key value pair operations so I look at the Azure Tables as unnecessary IF you already have a SQL Database in play. Even if you didnt, SQL Express and LocalDB are free so I’m still not out anything.

    The add any key scenario in Azure tables is nightmarish in SQL (where’s your indexes and stats).

    Even in SQL, folks went sql_variant crazy for a while there.

  • NameThatTune

    Sounds like you need a scrape the scrape process.

    A sort of meta-analysis of your data to make the more unstructured data less unstructured.