Amazon RDS, Azure SQL Database, Editorials, SQL Server

Does the Cloud Force Data Silos?

As we’ve been working with different people, and, embarrassingly, our own systems, I’ve been noticing a trend.  It’s been bugging me for a while now, but I’ve not really been able to put words to it.  I mentioned in the prior editorial that multi-platform was important.  Basically, use the right tool for the job at hand.

And then I noticed this post that started to bring it a bit more into focus:  Data integration is one thing the cloud makes worse.

This is a real issue.  There is such specialization in the database platform tools (never mind the rest of the options in the hosting space) that you end up cherry-picking to solve a specific problem.  But in that cherry-picking, in many cases, you’re creating a heck of a system of silos.

*Gets out old-database-guy cane…*

In the old days, you’d figure out how to make your database platform of choice do what you needed, or as closely to that goal as possible, and then deploy.  It made absolutely no sense to go putting in a new database engine and learning to support it and use it.  The resources just didn’t make sense.  Besides, how many times did we (OK, *I*) preach about standardization as being the way to leverage your time and resources?  It was a constant fight.  Heck, there was so much talk even about the woes of SQL Server Express at the department level.

*Puts cane away*

This can be a significant issue, particularly if you embrace platform selection on a task-by-task level or an application by application level.  It’s so easy to roll out those other resources to answer a specific report or functionality issue.  But the side-effect is clear – you still have to support that.  It’ll make your hiring more challenging (do you know all of these platforms????)   and getting assistance and applying best practices will be more difficult.

In all my talk about data pedigree, this portion really had me taking pause:

But without a data integration strategy and technology, a single source of data truth is not possible. Systems become islands of automation unto themselves, and it doesn’t matter if they are in the public cloud or not.

We have to push back on not being able to maintain a data “truth” structure.  It can be done, though it certainly gets more and more challenging with the different variables.  I think documenting the data flows, making sure sources are known and expected and audited, and that data use is managed will all come into play.  Without this type of thought, I think it will be quite difficult to really ascertain the “correctness” of the information in those stores.  Developing audits and checks on the information flowing through the pieces will go a long way to making sure things are ok.

Much easier said than done – I mean, how do you audit a data flow for expected values?  What is an outlier that requires attention in the data infrastructure, and what’s a data element that is crying for analysis in the normal flow of reporting and such?

Very challenging indeed.

  • OracleGuy

    I’m seeing this right now as our organization pursues a microservice-based architecture. One of the espoused principles of microservices is that each one maintains its own data store. Which is great from a modular application architecture point of view, but ignores tiny details like querying this data in a meaningful way, referential integrity, transaction consistency, etc. Never mind that you could easily end up with a dozen different data storage technologies, which need to be supported, migrated, patched, backed-up, replicated for dev/test/UAT environments …

    • Not to abuse a worn out old phrase, but data systems are a bit like the wild west at the moment. I hope we can get our collective arms around them and make sure our solutions will continue to thrive in the future. But that’s a big risk, I think. I’m not sure that, without planning and thought right now, our systems won’t decay into non-usefulness because they won’t be able to be maintained, updated, modified, etc. beyond the initial tasks and the data may become marooned and not able to be further used.