Editorials

Types of Big Data – What Does It Mean for Us All?

I was reading an article on Forbes about the types of Big Data. I think the intent of the article was to introduce the various places “big data” on the whole is, or will be, coming from. I also think they do an interesting job of breaking it out into 5 different types of Big Data – and without regard to toolsets, but rather the types of information.

Here’s a link to the article

I think there is perhaps another type of big data that is inferred in all of this, but not specifically called out, that could end up surpassing many of the threshholds mentioned in this article in terms of data size. That is the whole field of AI when machines start talking to machines to not only combine information, but also create new information. This might be summaries, it might be conclusions reached when analyzing the information, or it might be the summary information that is used to structure the information analyzed. Things like indexing and summary tables and that type of thing also have to be considered as we move into using the information we receive en masse’.

What struck me perhaps the most in the article though was just how different the storage and processing requirements are for the different types of information from video and graphical elements included in the Dark Data classification (I love that term) to raw bits and bytes flowing from calculations and devices and such. There is a wide variance in the types of information.

As data platform folks, I think it speaks to the fact that the technology we’ll all have to be up to speed on will necessarily include technologies to store, secure and analyze these different types of information. This is especially true when you consider where and how you’ll store blobs of information vs. more standard, defined data types. While there is certainly support on many platforms for native storage of these elements, the processing may take additional temporary space, and the storage maintenance will be substantial.

In our own systems, we see that videos are regularly 100’s of MB each. This doesn’t include anything special, just the video. If those are to become part of the mission of the database processing, management of storage will be extremely important. We’ll have to be thinking carefully about archive, backup, types of access (real-time vs. near real-time vs. come back tomorrow type storage) and security. I keep working that security thing in there because as we’ve seen with breaches to-date at major companies, it’s critical that keep our eye on that ball.

Ben talked about the real value in storage costs today. It’s true, it’s a huge cost savings. But in working with clients, we’ve seen that once something is committed to storage, it’s extremely difficult to move it if you didn’t plan on that capability from the start. In other words, if you don’t have a way to archives those videos to longer-term storage – and have procedures created and in place from the start, deciding to pick through a video archive and move things a year or two from now is potentially a massive, time-eating, expensive proposition.

Given that, the costs are also something that can add up. This is just storage costs we’re talking about, but processing will increase (more to go through), performance may be impacted (more to process to fulfill requests) and so-on.

I think the take-away, from a purely and selfishly data platform centric view, is that we need to be thinking about scale for storage. How do you scale up and/or scale out storage, when does it become an issue, and how do you manage it for archive, compliance, security and the like… those are things that are likely best addressed head-on as you start to roll into some of these big data projects.