Volume: Modeling vs. Storage

When the topic of data volume arises, a few ground rules need to be in place to be sure that the apples and oranges of data are not compared directly.

The Really Big Numbers – Storage requirements for raw data collected over time are what puts the “Big” in Big Data. It’s in this context that Petabytes, or 100x Terabytes need be discussed. While these are very big numbers indeed, this is still raw data and the density of insight contained is rather low.

Data Volume for Modeling – This volume will almost always be lower because raw data being cleaned, aggregated or summarized. This can lead to substantial compression without loss of accuracy since rows with little or no useful data can be combined.

Real-Time Data Scoring – As measured in predictions per second, this comes into play when the resulting model is put into production to provide real-time insight from operational data as it is created. Examples would be risk scoring for credit card transactions or scoring sales leads. This is not as computationally intensive as model building, nor is the volume of raw data going to be anywhere near the amounts required for storage.