“Big Data”. A phrase destined to live a long life because it’s catchy and amorphous enough to mold itself into the current conversation. There is a great deal of literature, lesson and lore now floating around on this topic. We know Hadoop customers store petabytes of unstructured content, we know there are approximately 5 billion cell phones in the world producing both data and mobile consumers, we know Facebook is this important global mind meld of puppy-dog pictures, farmville, and patriotic prose. Today, I want to involve you in what has been speed walking through my brain, how do you take something as esoteric as analytics and apply big data to it?
If you’re not a data scientist already, go load the analytics package “r” on your laptop. Then, create yourself a list, an array or a matrix of data and run some of the analytic functions contained within the package (i.e. t.test(), lm(), kmeans()) . If you do this, you’ll quickly learn that:
a) This is super deep complex activity that combines statistics and programming
b) Not suited for the majority of brains in the IT industry today
c) Manipulated, massaged, converted, and carefully transformed by humans from one analysis to the next making decisions as you go.
d) And… the data for each analysis isn’t the most aggressive volume of data (in no. of gigabytes) you’ve seen before.
If this is true, how do we answer the following questions:
1) How does the data get so big?
2) How do we apply Big IT to Analytics?
HOW DOES DATA GET BIG?
Ok, let me baseline you on linear regression. Linear Regression (LR) is the gateway drug to predictive analytics. LR is the assessment of existing data that shows a linear pattern as you traverse a set of variables. This is called “Best fit” or Least Squares Regression Line. With luck your data will provide a best fit that is so linear, when you predict one variable, it gives you an algorithm (or coefficients) to predict the other. In the link I reference (here), is an example of tracking two variables “age” and “height” across a sample. The result is an algorithm where we could enter an “age” and get a suggested “height” back, thus providing predictive capabilities.
There are other algorithms beyond LR. Shopping baskets are processed as key value pairs (KVP). Combining each pair combination looking to trends in buying patterns. KVP is in demand today, and it’s a culprit in creating large amounts of data. A classic example is predicting what consumers buy. Everyone always talks about groceries because everyone buys groceries and they buy lots of things each trip. Being able to predict what people will purchase, would allow grocery stores to better serve, while reducing costs to help their razor thin margins. This is why there are so many grocery store examples…
However, I’d prefer a new target for our discussion. Its beach time, I’m heading to a North Carolina beach destination soon, so let’s do beach shop souvenirs. If you’ve been to a North Carolina beach you know it’s all about lighthouses, pirate legends, and casual fun.
In our example there are:
– Pretty T-shirts
– Rebel/Pirate T-shirts
– Surfer/Cool T-shirts
– Sea shells
– Bow covered flip-flops
– Pirate gear
– Cheap surf/water gear
– Sun tan lotion
– Lighthouse gifts
– Boat-in-a-bottle gifts (BiaB)
Review my list, I think we can predict a few type of shoppers. The contemporary “southern belle”, the “pirate on the inside” and the “surfer dude wannabe”. I would speculate that these three shoppers would tend to buy like in these patterns:
– Southern Belle – Pretty T-shirts, bow flip-flops, sea shells, sun tan lotion, postcards, lighthouse gifts, and BiaB
– Pirate Pete – Pirate/Rebel T-shirts, pirate gear, and BiaB
– Spicoli Dude – Surfer/Cool T-shirts, Sun tan lotion, surf gear
Note, I speculate on my mental library of personal observations, KVP speculates based on data. To come to a more definitive conclusion, we would use KVP to process a day’s worth of transactions. If we ran these tests, you would be able to appreciate the vast number of combinations to consider. If our goal is to identify correlation or causality in product purchase relationships (say a person who buys a shell, likely to also buy lighthouse at a 95% confidence), you have to consider all the combinations of purchase relationships across a large number of receipts. This mean comparing: one to one combinations, 2 to 1 combinations, up to N to 1 combinations where N is the number of items purchased (data scientists…yes this is a simplification, be kind with your technical assessment of it…). Now apply those combinations across 100’s shoppers in a given day, across a chain of stores, across the summer season. Now imagine you’re a global retail giant. What rubik’s cube of potential value this could be, and how much data gets generated in the process. It’s the combinations of assessment that make the data growth sky rocket off the charts. Now think about how you manage it, communicate it, leverage it, and do you ever throw it away?
HOW DO WE APPLY BIG IT?
I know the word Big IT sounds like something we’re trying to get away from, right? Scale up is dead, scale out is hip. However to me Big IT is the lessons we learned about mission criticality, scale, consolidation/virtualization, and service levels that run all our companies today. We have a hoard of global IT professionals keeping the lights on and analytics has to make the jump from academia and departmental solutions to Big IT to get to Big Data. If we really want to know infinitely more about us (US defined as myself, myself with others, others without myself, only men, only women, only women in Europe, only teens who play football, I think you get it…) we need the IP and assets we’ve developed in the last era of scale up. We need the connectivity, we need the structure of things like ITIL, we need hiccup tolerant approaches that run on systems management tools, not hundreds of IT resources flipping out blown out components they bought at a school auction. The “science project” has to become big business.
So how do we apply something that is as complex and focused as analytics to IT? I think its organizational changes that provide a vehicle for architectural changes. Companies need to consider a “Chief Strategy Officer” (CSO) role as a binding force. Some companies may make this an engineering position or a marketing position based on their primary culture, but the role should exist and that role needs to set an analytics strategy for the company. First they should define an analytics mission statement. “What are we going to do with analytics within the company?” Then they need to answer the basic questions about what we know of: our employees, our customers and our processes. Additionally they need to ask what in the big “datascape” do we want to bring into our analytics engines to accomplish our mission. With this they can set an architectural strategy that leverages old tech and incorporates new tech to meet the mission objectives. Otherwise the company is locked in silo’d perspectives and can’t get to the bigger order items, it’s hard to construct this monster bottom-up. Many of the companies I talk to who are starting enterprise programs, still seem to be searching for the how to bring it together. The answer is, just like the CIO organized IT, companies need a c-level resource to define the charter.
With the correct organizational structure, a company can then look at an architecture that can index the outside bits for later use, adjudicate the data flow to peel off the useful content into higher functioning data stores and then apply analytics packages to distill insight. Techniques like machine learning will help automate the processing and allow the super smart operators to become more productive. And, the programmers will write applications to get the global-mobile user in active participation by both creating and consuming the information within the process.