Big Data: Data Science emerging field to support new levels of understanding


note: lots of good embedded links this time 🙂

Today’s blog is one of a pair of essays I’m going to write on two different perspectives associated with Big Data. I believe Big Data is notably important and impactful to the immediate future of our culture. First we’ll look at what Big Data is and the dynamics of dataology, or what most people call Data Science.

I am a latent Data Scientist.  It was probably my calling. Today I’m in the sales function of our business, but I started my career with a Masters in Database Management. I was doing C++, B-trees, Bloom filters, working with SAS  data sets and taking endless classes in statistics/economics.  Frankly, I loved it. My passion was the data structures and the insight that could be acquired by slicing the chunks.  Organized data, in its prescribed context, is real. It’s not some pundit’s opinion on the 247News.com.  It is based on facts.

 For whatever reason, I ended up going down the “line of business” systems integration path. My career led me into workflow, document management, and supply chain technologies.  These were connective and provided an immersive business process environment, but the treatment of the data was still inefficient, and disrespectful to its value.

Yes, I have been branded a “purist” in my past life, but ultimately, my focus was more on the data and the data models than on corporate profit.  I consistently ran into the walls of “good enough”.  And I realized that rich data models weren’t considered profitable endeavors by the majority of the American industry.  Instead, most wanted little snippets of data, transactional crap, when we knew so much more.

Several industry initiatives showed up: “Business Intelligence” (BI), “Information Lifecycle Management” (ILM) and “Master Data Management” (MDM) all with promise of a better strategy for data, but they were realized into purpose-built tactical systems to address specific real life problems of sales growth, regulatory submissions and company acquisitions. None of them truly structured and exploited the fundamental Intellectual Property (IP) of the data that each company has about their customers, employees, and processes.

Add to these points that we all see “the disconnect” between data points in our daily lives. Banks that don’t recognize the “savings account you” from the “home loan you”. Or, my employee electronic health records system that shows all my doctors visit records and tests, but when I take an online health risk assessment, it still asks me pages of questions about blood pressure and cholesterol.    

Well I am an optimist, and I believe that the current industry initiative called “Data Science” has structural differentiation from past trends and will likely get closer to my vision in several ways.

Data Science will be more scientific with data because:

  • Data is growing in such a manner that we are actually in trouble. We have multiple points of failure in people, process, and technology; and most industry leaders recognize it. Look at this page on the digital universe, if you don’t believe me. 
  • There is a convergence of three powerful movements.
    •  Data generating devices both personal mobile and industrial assets are creating data.
    • Global data mining and analytics is on the rise.
    • Significant improvement in data warehouse and, data analytics technologies allowing for the next level of processing.  Here’s EMC’s Big Data page as an example
  • Big Data is the aggregation and analysis of heterogeneous data sets/collectors.  The concept of big data is an important pillar in the new “Data Analytics” investments which will be pervasive over the next several years.  By definition, Big Data is not just about infinitely large purpose built databases.  It’s like a hive of bees; Big Data to me is broader, more dispersed and hierarchical in nature.  It’s funny we will know less about each individual piece of data, but in mass, we’ll know more about ourselves in many more ways.  Today these initiatives will be funded by the standard engines of power and profit, but some of the most impressive data science I have seen so far is in the scientific community.  Spend an hour watching TED Videos like Hans Rosling leveraging Microstrategy visuals on his HIV data analysis work.  Or one of my favorite books of all time, freakonomics based on the data analysis work of statistician Steven Levitt.

So, I am excited about the advancement of “Science” and “Data” in this emerging field.  EMC Corporation is showing industry leadership in this growing discipline including funding studies and an annual conference for data scientists.   EMC’s Data Science Summit (EDSS11) May 23 2011 brought together an international consortium of data scientists to help define core fundamentals and highlight the building need for resources in this field.  I applaud EMC for stepping into the proactive mentorship of the data industry. It’s a great fit for EMC and a place we need to invest in advancement.  Additionally, EMC just published a survey from the summit.

Here are some of the summary findings:

       Informed Decision-making—Only 1/3 of respondents are very confident in their company’s ability to make business decisions based on new data.

       Looming Talent Shortage—65% of data science professionals believe demand for data science talent will outpace the supply over the next 5 years – with most feeling that this supply will be most effectively sourced from new college graduates.

       Customer Insights—Only 38% of business intelligence analysts and data scientists strongly agree that their company uses data to learn more about customers.

       Lack of Data Accessibility—Only 12% of business intelligence professionals and 22% of data scientists strongly believe employees have the access to run experiments on data – undermining a company’s ability to rapidly test and validate ideas and thus its approach to innovation.

       Advanced Degrees—Data scientists are 3 times as likely as business intelligence professionals to have a Master’s or Doctoral degree.

       Higher-Level Skills—Data scientists require significantly greater business and technical skills than today’s business intelligence professional. According to the Data Science Study, they are twice as likely to apply advanced algorithms to data, but also 37% more likely to make business decisions based on that data.

 You will note in the survey that data scientists are inherently different from BI professionals.  This confirms my beliefs that we’re going somewhere more all-encompassing than a “sales report” and that we’ll spend more time and money on the submerged part of the iceberg.

If you’re interested in next year’s summit click this link EDSS12

Advertisements

“One Throat to Choke” – Who’s Kidding your Hands Don’t Fit.


Well here I am, back from Oracle OpenWorld and banging out another week of work. OOW for me was 3 days, jam packed full of interactions with our customers, partners, and Oracle.   I found the show to be informative, and I definitely realized how invested EMC’s customers are in the Oracle-EMC solutions.  EMC and Oracle have been in an industry interlock through the last 2 decades together supporting some of the most impactful business processes in the world. As I talked to an endless stream of noteworthy customers, I felt that connection completely.  While there, I also poked around, attended keynotes, and talked to everyone who dared to look at me (a mistake they will not make twice!).

I’d like to tell you about all the insights I took away, but frankly, I can’t keep your interest for that long. So, let me work to edit and organize my thoughts. Today I will work in Vignettes. Why? Because, no one has told me I couldn’t, and I am feeling creative.

Vignette One: “Crazy Mixed Up [Open]World”

The scene opens in a super large room with bright lights and thousands of people, otherwise known as the Keynote hall.  Monday morning, I had the privilege to sit up front with a few of our customers and some serious players from EMC. Monday morning Joe Tucci, CEO of EMC, kicked off the keynotes. After a strong rally of all the great things EMC is doing around the Cloud (web coverage: http://tiny.cc/wuiu9), He introduced Pat Gelsinger who in turn introduced Chad Sakac. Pat and Chad had an entertaining presentation on EMC’s Big Data strategy. Data growth, EMC Greenplum, virtualization, analytics engines; many topics were reviewed in the context of EMC innovations. Pat also held up a new piece of EMC technology, Lightening flash cards . In beta, the Lightening flash cards have a CPU mounted on the blade, and they will provide a reported 320GB of lightening fast flash per card. Chad followed Pat’s lead and began to demo VMware’s integrated VFabric Cloud Application Platform (related coverage: http://tiny.cc/0ec74). This really showed how to take customer analytics requirements down through the software and hardware.  They also showed the card at work as it vaporized performance problems live on stage. The two ended by comparing the hypothetical auto insurance costs between 3 constituents, you may have heard of: Gelsinger, Tucci, and Larry Ellison. Larry’s was the most expensive since it had a jet and racing boat in his fleet of vehicles.  It was a humorous way to end, and a good time was had by all. 

What struck me was that in just a handful of years, how EMC was no longer a storage company, but a bundled solution company, much of what was noteworthy in the presentation was all the software and software integration that has been developed. And, other than the Lightening blade that Pat held up, there was little mention about the hardware.  Following the event, it was my job to take Pat to a meeting with a CIO from one of our customers. In that meeting and throughout the day, you could tell that the keynote message and the energy resonated.

Following the EMC keynote, Oracle took the stage for a couple of hours and presented on Exalytics, SPARC Super Cluster, they also reviewed Exadata and Exalogic updates. By circumstance, they reiterated many of the same functionalities EMC had discussed, but with an Oracle-specific platform to support Oracle apps. In their presentation, Oracle took shots at many of their new infrastructure competitors “23x faster than”, “more gigabit capacity for”, and “2 more DRAM of that”. The dialog continued…

Whether I was bored or in a new enlightened state, there listening to the keynotes it hit me. Like an episode from the “Twilight Zone”, our two companies had switched places. “Freaky Friday”, but it’s only Monday… We were spending our time talking about software and Oracle was spending their time talking about hardware. I wonder how many of the thousands of people listening thought the exact same thing?  It goes to show, as Joe Tucci said, “Cloud is the most disruptive tech wave ever”. The vendors our customers have worked with for years are going through notable changes to provide for a new era of IT technology. The good news, customers have quality options to fulfill their requirements with, and they will vote with their wallets.

Vignette Two: “Congestion at the Intersection of Cloud Meets Big Data”

Ever been in a canyon in Arizona when a thundering horde of cattle came pounding in your direction? I’ve done some hiking in New Mexico and Arizona in my life and…well ok I saw a cow or two, but no stampede. The closest I ever came was last week at the EMC booth at OOW11. About every 5-10 minutes we would run a theater presentation and as the crowd left, you’d literally watch the booth staff step aside to avoid being trampled. I didn’t count them personally, but I know that way more than 13,000 people took a few minutes to talk to an expert or watch a show in our theater. Additionally we had EMC IT speaking about our transformation to virtualize our Oracle databases internally, we had EMC TV taping customer testimonials, and our meeting space was packed for 3 days straight. Unlike a traffic intersection, there’s always room for more.  Come join the movement!

Vignette Three:  “One Throat to Choke – if you have hands the size of Manhattan”

So let’s talk a little about clouds. The cloud is a lot like a Mainframe(MF), without ownership issues… What you say?!?  Stay with me here…if you look at the systems in support of PaaS, IaaS, SaaS, etc. what are their major features: Virtualization, Scale, Consolidation, Multi-tenancy, Systems management, Chargeback, etc.  They are in ways very similar to a big MF from the 1970’s.  A major difference is that the MF was a vertically integrated mostly proprietary single sourced product. The efficiency of the system was high, but the flexibility of user to choose how she used the system was limited and costly.  It’s taken us 30 years to get back to the same concept with a small but massive innovation: choice.  The cloud is the cloud because it’s democratic. It’s made up of many providers providing a litany of options on open systems. You get the benefits of MF on a hyper scale.  These key concepts are the essence of the “tipping point” (Malcom Gladwell)  for the next wave of IT, and these concepts are what bothered me about Oracle’s strategy as they too join the cloud. 

The keynote on Wednesday claimed that Oracle’s new public cloud offering is great because it’s standards based. This claim mainly hung on Java as the development platform. It was said many existing clouds and enterprise software are not valid because they are not based on these similar standards.

Yet now for the third year in a row, Oracle announced new appliances and a proprietary version of Linux that continue to drive the Oracle apps and DB owners to single sourced, primarily proprietary solution. Luckily for the thundering horde, there are good alternatives that offer better alignment to their entire IT strategy.  However, it’s the overwhelming message that this is somehow good for the industry, is what I would call, un-productive to the cause. A clear eye will see this as a trip “forward to the past”, back to a world Tom J. Watson would recognize.

Vignette Four: “It’s Easier to Ask for Forgiveness than Permission”

A man walks into a doctor and says “It hurts when I run my Oracle apps without Virtualization”, the doctor says “then virtualize”.  If there was a predominate dialog running through the entire show it was customers asking if, when and how they can virtualize Oracle.  Oracle has traditionally tried to make it difficult to virtualize Oracle using VMware; [assumptive] because a lack of VMware drives demand to their appliances and thus OVM. However this has been a small puddle in the path of progress that many have already crossed for both non-production and more recently production DBs. With vSphere5 the limitations have been removed and now it’s on a normal technology adoption cycle.  I already mentioned that a company as big as EMC is converting to an approximately 99% virtualized environment. We will see many customers virtualize the database in 2012 as described in this recent press release on American Tire Distributors.  Of the customers I spoke with, their primary concern was that Oracle support contract states, if there is a problem that can’t be resolved the customer may have to migrate to a physical environment to resolve it.   That’s not a crazy statement to have in a support contract, and it’s also not crazy for customers to be highly concerned about how this statement will be leveraged.  I appreciate that this big opportunity, to better really important IT environments, is also a risk because they are so important. This is why a natural technology adoption cycle exists, and it is similar to the virtualization of MS Exchange debate 5-6 years ago. We’re way past that one, the databases are next to be taken by the Virtual Tsunami.

Two recent surveys came out that I want to bring to your attention.

  • Storage Attach for VMWare Environments (Source: Goldman Sachs IT Spending Survey, March 2011)
    • EMC went from 33% (Dec 2010), to 40% (Feb 2011)
    • next closest competitor was 17% (Feb 2011)
  • EMC #1 Choice for Application Storage (Source: IDC’s Wrldwd Qtrly Strge Sys Tracker, Mar 2011,SUDS Survey )
    • Across seven categories including Oracle, SAP, SharePoint, Exchange, VDI, Analytics, EMC is #1. 
    • The 2nd and 3rd positions were not swept by any other vendor.

At EMC, we see this happening. There is no doubt the train has left the station, it’s your decision which car to jump on, or if you’re taking alternate transportation.

 Vignette Five: “Tragically Upstaged”

On Wednesday, Larry Ellison held the keynote. If you’ve never seen Larry present he’s casual, charismatic, and poisonous to his prey.  For Wednesday the prey was SAP & Salesforce.com. Like an XBOX shoot’em up, there was gore everywhere; if you avoid the inaccuracies, it was a great demonstration sleight of hand and showmanship. He also announced a few new offerings that I should spend some time on, but I’m going with a different angle here.

What really interested me happened close to the end of the keynote. Let me take you back to my blog “Shake Rattle and Roll”  where I talked about the different technologies I used versus my kids during the east coast earthquake. I wasn’t on social media and thus less informed and connected than my daughters.   I am here to report I am reformed!  I was on twitter during Larry’s speech typing and reading. It is there where I got the sad news about Steve Jobs. I then watched people begin to get up and leave the keynote, first a trickle, then a flow, then a flood.  Those who were not on social media probably didn’t know what was happening.  I however, was connected to my fellow techies at that moment, and though be it that I was completely bummed by the news, I felt I had closed the gap just a little on the iGeneration.