Emc Data Science Study WP PDF
Emc Data Science Study WP PDF
number of data
scientists
required by
automating
much of the
analytical work
17%
Increase the
number of data
scientists
required by
opening up new
possibilities
83%
demand for a certain set of skills, while later demand wanes as many of those initial skills are
automated by even newer tools. Consider, for instance, the way many data processing and network
management jobs that used to require legions of computer operators are now handled by automated
monitoring tools. Data science is still in its very early phase, with the amount of data exploding and
the right tools to process them just becoming
available.
The best source of new Data Science talent
is:
Today's BI
professionals
12%
Professionals
in disciplines
other than IT
or computer
science
27%
Other
3%
Students
studying
computer
science
34%
Students
studying
fields other
than
computer
science
24%
university students.
It may be helpful to think of data science and business intelligence as being on two ends of the same
spectrum, with business intelligence focused on managing and reporting existing business data in
order to monitor or manage various concerns within the enterprise. In contrast, data science applies
advanced analytical tools and algorithms to generate predictive insights and new product
innovations that are a direct result of the data.
The need for rigorous scientific training was born out in our research on data scientists, and paints a
clear distinction between data scientists and BI professionals. The most popular undergraduate
degree for BI professionals was in business at 37% - more than the next three categories combined.
In contrast, the most popular degree for data science professionals was computer science (24%),
followed closely by engineering (17%) and the hard sciences (11%). We also found that data science
professionals were over 2.5 times more likely to have a masters degree, and over 9 times more
likely to have a doctoral degree as business intelligence professionals.
The data science toolkit is more varied and more technically sophisticated than the BI toolkit. While
most BI professionals do their analysis and data processing in Excel, data science professionals are
using SQL, advanced statistical packages, and NoSQL databases. Further, although big-data tools
like Hadoop, and advanced visualization tools like Tableau are just starting to emerge in the data
science world, they are almost unseen in the business intelligence world. Open Source tools, like the
R statistics package, Python, and Perl, are each used by one in five data science professionals, but
around one in twenty BI professionals.
Data Storage
51%
46%
Microsoft SQL
Server
Data
Management
77%
Other SQL
database
43%
SAS
22%
39%
STATA
21%
15%
13%
Netezza
7%
13%
Hadoop
Python
5%
Perl
37%
IBM/Cognos Business
Intelligence Tools
29%
11%
4%
AWK
13%
20%
35%
25%
32%
24%
16%
SAP/Business Objects
Business Intelligence
22%
17%
MicroStrategy Business
Intelligence Tools
21%
11%
Tableau
7%
10%
0%
12%
2%
Oracle Business
Intellience Tools
27%
BASH
Greenplum
37%
22%
SPSS
Other NoSQL
database
44%
51%
Microsoft Business
Intelligence Tools
52%
SQL
Oracle
Data Visualization
68%
Excel
42%
35%
IBM
Data Analysis
19%
Greenplum
7%
QlikView
15%
4%
Datameer
15%
5%
2%
14%
3%
20%
R
5%
18%
4%
Karmasphere
9%
5%
An important facet of data science is the ability to run experiments on data, as evidenced by DJ
Patels description of how they built the people you may know function at LinkedIn:
It would have been easy to turn this into a high-ceremony development project that would take thousands of
hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn's
membership. But the process worked quite differently: it started out with a
relatively small, simple program that looked at members' profiles and made
recommendations accordingly. Asking things like, did you go to Cornell? Then
you might like to join the Cornell Alumni group. It then branched out
incrementally. In addition to looking at profiles, LinkedIn's data scientists
started looking at events that members attended. Then at books members
had in their libraries. The result was a valuable data product that analyzed a
huge database -- but it was never conceived as such. It started small, and
added value iteratively. It was an agile, flexible process that built toward its
goal incrementally, rather than tackling a huge mountain of data all at once.
Business
Intelligence
22%
12%
Usama Fayyad
Former Chief Data Officer at Yahoo, and
currently CEO or Chairman at 3 mid-stage
start-up companies
and the role is critical in organizations making the most out of the explosion of data they have
access to.
How frequently do you partner
with each role (%very frequently)
18%
23%
18%
20%
Graphic Designer
HR
30%
27%
Sales
14%
Statistician
32%
32%
29%
24%
32%
Marketing
Strategic Planning
Programmer
15%
IT Administration
16%
Business Management
9%
Data Scientist
36%
33%
38%
35%
38%
Data Scientist
Our findings showed that the emerging big data scientist is distinctly different from other data
professionals. For instance, nearly half of big data
scientists use R, despite the fact that it is only used
About how much time do you spend on
by only 13% of other practitioners. They are also
the following activities (% A lot)
twice as likely to use a big data storage tool like
48%
Acquiring new data sets
Hadoop, Greenplum, or Netezza. Big data scientists
27%
are also remarkably educated 40% have a
50%
Parsing data sets
21%
masters degree, and an additional 17% have a
58%
Filtering and organizing data
doctorate. Over 90% have at least a college
34%
education.
52%
Mining data for patterns
23%
Big Data
22%
48%
54%
31%
27%
50%
30%
35%
Normal Data
58%
60%
also more likely to partner with frequently with business management, but are interestingly no more
likely to partner with IT administration.
Finally, big data scientists touch data in more ways. They are twice as likely as those working with
normal data to work across the data life cycle, everywhere from acquiring new data to business
decision making, and around half spend a lot of time on each of these activities.
Organizational Implications
In order to remain competitive in the world of data science, companies need to create organizational
cultures that are conducive to data-driven decision making. First, they need to expand their view on
the possibilities when hiring data scientists, and look outside business degrees, and even computer
science, to find practitioners with the intellectual curiosity and technical depth to solve big data
problems, with academic concentrations in the hard sciences, statistics, and mathematics. Data
scientists use a variety of tools, but also recognize skill gaps as a barrier to adoption. Rather than
hiring for experience with a certain toolkit, companies should invest in on-the-job training with their
chosen set of emerging technologies.
Once companies have brought in the right talent, they need to create an environment conducive to
effective data science. That means building high-performing, cross-functional teams that include a
variety of roles, including programmers, statisticians, and graphic designers, and aligning them to
directly support interested business decision makers. They should also loosen restrictions on data in
the enterprise, allowing employees to more freely run data-driven experiments. Finally, data
scientists should be given free access to run experiments on data, without bureaucratic obstacles,
so that they can rapidly translate their own intellectual curiosity into business results.
Our Methodology
The EMC Data Science Community Survey interviewed 497 data scientists and business intelligence
professionals from around the world, including deliberate samples in the United States, India, China,
the United Kingdom, Germany, and France. 465 respondents were collected through a partnership
with Toluna, one of the worlds premier online panel providers. All Toluna participants were prescreened for information technology decision making authority, and further screened as either data
science professionals or business intelligence professionals. An additional 25 responses came from
participants in the 2011 Data Science Summit, and six through publication by Kaggle, an online
contest community for data scientists. All groups were asked the same questions with the same
screeners.
http://gerdleonhard.typepad.com/files/wef_ittc_personaldatanewasset_report_2011.pdf
http://www.nytimes.com/2009/08/06/technology/06stats.html
http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_f
or_innovation
iv http://radar.oreilly.com/2011/09/building-data-science-teams.html
v http://radar.oreilly.com/2010/06/what-is-data-science.html
i
ii