Big Data: The Definitive Guide To The Revolution in Business Analytics
Big Data: The Definitive Guide To The Revolution in Business Analytics
Data
The definitive guide to the
revolution in business analytics
Big
Data
The definitive guide to the
revolution in business analytics
THE WHITE
BOOK OF
Big Data
Contents
Acknowledgements 4
Preface 5
1: What is Big Data? 6
2: What does Big Data Mean for the Business? 16
3: Clearing Big Data Hurdles 24
4: Adoption Approaches 32
5: Changing Role of the Executive Team 42
6: Rise of the Data Scientist 46
7: The Future of Big Data 48
8: The Final Word on Big Data 52
Big Data Speak: Key terms explained 57
Appendix: The White Book Series 60
Acknowledgements
With thanks to our authors:
l Ian Mitchell, Chief Architect, UK & Ireland, Fujitsu
l Mark Locke, Head of Planning & Architecture, International Business, Fujitsu
l Mark Wilson, Strategy Manager, UK & Ireland, Fujitsu
l Andy Fuller, Big Data Offering Manager, UK & Ireland, Fujitsu
With further thanks to colleagues at Fujitsu in Australia, Europe and Japan who kindly
reviewed the books contents and provided invaluable feedback.
For more information on Fujitsus Big Data capabilities and to learn how we can assist your
organisation further, please contact us at [email protected] or contact your local
Fujitsu team (see page 62).
ISBN: 978-0-9568216-2-1
Published by Fujitsu Services Ltd.
Copyright Fujitsu Services Ltd 2012. All rights reserved.
No part of this document may be reproduced, stored or transmitted in any form without prior written
permission of Fujitsu Services Ltd. Fujitsu Services Ltd endeavours to ensure that the information in
this document is correct and fairly stated, but does not accept liability for any errors or omissions.
Preface
In economically uncertain times, many businesses and public sector
organisations have come to appreciate that the key to better decisions, more
effective customer/citizen engagement, sharper competitive edge, hyperefficient operations and compelling product and service development is
data and lots of it. Today, the situation they face is not any shortage of
that raw material (the wealth of unstructured online data alone has swollen
the already torrential flow from transaction systems and demographic
sources) but how to turn that amorphous, vast, fast-flowing mass of Big
Data into highly valuable insights, actions and outcomes.
This Fujitsu White Book of Big Data aims to cut through a lot of the market
hype surrounding the subject to clearly define the challenges and
opportunities that organisations face as they seek to exploit Big Data.
Written for both an IT and wider executive audience, it explores the different
approaches to Big Data adoption, the issues that can hamper Big Data
initiatives, and the new skillsets that will be required by both IT specialists
and management to deliver success. At a fundamental level, it also shows
how to map business priorities onto an action plan for turning Big Data into
increased revenues and lower costs.
At Fujitsu, we have an even broader and more comprehensive vision for
Big Data as it intersects with the other megatrends in IT cloud and
mobility. Our Cloud Fusion innovation provides the foundation for businessoptimising Big Data analytics, the seamless interconnecting of multiple
clouds, and extended services for distributed applications that support
mobile devices and sensors.
We hope this book offers some perspective on the opportunities made real
by such innovation, both as a Big Data primer and for ongoing guidance
as your organisation embarks on that extended, and hopefully fruitful,
journey. Please let us know what you think and how your Big Data
adventure progresses.
Cameron McNaught
Senior Vice President and Head of Strategic Solutions
International Business
Fujitsu
5
What
is
Big
Data?
In short, Big Data is about quickly deriving business value from a range of
new and emerging data sources, including social media data, location data
generated by smartphones and other roaming devices, public information
available online and data from sensors embedded in cars, buildings and
other objects and much more besides.
Many analysts use the 3V model to define Big Data. The three Vs stand for
volume, velocity and variety.
Volume refers to the fact that Big Data involves analysing comparatively
huge amounts of information, typically starting at tens of terabytes.
Photograph: iStockphoto
Velocity reflects the sheer speed at which this data is generated and
changes. For example, the data associated with a particular hashtag on
Twitter often has a high velocity. Tweets fly by in a blur. In some instances
they move so fast that the information they contain cant easily be stored,
yet it still needs to be analysed.
Data speed
In a Big Data
world, one of the
key factors is speed.
Traditional analytics
focus on analysing
historical data.
Big data extends
this concept to
include real-time
analytics of in-flight
transitory data.
Variety describes the fact that Big Data can come from many different
sources, in various formats and structures. For example, social media sites
and networks of sensors generate a stream of ever-changing data. As well
as text, this might include, for example, geographical information, images,
videos and audio.
Data sources
Big Data not only
extends the data
types, but the
sources that the
data is coming from
to include real-time,
sensor and public
data sources, as well
as in-house and
subscription sources.
The growth of semi-structured data (see Data types, right) is driving the
adoption of new database models based on the idea of Linked Data. These
reflect the way information is connected and represented on the Internet, with
links cross-referencing various pieces of associated information in a loose web,
rather than requiring data to adhere to a rigid, inflexible format where
everything sits in a particular, predefined box. Such an approach can provide the
flexibility of an unstructured data store along with the rigour of defined data
structures. This can enhance the accuracy and quality of any query and
associated analyses.
While the 3V model is a useful way of defining Big Data, in this book we will also be
concentrating on a fourth, vital V value. There is no point in organisations
implementing a Big Data solution unless they can see how it will give them
increased business value. That might not only mean using the data within their
own organisation value could also come from selling it or providing access to third
parties. This drive to maximise the value of Big Data is a key business imperative.
There are other ways in which Big Data offers businesses new ways to generate
value. For example, whereas traditional business analytical systems had to
operate on historical data that might be weeks or months out of date, a Big
Data solution can also analyse information being generated in real time (or at
least close to real time). This can deliver massive benefits for businesses, as they
are able to respond more quickly to market trends, challenges and changes.
Furthermore, Big Data solutions can add new value by analysing the sentiment
contained in the data rather than just looking at the raw information (for example,
they can understand how customers are feeling about a particular product). This is
known as semantic analysis. There are also growing developments in artificial
intelligence techniques that can be used to perform complex fuzzy searches and
unearth new, previously impenetrable business insights from the data.
In summary, Big Data gives organisations the opportunity to exploit a
combination of existing data, transient data and externally available data
sources in order to extract additional value through:
l Improved business insights that lead to more informed decision-making
l Treating data as an asset that can be traded and sold.
It is therefore important that organisations keep sight of both the long-term goal
of Big Data to integrate many data sources in order to unlock even more
Data types
The drive
to maximise
the value
of Big Data
is a key
business
imperative.
The trouble with all new trends and buzz-phrases is that they quickly become the
latest bandwagon for suppliers. As noted at the start of this chapter, all manner
of products and services are now being paraded under the Big Data banner,
which can make the topic seem incredibly confusing (hence this book). This is
compounded when vendors whose products might only pertain to a small part of
the Big Data story grandly market them as Big Data solutions, when in fact
theyre just one element of a solution. As a marketing term, then, be aware that
Big Data means about as much as the term cloud i.e. not a great deal.
History tells us that yesterdays big is todays normal. Some over-40s reading
this book will probably remember wondering how they were ever going to fill the
1 kilobyte of memory on their Sinclair ZX81. Today we walk around with tens of
When the concept of Big Data first emerged, there was a lot of talk about
relative accuracy. It was said that over a large, fluid set of data, a Big Data
solution could give a good approximate answer, but that organisations requiring
greater accuracy would need a traditional data warehouse or BI solution. While
thats still true to a degree, many of todays Big Data solutions use the same
algorithms (computational analysis methods) as traditional BI systems,
meaning theyre just as accurate. Rather than fixating on the mathematical
accuracy of the answers given by their systems, organisations should instead
focus on the business relevance of those answers.
Since Big Data has only been in common use since mid-2009, it might seem
natural to assume that early adopters face the usual slew of teething problems.
However, this is not the case. Thats not because the IT industry has become any
better at avoiding such problems. Rather, its because although the term Big
Data may be relatively new, the concept is certainly not.
Consider an organisation like Reuters (whose business model is based on
extracting relevant news from a mass of data and getting it to the right people
as quickly a possible) it has been dealing with Big Data for over 100 years. In
more recent years, so have Twitter, Facebook, Google, Amazon, eBay and a raft
of other well-known online names. Today, the bigger problem is that so much
data is thrown away, ignored or locked up in silos where it adds minimal value.
Being able to integrate available data from different sources in order to extract
more value is vital to making any Big Data solution successful. Many
organisations already have a data warehouse or BI system. However, these
typically only operate on the structured data within an organisation. They
10
A common misconception is that a Big Data solution is simply a search tool. This
view probably comes from the fact that Google is a pioneer and key player in the
Big Data space. But a Big Data solution contains many more features than simply
search. Going back to our Vs, search can deal with volume and variability, but it
cant handle velocity, which reduces the value it can offer on its own to a business.
CIOs are often concerned with what a Big Data solution should look like, how they
can deliver one and the ways in which the business might use it. The diagram
below gives a simple breakdown of how such a solution can be structured. The
red box represents the solution itself. Outside on the left-hand side, are the
various data sources that feed into the system for example, open data (e.g.
public or government-provided data, commercial data sites), social media (e.g.
Twitter) or internal data sources (e.g. online transaction or analytical systems).
Structure of a Big Data Solution
Business
Decision-makers
Data
Consumers
Data
Scientists
Reports,
Dashboards, etc
Visualisation
Complex
Event
Processing
Streaming
Application
Developers
Semantic
Analysis
Data
Transformation
Historical
Analysis
Search
Sensors
Data Integration
Social Media
Open Data
Consuming
Systems
Structured Data
Unstructured Data
Data Storage
Business
Partners
Platform Infrastructure
11
The first function of the solution is data integration connecting the system to
these various data sources (using standard application interfaces and protocols).
This data can then be transformed (i.e. changed into a different format for ease of
storage and handling) via the data transformation function, or monitored for key
triggers in the complex event processing function. This function looks at every
piece of data, compares it to a set of rules and raises an alert when a match is
found. Some complex event processing engines also allow time-based rules (e.g.
alert me if my product is mentioned on Twitter more than 10 times a second).
The data can then be processed and analysed in near real time (using massively
parallel analysis) and/or stored within the data storage function for later
analysis. All stored data is available for both semantic analysis and traditional
historical analysis (which simply means the data is not being analysed in real
time, not that the analysis techniques are old-fashioned).
Search is also a key part of the Big Data solution and allows users to access data
in a variety of ways from simple, Google-like, single-box searches to complex
entry screens that allow users to specify detailed search criteria.
The data (be it streaming data, captured data or new data generated during
analysis) can also be made available to internal or external parties who wish to
use it. This could be on a free or fee basis, depending on who owns the data.
Application developers, business partners or other systems consuming this
information do so via the solutions data access interface, represented on the
right-hand side of the diagram.
Finally, one of the key functions of the solution is data visualisation presenting
information to business users in a form that is meaningful, relevant and easily
understood. This could be textual (e.g. lists, extracts, etc) or graphical (ranging
from simple charts and graphs to complex animated visualisations).
Furthermore, visualisation should work effectively on any device, from a PC to a
smartphone. This flexibility is especially important since there will be a variety of
different users of the data (e.g. business decision-makers, data consumers and
data scientists represented across the top of the diagram), whose needs and
access preferences will vary.
12
With the rise of Big Data and the growing ease of access to vast numbers of
data records and repositories, personal data privacy is becoming ever harder to
guarantee even if an organisation attempts to anonymise its data. Big Data
solutions can integrate internal data sets with external data such as social
media and local authority data. In doing so, they can make correlations that
de-anonymise data, resulting in an increased and to many, worrying ability
to build up detailed personal profiles of individuals.
Today organisations can use this information to filter new employees, monitor
social media activity for breaches of corporate policy or intellectual property and
so on. As the technical capability to leverage social media data increases, we
may see an increase in the corporate use of this data to track the activities of
individual employees. While this is less of a concern in countries such as the UK
and Australia, where citizens rights to privacy and fair employment are a major
focus, such issues are not uniformly recognised by governments around the
world. These concerns have led to a drive among privacy campaigners and EU
data protection policy-makers towards a right to forget model, where anyone
can ask for all of their data to be removed from an organisations systems and
be completely forgotten.
Many of the concerns are borne out of stories such as people being turned down
for a job because an employer found a comprising picture of them on Facebook,
or companies sacking people for something theyve posted in a private capacity
on social media. But as todays younger generation becomes the management
of tomorrow, it is likely to be more relaxed about both data privacy issues, and
about what employees reveal about what they get up to in their own time. As a
result, were likely to see a move towards more of a right to forgive model
where individuals feel able to place more trust in organisations not to misuse
their data, and those organisations will be less likely to do so.
The generation that has grown up with social media understands, for example,
that if a photograph of someone inebriated at a party is posted on Facebook,
it doesnt mean that person is an unworthy employee. Once such a more relaxed
attitude to personal privacy becomes pervasive, data will become more
accessible as people trust it wont be misinterpreted or misused by businesses
and employers.
So when is the right time to adopt a Big Data solution? Just as has happened
with mobile phones, our dependency on data will increase over time. This will
come about as consumers trust in the data grows in line with it becoming both
13
more resilient and more accessible. Given that Big Data is not actually new (as
discussed earlier), late adopters may surprisingly quickly come to suffer the
negative business consequences of not embracing it sooner.
For the past decade or so, businesses have often categorised data according to a
traditional knowledge management (KM) model known as the DIKW hierarchy
(data, information, knowledge, wisdom). In this model, each level is built from
elements contained in the previous level. But in the context of Big Data, this
needs to be extended to more accurately reflect organisations need to gain
business value from their (and others) data. A better model might be:
l Integrated data data that is connected and reconnected to make it
more valuable
l Actionable information information put into the hands of those
that can use it
l Insightful knowledge knowledge that provides real insight
(i.e. not just a stored document)
l Real-time wisdom getting the answer now, not next week.
Of course, some organisations have put significant investment into traditional
knowledge management systems and processes. So in regard to KM and its
relationship with Big Data, it is worth noting the following:
1. KM is an enabler for Big Data, but not the goal
2. KM activities achieve better outcomes for structured data than for unstructured
or semi-structured data
3. The principles of KM are still important but they need to be interpreted in new
ways for the new types of data being processed
4. KM focuses much effort on storing all data, but that is not always the focus
with Big Data, particularly when analysing in-flight (transient) data.
In that sense Big Data has a librarians focus. The archivist wants to store data
but is less interested in making it accessible. The librarian is less interested in
storing data as long as he or she has access to it and can provide the
information that their clients need.
14
15
What
does
Big
Data
Mean
for
the
Business?
After the war, organisations began to realise that computing was also the key to
securing business advantage giving them the opportunity to work more quickly
and efficiently than their competitors and the IT industry was born.
Photograph: Corbis
Today IT has spread beyond the confines of the military, government and business,
playing a part in almost every aspect of peoples lives. The consumerisation of IT
has meant that most people in developed societies now own powerful, connected
computing devices such as laptops, tablet PCs and smartphones. Combined with
the growth of the Internet, this means an immense and exponentially growing
amount of data is being generated and is potentially available for analysis. This
encompasses everything from highly structured information, such as government
census data, to unstructured information, such as the stream of comments and
conversations posted on social networks.
The challenge
for organisations
now is to achieve
insightful results
like those of
wartime
code-breakers.
The challenge for organisations now is to achieve insightful results like those of the
wartime code-breakers, but in a very much more complicated world with many
additional sources of information. In a nutshell, the Big Data concept is about
bringing a wide variety of data sources to bear on an organisations challenges and
asking the right type of questions to give relevant insights in as near to real time
as possible. This concept implies:
l Data sets are large and complex, consisting of different information types and
multiple sources
17
The challenge is
to find gold in
the ever-growing
mountain of
information and
act on it in near
real time.
For different businesses and roles, this will mean different things. How someone
assesses and balances factors such as value, cost, risk, reward and time when
making decisions will vary according to their particular organisational and
operational priorities. For example, sales and marketing professionals might focus
on entering new markets, winning new customers, increasing brand awareness,
boosting customer loyalty and predicting demand for a new product. Operations
personnel, meanwhile, are more likely to concentrate on ensuring their
organisations processes are as optimal and efficient as possible, with a focus on
measuring customer satisfaction.
18
19
20
A monumental impact
Real-time insight will have a huge impact on everyones lives as big as any
historical technological breakthrough, including the advent of the PC and
emergence of the Internet. By 2017, its likely that:
21
Alternative
perspectives
on an
organisations
data can open
new pathways
to success.
22
l Structured data is no longer the only data that can be analysed. This leads to
new opportunities and possibilities. Unstructured social media data is a gold mine,
for example.
l Creating Big Data programmes focused on the customer are good, but
businesses shouldnt forget to track the competition as well.
70% of
senior managers
believe Big Data
has the potential
to drive competitive
edge.
Survey of 200 senior managers
by Coleman Parkes Research for
Fujitsu UK & Ireland (2012)
Clearing
Big
Data
Hurdles
Photograph: Corbis
25
of trustworthiness can also (but not necessarily) equate to whether the source is
internal or external, paid or unpaid, the age of the data and the size of the sample.
Data source dependency
If a business model relies on a particular external data source, it is important to
consider what would happen if that source were no longer available, or if a
previously free source started to levy access charges. For example, GPS sensor data
may provide critical location data, but in the event of a war it might become
unavailable in a certain region or its accuracy could be reduced. Another example is
the use of (currently free) open data from government sources. A change of policy
might lead to the introduction of charges for commercial use of certain sources.
Avoid analytical paralysis
Access to near real-time analytics can offer incredible advantages. But the sheer
quantity of potential analyses that a business can conduct means theres a danger
of analytical paralysis generating such a wealth of information and insight
(some of it contradictory) that its impossible to interpret. Organisations need to
ensure they are sufficiently informed to react without becoming overwhelmed.
Senior managers
need to ensure
Big Data
initiatives
are not
undermined by
employee
resistance to
change.
26
the fear that advanced predictive analytics undermines the role of skilled teams
in areas such as forecasting, marketing and risk profiling. If their fears arent
comprehensively addressed at the outset, such employees may attempt to
discredit the Big Data initiative in its early stages and could potentially derail it.
Technical challenges
Many of Big Datas technical challenges also apply to data it general. However, Big
Data makes some of these more complex, as well as creating several fresh issues.
Chapter 1 outlined the technical elements of a Big Data solution (see The IT bit,
page 11). Below, we examine in more detail some of the challenges and
considerations involved in designing, implementing and running these elements.
Data integration
Since data is a key asset, it is increasingly important to have a clear understanding
of how to ingest, understand and share that data in standard formats in order that
business leaders can make better-informed decisions. Even seemingly trivial data
formatting issues can cause confusion. For example, some countries use a comma
to express a decimal place, while others use commas to separate thousands,
millions, etc a potential cause of error when integrating numerical data from
different sources. Similarly, although the format may be the same across different
name and address records, the importance of first name and family name may
be reversed in certain cultures, leading to the data being incorrectly integrated.
Organisations might also need to decide if textual data is to be handled in its
native language or translated. Translation introduces considerable complexity
for example, the need to handle multiple character sets and alphabets.
Further integration challenges arise when a business attempts to transfer
external data to its system. Whether this is migrated as a batch or streamed, the
infrastructure must be able to keep up with the speed or size of the incoming
data. The selected technology therefore has to be adequately scalable, and the
IT organisation must be able to estimate capacity requirements effectively.
Another important consideration is the stability of the systems connectors
(the points where it interfaces with and talks to the systems supplying external
data). Companies such as Twitter and Facebook regularly make changes to their
application programming interfaces (APIs) which may not necessarily be
published in advance. This can result in the need to make changes quickly to
ensure the data can still be accessed.
27
Data transformation
Another challenge is data transformation the need to define rules for handling
data. For example, it may be straightforward to transform data between two
systems where one contains the fields given name and family name and the
other has an additional field for middle initial but transformation rules will be
more complex when, say, one system records the whole name in a single field.
Organisations also need to consider which data source is primary (i.e. the correct,
master source) when records conflict, or whether to maintain multiple records.
Handling duplicate records from disparate systems also requires a focus on data
quality (see also Complex event processing and Data integrity below).
Complex event processing
Complex event processing (CEP) effectively means (near) real-time analytics.
Matches are triggered from data based on either business or data management
rules. For example, a rule might look for people with similar addresses in different
types of data. But it is important to consider precisely how similar two records are
before accepting a match. For example, is there only a spelling difference in the
name or is there a different house number in the address line? There may well be
two Tom Joneses living in the same street in Pontypridd but Tom Jones and
Thomas Jones at the same address are probably the same person.
Organisations
need to consider
which data
source is
primary their
master data.
IT professionals are used to storing data and running queries against it, but CEP
stores queries that are processed as data passes through the system. This means
rules can contain time-based elements, which are more complicated to define.
For example, a rule that says if more than 2% of all shares drop by 20% in less
than 30 seconds, shut down the stock market may sound reasonable, but the
trigger parameters need to be thought through very carefully. What if it takes 31
seconds for the drop to occur? Or if 1% of shares drop by 40%? The impact is
similar, but the rule will not be triggered.
Semantic analysis
Semantic analysis is a way of extracting meaning from unstructured data. Used
effectively, it can uncover peoples sentiments towards, for example, organisations
and products, as well as unearthing trends, untapped customer needs, etc.
However, it is important to be aware of its limitations. For example, computers
are not yet very good at understanding sarcasm or irony, and human intervention
might be required to create an initial schema and validate the data analysis.
28
Historical analysis
Historical analysis could be concerned with data from any point in the past. That
is not necessarily last week or last month it could equally be data from 10 seconds
ago. While IT professionals may be familiar with such an application its meaning
can sometimes be misinterpreted by non-technical personnel encountering it.
Search
As Chapter 1 outlined, search is not always as simple as typing a word or phrase
into a single text input box. Searching unstructured data might return a large
number of irrelevant or unrelated results. Sometimes, users need to conduct
more complicated searches containing multiple options and fields. IT
organisations need to ensure their solution provides the right type and variety of
search interfaces to meet the businesss differing needs.
Another consideration is how search results are presented. For example, the data
required by a particular search could be contained in a single record (e.g. a
specific customer), in a ranked listing of records (e.g. articles listed according to
their relevance to a particular topic), or in an unranked set of records (e.g.
products discontinued in the past 12 months). This means IT professionals need
to consider the order and format in which results are returned from particular
types of searches. And once the system starts to make inferences from data,
there must also be a way to determine the value and accuracy of its choices.
Data storage
As data volumes increase storage systems are becoming ever more critical. Big
Data requires reliable, fast-access storage. This will hasten the demise of older
technologies such as magnetic tape, but it also has implications for the
management of storage systems. Internal IT may increasingly need to take a
similar, commodity-based approach to storage as third-party cloud storage
suppliers do today i.e. removing (rather than replacing) individual failed
components until they need to refresh the entire infrastructure. There are also
challenges around how to store the data for example, whether in a structured
database or within an unstructured (NoSQL) system or how to integrate
multiple data sources without over-complicating the solution.
Data integrity
For any analysis to be truly meaningful it is important that the data being analysed
is as accurate, complete and up to date as possible. Erroneous data will produce
misleading results and potentially incorrect insights. Since data is increasingly used
29
Data replication
Generally, data is stored in multiple locations in case one copy becomes corrupted
or unavailable. This is known as data replication. The volumes involved in a Big
Data solution raise questions about the scalability of such an approach. However,
Big Data technologies may take alternative approaches. For example, Big Data
frameworks such as Hadoop (see Chapter 1, page 15) are inherently resilient,
which may mean it is not necessary to introduce another layer of replication.
Data migration
When moving data in and out of a Big Data system, or migrating from one
platform to another, organisations should consider the impact that the size of
the data may have. Not only does the extract, transform and load process need
to be able to deal with data in a variety of formats, but the volumes of data will
often mean that it is not possible to operate on the data during a migration or
at the very least there needs to be a system to understand what is currently
available or unavailable.
Visualisation
While it is important to present data in a visually meaningful form, it is equally
important to ensure presentation does not undermine the effectiveness of the
system. Organisations need to consider the most appropriate way to display the
results of Big Data analytics so that the data does not mislead. For example, a
graph might look good rendered in three dimensions, but in some cases a simpler
representation may make the meaning of the data stand out more clearly. In
addition, IT should take into account the impact of visualisations on the various
target devices, on network bandwidth and on data storage systems.
30
Data access
The final technical challenge relates to controlling who can access the data, what
they can access, and when. Data security and access control is vital in order to
ensure data is protected. Access controls should be fine-grained, allowing
organisations not only to limit access, but also to limit knowledge of its existence.
One issue raised by advanced analytics is the possibility that the aggregation
of different data sources reveals information that would otherwise be deemed a
security or privacy risk. Enterprises therefore need to pay attention to the
classification of data. This should be designed to ensure that data is not locked
away unnecessarily (limiting its potential to provide actionable insights) but
equally that it doesnt present a security or privacy risk to any individual or company.
In addition, open-ended queries (searching with wildcards) may have performance
implications or cause concerns from a data extraction perspective. For example,
organisations may invest significant resources in consolidating data to give them a
single view of customers or other key competitive information which could
become a target for hacking, theft or sabotage.
If a business provides an external interface to its data by means of an API, this
needs to be maintained, echoing the challenge referred to above (under Data
integration), but this time as a provider of data rather than as a consumer. Finally,
application developers need to be aware that the move from serial to parallel
processing may affect the way that applications are designed and implemented.
Legislative challenges
From a legal standpoint, many of the challenges relate to data ownership, privacy
and intellectual property rights. Over time, we can expect a societal shift in attitudes
towards data handling, but currently organisations have to take into account that:
l Depending on where data originates, there may be issues around ownership,
intellectual property and licensing all of which will need to be resolved before
data can be used
l As data is aggregated even anonymised data may contain identifiable
information, which may place a business in breach of data protection regulations
l With data being stored on various systems across the globe, there may be
issues of data sovereignty and residency (i.e. questions over which country or
countries can claim legal jurisdiction over particular data) depending on the type
of data being stored and processed.
31
Adoption
Approaches
Adoption Approaches
Photograph: Alamy
There are early adoption case studies in which retail companies, for example, are
using Big Data solutions to bring together weather information and logistics data
to deliver the right products, to the right place, just in time. However, the most
obvious and predominant use of Big Data solutions is in the area of customer
analysis and product cross-selling.
External and open data
Another major trend driving business interest in Big Data is the growing
availability of external and open data sets. As previous chapters have noted,
businesses will increasingly need to consume and aggregate data from outside
their own organisations. That will not only extend to unstructured data from social
networks, sensors, etc, but also to public data sets and private databases offering
33
Free sources
There are many free
sources of data to
be exploited, not
all of them within
your organisation.
All these trends help to answer the question why now?, but the next question for
many businesses is what next?. How should a business interested in adopting a
Big Data solution approach the task? The last chapter outlined some of the
Transactional
Systems
Big Data
Systems
ab
str
a
cte
op
tim
um
OLAP
OLTP
Structured
(transactional)
LINKED
NOSQL
Unstructured
External (Open)
Data Sources
34
Adoption Approaches
challenges organisations can face along the road, and questions they need to ask.
This chapter will look in more detail at how organisations can overcome some of
those challenges and find the most appropriate and successful adoption approach
for their particular business.
Before adopting a Big Data solution, an organisation needs to know the problem
it wants to solve. For example, it might want to understand the relationship
between customers buying patterns and their online influence (on social
networks, etc). Equally, it also needs to understand how best to represent the
results (in a table, graphical format, textually, etc).
Data lifecycles
Many proposed
approaches
to the
management
of Big Data
could result
in organisations
creating new
information
silos.
35
There are no definitive answers to these questions. They will vary depending
on the contents of the data, an organisations policies and any legal or
regulatory restrictions. However, these considerations underline the need for
robust data lifecycles when adopting Big Data solutions.
In effect, data is created (arrives), is maintained (exists), and then is deleted
(disappears) at some point in the future. That data needs to be managed in a
way that ensures it is kept for the right length of time and no longer. But Big
Datas potential to find new insights and generate value from old data prompts
another question shouldnt organisations be keeping it forever, as long as
they can do so securely, cost-effectively and legally?
When selecting tools for Big Data analysis, organisations face a number of
considerations:
l Where will the data be processed? Using locally-hosted software on a
dedicated appliance, or in the cloud? Early Big Data usage is likely to focus on
business analytics. Enterprises will need to turn capacity on and off at particular
times, which will result in them favouring the cloud over private solutions.
l From where does the data originate and how will it be transported? Its
often easier to move the application than it is to move the data (e.g., for large
data sets already hosted with a particular cloud provider). If the data is
updated rapidly then the application needs to be close to that data in order to
maximise the speed of response.
l How clean is the data? The hotch-potch nature of Big Data means that it
needs a lot of tidying up which costs time and money. It can be more
effective to use a data marketplace. Although the quality of the data provided
via such marketplaces can vary, there will generally be a mechanism for user
feedback/ratings which can give a good indication of the quality of the
different services on offer.
l What is the organisations culture? Does it have teams with the necessary
skills to analyse the data? Big Data analysis requires people who can think
laterally, understand complicated mathematical formulae (algorithms)
and focus on business value. Data science teams need a combination of
technical expertise, curiosity, creative flair and the ability to communicate their
insights effectively.
l What does the organisation want to do with the data? As well as honing
the choice of tools, having an idea about the desired outcomes of the analysis
can also help a business to identify relevant patterns or find clues in the data.
36
Acquire new
customers
Increase
existing
customer
revenue
Increase
customer
loyalty
ased revenu
e
cre
In
Increase
brand
awareness
Displace
costs
Increased
profit
Indirect
Increase
customer
satisfaction
Reduce
capital
requirements
t reduction
Cos
Develop new
products
or services
Improve
productivity
Direct
Adoption Approaches
Reduce
customer
support
dependencies
High value
Medium value
Low value
No value
*
Readers are invited to add their
own shading to the diagram
Decrease
time to
market
Reduce
fulfilment
errors
The chart above has been designed to help readers of this book identify the
key Big Data priorities for their businesses. At the heart of business is the drive
for increased profit, represented here in the centre of the target. Working
outwards, businesses can either increase profit by increasing revenue or
reducing costs. Both of these methods can be achieved through either direct or
indirect actions, but by combining the two we move outwards towards the
appropriate segment.
The outer circle shows the various actions a business can take. The left
hemisphere contains revenue-increasing actions, while the right side contains
cost-reducing actions. The diagram also splits horizontally to show direct
actions (in the top half) and indirect actions (bottom half). From this it is easy
to see the possible actions a business can take to increase revenues or reduce
costs, either directly or indirectly. These are also listed below with examples:
Direct actions to increase revenues:
l Develop new products or services (e.g. to address new opportunities)
l Increase existing customer revenue (e.g. by raising prices)
37
In common with any business change initiative, a Big Data project needs to be
business-led and (ideally) executive-sponsored it will never work as an
isolated IT project. Even more importantly, Big Data is a complex area that
requires a wide range of skills it spans the whole organisation and the entire
executive team needs to work together (not just the CIO).
In addition, there is a dearth of data scientists, and it may be necessary to fill
gaps with cross training. The next two chapters of this book look at the
changing role of the executive team and the rise of the data scientist.
38
Adoption Approaches
Over 70% of
organisations rank
better marketing
and responding to
changing needs of
the customer/citizen
as the areas where
Big Data could have
most impact.
Survey of 200 senior managers
by Coleman Parkes Research for
Fujitsu UK & Ireland (2012)
Solution
Value
Next step
Acquire new
customers
Increase
existing
customer
revenue
Increased sales.
Develop
new products
or services
Increase
brand
awareness
A clearer understanding
of where your brand is
strong or weak, allowing
you to target specific
sectors, with near realtime feedback of
campaign effectiveness.
Increase
customer
loyalty
Adoption Approaches
Problem
Solution
Value
Next step
Increase
customer
satisfaction
Quicker identification
of customer issues and
comments, analysed to
identify trends, etc.
Reduce
fulfilment
errors
Identify consumer
problems quicker.
Address bad word-of-mouth
marketing quickly.
Decrease
time to
market
Reduce
customer
support
dependencies
Reduce
capital
requirements
Displace
costs
Improve
productivity
41
Changing
Role
of
the
Executive
Team
Big Data is not an IT-owned decision its a business-owned decision, i.e. what
does an organisation need to know to be more effective in executing its strategy
and business model to maintain competitive advantage in an ever-more complex
and competitive business landscape?
To out-compute is to out-compete. This oft-quoted maxim first coined almost
a decade ago by Bob Bishop, ex-CEO of former market-leading workstation
supplier Silicon Graphics is no less true today than it ever was. The next
generation of leading organisations will achieve their success by making the
best use of IT to exploit an ever-growing mountain of Big Data. Executive teams
that dont understand what is possible with the technology will not be able to
lead their businesses effectively.
In some areas, the nature of key leadership roles has changed dramatically in
modern times. For example, 30 years ago, few people would have imagined that
statisticians and behavioural scientists would shape political parties strategies,
or that elections would be won through targeted online campaigns rather than
blanket mass-media coverage.
The next
generation
of leading
organisations
will achieve
their success
by making the
best use of IT
to exploit an
ever-growing
mountain of
Big Data.
43
and his story was recounted in the 2003 book Moneyball by Michael Lewis
(made into a movie starring Brad Pitt in 2011).
These examples illustrate how today and tomorrows executives will need to
ensure they are receiving a timely and accurate flow of information, which they
can use to make better decisions and give their organisations an edge over the
competition. Only time will tell if they can make the transition successfully, but
those seeking a head start should certainly be sure they are aware of the value
that Big Data initiatives can potentially bring to their businesses.
And it will take strong leadership, particularly since organisations are likely to
experience various forms of resistance to Big Data. Executive teams should
pinpoint initial projects to sponsor and look for early success stories they can
exploit to ensure their organisations remain excited and positive about the
opportunities ahead.
Executives will
need to ensure
they are
receiving a
timely and
accurate flow of
information,
which they can
use to make
better decisions
and give their
organisations an
edge over the
competition.
44
55% of
companies are
already changing
their business
processes to make
best use of
Big Data.
Survey of 200 senior managers
by Coleman Parkes Research for
Fujitsu UK & Ireland (2012)
Rise
of
the
Data
Scientist
Just as the oil industry requires people from diverse disciplines to work
together to maximise returns, so does Big Data. For a Big Data venture
to succeed, a business needs three key things:
1. IT capability to capture, store, process and publish the data
2. Business knowledge to define and articulate the desired business
outcome, sponsor the initiative, provide business insight and ensure
resources are effectively deployed
3. Data scientists.
Data is the
new oil: it
needs discovery,
extraction
and refining
to realise
its value.
Photograph: Corbis
47
The
Future
of
Big
Data
Data evolution
It is also certain that the amount of data stored will continue to grow at an
astounding rate. This inevitably means Big Data applications and their
underlying infrastructure will need to keep pace.
Increasingly
tools will
emerge that put
the power of Big
Data analysis
into everyones
hands
consumers,
business,
government.
Photograph: Corbis
Looking out two to three years, it is clear that data standards will mature, driving
up accessibility. Work on the semantic web a collaborative project to define
common data formats is likely to accelerate alongside the growth in demand
among organisations and individuals to be able access disparate sources of
data. More governments will initiate open data projects, further boosting the
variety and value of available data sources.
Linked Data databases will become more popular and could potentially push
traditional relational databases to one side due to their increased speed and
flexibility. This means businesses will be able to develop and evolve applications
at a much faster rate.
49
Data security will always be a concern, and in future data will be protected at a
much more granular level than it is today. For example, whereas users may
currently be denied access to an entire database because it contains a small
number of sensitive records, in future access to particular records (or records
conforming to particular criteria) could be blocked for particular users.
As data increasingly becomes viewed as a valuable commodity, it will be freely
traded, manipulated, added to and re-sold. This will fuel the growth of data
marketplaces websites where sellers can offer their data wares simply and
effectively and buyers will be able to review and select from a comprehensive
range of sources.
As data
increasingly
becomes viewed
as a valuable
commodity,
it will be
freely traded,
manipulated,
added to and
re-sold.
50
49% of
senior managers
believe that, by 2015,
Big Data will have
fundamentally
changed how
businesses
operate.
Survey of 200 senior managers
by Coleman Parkes Research for
Fujitsu UK & Ireland (2012)
51
The
Final
Word
on
Big
Data
Photograph: iStockphoto
The vision
is about
linking people,
products and
services with
information,
allowing
individuals and
organisations
to share and
collaborate on a
planetary scale.
53
Fujitsus current Big Data platform offerings have been build on the back of a
long-standing R&D commitment in the critical area of data management, and
feature a broad array of products and services.
Communications
Central Services
Batch processing
Navigation
Data Analysis
Services
Data Curation
54
Data curation
To generate new sources of value, Fujitsu can offer techniques it developed for
using data to make its operations more efficient or develop new businesses. This
is done using curators specialised analytical tools that feature mathematical
and statistical data modeling, multivariate analysis and machine learning.
Big data solutions in action
One example of Big Data in use today is TC Cloud, a Fujitsu service for the
Japanese market which supports non-linear structural analysis, electromagnetic
wave analysis and computational chemistry. Another is the companys Akisai
Cloud, for the food and agricultural industry, which (for example) analyses
environmental and other data to ensure crop yields are maximised.
Fujitsu offers a broad range of Big Data products, and the diagram below summaries
how these map onto the model of a Big Data solution outlined in Chapter 1 (page 11).
How Fujitsu Products Deliver a Big Data Solution
Spatiowl
Data
Consumers
Business
Decision-makers
Data
Scientist
Reports,
Dashboards, etc
Visualisation
Fujitsus Big
Data offerings
are built on the
back of a longstanding R&D
commitment
into the critical
area of data
management.
Complex
Event
Processing
Data Integration
Sensors
Application
Developers
Interstage
Semantic
Analysis
Historical
Analysis
Data
Transformation
Streaming
Structured Data
Consuming
Systems
Unstructured Data
Search
Social Media
Open Data
Business
Partners
Platform Infrastructure
55
At its foundation is the Fujitsu Global Cloud Data Platform, providing all the
required data integration, distribution, real-time analytics and manipulation
capabilities. This can be coupled with Interstage, Fujitsus integration and
business process automation engine. Additionally, Fujitsu offers its Key Value
Store for high-volume data storage and retrieval, supplemented by
the BigGraph Linked Data database.
Alongside that, Spatiowl is one of Fujitsus business-focused Big Data
solutions, that has been applied to resolving location data service problems.
56
A way to control who and/or what may access a given resource, either physical (e.g. a
server) or logical (e.g. a record in a database).
Architectural pattern
A design model documenting how a solution to a design problem can be achieved and
repeated.
Availability
Big Data
(1) The application of new analytical techniques to large, diverse and unstructured data
sources to improve business performance (2) Data sets that grow so large that they
become awkward to work with using traditional database management tools (3) Data
typically containing many small records travelling quickly (4) Data characterised by its
high volume, velocity, variety (or variability) and ultimately its value.
Business intelligence
A term used to describe systems that analyse business data for the purpose of making
informed decisions.
Cloud architecture
The architecture of the systems involved in the delivery of cloud computing. This
typically involves multiple cloud components communicating with one another over a
loosely-coupled mechanism (i.e. one where each component has little or no knowledge
of the others).
A service provider that makes a cloud computing environment such as a public cloud
available to others.
The organisation purchasing cloud services for consumption either by its customers or
its own IT users.
The different levels at which cloud services are provided. Commonly: Infrastructure-as-aService (IaaS); Platform-as-a-Service (PaaS); Software-as-a-Service (SaaS); Data-as-aService (DaaS); and Business Process-as-a-Service (BPaaS).
High-speed processing of many events across all the layers of an organisation. CEP can
identify the most meaningful events, analyse their impact and take action in real time.
Context-sensitive
Data integration
Data integrity
In the context of data security, integrity means that data cannot be modified without
detection.
Data residency
The location of data in terms of both the legal location (the country any related
governance can be enforced) and the physical location (the systems on which it is
stored).
57
Data scientist
Data storage
The processes and tools relating to safe and accurate maintenance of data in a
computer system.
Data transformation
The processes and tools required to transform data from one format to another.
Esper
A complex event processing engine available for the Java (Esper) and Microsoft .NET
(NEsper) programming frameworks.
Hadoop
Historical analysis
The processes and tools used to analyse data from the past either the immediate past
or over an extended period.
Interoperability
A means of storing data without being concerned about its structure. Key value stores
are easy to build and scale well. Examples include MongoDB, Amazon Dynamo and
Windows Azure Table Storage. These can also be thought of as NoSQL online
transaction processing.
Linked Data
Map/Reduce
Non-repudiation
A service that provides proof of the integrity and origin of data together with
authentication that can be asserted (with a high level of assurance) to be genuine.
NoSQL
An alternative approach to data storage, used for unstructured and semi-structured data.
Open data
Data which is made freely available by one organisation for use by others, generally
with a licence attached.
Data that, by its nature, is covered under privacy and data protection legislation. This
applies to information about both employees and consumers.
Real-time
Search
The processes and tools that allow data to be located based on given criteria.
Semantic analysis
The processes and tools used to discover meaning (semantics) inside (particularly
unstructured) data.
Semantic web
A collaborative movement led by the World Wide Web Consortium (W3C) that promotes
common formats for data on the web.
58
Part of a service contract where the level of service is formally defined to provide a
common understanding of services, priorities, responsibilities and guarantees.
Shadow IT
A term often used to describe IT systems and IT solutions built and/or used inside
organisations without formal approval from the IT department.
SOLR
An Apache open-source enterprise search platform which powers the search and
navigation features of many of the worlds largest Internet sites.
Structured data
Tokenisation
The process of replacing a piece of sensitive data with a value that is not considered
sensitive in the context of the environment in which it resides (e.g. replacing a name
and address with a reference code representing the actual data, which is held in
another database hosted elsewhere).
Unstructured data
Data with no set format, or with a loose format (e.g. social media updates, log
files, etc).
Uptime
A measure of the time a computer system has been available (working as intended).
Not to be confused with overall system availability, which will depend on a number of
measures, including the uptime of individual components.(See also Availability)
Visualisation
The processes and tools for presenting the results of data analysis in a manner that
enables better decisions to be made.
59
Also
in
this
series
The White Book of...
Cloud Adoption
Cloud Security
60