Success With Big Data Analytics: Competencies and Capabilities For The Journey
Success With Big Data Analytics: Competencies and Capabilities For The Journey
Abstract
Big data analytics creates sizable value for companies in all industries. Value is created
through a better customer focus, from increased operational excellence, or from entirely new
businesses. However, many companies underestimate the cost, complexity, and
competencies to get to that point, and many fail along the journey. Smart companies reduce
the cost, complexity, and competency gap by relying on NetApp and NetApp partners for their
big data infrastructure. Because enterprise storage building blocks and associated service
capabilities provide maturity and cost-effectiveness at the data management level, across all
big data uses and technologies, companies are freed to focus on developing business-facing
capabilities, which requires mastering competencies at the application and data science
levels. Big data analytics technologies discussed in this white paper include Splunk, NoSQL
databases, Hadoop, Solr, and Spark.
TABLE OF CONTENTS
1 Situation ...............................................................................................................................................3
2 Challenges ...........................................................................................................................................3
4 Conclusion .........................................................................................................................................10
Glossary ...................................................................................................................................................12
LIST OF FIGURES
Figure 1) Competencies required for success with big data analytics. ........................................................................ 4
2 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
1 Situation
1.1 The Promise of Big Data Analytics
Big data analytics is the process of examining datasets that are characterized by a greater volume,
velocity, or variety of data types than those found in traditional business intelligence and data warehouse
environments, with the purpose of uncovering hidden patterns, unknown correlations, market trends,
customer preferences, and other useful business information. These analytical findings can lead to more
effective marketing, new revenue opportunities, better customer service, improved operational efficiency,
competitive advantages over rival organizations, and other business benefits.
Companies across all industries increasingly view data as a critical production factor similar to talent and
capital. They realize that capturing and blending more data sources than ever before across many
different domains create economic value. For instance, financial services companies collect, price, and
disburse capital across their various lines of business, from granting credit to providing insurance to
making capital markets work. The volatility and disruption the industry has experienced over the last few
years have spurred banks and insurers to unlock the value inherent in the data their businesses generate.
They are looking for real-time, actionable insights that help them better understand customers, price risks,
and spot fraud. This means gathering, analyzing, and storing millions of transactions, interactions,
observations, and third-party data points per minute. Existing systems such as relational databases or
enterprise data warehouses are high performing but often not suited for the volume, velocity, and variety
of data. It is no wonder that financial services companies have been among the early adopters of big
data, embracing technologies and solutions such as NoSQL databases, Hadoop, Spark, and Splunk. (For
a definition of these technologies, see the glossary at the end of this white paper.) They are leveraging
the power of various big data technologies to transform the customer experience and improve profitability.
2 Challenges
Anecdotal evidence suggests that 50% of enterprises that embrace big data struggle to create business
value, evidenced by many abandoned, stranded Hadoop efforts. Only 10% become truly successful,
having developed and mastered the many competencies required after a long, arduous journey.
3 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
software companies. Although NoSQL and Hadoop deliver compelling capabilities, many enterprises
underestimate the complexity of getting these technologies to work smoothly in business-critical
environments. The consolidation that has happened in more mature areas of technology (for
example, IBM, Oracle, Microsoft, and SAP) has yet to happen in the big data space, which is
characterized by more than a dozen NoSQL databases, several Hadoop distributions, and rapid
advancement in newer Apache projects such as Spark and Storm.
New technologies and architectures also call for new skills. There is a lot to learn across the
entire lifecycle of big data initiatives and technologies. Many enterprises lack a strong digital leader
who is able to align business needs with technology capabilities. The architectures, technologies, and
vendors selected need to align with those evolving business needs. New skills need to be developed,
often through external hires, across a range of roles ranging from architecture to infrastructure
engineering and operations to data science to application development. Moreover, when analyzing
data with high volume, velocity, and variety, great skill is required to assess the veracity (that is,
quality and trustworthiness) of this data.
Security
Partners
Program Mgt.
The following sections cover the best practices that NetApp has developed and that ensure successful
business outcomes for our customers.
4 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
3.1 Strategy
To succeed at this stage, NetApp customers address the following areas:
Business strategy and alignment
Use case definition and prioritization
1
http://solutionconnection.netapp.com/solution-listing.aspx.
5 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
purpose and discarded too soon. Therefore, management has an incomplete picture of what is going on.
There is no single point of truth.
Smart enterprises adopt more of a layer approach to technology, driven by an enterprise-wide unifying
target architecture wherever possible. The main architectural principle of moving data once but sharing
and processing it many times mandates shared storage building blocks. The physical instantiation of
these building blocks can vary, serving data that is hot, warm, cold, or frozen, on the premises or in the
cloud. What matters is an integrated approach to managing, operating, and securing these building
blocks.
Such an approach brings many benefits:
Modular, modern architecture that supports the broadest range of applications and analyses
Freedom to choose the best tool or processing engine for the job
All data, across all time periods, joined and correlated across domains
Shared data that is consumable for different use cases, often building on each other
Multiple lenses on the same data, with team-specific views
Ability to serve new use cases quickly and affordably
Fast learning curve, making it easier to train, retain, and develop staff
Big idea: NetApp offers an enterprise architecture with validated storage building blocks stretching
across new deployments as well as in-place analytics on existing data, which guarantee lower total cost
of ownership (TCO) and risk than commodity servers with internal drives. The NetApp approach brings:
A mature solution architecture that includes validated designs, technical reports, and complete
runbooks, which shortens time to value, increases deployment stability, and reduces consulting
expenses
Ability to handle both unstructured and structured data with the portfolio of products
Reduced operational complexity, including speed of provisioning capacity and users
Consistent enforcement of data security, privacy, governance, and compliance
A dramatic reduction in the power, space, and skills required
Accelerated testing and development of big data solutions by making sure of seamless data
movement between on-premises and public cloud environments
6 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
rebuild the lost data. Performance is only negligibly affected, and for an order of magnitude less time
than with internal storage. For instance, in a recent test, a single drive failure and rebuild process in
one of the internal drives in a Couchbase cluster server had a significant impact on the clusters
capability to process requests from clients. The operations-per-second rate dropped by over 90%.
However, with the NetApp EF560 and DDP, the impact was limited, and approximately 15 minutes
after the initial disk failure, normal service was restored.2
File system storage and compute are decoupled and can scale independently subject to workload
requirements. This also eliminates the need for rebalancing or migration when new data nodes are
added, thereby making the data lifecycle nondisruptive.
NetApp storage also increases performance for many big data workloads. For instance, in a recent
benchmark, Splunk on NetApp achieved search performance that was 69% better than Splunk on
equivalent commodity servers with internal disks.3 NetApp provided optimized performance and
capacity buckets for Splunks hot, warm, cold, and frozen data tiers. Moreover, because data is
externally protected, additional performance and efficiency gains can be realized by reducing the
amount of data replication, lightening the load on compute and network resources and reducing the
amount of storage required just for data protection.
The ability to do in-place analytics on existing NAS data using NetApp technologies can help save
infrastructure cost and time of setting up a duplicate storage silo for analytics and provide faster time
to insights. It also eliminates data movement.
2
Detailed report available on www.netapp.com; report number detailed report or search on our webpage for
TR-4462.
3
Performance data taken from NetApp E-Series for Splunk Enterprise 2015 Function1.
7 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
platform: its source, its format, its lineage, and its quality. Data visibility and understanding are unified
across traditional business intelligence and true big data environments such as Hadoop. Defining and
capturing metadata allow ease of searching and browsing. Proper metadata management is the
foundation of data quality. As data lakes grow in depth and importance to the business, the quality of the
metadata becomes essential to make sure that the data poured into these lakes can be found, used, and
exploited for years to come.
Big idea: NetApp partner solutions can assist. Zalonis Bedrock provides unified data management
across the entire data pipeline from ingestion through to self-service data preparation. It ensures file- and
record-level watermarking so you can see data lineage, movement, and usage. This ensures that
consumers can search, browse, and find the data they need, reducing the time to insight for new analytics
projects.
3.3.3 Collaboration
Collaboration at its core is about coordination between data owners, data professionals (for example,
administrators, developers, and data scientists), and data consumers. The more business critical the use
cases, the more important that collaboration becomes. Smart organizations have found ways to address
misaligned funding and incentives. For instance, department A is only able to tap into the wealth of data
in the shared data lake if department A also makes its own data available to other departments, avoiding
the free-rider problem. Moreover, a code of conduct might state that department A needs to provide
advance notice regarding changes in the availability, quality, or format of its data, because department B
may use As data source for powering real-time recommendations at the point of sale. Some large
companies have created homegrown internal social networks that break down these communication and
incentive barriers.
Big idea: NetApp partner solutions such as Zalonis Bedrock embody best practices to foster
collaboration and coordination. Specifically, Bedrock provides workflow and enrichment. Workflow covers
tasks such as masking, lineage, data format conversion, change data capture, and notifications.
Enrichment allows data professionals to orchestrate and manage the data preparation process.
3.4 Operations
To succeed at this stage, NetApp customers address the following areas:
Manageability
Efficiency and performance
8 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
3.4.1 Manageability
Administering a large Hadoop cluster can be more complicated than many realize. There is complexity
associated with manually recovering from drive failures in a Hadoop cluster with internal drives.
Big idea: NetApp provides the SANtricity Storage Manager, which has often been commented on as
the easiest to use and most intuitive interface in the industry. It features a combination of wizards and
automation for common processes along with detailed dials for storage experts. It provides a centralized
management GUI for monitoring and managing drive failures. The SANtricity operating system is also
performance optimized, yet still offers a complete set of data management features such as Snapshot
copies and mirroring. This makes it easy to meet service-level agreements with predictable performance.
These are complemented by OnCommand Insight for health checks and the widely acclaimed NetApp
AutoSupport.
9 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
Big idea: NetApp contributes to security by providing hardware-accelerated encryption. The benefit is a
performance impact of less than 1 percent, compared to several percentage points with competing
solutions.
3.7 Partners
In the rapidly evolving big data space, no single company can provide everything, and NetApp is no
exception.
Big idea: What NetApp does provide is a comprehensive partner ecosystem across Hadoop, NoSQL,
and analytic applications such as Splunk that collectively solves the big data analytics needs of the most
demanding enterprises. The NetApp partner ecosystem is available on Solution Connection:
http://solutionconnection.netapp.com/solution-listing.aspx.
4 Conclusion
In summary, NetApp helps you achieve success with big data analytics initiatives, no matter whether your
role is on the business side as a business owner and consumer of big data insights or on the technical
side as a developer or data professional.
NetApp and partners help you create maximum business value with short time to market because
NetApps portfolio of solutions provides better and more consistent performance and is tested with
Hadoop distributions, NoSQL databases, and applications such as Splunk and Spark. Additionally, there
is a TCO advantage stemming from better performance and scalability, efficiency (storage, power, and
licenses), and improved recoverability. Overall, NetApp provides a better balance of performance (with
less hardware than competing solutions), capacity, and cost. Customers particularly value the
independent scaling of storage and compute, performance tiering, and space efficiency (single source of
data, no resync, and no copy).
Specifically, our E-Series provides the following benefits:
Realize better performance than internal drives during data rebuilds.
10 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
Increase search performance by 69% versus commodity servers with internal disks.4
Save on storage capacity by reducing replication factor. This reduces storage capacity requirements
and maintains availability with less copies.
Scale compute and storage independently to better match application workload.
Enjoy single-interface management across storage environment.
Maximize uptime of cluster through superior availability.
Improve reliability with enterprise storage building blocks.
Encrypt your data with no performance impact.
Optimize performance and capacity for hot, warm, cold, and frozen data.
Rest assured with world-class NetApp AutoSupport.
4
Source: Function1 report.
11 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
Glossary
Hadoop: Open-source software that provides the enterprise-wide data lake:
Allows acquiring all data in its original format and storing it in one place, cost effectively and for long
time periods.
Allows different processing engines and schema on read.
Provides mature multitenancy, operations, security, and integration.
NoSQL: Nonrelational databases popular for big data and real-time web applications:
Data models (for example, key-value, graph, or document) seen as more flexible than relational
database tables
Popular for high-availability, low-latency use cases
Simplicity of scaling out horizontally using clusters of machines versus scaling up for relational
databases
Popular open-source NoSQL DB, including MongoDB, Apache Cassandra, Solr, and HBase
Spark: Open-source software that provides a modern development environment and power user
analytical environment for big data:
In-memory high-speed analytics engine
Advanced machine learning libraries
Unified programming model across all processing engines
Splunk: Software solution for searching, monitoring, and analyzing machine-generated data using a web
interface:
Captures, indexes, and correlates real-time data in a searchable repository from which it can
generate graphs, reports, alerts, dashboards, and visualizations
Horizontal technology, based on a proprietary NoSQL database, traditionally used for IT service
management, security, compliance, and web analytics
12 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.
Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact
product and feature versions described in this document are supported for your specific environment.
The NetApp IMT defines the product components and versions that can be used to construct
configurations that are supported by NetApp. Specific results depend on each customers installation in
accordance with published specifications.
Copyright Information
Copyright 19942016 NetApp, Inc. All rights reserved. Printed in the U.S. No part of this document
covered by copyright may be reproduced in any form or by any meansgraphic, electronic, or
mechanical, including photocopying, recording, taping, or storage in an electronic retrieval system
without prior written permission of the copyright owner.
Software derived from copyrighted NetApp material is subject to the following license and disclaimer:
THIS SOFTWARE IS PROVIDED BY NETAPP AS IS AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY
DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.
NetApp reserves the right to change any products described herein at any time, and without notice.
NetApp assumes no responsibility or liability arising from the use of products described herein, except as
expressly agreed to in writing by NetApp. The use or purchase of this product does not convey a license
under any patent rights, trademark rights, or any other intellectual property rights of NetApp.
The product described in this manual may be protected by one or more U.S. patents, foreign patents, or
pending applications.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to
restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software
clause at DFARS 252.277-7103 (October 1988) and FAR 52-227-19 (June 1987).
Trademark Information
NetApp, the NetApp logo, Go Further, Faster, AltaVault, ASUP, AutoSupport, Campaign Express, Cloud
ONTAP, Clustered Data ONTAP, Customer Fitness, Data ONTAP, DataMotion, Flash Accel, Flash
Cache, Flash Pool, FlashRay, FlexArray, FlexCache, FlexClone, FlexPod, FlexScale, FlexShare,
FlexVol, FPolicy, GetSuccessful, LockVault, Manage ONTAP, Mars, MetroCluster, MultiStore, NetApp
Fitness, NetApp Insight, OnCommand, ONTAP, ONTAPI, RAID DP, RAID-TEC, SANshare, SANtricity,
SecureShare, Simplicity, Simulate ONTAP, SnapCenter, SnapCopy, Snap Creator, SnapDrive,
SnapIntegrator, SnapLock, SnapManager, SnapMirror, SnapMover, SnapProtect, SnapRestore,
Snapshot, SnapValidator, SnapVault, SolidFire, StorageGRID, Tech OnTap, Unbound Cloud, WAFL,
and other names are trademarks or registered trademarks of NetApp Inc., in the United States and/or
other countries. All other brands or products are trademarks or registered trademarks of their respective
holders and should be treated as such. A current list of NetApp trademarks is available on the web at
http://www.netapp.com/us/legal/netapptmlist.aspx. WP-7233-0616
13 Success with Big Data Analytics 2016 NetApp, Inc. All rights reserved.