Big Data Essentials
Big Data Essentials
Many of the big data tools are open-source and Linux-based. Explore the
fundamentals of big data, including positioning it in an historical IT context, available
tools for working with big data, and the big data stack.
Table of Contents
1. Introducing Big Data
2. The Biggest Wave Yet
3. Emerging Technologies
4. Defining Big Data
5. Key Terms for Data
6. Sizing Big Data
7. The Original Key Contributors
8. The Distro Companies
9. Apache Software Foundation
10. Apache Projects
11. Other Apache Projects
12. Other Open Source Projects
13. The Big Data Stack
14. Big Data Components
15. NoSQL Databases
Big Data will create new careers and skills sets in IT, it will necessitate numerous
projects and changes in data centers, bring about significant changes to analytics,
can be highly disruptive in many directions of business, and will enable greater data
level connectivity in global culture.
Now let's take a look at what we really do mean by Big Data, and here's the real
answer. Big Data is supercomputing. First the primary technology of Big Data has
been in use for a few decades. We call it supercomputing. And initially it was
reserved for very select use, such as scientific forecasting of weather. The use of
clustering, parallel processing, and distributed file systems is well understood, but was
out of reach for commercial use. But today this supercomputing technology is
available to all of us, and it is what is powering Big Data in the 21st century. Today a
Big Data cluster will have 20 racks, 400 nodes, 4800 physical disks, and will be easily
managing 3200 terabytes. That's 3200 terabytes of data. The future may well bring
clusters with millions of nodes, managing zettabytes of data. We are entering a new
era of hypercomputing. The other really important thing about Big Data is it is
affordable supercomputing. Previous supercomputing technology was expensive,
extremely expensive, and out of reach for most commercial use, and now it is not.
Heading: Supercomputing for All of Us.
Big Data is possible because the capability to parallel process computing jobs across
large amounts of distributed data was made possible.
What makes big data commercially viable is it affordable at scale.
A Big Data supercomputing platform can be built for a fraction, and I do mean a
fraction, of the cost of earlier supercomputing. And this is what's driving the explosive
growth of Big Data. In summary Big Data is supercomputing for all of us. Now the
term Big Data has a number of different meanings and I think it's very important, to
kind of understand the context in which the term Big Data is being used. I group them
into one of four categories – Technology, Analytics, Business, and Marketplace. In
regards to the Technology, Big Data is a collection of a wide range of technologies,
almost all of it open source used in a distributed computing environment to create
low-cost, but high-yield supercomputing platforms. And when the term Big Data is
used in regards to Analytics, we are referring to a dramatic change in the practice of
data analytics; the hunt for value in the data.
Heading: Big Data in Context.
The different meanings associated with Big Data include reference to open source
supercomputing technology, an advance in data analytics for fusing data into
knowledge, porting over to data dependent industries in the business world, and a
high value, explosive opportunity in the market place.
The core concept that's driving this change is that the sample size is now the entire
data set. And that the answer can be found in minutes or hours, not weeks and months.
This is leading to a significant change in how we map, analyze, and find new
knowledge from data. And then when the term Big Data is used in regards to business,
we are really talking about a very significant business disrupter – certainly I believe,
larger than the Internet and is causing major changes in business, government, and
research. And finally perhaps the most exciting, when people use the term Big Data in
regards to the marketplace. It's because Big Data is this generation's high-value,
explosive growth technology with estimated multiple billions of dollars of associated
revenue.
A timeline displays the progress of computing over the decades, beginning with 1940
and ending with 2020. The 1950s shows the transistor, the 1960s shows mainframes,
the 1970s shows mini computers, the 1980s shows the PC, the 1990s shows the World
Wide Web, the year 2000 shows smart phones, 2010s shows social media, and the
2020s shows Big Data.
In 1982, Time Magazine named the PC "The Machine of the Year." It had that big of
an impact on our entire culture. Think – within just a few short years of it being
introduced, every office worker had a PC on their desk, many of us purchased them
for home and school use. Entire new industries in entertainment, education, and
business were all created as a result of the introduction of the PC in the 1980s. And
then came the really big bang for technology, the introduction of the Internet, which
provided access to information on a global and real-time scale. Today there are over
two billion people on the Internet; over one-third of the global population. And the
changes to our economy, our commerce, our culture, our politics, and even our
perception of each other have been changed dramatically and permanently. And then
we have the introduction of smartphones. It puts a computer in your pocket. Think
about it. Today there are over five billion mobile phones in use across the planet and
this connects us faster and wider than we've ever been connected in all of history. And
then came the introduction of Big Data, and the first real introduction of Big Data was
social media. Social media is the online social interaction among people to create and
share information.
One of the most popular being photos. But empowering the social media are the Big
Data recommendation engines. Those engines analyze and suggest connections to
other members, they make suggestions on goods to purchase or places to travel, they
find topics that peek our interest. Social media is now one of the largest segments in
technology and still growing, and the end result of all this is larger and more
significant impact to everyone on the planet. Now there are many of us involved in
Big Data who strongly believe this will be the biggest technology wave yet. We see
this technology wave completely and radically changing all other competing
structures of the past 70 years.
Heading: The Biggest Wave Yet.
The scope of big data computing and the possibilities it presents means that it will be
a significant wave of change in the Information Technology sector.
Tracking data with our mobile phone is being poured into Big Data solutions by the
phone company, the government, market research companies, and other businesses.
Each organization is finding value in analyzing this data and each of these
organizations will interact with us differently because of this data. Cell phone tracking
data, in of itself, is becoming a multi-billion dollar industry. These changes will
dwarf, and I do mean dwarf, the impact of the Internet. So as you learn about Big
Data, my goal is to convince you to agree with this statement; the changes are going
to be nearly inconceivable. In the next few lectures I will present facts and comments
and ideas of why I believe these bold statements are true.
Emerging Technologies
There are several emerging technologies all coming together at the same time, which
are foundation of Big Data and the changes to global computing. Let's look at them
individually and then look at the result. These technologies are flash memory, mobile
devices, cloud computing, distributed computing, advanced analytics, and in-memory
applications. The use of flash memory in solid-state drives allow computers to
become universal. Additionally flash memory delivers random-access speeds of less
than .1 milliseconds. Now compare this to disk access of 3 to 12 milliseconds. I
believe future Big Data solutions will use a lot of flash memory to improve access
time to data. Mobile devices represent computers everywhere. Mobile devices serve
the dual role of being the creator of much of our Big Data inputs, and the receiver of
the outputs from Big Data solutions. And we all know the impact of Cloud
Computing, how it's impacted storage, databases, and services. And these...of many of
these have moved onto the cloud, creating an entire new economy of computing.
Cloud computing offers great access for rapidly deploying Big Data solutions. And
then the heart of Big Data is the technology for distributed computing.
Heading: Technologies Driving Big Data.
The technologies that are driving big data are flash memory, mobile devices, cloud
computing, distributed computing, advanced analytics, and in-memory analytics.
The medical field example of how data is transported through different channels. In
the example, the presenter explains how data moves from an end user mobile device,
through big data repositories, alert and dispatch systems, incident management,
medical care management, until it find its way into medical research.
The device continuously records a number of health data points related to the heart.
And this data is then streamed through your mobile device into a Big Data repository.
Here the inputs are continuously monitored and a pattern recognition applied to
determine if there's an oncoming heart event of concern. And when the system
matches to an alert, obviously a warning will be immediately sent out. Now think
about this. If this were to be an incident, the information could be shared with the
ambulance service and the hospital, perhaps even before the ambulance arrives on
scene. And at the hospital, the use of RFI tags on medical instruments and
medications can be used to create the patient's medical chart. While back at the Big
Data repository, all of this data can then be mapped to other data, such as
environmental factors or dietary decisions. And then it can be analyzed again to
determine a comprehensive understanding of the health factors affecting heart health.
From this new insurance risks could be actualized, or new health advisories could be
developed, or new target drugs could be developed. Now this is Global Data Fabric in
action and this is the way you have to understand Big Data. It is the centerpiece for
the entire biosphere of modern computing.
There are other organizations that have provided definitions to Big Data and one of
the most popular is the Gartner's vector model. Gartner is an American Information
Technology Research and Advisory company. Gartner early on recognized the
emergence and the importance of Big Data. Now according to Gartner, the technical
term "Big Data" is used to describe data overload. They further define three vectors to
describe Big Data – increasing variety, increasing velocity, and increasing volume.
Increasing variety implies a wide variance in the types of data. Increasing velocity is
the creation and the volumes of creation of data and the speed at which it's being
delivered. Increasing volume means data in volumes greater than terabytes, all the
way up to zettabytes. Now I strongly encourage you to memorize both the statement
and the Gartner vector model.
Heading: Gartner's Vector Model.
A diagram of Gartner's vector model is displayed. It comprises a circle with Big Data
at its center. The circle is divided into three segments: Variety, Velocity, and Volume.
The Variety segment shows an arrow leading from Structured to Structured &
unstructured. The Velocity segment shows an arrow leading from Batch to Streaming
data. The Volume segment shows an arrow leading from Terabytes to Zettabytes.
And now my favorite term, the supercomputing platform. This is all the hardware and
software components used to build a supercomputing cluster. This is the most
common term I use when talking about Big Data. But I particularly mean the actual
cluster used to produce the Big Data solution. And the Hadoop cluster is the term I
use when I want to talk specifically about Hadoop. Now let's jump over and talk about
the different categories of data. And it really falls into two broad groupings. There is
structured data and there is unstructured data. Structured data is tightly predefined in
its type and format, while unstructured Data is everything else. Structured data, I think
of as being tightly pre-defined prior to creation and we primarily associate it with
SQL and RDBMSs. Now nearly all transactional data is structured. And by this we
mean a programmer must have created a schema ahead of time. Then the schema
defines the tables, consisting of column definitions and rows of records. A schema
tightly defines the data that must be entered into a record.
Heading: Categorizing Types of Data.
It's not unusual for a programmer to further define the data as being unique, or not
unique, or not known. It could be defined as a particular data type or even set to a
particular string, think of a phone number or a ZIP code. Now let's compare this to
unstructured data. Well over, easily well over, 95% of the world's digital data is
unstructured. And it consist of things such as text files, logs, e-mail, books, photos,
audio. Now we all know there is tremendous value in dealing with unstructured data.
Obviously there is value in the document contents itself, but there is value in
discovering patterns. Even in something as simple as counting the nulls or
determining the percentage of dirty data. There's a lot of data that is structured to a
point. Now this is an important comment. I personally like to refer to it as semi-
structured or loosely structured data. By this I mean any data element that comes with
metadata is an example of semi-structured. Think about an MPEG file. How much
structured data, how much metadata, how much direction on the content comes in a
bit string? A lot.
Heading: Unstructured Data.
Because unstructured data has metadata to describe it, it is better called semi-
structured data. For example, in stream data for sequence video or audio files and
header information for e-mail messages.
Now let me introduce key-value pairs. Key-value pairs are very simple model for
representing data. And most unstructured data can be expressed and managed as key-
value pairs. It has very broad applicability in Big Data. It is really very simple. A user
provides a key and all of the values related to the key are returned. A key may or may
not be unique and there are great advantages to programmers in key-value pairs. The
primary is that it's an open-ended data structure. This means that the data structure can
be extended without modification to existing code, to existing data, and most
importantly to an existing schema. Let me give you a few examples. An address book;
an address book relates a name – the key – to contact information – the value. A bank
account; a bank account using the account number – the key – to associate with
account details – the value. A movie database; the movie database contains titles of
movies that relate to a key. And when you select that key, the video streaming begins
on which you can view your movie – the value.
Heading: Examples of Key-Value Pairs.
Some of the examples of key-value pairs include an address book, a bank account,
and a movie database. In an address book, the key of a last name is entered and the
program returns an address or phone number. With a bank account, an account
number is entered and the program returns a list of statements. Similarly in a movie
database, the key of a movie title and a movie which can be streamed is returned.
So let me now answer this question. Why are key-value pairs so important to Big
Data? I would say that Big Data was purposely intended to take on unstructured Data.
And I would say that key-value pairs are the most critical strategy, for managing that
unstructured data. Additionally key-value pairs are the cornerstone of MapReduce and
NoSQL databases. They were intentionally, purposely built for key-value pairs. And
as both MapReduce and NoSQL databases is scaled affordably, this is what gives the
power to Big Data. And in regards to programming, it's important to understand the
key-value pairs can be associated with any value. And they provide a very fast access
method and it's simple, fast, and scalable. And most importantly key-value pairs really
have a schema that is loose, flexible and can be modified on the fly.
A table containing three columns and ten rows is displayed. The columns are: Title,
Abbreviation, and Number of Bytes. Each row contains the title of the data size, its
abbreviation, and then the number of bytes.
To calculate how many golfballs make up a petabyte, you would divide 10 to the 15th
by 10 to the 9th. This give you 10 to the 6th. Or one million golfballs - a football
stadium of golfballs.
I can reach this number quickly by subtracting 9 from 15. Now 10 to the 6 is one
million golf balls, that's enough golf balls to fill a football stadium. In Big Data we
commonly deal in these football stadium size collections of data. Now let's look at
some real world data and compare it to our football stadiums. I want to look at some
data that can commonly be found on the Internet. Walmart handles more than 1
million customer transactions every hour and their customer database is over 2.5
petabytes of data. This is 2.5 football stadiums of data. While the Library of Congress
receives over 235 TB of data every single year and the total collection of the Library
of Congress now houses over 60 petabytes of electronic data, this is 60 football
stadiums of data. Now let's think about mobile phones. I said earlier that we have 5
billion phone users in the world today. Each one of them can generate up to a terabyte
of data records every single year. This is 5 million football stadiums of data every
single year.
Heading: Imagining Really Large Data Sets.
Walmart processes more than a million customer transactions hourly and stores 2.5
petabytes of customer data. The Library of Congress collects 235 terabytes of new
data per year and stores 60 petabytes of data. Over 5.5 billion mobile phones were
used in 2014; each phone creates one terabyte of call record data yearly.
To monetize world wide web usage meant that the world wide web needed to be
indexed. This presented problems such as how to crawl the web and collect searches
on every website, how to complete the collection quickly enough, how each website
would be indexed and when it would be reindexed, how to organize and prioritize
page rankings, and how to match search queries to the most relevant pages.
How do I identify what has changed since the last reindexing? How do I even index
all these different websites and capture all that data? How do I organize and prioritize
the page rankings? And another common use case was how to match lookup queries
to the most relevant pages. Now many people, many universities, and many
companies all participated in creating the foundations of Big Data; each building on
the work of each other. But the following are considered be the original key
contributors, to solving the architectural challenge of distributed computing for
commercial use. And that's Google and Yahoo. I particularly also would like to
mention Doug Cutting. Now we're going to look at each of these individual
organization's contribution. Let's begin with Google. Google has led this technology
and has led the creation of supercomputing computing environments. Perhaps one of
the most important things in their leadership is that they made the decision to publish
two academic papers describing Google's Big Data technology. The first paper was
the Google File System and the second was the Google BigTable Architecture. Now
you can quickly find these papers on Google itself. They are seminal to Big Data and I
encourage you to download and read both of them. Let's talk about Yahoo's
contribution. They are a pioneer in building low-cost supercomputing platforms.
Heading: The Original Key Contributors.
Google, Yahoo, and Doug Cutting solved the problem of commercial distributed
computing and parallel processing.
Google was committed to creating a new paradigm in computing and invested heavily
in the idea. In 2003/2004 Google published the Google File System (GFS) and
Google BigTable Archetecture white papers, which described Google's technologies.
Their engineers have made many significant contributions in share nothing. They have
designed, architected, and deployed some of the first massively-sized clusters. And
perhaps very important to us is that in 2006, Yahoo hired Doug Cutting, who then
became the father of Hadoop. Most importantly to us, Yahoo has always encouraged
the open sourcing of Hadoop. And they supported not only the Hadoop project, but
many other Apache projects. Even to this day they continue as a major contributor and
test-bed for Hadoop. Doug Cutting is considered the father of Hadoop. He also has
developed Nutch and Lucene. As I said earlier, in 2006 he was hired by Yahoo to
develop Hadoop. He started by working from the papers released by Google for the
Google File System and Big Table. He promoted it inside Yahoo as an open-source
project. Doug currently serves as the Chief Architectural Officer at Cloudera. Now as
a funnier side, everyone should know how he came up with the name Hadoop. Doug's
son had a yellow stuffed elephant that he dragged around and he called it Hadoop.
And that's how we got the name, Hadoop.
Heading: Doug Cutting's Contribution.
Cloudera is open-source but has its own management console, Hortonworks provides
pure open-source play, IBM has positioned itself as a big corporate partner, and
MapR has managed to customize and commercialize the open source.
Currently the standard core is Hadoop, Hbase, Hive, Pig, Sqoop and Flume. But each
distro then tailors their own selection of additional software components. Generally all
distros release their software as tar, RPM, Yum and APT. Cloudera offers a Hadoop
solution to solve data intensive business problems. They are a major leader in Hadoop
and they are recognized as one of the first independent distros. Cloudera offers its
enterprise customers a family of products and services that well complement the open
source Hadoop platform. Cloudera serves a wide range of customers including
financial services, healthcare, digital media, advertising, networking, and even
telecommunications. One small comment I'd like to add: they are strongly backed by
Intel. Now let me talk about the CDH Bundle. That is the Cloudera Distribution of
Hadoop. You can find it at cloudera.com/hadoop. It is the baseline product set that
we're familiar with Hadoop, which includes Hadoop, Hive, Pig, Hbase. Also tools like
Sqoop, Flume, Mahout, and Whirr.
Heading: Introducing Cloudera.
Cloudera is the first distro company in history and a major Hadoop leader. They offer
comprehensive training, architectural consulting, and technical support.
A diagram depicts the different components of Big Data. For example, unstructured
data, such as movies, audio, and images, and databases such as Forecast, Financial,
Legacy, and CRM. Also tools such as Decision extracting tools, Data mining tools,
and Predictive modeling tools. These tools are used by business users, for statistical
analysis, and by predictive modeling experts respectively.
The Cloudera Distribution of Hadoop (CDH) includes Hue, Impala, Mahout, Whirr,
and Cloudera Enterprise Manager.
Add-ons include Cloudera's Enterprise Manager. Now this is a very important offering
from Cloudera as it is a tool that's intentionally built to install and manage a
supercomputing platform. Hortonworks is a major competitor to Cloudera and they
are a second large player in the expanding Hadoop marketplace. In 2011 the Yahoo
division responsible for much of the development of Hadoop was spun off to form its
own company, Hortonworks. They are a fully open source solution, and they do not
offer any commercial software. Hortonworks offers its enterprise customers services
for the open source Hadoop platform, and they call that the HDP Bundle. The
Hortonworks Bundle can be found at hortonworks.com/hdp/downloads. It obviously
also includes the standard baseline tools such as Hadoop, Hive, Pig, and Hbase, but
they also include a whole range of extra tools like Sqoop, Flume, Mahout and Whirr.
They continue to add additional tools all the time like the current offering includes
Storm, Falcon, and Knox. HDP is a great choice for Hadoop.
Heading: Introducing Hortonworks.
A diagram depicts how a Big data system might work with Hortonworks bundles. The
diagram shows Image, Email, HTML, and DB content attached to a box that contains
the Edge node, Name node, and Data nodes sections. The Edge node section contains
Tomcat and Oozie, the Name node section contains Tresata libraries, Oozie, Hive,
MapReduce, and Hbase, and the Data nodes section contains the MapReduce jobs,
Hive jobs, Hadoo pHDFS, and Apache Hbase.
There is a two sided arrow which connects the box with a Total view of customer
graphic.
The Hortonworks Data Platform (HDP) also includes these additional tools:
Accumulo, Storm, Mahout, Soir, and Falcon.
Well let me introduce IBM. Obviously we all know that they are a major leader in all
levels of computing, but they've taken a very strong lead in Big Data. They've made
many commercial improvements on Hadoop, many of them tailored from their years
of experience in the enterprise. IBM packages their Big Data solution as a product
called Infosphere Big Insights. It comes in two editions. There's a free and a
commercial. The commercial tool comes with all kinds of additional application tools,
data analyst tools, and data visualization tools. It's a great product. And finally let me
introduce MapR. They also are one of the early distro companies and they built an
early reputation for being enterprise ready. They have an intense focus on
performance and availability. Their products are called M3 and M5. M3 is the free
version and M5 is a paid enterprise version. It does come with a customized file
system, and they've spent a tremendous amount of effort doing distributed namenode
to remove the single point of failure for a Hadoop file system. It also is a
tremendously high value product.
Heading: Introducing IBM.
The IBM Infosphere bundle includes an enterprise edition of IBM Infosphere Big
Insights with application tools, data analysis tools, and data visualization tools.
A diagram shows how MapR systems work. In a box called Supercomputing platform,
there is a Data access server, which is connected via a LAN to a Job node and a Data
node. The Job node contains the MapReduce and Oozie servers, which are connected.
The Data nodes section contains Hadoop and Hbase. The Job node is connected via
VPN to two Oozie client user access computers.
MapR is a customized file system which has a distributed namenode arch to remove
SPOF for HDFS.
Open source is a software usage philosophy of open collaboration and free access to
software. Some of the popular open source programs are Linux, MySQL/Postgres, R
statistical programming language, and Hadoop.
Some of the values of open source software are having a global community of
developers, rapid development and rapid to market, low cost point, and access to
source code.
The risks which come with open source software include the development community
fading away and having numerous overlapping products.
Now let me discuss some of the risks. The first is that your development community
fades away. By this we mean that your product is no longer considered cutting edge
and leading in its market space and the development community moves on to other
products. Another really significant problem that you need to understand in
development of open source projects is that you can have numerous overlapping
products causing fierce competition for a fairly niche market. Now I want to introduce
one of the key players in Big Data and that is the Apache Software Foundation. I have
their mission statement here and it really does a great job of summarizing their value
added. They are a group of software developers who are committed to developing free
and open source software and perhaps one of the most important statements is that
they are collaborative and consensus based in their development. I would also like to
discuss their software solution in terms of a license; they have a very liberal
distribution policy for their licensing. For example, reuse allows for reuse of the
source and the bundles without reservation. Their extensions are allowed for all
programmers, who can then easily modify source for local customization. And for the
commercialization, they do allow you to build on their products. However, if you are
going to use an Apache product for commercial purposes have legal review of the
license.
Heading: Apache Software Foundation.
The Apache license allows the licensee to reuse, extend, and commercialize their use
of the software.
Apache Projects
Apache organizes software development into top tier projects and subordinate
projects. Each project has a community of committers – these are software developers
who contribute peer review code to the primary projects. These projects are all
defined by consensus and by pragmatic software licensing. In other videos, we're
going to learn about Hadoop. It is a software framework that allows for the
distribution of processing of large data sets using a fairly simple programming model.
The initial code was originally developed by Doug Cutting, who by the way is
committed to a number of Apache projects. You can go to apache.org web site and
look up all these various projects. I suggest you start with Hadoop. It is interesting to
look up the list of committers and the companies they work for.
Heading: Apache Software Foundation.
"Hadoop is an Open Source project lead by Apache. It underpins the Big Data
movement and is a rapidly advancing major technology. Original design intent was to
handle massive amounts of data, quickly, efficiently and inexpensively. It handles all
data types, structure and unstructured. If the data consists of bits and bytes Hadoop
will store it on commodity hardware." Source: apache.org.
And finally let me introduce Oozie as another Apache project commonly used in a
Hadoop cluster. Oozie is a workflow and coordination tool. It runs...across the
supercomputing platform. It allows jobs to run in parallel while waiting for input from
other jobs. One of the interesting advantages to Oozie is it comes with a very complex
scheduling tool. This allows for coordination of jobs waiting for other dependencies
within the supercomputing platform. Oozie is a great tool to master if you are going to
run Hadoop supercomputing.
Heading: Introducing Oozie.
"Big Data is made up of many functional elements and these are built up into a
number of different systems for flowing and analyzing data. These systems can be
seen as a layered model. This is not a technical definition per se but a way of
understanding all of the many components and how they fit together to provide a
supercomputing platform." Source: Will Dailey, Instructor.
The first layer is Infrastructure. This includes all the hardware and software used to
support and to operate a Hadoop cluster. Obviously this will include software versions
of the operating system but it should include all the commonly used tools for
monitoring and reporting on the Hadoop cluster as well. The second layer we call the
Data Repository layer. It deals with the movement of data within a distributed
computing environment. It is centered on the Hadoop distributed file system, the
primary repository, but should include data transport tools such as Sqoop and Flume.
Additionally we commonly assign all the no SQL databases, such as Accumulo and
HBase as a form of data repository. The Data Refinery layer deals with the
manipulation and processing of data using the parallel processing framework. The
primary technology is Yarn and MapReduce. We will learn more about these two
technologies in a later video. The Data Factory layer is an ever expanding number of
tools designed to interface into Hadoop. All of them make it easier to access the full
power of Hadoop. Many of them actually allow the users to create compute jobs in an
easily understood language, such as SQL, and then they translate these inputs into
MapReduce jobs. This is why we refer to this class of software as data workers and
why we call this layer the data factory.
Heading: The Infrastructure Layer.
Infrastructure comprises utilities, network, and hardware which build or support the
supercomputing platform.
The Data Repository layer is in charge of controlling the movement and storage of
data within the supercomputing platform.
The Data Refinery layer is responsible for parallel processing compute jobs in order
to manipulate and process data. This is done using primarily Yarn and MapReduce,
both products of the Hadoop project.
The Data Factory layer comprises tools which are designed to make it easier to
access the full offer of Hadoop. The tools include Hive, Pig, Spark, and Oozie.
The Data Fusion layer is the application layer and the true business end of the Big
Data solution. These are applications many purposely built to take advantage of
Hadoop's power which create new information by combining multiple data sources;
we frequently refer to this action as data fusion. They are combining the different data
to gain new knowledge. This is where all the data analytics and the data visualization
tools reside in our model. Examples are a machine learning technology, such as
Mahout, or Data Visualization tools, such as Datameter or Pentahoe. And no model
would be complete without understanding why we're building the Hadoop cluster.
What software tools do we use to interface into the business? What tools do we use to
generate reports, which will clearly explain our value to the business? We use a range
of tools to express the requirements, the service levels, the request for changes, the
cost and the expenditures, all are important parts of our Big Data stack.
Heading: The Data Fusion Layer.
The Data Fusion layer comprises tools used to create applications, implement
algorithms, and visualize data. The tools include Mahout, Datameter, Pentahoe, and
Tableau.
The Business Value layer is the product of the supercomputing platform and uses a
number of tools to express the requirements, service levels, and costs and
expenditures for the business.
A diagram shows the various tools used on a Big Data platform including Flume,
MapReduce, Oozie, and Ambari.
The name Sqoop is a combination of SQL with the word Hadoop. It is a great tool for
exporting and importing data between any RDBMS and Hadoop Distributed File
System. Sqoop uses both a JDBC and a command line interface. It supports
parallelization across the cluster and has the ability to deploy as a MapReduce job to
manage the export or import. I use this tool frequently and have always found it to be
reliable and a stalwart ally in the battle for moving data into a Hadoop cluster. Hive
was originated at Facebook. This factory worker is a strongback in any Hadoop
cluster. It is one of the first tools learned by many new Big Data users. Hive is a SQL-
like interface into Hadoop. It allows those of us who are comfortable with SQL to use
common SQL commands and common relational table structures to create
MapReduce jobs without having to know MapReduce. Hive treats all the data like it
belongs in tables and allows us to create table definitions over the top of the data files.
Heading: Introducing Hive.
Hive convert inputs into MapReduce jobs and organize unstructured data metadata
into tables.
The other strongback in a Hadoop Cluster is Pig. Pig was originally created at Yahoo.
It was named on the fact that pigs eat anything. Pig is a script language for data flow.
It also converts the scripts into MapReduce jobs. Pig also provides its own vernacular.
Obviously the scripting language is called Pig Latin and the Pig Shell is called grunt.
And let us not forget that there is a common storage for Pig scripts we call the Piggy
Bank. These are just a few of the most common tools used in a Hadoop cluster. To
learn more about the more popular tools, I suggest you go to one of the distro
companies such as Cloudera, Hortonworks, IBM, or MapR and review their
distribution for a supercomputing platform.
Heading: Introducing Pig.
NoSQL Databases
Let us begin our conversation about NoSQL with a few common facts. First NoSQL
means Not Only SQL. This implies they can use SQL as well as other low-level query
languages. NoSQL is a broad range of databases. The broadest definition is they are
non-relational distributed store databases. Nearly all NoSQL databases use a key
value pairing as their central architecture. All are designed to work in a distributed
computing environment. NoSQL databases are powerful because of the simplicity of
their design for the fact that they massively scale out, and that they are significantly
faster when used for the right use cases. Most NoSQL databases are open source. On
our last count there was over 75 offerings, many highly customized specific use cases.
Selecting and deploying a NoSQL database truly does require platform engineering.
There's a large amount of complexity, overlap, and constant change in this technology
arena. We strongly suggest you seek expert advice when making this decision.
NoSQL databases are in the forefront of new engineering to take full of advantage of
emerging technologies such as solid-state drives and all-memory applications.
Heading: NoSQL Databases.
Graph databases use graph structures that define datasets by nodes, edges, and
properties.
Some of the more widely used databases are Accumulo, Cassandra, Druid, and
Hbase. Hbase is an extension of Hadoop.
A diagram shows data being placed in a database, going through the key-value store
and the output showing the saved data. For example, the first data input is ABC 3356.
For example, data is put into the key-value store, which has the format "Name":"ABC
Co.". The data is saved as "Web":"abc.com".
Comparing RDBMSs to NoSQL is a complicated subject. But let me make a few key
points everyone should understand when dealing with NoSQL databases. RDBMSs
are superior for transactional processing. They are uniquely designed to store and
retrieve rows of information. While NoSQL databases are superior for finding subsets
of data from massive datasets. RDBMSs are highly engineered to guarantee data
integrity, while NoSQL databases take a very different approach. But this does not
mean they lose or corrupt data. If you are interested in learning more about NoSQL
databases, there's a lot information on the Internet. There's some great Skillsoft online
classes and a growing number of subject matter experts. Because of its power and
speed, NoSQL databases play, and will continue to play, an important role in our
supercomputing platform.