What Is Data Mining?: Warehousing
What Is Data Mining?: Warehousing
Data never sleeps. Every minute massive amounts of it are being generated from every
phone, website and application across the Internet. Just how much data is being created
and where does it come from?
Every minute of every day we create
Volume
Big data implies enormous volumes of data. It used to be employees created data. Now that data is
generated by machines, networks and human interaction on systems like social media the volume of data
to be analyzed is massive. Yet, Inderpal states that the volume of data is not as much the problem as
other Vs like veracity.
Variety
Variety refers to the many sources and types of data both structured and unstructured. We used to store
data from sources like spreadsheets and databases. Now data comes in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for
storage, mining and analyzing data. Jeff Veis, VP Solutions at HP Autonomypresented how HP is helping
organizations deal with big challenges including data variety.
Velocity
Big Data Velocity deals with the pace at which data flows in from sources like business processes,
machines, networks and human interaction with things like social media sites, mobile devices, etc. The
flow of data is massive and continuous. This real-time data can help researchers and businesses make
valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the
velocity. Inderpal suggest that sampling data can help deal with issues like volume and velocity.
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and
mined meaningful to the problem being analyzed. Inderpal feel veracity in data analysis is the biggest
challenge when compares to things like volume and velocity. In scoping out your big data strategy you
need to have your team and partners work to help keep your data clean and processes to keep dirty data
from accumulating in your systems.
ValidityLike big data veracity is the issue of validity meaning is the data correct and accurate for the
intended use. Clearly valid data is key to making the right decisions. Phil Francisco, VP of Product
Management from IBM spoke about IBMs big data strategy and tools they offer to help with data veracity
and validity.
Volatility
Big data volatility refers to how long is data valid and how long should it be stored. In this world of real
time data you need to determine at what point is data no longer relevant to the current analysis.
Big data clearly deals with issues beyond volume, variety and velocity to other concerns like veracity,
validity and volatility. To hear about other big data trends and presentation follow the Big Data Innovation
Summit on twitter #BIGDBN.
Volume refers to the vast amounts of data generated every second. Just think
of all the emails, twitter messages, photos, video clips, sensor data etc. we
produce and share every second. We are not talking Terabytes but Zettabytes
or Brontobytes. On Facebook alone we send 10 billion messages per day, click
the "like' button 4.5 billion times and upload 350 million new pictures each
and every day. If we take all the data generated in the world between the
beginning of time and 2008, the same amount of data will soon be generated
every minute! This increasingly makes data sets too large to store and analyse
using traditional database technology. With big data technology we can now
store and use these data sets with the help of distributed systems, where parts
of the data is stored in different locations and brought together by software.
Velocity refers to the speed at which new data is generated and the speed at
which data moves around. Just think of social media messages going viral in
seconds, the speed at which credit card transactions are checked for
fraudulent activities, or the milliseconds it takes trading systems to analyse
social media networks to pick up signals that trigger decisions to buy or sell
shares. Big data technology allows us now to analyse the data while it is being
generated, without ever putting it into databases.
Variety refers to the different types of data we can now use. In the past we
focused on structured data that neatly fits into tables or relational databases,
such as financial data (e.g. sales by product or region). In fact, 80% of the
worlds data is now unstructured, and therefore cant easily be put into tables
(think of photos, video sequences or social media updates). With big data
technology we can now harness differed types of data (structured and
Architecture
I read the tip on Introduction to Big Data and would like to know more about
how Big Data architecture looks in an enterprise, what are the scenarios in
which Big Data technologies are useful, and any other relevant information.
Solution
In this tip, let us take a look at the architecture of a modern data processing
and management system involving a Big Data ecosystem, a few use cases of
Big Data, and also some of the common reasons for the increasing adoption
of Big Data technologies.
Architecture
Before we look into the architecture of Big Data, let us take a look at a high
level architecture of a traditional data processing management system. It looks
as shown below.
As discussed in the previous tip, there are various different sources of Big
Data including Enterprise Data, Social Media Data, Activity Generated Data,
Public Data, Data Archives, Archived Files, and other Structured or
Unstructured sources.
Transactional Systems
Data Archive is collection of data which includes the data archived from the
transactional systems in compliance with an organization's data retention and
data governance policies, and aggregated data (which is less likely to be
needed in the near future) from a Big Data engine etc.
ODS
This is the heart of modern (Next-Generation / Big Data) data processing and
management system architecture. This engine capable of processing large
volumes of data ranging from a few Megabytes to hundreds of Terabytes or
even Petabytes of data of different varieties, structured or unstructured,
coming in at different speeds and/or intervals. This engine consists primarily of
a Hadoop framework, which allows distributed processing of large
heterogeneous data sets across clusters of computers. This framework
consists of two main components, namely HDFS and MapReduce. We will
take a closer look at this framework and its components in the next and
subsequent tips.
Big Data technologies today?". The reason for adoption of Big Data
technologies is due to various factors including the following:
Cost Factors
o Availability of Commodity Hardware
o Availability of Open Source Operating Systems
o Availability of Cheaper Storage
o Availability of Open Source Tools/Software
Business Factors
o There is lot of data being generated outside the
enterprise and organizations are compelled to consume
that data to stay ahead of the competition. Often
organizations are interested in a subset of this large
volume of data.
o The volume of structured and unstructured data being
generated in the enterprise is very large and cannot be
effectively handled using the traditional data
management and processing tools.
HIPI(hadoop image processing interface)
HIPI is an image processing library designed to be used with the Apache Hadoop
MapReduce parallel programming framework. HIPI facilitates efficient and high-throughput image
processing with MapReduce style parallel programs typically executed on a cluster. It provides a
solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS)
and make them available for efficient distributed processing. HIPI also provides integration
with OpenCV, a popular open-source library that contains many computer vision algorithms
(see covar example program to learn more about this integration).
HIPI is developed and maintained by a growing number of developers from around the world.
The latest release of HIPI has been tested with Hadoop 2.7.1.
System Design
This diagram shows the organization of a typical MapReduce/HIPI program:
The primary input object to a HIPI program is a HipiImageBundle (HIB). A HIB is a collection of
images represented as a single file on the HDFS. The HIPI distribution includes several useful
tools for creating HIBs, including a MapReduce program that builds a HIB from a list of images
downloaded from the Internet.
The first processing stage of a HIPI program is a culling step that allows filtering the images in a HIB
based on a variety of user-defined conditions like spatial resolution or criteria related to the image
metadata. This functionality is achieved through the Culler class. Images that are culled are never
fully decoded, saving processing time.
The images that survive the culling stage are assigned to individual map tasks in a way that attempts
to maximize data locality, a cornerstone of the Hadoop MapReduce programming model. This
functionality is achieved through the HibInputFormat class. Finally, individual images are presented
to the Mapper as objects derived from the HipiImage abstract base class along with an
associated HipiImageHeader object. For example, the ByteImage and FloatImage classes extend the
HipiImage base class and provide access to the underlying raster grid of image pixel values as
arrays of Java bytes and floats, respectively. These classes provide a number of useful functions like
cropping, color space conversion, and scaling.
HIPI also includes support for OpenCV, a popular open-source computer vision library. Specifically,
image classes that extend from RasterImage (such as ByteImage and FloatImage, discussed above)
may be converted to OpenCV Java Mat objects using routines in the OpenCVUtils class.
The OpenCVMatWritableclass provides a wrapper around the OpenCV Java Mat class that can be
used as a key or value object in MapReduce programs. See the covar example programfor more
detailed information about how to use HIPI with OpenCV.
The records emitted by the Mapper are collected and transmitted to the Reducer according to the
built-in MapReduce shuffle algorithm that attemps to minimize network traffic. Finally, the userdefined reduce tasks are executed in parallel and their output is aggregated and written to the HDFS.
(hadoop distributed file system).