0% found this document useful (0 votes)
95 views12 pages

What Is Data Mining?: Warehousing

Big data architecture involves collecting large amounts of both structured and unstructured data from various internal and external sources, storing it in a distributed file system, and processing it using a framework like Hadoop to gain insights. The data is processed by a big data engine and can then be integrated with operational systems, data warehouses, and analytics tools. This architecture allows organizations to extract value from all their data at massive scale.

Uploaded by

RojaSona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views12 pages

What Is Data Mining?: Warehousing

Big data architecture involves collecting large amounts of both structured and unstructured data from various internal and external sources, storing it in a distributed file system, and processing it using a framework like Hadoop to gain insights. The data is processed by a big data engine and can then be integrated with operational systems, data warehouses, and analytics tools. This architecture allows organizations to extract value from all their data at massive scale.

Uploaded by

RojaSona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

What is Data Mining?

Is a process used by companies to turn raw data into useful information.


By using software to look for patterns in large batches of data, businesses can learn
more about their customers and develop more effective marketing strategies as
well as increase sales and decrease costs.
Depends on effective data collection and warehousing as well as computer
processing.
What is BIG DATA ?
Big data is a term that describes the large volume of data both structured and
unstructured that inundates a business on a day-to-day basis.
But its not the amount of data thats important. Its what organizations do with the
data that matters.
Big data can be analyze for insights that lead to better decisions and strategic
business moves.
How Much Data is Created Every Minute?

Data never sleeps. Every minute massive amounts of it are being generated from every
phone, website and application across the Internet. Just how much data is being created
and where does it come from?
Every minute of every day we create

More than 204 million email messages

Over 2 million Google search queries

48 hours of new YouTube videos

684,000 bits of content shared on Facebook

More than 100,000 tweets

$272,000 spent on e-commerce

3,600 new photos shared on Instagram

Nearly 350 new WordPress blog posts

Volume
Big data implies enormous volumes of data. It used to be employees created data. Now that data is
generated by machines, networks and human interaction on systems like social media the volume of data

to be analyzed is massive. Yet, Inderpal states that the volume of data is not as much the problem as
other Vs like veracity.
Variety
Variety refers to the many sources and types of data both structured and unstructured. We used to store
data from sources like spreadsheets and databases. Now data comes in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for
storage, mining and analyzing data. Jeff Veis, VP Solutions at HP Autonomypresented how HP is helping
organizations deal with big challenges including data variety.
Velocity
Big Data Velocity deals with the pace at which data flows in from sources like business processes,
machines, networks and human interaction with things like social media sites, mobile devices, etc. The
flow of data is massive and continuous. This real-time data can help researchers and businesses make
valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the
velocity. Inderpal suggest that sampling data can help deal with issues like volume and velocity.
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and
mined meaningful to the problem being analyzed. Inderpal feel veracity in data analysis is the biggest
challenge when compares to things like volume and velocity. In scoping out your big data strategy you
need to have your team and partners work to help keep your data clean and processes to keep dirty data
from accumulating in your systems.
ValidityLike big data veracity is the issue of validity meaning is the data correct and accurate for the
intended use. Clearly valid data is key to making the right decisions. Phil Francisco, VP of Product
Management from IBM spoke about IBMs big data strategy and tools they offer to help with data veracity
and validity.
Volatility
Big data volatility refers to how long is data valid and how long should it be stored. In this world of real
time data you need to determine at what point is data no longer relevant to the current analysis.
Big data clearly deals with issues beyond volume, variety and velocity to other concerns like veracity,
validity and volatility. To hear about other big data trends and presentation follow the Big Data Innovation
Summit on twitter #BIGDBN.

Volume refers to the vast amounts of data generated every second. Just think
of all the emails, twitter messages, photos, video clips, sensor data etc. we
produce and share every second. We are not talking Terabytes but Zettabytes
or Brontobytes. On Facebook alone we send 10 billion messages per day, click
the "like' button 4.5 billion times and upload 350 million new pictures each
and every day. If we take all the data generated in the world between the
beginning of time and 2008, the same amount of data will soon be generated
every minute! This increasingly makes data sets too large to store and analyse
using traditional database technology. With big data technology we can now
store and use these data sets with the help of distributed systems, where parts
of the data is stored in different locations and brought together by software.
Velocity refers to the speed at which new data is generated and the speed at
which data moves around. Just think of social media messages going viral in
seconds, the speed at which credit card transactions are checked for
fraudulent activities, or the milliseconds it takes trading systems to analyse
social media networks to pick up signals that trigger decisions to buy or sell
shares. Big data technology allows us now to analyse the data while it is being
generated, without ever putting it into databases.
Variety refers to the different types of data we can now use. In the past we
focused on structured data that neatly fits into tables or relational databases,
such as financial data (e.g. sales by product or region). In fact, 80% of the
worlds data is now unstructured, and therefore cant easily be put into tables
(think of photos, video sequences or social media updates). With big data
technology we can now harness differed types of data (structured and

unstructured) including messages, social media conversations, photos, sensor


data, video or voice recordings and bring them together with more traditional,
structured data.
Veracity refers to the messiness or trustworthiness of the data. With many
forms of big data, quality and accuracy are less controllable (just think of
Twitter posts with hash tags, abbreviations, typos and colloquial speech as
well as the reliability and accuracy of content) but big data and analytics
technology now allows us to work with these type of data. The volumes often
make up for the lack of quality or accuracy.
Value: Then there is another V to take into account when looking at Big Data:
Value! It is all well and good having access to big data but unless we can turn it
into value it is useless. So you can safely argue that 'value' is the most
important V of Big Data. It is important that businesses make a business case
for any attempt to collect and leverage big data. It is so easy to fall into the
buzz trap and embark on big data initiatives without a clear understanding of
costs and benefits.

Architecture
I read the tip on Introduction to Big Data and would like to know more about
how Big Data architecture looks in an enterprise, what are the scenarios in
which Big Data technologies are useful, and any other relevant information.
Solution

In this tip, let us take a look at the architecture of a modern data processing
and management system involving a Big Data ecosystem, a few use cases of
Big Data, and also some of the common reasons for the increasing adoption
of Big Data technologies.

Architecture
Before we look into the architecture of Big Data, let us take a look at a high
level architecture of a traditional data processing management system. It looks
as shown below.

As we can see in the above architecture, mostly structured data is involved


and is used for Reporting and Analytics purposes. Although there are one or
more unstructured sources involved, often those contribute to a very small
portion of the overall data and hence are not represented in the above
diagram for simplicity. However, in the case of Big Data architecture, there are
various sources involved, each of which is comes in at different intervals, in
different formats, and in different volumes. Below is a high level architecture of
an enterprise data management system with a Big Data engine.

Let us take a look at various components of this modern architecture.


Source Systems

As discussed in the previous tip, there are various different sources of Big
Data including Enterprise Data, Social Media Data, Activity Generated Data,
Public Data, Data Archives, Archived Files, and other Structured or
Unstructured sources.
Transactional Systems

In an enterprise, there are usually one or more Transactional/OLTP systems


which act as the backend databases for the enterprise's mission critical
applications. These constitute the transactional systems represented above.
Data Archive

Data Archive is collection of data which includes the data archived from the
transactional systems in compliance with an organization's data retention and
data governance policies, and aggregated data (which is less likely to be
needed in the near future) from a Big Data engine etc.

ODS

Operational Data Store is a consolidated set of data from various transactional


systems. This acts as a staging data hub and can be used by a Big Data
Engine as well as for feeding the data into Data Warehouse, Business
Intelligence, and Analytical systems.
Big Data Engine

This is the heart of modern (Next-Generation / Big Data) data processing and
management system architecture. This engine capable of processing large
volumes of data ranging from a few Megabytes to hundreds of Terabytes or
even Petabytes of data of different varieties, structured or unstructured,
coming in at different speeds and/or intervals. This engine consists primarily of
a Hadoop framework, which allows distributed processing of large
heterogeneous data sets across clusters of computers. This framework
consists of two main components, namely HDFS and MapReduce. We will
take a closer look at this framework and its components in the next and
subsequent tips.

Big Data Use Cases


Big Data technologies can solve the business problems in a wide range of
industries. Below are a few use cases.
Banking and Financial Services
o Fraud Detection to detect the possible fraud or suspicious
transactions in Accounts, Credit Cards, Debit Cards, and
Insurance etc.
Retail
o Targeting customers with different discounts, coupons,
and promotions etc. based on demographic data like
gender, age group, location, occupation, dietary habits,
buying patterns, and other information which can be
useful to differentiate/categorize the customers.
Marketing

o Specifically outbound marketing can make use of


customer demographic information like gender, age
group, location, occupation, and dietary habits, customer
interests/preferences usually expressed in the form of
comments/feedback and on social media networks.
o Customer's communication preferences can be identified
from various sources like polls, reviews,
comments/feedback, and social media etc. and can be
used to target customers via different channels like SMS,
Email, Online Stores, Mobile Applications, and Retail
Stores etc.
Sentiment Analysis
o Organizations use the data from social media sites like
Facebook, Twitter etc. to understand what customers are
saying about the company, its products, and services.
This type of analysis is also performed to understand
which companies, brands, services, or technologies
people are talking about.
Customer Service
o IT Services and BPO companies analyze the call
records/logs to gain insights into customer complaints
and feedback, call center executive response/ability to
resolve the ticket, and to improve the overall quality of
service.
o Call center data from telecommunications industries can
be used to analyze the call records/logs and optimize the
price, and calling, messaging, and data plans etc.
Apart from these, Big Data technologies/solutions can solve the business
problems in other industries like Healthcare, Automobile, Aeronautical,
Gaming, and Manufacturing etc.

Big Data Adoption


Data has always been there and is growing at a rapid pace. One question
being asked quite often is "Why are organizations taking interest in the silos of
data, which otherwise was not utilized effectively in the past, and embracing

Big Data technologies today?". The reason for adoption of Big Data
technologies is due to various factors including the following:
Cost Factors
o Availability of Commodity Hardware
o Availability of Open Source Operating Systems
o Availability of Cheaper Storage
o Availability of Open Source Tools/Software
Business Factors
o There is lot of data being generated outside the
enterprise and organizations are compelled to consume
that data to stay ahead of the competition. Often
organizations are interested in a subset of this large
volume of data.
o The volume of structured and unstructured data being
generated in the enterprise is very large and cannot be
effectively handled using the traditional data
management and processing tools.
HIPI(hadoop image processing interface)
HIPI is an image processing library designed to be used with the Apache Hadoop
MapReduce parallel programming framework. HIPI facilitates efficient and high-throughput image
processing with MapReduce style parallel programs typically executed on a cluster. It provides a
solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS)
and make them available for efficient distributed processing. HIPI also provides integration
with OpenCV, a popular open-source library that contains many computer vision algorithms
(see covar example program to learn more about this integration).
HIPI is developed and maintained by a growing number of developers from around the world.
The latest release of HIPI has been tested with Hadoop 2.7.1.

System Design
This diagram shows the organization of a typical MapReduce/HIPI program:

The primary input object to a HIPI program is a HipiImageBundle (HIB). A HIB is a collection of
images represented as a single file on the HDFS. The HIPI distribution includes several useful
tools for creating HIBs, including a MapReduce program that builds a HIB from a list of images
downloaded from the Internet.
The first processing stage of a HIPI program is a culling step that allows filtering the images in a HIB
based on a variety of user-defined conditions like spatial resolution or criteria related to the image
metadata. This functionality is achieved through the Culler class. Images that are culled are never
fully decoded, saving processing time.
The images that survive the culling stage are assigned to individual map tasks in a way that attempts
to maximize data locality, a cornerstone of the Hadoop MapReduce programming model. This
functionality is achieved through the HibInputFormat class. Finally, individual images are presented
to the Mapper as objects derived from the HipiImage abstract base class along with an
associated HipiImageHeader object. For example, the ByteImage and FloatImage classes extend the
HipiImage base class and provide access to the underlying raster grid of image pixel values as
arrays of Java bytes and floats, respectively. These classes provide a number of useful functions like
cropping, color space conversion, and scaling.
HIPI also includes support for OpenCV, a popular open-source computer vision library. Specifically,
image classes that extend from RasterImage (such as ByteImage and FloatImage, discussed above)
may be converted to OpenCV Java Mat objects using routines in the OpenCVUtils class.
The OpenCVMatWritableclass provides a wrapper around the OpenCV Java Mat class that can be
used as a key or value object in MapReduce programs. See the covar example programfor more
detailed information about how to use HIPI with OpenCV.
The records emitted by the Mapper are collected and transmitted to the Reducer according to the
built-in MapReduce shuffle algorithm that attemps to minimize network traffic. Finally, the userdefined reduce tasks are executed in parallel and their output is aggregated and written to the HDFS.
(hadoop distributed file system).

Big data mining frameworks?


Bigdata mining platform:
Bigdata semantics and application knowledge:
1.Information sharing and data privacy
2.Domain and applicaton knowledge
Bigdata mining algorithm:
Local learning and model fusion for multiple.
Mining from sparse,uncertain and incomplete data.
Mining complex and dynamic data.
Challenges:
Location of bigdata sources commonly bigdata are stored in different locations.
Volume of the bigdata size of the bigdata grow continuously.
Hardware resources-RAM capacity.privacy-medical reports,bank reports transactions.
Having domain knowledge.
Getting meaningful information.
Solutions:
Parallel computing programming.
An efficient platform for computing will not have centralized data storage instead of that platform will
be distributed in big scale storage.
Restricting access to the data.
Advantage:
1.Fast response
2.Extract useful information
3.prediction of required data from large amount of data.
4.Savour of better results in the form of visualization.
Conclusion:
Big data is over than a capacity.
There is a potential for making faster advance in many scientific and improving the
profitability and success of many enterprises by using technology like Hadoop,Pig and so on.
It is a large variety of application domains, and therefore not cost-effective to address in the
context of one domain alone.
Furthermore, this system will provide fully transformative solutions, and will be address
naturally for the next generation of industrial application.

You might also like