Unit 1
Unit 1
Introduction to Big Data Platform - Challenges of Conventional Systems - Intelligent data analysis -
Nature of Data - Analytic Processes and Tools - Analysis vs Reporting.
---------------------------------------------------------------------------------------------------------------
INTRODUCTION TO DATA:
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer
which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
Three Actions on Data
Capture
Transform
Store
BigData
Big Data may well be the Next Big Thing in the IT world.
Big data burst upon the scene in the first decade of the 21st century.
1
Big Data is also data but with a huge size.
Big Data is a term used to describe a collection of data that is huge in size and yet growing
exponentially with time.
In short such data is so large and complex that none of the traditional data management tools are
able to store it or process it efficiently.
Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications.
Examples of Bigdata
The New York Stock Exchange generates about one terabyte of new trade data per day.
Other examples of Big Data generation includes stock exchanges, social media sites, jet engines, etc.
4. Veracity:
Veracity is all about the trust score of the data. If the data is collected from trusted or reliable
sources then the data neglect this rule of big data. It refers to inconsistencies and uncertainty in
data, that is data which is available can sometimes get messy, low quality and less accurate. Data is
2
also variable because of the multitude of data dimensions resulting from multiple disparate data
types and sources.
Example: Data in bulk could create confusion whereas less amount of data could convey half or
Incomplete Information.
5. Value:
Value refers to purpose, scenario or business outcome that the analytical solution has to address.
Does the data have value, if not, is it worth being stored or collected? The analysis needs to be
performed to meet the ethical considerations.
6. Variability:
It defines the need to get meaningful data considering all possible circumstances.
How fast or available data that extent is the structure of your data is changing?
How often does the meaning or shape of your data change?
Big data platform are also delivered through cloud where the provider provides big data solutions and
services.
New analytic applications drive the requirements for a big data platform
1. Integrate and manage the full variety, velocity and volume of data
2. Apply advanced analytics to information in its native form
3. Visualize all available data for ad-hoc analysis
4. Development environment for building new analytic applications
5. Workload optimization and scheduling
3
6. Security and Governance
Figure: MapReduce
4
2. Big Data Platform - Stream Computing:
Built to analyze data in motion
Multiple concurrent input streams
Massive scalability
Process and analyze a variety of data:
-- Structured, unstructured content, video, audio
-- Advanced analytic operators
3. Big Data Platform - Data Warehousing:
Workload optimized systems
-- Deep analytics appliances
-- Configurable operational analytics appliances
-- Data warehousing software
Capabilities
-- Massive parallel processing engine
-- High performance OLAP
-- Mixed operational and analytic workloads
4. Big Data Platform - Information Integration and Governance
Integrate any type of data to the big data platform
-- Structured
-- Unstructured
-- Streaming
Governance and trust for big data
-- Secure sensitive data
-- Lineage and metadata of new big data sources
-- Lifecycle management to control data growth
-- Master data to establish single version of the truth
5
Massive volume of structured data movement
-- 2.38 TB / Hour load to data warehouse
-- High-volume load to Hadoop file system
Ingest unstructured data into Hadoop file system
Integrate streaming data sources
6
9. Big Data Platform - Analytic Applications:
Big Data Platform is designed for analytic application development and integration.
BI/Reporting – Cognos BI, Attivio
Predictive Analytics – SPSS, G2, SAS
Exploration/Visualization – BigSheets, Datameer
Instrumentation Analytics – Brocade, IBM GBS
Content Analytics – IBM Content Analytics
Functional Applications – Algorithmics, Cognos Consumer Insights, Clickfox, i2, IBM GBS
Industry Applications – TerraEchos, Cisco, IBM GBS
7
The rise and development of social networks, multimedia, electronic commerce (e-Commerce) and
cloud computing have increased considerably the data. Additionally, since the needs of enterprise analytics
are constantly growing, the conventional architectures cannot satisfy the demands and, therefore, new and
enhanced architectures are necessary.
In this context, new challenges are encountered including storage, capture, processing, filtering,
analysis, search, sharing, visualization, querying and privacy of the very large volumes of data.
8
data mining without exposing sensitive personal information is another challenging field to be
investigated
Conventional data is generated in enterprise Big data is generated outside the enterprise
level. level.
Conventional database system deals with Big data system deals with structured, semi-
structured data. structured and unstructured data.
Conventional data is generated per hour or per But big data is generated more frequently mainly
day or more. per seconds.
Conventional data source is centralized and it is Big data source is distributed and it is managed
managed in centralized form. in distributed form.
Conventional data base tools are required to Special kind of data base tools are required to
perform any data base schema based operation. perform any database schema-based operation.
Normal functions can manipulate data. Special kind of functions can manipulate data.
Its data model is strict schema based and it is Its data model is a flat schema based and it is
static. dynamic.
It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
9
Conventional Data Big Data
Intelligdent data analysis is the scientific process of transforming data into insight for making
better decisions. Here, the mathematical techniques are used to analyze the complex situations
to give power to make effective decisions and build more productive systems based on:
4. Estimates of Risk
Business firms may commonly apply analytics to business data to describe, predict and improve
business performance.
Example areas are: retail analytics, store assessment and stock keeping, marketing, web
analytics, sales force sizing, price modeling, credit risk analysis and fraud analytics.
The goal of Data Analytics is to get actionable insights resulting in smarter decisions and better
business outcomes.
1. Predictive (forecasting)
PREDICTIVE ANALYTICS
10
Predictive Analysis turns data into valuable, actionable information. Predictive analytics
uses data to determine the probable future outome of an event or likelihood of a siutation
occuring.
Predictive models find the patterns in the historical and transactional data to identify the risks and
opportunities. Models capture relationships among many factors to asses the risk.
1. Predictive Modelling
3. Transaction Profiling.
An organization that offers multiple products, predicitve analysis can help analyze the
customers’ spending usage and other behaviour, leading to efficient cross sales or selling additional
products to current customers. This leads to higher profitabilty per customer and stronger
customer relationships.
DESCRIPTIVE ANALYTICS
Descritptive modelling tools can be utilized to develop further models that can simulate large
number of individuals and make predictions. For example, descriptive analysis examines historical
electricity usage data to help plan power needs and allow electric companies to set optimal prices.
PRESCRIPTIVE ANALYTICS
Prescriptive Analytics goes beyong predicting future outcomes by also suggesting actions
to benefit from the predictions and showing the decision maker the implications of each
decision option. Prescriptive Analytics not only anticipates what will happen and when it will
happen, but also why it will happen.
11
Prescriptive Analysis combines data, business rules and mathemtaical models.
The data may come from multiple sources, internal and external to the organization. The
data may also be structured data which includes numerical and categorical data, as well as
unstructured data such as text, images, audio and video data.
Business Rules define the business process and include constraints, preferences, policies,
best practices and boundaries.
Mathematical models are techniques derived from mathematical sciences, applied
statistics, machine learning, operations research and natural language processing.
One example is energy and utilities. Natural gas prices fluctuate depends upon supply, demand,
econometrics, geo-politics and weather conditions. Prescriptive analytics can accurately predict
prices by modelling internal and external variables simultaneously and also provide decision options
and show the impact of each decision option.
NATURE OF DATA
In BigData, data could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
What is Structured Data?
Any data that can be stored, accessed and processed in the form of fixed format(eg. table) is termed
as a 'structured' data.
Developed techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it.
One main issue with the data these days is:
when a size of such data grows to a huge extent, typical sizes are being in the range of multiple zetta
bytes. That is why the name Big Data is given and imagine the challenges involved in its storage and
processing
Data stored in a relational database management system(RDBMS) is one example of a 'structured'
data.
Unstructured Data
12
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges in terms of its
processing for deriving value out of it.
A typical example of unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc.
Now-a-days organizations have wealth of data available with them but unfortunately, they don't
know how to derive value out of it since this data is in its raw form or unstructured format.
Example of Unstructured data – Results returned by a search engine like 'Google Search'
Semi-structured Data
Semi-structured data can contain both the forms of data and shares the characteristics of both the
forms.
Semi-structured data refers to data that is not captured or formatted in conventional ways. Semi-
structured data does not follow the format of a tabular data model or relational databases because
it does not have a fixed schema.
Example of semi-structured data is a data represented in an XML,CSV, JSON files.
BUSINESS UNDERSTANDING
The very first step consists of business understanding. Whenever any requirement occurs,
1. firstly we need to determine the business objective,
2. assess the situation,
3. determine data mining goals and then
4. produce the project plan as per the requirement.
5. Finally, Business objectives are defined in this phase.
DATA EXPLORATION
The second step consists of Data understanding.
1. For the further process, we need to gather initial data, describe and explore the data and
verify data quality to ensure it contains the data we require.
13
2. Data collected from the various sources is described in terms of its application and the
need for the project in this phase. This is also known as data exploration.
Data exploration is essential step to verify the quality of data collected.
DATA PREPARATION
1. From the data collected in the last step, we need to select data as per the need, clean it,
construct it to get useful information and then integrate it all.
2. Finally, we need to format the data to get the appropriate data.
3. Data is selected, cleaned, and integrated into the format finalized for the analysis in this
phase.
ANALYZE DATA
1. The next step is to Analyze. The cleaned data is used for analyzing and identifying trends. It
also performs calculations and combines data for better results.
2. Here, a data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and
implemented on the data in this phase.
3. Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
4. Where data is stored?
– Distributed Storage (e.g. Amazon S3)
5. What is the programming model?
– Distributed Processing (e.g. MapReduce)
6. How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
7. What operations are performed on data?
– Analytic / Semantic Processing
8. Big data tools on clouds
– MapReduce model
14
– Iterative MapReduce model
– Graph model
– Collective model
9. Other BDA tools
– SaS
–R
– Hadoop
DEPLOYMENT
The final step is Act. After a presentation is given based on your data model, the
stakeholders discuss whether to move forward or not. If they agreed to your recommendations,
they move further with your solutions. If they don’t agree with your findings, you will have to dig
deeper to find more possible solutions. Every step has to be re-organized. We have to repeat
every step to see whether there are any gaps in there. The data collected must be reviewed
to see if there is any bias and identify options. After the gaps are identified and the data is
analyzed, a presentation is given again.
15
enterprise reporting, dashboards, ad-hoc analysis, scorecards, and what-if scenario analysis on an
integrated, enterprise scale platform.
• In-Database Analytics include a variety of techniques for finding patterns and relationships in
your data. Because these techniques are applied directly within the database, you eliminate data
movement to and from other analytical servers, which accelerates information cycle times and
reduces total cost of ownership.
• Hadoop is useful for pre-processing data to identity macro trends or find nuggets of information,
such as out-of-range values. It enables businesses to unlock potential value from new data using
inexpensive commodity servers. Organizations primarily use Hadoop as a precursor to advanced
forms of analytics.
• Decision Management includes predictive modeling, business rules, and self-learning to take
informed action based on the current context. This type of analysis enables individual
recommendations across multiple channels, maximizing the value of every customer interaction.
Oracle Advanced Analytics scores can be integrated to operationalize complex predictive analytic
models and create real-time decision processes.
There are hundreds of data analytics tools out there in the market today but the selection
of the right tool will depend upon your business NEED, GOALS, and VARIETY to get business in the
right direction. The top 10 analytics tools in big data.
APACHE Hadoop
1. It’s a Java-based open-source platform that is being used to store and process big data.
2. It is built on a cluster system that allows the system to process data efficiently and let the
data run parallel. It can process both structured and unstructured data from one server to
multiple computers.
3. Hadoop also offers cross-platform support for its users.
4. Today, it is the best big data analytic tool and is popularly used by many tech giants such as
Amazon, Microsoft, IBM, etc.
Cassandra
1. APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large
amounts of data.
16
2. It’s one of the most popular tools for data analytics and has been praised by many tech
companies due to its high scalability and availability without compromising speed and
performance.
3. It is capable of delivering thousands of operations every second and can handle petabytes
of resources with almost zero downtime.
4. It was created by Facebook back in 2008 and was published publicly.
Qubole
1. It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc
analysis in machine learning.
2. Qubole is a data lake platform that offers end-to-end service with reduced time and effort
which are required in moving data pipelines.
3. It is capable of configuring multi-cloud services such as AWS, Azure, and Google Cloud.
4. It also helps in lowering the cost of cloud computing by 50%.
Xplenty
1. It is a data analytic tool for building a data pipeline by using minimal codes in it.
2. It offers a wide range of solutions for sales, marketing, and support.
3. With the help of its interactive graphical interface, it provides solutions for ETL, ELT, etc.
4. The best part of using Xplenty is its low investment in hardware & software and its offers
support via email, chat, telephonic and virtual meetings.
5. Xplenty is a platform to process data for analytics over the cloud and segregates all the data
together.
APACHE Spark
1. APACHE Spark is a framework that is used to process data and perform numerous tasks on
a large scale.
2. It is also used to process data via multiple computers with the help of distributing tools.
3. It is widely used among data analysts as it offers easy-to-use APIs that provide easy data
pulling methods and it is capable of handling multi-petabytes of data as well.
17
4. Recently, Spark made a record of processing 100 terabytes of data in just 23 minutes which
broke the previous world record of Hadoop (71 minutes). This is the reason why big tech
giants are moving towards spark now and is highly suitable for ML and AI today.
Mongo DB
1. Mongo DB Came in limelight in 2010, is a free, open-source platform and a document-
oriented (NoSQL) database that is used to store a high volume of data.
2. It uses collections and documents for storage and its document consists of key-value pairs
which are considered a basic unit of Mongo DB.
3. It is so popular among developers due to its availability for multi programming languages
such as Python, Jscript, and Ruby.
Apache Storm
1. A storm is a robust, user-friendly tool used for data analytics, especially in small companies.
2. The best part about the storm is that it has no language barrier (programming) in it and can
support any of them.
3. It was designed to handle a pool of large data in fault-tolerance and horizontally scalable
methods.
4. Storm leads the chart because of its distributed real-time big data processing system, due
to which today many tech giants are using APACHE Storm in their system.
5. Some of the most notable names are Twitter, Zendesk, NaviSite, etc.
SAS
1. Today it is one of the best tools for creating statistical modeling used by data analysts.
2. By using SAS, a data scientist can mine, manage, extract or update data in different variants
from different sources.
3. Statistical Analytical System or SAS allows a user to access the data in any format (SAS
tables or Excel worksheets).
4. It also offers a cloud platform for business analytics called SAS Viya and also to get a strong
grip on AI & ML, they have introduced new tools and products.
Data Pine
18
1. Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany).
2. In a short period of time, it has gained much popularity in a number of countries and it’s
mainly used for data extraction (for small-medium companies fetching data for close
monitoring).
3. With the help of its enhanced UI design, anyone can visit and check the data as per their
requirement and offer in 4 different price brackets, starting from $249 per month.
4. They do offer dashboards by functions, industry, and platform.
Rapid Miner
1. It’s a fully automated visual workflow design tool used for data analytics.
2. It’s a no-code platform and users aren’t required to code for segregating data.
3. Today, it is being heavily used in many industries such as ed-tech, training, research, etc.
4. Though it’s an open-source platform but has a limitation of adding 10000 data rows and a
single logical processor.
5. With the help of Rapid Miner, one can easily deploy their ML models to the web or mobile.
ANALYSIS VS REPORTING
Analytics and reporting can help a business improve operational efficiency and production in
several ways. Analytics is the process of making decisions based on the data presented, while
reporting is used to make complicated information easier to understand.
Analytics is the technique of examining data and reports to obtain actionable insights that can
be used to make better decisions and improve business performance.
The steps involved in data analytics are as follows:
On the other hand, reporting is the process of presenting data from numerous sources clearly
and simply. The procedure is always carefully set out to report correct data and avoid
misunderstandings. Today’s reporting applications offer cutting-edge dashboards with
19
advanced data visualization features. Companies produce a variety of reports, such as financial
reports, accounting reports, operational reports, market studies, and more.
Analytics and reporting can significantly benefit your business. If you want to use both to their full
potential knowing the difference between the two is important. Some key differences are:
Analytics Reporting
Analytics is the method of examining and Reporting is an action that includes all the
analyzing summarized data to make business needed information and data and is put together
decisions. in an organized way.
The purpose of analytics is to draw conclusions The purpose of reporting is to organize the data
based on data. into meaningful information.
20