Big Data Components
Big Data Components
Big Data
P.SRIDEVI
DEPT OF CSE-IT
UNIT1
Introduction to Big Data Platform
Challenges of Conventional Systems
Intelligent data analysis
Nature of Data
Analytic Processes and Tools
Analysis vs Reporting
What is BigData
• The ingestion layer is the very first step of pulling in raw data.
• The various sources of data.
• It comes from internal sources,
• relational databases,
• non-relational databases,
• social media,
• emails,
• phone calls/mobile apps etc.
Types of ingestion
• There are two kinds of ingestions :
• Batch, in which large groups of data are gathered and delivered together.
• A batch layer (cold path) stores all of the incoming data in its raw form
and performs batch processing on the data. The result of this processing
is stored as a batch view.
• Streaming, which is a continuous flow of data. This is necessary for real-
time data analytics
• A speed layer (hot path) analyzes data in real time. This layer is designed
for low latency (minimum delay), at the expense of accuracy.
Data sources.
• All big data solutions start with one or more data sources.
• Examples include:
• Relational databases -- Application data stores.
• WEB SERVER LOG FILES -- Static files produced by applications.
• IOT DEVICES--Real-time data sources.
Batch processing.(OSS SW used is spark)
• Because the data sets are so large, often a big data solution must
process data files using batch jobs to filter, aggregate, and otherwise
prepare the data for analysis.
• Usually these jobs involve reading source files, processing them, and
writing the output to new files.
• Options include running U-SQL (one language to process data for any
format)jobs in Azure Data Lake Analytics, in an HDInsight Hadoop
cluster, or using Java, Scala, or Python programs in an HDInsight Spark
cluster.
• Azure HDInsight is a fully-managed cloud service that makes it easy,
fast, and cost-effective to process massive amounts of data.It Uses the
most popular open-source frameworks such as Hadoop, Spark, Hive,
Kafka, Storm, HBase, Microsoft ML Server and more.
Real-time message ingestion.
• If the solution includes real-time sources, the architecture must include a way to
capture and store real-time messages for stream processing.
• This might be a simple data store, where incoming messages are dropped into a
folder for processing.
• However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other
message queuing semantics.
• This streaming architecture is often referred to as stream buffering. Options
include Azure Event Hubs.
• (Azure Event Hubs is a big data streaming platform and event ingestion service.
It can receive and process millions of events per second. Data sent to an event
hub can be transformed and stored by using any real-time analytics provider or
batching/storage).
2. Data storage.(DataWH Vs DataLake)
• Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats.
• This kind of store is often called a data lake. Options for implementing
this storage include Azure Data Lake Store in Azure Storage.
• Storage is where the converted data is stored in a data lake or
warehouse and eventually processed.
• The data lake/warehouse is the most essential component of a big data
ecosystem.
• Data in data lake contain only thorough, relevant data to make insights
as valuable as possible.
• It must be efficient with as little redundancy as possible to allow for
quicker processing.
Azure Data Lake
• Azure Data Lake is a big data solution based on multiple cloud
services in the Microsoft Azure ecosystem.
• It allows organizations to ingest multiple data sets, including
structured, unstructured, and semi-structured data, into an infinitely
scalable data lake enabling storage, processing, and analytics.
DWH vs Data MART vs Data Lake
• Data warehouses, data lakes, and data marts are different cloud
storage solutions.
• A data warehouse stores data in a structured format. It is a central
repository of pre-processed data for analytics and business
intelligence.
• A data mart is a data warehouse that serves the needs of a specific
business unit, like a company’s finance, marketing, or sales
department.
• a data lake is a central repository for raw data and unstructured data.
You can store data first and process it later on.
3. Big data Analytics :
• In the analysis layer, data gets passed through several tools, shaping it
into actionable insights.
• There are four types of analytics on big data :
• Diagnostic: Explains why a problem is happening.
• Descriptive: Describes the current state of a business through
historical data.
• Predictive: Projects future results based on historical data.
• Prescriptive: Takes predictive analytics a step further by projecting
best future efforts.
Big data solutions typically deals with the following
types of workload:
• These days, organizations are realising the value they get out of big
data analytics and hence they are deploying big data tools and
processes to bring more efficiency in their work
environment(SECODA,COLLIBRA)
• Collibra is a data catalog platform and tool that helps organizations
better understand and manage their data assets. Collibra helps
create an inventory of data assets, capture information (metadata)
about them, and govern these assets.
• Secoda is a tool for writing queries to search company data(SECODA)
Challenges of conventional systems
• Big data is the storage and analysis of large data sets.
• These are complex data sets that can be both structured or unstructured.
• They are so large that it is not possible to work on them with traditional
analytical tools.
• One of the major challenges of conventional systems was the uncertainty of
the Data Management.
• Big data is continuously expanding, there are new companies and
technologies that are being developed every day.(Google,Amazon,Netflix)
• Trusting the quality of data. Data security and privacy is a challenge.
• Not designed as user friendly for data extraction.
• A big challenge for companies is to find out which technology works best for
them without the introduction of new risks and problems.
• These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency in their work environment.
BIG DATA AS A SERVICE
• Big Data has created a demand for scalable, flexible and affordable data
management platforms to meet modern compute requirements.
• Big Data as a Service (BDaaS) integrates many of the functionalities and
benefits of SaaS, IaaS, PaaS and DaaS, and leverages additional resources
in the market for analyzing Big Data.
• Big Data as a Service encompasses the software, data warehousing,
infrastructure and platform service models in order to deliver advanced
analysis of large data sets, generally through a cloud-based network.
• It is a solution-based system designed to provide organizations with the
wide-ranging capabilities to gain insights from data.
List some of data analytics tools
• Data analytics tools not only report the results of the data but also explain
why the results occurred to help identify weaknesses, fix potential problem
areas, alert decision-makers to unforeseen events and even forecast future
results based on decisions the company might make.
• R Programming (Leading Analytics Tool in the industry)
• Python
• Excel
• SAS
• Apache Spark
• Splunk
• RapidMiner
• Tableau Public
Orchestration.
• Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data,
move data between multiple sources and sinks.
• load the processed data into an analytical data store, or push the
results straight to a report or dashboard. To automate these
workflows, an orchestration technology such Azure Data Factory or
Apache Oozie and Sqoop is used
• Workflow:
Sourcedata->move data between sources and sinks->load processed
data for analytics->display the results on dashboard.
A big data architecture is designed to handle the ingestion, processing,
and analysis of data that is too large or complex for traditional database
systems.
Stream processing.
• After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis.
• The processed stream data is then written to an output sink. Azure
Stream Analytics provides a managed stream processing service based
on perpetually running SQL queries that operate on unbounded streams.
• open source Apache streaming technologies like Storm and Spark
Streaming in an HDInsight cluster can be used
• Azure HDInsight is a service offered by Microsoft, that enables us to use
open source frameworks for big data analytics.
• Azure HDInsight allows the use of frameworks like Hadoop, Apache
Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, etc., for
processing large volumes of data.
Analysis and reporting.
• The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data,
the architecture may include a data modelling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis
Services.
• It might also support self-service BI, using the modelling and
visualization technologies in Microsoft Power BI or Microsoft Excel.
• Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios,
many Azure services support analytical notebooks, such as Jupyter,
enabling these users to leverage their existing skills with Python or R.
Reporting vs Analytics
Reporting Analysis
DWH DATA LAKE
DWH contains Structured data Unstructured data
Collect data from many sources(Flat files,spread Collect all kinds of data( structured and unstructured)
sheets, DBs , apps etc) in one place
Reporting presents the actual data to end-users, after Analytics doesn't present the data but instead draws
collecting, sorting and summarizing it to make it easy information from the available data and uses it to
to understand generate insights, forecasts and recommended
actions.