Unit-I (Big Data)
Unit-I (Big Data)
Information: Processed data is called information. When raw facts and figures are
processed and arranged in some proper order then they become information. Information
has proper meanings. Information is useful in decision-making. In other words,
Information is data that has been processed in such a way as to be meaningful values to the
person who receives it.
Examples of information:
• Student’saddress labels- Stored data of students can be used to print address labels of
students. These address labels are used to send any intimation / information to students at
their home addresses.
• Student’s examination, Results- In examination system collected data (obtained marks
in each subject) is processed to get total obtained marks of a student. Total obtained marks
are Information. It is also used to prepare result card of a student.
• Survey Report – Survey data is summarized into reports/information to present to
management of the company. The management will take important decisions on the basis
of data collected through surveys.
Units of data:When dealing with big data, we consider numbers to represent like
megabytes, gigabytes, terabytes etc. Here is the system of units to represent data.
The bit
The Bit
The Byte
Kilobyte (1024 Bytes)
Megabyte (1024 Kilobytes)
Gigabyte (1,024 Megabytes, or 1,048,576 Kilobytes)
Terabyte (1,024 Gigabytes)
Petabyte (1,024 Terabytes, or 1,048,576 Gigabytes)
Exabyte (1,024 Petabytes)
Zettabyte (1,024 Exabytes)
Yottabyte (1,024 Zettabytes)
BIG DATA
The term has been in use since the 1990s, with some giving credit to John Mashey for
popularizing the term.
Big Data is also data but with a huge size.
Big data is a term that describes the large volume of data.
Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
Big data is collection of data sets which of large and complex that it becomes difficult to
process using on-hand database system tools or traditional data-processing applications.
Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
In Structured format you have proper schema for your data. So you know what columns
will be there and basically it is a structured format (or) a tabular format.
Semi-Structured Data:
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a structured in form but it is actually not defined with e.g. a table definition in relational
DBMS.
Example of semi-structured data is a data represented in eXtensible Markup Language
(XML) file, Java Script object notation (JSON), Comma Separated Values (CSV) files, E-
mail where schema is not defined properly.
Volume
Variety
Velocity
Value
Veracity
Volume:
Volume refers to the unimaginable amounts of information generated every second from
social media, cell phones, cars, credit cards, M2M sensors, images, video, and whatnot. We
are currently using distributed systems, to store data in several locations and brought
together by a software Framework like Hadoop.
Facebook alone can generate about billion messages, 4.5 billion times that the “like”
button is recorded, and over 350 million new posts are uploaded each day. Such a huge
amount of data can only be handled by Big Data Technologies.
As you can see from the image, the volume of data is rising exponentially. In 2016, the
data created was only 8 ZB and it is expected that, by 2020, the data would rise up to 40
ZB, which is extremely large.
Variety:
As Discussed before, Big Data is generated in multiple varieties. Compared to the
traditional data like phone numbers and addresses, the latest trend of data is in the form of
photos, videos, and audios and many more, making about 80% of the data to be completely
unstructured.
Velocity:
Velocity is the speed at which data is generated and processed. At first, mainframes were
used wherein fewer people used computers.Then came the client/server model and more
and more computers wereevolved. After this, the web applications came into the picture
andstarted increasing over the Internet. Then, everyone began using these applications.
These applications were then used by more and more devices such as mobiles as they were
very easy to access. Hence, a lot of data!
As it is clear from the image, every 60 seconds, so much of the data is generated
Value:
Here, our fourth V comes in, which deals with a mechanism to bring outthe correct
meaningout ofdata. First ofall, you needtomine thedata, i.e., a process to turn raw data into
useful data. Then, an analysisis done on the data that you have cleaned or retrieved out of
the rawdata. Then, you need to make sure whatever analysis you have done benefits your
business such as in finding out insights, results, etc. which were not possible earlier.
You need to make sure that whatever raw data you are given, you have cleaned it to be
used for deriving business insights. After you have cleaned the data, a challenge pops up,
i.e., during the process of dumping a huge amount of data, some packages might have lost.
So for resolving this issue, our next V comes into the picture.
Veracity:
Since the packages get lost during the execution, we need to start again from the stage of
mining raw data in order to convert them into valuable data. And this process goes on.
Also, there will be uncertainties and inconsistencies in the data. To overcome this, our last
V comes into place, i.e., Veracity. Veracity means the trustworthiness and quality of data.
It is necessary that the veracity of the data is maintained. For example, think about
Facebook posts, with hashtags, abbreviations, images, videos, etc., which make them
unreliable and hamper the quality of their content. Collecting loads and loads of data is of
no use if the quality and trustworthiness of the data is not up to the mark.
Importance of Big Data:
The Big Data analytics is indeed a revolution in the field of Information Technology. The
use of Data analytics by the companies is enhancing every year. Big data has the properties
of high variety, volume, and velocity. Big Data involves the use of analytics techniques
like machine learning, data mining, natural language processing, and statistics. With the
help of big data multiple operations can be performed at a single platform. You can store
Terabytes of data, pre process it, analyze the data and visualize the data with the help of
couple of big data tools.
Data is extracted, prepared and blended to provide analysis for the businesses. Large
enterprises and multinational organizations use these techniques widely these days in
different ways.
Big data analytics helps organizations to work with their data efficiently and use that data
identify new opportunities. Differenttechniques and algorithms can be applied to predict
from data. Multiple business strategies can be applied for future success of the company
and that leads to smarter business moves, more efficient operations and higher profits.
Cost Savings: Some tools of Big Data like Hadoop and Cloud- Based Analytics can bring
cost advantages to business when large amounts of data plus they can identify more
efficient ways of doing business.
Time Reductions: The high speed of tools like Hadoop and in- memory analytics can
easily identify new sources of data which helps businesses analyzing data immediately and
make quick decisions based on the learning’s.
Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most and
produce products according to this trend. By this, it can get ahead of its competitors.
Control online reputation: Big data tools can do sentiment analysis. Therefore, you can
get feedback about who is saying what about your company. If you want to monitor and
improve the online presence of your business, then, big data tools can help in all this.
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights:Big data analytics can help change all business operations. This includes the
ability to match customer expectation, changing company’s product line and of course
ensuring that the marketing campaigns are powerful.
The marketing and advertising sector is able to make a more sophisticated analysis. This
involves observing the online activity, monitoring the point of sale transactions, and
ensuring on the fly detection of dynamic changes in customer trends. Gaining insights on
customer behavior takes collecting and analyzing the customer’s data. This is done through
the similar approach used by marketers and advertisers as illustrated. This result into the
capability to achieve focused and targeted campaigns.
A more targeted and personalized campaign means that businesses can save money and
ensure efficiency. This is because they target high potential clients with the right products.
Big data analytics is good for advertisers since the companies can use this data to
understand customers purchasing behavior. Through predictive analytics, it is possible for
the organizations to define their target clients.
Example:
Netflix is a good example of a big brand that uses big data analyticsfor targeted
advertising. With over 100 million subscribers, the company collects huge data, which is
the key to achieving the industry status Netflix boosts. If you are a subscriber, you are
familiar to how they send you suggestions of the next movie you should watch. Basically,
this is done using your past search and watch data. This data is used to give them insights
on what interests the subscriber most.
Example:
You have probably heard of Amazon Fresh and Whole Foods. This is a perfect example of
how big data can help improve innovation and product development. Amazon leverages
big data analytics to move into a large market. The data-driven logistics gives Amazon the
required expertise to enable creation and achievement of greater value. Focusing on big
data analytics, Amazon whole foods is able to understand how customers buy groceries
and how suppliers interact with the grocer. This data gives insights whenever there is need
to implementfurtherchanges.
Architecture for Handling Big Data:
Big data solutions typically involve one or more of the following types of workload:
Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data. Ingestion
Predictive analytics and machine learning.
Most big data architectures include some or all of the following components:
Data sources: All big data solutions start with one or more data sources. Examples
include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.
Data storage: Data for batch processing operations is typically stored in a distributed file
store that can hold high volumes of large files in various formats. This kind of store is
often called a data lake.
Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise prepare
the data for analysis. Usually these jobs involve reading source files, processing them, and
writing the output to new files.
Real-time message ingestion: If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This
might be a simple data store, where incoming messages are dropped into a folder for
processing. However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other message
queuing semantics.
Stream processing: After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis. The processed stream
data is then written to an output sink. Azure Stream Analytics provides a managed stream
processing service based on perpetually running SQL queries that operate on unbounded
streams
Analytical data store: Many big data solutions prepare data for analysis and then serve
the processed data in a structured format that can be queried using analytical tools. The
analytical data store used to serve these queries can be a Kimball-style relational data
warehouse, as seen in most traditional business intelligence (BI) solutions. Alternatively,
the data could be presented through a low-latency NoSQL technology such as HBase, or an
interactive Hive database that provides a metadata abstraction over data files in the
distributed data store.
Analysis and reporting: The goal of most big data solutions is to provide insights into the
data through analysis and reporting. To empower users to analyze the data, the architecture
may include a data modeling layer, such as a multidimensional OLAP cube or tabular data
model in Azure Analysis Services. It might also support self-service BI, using the
modeling and visualization technologies in Microsoft Power BI or Microsoft Excel.
Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts.
Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows that transform source data, move data between multiple sources
and sinks, load the processed data into an analytical data store, or push the results straight
to a report or dashboard. To automate these workflows, you can use an orchestration
technology such Azure Data Factory or Apache Oozie and Sqoop.
Big data challenges include the Storing, Analyzing, and Visualizing the extremely large
and fast-growing data.
The constant stream of information from various sources is becoming more intense,
especially with the advance in technology. And this is where big data platforms come
in to store and analyze the ever-increasing mass of information.
Currently, the marketplace is flooded with numerous Open source and commercially
available big data platforms. They boast different features and capabilities for use in
a big data environment.
Hadoop provides set of tools and software for making the backbone of the Big Data
analytics system. Hadoop ecosystem provides necessary tools and software for
handling and analyzing Big Data. On the top of the Hadoop system many
applications can be developed and plugged-in to provide ideal solution for Big Data
needs.
Apache Hadoop
Hadoop is an open-source programming architecture and server software. It is employed to
store and analyze large data sets very fast with the assistance of thousands of commodity
servers in a clustered computing environment. In case of one server or hardware failure, it
can replicate the data leading to no loss of data.
This big data platform provides important tools and software for big data management.
Many applications can also run on top of the Hadoop platform. And while it can run on OS
X operating systems, Linux, and Windows, it is commonly employed on Ubuntu and other
variants of Linux.
CLOUDERA
Cloudera is a big data platform based on Apache’s Hadoop system. It can handle huge
volumes of data. Enterprises regularly store over 50 petabytes in this platform’s Data
Warehouse, which handles data such as text, machine logs, and more. Cloudera’s
DataFlow also enables real-time data processing.
AMAZON WEB SERVICES
Popularly known as AWS, this is another Hadoop-based big data platform from Amazon.
AWS is hosted in the cloud environment. Thus, businesses can employ AWS to manage
their big data analytics in the cloud. And through Amazon EMR(Elastic MapReduce),
enterprises can set up and effortlessly scale other big data platforms like Spark, Apache
Hadoop, and Presto.
ORACLE
Oracle is another big data platform with a cloud hosting environment. It can automatically
send data in different formats to cloud servers without downtime. It can also run on-
premise and in hybrid environments. This allows for data transformation and enrichment,
whether it’s live streaming or stored in a data lake. The platform offers a free tier as well.
SNOWFLAKE
This big data platform acts as a data warehouse for storing, processing, and analyzing data.
It is designed similarly to a SaaS product. This is because everything about its framework
is run and managed in the cloud. It runs fully atop public cloud hosting frameworks and
integrates with a new SQL query engine.
MapR
MapR is another Big Data platform which us using the Unix file system for handling data.
It is not using HDFS and this system is easy to learn anyone familiar with the Unix system.
This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing
feature.
APACHE STORM
Apache Storm is the brainchild of Apache Software Foundation. This big data platform is
used in real-time data analytics and distributed processing. It supports virtually all
programming languages because of its high scalability and fault tolerance. Big data giants
such as Yelp, Twitter, Yahoo, and Spotify use Apache Storm.
APACHE SPARK
Apache Spark is software that runs on the top of Hadoop and provides API for real-time,
in-memory processing and analysis of large set of data stored in the HDFS. It stores the
data into memory for faster processing. Apache Spark runs program 100 times faster in-
memory and 10 times faster on disk as compared to the MapRedue. Apache Spark is here
to faster the processing and analysis of big data sets in Big Data environment. Apache
Spark is being adopted very fast by the business to analyze their data set to get real value
of their data.
RDBMS vs HADOOP
Data warehouse:
A data warehouse essentially combines information from several sources into one
comprehensive database. Let’s summarize what data warehouse is –
Subject oriented
A data warehouse can be used to analyze a particular subject area like sales, finance, and
inventory. Each subject area contains detailed data. For example, to learn more about your
company's sales data, you can build a warehouse that concentrates on sales.
Using this warehouse, you can answer questions like "Who was our best customer for this
item last year? "This ability to define a data warehouse by subject matter, sales in this case,
makes the data warehouse subject oriented.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration during
data warehousing toensure consistency in naming conventions attributes types, etc.,
amongdifferentdatasources.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery, and concurrency
capabilities, which allows for substantial speedup of data retrieval.Non-Volatile defines
that once entered into the warehouse, and data should not change
Time-variant:
Historical information is kept in adata warehouse. For example, one can retrieve files from
3 months, 6 months,12 months, or even previous data from a data warehouse.
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
1. Map reduce
MAPREDUCE is a software framework and programming model used for processing
huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce.
Map tasks deal with splitting and mapping of data while Reduce tasks deal with shuffle and
reduce the data.
Consider you have following input data for your Map Reduce Program
Hadoop is good
Hadoop is bad
MapReduce Working Process
Thedatagoesthroughthefollowingphases
InputSplits:
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in
each split is passed to a mapping function to produce output values. In our example, a job
of mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word,
frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubbed together
along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value. In short, this phase
summarizes the complete dataset.
HDFS ARCHITECTURE
Secondary
NameNode
Let’s understand this with an example: We need to read 1TB of data and we have one
machine with 4 I/O channels each channel having 100MB/s, it took 45 minutes to read the
entire data. Now the same amount of data is read by 10 machines each with 4 I/O channels
each channel having 100MB/s. Guess the amount of time it took to read the data? 4.3
minutes. HDFS solves the problem of storing big data. The two main components of HDFS
are NAME NODE and DATA NODE. Name node is the master, we may have a secondary
name node as well incase the primary name node stops working the secondary name node
will act as a backup. The name node basically maintains and manages the data nodes by
storing metadata. The data node is the slave which is basically the low-cost commodity
hardware. We can have multiple data nodes. The data node stores the actual data. This data
node supports the replication factor, suppose if one data node goes down then the data can
be accessed by the other replicated data node, therefore, the accessibility of data is
improved and loss of data is prevented
The initial version of Hadoop had just two components: Map Reduce and HDFS. Later it
was realized that Map Reduce couldn’t solve a lot of bigdata problems. The idea was to
take the resource management and job scheduling responsibilities away from the old map-
reduce engine and give it to a new component. So this is how YARN came into the picture.
It is the middle layer between HDFS and MapReduce which is responsible for managing
cluster resources.
A NodeManager slave that's installed at each node and functions as a monitoring and
reporting agent of the ResourceManager
An ApplicationMaster that's created for each application to negotiate for resources and
work with the NodeManager to execute and monitor tasks
Resource containers that are controlled by Node Managers and assigned the system
resources allocated to individual applications
YARN ARCHITECTURE
4. Common:
Also called the Hadoop common. These are nothing but the JAVA libraries, files, scripts,
and utilities that are actually required by the other Hadoop components to perform.
Comparisons between Data Warehouse and Hadoop
Storage Suitable for data with It works well with large data sets
small volume and it’s having huge volume, velocity, and
too much expensive variety
for large volume data
Intelligent data analysis discloses hidden facts that are not known previously and provide
potentially important information or facts from large quantities of data.
Phases of IDA
Nature of Data
Data are known facts or things used as basis for inference or reckoning. We can find data
in all the situations of the world around us, in all the structured or unstructured, in
Continuous or discrete conditions, in weather records, stock market logs, in photo albums,
music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw
material of any kind of human activity.
Categorical data are values or observations that can be sorted into groups or categories.
There are two types of categorical values, nominal, ordinal.
• A nominal variable has no intrinsic ordering to its categories. For example, housing is a
categorical variable having two categories
Ex: 1) own and rent 2) male and Female 3) Federal, Democratic, Republican
• An ordinal variable has an established ordering. For example, age as a variable with three
orderly categories
Ex: 1) young, adult, and elder. 2) poor, satisfactory, Good, Very Good, Excellent
Numerical data are values or observations that can be measured. There are two kinds of
numerical values, discrete and continuous.
• Discrete (countable) data are values or observations that can be counted and are distinct
and separate. For example, number of lines in a code.
• Continuous (Measurable) data are values or observations that may take on any value within
a finite or infinite interval. For example, an economic time series such as historic gold
prices.
It is very important to examine the data thoroughly before undertaking any formal analysis.
Traditionally, data analysts have been taught to "familiarise themselves with their data"
before beginning to model it or test it against algorithms.
Different issues need to be considered while handling Bigdata are as follows:
• Missing Data
• Mis recorded data
• Sampling Data
• Distortions due to contamination
• Anomalous data, or data with hidden peculiarities
• Curse of dimensionality (high dimensional space)
Analysis :
Analytics is the process of taking the organized data and analysing it.
This helps users to gain valuable insights on how businesses can improve their
performance.
Analysis transforms data and information into insights.
The goal of the analysis is to answer questions by interpreting the data at a deeper
level and providing actionable recommendations.
Example: ad hoc responses, insights, recommended actions, or a forecast
A canned report will show a company’s revenue and whether it is lower or higher
than expected; an ad-hoc drill-down can be used by financial and business analysts to
understand why this occurred.
For data analytics, the steps involved include:
Creating a data hypothesis
Gathering and transforming data
Building analytical models to ingest data, process it and offer insights
Use tools for data visualization, trend analysis, deep dives, etc.
Using data and insights for making decisions
Conclusion :
Reporting shows us “what is happening”.
The analysis focuses on explaining “why it is happening” and “what we can do about
it”.