Big Data Unit 1 Notes
Big Data Unit 1 Notes
TECHNOLOGY
TOPICS
UNIT - I Introduction to Big Data: Big Data and its Importance – Four V’s of Big
Data – Drivers for Big Data – Introduction to Big Data Analytics – Big Data
Analytics applications.
Introduction to Big Data:
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types of data:
1.unstructured data: This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program.
Forexample:
Semi-structured data:
This is the data which does not conform to a data model but has some structure.how
ever it is not in the form which can be used easily by a computer program.
Examples:E-mails,XML,markup languages,HTML
Structured data: This is the data which is in the organized form. i.e in the form of
rows and columns and can be easily used by a computer programs. Relationships
exist between the entities of data, such as classes and their objects.Data stored in
data bases is an example of structured data.
Structured data
unstructured
Structured data:
The data when it conforms to the schema/structure we say it is structured data.
• Structured data is generally tabular data that is represented by columns and rows in a
database.
• Databases that hold tables in this form are called relational databases.
• The mathematical term “relation” specify to a formed set of data held as a table.
• In structured data, all row in a table has the same set of columns.
• SQL (Structured Query Language) programming language used for structured data.
Sources of structured data:
The sources from where the data is generated is RDBMS,oracle,DB2,Microsoft
Sql server,Teradata,Mysql and OLTP systems.
Semi structured data:
The semi structured data is also referred to as self describing tags.
It uses tags to segregate the semantic elements.
Sources of semi structured data:
The sources of semi structured are
XML-Extensible mark up language
JSON-Java script object Notation.
An example of HTML as follows
<html>
<head>
<title>place your title here
</head>
<body bgcolor=”FFFFFF”
</html>
Composition
Data Condition
Context
1970s and before was the era of mainframes. The data was
essentially primitive and structured. Relational databases
evolved in 1980s and 1990s. The era was of data intensive
applications. The World Wide Web (WWW) and the Internet of
Things (IOT) have led to an onslaught of structured, unstructured,
and multimedia data. Refer Table 1.1.
The importance of big data does not revolve around how much data a company
has but how a company utilizes the collected data. Every company uses data in
its own way; the more efficiently a company uses its data, the more potential it
has to grow. The company can take data from any source and analyze it to find
answers which will enable:
Big Data importance doesn’t revolve around the amount of data a company has
but lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the
company uses its data, more rapidly it grows.
By analysing the big data pools effectively the companies can get answers to :
Cost Savings :
o Some tools of Big Data like Hadoop can bring cost advantages to business
when large amounts of data are to be stored.
o These tools help in identifying more efficient ways of doing business.
Time Reductions :
o The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data
immediately.
o This helps us to make quick decisions based on the learnings.
Understand the market conditions :
o By analyzing big data we can get a better understanding of current market
conditions.
o For example: By analyzing customers’ purchasing behaviours, a company
can find out the products that are sold the most and produce products according
to this trend. By this, it can get ahead of its competitors.
Control online reputation :
o Big data tools can do sentiment analysis.
o Therefore, you can get feedback about who is saying what about your
company.
o If you want to monitor and improve the online presence of your business, then
big data tools can help in all this.
Using Big Data Analytics to Boost Customer Acquisition(purchase) and
Retention :
o The customer is the most important asset any business depends on.
o No single business can claim success without first having to establish a solid
customer base.
o If a business is slow to learn what customers are looking for, then it is very
likely to deliver poor quality products.
o The use of big data allows businesses to observe various customer-related
patterns and trends.
Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights :
o Big data analytics can help change all business operations.
o Like the ability to match customer expectations, changing
company’s product line, etc.
o And ensuring that the marketing campaigns are powerful
1. Volume:
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial
role. If the volume of data is very large then it is actually considered
as a ‘Big Data’. This means whether a particular data can actually
be considered as a Big Data or not, is dependent upon the volume
of data.
• Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
• Example: In the year 2016, the estimated global mobile traffic was
6.2 Exabytes(6.2 billion GB) per month. Also, by the year 2020 we
will have almost 40000 Exa Bytes of data.
2. Velocity:
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks.
The most popular software framework (nowadays considered the standard for Big Data) is
Apache Hadoop for distributed storage and processing.
Due to the high availability of these software frameworks in open sources, it has become
increasingly inexpensive to start Big Data projects in organizations.
This means that organizations that want to process massive quantities of data (and thus have
large storage and processing requirements) do not have to invest in large quantities of IT
infrastructure.
Instead, they can license the storage and processing capacity they need and only pay for the amounts
they actually used. As a result, most of Big Data solutions leverage the possibilities of cloud
computing to deliver their solutions to enterprises.
4. Increased knowledge about data science
In the last decade, the term data science and data scientist have become tremendously popular.
In October 2012, Harvard Business Review called the data scientist “sexiest job of the 21st
century” and many other publications have featured this new job role in recent years. The
demand for data scientist (and similar job titles) has increased tremendously and many people
have actively become engaged in the domain of data science.
As a result, the knowledge and education about data science has greatly professionalized and
more information becomes available every day. While statistics and data analysis mostly
remained an academic field previously, it is quickly becoming a popular subject among
students and the working population.
Social media data provides insights into the behaviors, preferences and opinions of ‘the public’
on a scale that has never been known before. Due to this, it is immensely valuable to anyone
who is able to derive meaning from these large quantities of data. Social media data can be
used to identify customer preferences for product development, target new customers for future
purchases, or even target potential voters in elections. Social media data might even be
considered one of the most important business drivers of Big Data.
It is increasingly gaining popularity as consumer goods providers start including ‘smart’ sensors in
household appliances. Whereas the average household in 2010 had around 10 devices that
connected to the internet, this number is expected to rise to 50 per household by 2020.
Examples of these devices include thermostats, smoke detectors, televisions, audio systems and
even smart refrigerators.
● Medical information, such as diagnostic imaging
● Photos and video footage uploaded to the World Wide Web
● Video surveillance, such as the thousands of video cameras
across a city
● Mobile devices, which provide geospatial location data of the
users
● Metadata about text messages, phone calls, and application
usage on smart phones
● Smart devices, which provide sensor-based collection of
information from smart
● Non traditional IT devices, including the use of RFID
readers,
GPS navigation systems, and seismic processing.
These are the multiple sources where the data can be generated
from multiple sources.
• Capturing data
• Curation
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
1.Data is growing in an exponential rate.Most of the data have been generated in
the last 2-3 years.
Organizations and data collectors are realizing that the data they can gather from
individuals contains intrinsic value and, as a result, a new economy is emerging.
As this new digital economy continues to evolve, the market sees the introduction
of data vendors and data cleaners that use crowdsourcing to test the outcomes of
machine learning techniques.
Other vendors offer added value by repackaging open source tools in a simpler
way and bringing the tools to market. Vendors such as Cloudera, Hortonworks,
and Pivotal have provided this value-add for the open source framework Hadoop.
Data devices and the “Sensornet” gather data from multiple locations and
continuously generate new data about this data. For each gigabyte of new data
created, an additional petabyte of data is created about that data.
For example, consider someone playing an online video game through a PC,
game console, or smartphone. In this case, the video game provider captures data
about the skill and levels attained by the player. Intelligent systems monitor and
log how and when the user plays the game. As a consequence, the game provider
can fine-tune the difficulty of the game, suggest other related games that would
most likely interest the user, and offer additional equipment and enhancements
for the character based on the user’s age, gender, and interests. This information
may get stored locally or uploaded to the game provider’s cloud to analyze the
gaming habits and opportunities for upsell and cross-sell, and identify
archetypical profiles of specific kinds of users.
Retail shopping loyalty cards record not just the amount an individual spends,
but the locations of stores that person visits, the kinds of products purchased, the
stores where goods are purchased most often, and the combinations of products
purchased together. Collecting this data provides insights into shopping and travel
habits and the likelihood of successful advertisement targeting for certain types
of retail promotions.
Data collectors include sample entities that collect data from the device and
users.
Data results from a cable TV provider tracking the shows a person watches, which
TV channels someone will and will not pay for to watch on demand, and the
prices someone is willing to pay for premium TV content
Retail stores tracking the path a customer takes through their store while pushing
a shopping cart with an RFID chip so they can gauge which products get the most
foot traffic using geospatial data collected from the RFID chips
Data aggregators make sense of the data collected from the various entities from
the “SensorNet” or the “Internet of
Things.” These organizations compile data from the devices and usage patterns
collected by government agencies, retail stores, and websites. In turn, they can
choose to transform and package the data as products to sell to list brokers, who
may want to generate marketing lists of people who may be good targets for
specific ad campaigns.
These groups directly benefit from the data collected and aggregated by others
within the data value chain.
Retail banks, acting as a data buyer, may want to know which customers have the
highest likelihood to apply for a second mortgage or a home equity line of credit.
To provide input for this analysis, retail banks may purchase data from a data
aggregator. This kind of data may include demographic information about people
living in specific locations; people who appear to have a specific level of debt,
yet still have solid credit scores (or other characteristics such as paying bills on
time and having savings accounts) that can be used to infer credit worthiness; and
those who are searching the web for information about paying off debts or doing
home remodeling projects.
Obtaining data from these various sources and aggregators will enable a more
targeted marketing campaign, which would have been more challenging before
Big Data due to the lack of information or high-performing technologies.
A field to analyze and to extract information about the big data involved in the
business or the data world so that proper conclusions can be made is called big
data Analytics.
These conclusions can be used to predict the future or to forecast the business.
This data is more complex that it cannot be dealt with with traditional methods of
analysis.
We produce a massive amount of data each day, whether we know about it or not.
Every click on the internet, every bank transaction, every video we watch on
YouTube, every email we send, every like on our Instagram post makes up data
for tech companies.
With such a massive amount of data being collected, it only makes sense for
companies to use this data to understand their customers and
their behavior better. This is the reason why the popularity of Data Science has
grown manifold over the last few years.
A big data platform is a type of IT solution that combines the features and
capabilities of several big data applications and utilities within a single solution,
this is then used further for managing as well as analyzing Big Data.
It focuses on providing its users with efficient analytics tools for massive
datasets.
The users of such platforms can custom build applications according to their
use case like to calculate customer loyalty (E-Commerce user case), and so on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability,
Performance, and Security.
Example: Some of the most commonly used Big Data Platforms are :
•
• The big data architectures include the following components:
• Data sources: All big data solutions start with one or more data sources.
• Example,
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.
• Data storage: Data for batch processing operations is stored in a
distributed file store that can hold high volumes of large files in various
formats (also called data lake).
• Example,
• Azure Data Lake Store or blob containers in Azure Storage.
• Batch processing: Since the data sets are so large, therefore a big data
solution must process data files using long-running batch jobs to filter,
aggregate, and prepare the data for analysis.
• Real-time message ingestion: If a solution includes real-time sources,
the architecture must include a way to capture and store real-time
messages for stream processing.
• Stream processing: After capturing real-time messages, the solution
must process them by filtering, aggregating, and preparing the data for
analysis. The processed stream data is then written to an output sink. We
can use open-source Apache streaming technologies like Storm and
Spark Streaming for this.
• Analytical data store: Many big data solutions prepare data for analysis
and then serve the processed data in a structured format that can be
queried using analytical tools. Example: Azure Synapse Analytics
provides a managed service for large-scale, cloud-based data
warehousing.
• Analysis and reporting: The goal of most big data solutions is to provide
insights into the data through analysis and reporting. To empower users
to analyze the data, the architecture may include a data modelling layer.
Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts.
• Orchestration: Most big data solutions consist of repeated data
processing operations, that transform source data, move data between
multiple sources and sinks, load the processed data into an analytical
data store, or push the results straight to a report. To automate these
workflows, we can use an orchestration technology such as Azure Data
Factory.
Its volume ranges from Gigabytes to Its volume ranges from Petabytes
Terabytes. to Zettabytes or Exabytes.
Traditional data is generated per hour But big data is generated more
or per day or more. frequently mainly per seconds.
Traditional data base tools are Special kind of data base tools are
required to perform any data base required to perform any
operation. databaseschema-based operation.
Its data model is strict schema based Its data model is a flat schema
and it is static. based and it is dynamic.
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
MapReduce
MapReduce is a parallel programming model for writing distributed applications
devised at Google for efficient processing of large amounts of data (multi-terabyte
data-sets), on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an
Apache open-source framework.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work
across the machines and in turn, utilizes the underlying parallelism of
the CPU cores.
• Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to
detect and handle failures at the application layer.
• Servers can be added or removed from the cluster dynamically and
Hadoop continues to operate without interruption.
• Another big advantage of Hadoop is that apart from being open source,
it is compatible on all the platforms since it is Java based.