0% found this document useful (0 votes)
16 views

Big Data Chapter-I_new

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Big Data Chapter-I_new

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

(CS 415)-JOEL01 Big

data Processing
UNIT I
Introduction to big data: Data, Characteristics of data and Types of
digital data: Unstructured, Semi-structured and Structured, Sources of
data, Evolution and Definition of big data, Characteristics and Need of
big data, Challenges of big data
Introduction to Hadoop:
History of Hadoop, Components of Hadoop, The Hadoop Distributed
File System: Design of HDFS, HDFS Concepts, Java interfaces to HDFS,
Analysing the Data with Hadoop, scaling out, Hadoop Streaming,
UNIT-II
Map Reduce: Developing a Map Reduce Application, How Map Reduce
Works: Anatomy of a Map Reduce Job run, Failures, Map Reduce
Features: Counters, Sorting, Joins.

UNIT-III
NoSQL: Introduction to NoSQL, aggregate data models, aggregates,
key-value and document data models, relationships, graph databases,
schema less databases, materialized views, distribution models:
sharding, master-slave replication, peer- peer replication, sharding and
replication, consistency, relaxing consistency, version stamps.
UNIT-IV
Introduction to Hadoop ecosystem technologies: Serialization: AVRO, Co-
ordination: Zookeeper, Databases: HBase, Hive, Scripting language: Pig

Text Books
1. Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India Pvt.
Ltd., 2016
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World
Polyglot Persistence", Addison-Wesley Professional, 2012.
3. Tom White, "Hadoop: The Definitive Guide", Fourth Edition, O'Reilley, 2015.
References:
1. Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams
with Advanced Analytics”, John Wiley& sons, 2012.
2. Arshdeep Bahga and Vijay Madisetti, “Big Data Science & Analytics: A Hands On Approach
“, VPT, 2016
3. Bart Baesens, “Analytics in a Big Data World;The essential Guide to Data Science and its
Data
• Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things.

Qualitative vs Quantitative
Data can be qualitative or quantitative.
Qualitative data is descriptive information (it describes something)
Quantitative data is numerical information (numbers)
Quantitative data can be Discrete or Continuous:
• Discrete data can only take certain values (like whole numbers)
• Continuous data can take any value (within a range)
• Put simply: Discrete data is counted, Continuous data is measured
Examples:
• Qualitative:
• He is brown and black
• He has long hair
• He has lots of energy
• Quantitative:
• Discrete:
• He has 4 legs
• He has 2 brothers
• Continuous:
• He weighs 25.5 kg
• He is 565 mm tall
• Data can be collected in many ways. The simplest way is direct
observation.
Data or Datum?
• The singular form is "datum", so we say "that datum is very high".
• "Data" is the plural so we say "the data are available", but data is also
a collection of facts, so "the data is available" is fine too.

Information
• knowledge obtained from investigation, study, or instruction.
• intelligence, news.
Characteristics of data
• Data has three key Characteristics. Those are:
• Composition
• Condition
• Context
Composition
The Composition deals with the Structure of data that is:
• The Sources of data
• The Granularity
• The Types
• The Nature of data(like static or real-time streaming)

Condition
The Condition of data deals with the state of data, that is ,
“Can one use this data as is for analysis?” or “ Does it require cleansing
for further enhancement and enrichment?”
Context:
The context of data deals with
“Where has this data been generated?”
“Why was this data generated?”
“How sensitive is this data?”
“What are the events associated with this data?”
Types of digital data
• The data processed can be human-generated or machine-generated,
although it is ultimately the responsibility of machines to generate the
analytic results.
• Human-generated data- generated by humans with computer aid.
Examples are survey results, website content, mobile data, social
media data etc.
• Computer-generated data- generated by computers like Financial
data, Weblog data, sensor data etc.
• Machine-generated data is generated by software programs and
hardware devices in response to real-world events.
• Human-generated and machine-generated data can come from a
variety of sources and be represented in various formats or types.
• The primary types of data are:
structured data
unstructured data
semi-structured data
• Apart from these three fundamental data types, another important
type of data in Big Data environments is metadata.
Structured Data

• Data which is in organized form and can be easily used by computer


program.
• Data stored in databases is an example of structured data.
• Structure data conforms to a pre-defined schema/Structure.
• Most of the structured data is held in RDBMS.
• In RDBMS data is stored in terms of rows and columns called tables.
• Up to 1980s most of the enterprise data has been stored in relational
databases.
• Sources of Structured data are:
Database such as Oracle, DB2, MySQL, Spreadsheets, OLTP Systems.
• Structured data provides ease of working with it. The ease is with the
following:
• Insert/update/delete
• Security
• Indexing
• Scalability
• Transaction Processing
Semi- Structured Data
• Data which does not conform to a data model but has some structure.
• Semi- Structured data uses tags to segregate semantic elements.
• Tags are also used to enforce hierarchies of records and fields with in
data.
Characteristics of Semi- Structured Data:
• Inconsistent Structure
• Self-describing (Label/value Pairs)
• Often schema information is blended with data values.
• Data objects may have different attributes not known beforehand.
Examples of Semi-Structured data are:
Unstructured Data

• Data which does not conform to a data model or is not in a form


which can be used easily by a computer program.
• About 80-90% data of an organization is unstructured data.
Dealing with unstructured data:
The following techniques are used to find patterns in or interpret
unstructured data. Those are:
• Data Mining
• Natural Language Processing.
• Text Analytics
• Noisy Text Analytics
Examples of unstructured data
What is Big
Data?
As per Wikipedia
“Big data is a term for data sets that
are so large or complex that traditional
data processing applications are
inadequate to deal with them.”
Definition of Big Data
• Anything beyond the human & technical infrastructure needed to support
storage, processing and analysis.
• Collection of data that is huge in volume, yet growing exponentially with
time
• Data which is in Terabytes or petabytes or zettabytes or Yottabytes is called
bid data.
• Big Data is a collection of large datasets that cannot be processed using
traditional computing techniques.
• For example, the volume of data Facebook or Youtube need require it to
collect and manage on a daily basis, can fall under the category of Big Data.
• Today’s big may be tomorrow’s Normal.
Big Data Characteristics
• For a dataset to be considered Big Data, it must possess one or more
characteristics that require accommodation in the solution design and
architecture of the analytic environment.
• Most of these data characteristics were initially identified by Doug
Laney in early 2001
• Big Data contains a large amount of data that is not being processed
by traditional data storage or the processing unit. It is used by
many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per
day before replication.
• The five Big Data characteristics are:
• volume
• velocity
• variety
• veracity
• value
Volume
• The anticipated volume of data that is processed by Big Data solutions is substantial
and ever-growing.
• High data volumes impose distinct data storage and processing demands, as well as
additional data preparation, curation and management processes.
• Big Data is a vast 'volumes' of data generated from many sources daily, such
as business processes, machines, social media platforms, networks, human
interactions, and many more.
• Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
• Organizations and users world-wide create over 2.5 EBs of data a day.
Bits 0 or 1
Bytes 8 bits
Kilobytes 1024 bytes
Megabytes 1024Kilobytes
Gigabytes 1024 Megabytes
Terabytes 1024 Gigabytes
Petabytes 1024 Terabytes
Exabytes 1024 Petabytes
Zettabytes 1024 Exabytes
Yottabytes 1024 Zettabytes
• Typical data sources that are responsible for generating high data
volumes can include:
• online transactions, such as point-of-sale and banking
• scientific and research
• sensors, such as GPS sensors, RFIDs, smart meters and telematics
• social media, such as Facebook and Twitter
Velocity
• Velocity refers to the high speed of accumulation of data and how fast the data is
generated and processed to meet the demands, determines real potential in the data.
• In Big Data environments, data can arrive at fast speeds, and enormous datasets can
accumulate within very short periods of time.
• Velocity creates the speed by which the data is created in real-time. It contains the linking
of incoming data sets speeds, rate of change, and activity bursts. The primary aspect of
Big Data is to provide demanding data rapidly.
• There is a massive and continuous flow of data. This determines the potential of data that
how fast the data is generated and processed to meet the demands.
• From an enterprise’s point of view, the velocity of data translates into the amount of time
it takes for the data to be processed once it enters the enterprise’s perimeter.
• Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Velocity: Measure of how fast the data is
coming in.

Data generated in one minute in the digital


world
Variety
• Data variety refers to the multiple formats and types of data that need to be supported by Big
Data solutions.
• Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
• Data variety brings challenges for enterprises in terms of data integration, transformation,
processing, and storage.
Variety
Veracity
• Veracity refers to the quality or fidelity of data, Since a major part of the data is unstructured and
irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the data is
crucial in business developments.
• Data that enters Big Data environments needs to be assessed for quality, which can lead to data
processing activities to resolve invalid data and remove noise. For example, Facebook posts with
hashtags.
• Noise is data that cannot be converted into information and thus has no value, whereas signals have
value and lead to meaningful information.
• Data with a high signal-to-noise ratio has more veracity than data with a lower ratio.
• Is this data credible enough to collect insights from?
• Should we be basing our business decisions on the insights gathered from this data?
• When processing big data sets, it is important that the validity of the data is checked before
proceeding for processing.
Value
• Value is defined as the usefulness of data for an enterprise. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
• The value characteristic is intuitively related to the veracity characteristic in that the higher the
data fidelity, the more value it holds for the business.
• Value is also dependent on how long data processing takes place.
• For example, a 20 minute delayed stock quote has little to no value for making a trade compared
to a quote that is 20 milliseconds old.
• As demonstrated, value and time are inversely related.
• Apart from veracity and time, value is also impacted by the following
lifecycle-related concerns:
How well has the data been stored?
Were valuable attributes of the data removed during data cleansing?
Are the right types of questions being asked during data analysis?
Are the results of the analysis being accurately communicated to the
appropriate decision-makers?
Evolution of Big Data
• 1970s and before was the era of mainframes. The data was essentially
primitive and structured.
• Relational databases evolved in 1980s and 1990s. Relational databases
deal Data-intensive applications.
• 2000s and beyond , the World Wide Web and IoT came into existence
and led to structured, unstructured and multimedia data.
• The data generated in 2000s and beyond in huge and its analysis
required Big data Technology.
Sources of big data
• Voluminous amounts of big data make it crucial for businesses to
differentiate, for the purpose of effectiveness, the disparate big data
sources available
Media as a big data source
• Media is the most popular source of big data, as it provides
valuable insights on consumer preferences and changing trends.
• It is the fastest way for businesses to get an in-depth overview of
their target audience, draw patterns and conclusions, and enhance
their decision-making.
• Media includes social media and interactive platforms, like Google,
Facebook, Twitter, YouTube, Instagram, as well as generic media
like images, videos, audios,
Cloud as a big data source
• Today, companies have moved ahead of traditional data sources
by shifting their data on the cloud.
• Cloud storage accommodates structured and unstructured data
and provides business with real-time information.
• The main attribute of cloud computing is its flexibility and
scalability.
• As big data can be stored and sourced on public or private clouds,
via networks and servers, cloud makes for an efficient and
economical data source.
The web as a big data source

• The public web constitutes big data that is widespread and easily
accessible.
• Data on the Web or ‘Internet’ is commonly available to
individuals and companies alike.
• Web services such as Wikipedia provide free and quick
informational insights to everyone.
IoT as a big data source

• Machine-generated content or data created from IoT constitute a valuable


source of big data.
• This data is usually generated from the sensors that are connected to
electronic devices.
• The sourcing capacity depends on the ability of the sensors to provide real-
time accurate information.
• IoT is now gaining data not only from computers and smartphones, but also
possibly from every device that can emit data.
• With IoT, data can now be sourced from medical devices, vehicular
processes, video games, meters, cameras, household appliances, and the
like.
Databases as a big data source

• Businesses today prefer to use an amalgamation of traditional and


modern databases to acquire relevant big data.
• This integration paves the way for a hybrid data model and
requires low investment and IT infrastructural costs.
• These databases can then provide for the extraction of insights
that are used to drive business profits.
• Popular databases include a variety of data sources, such as MS
Access, DB2, Oracle, SQL, and Amazon Simple, among others.
Challenges with Big Data
It must be pretty clear by now that while talking about big data one can’t ignore the
fact that there are some obvious challenges associated with it. Let’s address some of
those challenges.
Quick Data Growth
Data growing at such a quick rate is making it a challenge to find insights from it.
There is more and more data generated every second from which the data that is
actually relevant and useful has to be picked up for further analysis.
Storage
Such large amount of data is difficult to store and manage by organizations without
appropriate tools and technologies. Unstructured data cannot be stored in traditional
databases.
Syncing Across Data Sources
This implies that when organisations import data from different sources the data from
one source might not be up to date as compared to the data from another source.
Security
Huge amount of data in organisations can easily become a target for
advanced persistent threats, so here lies another challenge for
organisations to keep their data secure by proper authentication, data
encryption, etc.
Unreliable Data
We can’t deny the fact that big data can’t be 100 percent accurate. It
might contain redundant or incomplete data, along with contradictions.
Miscellaneous Challenges
These are some other challenges that come forward while dealing with
big data, like the integration of data, skill and talent availability, solution
expenses, Visualization, Searching, Data Capture, processing a large
amount of data
Challenges with Big Data

Storing exponentially growing huge


data sets

Integrating disparate data

sources Generating insights in a

timely manner

Data

Governance

You might also like