Big Data Chapter-I_new
Big Data Chapter-I_new
data Processing
UNIT I
Introduction to big data: Data, Characteristics of data and Types of
digital data: Unstructured, Semi-structured and Structured, Sources of
data, Evolution and Definition of big data, Characteristics and Need of
big data, Challenges of big data
Introduction to Hadoop:
History of Hadoop, Components of Hadoop, The Hadoop Distributed
File System: Design of HDFS, HDFS Concepts, Java interfaces to HDFS,
Analysing the Data with Hadoop, scaling out, Hadoop Streaming,
UNIT-II
Map Reduce: Developing a Map Reduce Application, How Map Reduce
Works: Anatomy of a Map Reduce Job run, Failures, Map Reduce
Features: Counters, Sorting, Joins.
UNIT-III
NoSQL: Introduction to NoSQL, aggregate data models, aggregates,
key-value and document data models, relationships, graph databases,
schema less databases, materialized views, distribution models:
sharding, master-slave replication, peer- peer replication, sharding and
replication, consistency, relaxing consistency, version stamps.
UNIT-IV
Introduction to Hadoop ecosystem technologies: Serialization: AVRO, Co-
ordination: Zookeeper, Databases: HBase, Hive, Scripting language: Pig
Text Books
1. Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India Pvt.
Ltd., 2016
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World
Polyglot Persistence", Addison-Wesley Professional, 2012.
3. Tom White, "Hadoop: The Definitive Guide", Fourth Edition, O'Reilley, 2015.
References:
1. Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams
with Advanced Analytics”, John Wiley& sons, 2012.
2. Arshdeep Bahga and Vijay Madisetti, “Big Data Science & Analytics: A Hands On Approach
“, VPT, 2016
3. Bart Baesens, “Analytics in a Big Data World;The essential Guide to Data Science and its
Data
• Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things.
Qualitative vs Quantitative
Data can be qualitative or quantitative.
Qualitative data is descriptive information (it describes something)
Quantitative data is numerical information (numbers)
Quantitative data can be Discrete or Continuous:
• Discrete data can only take certain values (like whole numbers)
• Continuous data can take any value (within a range)
• Put simply: Discrete data is counted, Continuous data is measured
Examples:
• Qualitative:
• He is brown and black
• He has long hair
• He has lots of energy
• Quantitative:
• Discrete:
• He has 4 legs
• He has 2 brothers
• Continuous:
• He weighs 25.5 kg
• He is 565 mm tall
• Data can be collected in many ways. The simplest way is direct
observation.
Data or Datum?
• The singular form is "datum", so we say "that datum is very high".
• "Data" is the plural so we say "the data are available", but data is also
a collection of facts, so "the data is available" is fine too.
Information
• knowledge obtained from investigation, study, or instruction.
• intelligence, news.
Characteristics of data
• Data has three key Characteristics. Those are:
• Composition
• Condition
• Context
Composition
The Composition deals with the Structure of data that is:
• The Sources of data
• The Granularity
• The Types
• The Nature of data(like static or real-time streaming)
Condition
The Condition of data deals with the state of data, that is ,
“Can one use this data as is for analysis?” or “ Does it require cleansing
for further enhancement and enrichment?”
Context:
The context of data deals with
“Where has this data been generated?”
“Why was this data generated?”
“How sensitive is this data?”
“What are the events associated with this data?”
Types of digital data
• The data processed can be human-generated or machine-generated,
although it is ultimately the responsibility of machines to generate the
analytic results.
• Human-generated data- generated by humans with computer aid.
Examples are survey results, website content, mobile data, social
media data etc.
• Computer-generated data- generated by computers like Financial
data, Weblog data, sensor data etc.
• Machine-generated data is generated by software programs and
hardware devices in response to real-world events.
• Human-generated and machine-generated data can come from a
variety of sources and be represented in various formats or types.
• The primary types of data are:
structured data
unstructured data
semi-structured data
• Apart from these three fundamental data types, another important
type of data in Big Data environments is metadata.
Structured Data
• The public web constitutes big data that is widespread and easily
accessible.
• Data on the Web or ‘Internet’ is commonly available to
individuals and companies alike.
• Web services such as Wikipedia provide free and quick
informational insights to everyone.
IoT as a big data source
timely manner
Data
Governance