Big Data Chapter-I_new

Uploaded by

Suseela Devi

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Big Data Chapter-I_new

Uploaded by

Suseela Devi

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

(CS 415)-JOEL01 Big

data Processing
UNIT I
Introduction to big data: Data, Characteristics of data and Types of
digital data: Unstructured, Semi-structured and Structured, Sources of
data, Evolution and Definition of big data, Characteristics and Need of
big data, Challenges of big data
Introduction to Hadoop:
History of Hadoop, Components of Hadoop, The Hadoop Distributed
File System: Design of HDFS, HDFS Concepts, Java interfaces to HDFS,
Analysing the Data with Hadoop, scaling out, Hadoop Streaming,
UNIT-II
Map Reduce: Developing a Map Reduce Application, How Map Reduce
Works: Anatomy of a Map Reduce Job run, Failures, Map Reduce
Features: Counters, Sorting, Joins.

UNIT-III
NoSQL: Introduction to NoSQL, aggregate data models, aggregates,
key-value and document data models, relationships, graph databases,
schema less databases, materialized views, distribution models:
sharding, master-slave replication, peer- peer replication, sharding and
replication, consistency, relaxing consistency, version stamps.
UNIT-IV
Introduction to Hadoop ecosystem technologies: Serialization: AVRO, Co-
ordination: Zookeeper, Databases: HBase, Hive, Scripting language: Pig

Text Books
1. Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India Pvt.
Ltd., 2016
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World
Polyglot Persistence", Addison-Wesley Professional, 2012.
3. Tom White, "Hadoop: The Definitive Guide", Fourth Edition, O'Reilley, 2015.
References:
1. Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams
with Advanced Analytics”, John Wiley& sons, 2012.
2. Arshdeep Bahga and Vijay Madisetti, “Big Data Science & Analytics: A Hands On Approach
“, VPT, 2016
3. Bart Baesens, “Analytics in a Big Data World;The essential Guide to Data Science and its
Data
• Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things.

Qualitative vs Quantitative
Data can be qualitative or quantitative.
Qualitative data is descriptive information (it describes something)
Quantitative data is numerical information (numbers)
Quantitative data can be Discrete or Continuous:
• Discrete data can only take certain values (like whole numbers)
• Continuous data can take any value (within a range)
• Put simply: Discrete data is counted, Continuous data is measured
Examples:
• Qualitative:
• He is brown and black
• He has long hair
• He has lots of energy
• Quantitative:
• Discrete:
• He has 4 legs
• He has 2 brothers
• Continuous:
• He weighs 25.5 kg
• He is 565 mm tall
• Data can be collected in many ways. The simplest way is direct
observation.
Data or Datum?
• The singular form is "datum", so we say "that datum is very high".
• "Data" is the plural so we say "the data are available", but data is also
a collection of facts, so "the data is available" is fine too.

Information
• knowledge obtained from investigation, study, or instruction.
• intelligence, news.
Characteristics of data
• Data has three key Characteristics. Those are:
• Composition
• Condition
• Context
Composition
The Composition deals with the Structure of data that is:
• The Sources of data
• The Granularity
• The Types
• The Nature of data(like static or real-time streaming)

Condition
The Condition of data deals with the state of data, that is ,
“Can one use this data as is for analysis?” or “ Does it require cleansing
for further enhancement and enrichment?”
Context:
The context of data deals with
“Where has this data been generated?”
“Why was this data generated?”
“How sensitive is this data?”
“What are the events associated with this data?”
Types of digital data
• The data processed can be human-generated or machine-generated,
although it is ultimately the responsibility of machines to generate the
analytic results.
• Human-generated data- generated by humans with computer aid.
Examples are survey results, website content, mobile data, social
media data etc.
• Computer-generated data- generated by computers like Financial
data, Weblog data, sensor data etc.
• Machine-generated data is generated by software programs and
hardware devices in response to real-world events.
• Human-generated and machine-generated data can come from a
variety of sources and be represented in various formats or types.
• The primary types of data are:
structured data
unstructured data
semi-structured data
• Apart from these three fundamental data types, another important
type of data in Big Data environments is metadata.
Structured Data

• Data which is in organized form and can be easily used by computer

program.
• Data stored in databases is an example of structured data.
• Structure data conforms to a pre-defined schema/Structure.
• Most of the structured data is held in RDBMS.
• In RDBMS data is stored in terms of rows and columns called tables.
• Up to 1980s most of the enterprise data has been stored in relational
databases.
• Sources of Structured data are:
Database such as Oracle, DB2, MySQL, Spreadsheets, OLTP Systems.
• Structured data provides ease of working with it. The ease is with the
following:
• Insert/update/delete
• Security
• Indexing
• Scalability
• Transaction Processing
Semi- Structured Data
• Data which does not conform to a data model but has some structure.
• Semi- Structured data uses tags to segregate semantic elements.
• Tags are also used to enforce hierarchies of records and fields with in
data.
Characteristics of Semi- Structured Data:
• Inconsistent Structure
• Self-describing (Label/value Pairs)
• Often schema information is blended with data values.
• Data objects may have different attributes not known beforehand.
Examples of Semi-Structured data are:
Unstructured Data

• Data which does not conform to a data model or is not in a form

which can be used easily by a computer program.
• About 80-90% data of an organization is unstructured data.
Dealing with unstructured data:
The following techniques are used to find patterns in or interpret
unstructured data. Those are:
• Data Mining
• Natural Language Processing.
• Text Analytics
• Noisy Text Analytics
Examples of unstructured data
What is Big
Data?
As per Wikipedia
“Big data is a term for data sets that
are so large or complex that traditional
data processing applications are
inadequate to deal with them.”
Definition of Big Data
• Anything beyond the human & technical infrastructure needed to support
storage, processing and analysis.
• Collection of data that is huge in volume, yet growing exponentially with
time
• Data which is in Terabytes or petabytes or zettabytes or Yottabytes is called
bid data.
• Big Data is a collection of large datasets that cannot be processed using
traditional computing techniques.
• For example, the volume of data Facebook or Youtube need require it to
collect and manage on a daily basis, can fall under the category of Big Data.
• Today’s big may be tomorrow’s Normal.
Big Data Characteristics
• For a dataset to be considered Big Data, it must possess one or more
characteristics that require accommodation in the solution design and
architecture of the analytic environment.
• Most of these data characteristics were initially identified by Doug
Laney in early 2001
• Big Data contains a large amount of data that is not being processed
by traditional data storage or the processing unit. It is used by
many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per
day before replication.
• The five Big Data characteristics are:
• volume
• velocity
• variety
• veracity
• value
Volume
• The anticipated volume of data that is processed by Big Data solutions is substantial
and ever-growing.
• High data volumes impose distinct data storage and processing demands, as well as
additional data preparation, curation and management processes.
• Big Data is a vast 'volumes' of data generated from many sources daily, such
as business processes, machines, social media platforms, networks, human
interactions, and many more.
• Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
• Organizations and users world-wide create over 2.5 EBs of data a day.
Bits 0 or 1
Bytes 8 bits
Kilobytes 1024 bytes
Megabytes 1024Kilobytes
Gigabytes 1024 Megabytes
Terabytes 1024 Gigabytes
Petabytes 1024 Terabytes
Exabytes 1024 Petabytes
Zettabytes 1024 Exabytes
Yottabytes 1024 Zettabytes
• Typical data sources that are responsible for generating high data
volumes can include:
• online transactions, such as point-of-sale and banking
• scientific and research
• sensors, such as GPS sensors, RFIDs, smart meters and telematics
• social media, such as Facebook and Twitter
Velocity
• Velocity refers to the high speed of accumulation of data and how fast the data is
generated and processed to meet the demands, determines real potential in the data.
• In Big Data environments, data can arrive at fast speeds, and enormous datasets can
accumulate within very short periods of time.
• Velocity creates the speed by which the data is created in real-time. It contains the linking
of incoming data sets speeds, rate of change, and activity bursts. The primary aspect of
Big Data is to provide demanding data rapidly.
• There is a massive and continuous flow of data. This determines the potential of data that
how fast the data is generated and processed to meet the demands.
• From an enterprise’s point of view, the velocity of data translates into the amount of time
it takes for the data to be processed once it enters the enterprise’s perimeter.
• Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Velocity: Measure of how fast the data is
coming in.

Data generated in one minute in the digital

world
Variety
• Data variety refers to the multiple formats and types of data that need to be supported by Big
Data solutions.
• Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
• Data variety brings challenges for enterprises in terms of data integration, transformation,
processing, and storage.
Variety
Veracity
• Veracity refers to the quality or fidelity of data, Since a major part of the data is unstructured and
irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the data is
crucial in business developments.
• Data that enters Big Data environments needs to be assessed for quality, which can lead to data
processing activities to resolve invalid data and remove noise. For example, Facebook posts with
hashtags.
• Noise is data that cannot be converted into information and thus has no value, whereas signals have
value and lead to meaningful information.
• Data with a high signal-to-noise ratio has more veracity than data with a lower ratio.
• Is this data credible enough to collect insights from?
• Should we be basing our business decisions on the insights gathered from this data?
• When processing big data sets, it is important that the validity of the data is checked before
proceeding for processing.
Value
• Value is defined as the usefulness of data for an enterprise. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
• The value characteristic is intuitively related to the veracity characteristic in that the higher the
data fidelity, the more value it holds for the business.
• Value is also dependent on how long data processing takes place.
• For example, a 20 minute delayed stock quote has little to no value for making a trade compared
to a quote that is 20 milliseconds old.
• As demonstrated, value and time are inversely related.
• Apart from veracity and time, value is also impacted by the following
lifecycle-related concerns:
How well has the data been stored?
Were valuable attributes of the data removed during data cleansing?
Are the right types of questions being asked during data analysis?
Are the results of the analysis being accurately communicated to the
appropriate decision-makers?
Evolution of Big Data
• 1970s and before was the era of mainframes. The data was essentially
primitive and structured.
• Relational databases evolved in 1980s and 1990s. Relational databases
deal Data-intensive applications.
• 2000s and beyond , the World Wide Web and IoT came into existence
and led to structured, unstructured and multimedia data.
• The data generated in 2000s and beyond in huge and its analysis
required Big data Technology.
Sources of big data
• Voluminous amounts of big data make it crucial for businesses to
differentiate, for the purpose of effectiveness, the disparate big data
sources available
Media as a big data source
• Media is the most popular source of big data, as it provides
valuable insights on consumer preferences and changing trends.
• It is the fastest way for businesses to get an in-depth overview of
their target audience, draw patterns and conclusions, and enhance
their decision-making.
• Media includes social media and interactive platforms, like Google,
Facebook, Twitter, YouTube, Instagram, as well as generic media
like images, videos, audios,
Cloud as a big data source
• Today, companies have moved ahead of traditional data sources
by shifting their data on the cloud.
• Cloud storage accommodates structured and unstructured data
and provides business with real-time information.
• The main attribute of cloud computing is its flexibility and
scalability.
• As big data can be stored and sourced on public or private clouds,
via networks and servers, cloud makes for an efficient and
economical data source.
The web as a big data source

• The public web constitutes big data that is widespread and easily
accessible.
• Data on the Web or ‘Internet’ is commonly available to
individuals and companies alike.
• Web services such as Wikipedia provide free and quick
informational insights to everyone.
IoT as a big data source

• Machine-generated content or data created from IoT constitute a valuable

source of big data.
• This data is usually generated from the sensors that are connected to
electronic devices.
• The sourcing capacity depends on the ability of the sensors to provide real-
time accurate information.
• IoT is now gaining data not only from computers and smartphones, but also
possibly from every device that can emit data.
• With IoT, data can now be sourced from medical devices, vehicular
processes, video games, meters, cameras, household appliances, and the
like.
Databases as a big data source

• Businesses today prefer to use an amalgamation of traditional and

modern databases to acquire relevant big data.
• This integration paves the way for a hybrid data model and
requires low investment and IT infrastructural costs.
• These databases can then provide for the extraction of insights
that are used to drive business profits.
• Popular databases include a variety of data sources, such as MS
Access, DB2, Oracle, SQL, and Amazon Simple, among others.
Challenges with Big Data
It must be pretty clear by now that while talking about big data one can’t ignore the
fact that there are some obvious challenges associated with it. Let’s address some of
those challenges.
Quick Data Growth
Data growing at such a quick rate is making it a challenge to find insights from it.
There is more and more data generated every second from which the data that is
actually relevant and useful has to be picked up for further analysis.
Storage
Such large amount of data is difficult to store and manage by organizations without
appropriate tools and technologies. Unstructured data cannot be stored in traditional
databases.
Syncing Across Data Sources
This implies that when organisations import data from different sources the data from
one source might not be up to date as compared to the data from another source.
Security
Huge amount of data in organisations can easily become a target for
advanced persistent threats, so here lies another challenge for
organisations to keep their data secure by proper authentication, data
encryption, etc.
Unreliable Data
We can’t deny the fact that big data can’t be 100 percent accurate. It
might contain redundant or incomplete data, along with contradictions.
Miscellaneous Challenges
These are some other challenges that come forward while dealing with
big data, like the integration of data, skill and talent availability, solution
expenses, Visualization, Searching, Data Capture, processing a large
amount of data
Challenges with Big Data

Storing exponentially growing huge

data sets

Integrating disparate data

sources Generating insights in a

timely manner

Data

Governance

Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Module 1
No ratings yet
Module 1
54 pages
Bda M1
No ratings yet
Bda M1
111 pages
Unit 1 BIGDATA - 702 (D) CSE
No ratings yet
Unit 1 BIGDATA - 702 (D) CSE
20 pages
Big Data study 1
No ratings yet
Big Data study 1
77 pages
UNIT I notes
No ratings yet
UNIT I notes
26 pages
BDCC Unit 1
No ratings yet
BDCC Unit 1
165 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
Converted 4011171
No ratings yet
Converted 4011171
144 pages
Unit 1
No ratings yet
Unit 1
26 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
01_Introduction to Big Data Analytics.pdf
No ratings yet
01_Introduction to Big Data Analytics.pdf
37 pages
Unit 1 and Unit 2 notes bda
No ratings yet
Unit 1 and Unit 2 notes bda
11 pages
Cloud computing
No ratings yet
Cloud computing
86 pages
big-data-2022-notes
No ratings yet
big-data-2022-notes
118 pages
Unit I-KCS-061
No ratings yet
Unit I-KCS-061
42 pages
1. Data Science
No ratings yet
1. Data Science
54 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
big data introduction unit 1
No ratings yet
big data introduction unit 1
19 pages
Unit 01
No ratings yet
Unit 01
32 pages
1.1 Module-1
No ratings yet
1.1 Module-1
31 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
Unit-1 Final sgs
No ratings yet
Unit-1 Final sgs
24 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
BIG DATA ANALYTICS NOTES (1)
No ratings yet
BIG DATA ANALYTICS NOTES (1)
81 pages
UNIT- 1_DA_Notes
No ratings yet
UNIT- 1_DA_Notes
51 pages
big data unit 1
No ratings yet
big data unit 1
20 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Big Data Pgdca
No ratings yet
Big Data Pgdca
23 pages
BDA M1 (40pgs)
No ratings yet
BDA M1 (40pgs)
40 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Chapter 4 Data Analytics
No ratings yet
Chapter 4 Data Analytics
19 pages
BIGDATA ANALYTICS
No ratings yet
BIGDATA ANALYTICS
19 pages
Module 1 BDA
No ratings yet
Module 1 BDA
103 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
BigData Unit-1
No ratings yet
BigData Unit-1
72 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
No ratings yet
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
14 pages
Datas
No ratings yet
Datas
27 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
University Institute of Computing: Big Data Analytics 21CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 21CAH-782
13 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Hamid Seminar Doc
No ratings yet
Hamid Seminar Doc
57 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Li Et Al. (2022)
No ratings yet
Li Et Al. (2022)
17 pages
Patent Mining - A Survey
No ratings yet
Patent Mining - A Survey
50 pages
InputAccel Fundamentals & Administration Book 1 - IAFA-60-0209-01
100% (2)
InputAccel Fundamentals & Administration Book 1 - IAFA-60-0209-01
399 pages
The Future of Intelligence Analysis: US-Australia Project On AI and Human Machine Teaming
No ratings yet
The Future of Intelligence Analysis: US-Australia Project On AI and Human Machine Teaming
42 pages
Business Data Mining Week 15
No ratings yet
Business Data Mining Week 15
3 pages
Digital Data Part 1
No ratings yet
Digital Data Part 1
5 pages
(Ebook) Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python by Akshay Kulkarni, Adarsha Shivananda ISBN 9781484273500, 1484273508 All Chapters Instant Download
100% (10)
(Ebook) Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python by Akshay Kulkarni, Adarsha Shivananda ISBN 9781484273500, 1484273508 All Chapters Instant Download
81 pages
Data Science for students by Dr. Nair Unit-1
No ratings yet
Data Science for students by Dr. Nair Unit-1
70 pages
1 U Data-Analytics-Unit-I-1
100% (1)
1 U Data-Analytics-Unit-I-1
81 pages
CH08 DSS Turban Data Warehouse
No ratings yet
CH08 DSS Turban Data Warehouse
65 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
25 pages
(IJETA-V9I1P2) :yew Kee Wong
No ratings yet
(IJETA-V9I1P2) :yew Kee Wong
7 pages
Module - 3 - IoT Processing Topologies and Types
No ratings yet
Module - 3 - IoT Processing Topologies and Types
12 pages
Big Data Management For Dummies Informatica Ed
100% (4)
Big Data Management For Dummies Informatica Ed
53 pages
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
No ratings yet
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
8 pages
13853-Article Text-41995-1-10-20230323
No ratings yet
13853-Article Text-41995-1-10-20230323
13 pages
Chapter One Introduction To Information Storage and Retrieval
No ratings yet
Chapter One Introduction To Information Storage and Retrieval
107 pages
machine2
No ratings yet
machine2
3 pages
Development of Agriculture Chatbot Using Machine Learning Techniques
No ratings yet
Development of Agriculture Chatbot Using Machine Learning Techniques
5 pages
Spe 1017 0087 JPT
No ratings yet
Spe 1017 0087 JPT
1 page
Process Assessment Questionnaire Sample
No ratings yet
Process Assessment Questionnaire Sample
23 pages
Chapter 11 - Developing and Managing Customer Related Databases
No ratings yet
Chapter 11 - Developing and Managing Customer Related Databases
22 pages
AmishDoshi CV
No ratings yet
AmishDoshi CV
3 pages
Big Data Text Analytics
No ratings yet
Big Data Text Analytics
11 pages
Data Analytics Basics: A Beginner's Guide
No ratings yet
Data Analytics Basics: A Beginner's Guide
15 pages
Proposal PHD CSE Synopsis On Sentimant Analysis For Blogs
100% (1)
Proposal PHD CSE Synopsis On Sentimant Analysis For Blogs
16 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
75 pages
Application of Text Classification and Clustering of Twitter Data For Business Analytics
No ratings yet
Application of Text Classification and Clustering of Twitter Data For Business Analytics
7 pages
Text Mining: Presented By: Prakhyath Rai Asst. Professor, Dept. of ISE, SCEM, Mangaluru
No ratings yet
Text Mining: Presented By: Prakhyath Rai Asst. Professor, Dept. of ISE, SCEM, Mangaluru
15 pages