Unit 1
Unit 1
GTU #3170722
Unit-1
Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data
volume in Petabytes.
Big Data Characteristics
Value refers to turning data into value. By turning accessed big data into values, businesses
may generate revenue.
Big Data Characteristics
Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of
data that brings incompleteness and inconsistency.
Big Data Characteristics
Visualization is the process of displaying data in charts, graphs, maps, and other visual
forms.
Big Data Characteristics
Variety refers to the different data types i.e. various data formats like text, audios, videos,
etc.
Big Data Characteristics
Velocity is the rate at which data grows. Social media contributes a major role in the velocity
of growing data.
Big Data Characteristics
Virality describes how quickly information gets spread across people to people (P2P)
networks.
Volume
As it follows from the name, big data is used to refer to enormous
amounts of information.
Volume
[ Data at Rest ]
We are talking about not gigabytes but terabytes and petabytes of
data.
The IoT (Internet of Things) is creating exponential growth in data.
The volume of data is projected to change significantly in the
coming years.
Hence, 'Volume' is one characteristic which needs to be
considered while dealing with Big Data. • Terabytes, Petabytes
• Records/Arch
• Table/Files
• Distributed
Variety
Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured.
Variety
[ Data in many Forms ]
Data comes in different formats – from structured, numeric data in
traditional databases to unstructured text documents, emails,
videos, audios, stock ticker data and financial transactions.
This variety of unstructured data poses certain issues for storage,
mining and analysing data.
Organizing the data in a meaningful way is no simple task,
especially when the data itself changes rapidly. • Structured
• Unstructured
Another challenge of Big Data processing goes beyond the • Text
massive volumes and increasing velocities of data but also in • Multimedia
manipulating the enormous variety of these data.
Veracity
Veracity describes whether the data can be trusted.
Veracity
Veracity refers to the uncertainty of available data. [ Data in Doubt ]
Veracity arises due to the high volume of data that brings
incompleteness and inconsistency.
Hygiene of data in analytics is important because otherwise, you
cannot guarantee the accuracy of your results.
Because data comes from so many different sources, it’s difficult
to link, match, cleanse and transform data across systems.
• Trustworthiness
However, it is useless if the data being analysed are inaccurate or • Authenticity
incomplete. • Accurate
Veracity is all about making sure the data is accurate, which • Availability
requires processes to keep the bad data from accumulating in
your systems.
Velocity
Velocity is the speed in which data is grows, process and becomes
accessible.
Velocity
[ Data in Motion ]
A data flows in from sources like business processes, application
logs, networks, and social media sites, sensors, Mobile devices,
etc.
The flow of data is massive and continuous.
Most data are warehoused before analysis, there is an increasing
need for real-time processing of these enormous volumes.
Real-time processing reduces storage requirements while • Streaming
• Batch
providing more responsive, accurate and profitable responses.
• Real / Near Time
It should be processed fast by batch, in a stream-like manner • Processes
because it just keeps growing every years.
Value
It refers to turning data into value. By turning accessed big data
into values, businesses may generate revenue.
Value
[ Data into Money ]
Value is the end game. After addressing volume, velocity, variety,
variability, veracity, and visualization – which takes a lot of time,
effort and resources – you want to be sure your organization is
getting value from the data.
For example, data that can be used to analyze consumer behavior
is valuable for your company because you can use the research
results to make individualized offers. • Statistical
• Events
• Correlations
Visualization
Big data visualization is the process of displaying data in charts,
graphs, maps, and other visual forms.
Visualization
[ Data Readable ]
It is used to help people easily understand and interpret their data
at a glance, and to clearly show trends and patterns that arise from
this data.
Raw data comes in a different formats, so creating data
visualizations is process of gathering, managing, and transforming
data into a format that’s most usable and meaningful.
Big Data Visualization makes your data as accessible as possible • Readable
to everyone within your organization, whether they have technical • Accessible
data skills or not. • Presentation
• Visual Forms
Virality
Virality describes how quickly information gets spread across
people to people (P2P) networks.
Virality
[ Data Spread ]
It is measures how quickly data is spread and shared to each
unique node.
Time is a determinant factor along with rate of spread.
• P2P
• Shared
• Rate of Spread
Evolution of Big Data
1940s to 1989 – Data Warehousing and Personal
Desktop Computers
1989 to 1999 – Emergence of the World Wide Web
2000s to 2010s – Controlling Data Volume, Social Media
and Cloud Computing
2010s to now– Optimization Techniques, Mobile Devices
and IoT
Definition of Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can
store it or process it efficiently. Big data is also a data but with huge size.
Challenges of Conventional System
There are main three challenges of conventional system, which are as follows:
1. Volume of Data
2. Processing and Analyzing
3. Management of Data
Volume of Data
The volume of data increasing day by day, especially the data generated from machine,
telecommunication service, airline services, data from sensors, etc…
The rapid growth in data every year is coming with new source of data which are emerging.
As per survey, the growth in volume of data is so rapid that it is expected by IBM that by 2020
around 35 zettabyte of data will get stored in the world.
Processing & Analyzing
Processing of such large volume of data is major challenge and is very difficult.
Organization make use of such large volume of data by analyzing in order to achieve their
business goals.
Taking out insights from such large amount of data is time consuming and it also takes lot of
effort to do.
Processing and analyzing of data is also costly since the data is in different format and is
complex.
Management of Data
As the data gathered have different formats like structured, semi-structured and unstructured, it
is very challenging to manage such different variety of data.
Intelligent Data Analysis
Intelligent Data Analysis (IDA) is one of the major issues in the field of artificial intelligence and
information.
Intelligent data analysis reveals implicit, previously unknown and potentially valuable
information or knowledge from large amounts of data.
It also helps in making a decision.
All zones of data visualization, data pre-preparing(combination, altering, change, separating,
examining), data engineering, database mining procedure, devices and applications, use of
domain knowledge in in data analysis, big data applications, developmental algorithms, etc…
It includes three major steps:
1. Data Preparation
2. Rules finding or data mining
3. Result validation and explanation
Intelligent Data Analysis – Cont.
Data Preparation:
It includes extracting or collecting relevant data from source and then creating an data set.
Rules finding or Data mining:
It is working out rules contained in the dataset by means of certain methods or algorithms.
Result Validation and Explanation:
This result validation means examining these rules.
And Result explanation is giving intuitive, reasonable, and understandable description using logical
reasoning.
IDA is to extract useful knowledge, the process demands a combination of extraction, analysis,
conversion, classification, organization, reasoning, and so on.
We can imply machine learning and deep learning concept for IDA.
It will helps in many area:
Banking & Securities, Communications, Media, & Entertainment
Healthcare Providers
Importance of Big Data
Complex or massive data sets which are quite impractical to be managed using the traditional
database system and software tools are referred to as big data.
Big data is utilized by organizations in one or another way. It is the technology which possibly
realizes big data’s value.
It is the voluminous amount of both multi-structured as well unstructured data.
Traditional vs. Big Data
Confidentiality & Data Accuracy
Data Relationship
Data Storage Size
Different types of data
Flexibility
Real-time Analytics
Distributed Architecture
Majors between Traditional Data & Big Data
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Traditional database system deals with Big data system deals with structured, semi
structured data. structured and unstructured data.
Big Data
Traditional data is generated per hour or per But big data is generated more frequently
day or more. mainly per seconds.
Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.
Majors between Traditional Data & Big Data
The size of the data is very small. The size is more than the traditional data size.
Big Data
Traditional data base tools are required to Special kind of data base tools are required to
perform any data base operation. perform any data base operation.
Normal functions can manipulate data. Special kind of functions can manipulate data.
Majors between Traditional Data & Big Data
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.