Big Data Basics Unit 1
Big Data Basics Unit 1
What is Data?
• The quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in
the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
• What is Big Data?
• So before I explain what is Big Data, let me also tell you what it is
not!
• The most common myth associated with it is that it is just about the
size or volume of data.
• But actually, it’s not just about the “big” amounts of data being
collected.
• Big Data refers to the large amounts of data which is pouring in from
various data sources and has different formats.
• Even previously there was huge data which were being stored in
databases, but because of the varied nature of this Data, the
traditional relational database systems are incapable of handling this
Data.
• Big data refers to the large, diverse sets of information that
grow at ever-increasing rates.
• It encompasses the volume of information, the velocity or
speed at which it is created and collected, and the variety or
scope of the data points being covered.
• Big data often comes from multiple sources and arrives in
multiple formats.
• To really understand big data, it’s helpful to have some
historical background. Here is Gartner’s definition, circa
2001 (which is still the go-to definition): Big data is data that
contains greater variety arriving in increasing volumes and
with ever-higher velocity.
• big data is larger, more complex data sets, especially from
new data sources. These data sets are so voluminous that
traditional data processing software just can’t manage them
The History of Big Data
•Although the concept of big data itself is relatively new, the
origins of large data sets go back to the 1960s and '70s,
• when the world of data was just getting started with the first
data centers and the development of the relational database.
•Around 2005, people began to realize just how much data
users generated through Facebook, YouTube, and other online
services.
• Hadoop (an open-source framework created specifically to
store and analyze big data sets) was developed that same year.
• NoSQL also began to gain popularity during this time.
•The development of open-source frameworks, such as Hadoop
was essential for the growth of big data because they make big
data easier to work with and cheaper to store.
• In the years since then, the volume of big data has
•Users are still generating huge amounts of data—but it’s
not just humans who are doing it.
•With the advent of the Internet of Things (IoT), more
objects and devices are connected to the internet, gathering
data on customer usage patterns and product performance.
•The emergence of machine learning has produced still more
data.
•While big data has come far, its usefulness is only just
beginning.
What is Big Data?
Big Data is also data but with a huge size. Big Data
is a term used to describe a collection of data that is
huge in volume and yet growing exponentially with
time.
In short such data is so large and complex that none
of the traditional data management tools are able to
store it or process it efficiently.
Examples Of Big Data
Following are some the examples of Big Data-
1. The New York Stock Exchange generates
about one terabyte of new trade data per day.
2.Social Media
The statistic shows that 500+terabytes of new data
get ingested into the databases of social media
site Facebook, every day. This data is mainly
generated in terms of photo and video uploads,
message exchanges, putting comments etc.
3. A single Jet engine can generate 10+terabytes of
data in 30 minutes of flight time. With many
thousand flights per day, generation of data
reaches up to many Petabytes.
Characteristics of Big Data
1. Volume
•With big data, you’ll have to process high volumes of low-
density, unstructured data.
•This can be data of unknown value, such as Twitter data
feeds, click streams on a webpage or a mobile app, or sensor-
enabled equipment.
•For some organizations, this might be tens of terabytes of
data. For others, it may be hundreds of petabytes.
2. Velocity
•Velocity is the fast rate at which data is received and
(perhaps) acted on.
•Normally, the highest velocity of data streams directly into
memory versus being written to disk.
•Some internet-enabled smart products operate in real time or
near real time and will require real-time evaluation and action.
3. Variety
•Variety refers to the many types of data that are available.
•Traditional data types were structured and fit neatly in a
relational database.
•With the rise of big data, data comes in new unstructured data
types. Unstructured and semi-structured data types, such as text,
audio, and video, require additional preprocessing to derive
meaning and support metadata.
4. Value
•Data has intrinsic value. But it’s of no use until that value is
discovered
5. Veracity(Truth)
•How truthful is your data—and how much can you rely on it?
•This refers to the inconsistency which can be shown by the data
at times, thus hampering the process of being able to handle and
manage the data effectively.