0% found this document useful (0 votes)

21 views

Lecture 1.1 Slides

Uploaded by

sindhuk55

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Lecture 1.1 Slides

Uploaded by

sindhuk55

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Introducing Big Data

Introduction
Overview
• Big Data
• Seven V’s of Big Data
• Sources of Big Data
• Types of Big Data
• The Data Science Process
• Big Data and AI
• Summary
BIG Data
“Big data is a field of data science that explores how different tools,
methodologies and techniques can be used to analyse extremely large
and complex data sets, break them down and systematically derive
insights and information from them.” A combination
of unstructured, semi-
How BIG the DATA is? structured or structured
data collected by
organizations. This data can be
o In 2016, the total amount of data estimated to be 6.2 exabytes.
mined to gain insights and
used in machine
o In 2020, we were closer to the number of 40000 exabytes of data.
learning projects, predictive
modeling and other advanced
1 exabyte (EB) = 1018bytes analytics applications.

https://www.bornfight.com/blog/understanding-the-5-vs-of-big-data-volume-velocity-variety-veracity-value/
Seven V’s of BIG Data
• Volume
• Velocity
• Variety
• Variability
• Veracity
• Visualisation
• Value

https://bigdatapath.wordpress.com/2019/11/13/understanding-the-7-vs-of-big-data/
Seven V’s of BIG Data - Volume
It refers to the size of Big Data. When discussing Big Data volumes, almost
Data can be considered Big unimaginable sizes and unfamiliar numerical
Data or not is based on the
volume. The rapidly increasing terms are required:
volume data is due to cloud-
computing traffic, IoT, mobile
o Each day, the world produces 2.5 quintillion bytes of data, i.e.,
traffic etc.
2.3 trillion gigabytes.
o In 2020, we created approximately 40 zettabytes of data, which
is 43 trillion gigabytes.
o Most companies already have, on average, 100 terabytes of
data stored each.
o Facebook users upload that many data daily.
o Walmart alone processes over a million transactions per hour.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Velocity
It refers to the speed at which The speed at which data are generated,
the data is getting
accumulated. This is mainly accumulated and analysed is on a steep
due to IoTs, mobile data, social acceleration curve.
media etc.
o 90% of extant data have been created in just the last two
years.
o As of next year, there will be 19 billion network connections
globally feeding this velocity.
o There is an increasing need for real-time processing of these
enormous volumes, such as the 200 million emails, 300,000
tweets and 100 hours of YouTube videos that are passing by
every minute of the day.
o Real-time processing reduces storage requirements while
providing more responsive, accurate and profitable responses.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Variability
Variability describes how fast The intrinsic meanings and interpretations of the
and to what extent data under conglomerations of raw data depends on its
investigation is changing.
context. This is especially true with natural
language processing.
o A single word may have multiple meanings.
o New meanings are created and old meanings discarded over
time.
o Interpreting connotations is, for instance, essential to gauging
and responding to social media buzz.
o The boundless variability of Big Data therefore presents a
unique decoding challenge if one is to take advantage of its
full value.

https://serokell.io/blog/big-data
Seven V’s of BIG Data - Variety
It refers to Structured, Semi- Another challenge of Big Data processing goes
structured and Unstructured dat
a due to different sources of data beyond the massive volumes and increasing
generated either by humans or velocities of data but also in manipulating the
by machines.
enormous variety of these data.
o Taken as a whole, these data appear as an indecipherable mass
without structure.
o Consisting of natural language, hashtags, geo-spatial data,
multimedia, sensor events and so much more, the extraction of
meaning from such diversity requires ever-increasing algorithmic
and computational power.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Veracity
It refers to the assurance It is useless if the data being analysed are
of quality/integrity/credibility/acc
uracy of the data. inaccurate or incomplete.
o This situation arises when data streams originate from diverse
Since the data is collected from
sources presenting a variety of formats with varying signal-to-noise
multiple sources, we need to
ratios.
check the data for accuracy before
using it for business insights. o By the time these data arrive at a Big Data analysis stage, they may
be rife with accumulated errors that are difficult to sort out.
o It almost goes without saying that the veracity of the final analysis
is degraded without first cleaning up the data it works with.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Visualisation
Big data visualization refers to A core task for any Big Data processing system is
the implementation of more
contemporary visualization to transform the immense scale of it into
techniques to illustrate the something easily comprehended and actionable.
relationships within data.
o For human consumption, one of the best methods for this is
converting it into graphical formats.
o Spreadsheets and even three-dimensional visualizations are often
not up to the task, however, due to the attributes of velocity and
variety.
o There may be a multitude of spatial and temporal parameters and
relationships between them to condense into visual forms.

https://towardsdatascience.com/visualize-the-pandemic-with-r-covid-19-c3443de3b4e4
https://www.techopedia.com/definition/28988/big-data-visualization
Seven V’s of BIG Data - Value
Value refers to how useful the No one doubts that Big Data offers an enormous
data is in decision making
(Data to Decision). source of value to those who can deal with its
scale and unlock the knowledge within.
We need to extract the value of
the Big Data using proper o Not only does Big Data offer new, more effective methods of selling
analytics. but also vital clues to new products to meet previously undetected
market demands.
o Many industries utilize Big Data in the quest for cost reductions for
their organizations and their customers.
o Those who offer the tools and machines to handle Big Data, its
analysis and visualization also benefit hugely, albeit indirectly.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oSocial Data comes from the Likes, Tweets & Retweets, Comments,
Video Uploads, and general media that are uploaded and shared via the
world’s favourite social media platforms.
• Social data provides invaluable insights into consumer behaviour and sentiment.
• Social data can be enormously influential in marketing analytics.

https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oMachine Data is defined as information which is generated by industrial
equipment, sensors that are installed in machinery, and even web logs
which track user behaviour.
• Machine data expected to grow exponentially as the internet of things grows ever more
pervasive and expands around the world.
• Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly
growing Internet Of Things (IoTs) will deliver high velocity, value, volume and variety of data in
the very near future.

https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oTransactional Data is generated from all the daily transactions that take
place both online and offline. Invoices, payment orders, storage
records, delivery receipts – all are characterized as transactional data .
• Data alone is almost meaningless, and most organizations struggle to make sense of the data
that they are generating and how it can be put to good use.

https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Types of Big Data
Data analysts work with
different types of big data:
o Structured
o Semi-structured
o Quasi-structured
o Unstructured

https://serokell.io/blog/big-data
Types of Big Data (2)
Data analysts work with different types of big data:
oStructured. If your data is structured, it means that it is already
organized and convenient to work with. An example is data in Excel or
SQL databases that is tagged in a standardized format and can be easily
sorted, updated, and extracted.
oUnstructured. Unstructured data does not have any pre-defined order.
Google search results are an example of what unstructured data can
look like: articles, e-books, videos, and images.

https://serokell.io/blog/big-data
Types of Big Data (2) – Structured Vs. Unstructured Data

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Structured Data
Structured data resides in relational databases: a database
structured to recognise relations between stored items of data.
o Databases of this type are typically managed via a relational database
management system (“RDBMS“).
o Relational Database: tables of rows and columns containing related
information. Consider “Persons” Table as given as follows:

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Structured Data (Contd.)
A RDBMS uses structured query language (“SQL”) to access and
manipulate items in the RDBMS.
The benefit of structured data is
its labelling to describe
its attributes and relationships
with other data. This data
structure is easily searchable
using a human or algorithmically
generated query.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data
Unstructured data is everything else. Unstructured data:

o has an internal structure (i.e., bits and bytes)

o but is not structured via pre-defined data models or schema, i.e., not organised and labelled to identify
meaningful relationships between data

• It may be textual / non-textual. It may be human / machine-generated. It might also be

stored within a non-relational database like NoSQL.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (2)
Human-generated unstructured data: Typical human-generated unstructured data includes:
o Text files: word processing files, spreadsheets, presentations, emails.
o Email: largely text but has some internal structure thanks to its metadata (e.g., including the visible “to”,
“from”, “date / time”, “subject” entered to send an email) but also mixes in unstructured data via the
message body. For this reason, email is also referred to as semi-structured data.
o Social Media: like email, this is often semi-structured data, containing unstructured data (e.g., a Tweet)
but also structured data (e.g., the number of “Likes”, “retweets”, “date”, “author” etc.).
o Websites: YouTube, Instagram etc. contain lots of unstructured data, but also much structured data, e.g.,
like described above for Twitter.
o Mobile data: text messages, locations.
o Communications: IMs, Dictaphone recordings.
o Media: MP3, digital photos, audio recordings and video files.
o Business applications: MS Office documents, PDFs and similar.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (3)
Machine-generated unstructured data - Common types of machine-generated
unstructured data include:
o Satellite imagery: weather data, geographic forms, military movements.
o Scientific data: oil and gas exploration, space exploration, seismic imagery and atmospheric
data.
o Digital surveillance: CCTV.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (4)
Unstructured legal data - In the legal context, unstructured data is common across
the following areas:
o Document / Email Management: although the organisation of the DMS is structured (e.g. basic metadata (data about data):
file names, doc IDs, version numbers, creation / edit / read dates etc) the most valuable content is unstructured, i.e. the
contents of the constituent documents and emails. For this reason, it is often a pain to search and analyse this data in a
meaningful manner, e.g. to find a specific clause wording requires finding target document types, opening those and scrolling
around inside because there is no structured data about the content of that document (i.e. down to the clause or intra-clause
level), only it’s basic metadata. Unfortunately, it is precisely that type of data which is most useful, but least accessible, to a
lawyer.
o eDiscovery: most of the content under review is email, email attachments (i.e. MS office docs, images, PDFs and sometimes
voice) and naturally suffers from the same limitations described for document and email management.
o Legal Due Diligence: the content is almost exclusively MS Word and PDF docs but also sometimes spreadsheets and slide
decks – again, like the above it’s all unstructured beyond the basic metadata.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – The split of structured and unstructured data
On average unstructured data makes up 80%+ of today’s enterprise data, with the
remaining 20% being structured data.
Not only does unstructured data account for the majority of enterprise data, but the amount of
unstructured data is also growing at an average rate of 55% – 65% per year.
Unstructured data has grown, and continues to grow, because of:
o decreasing costs of data storage and processing power;
o ever-widening use of technology to create and manage work product (accelerated by minicomputers, then PCs and now
mobile and IoT devices etc.); and
o the internet and ever-increasing interconnectedness of devices and data.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (3)
Data analysts work with different types of big data:
oSemi-structured. Semi-structured data has been pre-processed but it
doesn’t look like a ‘normal’ SQL database. It can contain some tags,
such as data formats. JSON or XML files are examples of semi-
structured data. Some tools for data analytics can work with them.
oQuasi-structured. It is something in between unstructured and semi-
structured data. An example is textual content with erratic data formats
such as the information about what web pages a user visited and in
what order.

https://serokell.io/blog/big-data
The Data Science Process
Communicate
Ask an Results
interesting
question Tell the
story
Validate

Model

Explore
Clean
Wrangle Model and Analyse the data

Get the data

The Data Science Process – Ask an interesting
question
1) What exactly do you want to find out?

o Evaluate the well-being of your business first - what KPIs

are most relevant for your business and how do they
already develop.
o Identify where changes can be made. If nothing can be
changed, there is no point of analyzing data. But if you find Data Is Only As Good As
The Questions You Ask
a development opportunity, and see that your business
performance can be significantly improved.
o Next step is to consider what your goal is and what
decision-making it will facilitate.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (2)
The Key To Asking Good Data Analysis Questions

o The more specific it is, the more valuable (and

actionable) the answer is going to be.
o Instead of asking, “How can I raise revenue?”, you
should ask: “What are the channels we should focus Data Is Only As Good As
more on in order to raise revenue while not raising The Questions You Ask
costs very much, leading to bigger profit margins?”.
o “Which marketing campaign that I did this quarter got
the best ROI, and how can I replicate its success?”

***Asking the key questions when analyzing data can define your next strategy in developing your
company.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (3)
2) What standard KPIs will you use that can help?

o Your goal with business intelligence is to see reality clearly

so that you can make profitable decisions to help your
company thrive.
o The questions to ask when analyzing data will be the
framework, the lens, that allows you to focus on specific Data Is Only As Good As
aspects of your business reality. The Questions You Ask
o Once you have your data analytics questions, you need to
have some standard KPIs that you can use to measure
them.
o For example, let’s say you want to see which of your PPC
campaigns last quarter did the best. Did the best according
to what? Driving revenue? Driving profit? Giving the most
ROI? Giving the cheapest email subscribers?

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (4)
Let’s see this through a straightforward example.
o You are a retail company and want to
know what you sell, where, and when
– remember the specific questions for
analyzing data? On the example above,
it is clear that the amount of sales
performed over a set period of time
tells you when the demand is higher or
lower – you got your specific KPI
answer. Then you can dig deeper into
the insights and establish additional
sales opportunities, and identify
underperforming areas that affect the
overall sales of products.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data
3) Where will your data come from?

o The abundance of data sources may make things complicated. Your next
step is to “edit” these sources and make sure their data quality is up to
par, which will get rid of some of them as useful choices.
o You can use CRM data, data from things like Facebook and Google
Analytics, financial data from your company – let your imagination go
wild (as long as the data source is relevant to the questions you’ve
identified in step 1 and 2).

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (2)
4) Which scales apply to your different datasets?

There are basically 4 types of scales:

o Nominal – you organize your data in non-
numeric categories that cannot be ranked
or compared quantitatively. E.g., Examples:
– Different colors of shirts
– Different types of fruits
– Different genres of music

o Ordinal - An ordinal scale is one where the order matters but not the difference between
values. E.g. “You might ask patients to express the amount of pain they are feeling on a scale of
1 to 10. A score of 7 means more pain than a score of 5, and that is more than a score of 3. But
the difference between the 7 and the 5 may not be the same as that between 5 and 3.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (3)
4) Which scales apply to your different datasets?

o Interval – An interval scale is one where

there is order and the difference between
two values is meaningful.
Examples of interval variables include
• temperature (Farenheit), temperature (Celcius),
pH, SAT score (200-800), credit score (300-850).

o Ratio - A ratio variable, has all the properties of an interval variable, and also has a
clear definition of 0.0. When the variable equals 0.0, there is none of that variable.
Examples of ratio variables include:
• enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight,
length, temperature in Kelvin (0.0 Kelvin really does mean “no heat”), survival time.
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (4)
5) How can you ensure data quality?

oInsights and analytics based on a

shaky “data foundation” will give
you… well, poor insights and
analytics.
o Most data scientists spend (survey comes from CrowdFlower):
• 60% of the time in organizing and cleaning data (!).
• 19% of the time is spent on collecting datasets.
• 9% of the time is spent in mining the data to draw patterns.
• 3% of the time is spent on training the datasets.
• 4% of the time is spent on refining the algorithms.
• 5% of the time is spent on other tasks.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (5)
Data Quality

o Completeness - Completeness is defined as

expected comprehensiveness. Data can be complete
even if optional data is missing.
E.g., a customer’s first name and last name are mandatory but
middle name is optional; so, a record can be considered
complete even if a middle name is not available.

Questions you can ask yourself:

o Is all the requisite information available?
o Do any data values have missing elements? Or are they in an
unusable state?