Lecture 1.1 Slides
Lecture 1.1 Slides
Introduction
Overview
• Big Data
• Seven V’s of Big Data
• Sources of Big Data
• Types of Big Data
• The Data Science Process
• Big Data and AI
• Summary
BIG Data
“Big data is a field of data science that explores how different tools,
methodologies and techniques can be used to analyse extremely large
and complex data sets, break them down and systematically derive
insights and information from them.” A combination
of unstructured, semi-
How BIG the DATA is? structured or structured
data collected by
organizations. This data can be
o In 2016, the total amount of data estimated to be 6.2 exabytes.
mined to gain insights and
used in machine
o In 2020, we were closer to the number of 40000 exabytes of data.
learning projects, predictive
modeling and other advanced
1 exabyte (EB) = 1018bytes analytics applications.
https://www.bornfight.com/blog/understanding-the-5-vs-of-big-data-volume-velocity-variety-veracity-value/
Seven V’s of BIG Data
• Volume
• Velocity
• Variety
• Variability
• Veracity
• Visualisation
• Value
https://bigdatapath.wordpress.com/2019/11/13/understanding-the-7-vs-of-big-data/
Seven V’s of BIG Data - Volume
It refers to the size of Big Data. When discussing Big Data volumes, almost
Data can be considered Big unimaginable sizes and unfamiliar numerical
Data or not is based on the
volume. The rapidly increasing terms are required:
volume data is due to cloud-
computing traffic, IoT, mobile
o Each day, the world produces 2.5 quintillion bytes of data, i.e.,
traffic etc.
2.3 trillion gigabytes.
o In 2020, we created approximately 40 zettabytes of data, which
is 43 trillion gigabytes.
o Most companies already have, on average, 100 terabytes of
data stored each.
o Facebook users upload that many data daily.
o Walmart alone processes over a million transactions per hour.
https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Velocity
It refers to the speed at which The speed at which data are generated,
the data is getting
accumulated. This is mainly accumulated and analysed is on a steep
due to IoTs, mobile data, social acceleration curve.
media etc.
o 90% of extant data have been created in just the last two
years.
o As of next year, there will be 19 billion network connections
globally feeding this velocity.
o There is an increasing need for real-time processing of these
enormous volumes, such as the 200 million emails, 300,000
tweets and 100 hours of YouTube videos that are passing by
every minute of the day.
o Real-time processing reduces storage requirements while
providing more responsive, accurate and profitable responses.
https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Variability
Variability describes how fast The intrinsic meanings and interpretations of the
and to what extent data under conglomerations of raw data depends on its
investigation is changing.
context. This is especially true with natural
language processing.
o A single word may have multiple meanings.
o New meanings are created and old meanings discarded over
time.
o Interpreting connotations is, for instance, essential to gauging
and responding to social media buzz.
o The boundless variability of Big Data therefore presents a
unique decoding challenge if one is to take advantage of its
full value.
https://serokell.io/blog/big-data
Seven V’s of BIG Data - Variety
It refers to Structured, Semi- Another challenge of Big Data processing goes
structured and Unstructured dat
a due to different sources of data beyond the massive volumes and increasing
generated either by humans or velocities of data but also in manipulating the
by machines.
enormous variety of these data.
o Taken as a whole, these data appear as an indecipherable mass
without structure.
o Consisting of natural language, hashtags, geo-spatial data,
multimedia, sensor events and so much more, the extraction of
meaning from such diversity requires ever-increasing algorithmic
and computational power.
https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Veracity
It refers to the assurance It is useless if the data being analysed are
of quality/integrity/credibility/acc
uracy of the data. inaccurate or incomplete.
o This situation arises when data streams originate from diverse
Since the data is collected from
sources presenting a variety of formats with varying signal-to-noise
multiple sources, we need to
ratios.
check the data for accuracy before
using it for business insights. o By the time these data arrive at a Big Data analysis stage, they may
be rife with accumulated errors that are difficult to sort out.
o It almost goes without saying that the veracity of the final analysis
is degraded without first cleaning up the data it works with.
https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Visualisation
Big data visualization refers to A core task for any Big Data processing system is
the implementation of more
contemporary visualization to transform the immense scale of it into
techniques to illustrate the something easily comprehended and actionable.
relationships within data.
o For human consumption, one of the best methods for this is
converting it into graphical formats.
o Spreadsheets and even three-dimensional visualizations are often
not up to the task, however, due to the attributes of velocity and
variety.
o There may be a multitude of spatial and temporal parameters and
relationships between them to condense into visual forms.
https://towardsdatascience.com/visualize-the-pandemic-with-r-covid-19-c3443de3b4e4
https://www.techopedia.com/definition/28988/big-data-visualization
Seven V’s of BIG Data - Value
Value refers to how useful the No one doubts that Big Data offers an enormous
data is in decision making
(Data to Decision). source of value to those who can deal with its
scale and unlock the knowledge within.
We need to extract the value of
the Big Data using proper o Not only does Big Data offer new, more effective methods of selling
analytics. but also vital clues to new products to meet previously undetected
market demands.
o Many industries utilize Big Data in the quest for cost reductions for
their organizations and their customers.
o Those who offer the tools and machines to handle Big Data, its
analysis and visualization also benefit hugely, albeit indirectly.
https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oSocial Data comes from the Likes, Tweets & Retweets, Comments,
Video Uploads, and general media that are uploaded and shared via the
world’s favourite social media platforms.
• Social data provides invaluable insights into consumer behaviour and sentiment.
• Social data can be enormously influential in marketing analytics.
https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oMachine Data is defined as information which is generated by industrial
equipment, sensors that are installed in machinery, and even web logs
which track user behaviour.
• Machine data expected to grow exponentially as the internet of things grows ever more
pervasive and expands around the world.
• Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly
growing Internet Of Things (IoTs) will deliver high velocity, value, volume and variety of data in
the very near future.
https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oTransactional Data is generated from all the daily transactions that take
place both online and offline. Invoices, payment orders, storage
records, delivery receipts – all are characterized as transactional data .
• Data alone is almost meaningless, and most organizations struggle to make sense of the data
that they are generating and how it can be put to good use.
https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Types of Big Data
Data analysts work with
different types of big data:
o Structured
o Semi-structured
o Quasi-structured
o Unstructured
https://serokell.io/blog/big-data
Types of Big Data (2)
Data analysts work with different types of big data:
oStructured. If your data is structured, it means that it is already
organized and convenient to work with. An example is data in Excel or
SQL databases that is tagged in a standardized format and can be easily
sorted, updated, and extracted.
oUnstructured. Unstructured data does not have any pre-defined order.
Google search results are an example of what unstructured data can
look like: articles, e-books, videos, and images.
https://serokell.io/blog/big-data
Types of Big Data (2) – Structured Vs. Unstructured Data
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Structured Data
Structured data resides in relational databases: a database
structured to recognise relations between stored items of data.
o Databases of this type are typically managed via a relational database
management system (“RDBMS“).
o Relational Database: tables of rows and columns containing related
information. Consider “Persons” Table as given as follows:
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Structured Data (Contd.)
A RDBMS uses structured query language (“SQL”) to access and
manipulate items in the RDBMS.
The benefit of structured data is
its labelling to describe
its attributes and relationships
with other data. This data
structure is easily searchable
using a human or algorithmically
generated query.
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data
Unstructured data is everything else. Unstructured data:
o but is not structured via pre-defined data models or schema, i.e., not organised and labelled to identify
meaningful relationships between data
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (2)
Human-generated unstructured data: Typical human-generated unstructured data includes:
o Text files: word processing files, spreadsheets, presentations, emails.
o Email: largely text but has some internal structure thanks to its metadata (e.g., including the visible “to”,
“from”, “date / time”, “subject” entered to send an email) but also mixes in unstructured data via the
message body. For this reason, email is also referred to as semi-structured data.
o Social Media: like email, this is often semi-structured data, containing unstructured data (e.g., a Tweet)
but also structured data (e.g., the number of “Likes”, “retweets”, “date”, “author” etc.).
o Websites: YouTube, Instagram etc. contain lots of unstructured data, but also much structured data, e.g.,
like described above for Twitter.
o Mobile data: text messages, locations.
o Communications: IMs, Dictaphone recordings.
o Media: MP3, digital photos, audio recordings and video files.
o Business applications: MS Office documents, PDFs and similar.
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (3)
Machine-generated unstructured data - Common types of machine-generated
unstructured data include:
o Satellite imagery: weather data, geographic forms, military movements.
o Scientific data: oil and gas exploration, space exploration, seismic imagery and atmospheric
data.
o Digital surveillance: CCTV.
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (4)
Unstructured legal data - In the legal context, unstructured data is common across
the following areas:
o Document / Email Management: although the organisation of the DMS is structured (e.g. basic metadata (data about data):
file names, doc IDs, version numbers, creation / edit / read dates etc) the most valuable content is unstructured, i.e. the
contents of the constituent documents and emails. For this reason, it is often a pain to search and analyse this data in a
meaningful manner, e.g. to find a specific clause wording requires finding target document types, opening those and scrolling
around inside because there is no structured data about the content of that document (i.e. down to the clause or intra-clause
level), only it’s basic metadata. Unfortunately, it is precisely that type of data which is most useful, but least accessible, to a
lawyer.
o eDiscovery: most of the content under review is email, email attachments (i.e. MS office docs, images, PDFs and sometimes
voice) and naturally suffers from the same limitations described for document and email management.
o Legal Due Diligence: the content is almost exclusively MS Word and PDF docs but also sometimes spreadsheets and slide
decks – again, like the above it’s all unstructured beyond the basic metadata.
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – The split of structured and unstructured data
On average unstructured data makes up 80%+ of today’s enterprise data, with the
remaining 20% being structured data.
Not only does unstructured data account for the majority of enterprise data, but the amount of
unstructured data is also growing at an average rate of 55% – 65% per year.
Unstructured data has grown, and continues to grow, because of:
o decreasing costs of data storage and processing power;
o ever-widening use of technology to create and manage work product (accelerated by minicomputers, then PCs and now
mobile and IoT devices etc.); and
o the internet and ever-increasing interconnectedness of devices and data.
https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (3)
Data analysts work with different types of big data:
oSemi-structured. Semi-structured data has been pre-processed but it
doesn’t look like a ‘normal’ SQL database. It can contain some tags,
such as data formats. JSON or XML files are examples of semi-
structured data. Some tools for data analytics can work with them.
oQuasi-structured. It is something in between unstructured and semi-
structured data. An example is textual content with erratic data formats
such as the information about what web pages a user visited and in
what order.
https://serokell.io/blog/big-data
The Data Science Process
Communicate
Ask an Results
interesting
question Tell the
story
Validate
Model
Explore
Clean
Wrangle Model and Analyse the data
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (2)
The Key To Asking Good Data Analysis Questions
***Asking the key questions when analyzing data can define your next strategy in developing your
company.
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (3)
2) What standard KPIs will you use that can help?
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (4)
Let’s see this through a straightforward example.
o You are a retail company and want to
know what you sell, where, and when
– remember the specific questions for
analyzing data? On the example above,
it is clear that the amount of sales
performed over a set period of time
tells you when the demand is higher or
lower – you got your specific KPI
answer. Then you can dig deeper into
the insights and establish additional
sales opportunities, and identify
underperforming areas that affect the
overall sales of products.
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data
3) Where will your data come from?
o The abundance of data sources may make things complicated. Your next
step is to “edit” these sources and make sure their data quality is up to
par, which will get rid of some of them as useful choices.
o You can use CRM data, data from things like Facebook and Google
Analytics, financial data from your company – let your imagination go
wild (as long as the data source is relevant to the questions you’ve
identified in step 1 and 2).
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (2)
4) Which scales apply to your different datasets?
o Ordinal - An ordinal scale is one where the order matters but not the difference between
values. E.g. “You might ask patients to express the amount of pain they are feeling on a scale of
1 to 10. A score of 7 means more pain than a score of 5, and that is more than a score of 3. But
the difference between the 7 and the 5 may not be the same as that between 5 and 3.
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (3)
4) Which scales apply to your different datasets?
o Ratio - A ratio variable, has all the properties of an interval variable, and also has a
clear definition of 0.0. When the variable equals 0.0, there is none of that variable.
Examples of ratio variables include:
• enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight,
length, temperature in Kelvin (0.0 Kelvin really does mean “no heat”), survival time.
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (4)
5) How can you ensure data quality?
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (5)
Data Quality
https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality
https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality
https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality
https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality
https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality
***The timeliness depends on user expectation. Online availability of data could be required for room allocation
system in hospitality, but nightly data could be perfectly acceptable for a billing system.
https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality – A concrete example of data quality
issues.
https://ncube.com/blog/big-data-and-ai
The Data Science Process – Model and Analyse
the Data
Data analysis is defined as a process of cleaning, transforming, and modeling data to discover
useful information for business decision-making - extract useful information from data and
taking the decision based upon the data analysis.
https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2)
Types of Data Analysis
o Statistical Analysis shows "What happen?" by using past data in the form of dashboards.
Statistical Analysis includes collection, Analysis, interpretation, presentation, and modeling of
data. It analyses a set of data or a sample of data. There are two categories of this type of
Analysis - Descriptive Analysis and Inferential Analysis.
• Descriptive Analysis - analyses complete data or a sample of summarized numerical data. It shows
mean and deviation for continuous data whereas percentage and frequency for categorical data.
• Inferential Analysis - analyses sample from complete data. In this type of Analysis, you can find
different conclusions from the same data by selecting different samples.
https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2)
Types of Data Analysis
o Diagnostic Analysis shows "Why did it happen?" by finding
the cause from the insight found in Statistical Analysis.
• This Analysis is useful to identify behavior patterns of data.
• If a new problem arrives in your business process, then you can
investigate this Analysis to find similar patterns of that problem.
And it may have chances to use similar prescriptions for the new
problems.
https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2) Predictive analytics encompasses
a variety of statistical techniques
Types of Data Analysis from data mining, predictive
modelling, and machine learning that
Predictive Analysis shows "what is likely to happen" analyze current and historical facts to
by using previous data. make predictions about future or
• The simplest data analysis example is like if last year I
otherwise unknown events.
bought two dresses based on my savings and if this year my
salary is increasing double then I can buy four dresses.
• This Analysis makes predictions about future outcomes
based on current or past data. Forecasting is just an
estimate. Its accuracy is based on how much detailed
information you have and how much you dig in it.
https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2) Predictive analytics encompasses
a variety of statistical techniques
Types of Data Analysis from data mining, predictive
modelling, and machine learning that
Predictive Analysis shows "what is likely to happen" analyze current and historical facts to
by using previous data. make predictions about future or
• The simplest data analysis example is like if last year I
otherwise unknown events.
bought two dresses based on my savings and if this year my
salary is increasing double then I can buy four dresses.
• This Analysis makes predictions about future outcomes
based on current or past data. Forecasting is just an
estimate. Its accuracy is based on how much detailed
information you have and how much you dig in it.
https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2) Referred to as the "final frontier of
analytic capabilities. Prescriptive
Types of Data Analysis analytics not only anticipates what
will happen and when it will happen,
Prescriptive Analysis combines the insight from all but also why it will happen.
previous analysis to determine which action to take in a
current problem or decision.
• Most data-driven companies are utilizing Prescriptive
Analysis because predictive and descriptive Analysis are not
enough to improve data performance. Based on current
situations and problems, they analyze the data and make
decisions.
https://medium.com/comet-ml/a-data-scientists-guide-to-communicating-results-c79a5ef3e9f1
The Data Science Process – Communicate the
results (2)
Is right visualisation tool (such as charts, graphs) important?
Suburb Status
Southport Major Cities
Parkwood Major Cities
Coober Pedy Remote
https://infographic.tv/data-visualization-remoteness-index-map-of-australia/
Big Data and AI - How big data and AI work
together?
https://infographic.tv/data-visualization-remoteness-index-map-of-australia/
Big Data and AI – Use Cases
o Detecting anomalies - AI can analyze artificial intelligence
data to detect unusual occurrences in the data. For example,
having a network of sensors that have a predefined
appropriate range. Anything outside of that range is an
anomaly.
o Probability of future outcome - Using known condition that
has a certain probability of influencing the future outcome, AI
can determine the likelihood of that outcome.
o AI can recognize patterns - AI can see patterns that humans
don’t.
o Data Bars and Graphs - AI can look for patterns in bars and
graphs that might stay undetected by human supervision
https://ncube.com/blog/big-data-and-ai
Summary
• Introduction of Big data, sources of big data and types of big data.