0% found this document useful (0 votes)
21 views

Lecture 1.1 Slides

Uploaded by

sindhuk55
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture 1.1 Slides

Uploaded by

sindhuk55
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Introducing Big Data

Introduction
Overview
• Big Data
• Seven V’s of Big Data
• Sources of Big Data
• Types of Big Data
• The Data Science Process
• Big Data and AI
• Summary
BIG Data
“Big data is a field of data science that explores how different tools,
methodologies and techniques can be used to analyse extremely large
and complex data sets, break them down and systematically derive
insights and information from them.” A combination
of unstructured, semi-
How BIG the DATA is? structured or structured
data collected by
organizations. This data can be
o In 2016, the total amount of data estimated to be 6.2 exabytes.
mined to gain insights and
used in machine
o In 2020, we were closer to the number of 40000 exabytes of data.
learning projects, predictive
modeling and other advanced
1 exabyte (EB) = 1018bytes analytics applications.

https://www.bornfight.com/blog/understanding-the-5-vs-of-big-data-volume-velocity-variety-veracity-value/
Seven V’s of BIG Data
• Volume
• Velocity
• Variety
• Variability
• Veracity
• Visualisation
• Value

https://bigdatapath.wordpress.com/2019/11/13/understanding-the-7-vs-of-big-data/
Seven V’s of BIG Data - Volume
It refers to the size of Big Data. When discussing Big Data volumes, almost
Data can be considered Big unimaginable sizes and unfamiliar numerical
Data or not is based on the
volume. The rapidly increasing terms are required:
volume data is due to cloud-
computing traffic, IoT, mobile
o Each day, the world produces 2.5 quintillion bytes of data, i.e.,
traffic etc.
2.3 trillion gigabytes.
o In 2020, we created approximately 40 zettabytes of data, which
is 43 trillion gigabytes.
o Most companies already have, on average, 100 terabytes of
data stored each.
o Facebook users upload that many data daily.
o Walmart alone processes over a million transactions per hour.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Velocity
It refers to the speed at which The speed at which data are generated,
the data is getting
accumulated. This is mainly accumulated and analysed is on a steep
due to IoTs, mobile data, social acceleration curve.
media etc.
o 90% of extant data have been created in just the last two
years.
o As of next year, there will be 19 billion network connections
globally feeding this velocity.
o There is an increasing need for real-time processing of these
enormous volumes, such as the 200 million emails, 300,000
tweets and 100 hours of YouTube videos that are passing by
every minute of the day.
o Real-time processing reduces storage requirements while
providing more responsive, accurate and profitable responses.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Variability
Variability describes how fast The intrinsic meanings and interpretations of the
and to what extent data under conglomerations of raw data depends on its
investigation is changing.
context. This is especially true with natural
language processing.
o A single word may have multiple meanings.
o New meanings are created and old meanings discarded over
time.
o Interpreting connotations is, for instance, essential to gauging
and responding to social media buzz.
o The boundless variability of Big Data therefore presents a
unique decoding challenge if one is to take advantage of its
full value.

https://serokell.io/blog/big-data
Seven V’s of BIG Data - Variety
It refers to Structured, Semi- Another challenge of Big Data processing goes
structured and Unstructured dat
a due to different sources of data beyond the massive volumes and increasing
generated either by humans or velocities of data but also in manipulating the
by machines.
enormous variety of these data.
o Taken as a whole, these data appear as an indecipherable mass
without structure.
o Consisting of natural language, hashtags, geo-spatial data,
multimedia, sensor events and so much more, the extraction of
meaning from such diversity requires ever-increasing algorithmic
and computational power.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Veracity
It refers to the assurance It is useless if the data being analysed are
of quality/integrity/credibility/acc
uracy of the data. inaccurate or incomplete.
o This situation arises when data streams originate from diverse
Since the data is collected from
sources presenting a variety of formats with varying signal-to-noise
multiple sources, we need to
ratios.
check the data for accuracy before
using it for business insights. o By the time these data arrive at a Big Data analysis stage, they may
be rife with accumulated errors that are difficult to sort out.
o It almost goes without saying that the veracity of the final analysis
is degraded without first cleaning up the data it works with.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Seven V’s of BIG Data - Visualisation
Big data visualization refers to A core task for any Big Data processing system is
the implementation of more
contemporary visualization to transform the immense scale of it into
techniques to illustrate the something easily comprehended and actionable.
relationships within data.
o For human consumption, one of the best methods for this is
converting it into graphical formats.
o Spreadsheets and even three-dimensional visualizations are often
not up to the task, however, due to the attributes of velocity and
variety.
o There may be a multitude of spatial and temporal parameters and
relationships between them to condense into visual forms.

https://towardsdatascience.com/visualize-the-pandemic-with-r-covid-19-c3443de3b4e4
https://www.techopedia.com/definition/28988/big-data-visualization
Seven V’s of BIG Data - Value
Value refers to how useful the No one doubts that Big Data offers an enormous
data is in decision making
(Data to Decision). source of value to those who can deal with its
scale and unlock the knowledge within.
We need to extract the value of
the Big Data using proper o Not only does Big Data offer new, more effective methods of selling
analytics. but also vital clues to new products to meet previously undetected
market demands.
o Many industries utilize Big Data in the quest for cost reductions for
their organizations and their customers.
o Those who offer the tools and machines to handle Big Data, its
analysis and visualization also benefit hugely, albeit indirectly.

https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oSocial Data comes from the Likes, Tweets & Retweets, Comments,
Video Uploads, and general media that are uploaded and shared via the
world’s favourite social media platforms.
• Social data provides invaluable insights into consumer behaviour and sentiment.
• Social data can be enormously influential in marketing analytics.

https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oMachine Data is defined as information which is generated by industrial
equipment, sensors that are installed in machinery, and even web logs
which track user behaviour.
• Machine data expected to grow exponentially as the internet of things grows ever more
pervasive and expands around the world.
• Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly
growing Internet Of Things (IoTs) will deliver high velocity, value, volume and variety of data in
the very near future.

https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Sources of Big Data – where does it come from?
The bulk of big data generated comes from three primary sources: social data,
machine data and transactional data:
oTransactional Data is generated from all the daily transactions that take
place both online and offline. Invoices, payment orders, storage
records, delivery receipts – all are characterized as transactional data .
• Data alone is almost meaningless, and most organizations struggle to make sense of the data
that they are generating and how it can be put to good use.

https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
Types of Big Data
Data analysts work with
different types of big data:
o Structured
o Semi-structured
o Quasi-structured
o Unstructured

https://serokell.io/blog/big-data
Types of Big Data (2)
Data analysts work with different types of big data:
oStructured. If your data is structured, it means that it is already
organized and convenient to work with. An example is data in Excel or
SQL databases that is tagged in a standardized format and can be easily
sorted, updated, and extracted.
oUnstructured. Unstructured data does not have any pre-defined order.
Google search results are an example of what unstructured data can
look like: articles, e-books, videos, and images.

https://serokell.io/blog/big-data
Types of Big Data (2) – Structured Vs. Unstructured Data

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Structured Data
Structured data resides in relational databases: a database
structured to recognise relations between stored items of data.
o Databases of this type are typically managed via a relational database
management system (“RDBMS“).
o Relational Database: tables of rows and columns containing related
information. Consider “Persons” Table as given as follows:

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Structured Data (Contd.)
A RDBMS uses structured query language (“SQL”) to access and
manipulate items in the RDBMS.
The benefit of structured data is
its labelling to describe
its attributes and relationships
with other data. This data
structure is easily searchable
using a human or algorithmically
generated query.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data
Unstructured data is everything else. Unstructured data:

o has an internal structure (i.e., bits and bytes)

o but is not structured via pre-defined data models or schema, i.e., not organised and labelled to identify
meaningful relationships between data

• It may be textual / non-textual. It may be human / machine-generated. It might also be


stored within a non-relational database like NoSQL.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (2)
Human-generated unstructured data: Typical human-generated unstructured data includes:
o Text files: word processing files, spreadsheets, presentations, emails.
o Email: largely text but has some internal structure thanks to its metadata (e.g., including the visible “to”,
“from”, “date / time”, “subject” entered to send an email) but also mixes in unstructured data via the
message body. For this reason, email is also referred to as semi-structured data.
o Social Media: like email, this is often semi-structured data, containing unstructured data (e.g., a Tweet)
but also structured data (e.g., the number of “Likes”, “retweets”, “date”, “author” etc.).
o Websites: YouTube, Instagram etc. contain lots of unstructured data, but also much structured data, e.g.,
like described above for Twitter.
o Mobile data: text messages, locations.
o Communications: IMs, Dictaphone recordings.
o Media: MP3, digital photos, audio recordings and video files.
o Business applications: MS Office documents, PDFs and similar.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (3)
Machine-generated unstructured data - Common types of machine-generated
unstructured data include:
o Satellite imagery: weather data, geographic forms, military movements.
o Scientific data: oil and gas exploration, space exploration, seismic imagery and atmospheric
data.
o Digital surveillance: CCTV.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – Unstructured Data (4)
Unstructured legal data - In the legal context, unstructured data is common across
the following areas:
o Document / Email Management: although the organisation of the DMS is structured (e.g. basic metadata (data about data):
file names, doc IDs, version numbers, creation / edit / read dates etc) the most valuable content is unstructured, i.e. the
contents of the constituent documents and emails. For this reason, it is often a pain to search and analyse this data in a
meaningful manner, e.g. to find a specific clause wording requires finding target document types, opening those and scrolling
around inside because there is no structured data about the content of that document (i.e. down to the clause or intra-clause
level), only it’s basic metadata. Unfortunately, it is precisely that type of data which is most useful, but least accessible, to a
lawyer.
o eDiscovery: most of the content under review is email, email attachments (i.e. MS office docs, images, PDFs and sometimes
voice) and naturally suffers from the same limitations described for document and email management.
o Legal Due Diligence: the content is almost exclusively MS Word and PDF docs but also sometimes spreadsheets and slide
decks – again, like the above it’s all unstructured beyond the basic metadata.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (2) – The split of structured and unstructured data
On average unstructured data makes up 80%+ of today’s enterprise data, with the
remaining 20% being structured data.
Not only does unstructured data account for the majority of enterprise data, but the amount of
unstructured data is also growing at an average rate of 55% – 65% per year.
Unstructured data has grown, and continues to grow, because of:
o decreasing costs of data storage and processing power;
o ever-widening use of technology to create and manage work product (accelerated by minicomputers, then PCs and now
mobile and IoT devices etc.); and
o the internet and ever-increasing interconnectedness of devices and data.

https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
Types of Big Data (3)
Data analysts work with different types of big data:
oSemi-structured. Semi-structured data has been pre-processed but it
doesn’t look like a ‘normal’ SQL database. It can contain some tags,
such as data formats. JSON or XML files are examples of semi-
structured data. Some tools for data analytics can work with them.
oQuasi-structured. It is something in between unstructured and semi-
structured data. An example is textual content with erratic data formats
such as the information about what web pages a user visited and in
what order.

https://serokell.io/blog/big-data
The Data Science Process
Communicate
Ask an Results
interesting
question Tell the
story
Validate

Model

Explore
Clean
Wrangle Model and Analyse the data

Get the data


The Data Science Process – Ask an interesting
question
1) What exactly do you want to find out?

o Evaluate the well-being of your business first - what KPIs


are most relevant for your business and how do they
already develop.
o Identify where changes can be made. If nothing can be
changed, there is no point of analyzing data. But if you find Data Is Only As Good As
The Questions You Ask
a development opportunity, and see that your business
performance can be significantly improved.
o Next step is to consider what your goal is and what
decision-making it will facilitate.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (2)
The Key To Asking Good Data Analysis Questions

o The more specific it is, the more valuable (and


actionable) the answer is going to be.
o Instead of asking, “How can I raise revenue?”, you
should ask: “What are the channels we should focus Data Is Only As Good As
more on in order to raise revenue while not raising The Questions You Ask
costs very much, leading to bigger profit margins?”.
o “Which marketing campaign that I did this quarter got
the best ROI, and how can I replicate its success?”

***Asking the key questions when analyzing data can define your next strategy in developing your
company.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (3)
2) What standard KPIs will you use that can help?

o Your goal with business intelligence is to see reality clearly


so that you can make profitable decisions to help your
company thrive.
o The questions to ask when analyzing data will be the
framework, the lens, that allows you to focus on specific Data Is Only As Good As
aspects of your business reality. The Questions You Ask
o Once you have your data analytics questions, you need to
have some standard KPIs that you can use to measure
them.
o For example, let’s say you want to see which of your PPC
campaigns last quarter did the best. Did the best according
to what? Driving revenue? Driving profit? Giving the most
ROI? Giving the cheapest email subscribers?

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Ask an interesting
question (4)
Let’s see this through a straightforward example.
o You are a retail company and want to
know what you sell, where, and when
– remember the specific questions for
analyzing data? On the example above,
it is clear that the amount of sales
performed over a set period of time
tells you when the demand is higher or
lower – you got your specific KPI
answer. Then you can dig deeper into
the insights and establish additional
sales opportunities, and identify
underperforming areas that affect the
overall sales of products.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data
3) Where will your data come from?

o The abundance of data sources may make things complicated. Your next
step is to “edit” these sources and make sure their data quality is up to
par, which will get rid of some of them as useful choices.
o You can use CRM data, data from things like Facebook and Google
Analytics, financial data from your company – let your imagination go
wild (as long as the data source is relevant to the questions you’ve
identified in step 1 and 2).

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (2)
4) Which scales apply to your different datasets?

There are basically 4 types of scales:


o Nominal – you organize your data in non-
numeric categories that cannot be ranked
or compared quantitatively. E.g., Examples:
– Different colors of shirts
– Different types of fruits
– Different genres of music

o Ordinal - An ordinal scale is one where the order matters but not the difference between
values. E.g. “You might ask patients to express the amount of pain they are feeling on a scale of
1 to 10. A score of 7 means more pain than a score of 5, and that is more than a score of 3. But
the difference between the 7 and the 5 may not be the same as that between 5 and 3.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (3)
4) Which scales apply to your different datasets?

o Interval – An interval scale is one where


there is order and the difference between
two values is meaningful.
Examples of interval variables include
• temperature (Farenheit), temperature (Celcius),
pH, SAT score (200-800), credit score (300-850).

o Ratio - A ratio variable, has all the properties of an interval variable, and also has a
clear definition of 0.0. When the variable equals 0.0, there is none of that variable.
Examples of ratio variables include:
• enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight,
length, temperature in Kelvin (0.0 Kelvin really does mean “no heat”), survival time.
https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (4)
5) How can you ensure data quality?

oInsights and analytics based on a


shaky “data foundation” will give
you… well, poor insights and
analytics.
o Most data scientists spend (survey comes from CrowdFlower):
• 60% of the time in organizing and cleaning data (!).
• 19% of the time is spent on collecting datasets.
• 9% of the time is spent in mining the data to draw patterns.
• 3% of the time is spent on training the datasets.
• 4% of the time is spent on refining the algorithms.
• 5% of the time is spent on other tasks.

https://www.datapine.com/blog/data-analysis-questions/
The Data Science Process – Get the Data (5)
Data Quality

o Completeness - Completeness is defined as


expected comprehensiveness. Data can be complete
even if optional data is missing.
E.g., a customer’s first name and last name are mandatory but
middle name is optional; so, a record can be considered
complete even if a middle name is not available.

Questions you can ask yourself:


o Is all the requisite information available?
o Do any data values have missing elements? Or are they in an
unusable state?

https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality

o Consistency - means data across all systems reflects the


same information and are in synch with each other across the
enterprise. Examples:
• A business unit status is closed but there are sales for that
business unit.
• Employee status is terminated but pay status is active.
Questions you can ask yourself:
• Are data values the same across the data sets?
• Are there any distinct occurrences of the same data instances
that provide conflicting information?

https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality

o Confirmity - means the data is following the set of standard


data definitions like data type, size and format.
• For example, date of birth of customer is in the format “mm/dd/yyyy”.

Questions you can ask yourself:


• Do data values comply with the specified formats?
• If so, do all the data values comply with those formats?

https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality

o Accuracy - is the degree to which data correctly reflects the


real-world object OR an event being described. Examples:
• Sales of the business unit are the real value.
• Address of an employee in the employee database is the real
address.
Questions you can ask yourself:
• Do data objects accurately represent the “real world” values
they are expected to model?
• Are there incorrect spellings of product or person names,
addresses, and even untimely or not current data?

https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality

o Integrity - means validity of data across the relationships and


ensures that all data in a database can be traced and
connected to other data.
• For example, in a customer database, there should be a valid
customer, addresses and relationship between them. If there is an
address relationship data without a customer, then that data is not
valid and is considered an orphaned record.
Questions you can ask yourself:
• Is there are any data missing important relationship linkages?

https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality

o Timeliness - references whether information is available when


it is expected and needed.
Timeliness of data is very important. This is reflected in::
• Companies that are required to publish their quarterly results
within a given frame of time
• Customer service providing up-to date information to the
customers
• Credit system checking in real-time on the credit card account
activity

***The timeliness depends on user expectation. Online availability of data could be required for room allocation
system in hospitality, but nightly data could be perfectly acceptable for a billing system.

https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
The Data Science Process – Get the Data (5)
Data Quality – A concrete example of data quality
issues.

***Up to 80% of the time is spent "cleaning" the


dirty data before it is used for AI/ML projects.

Machine learning algorithms can


detect missing values, duplicate
records, detect outlier values, and
normalize data, so it's usable by other
algorithms!

https://ncube.com/blog/big-data-and-ai
The Data Science Process – Model and Analyse
the Data
Data analysis is defined as a process of cleaning, transforming, and modeling data to discover
useful information for business decision-making - extract useful information from data and
taking the decision based upon the data analysis.

Data analysis tools make it easier for users


to process and manipulate data, analyze the
relationships and correlations between data
sets, and it also helps to identify patterns
and trends for interpretation.

https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2)
Types of Data Analysis
o Statistical Analysis shows "What happen?" by using past data in the form of dashboards.
Statistical Analysis includes collection, Analysis, interpretation, presentation, and modeling of
data. It analyses a set of data or a sample of data. There are two categories of this type of
Analysis - Descriptive Analysis and Inferential Analysis.
• Descriptive Analysis - analyses complete data or a sample of summarized numerical data. It shows
mean and deviation for continuous data whereas percentage and frequency for categorical data.
• Inferential Analysis - analyses sample from complete data. In this type of Analysis, you can find
different conclusions from the same data by selecting different samples.

https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2)
Types of Data Analysis
o Diagnostic Analysis shows "Why did it happen?" by finding
the cause from the insight found in Statistical Analysis.
• This Analysis is useful to identify behavior patterns of data.
• If a new problem arrives in your business process, then you can
investigate this Analysis to find similar patterns of that problem.
And it may have chances to use similar prescriptions for the new
problems.

https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2) Predictive analytics encompasses
a variety of statistical techniques
Types of Data Analysis from data mining, predictive
modelling, and machine learning that
Predictive Analysis shows "what is likely to happen" analyze current and historical facts to
by using previous data. make predictions about future or
• The simplest data analysis example is like if last year I
otherwise unknown events.
bought two dresses based on my savings and if this year my
salary is increasing double then I can buy four dresses.
• This Analysis makes predictions about future outcomes
based on current or past data. Forecasting is just an
estimate. Its accuracy is based on how much detailed
information you have and how much you dig in it.

https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2) Predictive analytics encompasses
a variety of statistical techniques
Types of Data Analysis from data mining, predictive
modelling, and machine learning that
Predictive Analysis shows "what is likely to happen" analyze current and historical facts to
by using previous data. make predictions about future or
• The simplest data analysis example is like if last year I
otherwise unknown events.
bought two dresses based on my savings and if this year my
salary is increasing double then I can buy four dresses.
• This Analysis makes predictions about future outcomes
based on current or past data. Forecasting is just an
estimate. Its accuracy is based on how much detailed
information you have and how much you dig in it.

https://www.guru99.com/what-is-data-analysis.html
The Data Science Process – Model and Analyse
the Data (2) Referred to as the "final frontier of
analytic capabilities. Prescriptive
Types of Data Analysis analytics not only anticipates what
will happen and when it will happen,
Prescriptive Analysis combines the insight from all but also why it will happen.
previous analysis to determine which action to take in a
current problem or decision.
• Most data-driven companies are utilizing Prescriptive
Analysis because predictive and descriptive Analysis are not
enough to improve data performance. Based on current
situations and problems, they analyze the data and make
decisions.

By Modaniel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=36478653


The Data Science Process – Communicate the
results
One of the most important skills for data scientists to have is being
able to clearly communicate results so different stakeholders can
understand.

Who is your audience? Common audience includes:


o Your team manager
A useful starting framework you can use:
o Line-of-business (LOB) stakeholders
1. Your understanding of the business problem
o Data engineers/engineering team 2. How to measure business impact — what business metrics
do your model results align to?
3. What data is/was available — if appropriate, reference
what data have been it would be helpful to collect
4. The initial solution hypothesis
5. The solution/model — use examples and visualizations
6. The business impact of the solution and clear action items
for stakeholders

https://medium.com/comet-ml/a-data-scientists-guide-to-communicating-results-c79a5ef3e9f1
The Data Science Process – Communicate the
results (2)
Is right visualisation tool (such as charts, graphs) important?

Suburb Status
Southport Major Cities
Parkwood Major Cities
Coober Pedy Remote

https://infographic.tv/data-visualization-remoteness-index-map-of-australia/
Big Data and AI - How big data and AI work
together?

There’s a reciprocal relationship


between big data and AI: The latter
depends heavily on the former for
success, while also helping
organizations unlock the potential
in their data stores in ways that
were previously cumbersome or
impossible.

https://infographic.tv/data-visualization-remoteness-index-map-of-australia/
Big Data and AI – Use Cases
o Detecting anomalies - AI can analyze artificial intelligence
data to detect unusual occurrences in the data. For example,
having a network of sensors that have a predefined
appropriate range. Anything outside of that range is an
anomaly.
o Probability of future outcome - Using known condition that
has a certain probability of influencing the future outcome, AI
can determine the likelihood of that outcome.
o AI can recognize patterns - AI can see patterns that humans
don’t.
o Data Bars and Graphs - AI can look for patterns in bars and
graphs that might stay undetected by human supervision
https://ncube.com/blog/big-data-and-ai
Summary
• Introduction of Big data, sources of big data and types of big data.

• Brief overview of the data science process, different types of data


analytics and communication of data analytics results.

• The interrelationships of Big Data and AI.

• Some Big Data Analytics use cases.


References
• https://www.bornfight.com/blog/understanding-the-5-vs-of-big-data-volume-velocity-variety-veracity-value/
• https://bigdatapath.wordpress.com/2019/11/13/understanding-the-7-vs-of-big-data/
• https://serokell.io/blog/big-data
• https://suryagutta.medium.com/the-5-vs-of-big-data-2758bfcc51d
• https://www.cloudmoyo.com/blog/data-architecture/what-is-big-data-and-where-it-comes-from/
• https://lawtomated.com/structured-data-vs-unstructured-data-what-are-they-and-why-care/
• https://www.datapine.com/blog/data-analysis-questions/
• https://smartbridge.com/data-done-right-6-dimensions-of-data-quality/
• https://www.guru99.com/what-is-data-analysis.html
• https://ncube.com/blog/big-data-and-ai
• By Modaniel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=36478653
• https://enterprisersproject.com/article/2019/10/how-big-data-and-ai-work-together

You might also like