0% found this document useful (0 votes)

2 views

Unit 1

Data Science is a multidisciplinary field focused on collecting, processing, and analyzing data to derive insights. The document outlines the definition, need for data science, various job roles and required skills, types of data, and challenges associated with unstructured data. Additionally, it discusses data storage formats and the importance of data cleaning and preprocessing in ensuring data quality.

Uploaded by

poudelesmrity

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Unit 1

Uploaded by

poudelesmrity

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Data Science

Unit 1 - Introduction

Department of AI & ML, School of Computing

Mohan Babu University
Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102
2
Definition of Data Science
• According to the Oxford dictionary, science is “systematic study of the structure
and behaviour of the physical and natural world through observation and
experiment”.
• Data Science is a field of study and practice that involves the collection,
storage, and processing of data in order to derive important insights into a
problem or a phenomenon.
• It is a multidisciplinary field that uses tools and techniques to manipulate the
data so that you can find something new and meaningful.
• Such data may be generated by
– humans (surveys, logs, etc.) or
– machines (weather data, road vision, etc.), and
• Data could be in different formats (text, audio, video, augmented or virtual
reality, etc.).

3
Definition of Data Science . . .
• Example: Travel from station A to station B by car.
Now, we need to take some decisions such as which route will be the best
route to reach faster at the location, in which route there will be no traffic jam,
and which will be cost-effective.
All these decision factors will act as input data, and we will get an appropriate
answer from these decisions, so this analysis of data is called the data analysis,
which is a part of data science.

4
Need for Data Science
• Data Explosion - A lot of data is
generated at an unprecedented and ever-
increasing speed. Approximately 2.5
quintals bytes of data is generating on
every day. It is estimated as per
researches, that by 2020, 1.7 MB of data
will be created at every single second, by
a single person on earth.

• Handling of such huge amount of data is

a challenging task for every organization.

• To handle, process, and analyse data,

complex, powerful, and efficient
algorithms and technologies are required.

• Technology made data science to come

into existence.
5
Need for Data Science . . .
• Analyzing data wisely necessitates the involvement of competent and well-
trained practitioners, and analyzing such data can provide actionable insights.
• The “3V model” attempts to lay this out in a simple (and catchy) way. These are
the three Vs:
1. Velocity: The speed at which data is accumulated.
2. Volume: The size and scope of the data.
3. Variety: The massive array of data and types (structured and
unstructured).
• Each of these three Vs regarding data has dramatically increased in recent
years.

6
Jobs and Skills
Data Analyst
• Data analyst is an individual, who performs mining of huge amount of data,
models the data, looks for patterns, relationship, trends. At the end of the
day, he comes up with visualization and reporting for analyzing the data for
decision making and problem-solving process.
• Skills required: For becoming a data analyst, you must get a good
background in Mathematics, Statistics, Business Intelligence, Data Mining.
• Computer languages and tools : Statistical Tools, BI Tools, DA Tools, DBMS.

Machine Learning Expert

• The machine learning expert is the one who works with various machine
learning algorithms used in data science such as regression, clustering,
classification, decision tree, random forest, etc.
• Skills Required: Programming languages, Algorithmic skills, Analytical &
Problem-solving skills, Probability and Statistics.

7
Jobs and Skills . . .
Data Engineer
• A data engineer works with massive amount of data and responsible for building
and maintaining the data architecture of a data science project. Data engineer
also works for the creation of data set processes used in modeling, mining,
acquisition, and verification.
• Skills required: DBMS, Programming Languages

Data Scientist
• A data scientist is a professional who works with an enormous amount of data
to come up with compelling business insights through the deployment of various
tools, techniques, methodologies, algorithms, etc.
• Skills required: DBMS, Programming Languages, Mathematics, Statistics,
Visualization, Communication.

8
Components of Data Science

9
Skills for Data Science
• Willing to Experiment - A data scientist needs to have
– Drive, intuition, and curiosity to solve problems.
– To identify and articulate problems on their own.
– Intellectual curiosity and the ability to experiment.
– Analytical and creative thinking.
• Proficiency in Mathematical Reasoning -
– Mathematical and statistical knowledge is the second critical skill for a
potential applicant seeking a job in data science.
– Employers are seeking applicants who can demonstrate their ability in
reasoning, logic, interpreting data, and developing strategies to perform
analysis.
– Interpretation and use of numeric data are going to be increasingly critical
in business practices. As a result, an increasing trend in hiring for most
companies is to check if applicants are adept at mathematical reasoning.

10
Skills for Data Science . . .
• Data Literacy
– Data literacy is the ability to extract meaningful information from a dataset
and any modern business has a collection of data that needs to be
interpreted.
– A skilled data scientist plays an intrinsic role for businesses through an
ability to assess a dataset for relevance and suitability for the purpose of
interpretation, to perform analysis, and create meaningful visualizations to
tell valuable data stories.
– Data Literacy Training - Managers are being trained to “understand which
data is suitable, and how to use visualization and simulation to process and
interpret it.”
– Data-Driven Decision-Making is a driving force for innovation in business,
and data scientists.

11
Types of Data
Science Roles

12
Analyzing Data

Statistical Inferences
• we can conclude that the average height of
an American woman is 65 inches, at least
according to these 15 observations.
• The average weight is 136 pounds.
• An increase of an inch in height results in an
increase of less than 3 pounds in weight for
height between 58 and 65 inches.
• For values of height greater than 65 inches,
weight increases more rapidly.

13
Tools for Data Science
• Statistical Techniques
• Computational Thinking
• Programming and Data Processing Tools – Python, Java, PHP, R, and SQL etc
• Scientific Data Processing Environment – MATLAB etc

14
Applications of Data Science
Image recognition and speech recognition
• When an image is uploaded on Facebook, suggestions are intelligently provided
to tag to friends. This automatic tagging suggestion uses image recognition
algorithms, which is part of data science.
Gaming world
• In gaming world, the use of Machine learning algorithms is increasing day by
day.
• EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
Internet Search
• Search engines such as Google, Yahoo, Bing, Ask are using data science for
intelligent searching.
• All these search engines use the data science technology to make the search
experience better.

15
Applications of Data Science . . .
Transport
• Transport industries use data science technology to create self-driving cars. With
self-driving cars, it will be easy to reduce the number of road accidents.
Healthcare
• Data science is being used in healthcare sector for tumor detection, drug
discovery, medical image analysis, virtual medical bots, etc.
Recommendation systems
• Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with personalized
recommendations.
Risk detection
• Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued.
• Most of the finance companies are looking for the data scientist to avoid risk
and any type of losses with an increase in customer satisfaction.

16
Datatypes
• Structured data refers to highly organized information that can be seamlessly
included in a database and readily searched via simple search operations.
• Unstructured data is essentially the opposite, devoid of any underlying structure.

17
Structured Data

• Structured data can be

– Numerical
– Categorical
– Text
– Boolean
• Structured data values are labeled, which is not the case when it comes to
unstructured data.

18
Unstructured Data

• Unstructured data can be

– Media
– Imaging
– Audio
– Sensor etc
• Unstructured data values are unlabeled.
• Unstructured simply means that it is datasets that aren’t stored in a
structured database format.
• Unstructured data has an internal structure, but it’s not predefined through
data models.

19
Challenges with Unstructured Data
• Structured data is akin to machine language, in that it makes information much
easier to be parsed by computers.
• Unstructured data, on the other hand, is often how humans communicate.
• The lack of structure makes compilation and organizing unstructured data a
time- and energy-consuming task.
• It would be easy to derive insights from unstructured data if it could be instantly
transformed into structured data.

20
Data Collections
Open Data
• Data which is freely available in a public domain that can be used by anyone as
they wish, without restrictions from copyright, patents, or other mechanisms of
control.
• Local and federal governments, NGOs, and academic communities all lead open
data initiatives.
• List of principles commonly associated with open data are
– Public Licensing, Nonproprietary
– Adhere to Law and subject to privacy, confidentiality, security, or other valid
restrictions
– Accessible
– Described, Complete – Granularity, Derived & Aggregated data
– Reusable
– Timely, Managed Post-Release

21
Data Collections . . .

Social Media Data

• Gold mine for collecting data to analyze for research or marketing purposes.
• This is facilitated by the Application Programming Interface (API) that social media
companies provide to researchers and developers.
• Facebook Graph API, Twitter API etc.
• These APIs can be used by any individual or organization to collect and use this
data to accomplish a variety of tasks
Ex: socially impactful applications, research on human information behavior,
monitoring the aftermath of natural calamities, etc.
• Furthermore, to encourage research on niche areas, such datasets have often
been released by the social media platform itself.

22
Data Collections . . .
Multimodal Data
• Multimodal (different forms) and multimedia (different media) data such as
images, music and other sounds, gestures, body posture, and the use of space.
• Medical Imaging Machines such as MRI, CT, MEG etc.
• Internet of Things take a major role in generating multimodal data.

23
Data Collections . . .
Data Storage and Presentation - Depending on its nature, data is stored in
various formats. We will start with simple kinds – data in text form.

• CSV (Comma-Separated Values) format

• TSV (Tab-Separated Values) format

24
Data Collections . . .
Data Storage and Presentation . . .
• XML (eXtensible Markup Language)
– Popular for sharing data
between IT systems.
– was designed to be both
human- and machine readable,
and can thus be used to store
and transport data.
– XML data is stored in plain text
format, it provides a software-
and hardware-independent way
of storing data.
– This makes it much easier to
create data that can be shared
by different applications.

25
Data Collections . . .
Data Storage and Presentation . . .
• RSS (Really Simple Syndication)
– is a format used to share data between services.
– Ex: RSS Aggregator – Automatic notifications/alerts from websites/apps.
– it was defined in the 1.0 version of XML.
– The format of RSS follows XML standard usage but in addition defines the
names of specific tags, and what kind of information should be stored in
them.
– It facilitates delivery of information from various sources on the Web.
Information provided by a website in an XML file in such a way is called an
RSS feed.
– Most current web browsers can directly read RSS files, but a special RSS
reader or aggregator may also be used.
– RSS data is small and fast loading, it can easily be used with services such
as mobile phones, personal digital assistants (PDAs), and smart watches.

26
Data Collections . . .
Data Storage and Presentation . . .
• RSS (Really Simple Syndication) . . .

Sample
• RSS Document

27
Data Collections . . .
Data Storage and Presentation . . .
• JSON (JavaScript Object Notation)
– is a lightweight data-interchange format.
– It is not only easy for humans to read and write, but also easy for machines
to parse and generate.
– JSON is built on two structures:
 Collection of name–value pairs - In various languages, this is realized as
an object, record, structure, dictionary, hash table, keyed list, or
associative array.
 Ordered list of values - In most languages, this is realized as an array,
vector, list, or sequence.

28
Data Collections . . .
Data Storage and Presentation . . .
• JSON (JavaScript Object Notation) . . .
– When exchanging data between a browser and a server, the data can be
sent only as text.
– JSON is text, and any JavaScript object can be converted into JSON, and
can be sent to the server.

29
Data Collections . . .
Data Storage and Presentation . . .
• JSON (JavaScript Object Notation) . . .
– Any JSON received from the server can be converted into JavaScript
objects.

– This provides a way to work with data as JavaScript objects, with no

complicated parsing and translations.

30
Data Preprocessing
• Data in the real world is often dirty;
• It is in need of being cleaned up before it can be used for a desired purpose.
This is often called data pre-processing.
– Incomplete. When some of the attribute values are lacking, certain
attributes of interest are lacking, or attributes contain only aggregate data.
– Noisy. When data contains errors or outliers. For example, some of the
data points in a dataset may contain extreme values that can severely affect
the dataset’s range.
– Inconsistent. Data contains discrepancies in codes or names. For
example, if the “Name” column for registration records of employees
contains values other than alphabetical letters, or if records do not start
with a capital letter, discrepancies are present.

31
Steps in Data Preprocessing . .

32
Data Cleaning
• Importance
– “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
– “Data cleaning is the number one problem in data warehousing”—DCI
survey

• Data Cleaning Tasks

– Data Munging

– Handling Missing Data

– Smooth Noisy Data

33
Data Cleaning . . .
Data Munging
– Also called as Data Manipulating or Data Wrangling.
– Often, the real-world data is not in a format that is easy to work with.
– It needs to be converted to format more suitable for a computer to
understand.
– This can be done manually, automatically, or, in many cases, semi-
automatically. Unfortunately, often there is no better or systematic method
for wrangling.
– Ex: Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the mix.”

34
Data Cleaning . . .
Handling Missing Data
• Missing data may be due to
– human error
– equipment malfunction, transmission errors
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– Varying granularity of recording data
• Missing data affects data science results. Strategies to handle missing values
– Ignore the tuple
– Fill in the missing value manually: tedious + infeasible?
– Fill in the missing value automatically using statistical/inferencing techniques

35
Data Cleaning . . .
Handling Missing Data

Tid Refund Marital Taxable  Fill missing values with

Status Income Cheat attribute mean
1 Yes Single 120K No Mean No
 ? = 98
=
2 No Married 100K No 110  Fill missing values with class
3 No Single 85K Yes mean
Mean Yes
4 No Divorced 95K Yes  ? = 90
=
5 No Single 90K Yes 90
Mean = 98
6 Yes Divorced ? Yes

36
Data Cleaning . . .
Smooth Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitations
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
• Strategies
– Identify and remove outliers
– Resolve inconsistencies in data
37
Data Cleaning . . .
Smooth Noisy Data . . .

38
Data Integration
• Data from various sources commonly needs to be integrated. The following
steps describe how to integrate multiple databases –
– Combine data from multiple sources into a coherent storage place.
– Engage in schema integration, or the combining of metadata from different
sources.
– Detect and resolve data value conflicts.
– Address redundant data in data integration. Redundant data is commonly
generated in the process of integrating multiple databases. For example:
a. The same attribute may have different names in different databases.
b. One attribute may be a “derived” attribute in another table.
c. Correlation analysis may detect instances of redundant data.

39
Data Integration . . .
Schema integration

CID Name Address Mobile

DB2
DB1
First_ Last_
ID Address City Phone
Name Name

• Integrate metadata from different sources to solve problems

• Sometimes, data may have to be transformed for integration
• Ex: DB1. Customer.Name = DB2.Customer.First_Name +
DB2.Customer.Last_Name

40
Data Integration . . .
Data Value Conflicts
Reasons for this conflict could be different representations or different scales.

DB1 CID Name Height (feet) CID Name Height (meters) DB2
Data Integration . . .
Redundant Data Tid Name Address
Tid Name Address
• Redundant records DB1 Steve
T3 Steve T3
Mcgarret

Same DB2
Person ?

• Redundant attributes
Tid Name Mobile Tid Name Phone DB2
DB1
Same
attribute ?

• Derived Attributes
Tid Name DOB Tid Name Age DB2
DB1
Derived
attribute
42
Data Integration . . .
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve speed and quality of data science
process.
• Redundant attributes may be able to be detected by Correlation Analysis.
• Correlation analysis is a method of statistical evaluation used to study the
strength of a relationship between two variables (e.g. height and weight).
• The goal of a correlation analysis is
• to see whether two variables co-vary, and
• to quantify the strength of the relationship between the variables,

43
Data Transformation
• Data must be transformed so it is consistent and readable by a system.
• The following five processes may be used for data transformation.
1) Smoothing: Remove noise from data.
2) Aggregation: Summarization, data cube construction.
3) Generalization: Concept hierarchy climbing.
4) Normalization: Scaled to fall within a small, specified range and aggregation.
Some of the techniques that are used for accomplishing normalization are:
a) Min–max normalization.
b) Z-score normalization.
c) Normalization by decimal scaling.
5) Attribute or feature construction.
a) New attributes constructed from the given ones.

44
Data Reduction
• The process of data reduction is aimed at producing a reduced representation of
the dataset that can be used to obtain the same or similar analytical results.
• Why Data Reduction?
– A database/data warehouse may store terabytes of data.
– Complex data analysis/mining may take a very long time to run on the
complete data set.
• Data Reduction Strategies
– Data cube aggregation – summarize data
– Attribute subset selection - e.g., remove unimportant / redundant /
irrelevant attributes
– Dimensionality reduction — encoding mechanisms to reduce data set size
– Numerosity reduction — e.g., fit data into models
– Discretization and concept hierarchy generation (Generalization)

45
Data Reduction . . .
Data Cube Aggregation - data reduction works with the consideration of the
task. Summarize data to higher levels of granularity using concept hierarchies.

46
Data Reduction . . .
Dimensionality Reduction
• Dimensionality reduction method works with respect to the nature of the data.
• Identify which features to remove or collapse to a combined feature.
• This requires identifying redundancy in the given data and/or creating
composite dimensions or features that could sufficiently represent a set of raw
features.
• Strategies for reduction include sampling, clustering, principal component
analysis, etc.
Attribute Subset Selection
• Select attributes which are relevant, nonredundant and important to the context
of data science task.
• Selection may be done manually by domain expert – difficult and time
consuming – detrimental attribute selection may lead to poor quality of data
science task.

47
Data Discretization
• Data collected may contain attributes that are continuous, such as temperature,
ambient light, and a company’s stock price.
• These continuous values may need to be converted into more manageable
parts. This mapping is called discretization.
• Discretization is inherently reduces data. Hence, process of discretization could
also be perceived as a means of data reduction.
• There are three types of attributes involved in discretization:
a) Nominal: Values from an unordered set
b) Ordinal: Values from an ordered set
c) Continuous: Real numbers
• To achieve discretization, divide the range of continuous attributes into intervals.

48
Data Discretization . . .
• Ex: Temperature is a continuous values attribute. Let average temperatures of
each day are recorded.

Avg. Temp Interval

[30 – 32]
Replace
with
[48 – 50]

Interval
[20 – 22]

20 52

49
Data Analysis and Data Analytics
• Data Analysis
– hands-on data exploration and evaluation.
– looks backwards, providing marketers with a historical view of what has
happened in the past.
• Data Analytics
– includes data analysis as a necessary subcomponent.
– defines the science behind the analysis
– science means understanding problems and explore data in meaningful
ways.
– think in terms of past and future (prediction).
– makes extensive use of mathematics and statistics and the use of
descriptive techniques and predictive models to gain valuable knowledge
from data.
– These insights from data are used to recommend action or to guide
decision-making in a business context.

50
Descriptive Analysis
• Looking for something specific in data.
• What is happening now based on incoming data?
Ex: categorize customers by their likely product preferences and
purchasing patterns
• Typically, it is the first kind of data analysis performed on a dataset.
• Usually it is applied to large volumes of data.
• Data cannot be properly used if it is not correctly interpreted.
• Raw data that needs to be organized and summarized before it can be analyzed.
Data can only reveal patterns and allow to draw conclusions if it is presented as
an organized summary.
• It quantitatively describes the main features of a collection of data.
• Descriptive analysis facilitates analyzing and summarizing the data and is thus
instrumental to processes inherent in data science.

51
Descriptive Analysis . . .
• Humans often point out significant aspects of the world with numbers,
Ex: size, height, score, cost etc.
• Numerical representation can hold a considerable advantage over words.
Numbers allow humans to more precisely differentiate between objects or
concepts.

Variables
• Before we process or analyze any data, we have to be able to capture and
represent it. This is done with the help of variables.
• A variable is a label we give to our data.
• Numeric information can be separated into distinct categories - categorical
variable, nominal variable, ordinal variable, interval variable, ratio variable.
• Independent variable (predictor), dependent variable (target).

52
Descriptive Analysis – Frequency Distribution
• Of course, data needs to be displayed.
• Once some data has been collected, it is useful to plot a graph showing how
many times each score occurs. This is known as a frequency distribution.
• Frequency distributions come in different shapes and sizes. Therefore, it is
important to have some general descriptions for common types of distribution.

53
Descriptive Analysis – Frequency Distribution . . .

• Histogram - Histograms plot values of observations on the horizontal axis, with

a bar showing how many times each value occurred in the dataset. Works for
numerical data.

54
Descriptive Analysis – Frequency Distribution . . .

• Pie Chart – For visualizing categorical data when it’s distributed in a few finite
categories.

55
Descriptive Analysis – Frequency Distribution . . .

Normal Distribution
• In an ideal world, data would be distributed symmetrically around the center of
all scores.
• Thus, if we drew a vertical line through the center of a distribution, both sides
should look the same.
• This so-called normal distribution is characterized by a bell-shaped curve.

56
Descriptive Analysis – Frequency Distribution . . .

Normal Distribution . . .
• There are two ways in which a distribution can deviate from normal:
(1) Lack of symmetry (called skew)

57
Descriptive Analysis – Frequency Distribution . . .

Normal Distribution . . .
• There are two ways in which a distribution can deviate from normal . . .
(2) Pointiness (called kurtosis) - Degree to which scores cluster at the end of a
distribution (platykurtic) and how “pointy” a distribution is (leptokurtic)

58
Descriptive Analysis – Frequency Distribution . . .

Measures of Centrality
• Often, one number can tell us enough about a distribution. This is typically a
number that points to the “center” of a distribution which is also known as the
Central Tendency.
• There are three measures commonly used: mean, median, and mode.
• Mean
– Used to measure the central tendency of continuous data as well as a
discrete dataset.
– Susceptible to the influence of outliers.
– Only meaningful if the data is normally distributed, or at least close to
looking like a normal distribution.
– Is not good measure for skewed data.

59
Descriptive Analysis – Frequency Distribution . . .

Measures of Centrality . . . Mode

• Mode
– most frequently occurring
value in a dataset.
– Normally used for categorical
data.
• Median
– middle score for a dataset Ex: 11 4 9 12 1 7
that has been sorted
according to the values of Sorted data - 1 4 7 9 11 12
the data.
– With an even number of
median = average of middle values
values, median is calculated
as the average of the middle = (7 + 9) / 2 = 8
two data points.

60
Descriptive Analysis – Frequency Distribution . . .

Dispersion of a Distribution
• Data distributions come in all shapes and sizes.
• Simply looking at a central point may not help in understanding the actual shape
of a distribution.
• Therefore, it is required to look at the spread or dispersion of a distribution.
• Range
– Difference between largest and smallest scores in the data.
– because it uses only the highest and lowest values, outliers tend to result in
an inaccurate picture of the more likely range.

61
Descriptive Analysis – Frequency Distribution . . .

Dispersion of a Distribution . . .
• Interquartile Range
– calculate range after removing extreme values.
– Remove top and bottom quarters of data and calculate range of the
remaining (middle 50%) of the scores.

62
Descriptive Analysis – Frequency Distribution . . .

63
Descriptive Analysis – Frequency Distribution . . .

Dispersion of a Distribution . . .
• Variance
– indicates how spread out the data points are.
– pick a center of the distribution, typically the mean, then measure how far
each data point is from the center. If the individual observations vary greatly
from the group mean, the variance is big; and vice versa.
– measure of spread in units squared.

64
Descriptive Analysis – Frequency Distribution . . .

Dispersion of a Distribution . . .
• Variance . . .

65
Descriptive Analysis – Frequency Distribution . . .

Dispersion of a Distribution . . .
• Standard Deviation
– Variance measures of spread in units squared.
– Is square root of variance.
– ensures that measure of average spread is in the same units as the original
measure.

66
Diagnostic Analytics
• Used for discovery, or to determine why something happened.
• When done hands-on with a small dataset is also known as causal analysis,
since it involves at least one cause (usually more than one) and one effect.
• Allows a look at past performance to determine what happened and why. The
result of the analysis is often referred to as an analytic dashboard.
• Correlation Analysis
– Statistical measure that examines how two variables change together over
time.
– measures and describes the strength and direction of the relationship
between two variables.
– Strength indicates how closely two variables are related to each other, and
direction indicates how one variable would change its value as the value of
the other variable changes.

67
Diagnostic Analytics . . .
• Correlation Analysis . . .
– Pearson’s r correlation - measures the degree of the relationship between
linear related variables. N

• rA,B =
 (a
i 1
i  A)(bi  B)

NAB
Where A, B are attributes , N is the number of tuples,
– A and B are the means of A and B respectively
– σA and σB are the standard deviation of A and B respectively
 If rA,B > 0 : A and B are positively correlated.
 rA,B = 0 : independent;
 rA,B < 0 : negatively correlated

Higher values indicate stronger correlation.

68
Predictive Analytics
• Understanding future based on trends in past data as well as emerging new
contexts and processes.
• Ex: What might be the success are of new product if launched into the market?
• Provides companies with actionable insights based on data. Such information
includes estimates about the likelihood of a future outcome.
• Tools - SAS predictive analytics, IBM predictive analytics, RapidMiner etc.
• Applications – Healthcare, Finance, CRM, etc

69
Prescriptive Analytics
• Finding the best course of action for a given situation.
• This may start by
– first analyzing the situation (using descriptive analysis),
– then moves toward finding relationships among various variables,
– address a specific problem
• Process-intensive task,
• Analyzes potential decisions, the interactions between decisions, the influences
that bear upon these decisions, and the bearing all of this has on an outcome to
ultimately prescribe an optimal course of action in real time.
• Suggests options for taking advantage of a future opportunity or mitigate a
future risk.
• Continually and automatically processes new data to improve the accuracy of
predictions and provide advantageous decision options.

70
Prescriptive Analytics . . .
• Specific techniques include optimization, simulation, game theory, and decision-
analysis methods.
• Prescriptive analytics gives laser-like focus to answer specific questions.
• Prescriptive analytics can be really valuable in deriving insights from given data,
but it is largely not used.
• According to Gartner, 13% of organizations are using predictive analytics, but
only 3% are using prescriptive analytics.

71
Exploratory Analysis
• Often when working with data, we may not have a clear understanding of the
problem or the situation.
• And yet, we may be called on to provide some insights. In other words, we are
asked to provide an answer without knowing the question.
• Exploratory analysis is an approach to analyzing datasets to find previously
unknown relationships.
• Often such analysis involves using various data visualization approaches.
plotting data in different forms could provide with some clues regarding what
we may find or want to find in the data. Such insights can then be useful for
defining future studies/questions, leading to other forms of analysis.
• Exploratory analysis should not be used alone for generalizing and/or making
predictions from the data.

72
Exploratory Analysis . . .
• It postpones the usual assumptions about what kind of model the data follows
with the more direct approach of allowing the data itself to reveal its underlying
structure in the form of a model.
• Thus, exploratory analysis is not a mere collection of techniques; rather, it offers
a philosophy as to how to dissect a dataset; what to look for; how to look; and
how to interpret the outcomes.
• Vast and varied applications - The most common application is looking for
patterns in the data, such as finding groups of similar genes from a collection of
samples.

73
Mechanistic Analysis
• Involves understanding the exact changes in variables that lead to changes in
other variables for individual objects.
• Ex: studying the effects of carbon emissions on bringing about the Earth’s
climate change.
• Regression analysis
– is a process for estimating the relationships among variables.
– way of predicting an outcome variable from predictor variable (s).
– Linear Regression – relationship among variables is assumed to be linear.
• several predictor variables - multiple linear regression.
• one predictor variable - simple linear regression
– Ex: can be used to generate insights on consumer behavior, advertising
expenditures, stock prices etc

74
Mechanistic Analysis . . .
• Simple Linear Regression - Y = mX + c where c is intercept, m is slope.

Predicted
Value Y1

New Sample
X1
75
Mechanistic Analysis . . .
• Simple Linear Regression

Once these values are calculated, it is possible to estimate the value of y

from the value of x.

Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Basic of ds
No ratings yet
Basic of ds
14 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Vishwha D
No ratings yet
Vishwha D
29 pages
Chapter one-DSA
No ratings yet
Chapter one-DSA
20 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
2 Data Science Process 06-01-2024
No ratings yet
2 Data Science Process 06-01-2024
32 pages
data science notes Mtech
No ratings yet
data science notes Mtech
115 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Unit-1 - IDS
No ratings yet
Unit-1 - IDS
29 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data
No ratings yet
Data
43 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
Anu Data Scie
No ratings yet
Anu Data Scie
32 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
What Is A Data Scientist
No ratings yet
What Is A Data Scientist
21 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
CD101 Fundamental of Data Science
No ratings yet
CD101 Fundamental of Data Science
41 pages
How Does Data Science Works in 2021
No ratings yet
How Does Data Science Works in 2021
9 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
AI UNIT 1 Data Science
No ratings yet
AI UNIT 1 Data Science
16 pages
Introduction to Data Science Lecture 1
No ratings yet
Introduction to Data Science Lecture 1
4 pages
Data Science PDF
No ratings yet
Data Science PDF
8 pages
Data Science Components
No ratings yet
Data Science Components
7 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Data Science Fundamentals - Class1
100% (1)
Data Science Fundamentals - Class1
51 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
Applied Data Science Career Guide
No ratings yet
Applied Data Science Career Guide
15 pages
What Is Data Science
No ratings yet
What Is Data Science
8 pages
7 Step Ebook Guide
No ratings yet
7 Step Ebook Guide
69 pages
Ch7-Overview of Data Science-part 1
No ratings yet
Ch7-Overview of Data Science-part 1
37 pages
COMPUTATIONAL DATA SCIENCE - UNIT 1
No ratings yet
COMPUTATIONAL DATA SCIENCE - UNIT 1
18 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
File
No ratings yet
File
27 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
mod 3
No ratings yet
mod 3
96 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
16 pages
Free Guide - Comprehensive Guide To Become A Data Science Professional in 2023
No ratings yet
Free Guide - Comprehensive Guide To Become A Data Science Professional in 2023
17 pages
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
100% (1)
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
122 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
What Is Data Science
No ratings yet
What Is Data Science
3 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Data Science Training in Hyd PDF
No ratings yet
Data Science Training in Hyd PDF
16 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
himadev
No ratings yet
himadev
37 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
TM Xime PGDM QT I
No ratings yet
TM Xime PGDM QT I
24 pages
s.4 Statistics Lesson Notes
No ratings yet
s.4 Statistics Lesson Notes
31 pages
Hana Afoon - 20 Marks
No ratings yet
Hana Afoon - 20 Marks
27 pages
Formula and Notes For Class 11 Maths Download PDF Chapter 15. Statistics
No ratings yet
Formula and Notes For Class 11 Maths Download PDF Chapter 15. Statistics
16 pages
Unit III
No ratings yet
Unit III
25 pages
2.10 Exercises-1
No ratings yet
2.10 Exercises-1
7 pages
Statistics Chapter3 BSC211
No ratings yet
Statistics Chapter3 BSC211
20 pages
Worksheet Statistics
No ratings yet
Worksheet Statistics
12 pages
Slides Prepared by John S. Loucks St. Edward's University: 1 Slide © 2003 Thomson/South-Western
No ratings yet
Slides Prepared by John S. Loucks St. Edward's University: 1 Slide © 2003 Thomson/South-Western
54 pages
Charts and Graphs
No ratings yet
Charts and Graphs
7 pages
Session 4 - Constructing and Presenting Frequency Distribution
No ratings yet
Session 4 - Constructing and Presenting Frequency Distribution
13 pages
10.2 Statistics From Grouped Data
No ratings yet
10.2 Statistics From Grouped Data
19 pages
Unit II Frequency Data Distributions
No ratings yet
Unit II Frequency Data Distributions
10 pages
PR2 Research Chapter 1 2 3 4 Final
No ratings yet
PR2 Research Chapter 1 2 3 4 Final
35 pages
Xth Statistics
No ratings yet
Xth Statistics
21 pages
Handout 2 Frequency Distribution
100% (1)
Handout 2 Frequency Distribution
14 pages
IGCSE Statistics Syllabus
No ratings yet
IGCSE Statistics Syllabus
3 pages
Mt271 Lecture Notes 1
No ratings yet
Mt271 Lecture Notes 1
13 pages
Statistical Quality Control of Engineere
No ratings yet
Statistical Quality Control of Engineere
175 pages
The Central Limit Theorem
No ratings yet
The Central Limit Theorem
6 pages
Graph presentation of data
No ratings yet
Graph presentation of data
19 pages
01
No ratings yet
01
36 pages
Mathematics 10 Performance Task 1
No ratings yet
Mathematics 10 Performance Task 1
2 pages
RSC2601 TL001 3 2021
No ratings yet
RSC2601 TL001 3 2021
17 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
Frequency: Histogram From Raw Data
No ratings yet
Frequency: Histogram From Raw Data
3 pages
Biostatistics Manual
No ratings yet
Biostatistics Manual
95 pages
Ospe Pictures
No ratings yet
Ospe Pictures
53 pages
Chapter-2-Static of Data-1
No ratings yet
Chapter-2-Static of Data-1
13 pages
Statistics Chapter 7
No ratings yet
Statistics Chapter 7
11 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Data Science

Department of AI & ML, School of Computing

• Handling of such huge amount of data is

• To handle, process, and analyse data,

• Technology made data science to come

Machine Learning Expert

• Structured data can be

• Unstructured data can be

Social Media Data

• CSV (Comma-Separated Values) format

• TSV (Tab-Separated Values) format

– This provides a way to work with data as JavaScript objects, with no

• Data Cleaning Tasks

– Handling Missing Data

– Smooth Noisy Data

Tid Refund Marital Taxable  Fill missing values with

CID Name Address Mobile

• Integrate metadata from different sources to solve problems

Avg. Temp Interval

• Histogram - Histograms plot values of observations on the horizontal axis, with

Measures of Centrality . . . Mode

Higher values indicate stronger correlation.

Once these values are calculated, it is possible to estimate the value of y

You might also like