Unit I & II_FDS_II AI&DS
Unit I & II_FDS_II AI&DS
Unit I
Need for data science –benefits and uses –facets of data – data
science process – setting the research goal – retrieving data –cleansing,
integrating and transforming data – exploratory data analysis –build the
models – presenting and building applications.
Unit II
Unit III
Unit IV
Unit V
1
Data Science
Data Science is a multi-disciplinary science with an objective to
perform data analysis to generate knowledge that can be used for
decision making. This knowledge can be in the form of similar patterns
or predictive planning models, forecasting models etc.
A data science application collects data and information from
multiple heterogenous sources, cleans, integrates, processes and
analyses this data using various tools and presents information and
knowledge in various visual forms.
3
Voice of the customer: Text analytics on reviews, surveys, and
social media data can provide actionable insights into customer
satisfaction and areas for improvement.
4
Dynamic pricing: Retailers can adjust prices in real time based on
data insights such as demand fluctuations, competitor pricing, or
supply chain disruptions.
5
Environmental monitoring: Analyzing data from sensors and
satellites helps in monitoring environmental conditions, such as air
quality, deforestation, or ocean pollution.
Sports and Entertainment
Sports analytics: Data science is widely used in sports for
performance analysis, player tracking, injury prevention, and
strategy optimization.
Fan engagement: Sports teams and entertainment platforms use
data to understand audience preferences, offering personalized
content, experiences, and marketing strategies.
Game development: Video game developers use data science to
improve gameplay, balance game mechanics, and provide a more
personalized user experience.
Government and Public Sector
Crime prediction and prevention: Data science helps law
enforcement agencies predict and prevent crimes by analyzing
patterns of criminal activity and identifying areas at high risk.
Urban planning: Governments use data science to analyze traffic
patterns, environmental factors, and population demographics to
inform urban development and planning.
Public health: During epidemics or pandemics, data science is used
to model disease spread, optimize resource allocation, and support
policy decisions.
Facets of data
Very large amount of data will generate in big data and data
science. These data are various types and main categories of data are as
follows:
Structured
Unstructured
Natural language
Machine-generated
6
Graph-based
Audio, video, and images
Streaming
Structured
Structured data is arranged in rows and column format. It helps
for application to retrieve and process data easily. Database
management system is used for storing structured data.
The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns
and rows.
Structured data is also searchable by data type within content.
Structured data is understood by computers and is also efficiently
organized for human readers.
Unstructured Data
Unstructured data is data that does not follow a specified format.
Row and columns are not used for unstructured data. Therefore, it
is difficult to retrieve required information. Unstructured data has no
identifiable structure.
7
The unstructured data can be in the form of Text: (Documents,
email messages, customer feedbacks), audio, video, images.
Email is an example of unstructured data.
Data can be of any type. Unstructured data does not follow any
structural rules.
Natural language
Natural language is a special type of unstructured data. Natural
language processing enables machines to recognize characters,
words and sentences, then apply meaning and understanding to that
information.
For natural language processing to help machines understand
human language, it must go through speech recognition, natural
language understanding and machine translation. It is an iterative
process comprised of several layers of text analysis.
Machine - Generated Data
Machine-generated data is an information that is created without
human interaction as a result of a computer process or application
activity. This means that data entered manually by an end-user is not
recognized to be machine-generated.
Machine data contains a definitive record of all activity and
behavior of our customers, users, transactions, applications, servers,
networks, factory machinery and so on.
8
Examples of machine data are web server logs, call detail
records, network event logs and telemetry.
Both Machine-to-Machine (M2M) and Human-to-Machine (H2M)
interactions generate machine data. Machine data is generated
continuously by every processor-based system, as well as many
consumer-oriented systems.
Streaming Data
Streaming data is data that is generated continuously by thousands
of data sources, which typically send in the data records simultaneously
and in small sizes (order of Kilobytes).
Streaming data includes a wide variety of data such as log files
generated by customers using mobile or web applications,
ecommerce purchases, in-game player activity, information from
social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrumentation in data centers.
10
Setting the research goal
This step involves acquiring data from all the identified internal and
external sources, which helps to answer the business question.
Retrieving data
Its collection of data which required for project. This is the
process of gaining a business understanding of the data user have and
translating what each piece of data means.
Data can also be delivered by third-party companies and takes
many forms ranging from Excel spreadsheets to different types of
databases.
Data preparation
Data can have many inconsistencies like missing values, blank
columns, an incorrect data format, which needs to be cleaned. We
need to process, explore and condition data before modeling. The clean
data, gives the better predictions.
Data exploration
Data exploration is related to deeper understanding of data. Try to
understand how variables interact with each other, the distribution of the
data and whether there are outliers.
11
To achieve this use descriptive statistics, visual techniques and
simple modeling. This step is also called as Exploratory Data Analysis.
Data modeling
In this step, the actual model building process starts. Here, Data
scientist distributes datasets for training and testing. Techniques like
association, classification and clustering are applied to the training data
set. The model, once prepared, is tested against the "testing" dataset.
12
Spend time understanding the goals and context of your
research:
An essential outcome is the research goal that states the purpose
of your assignment in a clear and focused manner.
Understanding the business goals and context is critical for project
success.
Continue asking questions and devising examples until you grasp
the exact business expectations, identify how your project fits in
the bigger picture, appreciate how your research is going to change
the business, and understand how they’ll use your results.
Create a project charter
A project charter requires teamwork, and your input covers at least the
following:
A clear research goals
The project mission and context
How you’re going to perform your analysis
What resources you expect to use
Proof that it’s an achievable project, or proof of concepts
Deliverables and a measure of success
A timeline
13
Retrieving data
Retrieving required data is second phase of data science project.
Sometimes Data scientists need to go into the field and design a data
collection process..
Many companies will have already collected and stored the data
and what they don't have can often be bought from third parties.
Most of the high-quality data is freely available for public and
commercial use. Data can be stored in various format. It is in text
file format and tables in database. Data may be internal or external.
14
Data repository is also known as a data library or data archive.
This is a general term to refer to a data set isolated to be mined for
data reporting and analysis.
The data repository is a large database infrastructure, several
databases that collect, manage and store data sets for data
analysis, sharing and reporting.
Data repository can be used to describe several ways to collect and store
data:
Data warehouse is a large data repository that aggregates
data usually from multiple sources or segments of a business,
without the data being necessarily related.
Data lake is a large data repository that stores unstructured
data that is classified and tagged with metadata.
Data marts are subsets of the data repository. These data
marts are more targeted to what the data user needs and easier
to use.
Do not be afraid to shop around
If required data is not available within the company, take the help
of other company, which provides such types of database. For
example, Nielsen and GFK are provides data for retail industry. Data
scientists also take help of Twitter, LinkedIn and Facebook.
Government's organizations share their data for free with the world.
This data can be of excellent quality; it depends on the institution
that creates and manages it. The information they share covers a
broad range of topics such as the number of accidents or amount of
drug abuse in a certain region and its demographics.
15
Cleansing, integrating, and transforming data
The model needs the data in a specific format, so data transformation
will be the step. It’s a good habit to correct data errors as early on in the
process as possible.
16
Cleansing data
Data cleansing is a subprocess of the data science process. It focuses
on removing errors in the data. Then the data becomes a true and
consistent representation of the processes.
Types of errors:
Interpretation error - data for granted, like saying that a person’s
age is greater than 300 years.
Inconsistencies - class of errors is putting “Female” in one table
and “F” in another when they represent the same thing.
An overview of common errors
17
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Redundant Whitespace:
Whitespaces tend to be hard to detect but cause errors like other
redundant characters.
Example: a mismatch of keys such as “FR ” – “FR”
Fixing redundant whitespaces - Python can use the strip() function to
remove leading and trailing spaces.
18
Appending Table
Appending or stacking tables is effectively adding observations
from one table to another table.
Transforming data
In data transformation, the data are transformed or consolidated
into forms appropriate for mining. Relationships between an input
variable and an output variable aren't always linear.
19
Reducing the Number of Variables
Having too many variables in the model makes the model difficult
to handle and certain techniques don't perform well when user overload
them with too many input variables.
All the techniques based on a Euclidean distance perform well
only up to 10 variables. Data scientists use special methods to reduce
the number of variables but retain the maximum amount of data.
Euclidean distance :
Euclidean distance is used to measure the similarity between
observations. It is calculated as the square root of the sum of differences
between each point.
20
Example
21
Example
Bar chart, a line plot, and a distribution are some of the graphs
used in exploratory analysis.
22
Pareto diagram
Example
A Pareto diagram is a combination of the values and a cumulative
distribution.
It’s easy to see from this diagram that the first 50% of the countries
contain slightly less than 80% of the total amount.
If this graph represented customer buying power and we sell
expensive products, we probably don’t need to spend our
marketing budget in every country; we could start with the first
50%.
Box plot
23
A box plot is a type of chart often used in explanatory data analysis
to visually show the distribution of numerical data and skewness through
displaying the data quartiles or percentile and averages.
Example
24
The components of model building are as follows:
Selection of model and variable
Execution of model
Model diagnostic and model comparison
Model and Variable Selection
For this phase, consider model performance and whether project
meets all the requirements to use model, as well as other factors:
Must the model be moved to a production environment and, if so,
would it be easy to implement?
How difficult is the maintenance on the model: how long will it
remain relevant if left untouched?
Does the model need to be easy to explain?
Model Execution
Various programming language is used for implementing the model. For
model execution, Python provides libraries like StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process. Following are the remarks
on output:
25
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is
easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not
enough evidence exists to show that the influence is there.
• Linear regression works if we want to predict a value, but for classify
something, classification models are used. The k-nearest neighbors
method is one of the best method.
Example
26
3. WEKA: It is a free data mining software package with an analytic
workbench. The functions created in WEKA can be executed within Java
code.
4. Python is a programming language that provides toolkits for machine
learning and analysis.
5. SQL in-database implementations, such as MADlib provide an
alterative to in memory desktop analytical tools.
27
The team delivers final reports, briefings, code and technical documents.
In addition, team may run a pilot project to implement the models
in a production environment.
The last stage of the data science process is where user soft skills
will be most useful.
Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other
tools.
28
Unit – II
Frequency Distribution
A frequency distribution is a representation, either in a graphical or
tabular format, that displays the number of observations within a
given interval.
There are two main types of frequency distributions for quantitative data:
1. Ungrouped Frequency Distribution
2. Grouped Frequency Distribution
Ungrouped Frequency Distribution
An ungrouped frequency distribution is used when the dataset
is small and consists of individual data points (usually discrete). It lists
each unique data value along with its corresponding frequency (i.e., the
number of times it occurs).
Steps to Create an Ungrouped Frequency Distribution:
Step 1: Arrange the data in ascending or descending order.
Step 2: Count how many times each unique value appears in the
dataset.
Step 3: Organize the data into a table with two columns: one for
the data values and the other for their corresponding frequencies.
29
Example:
Consider the following dataset of ages: 18,21,18,22,22,23,21,22,24,21
Step 2: Calculate the class width. The class width is the range of values
within each interval. It is determined by dividing the range of the data by
the number of intervals:
Step 3: Create intervals (or classes). The intervals should cover the full
range of the data and be mutually exclusive (no overlaps).
30
Step 5: Construct a table showing the intervals (classes) and the
corresponding frequencies.
Example
31
Relative Frequency Distributions
A relative frequency distribution is a way of showing how often
each value or class interval occurs relative to the total number of data
points. It expresses the frequency of each data value or class as a
proportion or percentage of the total dataset.
Formula for Relative Frequency:
The relative frequency of a value or class interval is calculated by
dividing the frequency of that value or class by the total number of
observations in the dataset.
33
Example 2: Relative Frequency Distribution for Grouped
Data
Consider the following dataset of test scores:
55, 67, 78, 89, 92, 85, 56, 76, 90, 83, 77, 66, 60, 73, 88, 80,
65, 95, 79, 82
Step 1: Group the data into intervals:
55 – 61 , 62 – 68 , 69 – 75 , 76 – 82, 83 – 89, 90 – 96
Step 2: Count the frequencies for each interval.
34
Step 5: Construct the relative frequency table.
35
3. Find the Frequency: Count how many data points fall into each
class interval.
4. Calculate the Cumulative Frequency: Start from the first-class
interval and add the frequency of each interval to the sum of the
previous intervals.
36
Frequency distributions for nominal Data
Frequency distributions for nominal data are used to summarize
and display the counts or frequencies of categories within a variable
that represents distinct groups or categories with no inherent order.
Nominal Data: This is data that consists of categories or groups that are
mutually exclusive. Examples include:
Gender: Male, Female, Other
Color: Red, Blue, Green
Country: USA, Canada, Mexico
Frequency Distribution for Nominal Data
A frequency distribution for nominal data shows how many times
each category appears in the dataset. It typically includes both:
Frequency: The count of observations in each category.
Relative Frequency: The proportion or percentage of the total
that each category represents.
Components of a Frequency Distribution for Nominal Data:
Categories (Levels): These are the different groups or labels that
make up the nominal variable.
Example: For the variable "favorite color," the categories
might be Red, Blue, and Green.
Frequency: The count of how many times each category occurs in
the dataset.
Example: How many people in the survey like the color Red,
Blue, or Green.
Relative Frequency (optional): The proportion of observations
that fall into each category. It is calculated by dividing the
frequency of a category by the total number of observations.
Formula:
37
Example: If there are 20 responses and 5 of those are for the color
Red, the relative frequency for Red would be 520=0.25\frac{5}{20} =
0.25205=0.25 or 25%.
38
Graphical Representation:
To visualize the frequency distribution of nominal data,
Interpreting distributions
GRAPHS
Data can be described clearly and concisely with the aid of a well-
constructed frequency distribution.
Graphs for Quantitative Data
Histograms
A histogram is a special kind of bar graph that applies to
quantitative data (discrete or continuous).
The horizontal axis represents the range of data values. The bar
height represents the frequency of data values falling within the interval
formed by the width of the bar.
Some of the more important features of histograms
Equal units along the horizontal axis (the X axis) reflect the various
class intervals of the frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect
increases in frequency. (The units along the vertical axis do not
have to be the same width as those along the horizontal axis.)
The intersection of the two axes defines the origin at which both
numerical scales equal 0.
39
Numerical scales always increase from left to right along the
horizontal axis and from bottom to top along the vertical axis.
The body of the histogram consists of a series of bars whose
heights reflect the frequencies for the various classes.
Frequency polygon
Frequency polygons are a graphical device for understanding the
shapes of distributions. They serve the same purpose as histograms, but
are especially helpful for comparing sets of data.
Frequency polygons are also a good choice for displaying
cumulative frequency distributions.
We can say that frequency polygon depicts the shapes and trends
of data. It can be drawn with or without a histogram.
The midpoints will be used for the position on the horizontal axis
and the frequency for the vertical axis.
40
A line indicates that there is a continuous movement. A frequency
polygon should therefore be used for scale variables that are binned, but
sometimes a frequency polygon is also used for ordinal variables.
Frequency polygons are useful for comparing distributions. This is
achieved by overlaying the frequency polygons drawn for different data
sets.
41
corresponding stem. In these diagrams, all of the scores are
represented in the diagram without the loss of any information.
A stem-and-leaf plot retains the original data. The leaves are
usually the last digit in each data value and the stems are the
remaining digits.
MEDIAN
The median is the middle value when the data points are arranged
in ascending or descending order.
How to Calculate the Median?
Sort the Data: Arrange the data in ascending (or descending)
order.
Determine the Position of the Median:
o If the dataset contains an odd number of values, the median
is the middle value.
o If the dataset contains an even number of values, the median
is the average of the two middle values.
Steps for Finding the Median:
1. Sort the data in numerical order.
2. Find the position of the median:
o If the number of data points n is odd, the median is the value
at position (n+1)/2
o If the number of data points n is even, the median is the
average of the two middle values, which are located at
position
Examples:
Odd Number of Data Points (Simple Median):
Data: {1, 3, 5, 7, 9}
44
Sorted data: {1, 3, 5, 7, 9}
The median is the middle value, which is 5.
The median is 4.
MEAN
The mean is one of the most fundamental statistical concepts. The
basic idea is to sum all the data points and divide by the number of
points in the dataset.
45
Data: {4, 6, 8, 10, 12}
Step 1: Sum all the values
4+6+8+10+12=40
Step 2: Divide by the number of values
There are 5 values in the dataset, so divide 40 by 5:
Mean=40/5=8
So, the mean of this dataset is 8.
Example 2: Using Larger Numbers
Consider a larger dataset with more complex numbers, such as test
scores or salaries in thousands:
Data: {1200, 1500, 1800, 2100, 2400, 2700, 3000}
Step 1: Sum all the values
1200+1500+1800+2100+2400+2700+3000=14100
Step 2: Divide by the number of values
There are 7 values in this dataset, so:
Mean=14100 / 7 = 2014.29
47
Example:
Ranking of customer satisfaction on a scale from 1 to 5, where 1 is
"very dissatisfied" and 5 is "very satisfied."
Customer ratings: [3, 5, 4, 2, 5]
To find the mean rank:
o Sum the ranks: 3+5+4+2+5=19
o Divide by the number of observations: 19/5=3.8
o The mean rank is 3.8, which suggests an average level of
satisfaction slightly above "neutral."
48