DAV Unit 1
DAV Unit 1
Unit 1
Data in data analytics refers to raw figures or facts that are collected, stored, and processed to
derive meaningful insights. It can be numbers, text, images, videos, or other types of
measurable information used to analyze trends, patterns, and behaviors.
In simple terms:
Types of Data
1. Structured Data:
o Organized in rows and columns like in a table or database.
o Example: An Excel sheet with student names, ages, and marks.
2. Unstructured Data:
o Not organized or stored in a predefined format.
o Example: Photos, social media posts, emails, or videos.
3. Semi-structured Data:
o Not fully organized but has some structure (like tags or metadata).
o Example: JSON or XML files.
Characteristics of Data
Imagine a retail store wants to understand customer buying habits. Here’s the data they might
collect:
Data Analytics is the process of examining, organizing, and interpreting data to extract useful
insights and patterns. These insights help businesses, organizations, and individuals make
better decisions.
In simple terms:
Data analytics is like solving a puzzle where the pieces are bits of data.
When you put the pieces together, you can understand a bigger picture, like trends,
behaviors, or outcomes.
1. Collecting Data:
o Gather data from various sources like surveys, sales reports, sensors, or
websites.
o Example: A company collects data about product sales, customer feedback, and
website visits.
2. Cleaning Data:
o Remove errors, duplicates, or irrelevant data to ensure accuracy.
o Example: Fix missing data or correct typos in a customer database.
3. Analyzing Data:
o Use statistical and computational methods to find patterns or relationships in
the data.
o Example: Calculate average sales or identify which products are most popular.
4. Visualizing Data:
o Present findings using charts, graphs, or dashboards to make them easy to
understand.
o Example: Create a pie chart showing the percentage of sales from different
product categories.
1. Descriptive Analytics:
o Tells what happened in the past.
o Example: "Our sales increased by 20% last quarter."
2. Diagnostic Analytics:
o Explains why something happened.
o Example: "Sales increased because of a holiday season promotion."
3. Predictive Analytics:
o Predicts future outcomes based on past data.
o Example: "Sales are likely to increase by 30% next quarter."
4. Prescriptive Analytics:
o Suggests actions to achieve a desired outcome.
o Example: "Offer discounts during the holiday season to boost sales."
1. Data Collection:
o The restaurant collects data on daily sales, popular dishes, customer
demographics, and feedback.
2. Analysis:
o They find:
Most customers order burgers and fries.
Sales peak during weekends.
Customers complain about slow service.
3. Insights:
o The restaurant learns:
They should focus on improving their burger menu.
Hire extra staff on weekends to handle the rush.
4. Action:
o Introduce a new burger combo and streamline kitchen processes to improve
service speed.
Data Mining is the process of discovering patterns, trends, and useful information from large
sets of data. It uses techniques from statistics, machine learning, and database systems to
analyze and extract hidden insights that might not be immediately obvious.
Data mining is like digging into a mountain of data to find valuable "gold nuggets" of
information.
It helps organizations make better decisions by uncovering trends and patterns in data.
1. Data Collection:
o Gather large amounts of data from various sources like databases, websites, or
devices.
o Example: An online retailer collects data on customer purchases, browsing
habits, and product reviews.
2. Data Cleaning:
o Remove errors, duplicates, or irrelevant data to ensure the analysis is accurate.
o Example: Fix typos in customer names or remove incomplete records.
3. Data Integration:
o Combine data from multiple sources into a single dataset.
o Example: Merge customer purchase data with demographic information.
4. Data Analysis:
o Apply algorithms to identify patterns, trends, or relationships in the data.
o Example: Use clustering to group customers with similar buying habits.
5. Interpretation and Action:
o Translate the patterns into actionable insights.
o Example: Use findings to recommend products to customers or improve
marketing strategies.
1. Classification:
o Assign data into predefined categories.
o Example: Classifying emails as "spam" or "not spam."
2. Clustering:
o Group similar data points together.
o Example: Grouping customers based on shopping habits.
3. Association Rules:
o Find relationships between items in a dataset.
o Example: "Customers who buy bread often buy butter."
4. Regression:
o Predict a numeric value based on existing data.
o Example: Predicting house prices based on size and location.
5. Outlier Detection:
o Identify data points that don't fit the usual pattern.
o Example: Detecting fraudulent transactions in a credit card dataset.
1. Data Collection:
o The supermarket collects data on customer purchases over a year.
2. Analysis Using Association Rules:
o They find that customers who buy diapers often buy beer.
3. Insight:
o There is a strong association between these two products.
4. Action:
o The supermarket places beer and diapers closer to each other to encourage
combined purchases, leading to increased sales.
Knowledge Discovery is the overall process of finding useful and meaningful information or
patterns in large datasets. It involves multiple steps to extract insights that can help in decision-
making. Data Mining is a key step in this process.
In simple terms:
1. Data Selection:
o Identify and choose the relevant data needed for analysis.
o Example: A retail store selects sales data for the last two years.
2. Data Preprocessing:
o Clean and organize the data to remove errors, duplicates, or missing values.
o Example: Correct typos in customer names or remove records with incomplete
purchase details.
3. Data Transformation:
o Convert the data into a suitable format for analysis.
o Example: Transform raw sales data into monthly sales summaries.
4. Data Mining:
o Apply algorithms and techniques to uncover patterns or relationships in the data.
o Example: Use clustering to group customers based on their shopping habits.
5. Pattern Evaluation:
1. Data Selection:
o The bank collects customer data, including income, credit history, and previous
loan repayments.
2. Data Preprocessing:
o Remove incomplete records and correct errors in the dataset.
3. Data Transformation:
o Organize the data into categories like "low risk" and "high risk."
4. Data Mining:
o Use classification algorithms to identify characteristics of customers who are
likely to default on loans.
5. Pattern Evaluation:
o Identify that customers with a credit score below 600 and income below $50,000
are more likely to default.
6. Knowledge Representation:
o Create a report highlighting these risk factors and suggest strategies to reduce
defaults, like offering smaller loans to high-risk customers.
Data Mining: A step in the knowledge discovery process, focusing on finding patterns
in data.
Knowledge Discovery: The complete process, from selecting data to presenting
actionable insights.
1. One-to-One Relationship:
o Each item in one dataset is related to only one item in another dataset.
o Example: A person and their unique passport number.
2. One-to-Many Relationship:
o One item in one dataset is related to multiple items in another dataset.
o Example: A customer can make multiple purchases at a store.
3. Many-to-Many Relationship:
o Multiple items in one dataset are related to multiple items in another dataset.
o Example: Students enrolled in multiple courses, and each course having
multiple students.
4. Hierarchical Relationships:
o Data arranged in a tree-like structure.
o Example: A company’s organizational structure where one manager supervises
several employees.
5. Network Relationships:
o Complex, interconnected relationships among data points.
o Example: Social media connections where users are linked to their friends.
Understanding Connections: Relations help identify how different pieces of data are
connected.
Finding Patterns: Relations reveal trends or patterns, such as customer behavior or
product preferences.
Making Predictions: Relations enable predictive analytics, such as forecasting sales
based on customer interactions.
Scenario: An e-commerce company wants to analyze its customer data to improve sales.
1. Data:
o Collect customer data such as name, age, gender, purchase history, and
feedback.
2. Relations:
o One-to-One: Each customer has a unique customer ID.
o One-to-Many: A single customer may have multiple orders in their purchase
history.
o Many-to-Many: Customers can buy multiple products, and each product can
be bought by multiple customers.
3. Analysis:
o Identify that younger customers (18-25 years) frequently buy trendy gadgets.
o Analyze relations between product categories and purchase frequency to
recommend products.
The Iris dataset is one of the most famous and widely used datasets in data analytics and
machine learning. It is a small dataset that is simple to work with, making it ideal for beginners
learning data analysis, classification, and clustering techniques.
1. Dataset Size:
o It contains information about 150 samples of iris flowers.
o These samples are equally distributed across three species of iris flowers:
Iris-setosa
Iris-versicolor
Iris-virginica
2. Attributes (Features): The dataset has four numerical features for each flower:
o Sepal Length: Length of the outer petal in cm.
o Sepal Width: Width of the outer petal in cm.
o Petal Length: Length of the inner petal in cm.
o Petal Width: Width of the inner petal in cm.
3. Target Variable (Label):
o The species of the flower (setosa, versicolor, or virginica).
4. Data Format: It is often stored in a tabular format, like this:
1. Simplicity:
o The dataset is small and easy to understand.
o Perfect for beginners learning data visualization, classification, and machine
learning.
2. Variety:
o It contains both continuous features (like petal length) and a categorical target
variable (species).
3. Balanced Classes:
1. Data Visualization:
o Helps create scatter plots, histograms, and pair plots to observe relationships
between features.
2. Classification:
o Used to train machine learning models like k-Nearest Neighbors (k-NN),
Support Vector Machines (SVM), and Decision Trees to classify the flower
species.
3. Clustering:
o Helps in unsupervised learning to group flowers based on their features (e.g., k-
means clustering).
4. Feature Analysis:
o Allows analyzing which features (like petal length) are most useful for
distinguishing between species.
Scenario: You want to classify an iris flower based on its sepal and petal dimensions.
Data scales describe the different ways that data can be measured or classified. In data
analytics, understanding data scales is important because it helps determine which statistical or
analytical methods can be used.
1. Nominal Scale
2. Ordinal Scale
3. Interval Scale
1. Nominal Scale
Definition:
Data is categorized into distinct groups or categories without any order or ranking.
Characteristics:
o No numerical value or order.
o Categories are mutually exclusive (no overlap).
Example:
o Types of fruits: Apple, Banana, Orange.
o Gender: Male, Female.
Usage:
o Used for classification, grouping, and counting.
o Example in Analytics: Counting how many people prefer each type of fruit.
2. Ordinal Scale
Definition:
Data is categorized into ordered categories, but the intervals between the categories are
not uniform.
Characteristics:
o There is an order or ranking.
o Differences between rankings are not meaningful.
Example:
o Customer satisfaction levels: Poor, Average, Good, Excellent.
o Education level: High School, Bachelor's, Master's, Ph.D.
Usage:
o Used for ranking or prioritizing.
o Example in Analytics: Analyzing customer satisfaction trends over time.
3. Interval Scale
Definition:
Data is measured on a scale where intervals between values are meaningful, but there
is no true zero point.
Characteristics:
o Differences between values are meaningful.
o No "absolute zero" (e.g., zero does not mean "nothing").
Example:
o Temperature in Celsius or Fahrenheit: 20°C, 30°C (difference of 10°C is
meaningful, but 0°C does not mean "no temperature").
o Time of day: 2 PM, 3 PM (intervals are consistent).
Usage:
o Used for comparing differences.
o Example in Analytics: Analyzing temperature changes over a period.
Definition:
Data is measured on a scale with meaningful intervals and a true zero point, where zero
indicates "nothing."
Characteristics:
o Differences and ratios are meaningful.
o Allows for all mathematical operations (addition, subtraction, multiplication,
division).
Example:
o Weight: 0 kg, 50 kg, 100 kg (0 kg means no weight).
o Income: $0, $10,000, $50,000.
Usage:
o Used for quantitative analysis, like calculating averages or percentages.
o Example in Analytics: Analyzing the average income of a group.
Dissimilarity measures in data analytics are methods used to quantify how different two data
points are. These measures help in understanding the "distance" or "difference" between
objects, which is critical for tasks like clustering, classification, and recommendation systems.
1. Purpose:
To identify how "similar" or "different" two data points are based on their attributes.
2. Types of Data:
Dissimilarity measures can be applied to:
o Numerical data (e.g., age, height, income).
o Categorical data (e.g., gender, color, product type).
o Mixed data (both numerical and categorical).
3. Use Case:
Dissimilarity measures are used in algorithms like k-means clustering, nearest
neighbor classification, and hierarchical clustering.
1. Clustering:
Dissimilarity measures group similar data points into clusters (e.g., grouping customers
with similar purchasing behavior).
2. Recommendation Systems:
Netflix or Amazon uses dissimilarity measures to recommend movies/products based
on similar users.
3. Outlier Detection:
Identifies data points that are far apart from others.
Similarity measures in data analytics quantify how "similar" two data points are. They are
used to compare and find relationships between objects based on their features. Unlike
dissimilarity measures (which measure how different two points are), similarity measures focus
on how close or related the objects are.
1. Purpose:
To evaluate the degree of resemblance between two data points.
2. Applications:
o Clustering (e.g., grouping similar customers).
o Recommendation systems (e.g., Netflix recommending movies).
o Information retrieval (e.g., finding similar documents).
o Classification (e.g., categorizing data based on similarity).
3. Types of Data:
o Numerical data (e.g., age, height, income).
o Categorical data (e.g., gender, preferences).
o Text data (e.g., documents, reviews).
1. Clustering:
Similarity measures group data points into clusters, such as grouping customers with
similar buying patterns.
2. Recommendation Systems:
Suggest products, movies, or books based on similarity to user preferences.
Example: "People who bought this also bought..."
3. Text Analysis:
Compare documents for plagiarism or recommend similar articles.
4. Pattern Recognition:
Identify similar patterns in time-series data (e.g., stock prices or weather patterns).