0% found this document useful (0 votes)
11 views22 pages

DAV Unit 1

Uploaded by

Khushbu Pandya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

DAV Unit 1

Uploaded by

Khushbu Pandya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

IAR University

Department of Computer Sciences and Engineering


B.Tech (CE-AI) & B.Tech IT SEM VI
Subject: CE412_Data Analytics and Visualization
CS707_Data Analytics

Unit 1

What is Data in Data Analytics?

Data in data analytics refers to raw figures or facts that are collected, stored, and processed to
derive meaningful insights. It can be numbers, text, images, videos, or other types of
measurable information used to analyze trends, patterns, and behaviors.

In simple terms:

 Data is like raw materials.


 When processed, it helps make decisions, solve problems, and predict future outcomes.

Types of Data

1. Structured Data:
o Organized in rows and columns like in a table or database.
o Example: An Excel sheet with student names, ages, and marks.
2. Unstructured Data:
o Not organized or stored in a predefined format.
o Example: Photos, social media posts, emails, or videos.
3. Semi-structured Data:
o Not fully organized but has some structure (like tags or metadata).
o Example: JSON or XML files.

Characteristics of Data

1. Qualitative Data (Categorical):


o Non-numerical data.
o Example: Colors of cars (red, blue, black), customer feedback.
2. Quantitative Data (Numerical):
o Numbers or measurements.
o Example: Height, weight, sales figures.

Importance of Data in Analytics

 Data is the foundation of data analytics.


 Without data, it's impossible to perform analysis or make predictions.
 Data helps businesses understand customer behavior, improve operations, and increase
efficiency.

Dr. Pankti Bhatt


Example of Data in Data Analytics

Imagine a retail store wants to understand customer buying habits. Here’s the data they might
collect:

1. Customer details: Name, age, gender.


2. Sales transactions: Items bought, price, payment method.
3. Feedback: Customer ratings and reviews.

Using this data:

 The store can analyze popular products.


 Predict seasonal demand for items.
 Improve customer experience by addressing complaints.

What is Data Analytics?

Data Analytics is the process of examining, organizing, and interpreting data to extract useful
insights and patterns. These insights help businesses, organizations, and individuals make
better decisions.

In simple terms:

 Data analytics is like solving a puzzle where the pieces are bits of data.
 When you put the pieces together, you can understand a bigger picture, like trends,
behaviors, or outcomes.

Key Steps in Data Analytics

1. Collecting Data:
o Gather data from various sources like surveys, sales reports, sensors, or
websites.
o Example: A company collects data about product sales, customer feedback, and
website visits.
2. Cleaning Data:
o Remove errors, duplicates, or irrelevant data to ensure accuracy.
o Example: Fix missing data or correct typos in a customer database.
3. Analyzing Data:
o Use statistical and computational methods to find patterns or relationships in
the data.
o Example: Calculate average sales or identify which products are most popular.
4. Visualizing Data:
o Present findings using charts, graphs, or dashboards to make them easy to
understand.
o Example: Create a pie chart showing the percentage of sales from different
product categories.

Dr. Pankti Bhatt


Types of Data Analytics

1. Descriptive Analytics:
o Tells what happened in the past.
o Example: "Our sales increased by 20% last quarter."
2. Diagnostic Analytics:
o Explains why something happened.
o Example: "Sales increased because of a holiday season promotion."
3. Predictive Analytics:
o Predicts future outcomes based on past data.
o Example: "Sales are likely to increase by 30% next quarter."
4. Prescriptive Analytics:
o Suggests actions to achieve a desired outcome.
o Example: "Offer discounts during the holiday season to boost sales."

Importance of Data Analytics

 Helps businesses understand customers better.


 Improves decision-making by providing data-driven insights.
 Detects patterns or trends to predict future events.
 Optimizes processes and performance.

Example of Data Analytics

Scenario: A restaurant wants to improve its business.

1. Data Collection:
o The restaurant collects data on daily sales, popular dishes, customer
demographics, and feedback.
2. Analysis:
o They find:
 Most customers order burgers and fries.
 Sales peak during weekends.
 Customers complain about slow service.
3. Insights:
o The restaurant learns:
 They should focus on improving their burger menu.
 Hire extra staff on weekends to handle the rush.
4. Action:
o Introduce a new burger combo and streamline kitchen processes to improve
service speed.

What is Data Mining?

Data Mining is the process of discovering patterns, trends, and useful information from large
sets of data. It uses techniques from statistics, machine learning, and database systems to
analyze and extract hidden insights that might not be immediately obvious.

Dr. Pankti Bhatt


In simple terms:

 Data mining is like digging into a mountain of data to find valuable "gold nuggets" of
information.
 It helps organizations make better decisions by uncovering trends and patterns in data.

Steps in Data Mining

1. Data Collection:
o Gather large amounts of data from various sources like databases, websites, or
devices.
o Example: An online retailer collects data on customer purchases, browsing
habits, and product reviews.
2. Data Cleaning:
o Remove errors, duplicates, or irrelevant data to ensure the analysis is accurate.
o Example: Fix typos in customer names or remove incomplete records.
3. Data Integration:
o Combine data from multiple sources into a single dataset.
o Example: Merge customer purchase data with demographic information.
4. Data Analysis:
o Apply algorithms to identify patterns, trends, or relationships in the data.
o Example: Use clustering to group customers with similar buying habits.
5. Interpretation and Action:
o Translate the patterns into actionable insights.
o Example: Use findings to recommend products to customers or improve
marketing strategies.

Techniques in Data Mining

1. Classification:
o Assign data into predefined categories.
o Example: Classifying emails as "spam" or "not spam."
2. Clustering:
o Group similar data points together.
o Example: Grouping customers based on shopping habits.
3. Association Rules:
o Find relationships between items in a dataset.
o Example: "Customers who buy bread often buy butter."
4. Regression:
o Predict a numeric value based on existing data.
o Example: Predicting house prices based on size and location.
5. Outlier Detection:
o Identify data points that don't fit the usual pattern.
o Example: Detecting fraudulent transactions in a credit card dataset.

Importance of Data Mining

 Helps businesses understand customer behavior.


 Identifies trends and patterns to improve decision-making.
 Detects anomalies or risks, such as fraud or equipment failure.

Dr. Pankti Bhatt


 Increases efficiency by uncovering hidden opportunities.

Example of Data Mining

Scenario: A supermarket wants to increase sales.

1. Data Collection:
o The supermarket collects data on customer purchases over a year.
2. Analysis Using Association Rules:
o They find that customers who buy diapers often buy beer.
3. Insight:
o There is a strong association between these two products.
4. Action:
o The supermarket places beer and diapers closer to each other to encourage
combined purchases, leading to increased sales.

Difference Between Data Mining and Data Analytics

 Data Mining focuses on finding hidden patterns in data.


 Data Analytics focuses on interpreting data to solve problems or make decisions.

What is Knowledge Discovery?

Knowledge Discovery is the overall process of finding useful and meaningful information or
patterns in large datasets. It involves multiple steps to extract insights that can help in decision-
making. Data Mining is a key step in this process.

In simple terms:

 Knowledge Discovery is like finding a hidden treasure in a sea of information.


 It transforms raw data into knowledge that can be understood and used effectively.

Steps in Knowledge Discovery

1. Data Selection:
o Identify and choose the relevant data needed for analysis.
o Example: A retail store selects sales data for the last two years.
2. Data Preprocessing:
o Clean and organize the data to remove errors, duplicates, or missing values.
o Example: Correct typos in customer names or remove records with incomplete
purchase details.
3. Data Transformation:
o Convert the data into a suitable format for analysis.
o Example: Transform raw sales data into monthly sales summaries.
4. Data Mining:
o Apply algorithms and techniques to uncover patterns or relationships in the data.
o Example: Use clustering to group customers based on their shopping habits.
5. Pattern Evaluation:

Dr. Pankti Bhatt


o
Assess the patterns discovered to ensure they are meaningful and useful.
o
Example: A pattern showing that customers buy more during holiday seasons is
useful for planning promotions.
6. Knowledge Representation:
o Present the findings in an easy-to-understand format like charts, graphs, or
reports.
o Example: Create a bar chart showing peak sales months.

Importance of Knowledge Discovery

 Helps in decision-making by providing actionable insights.


 Uncovers hidden patterns or trends that may not be obvious.
 Improves efficiency by identifying opportunities and risks.
 Helps organizations adapt to changing environments by understanding data better.

Example of Knowledge Discovery

Scenario: A bank wants to reduce loan defaults.

1. Data Selection:
o The bank collects customer data, including income, credit history, and previous
loan repayments.
2. Data Preprocessing:
o Remove incomplete records and correct errors in the dataset.
3. Data Transformation:
o Organize the data into categories like "low risk" and "high risk."
4. Data Mining:
o Use classification algorithms to identify characteristics of customers who are
likely to default on loans.
5. Pattern Evaluation:
o Identify that customers with a credit score below 600 and income below $50,000
are more likely to default.
6. Knowledge Representation:
o Create a report highlighting these risk factors and suggest strategies to reduce
defaults, like offering smaller loans to high-risk customers.

Difference Between Data Mining and Knowledge Discovery

 Data Mining: A step in the knowledge discovery process, focusing on finding patterns
in data.
 Knowledge Discovery: The complete process, from selecting data to presenting
actionable insights.

What are Relations in Data Analytics?

Relations refer to the connections or associations between different pieces of data.


Understanding these relationships is essential for discovering patterns and insights in data
analytics.

Dr. Pankti Bhatt


Types of Relations in Data

1. One-to-One Relationship:
o Each item in one dataset is related to only one item in another dataset.
o Example: A person and their unique passport number.
2. One-to-Many Relationship:
o One item in one dataset is related to multiple items in another dataset.
o Example: A customer can make multiple purchases at a store.
3. Many-to-Many Relationship:
o Multiple items in one dataset are related to multiple items in another dataset.
o Example: Students enrolled in multiple courses, and each course having
multiple students.
4. Hierarchical Relationships:
o Data arranged in a tree-like structure.
o Example: A company’s organizational structure where one manager supervises
several employees.
5. Network Relationships:
o Complex, interconnected relationships among data points.
o Example: Social media connections where users are linked to their friends.

Why are Relations Important in Data Analytics?

 Understanding Connections: Relations help identify how different pieces of data are
connected.
 Finding Patterns: Relations reveal trends or patterns, such as customer behavior or
product preferences.
 Making Predictions: Relations enable predictive analytics, such as forecasting sales
based on customer interactions.

Example of Data and Relations in Data Analytics

Scenario: An e-commerce company wants to analyze its customer data to improve sales.

1. Data:
o Collect customer data such as name, age, gender, purchase history, and
feedback.
2. Relations:
o One-to-One: Each customer has a unique customer ID.
o One-to-Many: A single customer may have multiple orders in their purchase
history.
o Many-to-Many: Customers can buy multiple products, and each product can
be bought by multiple customers.
3. Analysis:
o Identify that younger customers (18-25 years) frequently buy trendy gadgets.
o Analyze relations between product categories and purchase frequency to
recommend products.

Dr. Pankti Bhatt


4. Outcome:
o Use these insights to create personalized marketing campaigns or recommend
related products, boosting sales.

What is the Iris Dataset in Data Analytics?

The Iris dataset is one of the most famous and widely used datasets in data analytics and
machine learning. It is a small dataset that is simple to work with, making it ideal for beginners
learning data analysis, classification, and clustering techniques.

Key Features of the Iris Dataset

1. Dataset Size:
o It contains information about 150 samples of iris flowers.
o These samples are equally distributed across three species of iris flowers:
 Iris-setosa
 Iris-versicolor
 Iris-virginica
2. Attributes (Features): The dataset has four numerical features for each flower:
o Sepal Length: Length of the outer petal in cm.
o Sepal Width: Width of the outer petal in cm.
o Petal Length: Length of the inner petal in cm.
o Petal Width: Width of the inner petal in cm.
3. Target Variable (Label):
o The species of the flower (setosa, versicolor, or virginica).
4. Data Format: It is often stored in a tabular format, like this:

Sepal Length Sepal Width Petal Length Petal Width Species

5.1 3.5 1.4 0.2 Iris-setosa

7.0 3.2 4.7 1.4 Iris-versicolor

6.3 3.3 6.0 2.5 Iris-virginica

Why is the Iris Dataset Popular in Data Analytics?

1. Simplicity:
o The dataset is small and easy to understand.
o Perfect for beginners learning data visualization, classification, and machine
learning.
2. Variety:
o It contains both continuous features (like petal length) and a categorical target
variable (species).
3. Balanced Classes:

Dr. Pankti Bhatt


o Each species has an equal number of samples (50), making it suitable for
supervised learning tasks.

Applications of the Iris Dataset

1. Data Visualization:
o Helps create scatter plots, histograms, and pair plots to observe relationships
between features.
2. Classification:
o Used to train machine learning models like k-Nearest Neighbors (k-NN),
Support Vector Machines (SVM), and Decision Trees to classify the flower
species.
3. Clustering:
o Helps in unsupervised learning to group flowers based on their features (e.g., k-
means clustering).
4. Feature Analysis:
o Allows analyzing which features (like petal length) are most useful for
distinguishing between species.

Example Analysis Using the Iris Dataset

Scenario: You want to classify an iris flower based on its sepal and petal dimensions.

1. Step 1: Data Exploration:


o Visualize the dataset to see how features like petal length differ across species.
o Example: A scatter plot showing petal length vs. petal width may show that Iris-
setosa is distinct from the other species.
2. Step 2: Train a Model:
o Use a classification algorithm like Decision Trees.
o Train the model on 80% of the dataset and test it on the remaining 20%.
3. Step 3: Predict:
o Input the dimensions of a new flower (e.g., sepal length: 5.8, petal width: 1.8).
o The model predicts the species as Iris-versicolor.
4. Step 4: Evaluate:
o Check the model's accuracy using metrics like precision, recall, or accuracy
score.

What are Data Scales in Data Analytics?

Data scales describe the different ways that data can be measured or classified. In data
analytics, understanding data scales is important because it helps determine which statistical or
analytical methods can be used.

There are four main types of data scales:

1. Nominal Scale
2. Ordinal Scale
3. Interval Scale

Dr. Pankti Bhatt


4. Ratio Scale

1. Nominal Scale

 Definition:
Data is categorized into distinct groups or categories without any order or ranking.
 Characteristics:
o No numerical value or order.
o Categories are mutually exclusive (no overlap).
 Example:
o Types of fruits: Apple, Banana, Orange.
o Gender: Male, Female.
 Usage:
o Used for classification, grouping, and counting.
o Example in Analytics: Counting how many people prefer each type of fruit.

2. Ordinal Scale

 Definition:
Data is categorized into ordered categories, but the intervals between the categories are
not uniform.
 Characteristics:
o There is an order or ranking.
o Differences between rankings are not meaningful.
 Example:
o Customer satisfaction levels: Poor, Average, Good, Excellent.
o Education level: High School, Bachelor's, Master's, Ph.D.
 Usage:
o Used for ranking or prioritizing.
o Example in Analytics: Analyzing customer satisfaction trends over time.

3. Interval Scale

 Definition:
Data is measured on a scale where intervals between values are meaningful, but there
is no true zero point.
 Characteristics:
o Differences between values are meaningful.
o No "absolute zero" (e.g., zero does not mean "nothing").
 Example:
o Temperature in Celsius or Fahrenheit: 20°C, 30°C (difference of 10°C is
meaningful, but 0°C does not mean "no temperature").
o Time of day: 2 PM, 3 PM (intervals are consistent).
 Usage:
o Used for comparing differences.
o Example in Analytics: Analyzing temperature changes over a period.

Dr. Pankti Bhatt


4. Ratio Scale

 Definition:
Data is measured on a scale with meaningful intervals and a true zero point, where zero
indicates "nothing."
 Characteristics:
o Differences and ratios are meaningful.
o Allows for all mathematical operations (addition, subtraction, multiplication,
division).
 Example:
o Weight: 0 kg, 50 kg, 100 kg (0 kg means no weight).
o Income: $0, $10,000, $50,000.
 Usage:
o Used for quantitative analysis, like calculating averages or percentages.
o Example in Analytics: Analyzing the average income of a group.

Why Are Data Scales Important in Data Analytics?

1. Choosing the Right Method:


o Different scales require different statistical and analytical techniques.
o Example: You can calculate averages for ratio data but not for nominal data.
2. Data Visualization:
o The choice of graph or chart depends on the data scale.
o Example: A bar chart is suitable for nominal data, but a line chart works better
for ratio data.
3. Accurate Analysis:
o Using the wrong methods for a specific data scale can lead to incorrect results.
o Example: Running a regression analysis on ordinal data may not be appropriate.

Examples in Simple Scenarios

Scenario 1: Analyzing Customer Demographics

 Nominal: Gender (Male, Female).


 Ordinal: Education level (High School, Bachelor's, Master's).
 Ratio: Annual income ($50,000, $60,000).

Scenario 2: Weather Analysis

 Nominal: Type of weather (Sunny, Rainy, Cloudy).


 Interval: Temperature (25°C, 30°C).
 Ratio: Amount of rainfall (0 mm, 10 mm, 20 mm).

Dr. Pankti Bhatt


What is Set and Matrix Representation in Data Analytics?

Dr. Pankti Bhatt


Dr. Pankti Bhatt
Dr. Pankti Bhatt
What are Dissimilarity Measures in Data Analytics?

Dissimilarity measures in data analytics are methods used to quantify how different two data
points are. These measures help in understanding the "distance" or "difference" between
objects, which is critical for tasks like clustering, classification, and recommendation systems.

Key Points About Dissimilarity Measures:

1. Purpose:
To identify how "similar" or "different" two data points are based on their attributes.
2. Types of Data:
Dissimilarity measures can be applied to:
o Numerical data (e.g., age, height, income).
o Categorical data (e.g., gender, color, product type).
o Mixed data (both numerical and categorical).
3. Use Case:
Dissimilarity measures are used in algorithms like k-means clustering, nearest
neighbor classification, and hierarchical clustering.

Dr. Pankti Bhatt


Dr. Pankti Bhatt
Why Are Dissimilarity Measures Important?

1. Clustering:
Dissimilarity measures group similar data points into clusters (e.g., grouping customers
with similar purchasing behavior).
2. Recommendation Systems:
Netflix or Amazon uses dissimilarity measures to recommend movies/products based
on similar users.
3. Outlier Detection:
Identifies data points that are far apart from others.

Dr. Pankti Bhatt


4. Classification:
Helps assign new data points to predefined categories based on their similarity to
existing data.

What are Similarity Measures in Data Analytics?

Similarity measures in data analytics quantify how "similar" two data points are. They are
used to compare and find relationships between objects based on their features. Unlike
dissimilarity measures (which measure how different two points are), similarity measures focus
on how close or related the objects are.

Key Points About Similarity Measures

1. Purpose:
To evaluate the degree of resemblance between two data points.
2. Applications:
o Clustering (e.g., grouping similar customers).
o Recommendation systems (e.g., Netflix recommending movies).
o Information retrieval (e.g., finding similar documents).
o Classification (e.g., categorizing data based on similarity).
3. Types of Data:
o Numerical data (e.g., age, height, income).
o Categorical data (e.g., gender, preferences).
o Text data (e.g., documents, reviews).

Dr. Pankti Bhatt


Dr. Pankti Bhatt
Dr. Pankti Bhatt
Why Are Similarity Measures Important?

1. Clustering:
Similarity measures group data points into clusters, such as grouping customers with
similar buying patterns.
2. Recommendation Systems:
Suggest products, movies, or books based on similarity to user preferences.
Example: "People who bought this also bought..."
3. Text Analysis:
Compare documents for plagiarism or recommend similar articles.
4. Pattern Recognition:
Identify similar patterns in time-series data (e.g., stock prices or weather patterns).

Dr. Pankti Bhatt


Dr. Pankti Bhatt

You might also like