0% found this document useful (0 votes)

33 views

unit-2

Uploaded by

maryjan88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

unit-2

Uploaded by

maryjan88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Unit- 2

VISUAL AIDS FOR EDA

SYLLABUS

Technical Requirements- Line chart- Bar Charts- Scatter plot- Pie chart- Table
Chart- Polar chart- Data transformation techniques- Data cleaning- loading the
CSV file- Converting Nan values- Applying descriptive Statistics- Data
refactoring- Dropping columns- Data Analysis- Number of e mails- time of day-
Average emails per day and hour- Most frequently used words.

2.1. Technical Requirements

When preparing visual aids for Exploratory Data Analysis (EDA),

adhering to certain technical requirements is important to ensure clarity,
effectiveness, and accuracy. Here's a breakdown:
1. Data Visualization Tools:
• Make sure you have Python 3.X installed on your computer. It is
recommended to use a Python notebook such as Anaconda.
• You must have Python libraries such as pandas, seaborn, and
matplotlib installed.
• Matplotlib/Seaborn (Python): These are commonly used for creating
static plots.
• Plotly/Bokeh (Python): For interactive visualizations.
• Tableau/Power BI: Tools for more complex, interactive visualizations,
especially for presentations.
2. Resolution and Size:
• High Resolution: Ensure visuals are at least 300 dpi (dots per inch) for
print quality. For digital presentations, a resolution of 1080p (1920x1080
pixels) is usually sufficient.
• Scalability: Graphics should be scalable (vector formats like SVG are
preferred for static images) to ensure they don’t lose quality when
resized.
3. Color Scheme:
• Use Consistent Color Palettes: Choose a color scheme that is easy to
interpret and consistent across all visuals.
• Color Blindness Consideration: Use color palettes that are accessible to
individuals with color blindness (e.g., colorblind-friendly palettes like
those from ColorBrewer).
4. Data Labels and Annotations:
• Readable Fonts: Use a font size that is legible from a distance in
presentations. At least 12-point font for detailed labels and 16-point for
titles is recommended.
• Clear Annotations: Use arrows, labels, and annotations where necessary
to highlight key points without cluttering the visuals.
5. Axes and Scales:
• Properly Labeled Axes: Clearly label the axes with appropriate units and
descriptions.
• Consistent Scaling: Use consistent scales across similar charts to
facilitate comparison. Avoid distorting scales, which can mislead the
interpretation.
6. Data Integrity:
• Accurate Representation: Ensure that the visualizations accurately
reflect the underlying data without introducing bias or distortion.
• Avoid Overfitting: When displaying trends, be careful with the level of
smoothing or fitting applied to ensure that the plot does not mislead.
7. Interactivity (if applicable):
• Interactive Features: For dynamic tools, include features like tooltips,
zooming, and filtering to allow deeper exploration of the data.
• Ease of Use: Ensure that interactive features are intuitive and easy to use,
especially for non-technical audiences.
8. Consistency:
• Uniform Visual Style: Maintain a consistent visual style (fonts, colors,
grid lines) across all visual aids to avoid confusion.
• Chart Types: Use appropriate chart types for the data being represented
(e.g., bar charts for categorical data, scatter plots for relationships).
9. Clarity and Simplicity:
• Avoid Clutter: Minimize unnecessary elements (e.g., excessive grid
lines, redundant legends) to focus on the key insights.
• Simplify Complex Data: Break down complex data into simpler, more
digestible visuals when possible.
10. File Formats:
• Preferred Formats: For static visuals, use PNG, JPEG, or SVG for high-
quality images. For interactive visuals, HTML or JavaScript-based files
are common.
• Export Options: Ensure that your visualization tool supports exporting
visuals in formats compatible with your presentation platform (e.g.,
PowerPoint, PDFs).
VISUAL AIDS FOR EDA
• Line chart
• Bar chart
• Scatter plot
• Area plot and stacked plot
• Pie chart
• Table chart
• Polar chart
• Histogram
• Lollipop chart

Line chart
• A line chart is used to illustrate the relationship between two or more
continuous variables.
• A line plot is used to represent quantitative values over a continuous
interval or time period. It is generally used to depict trends on how the
data has changed over time.
Program:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5, 6]
y = [1, 5, 3, 5, 7, 8]
plt.plot(x, y)
plt.show()

• Steps involved
Let's look at the process of creating the line chart:
1. Load and prepare the dataset..
2. Import the matplotlib library. It can be done with this command:
import matplotlib.pyplot as plt
3. Plot the graph:
plt.plot(df)
4. Display it on the screen:
plt.show()

Bar chart

A Bar chart or bar graph is a chart or graph that presents categorical data with
rectangular bars with heights or lengths proportional to the values that they
represent. A bar plot is a way of representing data where the length of the bars
represents the magnitude/size of the feature/variable.
• Program (horizontal bar chart)
import numpy as np
import calendar
import matplotlib.pyplot as plt
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
figure, axis = plt.subplots()
plt.xticks(months, calendar.month_name[1:13], rotation=20)
plot = axis.bar(months, sold_quantity)
plt.show()
Program(vertical bar chart)
import numpy as np
import calendar
import matplotlib.pyplot as plt
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
figure, axis = plt.subplots()
plt.yticks(months, calendar.month_name[1:13], rotation=20)
plot = axis.barh(months, sold_quantity)
plt.show()

Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and
scatter
diagrams. They use a Cartesian coordinates system to display values of
typically two
variables for a set of data.
Scatter plots can be constructed in the following two situations:
1. When one continuous variable is dependent on another variable, which is
under the control of the observer
2. When both continuous variables are independent
There are two important concepts—independent variable and dependent
variable. In statistical modeling or mathematical modeling, the values of
dependent variables rely on the values of independent variables. The
dependent variable is the outcome variable being studied. The independent
variables are also referred to as regressors. The takeaway message here is
that scatter plots are used when we need to show the relationship between
two variables, and hence are sometimes referred to as correlation plots.
//Program

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 1, 3, 5]
# Create scatter plot
plt.scatter(x, y)
# Add title and labels
plt.title("Simple Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
# Display the plot
plt.show()

//program 2
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.dpi'] = 150
sns.set()
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,
"virginica": 2})
plt.scatter(x=df["sepal_length"], y=df["sepal_width"], c =
df.species)
plt.xlabel('Septal Length')
plt.ylabel('Petal length')
plt.show()

Bubble chart
A bubble plot is a manifestation of the scatter plot where each data point on
the graph is shown as a bubble. Each bubble can be illustrated with a
different color, size, and appearance.
A bubble chart is a type of data visualization that represents three dimensions
of data on a two-dimensional plot. It's essentially an extension of a scatter
plot, with an additional variable represented by the size of the bubbles
(circles) plotted on the chart.
Components of a Bubble Chart:
1. X-axis (Horizontal axis): Represents the first variable, which is typically
a numerical or categorical value.
2. Y-axis (Vertical axis): Represents the second variable, usually another
numerical value.
3. Bubbles (Circles): Each bubble represents a data point. The position of
the bubble is determined by the values of the x and y variables.
4. Bubble Size: The size (or area) of the bubble represents a third variable.
Larger bubbles indicate higher values, while smaller bubbles indicate
lower values.
When to Use a Bubble Chart:
• Comparing Three Variables: When you want to visualize the
relationship between three variables simultaneously.
• Emphasizing Differences: Useful for showing differences in magnitude,
where the size of the bubble adds another layer of information beyond
just the x and y coordinates.
• Clustering: To identify clusters or patterns where multiple bubbles are
closely related in terms of their x, y, and size variables.
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]
sizes = [100, 200, 300, 400, 500] # Bubble sizes

# Create bubble chart

plt.scatter(x, y, s=sizes, alpha=0.5)

# Add title and labels

plt.title("Simple Bubble Chart")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
# Display the plot
plt.show()

Area plot and stacked plot

An area plot is a type of graph that displays quantitative data visually, where
the area between the line and the axis is filled with color. It's similar to a line
plot but with the area under the line filled in, which can help emphasize the
magnitude of changes over time.
Key Features:
• Continuous Data Representation: Like line plots, area plots are great for
showing trends over time or continuous data.
• Cumulative Values: Area plots can be particularly effective when you
want to show how individual contributions accumulate to a total.
When to Use:
• Trends Over Time: When you need to highlight changes in data over
time, such as sales, stock prices, or other time series data.
• Comparing Categories: When comparing different categories that sum to
a total, especially in stacked area plots.

When to Use:
• Trends Over Time: When you need to highlight changes in data over
time, such as sales, stock prices, or other time series data.
• Comparing Categories: When comparing different categories that sum to
a total, especially in stacked area plots.
Example Code for an Area Plot:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 8, 6, 10]
# Create area plot
plt.fill_between(x, y, color="skyblue", alpha=0.4)
plt.plot(x, y, color="Slateblue", alpha=0.6)
# Add title and labels
plt.title("Simple Area Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
# Display the plot
plt.show()

Stacked Area Plot:

A stacked area plot is an extension of a regular area plot where multiple
datasets are plotted on top of each other. The areas are stacked on top of one
another, which helps in visualizing the cumulative contribution of different
datasets to a total.
Key Features:
• Cumulative Comparison: Each area represents a part of the total, and the
top line represents the cumulative total of all areas stacked together.
• Relative Contributions: This plot helps in understanding how different
components contribute to the overall total and how these contributions
change over time.
When to Use:
• Categorical Comparison Over Time: Useful for comparing how different
categories contribute to a whole over time, like revenue from different
product lines.
• Cumulative Trends: When it's important to see both individual trends and
the overall trend simultaneously.
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 6, 8, 10]
y2 = [2, 3, 4, 2, 5]
y3 = [1, 2, 5, 7, 8]

# Create stacked area plot

plt.stackplot(x, y1, y2, y3, labels=['Category 1',
'Category 2', 'Category 3'], colors=['skyblue',
'lightgreen', 'orange'])

# Add title, labels, and legend

plt.title("Simple Stacked Area Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend(loc='upper left')

# Display the plot

plt.show()

Key Differences:
• Area Plot: Focuses on a single variable or cumulative total over time.
• Stacked Area Plot: Shows multiple variables stacked on top of each other
to visualize cumulative effects and individual contributions
simultaneously.

Pie chart
A pie chart is a circular graph divided into slices, where each slice represents
a proportion of the whole. It's a popular way to visualize the relative sizes of
different categories in a dataset.
Key Features:
• Proportional Representation: Each slice represents a proportion of the
total. The size of the slice is proportional to the quantity it represents.
• Categories: The chart is used to display categorical data. Each category
is represented by a different slice of the pie.
• Simple and Intuitive: Pie charts are easy to understand at a glance,
making them suitable for conveying simple proportions.
When to Use a Pie Chart:
• Part-to-Whole Relationships: When you want to show how different
categories contribute to the total.
• Limited Categories: Best for datasets with a small number of categories
(typically less than 6-7). Too many slices can make the chart difficult to
read.
• Non-Hierarchical Data: When comparing non-hierarchical data where
categories are independent of each other.
Example Code for a Pie Chart in Python:
import matplotlib.pyplot as plt
# Sample data
labels = ['Category A', 'Category B', 'Category C', 'Category D']
sizes = [15, 30, 45, 10]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
# Create pie chart
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%',
startangle=140)
# Add title
plt.title("Simple Pie Chart")
# Display the plot
plt.show()
Table chart

A table chart, often referred to simply as a "table," is a way to organize and

display data in rows and columns. Unlike graphical charts, a table presents data
in a structured, grid format, making it ideal for displaying precise values,
comparing data, and organizing information clearly.

Key Features:

• Rows and Columns: Data is organized into rows (horizontal) and

columns (vertical), with each cell containing specific data.
• Headers: The top row usually contains headers to label each column,
helping to identify what each column represents.
• Easy Data Lookup: Tables are useful for looking up specific data points,
making them ideal for detailed data presentation.

When to Use a Table Chart:

• Detailed Data Display: When you need to present exact numbers, such
as sales figures, test scores, or inventory counts.
• Comparisons Across Categories: Useful for comparing data across
different categories or dimensions.
• Multi-Dimensional Data: When data has multiple dimensions that need
to be presented simultaneously, such as comparing multiple attributes of
different products.
Key Considerations:
• Readability: Ensure that the table is well-organized and easy to read,
especially if there are many rows and columns.
• Alignment: Proper alignment of text and numbers within cells improves
readability.
• Data Accuracy: Since tables display precise values, ensure that all data is
accurate and formatted correctly.
Use Cases:
• Reports: Tables are often used in reports, research papers, and
presentations where exact values need to be conveyed.
• Financial Statements: Common in accounting and finance for displaying
income statements, balance sheets, and other financial data.
• Inventory Lists: Useful for tracking stock levels, product details, and
other inventory-related data.
Advantages of Table Charts:
• Precise Data Representation: Unlike graphs, tables present exact values,
making them ideal for detailed analysis.
• Multi-Dimensional Comparison: Tables allow you to compare multiple
dimensions or attributes of data simultaneously.
• Flexible Formatting: You can format tables to highlight specific data, add
totals, or create subtotals.
import matplotlib.pyplot as plt
import numpy as np # Added import for numpy

# Years under consideration

years = ["2010", "2011", "2012", "2013", "2014"]
# Available watt
columns = ['4.5W', '6.0W',
'7.0W','8.5W','9.5W','13.5W','15W']
unitsSold = [
[65, 141, 88, 111, 104, 71, 99],
[85, 142, 89, 112, 103, 73, 98],
[75, 143, 90, 113, 89, 75, 93],
[65, 144, 91, 114, 90, 77, 92],
[55, 145, 92, 115, 88, 79, 93],
]
# Define the range and scale for the y axis
values = np.arange(0, 600, 100)
colors = plt.cm.OrRd(np.linspace(0, 0.7,
len(years)))
index = np.arange(len(columns)) + 0.3
bar_width = 0.7
y_offset = np.zeros(len(columns))
fig, ax = plt.subplots()
cell_text = []
n_rows = len(unitsSold)
for row in range(n_rows):
# Indented this line to be part of the for loop
plot = plt.bar(index, unitsSold[row],
bar_width, bottom=y_offset, color=colors[row])
y_offset = y_offset + unitsSold[row]
cell_text.append(['%1.1f' % (x) for x in
y_offset])
i=0
# Each iteration of this for loop, labels each
bar with corresponding value
# for the given year
for rect in plot:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2,
y_offset[i],'%d' % int(y_offset[i]),
ha='center', va='bottom')
i = i+1
# Add a table to the bottom of the axes
the_table = plt.table(cellText=cell_text,
rowLabels=years,
rowColours=colors,
colLabels=columns, loc='bottom')
plt.ylabel("Units Sold")
plt.xticks([])
plt.title('Number of LED Bulb Sold/Year')
plt.show()

Polar chart
A polar chart, also known as a radar chart or spider chart, is a graphical
method used to display multivariate data in a circular format. It is
particularly useful for visualizing the relationships between multiple
variables on a single plot, with each axis representing a different variable.
Key Features:
• Circular Layout: The chart is circular, with each variable plotted along a
different axis radiating from the center.
• Axes: Each axis represents a different variable, and data points are plotted
along these axes.
• Connecting Lines: The data points are connected by lines, forming a
closed shape, often resembling a polygon.
• Comparisons: Polar charts are useful for comparing multiple datasets on
the same plot, allowing you to visualize differences between them.
When to Use a Polar Chart:
• Multidimensional Comparisons: When you need to compare multiple
variables across different categories.
• Visualizing Strengths and Weaknesses: Polar charts are great for
showing the strengths and weaknesses of different categories, making
them popular in performance analysis.
• Complex Data: When the data is too complex for a traditional bar or line
chart, and you need a more comprehensive view
Key Considerations:
• Readability: Polar charts can become cluttered and hard to read if there
are too many variables or datasets. Limiting the number of variables is
essential for clarity.
• Comparison: Ensure that comparisons between different datasets on the
same chart are visually distinguishable by using different colors or line
styles.
• Scaling: Consistent scaling across axes is necessary to avoid misleading
visualizations.
Use Cases:
• Performance Analysis: Commonly used to evaluate the performance of
individuals or teams across multiple dimensions (e.g., skills assessment,
product features).
• Market Research: Useful for comparing different products or services
based on various attributes.
• Risk Assessment: In finance, polar charts can help visualize and compare
different types of risks.
Advantages of Polar Charts:
• Multivariate Visualization: Excellent for visualizing relationships
between multiple variables at once.
• Pattern Recognition: The shape formed by the data points can reveal
patterns and outliers.
• Comparison of Multiple Entities: Polar charts make it easy to compare
multiple datasets on the same plot.
When Not to Use:
• Too Many Variables: Polar charts can become overwhelming and
difficult to interpret with too many variables or data points.
• Precise Data Representation: If precise values or detailed comparisons
are needed, other types of charts (like bar or line charts) may be more
appropriate.
Progam:
import matplotlib.pyplot as plt
import numpy as np
# Number of variables we're plotting
categories = ['A', 'B', 'C', 'D', 'E']
values = [4, 3, 2, 5, 4]
# Repeat the first value to close the circle
values += values[:1]
# Compute the angle of each axis
angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]
# Create polar plot
fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
# Draw one line per variable and fill the area
ax.fill(angles, values, color='skyblue', alpha=0.4)
ax.plot(angles, values, color='blue', linewidth=2)
# Add category labels to the axes
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
# Add title
plt.title("Simple Polar Chart", size=15, color='black', y=1.1)
# Show the plot
plt.show()

Histogram
A histogram is a graphical representation of the distribution of numerical
data. It is used to visualize the frequency of data points within specified
ranges, known as "bins." Each bin represents an interval of data, and the
height of the bar corresponds to the number of data points that fall within
that interval.
Key Features of a Histogram:
1. Bins: The data is divided into intervals, or "bins." The choice of bin size
can affect the appearance and interpretation of the histogram.
2. Bars: The height of each bar represents the frequency or count of data
points within each bin.
3. Continuous Data: Histograms are typically used for continuous data,
showing how the data is distributed across the range of values.
import matplotlib.pyplot as plt

# Sample data
data = [12, 15, 21, 22, 25, 25, 26, 28, 30, 32, 35,
36, 38, 40, 42, 45, 48, 50]

# Creating the histogram

plt.hist(data, bins=5, edgecolor='black')

# Adding titles and labels

plt.title('Simple Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Display the histogram

plt.show()
Lollipop chart
A lollipop chart is a variation of a bar chart used in Exploratory Data
Analysis (EDA). It provides a clear visual representation of data by using
lines and circles (lollipops) to depict the relationship between categorical
data and numerical values. It's particularly useful when you want to
highlight individual values without the clutter of bars.
When to Use a Lollipop Chart:
• To compare the values of different categories.
• When you want a cleaner and less cluttered alternative to a bar chart.
• To emphasize individual data points.
import matplotlib.pyplot as plt

# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [20, 34, 30, 35, 27]

# Creating the lollipop chart

plt.stem(categories, values, basefmt=" ",
use_line_collection=True)

# Customizing the plot

plt.title('Lollipop Chart Example')
plt.xlabel('Categories')
plt.ylabel('Values')

# Display the chart

plt.show()
The following table shows the different types of charts based on the
purposes:
Data Transformation Techniques
In Exploratory Data Analysis (EDA), data transformation is an essential
step to better understand your data, uncover hidden patterns, and prepare
it for further modeling. During EDA, transformation techniques are
applied to make the data more interpretable, improve the distribution of
variables, and address issues like skewness, outliers, and different scales
of measurement.

Why Data Transformation is Important in EDA:

1. Improves Data Distribution: Helps in transforming skewed
distributions into more normal-like distributions, which is often required
for many statistical techniques.
2. Reduces the Impact of Outliers: Transformations like log and square
root can reduce the influence of outliers, making patterns in the data more
visible.
3. Enables Better Comparisons: Scaling or normalizing data allows for
more meaningful comparisons between features.
4. Prepares Data for Modeling: Many machine learning algorithms
require data to be in a specific format, and data transformation ensures
compatibility.

Common Data Transformation Techniques in EDA:

1. Log Transformation:
- Use when you encounter skewed data with a long tail on the right
(positive skew).
- Helps to stabilize variance and make the data more normally
distributed.
- Example:
import numpy as np
df['log_transformed'] = np.log(df['column'] + 1) # Adding 1 to handle
zero values

2. Square Root Transformation:

- Useful for moderately skewed data, less aggressive than log
transformation.
- Example:
```python
df['sqrt_transformed'] = np.sqrt(df['column'])
```

3. Box-Cox Transformation:
- More flexible than log or square root transformations. It can handle
both positive and negative skewness.

4. Min-Max Scaling:
- Scales data to a range of [0, 1]. Useful when you want to normalize
data without distorting differences in the range.
- Example:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['scaled'] = scaler.fit_transform(df[['column']])

5. Standardization (Z-Score Normalization):

- Centers data around the mean with a standard deviation of 1. Useful
for algorithms that assume normally distributed data.
- Example:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['standardized'] = scaler.fit_transform(df[['column']])

6. Normalization:
- Typically used to scale features so that they have a unit norm (e.g.,
vector length of 1). Useful in models like K-NN.
- Example:

from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
df['normalized'] = normalizer.fit_transform(df[['column']])

7. Binning (Discretization):
- Converts continuous data into categorical bins. Useful for reducing the
impact of minor observation errors and simplifying the analysis.
8. Feature Engineering:
- Creating new features through transformations like polynomial
features, interaction terms, or domain-specific transformations can reveal
more insights during EDA.

9. Handling Missing Values:

- Filling or imputing missing data can be considered a form of
transformation. Techniques like mean/median imputation,
forward/backward fill, or interpolation can be used.

10. Reducing Dimensionality (PCA):

- Principal Component Analysis (PCA) can be used to reduce the
dimensionality of the data, helping to visualize high-dimensional data
during EDA.
How to Apply Data Transformation in EDA:
1. Visualize the Distribution: Before applying transformations, visualize
the data using histograms, box plots, or Q-Q plots to understand its
distribution.
2. Apply Transformations: Based on the distribution, apply appropriate
transformations like log, square root, or Box-Cox.
3. Re-Visualize: After transformation, re-visualize the data to check if the
transformation has achieved the desired effect (e.g., reducing skewness,
improving normality).
4. Experiment with Different Techniques: In EDA, it's important to try
different transformations to see which works best for your data.
5. Document Findings: Keep track of which transformations were applied
and how they impacted the data. This will be useful later when building
models.

Loading the dataset

Here are the steps to follow:
1. Log in to your personal Gmail account.
2. Go to the following link: https:/ / takeout. google. com/ settings/
takeout.
3. Deselect all the items but Gmail, as shown in the following screenshot:

4. Select the archive format, as shown in the following screenshot:

selected Send download link by email, One-time archive, .zip, and the
maximum allowed size. You can customize the format. Once done, hit
Create archive.
Loading the dataset
First of all, download the data. Gmail (https:/ / takeout. google.
com/ settings/ takeout) provides data in mbox format.

1. load the required libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Note that for this analysis, we need to have the mailbox package
installed. If it is not installed on your system, it can be added to your
Python build using the pip install mailbox instruction.

2. load the dataset:

import mailbox
mboxfile = "PATH TO DOWNLOADED MBOX FIL"
mbox = mailbox.mbox(mboxfile)
mbox
Note that it is essential that you replace the mbox file path with your own
path.
The output of the preceding code is as follows:
<mailbox.mbox at 0x7f124763f5c0>
The output indicates that the mailbox has been successfully created.
3. Next, let's see the list of available keys:
for key in mbox[0].keys():
print(key)
Data Cleansing:

Data cleansing (or data cleaning) is a crucial step in Exploratory Data

Analysis (EDA) that involves identifying and correcting errors,
inconsistencies, and inaccuracies in your dataset. Clean data is essential
for accurate analysis, meaningful insights, and robust modeling. The goal
of data cleansing is to ensure that your data is complete, consistent, and
free from errors.
Key Steps in Data Cleansing During EDA:
1. Handling Missing Data:
o Identify Missing Values: Check for missing values in your dataset.
df.isnull().sum() # Shows the number of missing values per
column
Imputation: Fill in missing values using techniques such as:
• Mean/Median Imputation: Replace missing values with the mean or
median of the column.
df['column'].fillna(df['column'].mean(), inplace=True)
Mode Imputation: Replace missing values with the most frequent value
(for categorical data).
Forward/Backward Fill: Propagate the previous or next value to fill
missing data.
Interpolation: Estimate missing values based on other data points.
Remove Missing Values: If a large proportion of a column or row is
missing, consider dropping it.
2. Removing Duplicates:
• Identify Duplicates: Check for duplicate rows in the dataset.

df.duplicated().sum() # Returns the number of duplicate rows

• Remove Duplicates: Drop duplicate rows to avoid biasing the analysis.
df.drop_duplicates(inplace=True)
3. Handling Outliers:

• Identify Outliers: Use visualizations like box plots or statistical methods

(e.g., Z-score, IQR) to detect outliers.
o Transform or Remove Outliers: Depending on the context, you can
either:
o Cap/Floor: Limit extreme values by capping them at a threshold.
o Transform: Apply a log or square root transformation to reduce the
effect of outliers.
o Remove: Drop outliers if they are erroneous or not relevant.
4. Correcting Inconsistent Data:

Standardize Categorical Variables: Ensure consistency in categories

by correcting typos, different formats, or cases (e.g., "Male" vs. "male").
Check for Logical Consistency: Ensure that the data follows logical
rules (e.g., age cannot be negative, dates should be in the correct format).

5.Dealing with Irrelevant Data:

Remove Irrelevant Features: Drop columns that are not relevant to the
analysis (e.g., identifiers, redundant features).

df.drop(['irrelevant_column'], axis=1, inplace=True)

Filter Irrelevant Rows: Remove rows that don't contribute to the

analysis (e.g., test records, out-of-scope data).

6.Handling Incorrect Data Types:

Convert Data Types: Ensure that columns have the correct data type
(e.g., converting strings to dates, integers to floats).

df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'],
errors='coerce')

7.Dealing with Data Leakage:

Identify Leakage: Data leakage occurs when information from outside

the training dataset is used to create the model, leading to overfitting.
Ensure that your features do not inadvertently include future information
or target values.
8.Fixing Data Range and Boundaries:

Check Data Ranges: Ensure that numerical data falls within the
expected range (e.g., age should be within human limits, scores should be
within 0-100).
Correct or Remove: Adjust or drop values that fall outside logical
ranges.
9.Check for Structural Errors:
Identify and Correct: Structural errors include issues like
mislabeling, data entry errors, and incorrect formatting. For example, a
column meant to store numeric values might have non-numeric
characters.
Replace or Remove: Correct these errors by replacing them with
appropriate values or dropping erroneous rows.

Tools and Techniques for Data Cleansing:

• Pandas: A Python library used for data manipulation and

cleaning.
• Regular Expressions: Useful for identifying and correcting
patterns in text data.
• Visualization: Use plots like histograms, box plots, and scatter
plots to visually inspect the data for anomalies.

Example Workflow in Python:

1. Import the csv package:

import csv
2. Create a CSV file with only the required attributes:
with open('mailbox.csv', 'w') as outputfile:
writer = csv.writer(outputfile)
writer.writerow(['subject','from','date','to','label','thread'])
for message in mbox:
writer.writerow([
message['subject'],
message['from'],
message['date'],
message['to'],
message['X-Gmail-Labels'],
message['X-GM-THRID']
]
)
The preceding output is a csv file.
Loading the CSV file
We will load the CSV file. Refer to the following code block:

dfs = pd.read_csv('mailbox.csv', names=['subject', 'from', 'date', 'to',

'label', 'thread'])

Converting the date

Next, we will convert the date.
Check the datatypes of each column as shown here:

dfs.dtypes

The output of the preceding code is as follows:

subject object
from object
date object
to object
label object
thread float64
dtype: object

Note that a date field is an object. So, we need to convert it into a DateTime
argument. In
the next step, we are going to convert the date field into an actual DateTime
argument. We
can do this by using the pandas to_datetime() method. See the following code:
dfs['date'] = dfs['date'].apply(lambda x: pd.to_datetime(x,
errors='coerce', utc=True))
Let's move onto the next step, that is, removing NaN values from the fields

Removing NaN values

NaN, standing for 'Not a Number', is a special floating-point value that

represents missing or undefined values in Python. You can detect NaN values
using the math.
isnan() function or pandas isnull() function. To handle NaN values, you can use
pandas fillna() function or dropna() function.
Next, we are going to remove NaN values from the field.
We can do this as follows:
dfs = dfs[dfs['date'].notna()]
Next, it is good to save the preprocessed file into a separate CSV file in case we
need it
again. We can save the dataframe into a separate CSV file as follows:
dfs.to_csv('gmail.csv')
Great! Having done that, let's do some descriptive statistics.

Descriptive Statistic

Measures of Central Tendency

It represents the whole set of data by a single value. It gives us the location of
the central points. There are three main measures of central tendency:
• Mean
• Mode
• Median

Mean
It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count.

where,
• x = Observations
• n = number of terms
import numpy as np
# Sample Data
arr = [5, 6, 11]
Mean
mean = np.mean(arr)
print("Mean = ", mean)

Mode
It is the value that has the highest frequency in the given data set. The data set
may have no mode if the frequency of all data points is the same. Also, we can
have more than one mode if we encounter two or more data points having the
same frequency.

from scipy import stats

# sample Data
arr = [1, 2, 2, 3]
# Mode
mode = stats.mode(arr)
print("Mode = ", mode)

Median
It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is the median
and if it is even then the median would be the average of two central elements.

import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = np.median(arr)
print("Median = ", median)

Measure of Variability
Measures of variability are also termed measures of dispersion as it helps to
gain insights about the dispersion or the spread of the observations at hand.
Some of the measures which are used to calculate the measures of dispersion in
the observations of the variables are as follows:
• Range
• Variance
• Standard deviation
Range
The range describes the difference between the largest and smallest data point in
our data set. The bigger the range, the more the spread of data and vice versa.

Range = Largest data value – smallest data value

import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format( Maximum,
Minimum, Range))
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the
Mean, then subtracting each number from the Mean which is also known as the
average, and squaring the result. Adding all the values and then dividing by the
no of terms followed by the square root.

where,
• x = Observation under consideration
• N = number of terms
• mu = Mean

import statistics
# sample data
arr = [1, 2, 3, 4, 5]

# Standard Deviation
print("Std = ", (statistics.stdev(arr)))

Applying descriptive statistics

Having preprocessed the dataset, let's do some sanity checking using descriptive
statistics
techniques.
We can implement this as shown here:
dfs.info()
Data refactoring

Data refactoring involves restructuring and optimizing data without changing its
external behavior. It’s a process commonly used to improve data quality,
maintainability, and performance, especially when dealing with large datasets or
evolving systems.
Concepts of Data Refactoring
1. Improving Data Quality: Ensuring the data is accurate, consistent, and
reliable. This can involve correcting errors, removing duplicates, or
ensuring standard formats.
2. Normalization: Organizing data to reduce redundancy and improve
integrity. This often involves structuring a database in such a way that
updates and deletions can be made more efficiently.
3. Data Schema Refactoring: Modifying the structure of the database (e.g.,
tables, fields) to improve performance, accommodate changes, or
enhance data access. Examples include renaming tables or fields, splitting
tables, or adding new indices.
4. Data Cleaning: The process of detecting and correcting (or removing)
corrupt or inaccurate records. It includes activities like deduplication,
correcting misspellings, and addressing inconsistencies.
5. Data Transformation: Converting data from one format or structure to
another. This can be as simple as changing data types or as complex as
migrating data from one database system to another.
6. Version Control for Data: Managing changes in data structures or
transformations using version control systems, enabling rollback to
previous states and collaboration across teams.
Techniques in Data Refactoring
1. Splitting Columns or Tables: If a column contains multiple pieces of
information, splitting it into multiple columns can improve clarity and
access. Similarly, if a table grows too large, it may be split into several
smaller tables.
2. Combining Data: Sometimes data that is stored in separate tables or
columns can be combined to reduce complexity. For example, data that’s
frequently accessed together can be stored together.
3. Changing Data Types: Altering the data type of a column to a more
suitable type for better storage efficiency or improved query performance
(e.g., changing a string to a date type).
4. Refactoring Queries: Optimizing queries to make data retrieval more
efficient. This can involve rewriting SQL queries to reduce complexity,
adding indices, or using caching mechanisms.
5. Data Aggregation: Summarizing detailed data to make it easier to
analyze or report on. For example, transforming transactional data into
summary statistics.
6. Denormalization: Sometimes, refactoring involves denormalization to
improve performance for specific queries, by introducing some
redundancy that reduces the need for complex joins.
Examples of Data Refactoring
• Renaming a Column: Renaming a poorly-named column to make its
purpose clearer, e.g., changing cust to customer_id.
• Splitting a Column: A column containing full names (e.g., "John Doe")
could be split into two columns: first_name and last_name.
• Changing Data Types: Changing a varchar field storing dates into a date
type for easier comparison and query optimization.
• Adding an Index: Adding an index to a frequently searched column to
speed up query performance.
• Refactoring Queries: Optimizing a complex SQL query with multiple
joins into a more efficient version by restructuring the query logic.
• Schema Normalization: Breaking a table that stores repeated
information (like addresses) into two tables, one for unique addresses and
another that references them, reducing redundancy.
Tools for Data Refactoring
• Database Management Systems (DBMS): Tools like MySQL,
PostgreSQL, and Oracle provide features for schema changes, data
cleaning, and optimization.
• ETL Tools: Extract, Transform, Load (ETL) tools like Talend, Apache
NiFi, or Alteryx help in transforming and cleaning data.
• Version Control Systems: Tools like Git can be used to version control
data scripts and schema changes.
• SQL Refactoring Tools: Tools like dbForge SQL Complete or Redgate
SQL Prompt assist with refactoring SQL queries.
-----------------------------------------------------------------------------
Starting a data refactoring process requires careful planning, analysis, and
execution to ensure that the data integrity is maintained and that the changes
bring meaningful improvements. Below is a step-by-step guide on how to start
refactoring data:
1. Understand the Current State
• Audit the Data: Analyze the current data structure, including tables,
columns, data types, relationships, and indexes. Understand how data is
being used, stored, and accessed.
• Identify Problems: Look for issues like redundant data, performance
bottlenecks, inconsistent formats, and complex queries. This will help
you identify areas that need refactoring.
• Gather Requirements: Work with stakeholders (e.g., developers,
analysts, and business users) to understand the current pain points and
future needs. This ensures that the refactoring aligns with business
objectives.
2. Define the Scope
• Prioritize Changes: Not all data issues need to be addressed at once.
Prioritize based on factors like impact, risk, and ease of implementation.
Focus on areas that provide the most value.
• Set Objectives: Clearly define the goals of refactoring. Whether it’s
improving query performance, simplifying data models, or reducing
redundancy, having clear objectives will guide your efforts.
3. Plan the Refactoring
• Create a Roadmap: Develop a step-by-step plan that outlines the
changes to be made, the sequence of those changes, and the timeline for
implementation. Include considerations for data backups, testing, and
rollbacks.
• Document Changes: Before making any modifications, document the
existing data structure and the planned changes. This is essential for
maintaining clarity and for future reference.
4. Backup Data
• Backup Databases: Always create a full backup of your data before
starting any refactoring process. This ensures that you can restore the data
to its original state if something goes wrong.
5. Test on a Small Scale
• Use a Sandbox Environment: Before applying changes to the
production environment, test the refactoring in a sandbox or development
environment. This allows you to identify any potential issues without
risking live data.
• Perform Data Integrity Checks: Ensure that data transformations
maintain accuracy. Compare the output of the refactored data with the
original data to verify consistency.
6. Implement the Changes
• Make Incremental Changes: Avoid making all changes at once.
Implement refactoring in small, manageable increments. This reduces the
risk of errors and makes it easier to troubleshoot issues.
• Monitor Performance: After making changes, monitor the system's
performance to ensure that the refactoring achieves the desired
improvements. This can include running queries, checking application
logs, and analyzing metrics.
7. Validate the Changes
• Run Test Cases: Develop and execute test cases to verify that the
refactored data works correctly with your applications and processes.
Include both functional tests (e.g., are queries returning correct data?) and
performance tests (e.g., are queries faster?).
• User Acceptance Testing (UAT): Engage end-users to validate the
changes from a functional and usability perspective. Ensure that all
stakeholders are satisfied with the refactoring outcomes.
8. Document the Changes
• Update Documentation: Document the new data structure, changes
made, and any new processes. This is crucial for maintaining clarity for
future maintenance and for onboarding new team members.
• Create Migration Scripts: If applicable, write migration scripts to
automate the refactoring process. This ensures consistency and helps with
scaling the changes to different environments.
9. Deploy to Production
• Schedule Deployment: Choose a deployment window that minimizes
disruption, such as during low-traffic periods. Ensure that the team is
prepared for any potential issues that may arise.
• Monitor After Deployment: Post-deployment, closely monitor the
system for any issues, and be prepared to roll back changes if necessary.
10. Iterate and Improve
• Gather Feedback: After deploying the changes, gather feedback from
users and stakeholders to identify any remaining issues or areas for
further improvement.
• Refine Further: Data refactoring is often an ongoing process. Use the
insights gained to refine and optimize further. Stay open to continuous
improvement as your data needs evolve.
Example of Starting Refactoring:
Let’s say you have a large customer database that suffers from redundancy and
performance issues. Here’s how you might start:
1. Audit: Analyze the database and identify redundant columns (e.g.,
address information stored in multiple places).
2. Scope: Prioritize eliminating redundancy in the customer address data.
3. Plan: Decide to normalize the address data by creating a separate Address
table that the Customer table references.
4. Backup: Create a backup of your database.
5. Test: Implement this change in a sandbox environment and test queries to
ensure performance improvement.
6. Implement: Gradually roll out the changes to the production
environment.
7. Validate: Ensure that the application retrieves the correct address data
and that performance improves.
8. Deploy and Document: Deploy the changes and update documentation
to reflect the new database structure.
Dropping columns

Dropping columns is a common preprocessing step, especially when preparing

data for modeling or analysis. The goal is to remove unnecessary, irrelevant, or
problematic features that could negatively impact the performance of your
models or the clarity of your analysis.
Reasons to Drop Columns in Data Science
1. Irrelevant Features: If a column doesn't contribute to the problem you're
solving, it should be dropped to reduce noise and computational
complexity. For example, an ID column might not provide any
meaningful information for predictive modeling.
2. High Cardinality Features: Columns with too many unique values (like
a unique identifier) can introduce complexity, especially in algorithms
that aren’t equipped to handle high cardinality features effectively.
3. Multicollinearity: When two or more columns are highly correlated, it
can lead to multicollinearity, which may distort model performance.
Dropping one of the correlated columns can help.
4. Missing Data: If a column has a significant percentage of missing values
that can't be imputed reliably, it might be better to drop the column
altogether.
5. Low Variance: Columns with little to no variance (e.g., a column where
almost all values are the same) may not contribute to predictive power
and can be dropped.
6. Feature Selection: During feature selection, columns that do not have a
strong predictive relationship with the target variable may be dropped.
Techniques like Recursive Feature Elimination (RFE) or feature
importance from models can help identify these columns.
Techniques for Dropping Columns
1. Manual Selection: Manually identify and drop columns that are
irrelevant or redundant. This can be done based on domain knowledge or
exploratory data analysis (EDA).
2. Automated Methods: Use algorithms or statistical tests to automatically
drop less significant features. Some methods include:
o Variance Threshold: Drops columns with variance below a certain
threshold.
o Correlation Threshold: Drops one of the columns in a pair of
highly correlated features.
o Feature Importance: Based on a model's feature importance, drop
the least important features.
3. Pandas in Python: In Python, the pandas library provides simple
methods to drop columns:
import pandas as pd
# Drop single column
df = df.drop('column_name', axis=1)
# Drop multiple columns
df = df.drop(['column1', 'column2'], axis=1)

4. Dropping Columns with Missing Data: If you want to drop columns with
a certain percentage of missing data:

# Drop columns with more than 50% missing values

df = df.dropna(thresh=int(0.5*len(df)), axis=1)
5. Dropping Columns with Low Variance: You can drop columns with
low variance using the VarianceThreshold from scikit-learn:

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
df_reduced = selector.fit_transform(df)

Best Practices
• Understand the Impact: Always consider the impact of dropping a
column on your model and analysis. If you're unsure, you can experiment
with and without the column to see how it affects results.
• Documentation: Keep track of which columns you drop and why, as this
can help you retrace your steps or explain your methodology later.
• Iterative Process: Dropping columns is often part of an iterative process.
You might remove columns during initial cleaning, then revisit this step
during feature selection or model tuning.
-----------------------------------------------
Data Analysis

To analyze email data, focusing on the number of emails sent, the time
of day they were sent, the average number of emails per day and hour,
and the most frequently used words, here’s a general approach:
1. Data Collection
• Emails Data: Extract data like timestamps, subject lines, and email
bodies from your email system. This might involve exporting emails to a
CSV file or using an email API.
• Attributes Needed:
o Timestamp (Date and Time)
o Email Body/Subject Line (Text content)
o Email Address (optional) (To filter out specific
senders/recipients)
2. Data Preparation
• Clean the Data:
o Convert timestamps to a standardized format.
o Remove unwanted data (e.g., email signatures, automatic replies).
• Time of Day Analysis:
o Extract the hour from the timestamp to analyze email frequency by
time of day.
3. Analysis
• Number of Emails by Time of Day:
o Group emails by hour of the day (0-23) to see the distribution of
emails across the day.
• Average Emails per Day and Hour:
o Calculate the average number of emails sent per day.
o Calculate the average number of emails sent per hour by dividing
the total emails by the number of days and hours.
• Most Frequently Used Words:
o Tokenize the email bodies and subject lines.
o Remove common stopwords (e.g., "the," "and," "is").
o Count the frequency of each word.
4. Visualization
• Time of Day Distribution: Create a line plot or bar chart to show the
distribution of emails across the hours of the day.
• Average Emails per Day and Hour: Use bar charts to illustrate the
averages.
• Word Cloud: Visualize the most frequently used words in a word cloud
for a quick insight into the common themes.
5. Tools & Libraries
• Python Libraries:
o Pandas: For data manipulation and analysis.
o Matplotlib/Seaborn: For plotting graphs.
o NLTK or spaCy: For text processing (word tokenization and
removing stopwords).
o WordCloud: For generating word clouds.
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords

# Sample Data Loading

data = pd.read_csv('/content/email.csv') # Assuming a CSV
with columns 'timestamp' and 'body'

# Convert timestamp to datetime

data['timestamp'] = pd.to_datetime(data['timestamp'])

# Extract hour of day

data['hour'] = data['timestamp'].dt.hour

# Calculate number of emails by time of day

emails_by_hour = data.groupby('hour').size()

# Calculate average emails per day and hour

emails_per_day =
data.groupby(data['timestamp'].dt.date).size()
avg_emails_per_day = emails_per_day.mean()
avg_emails_per_hour = emails_by_hour.mean()

# Display average emails

print(f'Average emails per day: {avg_emails_per_day}')
print(f'Average emails per hour: {avg_emails_per_hour}')
# Plotting emails by time of day
emails_by_hour.plot(kind='bar', title='Emails by Time of
Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Emails')
plt.show()

# Text processing for most frequently used words

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
words = ' '.join(data['body']).lower().split()
filtered_words = [word for word in words if word not in
stop_words and word.isalpha()]

# Count word frequency

word_counts = Counter(filtered_words)
common_words = word_counts.most_common(10)
print("Most Frequently Used Words:", common_words)

# Generate Word Cloud

wordcloud = WordCloud(width=800, height=400,
background_color='white').generate_from_frequencies(word_cou
nts)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

______

BioChemFC200 Operators Manual
No ratings yet
BioChemFC200 Operators Manual
65 pages
Module 4 DS
No ratings yet
Module 4 DS
89 pages
DV UNIT 2
No ratings yet
DV UNIT 2
5 pages
Unit 3 DATA VISUAIZATION
No ratings yet
Unit 3 DATA VISUAIZATION
25 pages
Ameer Data Visualization and Techniques
No ratings yet
Ameer Data Visualization and Techniques
4 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
Unit 4
No ratings yet
Unit 4
21 pages
DataVisualization 1
No ratings yet
DataVisualization 1
46 pages
Chapter 4 Common Visualization Idioms
No ratings yet
Chapter 4 Common Visualization Idioms
39 pages
Unit-5 new
No ratings yet
Unit-5 new
31 pages
MATPLOTLIB BASICS
No ratings yet
MATPLOTLIB BASICS
27 pages
UNIT - 1 EDA Continuation
No ratings yet
UNIT - 1 EDA Continuation
113 pages
kundan jiiiiii
No ratings yet
kundan jiiiiii
26 pages
Unit _Data Visualization
No ratings yet
Unit _Data Visualization
33 pages
Prac - 6
No ratings yet
Prac - 6
7 pages
Chart Types
No ratings yet
Chart Types
20 pages
Unit 4 - Data Visualization
No ratings yet
Unit 4 - Data Visualization
32 pages
Present With Data
No ratings yet
Present With Data
37 pages
21CS644 Module 4
No ratings yet
21CS644 Module 4
24 pages
1 - Introduction - Data Visualization
No ratings yet
1 - Introduction - Data Visualization
3 pages
Chapter 4
No ratings yet
Chapter 4
120 pages
1714514135
No ratings yet
1714514135
12 pages
Module4 DSV
No ratings yet
Module4 DSV
89 pages
13_20241118_DataVisualisation_2
No ratings yet
13_20241118_DataVisualisation_2
91 pages
Common Visualization Idioms
0% (1)
Common Visualization Idioms
95 pages
Individual Summarys
No ratings yet
Individual Summarys
3 pages
DV-Viva-Voice-Data Visualization
No ratings yet
DV-Viva-Voice-Data Visualization
12 pages
Data+visualisation+.pptx
No ratings yet
Data+visualisation+.pptx
30 pages
Edashsh
No ratings yet
Edashsh
7 pages
Introduction to Charts
No ratings yet
Introduction to Charts
7 pages
UNIT4
No ratings yet
UNIT4
8 pages
Group Assignment 2
No ratings yet
Group Assignment 2
6 pages
SMA EXP4 AYU
No ratings yet
SMA EXP4 AYU
6 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
139 pages
Data Visualization Module1
No ratings yet
Data Visualization Module1
44 pages
Data Visu Ans
No ratings yet
Data Visu Ans
20 pages
Big data Analysis Presentation
No ratings yet
Big data Analysis Presentation
9 pages
Introduction to Data Science Module 1 (1)
No ratings yet
Introduction to Data Science Module 1 (1)
32 pages
Which Chart When A Data Analysts Guide 1684073962
No ratings yet
Which Chart When A Data Analysts Guide 1684073962
19 pages
Basic Charts and Multidimensional Visualization
No ratings yet
Basic Charts and Multidimensional Visualization
33 pages
Types of Chart
No ratings yet
Types of Chart
18 pages
unit4
No ratings yet
unit4
35 pages
Most Common Types of Charts and Graphs
No ratings yet
Most Common Types of Charts and Graphs
4 pages
Data Visualization 101 - How To Choose The Right Chart or Graph For Your Data
No ratings yet
Data Visualization 101 - How To Choose The Right Chart or Graph For Your Data
21 pages
Data Basics for ML
No ratings yet
Data Basics for ML
23 pages
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
No ratings yet
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
34 pages
Module 3
No ratings yet
Module 3
97 pages
Python Data Visualization
No ratings yet
Python Data Visualization
174 pages
Data Visualization With Matplotlib
No ratings yet
Data Visualization With Matplotlib
20 pages
Introduction To Visual Representation
No ratings yet
Introduction To Visual Representation
4 pages
Unit-1-1
No ratings yet
Unit-1-1
19 pages
Reading and Writing Set 2 Assgn
No ratings yet
Reading and Writing Set 2 Assgn
16 pages
EDA Module 2
No ratings yet
EDA Module 2
34 pages
Chapter 3 Non Spatial Data Visualization
No ratings yet
Chapter 3 Non Spatial Data Visualization
45 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
DM14 Visualisation
100% (1)
DM14 Visualisation
67 pages
[602107]_Introduction to Data Analytics_Tuáº§n 2_3_Chapter02_updated
No ratings yet
[602107]_Introduction to Data Analytics_Tuáº§n 2_3_Chapter02_updated
32 pages
All_Unit_DV_Notes
No ratings yet
All_Unit_DV_Notes
31 pages
Data Visualization Techniques 1
No ratings yet
Data Visualization Techniques 1
27 pages
Assignment EDA
No ratings yet
Assignment EDA
4 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Download full C Programming From Problem Analysis to Program Design 4th Edition Barbara Doyle ebook all chapters
100% (3)
Download full C Programming From Problem Analysis to Program Design 4th Edition Barbara Doyle ebook all chapters
81 pages
Commodities Corner: Gold Spot
No ratings yet
Commodities Corner: Gold Spot
6 pages
CDE (SMD) AVGA Series
No ratings yet
CDE (SMD) AVGA Series
6 pages
BPSK 2024
No ratings yet
BPSK 2024
15 pages
SSP 664 Audi A8 Type 4N Electrics and Electronics PDF
No ratings yet
SSP 664 Audi A8 Type 4N Electrics and Electronics PDF
3 pages
9.1 Common Acids and Alkalis: YPICA Lee Lim Ming College Set 2: Exercise 1 Read The Following Notes (Chapter 9)
No ratings yet
9.1 Common Acids and Alkalis: YPICA Lee Lim Ming College Set 2: Exercise 1 Read The Following Notes (Chapter 9)
24 pages
Abacos y Tablas SHELL
No ratings yet
Abacos y Tablas SHELL
19 pages
6 - Full Wave Rectifier
No ratings yet
6 - Full Wave Rectifier
8 pages
Jayaram Mohapatra Resume -Thai
No ratings yet
Jayaram Mohapatra Resume -Thai
4 pages
Nortel BayStack 5510-24T Switch - Engineeriing Manual
No ratings yet
Nortel BayStack 5510-24T Switch - Engineeriing Manual
72 pages
DAA-CSM-AIDS-QUESTIONBANK-ANSWERS
No ratings yet
DAA-CSM-AIDS-QUESTIONBANK-ANSWERS
55 pages
Jee 2014 Booklet1 HWT Quadratic Equations & Inequations
100% (1)
Jee 2014 Booklet1 HWT Quadratic Equations & Inequations
10 pages
2013-11-Uj Negyhengeres Mercedes Motor
No ratings yet
2013-11-Uj Negyhengeres Mercedes Motor
8 pages
Roll No.: 31 Aryabhatta Inter-School Mathematics Competition - 2014
100% (1)
Roll No.: 31 Aryabhatta Inter-School Mathematics Competition - 2014
10 pages
Speech Recognition Using Artificial Neural Networks
No ratings yet
Speech Recognition Using Artificial Neural Networks
50 pages
Lens Ray Diagrams
No ratings yet
Lens Ray Diagrams
24 pages
Proceedings of the London Mathematical Society Volume s2-35 issue 1 1933_J_G_Semple-On Representations of Line-Congruences of the Second and Third Orders
No ratings yet
Proceedings of the London Mathematical Society Volume s2-35 issue 1 1933_J_G_Semple-On Representations of Line-Congruences of the Second and Third Orders
31 pages
Vertical Solid Shaft (VSS), High Thrust, TEFC
No ratings yet
Vertical Solid Shaft (VSS), High Thrust, TEFC
9 pages
Delivery Note Form
No ratings yet
Delivery Note Form
12 pages
Pathology Plant Diseases
100% (3)
Pathology Plant Diseases
36 pages
130 DC Motor
No ratings yet
130 DC Motor
3 pages
OP227 - Analog Devices
No ratings yet
OP227 - Analog Devices
17 pages
Universal Graphic Guide
No ratings yet
Universal Graphic Guide
13 pages
Oscillations MCM1
No ratings yet
Oscillations MCM1
15 pages
Fundamentals of Mathematics 9th Enhanced Edition James Van Dyke All Chapters Instant Download
100% (12)
Fundamentals of Mathematics 9th Enhanced Edition James Van Dyke All Chapters Instant Download
81 pages
6 - Minerals and Water
No ratings yet
6 - Minerals and Water
20 pages
Summer (2010) Two Circular Cylinders in Cross-Flow. A Review PDF
No ratings yet
Summer (2010) Two Circular Cylinders in Cross-Flow. A Review PDF
57 pages
Get Core Topics in General & Emergency Surgery: A Companion To Specialist Surgical Practice Simon Paterson-Brown PDF Ebook With Full Chapters Now
100% (6)
Get Core Topics in General & Emergency Surgery: A Companion To Specialist Surgical Practice Simon Paterson-Brown PDF Ebook With Full Chapters Now
25 pages
Chapter 6 - Intermediate Languages
No ratings yet
Chapter 6 - Intermediate Languages
18 pages

unit-2

Uploaded by

unit-2

Uploaded by

Unit- 2

VISUAL AIDS FOR EDA

2.1. Technical Requirements

When preparing visual aids for Exploratory Data Analysis (EDA),

import matplotlib.pyplot as plt

# Create bubble chart

# Add title and labels

Area plot and stacked plot

Stacked Area Plot:

# Create stacked area plot

# Add title, labels, and legend

# Display the plot

A table chart, often referred to simply as a "table," is a way to organize and

• Rows and Columns: Data is organized into rows (horizontal) and

When to Use a Table Chart:

# Years under consideration

# Creating the histogram

# Adding titles and labels

# Display the histogram

# Creating the lollipop chart

# Customizing the plot

# Display the chart

Why Data Transformation is Important in EDA:

Common Data Transformation Techniques in EDA:

2. Square Root Transformation:

from sklearn.preprocessing import MinMaxScaler

5. Standardization (Z-Score Normalization):

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import Normalizer

9. Handling Missing Values:

10. Reducing Dimensionality (PCA):

Loading the dataset

4. Select the archive format, as shown in the following screenshot:

1. load the required libraries:

2. load the dataset:

Data cleansing (or data cleaning) is a crucial step in Exploratory Data

df.duplicated().sum() # Returns the number of duplicate rows

• Identify Outliers: Use visualizations like box plots or statistical methods

Standardize Categorical Variables: Ensure consistency in categories

5.Dealing with Irrelevant Data:

df.drop(['irrelevant_column'], axis=1, inplace=True)

Filter Irrelevant Rows: Remove rows that don't contribute to the

6.Handling Incorrect Data Types:

7.Dealing with Data Leakage:

Identify Leakage: Data leakage occurs when information from outside

Tools and Techniques for Data Cleansing:

• Pandas: A Python library used for data manipulation and

Example Workflow in Python:

1. Import the csv package:

dfs = pd.read_csv('mailbox.csv', names=['subject', 'from', 'date', 'to',

Converting the date

The output of the preceding code is as follows:

Removing NaN values

NaN, standing for 'Not a Number', is a special floating-point value that

Measures of Central Tendency

from scipy import stats

Range = Largest data value – smallest data value

Applying descriptive statistics

Dropping columns is a common preprocessing step, especially when preparing

# Drop columns with more than 50% missing values

from sklearn.feature_selection import VarianceThreshold

# Sample Data Loading

# Convert timestamp to datetime

# Extract hour of day

# Calculate number of emails by time of day

# Calculate average emails per day and hour

# Display average emails

# Text processing for most frequently used words

# Count word frequency

# Generate Word Cloud

You might also like