unit-2
unit-2
SYLLABUS
Technical Requirements- Line chart- Bar Charts- Scatter plot- Pie chart- Table
Chart- Polar chart- Data transformation techniques- Data cleaning- loading the
CSV file- Converting Nan values- Applying descriptive Statistics- Data
refactoring- Dropping columns- Data Analysis- Number of e mails- time of day-
Average emails per day and hour- Most frequently used words.
Line chart
• A line chart is used to illustrate the relationship between two or more
continuous variables.
• A line plot is used to represent quantitative values over a continuous
interval or time period. It is generally used to depict trends on how the
data has changed over time.
Program:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5, 6]
y = [1, 5, 3, 5, 7, 8]
plt.plot(x, y)
plt.show()
• Steps involved
Let's look at the process of creating the line chart:
1. Load and prepare the dataset..
2. Import the matplotlib library. It can be done with this command:
import matplotlib.pyplot as plt
3. Plot the graph:
plt.plot(df)
4. Display it on the screen:
plt.show()
Bar chart
A Bar chart or bar graph is a chart or graph that presents categorical data with
rectangular bars with heights or lengths proportional to the values that they
represent. A bar plot is a way of representing data where the length of the bars
represents the magnitude/size of the feature/variable.
• Program (horizontal bar chart)
import numpy as np
import calendar
import matplotlib.pyplot as plt
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
figure, axis = plt.subplots()
plt.xticks(months, calendar.month_name[1:13], rotation=20)
plot = axis.bar(months, sold_quantity)
plt.show()
Program(vertical bar chart)
import numpy as np
import calendar
import matplotlib.pyplot as plt
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
figure, axis = plt.subplots()
plt.yticks(months, calendar.month_name[1:13], rotation=20)
plot = axis.barh(months, sold_quantity)
plt.show()
Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and
scatter
diagrams. They use a Cartesian coordinates system to display values of
typically two
variables for a set of data.
Scatter plots can be constructed in the following two situations:
1. When one continuous variable is dependent on another variable, which is
under the control of the observer
2. When both continuous variables are independent
There are two important concepts—independent variable and dependent
variable. In statistical modeling or mathematical modeling, the values of
dependent variables rely on the values of independent variables. The
dependent variable is the outcome variable being studied. The independent
variables are also referred to as regressors. The takeaway message here is
that scatter plots are used when we need to show the relationship between
two variables, and hence are sometimes referred to as correlation plots.
//Program
//program 2
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.dpi'] = 150
sns.set()
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,
"virginica": 2})
plt.scatter(x=df["sepal_length"], y=df["sepal_width"], c =
df.species)
plt.xlabel('Septal Length')
plt.ylabel('Petal length')
plt.show()
Bubble chart
A bubble plot is a manifestation of the scatter plot where each data point on
the graph is shown as a bubble. Each bubble can be illustrated with a
different color, size, and appearance.
A bubble chart is a type of data visualization that represents three dimensions
of data on a two-dimensional plot. It's essentially an extension of a scatter
plot, with an additional variable represented by the size of the bubbles
(circles) plotted on the chart.
Components of a Bubble Chart:
1. X-axis (Horizontal axis): Represents the first variable, which is typically
a numerical or categorical value.
2. Y-axis (Vertical axis): Represents the second variable, usually another
numerical value.
3. Bubbles (Circles): Each bubble represents a data point. The position of
the bubble is determined by the values of the x and y variables.
4. Bubble Size: The size (or area) of the bubble represents a third variable.
Larger bubbles indicate higher values, while smaller bubbles indicate
lower values.
When to Use a Bubble Chart:
• Comparing Three Variables: When you want to visualize the
relationship between three variables simultaneously.
• Emphasizing Differences: Useful for showing differences in magnitude,
where the size of the bubble adds another layer of information beyond
just the x and y coordinates.
• Clustering: To identify clusters or patterns where multiple bubbles are
closely related in terms of their x, y, and size variables.
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]
sizes = [100, 200, 300, 400, 500] # Bubble sizes
When to Use:
• Trends Over Time: When you need to highlight changes in data over
time, such as sales, stock prices, or other time series data.
• Comparing Categories: When comparing different categories that sum to
a total, especially in stacked area plots.
Example Code for an Area Plot:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 8, 6, 10]
# Create area plot
plt.fill_between(x, y, color="skyblue", alpha=0.4)
plt.plot(x, y, color="Slateblue", alpha=0.6)
# Add title and labels
plt.title("Simple Area Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
# Display the plot
plt.show()
# Sample data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 6, 8, 10]
y2 = [2, 3, 4, 2, 5]
y3 = [1, 2, 5, 7, 8]
Key Differences:
• Area Plot: Focuses on a single variable or cumulative total over time.
• Stacked Area Plot: Shows multiple variables stacked on top of each other
to visualize cumulative effects and individual contributions
simultaneously.
Pie chart
A pie chart is a circular graph divided into slices, where each slice represents
a proportion of the whole. It's a popular way to visualize the relative sizes of
different categories in a dataset.
Key Features:
• Proportional Representation: Each slice represents a proportion of the
total. The size of the slice is proportional to the quantity it represents.
• Categories: The chart is used to display categorical data. Each category
is represented by a different slice of the pie.
• Simple and Intuitive: Pie charts are easy to understand at a glance,
making them suitable for conveying simple proportions.
When to Use a Pie Chart:
• Part-to-Whole Relationships: When you want to show how different
categories contribute to the total.
• Limited Categories: Best for datasets with a small number of categories
(typically less than 6-7). Too many slices can make the chart difficult to
read.
• Non-Hierarchical Data: When comparing non-hierarchical data where
categories are independent of each other.
Example Code for a Pie Chart in Python:
import matplotlib.pyplot as plt
# Sample data
labels = ['Category A', 'Category B', 'Category C', 'Category D']
sizes = [15, 30, 45, 10]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
# Create pie chart
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%',
startangle=140)
# Add title
plt.title("Simple Pie Chart")
# Display the plot
plt.show()
Table chart
Key Features:
• Detailed Data Display: When you need to present exact numbers, such
as sales figures, test scores, or inventory counts.
• Comparisons Across Categories: Useful for comparing data across
different categories or dimensions.
• Multi-Dimensional Data: When data has multiple dimensions that need
to be presented simultaneously, such as comparing multiple attributes of
different products.
Key Considerations:
• Readability: Ensure that the table is well-organized and easy to read,
especially if there are many rows and columns.
• Alignment: Proper alignment of text and numbers within cells improves
readability.
• Data Accuracy: Since tables display precise values, ensure that all data is
accurate and formatted correctly.
Use Cases:
• Reports: Tables are often used in reports, research papers, and
presentations where exact values need to be conveyed.
• Financial Statements: Common in accounting and finance for displaying
income statements, balance sheets, and other financial data.
• Inventory Lists: Useful for tracking stock levels, product details, and
other inventory-related data.
Advantages of Table Charts:
• Precise Data Representation: Unlike graphs, tables present exact values,
making them ideal for detailed analysis.
• Multi-Dimensional Comparison: Tables allow you to compare multiple
dimensions or attributes of data simultaneously.
• Flexible Formatting: You can format tables to highlight specific data, add
totals, or create subtotals.
import matplotlib.pyplot as plt
import numpy as np # Added import for numpy
Polar chart
A polar chart, also known as a radar chart or spider chart, is a graphical
method used to display multivariate data in a circular format. It is
particularly useful for visualizing the relationships between multiple
variables on a single plot, with each axis representing a different variable.
Key Features:
• Circular Layout: The chart is circular, with each variable plotted along a
different axis radiating from the center.
• Axes: Each axis represents a different variable, and data points are plotted
along these axes.
• Connecting Lines: The data points are connected by lines, forming a
closed shape, often resembling a polygon.
• Comparisons: Polar charts are useful for comparing multiple datasets on
the same plot, allowing you to visualize differences between them.
When to Use a Polar Chart:
• Multidimensional Comparisons: When you need to compare multiple
variables across different categories.
• Visualizing Strengths and Weaknesses: Polar charts are great for
showing the strengths and weaknesses of different categories, making
them popular in performance analysis.
• Complex Data: When the data is too complex for a traditional bar or line
chart, and you need a more comprehensive view
Key Considerations:
• Readability: Polar charts can become cluttered and hard to read if there
are too many variables or datasets. Limiting the number of variables is
essential for clarity.
• Comparison: Ensure that comparisons between different datasets on the
same chart are visually distinguishable by using different colors or line
styles.
• Scaling: Consistent scaling across axes is necessary to avoid misleading
visualizations.
Use Cases:
• Performance Analysis: Commonly used to evaluate the performance of
individuals or teams across multiple dimensions (e.g., skills assessment,
product features).
• Market Research: Useful for comparing different products or services
based on various attributes.
• Risk Assessment: In finance, polar charts can help visualize and compare
different types of risks.
Advantages of Polar Charts:
• Multivariate Visualization: Excellent for visualizing relationships
between multiple variables at once.
• Pattern Recognition: The shape formed by the data points can reveal
patterns and outliers.
• Comparison of Multiple Entities: Polar charts make it easy to compare
multiple datasets on the same plot.
When Not to Use:
• Too Many Variables: Polar charts can become overwhelming and
difficult to interpret with too many variables or data points.
• Precise Data Representation: If precise values or detailed comparisons
are needed, other types of charts (like bar or line charts) may be more
appropriate.
Progam:
import matplotlib.pyplot as plt
import numpy as np
# Number of variables we're plotting
categories = ['A', 'B', 'C', 'D', 'E']
values = [4, 3, 2, 5, 4]
# Repeat the first value to close the circle
values += values[:1]
# Compute the angle of each axis
angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]
# Create polar plot
fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
# Draw one line per variable and fill the area
ax.fill(angles, values, color='skyblue', alpha=0.4)
ax.plot(angles, values, color='blue', linewidth=2)
# Add category labels to the axes
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
# Add title
plt.title("Simple Polar Chart", size=15, color='black', y=1.1)
# Show the plot
plt.show()
Histogram
A histogram is a graphical representation of the distribution of numerical
data. It is used to visualize the frequency of data points within specified
ranges, known as "bins." Each bin represents an interval of data, and the
height of the bar corresponds to the number of data points that fall within
that interval.
Key Features of a Histogram:
1. Bins: The data is divided into intervals, or "bins." The choice of bin size
can affect the appearance and interpretation of the histogram.
2. Bars: The height of each bar represents the frequency or count of data
points within each bin.
3. Continuous Data: Histograms are typically used for continuous data,
showing how the data is distributed across the range of values.
import matplotlib.pyplot as plt
# Sample data
data = [12, 15, 21, 22, 25, 25, 26, 28, 30, 32, 35,
36, 38, 40, 42, 45, 48, 50]
# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [20, 34, 30, 35, 27]
1. Log Transformation:
- Use when you encounter skewed data with a long tail on the right
(positive skew).
- Helps to stabilize variance and make the data more normally
distributed.
- Example:
import numpy as np
df['log_transformed'] = np.log(df['column'] + 1) # Adding 1 to handle
zero values
3. Box-Cox Transformation:
- More flexible than log or square root transformations. It can handle
both positive and negative skewness.
4. Min-Max Scaling:
- Scales data to a range of [0, 1]. Useful when you want to normalize
data without distorting differences in the range.
- Example:
6. Normalization:
- Typically used to scale features so that they have a unit norm (e.g.,
vector length of 1). Useful in models like K-NN.
- Example:
7. Binning (Discretization):
- Converts continuous data into categorical bins. Useful for reducing the
impact of minor observation errors and simplifying the analysis.
8. Feature Engineering:
- Creating new features through transformations like polynomial
features, interaction terms, or domain-specific transformations can reveal
more insights during EDA.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Note that for this analysis, we need to have the mailbox package
installed. If it is not installed on your system, it can be added to your
Python build using the pip install mailbox instruction.
Remove Irrelevant Features: Drop columns that are not relevant to the
analysis (e.g., identifiers, redundant features).
Convert Data Types: Ensure that columns have the correct data type
(e.g., converting strings to dates, integers to floats).
df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'],
errors='coerce')
Check Data Ranges: Ensure that numerical data falls within the
expected range (e.g., age should be within human limits, scores should be
within 0-100).
Correct or Remove: Adjust or drop values that fall outside logical
ranges.
9.Check for Structural Errors:
Identify and Correct: Structural errors include issues like
mislabeling, data entry errors, and incorrect formatting. For example, a
column meant to store numeric values might have non-numeric
characters.
Replace or Remove: Correct these errors by replacing them with
appropriate values or dropping erroneous rows.
dfs.dtypes
Note that a date field is an object. So, we need to convert it into a DateTime
argument. In
the next step, we are going to convert the date field into an actual DateTime
argument. We
can do this by using the pandas to_datetime() method. See the following code:
dfs['date'] = dfs['date'].apply(lambda x: pd.to_datetime(x,
errors='coerce', utc=True))
Let's move onto the next step, that is, removing NaN values from the fields
Descriptive Statistic
Mean
It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count.
where,
• x = Observations
• n = number of terms
import numpy as np
# Sample Data
arr = [5, 6, 11]
Mean
mean = np.mean(arr)
print("Mean = ", mean)
Mode
It is the value that has the highest frequency in the given data set. The data set
may have no mode if the frequency of all data points is the same. Also, we can
have more than one mode if we encounter two or more data points having the
same frequency.
Median
It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is the median
and if it is even then the median would be the average of two central elements.
import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = np.median(arr)
print("Median = ", median)
Measure of Variability
Measures of variability are also termed measures of dispersion as it helps to
gain insights about the dispersion or the spread of the observations at hand.
Some of the measures which are used to calculate the measures of dispersion in
the observations of the variables are as follows:
• Range
• Variance
• Standard deviation
Range
The range describes the difference between the largest and smallest data point in
our data set. The bigger the range, the more the spread of data and vice versa.
import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format( Maximum,
Minimum, Range))
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the
Mean, then subtracting each number from the Mean which is also known as the
average, and squaring the result. Adding all the values and then dividing by the
no of terms followed by the square root.
where,
• x = Observation under consideration
• N = number of terms
• mu = Mean
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# Standard Deviation
print("Std = ", (statistics.stdev(arr)))
Having preprocessed the dataset, let's do some sanity checking using descriptive
statistics
techniques.
We can implement this as shown here:
dfs.info()
Data refactoring
Data refactoring involves restructuring and optimizing data without changing its
external behavior. It’s a process commonly used to improve data quality,
maintainability, and performance, especially when dealing with large datasets or
evolving systems.
Concepts of Data Refactoring
1. Improving Data Quality: Ensuring the data is accurate, consistent, and
reliable. This can involve correcting errors, removing duplicates, or
ensuring standard formats.
2. Normalization: Organizing data to reduce redundancy and improve
integrity. This often involves structuring a database in such a way that
updates and deletions can be made more efficiently.
3. Data Schema Refactoring: Modifying the structure of the database (e.g.,
tables, fields) to improve performance, accommodate changes, or
enhance data access. Examples include renaming tables or fields, splitting
tables, or adding new indices.
4. Data Cleaning: The process of detecting and correcting (or removing)
corrupt or inaccurate records. It includes activities like deduplication,
correcting misspellings, and addressing inconsistencies.
5. Data Transformation: Converting data from one format or structure to
another. This can be as simple as changing data types or as complex as
migrating data from one database system to another.
6. Version Control for Data: Managing changes in data structures or
transformations using version control systems, enabling rollback to
previous states and collaboration across teams.
Techniques in Data Refactoring
1. Splitting Columns or Tables: If a column contains multiple pieces of
information, splitting it into multiple columns can improve clarity and
access. Similarly, if a table grows too large, it may be split into several
smaller tables.
2. Combining Data: Sometimes data that is stored in separate tables or
columns can be combined to reduce complexity. For example, data that’s
frequently accessed together can be stored together.
3. Changing Data Types: Altering the data type of a column to a more
suitable type for better storage efficiency or improved query performance
(e.g., changing a string to a date type).
4. Refactoring Queries: Optimizing queries to make data retrieval more
efficient. This can involve rewriting SQL queries to reduce complexity,
adding indices, or using caching mechanisms.
5. Data Aggregation: Summarizing detailed data to make it easier to
analyze or report on. For example, transforming transactional data into
summary statistics.
6. Denormalization: Sometimes, refactoring involves denormalization to
improve performance for specific queries, by introducing some
redundancy that reduces the need for complex joins.
Examples of Data Refactoring
• Renaming a Column: Renaming a poorly-named column to make its
purpose clearer, e.g., changing cust to customer_id.
• Splitting a Column: A column containing full names (e.g., "John Doe")
could be split into two columns: first_name and last_name.
• Changing Data Types: Changing a varchar field storing dates into a date
type for easier comparison and query optimization.
• Adding an Index: Adding an index to a frequently searched column to
speed up query performance.
• Refactoring Queries: Optimizing a complex SQL query with multiple
joins into a more efficient version by restructuring the query logic.
• Schema Normalization: Breaking a table that stores repeated
information (like addresses) into two tables, one for unique addresses and
another that references them, reducing redundancy.
Tools for Data Refactoring
• Database Management Systems (DBMS): Tools like MySQL,
PostgreSQL, and Oracle provide features for schema changes, data
cleaning, and optimization.
• ETL Tools: Extract, Transform, Load (ETL) tools like Talend, Apache
NiFi, or Alteryx help in transforming and cleaning data.
• Version Control Systems: Tools like Git can be used to version control
data scripts and schema changes.
• SQL Refactoring Tools: Tools like dbForge SQL Complete or Redgate
SQL Prompt assist with refactoring SQL queries.
-----------------------------------------------------------------------------
Starting a data refactoring process requires careful planning, analysis, and
execution to ensure that the data integrity is maintained and that the changes
bring meaningful improvements. Below is a step-by-step guide on how to start
refactoring data:
1. Understand the Current State
• Audit the Data: Analyze the current data structure, including tables,
columns, data types, relationships, and indexes. Understand how data is
being used, stored, and accessed.
• Identify Problems: Look for issues like redundant data, performance
bottlenecks, inconsistent formats, and complex queries. This will help
you identify areas that need refactoring.
• Gather Requirements: Work with stakeholders (e.g., developers,
analysts, and business users) to understand the current pain points and
future needs. This ensures that the refactoring aligns with business
objectives.
2. Define the Scope
• Prioritize Changes: Not all data issues need to be addressed at once.
Prioritize based on factors like impact, risk, and ease of implementation.
Focus on areas that provide the most value.
• Set Objectives: Clearly define the goals of refactoring. Whether it’s
improving query performance, simplifying data models, or reducing
redundancy, having clear objectives will guide your efforts.
3. Plan the Refactoring
• Create a Roadmap: Develop a step-by-step plan that outlines the
changes to be made, the sequence of those changes, and the timeline for
implementation. Include considerations for data backups, testing, and
rollbacks.
• Document Changes: Before making any modifications, document the
existing data structure and the planned changes. This is essential for
maintaining clarity and for future reference.
4. Backup Data
• Backup Databases: Always create a full backup of your data before
starting any refactoring process. This ensures that you can restore the data
to its original state if something goes wrong.
5. Test on a Small Scale
• Use a Sandbox Environment: Before applying changes to the
production environment, test the refactoring in a sandbox or development
environment. This allows you to identify any potential issues without
risking live data.
• Perform Data Integrity Checks: Ensure that data transformations
maintain accuracy. Compare the output of the refactored data with the
original data to verify consistency.
6. Implement the Changes
• Make Incremental Changes: Avoid making all changes at once.
Implement refactoring in small, manageable increments. This reduces the
risk of errors and makes it easier to troubleshoot issues.
• Monitor Performance: After making changes, monitor the system's
performance to ensure that the refactoring achieves the desired
improvements. This can include running queries, checking application
logs, and analyzing metrics.
7. Validate the Changes
• Run Test Cases: Develop and execute test cases to verify that the
refactored data works correctly with your applications and processes.
Include both functional tests (e.g., are queries returning correct data?) and
performance tests (e.g., are queries faster?).
• User Acceptance Testing (UAT): Engage end-users to validate the
changes from a functional and usability perspective. Ensure that all
stakeholders are satisfied with the refactoring outcomes.
8. Document the Changes
• Update Documentation: Document the new data structure, changes
made, and any new processes. This is crucial for maintaining clarity for
future maintenance and for onboarding new team members.
• Create Migration Scripts: If applicable, write migration scripts to
automate the refactoring process. This ensures consistency and helps with
scaling the changes to different environments.
9. Deploy to Production
• Schedule Deployment: Choose a deployment window that minimizes
disruption, such as during low-traffic periods. Ensure that the team is
prepared for any potential issues that may arise.
• Monitor After Deployment: Post-deployment, closely monitor the
system for any issues, and be prepared to roll back changes if necessary.
10. Iterate and Improve
• Gather Feedback: After deploying the changes, gather feedback from
users and stakeholders to identify any remaining issues or areas for
further improvement.
• Refine Further: Data refactoring is often an ongoing process. Use the
insights gained to refine and optimize further. Stay open to continuous
improvement as your data needs evolve.
Example of Starting Refactoring:
Let’s say you have a large customer database that suffers from redundancy and
performance issues. Here’s how you might start:
1. Audit: Analyze the database and identify redundant columns (e.g.,
address information stored in multiple places).
2. Scope: Prioritize eliminating redundancy in the customer address data.
3. Plan: Decide to normalize the address data by creating a separate Address
table that the Customer table references.
4. Backup: Create a backup of your database.
5. Test: Implement this change in a sandbox environment and test queries to
ensure performance improvement.
6. Implement: Gradually roll out the changes to the production
environment.
7. Validate: Ensure that the application retrieves the correct address data
and that performance improves.
8. Deploy and Document: Deploy the changes and update documentation
to reflect the new database structure.
Dropping columns
4. Dropping Columns with Missing Data: If you want to drop columns with
a certain percentage of missing data:
Best Practices
• Understand the Impact: Always consider the impact of dropping a
column on your model and analysis. If you're unsure, you can experiment
with and without the column to see how it affects results.
• Documentation: Keep track of which columns you drop and why, as this
can help you retrace your steps or explain your methodology later.
• Iterative Process: Dropping columns is often part of an iterative process.
You might remove columns during initial cleaning, then revisit this step
during feature selection or model tuning.
-----------------------------------------------
Data Analysis
To analyze email data, focusing on the number of emails sent, the time
of day they were sent, the average number of emails per day and hour,
and the most frequently used words, here’s a general approach:
1. Data Collection
• Emails Data: Extract data like timestamps, subject lines, and email
bodies from your email system. This might involve exporting emails to a
CSV file or using an email API.
• Attributes Needed:
o Timestamp (Date and Time)
o Email Body/Subject Line (Text content)
o Email Address (optional) (To filter out specific
senders/recipients)
2. Data Preparation
• Clean the Data:
o Convert timestamps to a standardized format.
o Remove unwanted data (e.g., email signatures, automatic replies).
• Time of Day Analysis:
o Extract the hour from the timestamp to analyze email frequency by
time of day.
3. Analysis
• Number of Emails by Time of Day:
o Group emails by hour of the day (0-23) to see the distribution of
emails across the day.
• Average Emails per Day and Hour:
o Calculate the average number of emails sent per day.
o Calculate the average number of emails sent per hour by dividing
the total emails by the number of days and hours.
• Most Frequently Used Words:
o Tokenize the email bodies and subject lines.
o Remove common stopwords (e.g., "the," "and," "is").
o Count the frequency of each word.
4. Visualization
• Time of Day Distribution: Create a line plot or bar chart to show the
distribution of emails across the hours of the day.
• Average Emails per Day and Hour: Use bar charts to illustrate the
averages.
• Word Cloud: Visualize the most frequently used words in a word cloud
for a quick insight into the common themes.
5. Tools & Libraries
• Python Libraries:
o Pandas: For data manipulation and analysis.
o Matplotlib/Seaborn: For plotting graphs.
o NLTK or spaCy: For text processing (word tokenization and
removing stopwords).
o WordCloud: For generating word clouds.
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
______