0% found this document useful (0 votes)
23 views

DEV Lab Record

Uploaded by

Jemima A
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

DEV Lab Record

Uploaded by

Jemima A
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

AD3301-Data Exploration and Visualization

List of Experiments

S.No Date Experiment Name Page Marks


No
1. Installation of the data Analysis and
Visualization tool Power BI
2. Implementation of exploratory data
analysis (EDA) on email data set.
3. Implementation of Numpy arrays,
Pandas data frames, Basic plots using
Matplotlib
4. Implementation of various variable and
row filters in R and various plot
features in R
5. Implementation of Time Series
Analysis and various visualization
techniques.
6. Implementation of Data Analysis and
representation on a Map using various
Map data sets
7. Implementation of cartographic
visualization for multiple datasets
8. Implementation of EDA on Wine
Quality Data Set.
9. Case study on a data set and apply the
various EDA and visualization
techniques and present an analysis
report.
Total

Staff Incharge

1
Ex.No: Installation of the data Analysis and Visualization tool
01 Power BI

AIM

ALGORITHM

PROGRAM
1. Download and Install Power BI Desktop
1. Download Power BI Desktop:
o Go to the Power BI Desktop download page.
o Click on “Download Free” to start the download.

2. Run the Installer:

2
o Locate the downloaded installer file (usually in your
Downloads folder).
o Double-click the file to start the installation process.
o Follow the on-screen instructions to complete the installation.
You may need to accept the license agreement and choose an
installation location.

3
3. Launch Power BI Desktop:
o After installation, open Power BI Desktop from the Start menu
or desktop shortcut.

2. Initial Configuration
1. Configure Initial Settings:
o When Power BI Desktop opens for the first time, you may be
prompted to sign in. You can sign in with a Microsoft account,
but this is optional for local use.
o Choose any initial settings based on your preferences or leave
them as defaults.

4
3. Import Data
1. Load Data:
o Click on the “Home” tab in the ribbon.
o Click “Get Data” to open the data source options.
o Choose the type of data source (e.g., Excel, CSV, SQL Server)
and follow the prompts to connect to your data source.

2. Transform and Clean Data:


o After importing the data, you may need to perform data
cleaning or transformation.
o Use the “Transform Data” button to open Power Query Editor,
where you can clean and transform the data (e.g., remove
duplicates, filter rows).

5
4. Create Visualizations
1. Add Visualizations:
o Once your data is loaded and cleaned, you can start creating
visualizations.
o In the “Report” view, drag and drop fields from your data onto
the report canvas.
o Choose from various visualization types (e.g., bar charts, line
charts, pie charts) from the “Visualizations” pane.

6
2. Customize Visualizations:
o Click on a visualization to configure its properties (e.g., axis
titles, colors).
o Use the “Format” pane to adjust visual settings.

7
3. Arrange Visualizations:
o Arrange and resize visualizations on the canvas to create a
meaningful report layout.

5. Publish and Share Reports


1. Publish to Power BI Service (Optional):
o Click the “Publish” button on the “Home” tab.
o Sign in with your Microsoft account if required.
o Choose a workspace in Power BI Service where you want to
publish the report.

8
2. Share Reports:
o After publishing, you can share the report with others by
providing them access through the Power BI Service.
o Use sharing options available in Power BI Service to control
who can view or interact with the report.

RESULT

Ex.No: Implementation of exploratory data analysis (EDA) on


02 email data set

AIM

ALGORITHM

9
PROGRAM
import pandas as pd

# Load the email dataset


df = pd.read_csv('emails.csv')

# Display the first few rows of the dataset


print(df.head())

# Display basic information about the dataset


print(df.info())

# Check for missing values


print(df.isnull().sum())

# Convert the 'Date' column to datetime format


df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Fill missing values in the 'Sender' and 'Subject' columns if any


df['Sender'].fillna('Unknown', inplace=True)
df['Subject'].fillna('No Subject', inplace=True)

# Check for any remaining missing values


print(df.isnull().sum())

import matplotlib.pyplot as plt

# Plot the number of emails over time


plt.figure(figsize=(10, 6))
df['Date'].groupby(df['Date'].dt.to_period('M')).count().plot(kind='bar')
plt.title('Number of Emails Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Emails')
plt.show()

# Top 10 senders
top_senders = df['Sender'].value_counts().head(10)

10
# Plotting the top senders
plt.figure(figsize=(10, 6))
top_senders.plot(kind='bar', color='skyblue')
plt.title('Top 10 Senders')
plt.xlabel('Sender')
plt.ylabel('Number of Emails')
plt.show()

from wordcloud import WordCloud

# Generate a word cloud from the subject lines


wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(' '.join(df['Subject']))

# Display the word cloud


plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Email Subjects')
plt.show()

# Extract the hour from the 'Date' column


df['Hour'] = df['Date'].dt.hour

# Plot the number of emails by hour


plt.figure(figsize=(10, 6))
df['Hour'].value_counts().sort_index().plot(kind='bar', color='orange')
plt.title('Emails Received by Hour of the Day')
plt.xlabel('Hour')
plt.ylabel('Number of Emails')
plt.show()

# Calculate the length of each email body


df['Body_Length'] = df['Body'].apply(lambda x: len(str(x).split()))

# Plot the distribution of email lengths


plt.figure(figsize=(10, 6))
df['Body_Length'].plot(kind='hist', bins=30, color='green')
plt.title('Distribution of Email Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Number of Emails')
plt.show()

11
OUTPUT

12
13
RESULT

Ex.No: Implementation of Numpy arrays, Pandas data frames ,


03 Basic plots using Matplotlib

AIM

ALGORITHM

PROGRAM
1. NumPy Arrays
pip install numpy
import numpy as np

# 1D Array
arr_1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr_1d)

# 2D Array
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", arr_2d)

# Array with specific shape and values


zeros = np.zeros((2, 3))
ones = np.ones((3, 2))
identity = np.eye(3)

print("Zeros Array:\n", zeros)


print("Ones Array:\n", ones)

14
print("Identity Matrix:\n", identity)
# Element-wise operations
arr_add = arr_1d + 10
arr_mult = arr_2d * 2

print("Array after addition:\n", arr_add)


print("Array after multiplication:\n", arr_mult)

# Matrix operations
matrix_mult = np.dot(arr_2d, np.array([[1, 0], [0, 1], [1, 0]]))
print("Matrix Multiplication:\n", matrix_mult)

2. Pandas DataFrames
pip install pandas
import pandas as pd

# Creating DataFrame from a dictionary


data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print("DataFrame:\n", df)

# Creating DataFrame from a CSV file


# df = pd.read_csv('file.csv') # Uncomment and use if you have a CSV
file

# Basic statistics
print("Statistics:\n", df.describe(include='all'))

# Accessing specific columns


print("Names:\n", df['Name'])

# Filtering rows
filtered_df = df[df['Age'] > 28]
print("Filtered DataFrame:\n", filtered_df)

# Adding a new column


df['Country'] = 'USA'
print("DataFrame with New Column:\n", df)

15
3. Basic Plots Using Matplotlib
pip install matplotlib
import matplotlib.pyplot as plt

# Line Plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Sine Wave', color='blue', linestyle='-')
plt.title('Line Plot')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.legend()
plt.grid(True)
plt.show()

# Scatter Plot
x = np.random.rand(50)
y = np.random.rand(50)

plt.figure(figsize=(10, 6))
plt.scatter(x, y, color='red', alpha=0.5)
plt.title('Scatter Plot')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.grid(True)
plt.show()

# Histogram
data = np.random.randn(1000)

plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Bar Plot
categories = ['A', 'B', 'C']

16
values = [10, 20, 15]

plt.figure(figsize=(10, 6))
plt.bar(categories, values, color='green')
plt.title('Bar Plot')
plt.xlabel('Category')
plt.ylabel('Value')
plt.grid(True)
plt.show()

OUTPUT

17
18
RESULT

Ex.No: Implementation of various variable and row filters in R


04 and various plot features in R

AIM

ALGORITHM

19
PROGRAM
Step 1: Data Loading and Initial Inspection
# Load a sample dataset (e.g., the 'mtcars' dataset)
data(mtcars)

# View the first few rows of the dataset


head(mtcars)

# Get a summary of the dataset


summary(mtcars)
Step 2: Variable Filtering
# Select specific columns (e.g., 'mpg', 'hp', 'wt')
selected_vars <- mtcars[, c('mpg', 'hp', 'wt')]

# View the filtered dataset


head(selected_vars)
Step 3: Row Filtering
# Filter rows where mpg (miles per gallon) is greater than 20
filtered_rows <- mtcars[mtcars$mpg > 20, ]

# Filter rows where 'cyl' (cylinders) equals 4


filtered_rows_cyl <- mtcars[mtcars$cyl == 4, ]
# View the filtered dataset
head(filtered_rows)
head(filtered_rows_cyl)
Step 4: Data Cleaning
# Remove rows with missing values (if any)
cleaned_data <- na.omit(mtcars)

# Rename columns for clarity


colnames(cleaned_data) <- c('Miles_Per_Gallon', 'Cylinders',
'Displacement',
'Horsepower', 'Rear_Axle_Ratio', 'Weight',
'Quarter_Mile_Time', 'Transmission', 'Gears',
'Carburetors')

# View the cleaned dataset


head(cleaned_data)
Step 5: Data Visualization
# Load necessary libraries
library(ggplot2)

20
# Scatter plot of Horsepower vs Miles Per Gallon
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = 'blue') +
labs(title = "Horsepower vs Miles Per Gallon", x = "Horsepower", y =
"Miles Per Gallon")

# Histogram of Miles Per Gallon


ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = 'green', color = 'black') +
labs(title = "Distribution of Miles Per Gallon", x = "Miles Per Gallon", y =
"Frequency")

# Boxplot of Miles Per Gallon by Cylinder


ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = 'orange', color = 'black') +
labs(title = "Miles Per Gallon by Cylinder", x = "Number of Cylinders", y
= "Miles Per Gallon")

OUTPUT

21
22
23
RESULT

Ex.No: Implementation of Time Series Analysis and various


05 visualization techniques

AIM

ALGORITHM

PROGRAM
pip install pandas numpy matplotlib

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a time series dataset


date_rng = pd.date_range(start='2023-01-01', end='2023-12-31',
freq='D')

24
data = pd.DataFrame(date_rng, columns=['date'])
data['value'] = np.sin(np.linspace(0, 10, len(date_rng))) +
np.random.normal(0, 0.5, len(date_rng))

# Set the date column as the index


data.set_index('date', inplace=True)

print(data.head())

plt.figure(figsize=(12, 6))
plt.plot(data.index, data['value'], label='Value')
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()

# Compute rolling statistics


data['rolling_mean'] = data['value'].rolling(window=30).mean()
data['rolling_std'] = data['value'].rolling(window=30).std()

plt.figure(figsize=(12, 6))
plt.plot(data.index, data['value'], label='Value', alpha=0.5)
plt.plot(data.index, data['rolling_mean'], label='Rolling Mean', color='red')
plt.plot(data.index, data['rolling_std'], label='Rolling Std Dev',
color='orange')
plt.title('Rolling Statistics')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()

from pandas.plotting import autocorrelation_plot

plt.figure(figsize=(12, 6))
autocorrelation_plot(data['value'])
plt.title('Autocorrelation Plot')
plt.grid(True)
plt.show()

# Extract month from the index

25
data['month'] = data.index.month

# Plot by month
plt.figure(figsize=(12, 6))
for month in range(1, 13):
monthly_data = data[data['month'] == month]
plt.plot(monthly_data.index, monthly_data['value'], label=f'Month
{month}')
plt.title('Seasonal Plot by Month')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
OUTPUT

26
RESULT

27
Ex.No: Implementation of Data Analysis and representation on
06 a Map using various Map data sets

AIM

ALGORITHM

PROGRAM
pip install folium geopandas plotly pandas

import folium
import geopandas as gpd
import pandas as pd

# Step 2: Load Geographic Data


# Example dataset: Load a sample GeoDataFrame
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Load some data for analysis (e.g., population or any other metric)
# For this example, we'll use a simple dataset:
data = pd.DataFrame({
'iso_a3': ['USA', 'CAN', 'MEX'],
'population': [331002651, 37742154, 126190788]
})

# Merge the data with the GeoDataFrame


world = world.merge(data, how='left', on='iso_a3')

# Step 3: Create a map with Folium


m = folium.Map(location=[20, 0], zoom_start=2)

# Step 4: Add Mouse Rollover and User Interaction

28
# Add a Choropleth map
folium.Choropleth(
geo_data=world,
name="choropleth",
data=world,
columns=["iso_a3", "population"],
key_on="feature.properties.iso_a3",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Population by Country",
).add_to(m)

# Add GeoJson for mouse rollover effect


folium.GeoJson(
world,
name="Population Info",
tooltip=folium.GeoJsonTooltip(
fields=["name", "population"],
aliases=["Country", "Population"],
localize=True,
),
).add_to(m)

# Step 5: Save and Display the Map


m.save("interactive_map.html")

# Display map in Jupyter Notebook (if using Jupyter)


m

OUTPUT

29
RESULT

Ex.No: Implementation of cartographic visualization for


07 multiple datasets

AIM

ALGORITHM

30
PROGRAM
pip install geopandas folium matplotlib

import geopandas as gpd


import folium
import matplotlib.pyplot as plt

# Load the world map dataset


world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Display the first few rows


print(world.head())

# Plot the world map


world.plot(figsize=(15, 10), edgecolor='black')
plt.title('World Map')
plt.show()

# Create a base map centered around the world


m = folium.Map(location=[20, 0], zoom_start=2)
# Add countries as GeoJson to the map
folium.GeoJson(world).add_to(m)

# Save the map as an HTML file


m.save("world_map.html")

# Example: Adding population data to the world map


# Add a column to the world dataset for population (you may need to add
real data here)
world['pop_est'] = world['pop_est'].fillna(0) # Fill NA values with 0

# Plot using matplotlib with color based on population


world.plot(column='pop_est', figsize=(15, 10), legend=True,
cmap='OrRd', edgecolor='black')

31
plt.title('World Population Map')
plt.show()

# Display in Jupyter Notebook (if using Jupyter)


m # m is the folium map object
OUTPUT

32
RESULT

Ex.No: Implementation of EDA on Wine Quality Data Set


08

AIM

ALGORITHM

33
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the wine quality dataset


url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv'
wine_data = pd.read_csv(url, sep=';')

# Display the first few rows of the dataset


print(wine_data.head())

# Basic information about the dataset


print(wine_data.info())
# Summary statistics
print(wine_data.describe())
# Check for missing values
print(wine_data.isnull().sum())

# Distribution of wine quality


plt.figure(figsize=(8, 6))
sns.countplot(x='quality', data=wine_data, palette='viridis')
plt.title('Distribution of Wine Quality Ratings')
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
corr = wine_data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Wine Quality Data')
plt.show()

# Pairplot of selected features


sns.pairplot(wine_data, vars=['fixed acidity', 'volatile acidity', 'citric acid',
'quality'], hue='quality', palette='viridis')
plt.show()

# Alcohol content vs quality


plt.figure(figsize=(10, 6))
sns.boxplot(x='quality', y='alcohol', data=wine_data, palette='viridis')

34
plt.title('Alcohol Content vs Wine Quality')
plt.show()

# Fixed acidity vs quality


plt.figure(figsize=(10, 6))
sns.boxplot(x='quality', y='fixed acidity', data=wine_data,
palette='viridis')
plt.title('Fixed Acidity vs Wine Quality')
plt.show()

# Example of handling imbalance (if necessary)


from sklearn.utils import resample

# Separate the minority and majority classes


df_majority = wine_data[wine_data.quality >= 6]
df_minority = wine_data[wine_data.quality < 6]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=len(df_majority), # to match majority
class
random_state=42) # reproducible results

# Combine majority class with upsampled minority class


wine_data_balanced = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts


print(wine_data_balanced['quality'].value_counts())

# 3D Scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(wine_data['sulphates'], wine_data['alcohol'],
wine_data['quality'], c=wine_data['quality'], cmap='viridis')
ax.set_xlabel('Sulphates')
ax.set_ylabel('Alcohol')
ax.set_zlabel('Quality')
plt.show()

OUTPUT

35
36
37
38
RESULT

Ex.No: Case study on a data set and apply the various EDA and
09 visualization techniques and present an analysis report.

39
AIM

ALGORITHM

PROGRAM
The Titanic dataset is a classic dataset often used to demonstrate various
data analysis techniques. This dataset provides information on the
passengers aboard the Titanic, including whether they survived the
disaster, their age, gender, class, fare, and other details. This case study
will apply various EDA and visualization techniques to uncover insights
from this data and present an analysis report.
Dataset Overview
 Dataset: Titanic passenger data.
 Features:
o Survived: Survival (0 = No, 1 = Yes)
o Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
o Sex: Sex
o Age: Age in years
o SibSp: Number of siblings/spouses aboard the Titanic
o Parch: Number of parents/children aboard the Titanic
o Fare: Passenger fare
o Embarked: Port of Embarkation (C = Cherbourg; Q =
Queenstown; S = Southampton)

Step 1: Data Loading and Inspection


import pandas as pd
# Load the Titanic dataset
titanic_data =
pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/
master/titanic.csv')

40
# Display the first few rows of the dataset
print(titanic_data.head())

# Basic information about the dataset


print(titanic_data.info())

# Summary statistics
print(titanic_data.describe())
Observation:
 The dataset has 891 entries and several features, some of which
have missing values (e.g., Age, Cabin, and Embarked).
Step 2: Data Cleaning
# Handle missing values by filling or dropping
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0],
inplace=True)
titanic_data.drop(columns=['Cabin'], inplace=True) # Drop Cabin due to
too many missing values

# Verify missing values


print(titanic_data.isnull().sum())
Observation:
 Missing values have been handled: Age filled with median,
Embarked filled with mode, and Cabin dropped.
Step 3: Univariate Analysis
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the distribution of survival


sns.countplot(x='Survived', data=titanic_data, palette='pastel')
plt.title('Survival Distribution')
plt.show()

# Distribution of Age
sns.histplot(titanic_data['Age'], bins=30, kde=True, color='blue')
plt.title('Age Distribution of Passengers')
plt.show()
Observation:
 The majority of passengers did not survive.
 Most passengers are between 20 and 40 years old.
Step 4: Bivariate Analysis
# Survival rate by class
sns.barplot(x='Pclass', y='Survived', data=titanic_data, palette='viridis')

41
plt.title('Survival Rate by Passenger Class')
plt.show()

# Survival rate by gender


sns.barplot(x='Sex', y='Survived', data=titanic_data, palette='viridis')
plt.title('Survival Rate by Gender')
plt.show()

# Age distribution by survival


sns.boxplot(x='Survived', y='Age', data=titanic_data, palette='pastel')
plt.title('Age Distribution by Survival')
plt.show()
Observation:
 Passengers in 1st class had a higher survival rate.
 Females had a significantly higher survival rate than males.
 Younger passengers were more likely to survive.
Step 5: Multivariate Analysis
# Survival rate by class and gender
sns.catplot(x='Pclass', hue='Sex', col='Survived', kind='count',
data=titanic_data, palette='pastel')
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
corr = titanic_data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
Observation:
 The survival rate is highest for females in 1st class.
 There is a strong negative correlation between Pclass and Survived,
indicating that higher-class passengers were more likely to survive.
Insights
1. Class and Survival: Passengers in higher classes had better
survival chances, with 1st class being the safest.
2. Gender and Survival: Females had a much higher survival rate,
especially in the 1st and 2nd classes.
3. Age Factor: Younger passengers had a better survival rate, with
children particularly having a higher chance of survival.
4. Embarkation Point: Passengers who embarked at Cherbourg (C)
had a higher survival rate compared to other embarkation points.
5. Multivariate Interaction: The combination of gender, class, and
age played a crucial role in determining the survival of a passenger.

42
OUTPUT

43
44
45
RESULT

46

You might also like