0% found this document useful (0 votes)
5 views

Fds Record

The document outlines the installation and exploration of various Python packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas using pip commands. It provides detailed instructions on how to install each package, along with basic programming exercises demonstrating their functionalities. Additionally, it covers fundamental concepts and operations related to NumPy arrays, including creation, indexing, slicing, and aggregation methods.

Uploaded by

kannan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Fds Record

The document outlines the installation and exploration of various Python packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas using pip commands. It provides detailed instructions on how to install each package, along with basic programming exercises demonstrating their functionalities. Additionally, it covers fundamental concepts and operations related to NumPy arrays, including creation, indexing, slicing, and aggregation methods.

Uploaded by

kannan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

EX.

No:1 Download and explore the features of NumPy, SciPy,


Date: Jupyter,Statsmodelsand Pandas packages.

Aim:
To Download, install and explore the features of NumPy, SciPy, Jupyter,
Statsmodels and Pandas packages by using pip command.

Basic Tools:
a. Python
b. Numpy
c. Scipy
d. Matplotlib
e. Pandas
f. statmodels
g. seaborn
h. plotly
i. bokeh

1. Python:
Python is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s
elegant
syntax and dynamic typing, together with its interpreted nature, make it an ideal language for
scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or
binary form for all major platforms from the Python web site, https://www.python.org/, and
may be
freely distributed.
Installation Commands:
Step 1: Download the Python Installer binaries. Open the official Python website in
your web browser https://www.python.org/downloads/

Step 2: Run the Executable Installer. Once the installer is downloaded, run the
Python installer. ...
Step 3: Add Python to environmental variables. ...
Step 4: Verify the Python Installation.

2. Numpy:
NumPy stands for Numerical Python and it is a core scientific computing library
in Python. It provides efficient multi-dimensional array objects and various operations
to work with these array objects.
Package installer for Python (pip) needed to run Python on your computer.

Installation Commands:
1. Command Prompt : Py –m pip - -version
2. Command Prompt :Py –m pip install numpy

3. Scipy
SciPy is a scientific computation library that uses NumPy underneath.SciPy stands for
Scientific Python.It provides more utility functions for optimization, stats and signal
processing.LikeNumPy, SciPy is open source so we can use it freely.
Installation Commands:
Command Prompt :Py –m pip install scipy

4. Matplotlib:
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Create publication quality plots.
Make interactive figures that can zoom, pan, update.
Customize visual style and layout.
Export to many file formats.
Embed in JupyterLab and Graphical User Interfaces.
Use a rich array of third-party packages built on Matplotlib.
Installation Commands:
Command Prompt :Py –m pip install matplotlib

5. Pandas:
pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation
tool, built on top of the Python programming language
Installation Commands:

Command Prompt: Py –m pip install Pandas

6. Jupyter:
The Jupyter Notebook is the original web application for creating and sharing
computational documents. It offers a simple, streamlined, document-centric experience.
Installation Commands:

Command Prompt: Py –m pip install jupyter

7. Statmodels:
Statsmodels is a Python package that allows users to explore data, estimate statistical
models, and perform statistical tests.
Installation Commands:

Command Prompt: Py –m pip install statsmodels

8. Seaborn:
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics.
Installation Commands:

Command Prompt: Py –m pip install seaborn

9. Plotly:
Plotly is a technical computing company headquartered in Montreal, Quebec, that
develops online data analytics and visualization tools. Plotly provides online graphing,
analytics, and statistics tools for individuals and collaboration, as well as scientific graphing
libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Installation Commands:
Command Prompt: Py –m pip install plotly

9. Bokeh:
Bokeh is a Python library for creating interactive visualizations for modern web browsers. It
helps you build beautiful graphics, ranging from simple plots to complex dashboards with
streaming datasets.
Installation Commands:

Command Prompt: Py –m pip install bokeh

EXERCISE PROGRAM
1.Basic Array Program:(Numpy Python)
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

2.Basic Array Program:(Pandas Python)


import pandas as pd
arr = pd.array([1, 2, 3, 4, 5])
print(arr)
3.Draw a line in a diagram x-axis ranging from 0 to 6 and the y-axis ranging from 0 to
250.using Matplotlib
import matplotlib.pyplot as plt
import numpy as n
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()

4.A Simple Scatter Plot Program:(Matplotlib Python)

import matplotlib.pyplot as plt


import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()

Result:
Thus the Download, install and explore the features of NumPy, SciPy, Matplotlib, Statsmodels
and Pandas packages by using pip command and basic program was executed
OUTPUT:

1.Basic Array Program:(Numpy Python)


[1 2 3 4 5]

2.Basic Array Program:(Pandas Python)


<PandasArray>
[1, 2, 3, 4, 5]
Length: 5, dtype: int64

3.Draw a line in a diagram from position (0,0) to position (6,250) using Matplotlib

4.A Simple Scatter Plot Program:(Matplotlib Python)


EX. No:2 Working with Numpy Arrays
Date:

Aim :
To write python programs to create and access numpy arrays.

Algorithm:
1. Start the Program.
2. Import Numpy Library.
3. Perform operation with Numpy Array.
4. Display the output.
5. Stop the Program.
Numpy:
Numpy stands for Numerical Python. It is a Python library used for working with
an
array. In Python, we use the list for purpose of the array but it’s slow to process. NumPy array
is a powerful N-dimensional array object and its use in linear algebra, Fourier transform, and
random number capabilities. It provides an array object much faster than traditional Python
lists.

NUMPY CONCEPTS:
Create NumPy ndarray Object
numpy.zeros()
numpy.ones()
numpy.empty()
numpy.linspace()
numpy.arange()
numpy.array()
Check Number of Dimensions
Dimensions in Arrays
Higher Dimensional Arrays
NumPy Array Indexing
NumPy Array Slicing
Two-Dimensional Arrays
NumPy Array Shape
Reshaping of Arrays
Aggregations
• Mean
• Median
• Mode
• Standard deviation

Joining NumPy Arrays


Splitting NumPy Arrays
How to Create Your Own ufunc
Sorting Arrays
Comparisons
Masks
Boolean Logic

NumPy concepts:

1. Create a NumPy ndarray Object

• Definition: A ndarray is a multidimensional container of items of the same type. It is


the core object in NumPy.
• Syntax: np.array()
• Example:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

2. numpy.zeros(): Create an array filled with zeros.

• Definition: This function creates a new array of given shape and type, filled with
zeros.
• Syntax: np.zeros(shape, dtype=float)
• Example:

arr = np.zeros((2, 3)) # 2x3 array of zeros


print(arr)

3. numpy.ones(): Create an array filled with ones.

• Definition: This function creates a new array of given shape and type, filled with
ones.
• Syntax: np.ones(shape, dtype=float)
• Example:

arr = np.ones((2, 3)) # 2x3 array of ones


print(arr)

4. numpy.empty(): Create an uninitialized array.

• Definition: This function returns an array without initializing its values.


• Syntax: np.empty(shape, dtype=float)
• Example:

arr = np.empty((2, 3)) # 2x3 empty array (uninitialized)


print(arr)

5. numpy.linspace(): Create an array with evenly spaced values.

• Definition: This function returns an array of evenly spaced numbers over a specified
range.
• Syntax: np.linspace(start, stop, num=50, endpoint=True)
• Example:

arr = np.linspace(0, 10, 5) # 5 values from 0 to 10


print(arr)

6. numpy.arange(): Create an array with a range of values.

• Definition: This function returns an array with values spaced regularly within a given
interval.
• Syntax: np.arange([start, ]stop, [step, ])
• Example:

arr = np.arange(0, 10, 2) # Values from 0 to 10 with step 2

print(arr)

7. numpy.array(): Create an array from a list or other sequences.

• Definition: This function is used to create arrays from lists, tuples, or other
sequences.
• Syntax: np.array(object, dtype=None)
• Example:

arr = np.array([1, 2, 3, 4])


print(arr)

8. Check Number of Dimensions

• Definition: You can use the .ndim attribute to check the number of dimensions of an
array.
• Syntax: array.ndim
• Example:

print(arr.ndim) # Number of dimensions of the array

9. Dimensions in Arrays

• Definition: NumPy arrays can be 1D, 2D, 3D, or more. The number of dimensions is
the rank of the array.
• Example:

arr_2d = np.array([[1, 2], [3, 4]])


print(arr_2d.ndim) # Output will be 2 for 2D array

10. Higher Dimensional Arrays

• Definition: You can create arrays with more than two dimensions, such as 3D or 4D
arrays.
• Example:

arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print(arr_3d.ndim) # Output will be 3 for 3D array

11. NumPy Array Indexing

• Definition: NumPy arrays support indexing, similar to Python lists, allowing access
to individual elements.
• Syntax: array[index]
• Example:

arr = np.array([10, 20, 30, 40])

print(arr[2]) # Access the 3rd element


12. NumPy Array Slicing

• Definition: NumPy arrays support slicing to extract a portion of the array.


• Syntax: array[start:end:step]
• Example:

arr = np.array([1, 2, 3, 4, 5])

print(arr[1:4]) # Slice elements from index 1 to 3

13. Two Dimensional Arrays

• Definition: Arrays with two dimensions, like a matrix or a table.


• Example:

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

print(arr_2d)

14. NumPy Array Shape

• Definition: The .shape attribute returns a tuple representing the dimensions of the
array.
• Syntax: array.shape
• Example:

print(arr_2d.shape) # Output will be (2, 3) for 2x3 array

15. Reshaping of Arrays

• Definition: You can change the shape of an array using .reshape().


• Syntax: array.reshape(new_shape)
• Example:

arr = np.array([1, 2, 3, 4, 5])

reshaped_arr = arr.reshape(5, 1)

print(reshaped_arr)

16. Aggregations (Sum, Mean, Median)

• Definition: NumPy provides functions for aggregation operations such as sum, mean,
and median.
• Example:

arr = np.array([1, 2, 3, 4, 5])

print(arr.sum()) # Sum
print(arr.mean()) # Mean

print(np.median(arr)) # Median

17. Standard Deviation

• Definition: Standard deviation measures the amount of variation or dispersion of a


dataset.
• Syntax: array.std()
• Example:

print(arr.std()) # Standard deviation

18. Joining NumPy Arrays

• Definition: You can join multiple arrays into one using np.concatenate().
• Syntax: np.concatenate((arr1, arr2))
• Example:

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

joined_arr = np.concatenate((arr1, arr2))

print(joined_arr)

19. Splitting NumPy Arrays

• Definition: You can split an array into multiple sub-arrays using np.split().
• Syntax: np.split(array, sections)
• Example:

arr = np.array([1, 2, 3, 4, 5, 6])

split_arr = np.split(arr, 3)

print(split_arr) # Split into 3 sub-arrays

20. How to Create Your Own ufunc

• Definition: A universal function (ufunc) is a function that operates element-wise on


arrays.
• Syntax: np.frompyfunc(function, nin, nout)
• Example:

def add_five(x):

return x + 5

ufunc_add_five = np.frompyfunc(add_five, 1, 1)

print(ufunc_add_five(np.array([1, 2, 3])))

21. Sorting Arrays

• Definition: The np.sort() function returns a sorted copy of an array.


• Syntax: np.sort(array)
• Example:

arr = np.array([5, 2, 8, 1, 9])

print(np.sort(arr)) # Sorted array

22. Comparisons

• Definition: NumPy supports element-wise comparisons between arrays.


• Syntax: array1 == array2 or array > array2
• Example:

arr = np.array([1, 2, 3])

print(arr == 2) # Output: [False True False]

23. Masks and Boolean Logic

• Definition: You can create masks for filtering arrays based on conditions.
• Example:

arr = np.array([1, 2, 3, 4, 5])

mask = arr > 3

print(arr[mask]) # Array elements greater than 3


Result :
Thus python programs for creating and accessing arrays have been executed.
OUTPUT:

NumPy ndarray object: [1 2 3 4 5]

numpy.zeros() - 2x3 array of zeros:


[[0. 0. 0.]
[0. 0. 0.]]

numpy.ones() - 2x3 array of ones:


[[1. 1. 1.]
[1. 1. 1.]]

numpy.empty() - 2x3 empty array:


[[0. 0. 0.]
[0. 0. 0.]]

numpy.linspace() - 5 values from 0 to 10:


[ 0. 2.5 5. 7.5 10. ]

numpy.arange() - Values from 0 to 10 with step 2:


[0 2 4 6 8]

numpy.array() - Array from a list:


[10 20 30 40]

Number of Dimensions in arr: 1

Number of Dimensions in two_d_arr: 2

Number of Dimensions in three_d_arr: 3


Indexing element at index 2 in arr: 3

Sliced array from index 1 to 4: [2 3 4]

Two Dimensional Array:


[[1 2 3]
[4 5 6]]

Shape of two_d_arr: (2, 3)

Reshaped Array (5x1):


[[1]
[2]
[3]
[4]
[5]]

Sum of arr: 15
Mean of arr: 3.0
Median of arr: 3.0

Standard Deviation of arr: 1.4142135623730951

Joined Arrays: [1 2 3 4 5 6]

Splitting the joined array into two parts: [array([1, 2, 3]), array([4, 5, 6])]

Applying custom ufunc (add 5) on arr: [ 6 7 8 9 10]


Sorted arr: [1 2 3 4 5]

Comparison: arr > 3: [False False False True True]

Mask for elements > 2 and < 5: [False True True False False]
Elements satisfying the mask: [3 4]
EX.No:3 Working with Pandas data frames
Date:

Aim:
To write Python programs for using pandas data frames and accessing it.

Alogorithm:
1. Start the Program.
2. Import Numpy &amp; Pandas Packages.
3. Create a Dataframe for the list of elements.
4. Load a Dataset from an external source into a pandas dataframe
5. Display the Output.
6. Stop the Program

Pandas Series:

• A one-dimensional array-like object that holds data of any type.


• Syntax: pd.Series(data, index=None)

Pandas DataFrame:

• A two-dimensional table-like data structure with rows and columns.


• Syntax: pd.DataFrame(data, index=None, columns=None)

Hierarchical Indexing (MultiIndex):

• Using multiple levels of indexing for more complex data representation.


• Syntax: df.set_index(['Column1', 'Column2'], inplace=True)

Aggregation and Grouping:

• Grouping data by a specific column and applying aggregation functions (e.g., sum,
mean).
• Syntax: df.groupby('Column').agg(function)
Descriptive Statistics:

• Generating summary statistics such as count, mean, standard deviation, etc.


• Syntax: df.describe()

PROGRAM:
import pandas as pd

# Create a simple Pandas Series from a list


data_series = [10, 20, 30, 40, 50]
series = pd.Series(data_series)

# Print the Series


print("Pandas Series:")
print(series)
print()

# Create a simple Pandas DataFrame from a dictionary


data_dict = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]
}

df = pd.DataFrame(data_dict)

# Print the DataFrame


print("Pandas DataFrame:")
print(df)
print()
# Hierarchical Index Example
df.set_index(['Name', 'Age'], inplace=True)

# Print DataFrame with hierarchical index


print("DataFrame with Hierarchical Index:")
print(df)
print()

# Aggregation and Grouping Example


# Let's assume 'Age' is grouped by 'Salary' range, and we will sum the 'Age' for each group
df['Salary Group'] = pd.cut(df['Salary'], bins=[50000, 60000, 70000, 80000, 90000],
labels=['50K-60K', '60K-70K', '70K-80K', '80K-90K'])
grouped = df.groupby('Salary Group').sum()

# Print Grouped and Aggregated Data


print("Grouped and Aggregated Data (Sum of Age by Salary Group):")
print(grouped)
print()

# Descriptive statistics (df.describe()) Example


print("Descriptive Statistics of the DataFrame:")
print(df.describe())

# Sample CSV Output


df.to_csv('sample_output.csv')
print("\nCSV File 'sample_output.csv' has been created.")
Result :
Thus python programs for creating and accessing data frames using Pandas have
been executed and verified.
OUTPUT:
Pandas Series:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Pandas DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000
4 Eva 45 90000
DataFrame with Hierarchical Index:
Age Salary
Name Age
Alice 25 50000
Bob 30 60000
Charlie 35 70000
David 40 80000
Eva 45 90000
Grouped and Aggregated Data (Sum of Age by Salary Group):
Age
Salary Group
50K-60K 55
60K-70K 35
70K-80K 40
80K-90K 45

Descriptive Statistics of the DataFrame:


Age Salary
count 5.000000 5.000000
mean 35.000000 70000.000000
std 7.905694 15811.388301
min 25.000000 50000.000000
25% 30.000000 60000.000000
50% 35.000000 70000.000000
75% 40.000000 80000.000000
max 45.000000 90000.000000

CSV File 'sample_output.csv' has been created.


EX.No:4
Date: Reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris

Aim:
To read the data from text files, Excel and the web and exploring various commands
for
doing descriptive analytics on the Iris data set

Dataset Down Load link:


https://www.kaggle.com/datasets/arshid/iris-flower-dataset

Descriptive Analysis:
Descriptive analysis, also known as descriptive analytics or descriptive statistics, is the
process
of using statistical techniques to describe or summarize a set of data.

Iris Data Set:


Iris Dataset is considered as the Hello World for data science. It contains five columns
namely
– Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a flowering
plant, the researchers have measured various features of the different iris flowers and
recorded
them digitally.

Data Loading:

• Loads data from an Excel file. Additional comments show how to load data from a
text file or URL.

Basic Information:

• Displays dataset structure, head (first five rows), basic statistics, and any missing
values.
Class Distribution:

• Counts the number of samples in each class of the Iris species.

Distributions:

• Generates histograms for each numerical feature to see data distributions.

Pair Plot:

• A pair plot helps visualize relationships between all feature pairs, with species
differentiated by color.

Correlation Heatmap:

• Shows the correlation between features to help understand feature relationships.

Grouped Statistics:

• Aggregates statistics like mean, standard deviation, minimum, and maximum for each
species group.

PROGRAM:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Load data from Excel file

excel_data = pd.read_excel('/mnt/data/IRIS.xls')

print("Excel data loaded successfully!")

# If the dataset were from a text file:

# text_data = pd.read_csv('path_to_text_file.txt')
# If the dataset were from a URL:

# url = 'http://example.com/iris.csv'

# web_data = pd.read_csv(url)

# Display basic info about the dataset

print("Dataset Information:")

print(excel_data.info())

# Display first few rows

print("\nFirst five rows of the dataset:")

print(excel_data.head())

# Basic statistics for each numerical column

print("\nDescriptive Statistics:")

print(excel_data.describe())

# Check for missing values

print("\nMissing values in each column:")

print(excel_data.isnull().sum())

# Distribution of the classes in the Iris dataset

print("\nClass distribution:")

print(excel_data['species'].value_counts())

# Visualize the distributions of numeric columns


for column in excel_data.select_dtypes(include=[np.number]).columns:

plt.figure(figsize=(8, 4))

sns.histplot(excel_data[column], kde=True, bins=20)

plt.title(f'Distribution of {column}')

plt.show()

# Pair plot to show relationships between features

sns.pairplot(excel_data, hue="species", markers=["o", "s", "D"])

plt.suptitle("Pair Plot of Iris Dataset Features", y=1.02)

plt.show()

# Correlation heatmap

plt.figure(figsize=(10, 8))

sns.heatmap(excel_data.corr(), annot=True, cmap='coolwarm')

plt.title("Correlation Heatmap")

plt.show()

# Group statistics by species

grouped_stats = excel_data.groupby('species').agg(['mean', 'std', 'min', 'max'])

print("\nGrouped Statistics by Species:")

print(grouped_stats)

Result:

Thus the data from text files, Excel and the web and exploring various commands for doing

descriptive analytics on the Iris data set were successfully completed and executed.
OUTPUT:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
max 7.900000 4.400000 6.900000 2.500000
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
setosa 50
versicolor 50
virginica 50
Name: species, dtype: int64
sepal_length sepal_width ...
mean std min max mean std min max
species
setosa 5.006000 0.352490 4.300 5.800 3.428000 0.379064 2.300 4.400
versicolor 5.936000 0.516171 4.900 7.000 2.770000 0.313798 2.000 3.400
virginica 6.588000 0.635880 4.900 7.900 2.974000 0.322497 2.200 3.800
EX.No:5 UCI and Pima Indians Diabetes data set
Date:

Aim:
To Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.

Algorithm:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.

Importing Libraries:

• pandas and numpy for data manipulation and analysis.


• sklearn.model_selection for splitting data into training and testing sets.
• sklearn.linear_model for creating and training linear and logistic regression models.
• sklearn.metrics for model evaluation (mean squared error, accuracy score).
• statsmodels.api for statistical modeling and producing a summary report of regression
models.

File Paths and Loading Data:

• Specified file paths for UCI Wine Quality and Pima Indians Diabetes datasets.
• Used try-except for loading datasets to handle potential FileNotFoundError
gracefully.
Univariate Analysis:

• For both datasets, calculated statistical summaries: mean, median, mode, variance,
standard deviation, skewness, and kurtosis.
• Utilized describe() for overall summary statistics.
• Displayed mean values for each column in the Wine Quality dataset for additional
detail.

Multiple Linear Regression (Wine Quality Dataset):

• Used LinearRegression to fit a linear model for predicting wine quality based on
various features.
• Split data into training and testing sets with train_test_split.
• Calculated predictions and computed Mean Squared Error (MSE) for model
evaluation.

Logistic Regression (Pima Diabetes Dataset):

• Used LogisticRegression to predict diabetes outcome based on patient features.


• Split data, trained the logistic model, and computed accuracy as an evaluation metric.

Statsmodels Summaries (Optional):

• For both linear and logistic regression, used statsmodels to generate detailed summary
reports.
• Added a constant term to the feature set (intercept) before fitting the model.
• Used OLS for linear regression and Logit for logistic regression to view statistical
significance of features, coefficients, and other model diagnostics.

PROGRAM:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score
import statsmodels.api as sm
# Define file paths
uci_file_path = r'C:\Users\hp\Downloads\Datasets for FDS\wine_quality.csv' # Adjust the
name as necessary
pima_file_path = r'C:\Users\hp\Downloads\Datasets for FDS\diabetes.csv' # Update with
actual file name
# Debugging print statements
print("UCI File Path:", uci_file_path)
print("Pima File Path:", pima_file_path)
# Load the UCI dataset (Wine Quality)
try:
uci_data = pd.read_csv(uci_file_path, delimiter=';') # Using the correct delimiter
print("UCI Data Loaded Successfully.")
except FileNotFoundError as e:
print("FileNotFoundError:", e)
exit(1) # Exit if the file is not found

# Univariate Analysis for UCI Dataset


uci_desc = uci_data.describe()
uci_mean = uci_data.mean()
uci_median = uci_data.median()
uci_mode = uci_data.mode().iloc[0] # Mode can return multiple rows
uci_var = uci_data.var()
uci_std = uci_data.std()
uci_skew = uci_data.skew()
uci_kurt = uci_data.kurt()
print("\nUCI Dataset Univariate Analysis")
print("Mean:", uci_mean)
print("Median:", uci_median)
print("Mode:", uci_mode)
print("Variance:", uci_var)
print("Standard Deviation:", uci_std)
print("Skewness:", uci_skew)
print("Kurtosis:", uci_kurt)
# Calculate the mean for each column in the UCI dataset and print
print("\nMean values for each feature in the Wine Quality Dataset:")
for column, mean_value in uci_mean.items():
print(f"{column}: {mean_value:.2f}")
# Multiple Linear Regression on Wine Quality Dataset
X = uci_data.drop('quality', axis=1) # Features (excluding target variable)
y = uci_data['quality'] # Target variable
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit the model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions
y_pred = linear_model.predict(X_test)
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error (Wine Quality): {mse:.2f}")
# Load the Pima Indians Diabetes dataset
try:
pima_data = pd.read_csv(pima_file_path)
print("Pima Data Loaded Successfully.")
except FileNotFoundError as e:
print("FileNotFoundError:", e)
exit(1) # Exit if the file is not found
# Univariate Analysis for Pima Dataset
pima_desc = pima_data.describe()
pima_mean = pima_data.mean()
pima_median = pima_data.median()
pima_mode = pima_data.mode().iloc[0]
pima_var = pima_data.var()
pima_std = pima_data.std()
pima_skew = pima_data.skew()
pima_kurt = pima_data.kurt()
print("\nPima Indians Diabetes Dataset Univariate Analysis")
print("Mean:", pima_mean)
print("Median:", pima_median)
print("Mode:", pima_mode)
print("Variance:", pima_var)
print("Standard Deviation:", pima_std)
print("Skewness:", pima_skew)
print("Kurtosis:", pima_kurt)

# Logistic Regression on Pima Indians Diabetes Dataset


# Prepare data for Logistic Regression
X_pima = pima_data.drop('Outcome', axis=1) # Features
y_pima = pima_data['Outcome'] # Target variable
# Split the data into training and testing sets
X_train_pima, X_test_pima, y_train_pima, y_test_pima = train_test_split(X_pima, y_pima,
test_size=0.2, random_state=42)
# Fit the logistic regression model
logistic_model = LogisticRegression(max_iter=200)
logistic_model.fit(X_train_pima, y_train_pima)
# Make predictions
y_pred_pima = logistic_model.predict(X_test_pima)
# Calculate accuracy
accuracy = accuracy_score(y_test_pima, y_pred_pima)
print(f"\nAccuracy (Pima Diabetes): {accuracy:.2f}")
# Optional: Statsmodels summary for linear regression on Wine Quality
X_with_const = sm.add_constant(X_train) # Adding a constant for statsmodels
ols_model = sm.OLS(y_train, X_with_const).fit()
print("\nMultiple Linear Regression Summary (Wine Quality):")
print(ols_model.summary())
# Optional: Statsmodels summary for logistic regression
X_pima_with_const = sm.add_constant(X_train_pima) # Adding a constant for statsmodels
logit_model = sm.Logit(y_train_pima, X_pima_with_const).fit()
print("\nLogistic Regression Summary (Pima Diabetes):")
print(logit_model.summary())

Result :
Thus Python programs are written for performing univariate,bivariate and multiple
linear regression analysis and executed successfully
OUTPUT:

UCI File Path: C:\Users\hp\Downloads\Datasets for FDS\wine_quality.csv


Pima File Path: C:\Users\hp\Downloads\Datasets for FDS\diabetes.csv
UCI Data Loaded Successfully.

UCI Dataset Univariate Analysis


Mean: fixed acidity 6.854788
volatile acidity 0.278241
citric acid 0.334192
residual sugar 6.391415
chlorides 0.045772
free sulfur dioxide 35.308085
total sulfur dioxide 138.360657
density 0.994027
pH 3.188267
sulphates 0.489847
alcohol 10.514267
quality 5.877909
dtype: float64
Median: fixed acidity 6.80000
volatile acidity 0.26000
citric acid 0.32000
residual sugar 5.20000
chlorides 0.04300
free sulfur dioxide 34.00000
total sulfur dioxide 134.00000
density 0.99374
pH 3.18000
sulphates 0.47000
alcohol 10.40000
quality 6.00000
dtype: float64
Mode: fixed acidity 6.800
volatile acidity 0.280
citric acid 0.300
residual sugar 1.200
chlorides 0.044
free sulfur dioxide 29.000
total sulfur dioxide 111.000
density 0.992
pH 3.140
sulphates 0.500
alcohol 9.400
quality 6.000
Name: 0, dtype: float64
Variance: fixed acidity 0.712114
volatile acidity 0.010160
citric acid 0.014646
residual sugar 25.725770
chlorides 0.000477
free sulfur dioxide 289.242720
total sulfur dioxide 1806.085491
density 0.000009
pH 0.022801
sulphates 0.013025
alcohol 1.514427
quality 0.784356
dtype: float64
Standard Deviation: fixed acidity 0.843868
volatile acidity 0.100795
citric acid 0.121020
residual sugar 5.072058
chlorides 0.021848
free sulfur dioxide 17.007137
total sulfur dioxide 42.498065
density 0.002991
pH 0.151001
sulphates 0.114126
alcohol 1.230621
quality 0.885639
dtype: float64
Skewness: fixed acidity 0.647751
volatile acidity 1.576980
citric acid 1.281920
residual sugar 1.077094
chlorides 5.023331
free sulfur dioxide 1.406745
total sulfur dioxide 0.390710
density 0.977773
pH 0.457783
sulphates 0.977194
alcohol 0.487342
quality 0.155796
dtype: float64
Kurtosis: fixed acidity 2.172178
volatile acidity 5.091626
citric acid 6.174901
residual sugar 3.469820
chlorides 37.564600
free sulfur dioxide 11.466342
total sulfur dioxide 0.571853
density 9.793807
pH 0.530775
sulphates 1.590930
alcohol -0.698425
quality 0.216526
dtype: float64

Mean values for each feature in the Wine Quality Dataset:


fixed acidity: 6.85
volatile acidity: 0.28
citric acid: 0.33
residual sugar: 6.39
chlorides: 0.05
free sulfur dioxide: 35.31
total sulfur dioxide: 138.36
density: 0.99
pH: 3.19
sulphates: 0.49
alcohol: 10.51
quality: 5.88

Mean Squared Error (Wine Quality): 0.57


Pima Data Loaded Successfully.

Pima Indians Diabetes Dataset Univariate Analysis


Mean: Pregnancies 3.845052
Glucose 120.894531
BloodPressure 69.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64
Median: Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64
Mode: Pregnancies 1.000
Glucose 99.000
BloodPressure 70.000
SkinThickness 0.000
Insulin 0.000
BMI 32.000
DiabetesPedigreeFunction 0.254
Age 22.000
Outcome 0.000
Name: 0, dtype: float64
Variance: Pregnancies 11.354056
Glucose 1022.248314
BloodPressure 374.647271
SkinThickness 254.473245
Insulin 13281.180078
BMI 62.159984
DiabetesPedigreeFunction 0.109779
Age 138.303046
Outcome 0.227483
dtype: float64
Standard Deviation: Pregnancies 3.369578
Glucose 31.972618
BloodPressure 19.355807
SkinThickness 15.952218
Insulin 115.244002
BMI 7.884160
DiabetesPedigreeFunction 0.331329
Age 11.760232
Outcome 0.476951
dtype: float64
Skewness: Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI -0.428982
DiabetesPedigreeFunction 1.919911
Age 1.129597
Outcome 0.635017
dtype: float64
Kurtosis: Pregnancies 0.159220
Glucose 0.640780
BloodPressure 5.180157
SkinThickness -0.520072
Insulin 7.214260
BMI 3.290443
DiabetesPedigreeFunction 5.594954
Age 0.643159
Outcome -1.600930
dtype: float64

Accuracy (Pima Diabetes): 0.75

Multiple Linear Regression Summary (Wine Quality):

Optimization terminated successfully.


Current function value: 0.467835
Iterations 6
Logistic Regression Summary (Pima Diabetes):
Logit Regression Results
==================================================================
============
Dep. Variable: Outcome No. Observations: 614
Model: Logit Df Residuals: 605
Method: MLE Df Model: 8
Date: Wed, 06 Nov 2024 Pseudo R-squ.: 0.2752
Time: 23:48:43 Log-Likelihood: -287.25
converged: True LL-Null: -396.34
Covariance Type: nonrobust LLR p-value: 9.311e-43
==================================================================
==========================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
const -9.0359 0.837 -10.802 0.000 -10.675 -7.396
Pregnancies 0.0645 0.036 1.791 0.073 -0.006 0.135
Glucose 0.0341 0.004 8.055 0.000 0.026 0.042
BloodPressure -0.0139 0.006 -2.260 0.024 -0.026 -0.002
SkinThickness 0.0031 0.008 0.397 0.691 -0.012 0.019
Insulin -0.0018 0.001 -1.782 0.075 -0.004 0.000
BMI 0.1026 0.017 5.948 0.000 0.069 0.136
DiabetesPedigreeFunction 0.6945 0.330 2.107 0.035 0.049 1.341
Age 0.0371 0.011 3.400 0.001 0.016 0.058
EX.No:6 Apply and explore various plotting
Date: functions on UCI data sets.

Aim:
To Apply and explore various plotting functions on UCI data sets for performing the
following

a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots d. Histograms e. Three dimensional plotting

1. Normal Curve

Definition:
A normal curve, or Gaussian distribution, is a symmetric, bell-shaped curve that shows the
probability distribution of a continuous random variable. The curve is defined by the mean
(center) and the standard deviation (spread).

Syntax:

import numpy as np
import matplotlib.pyplot as plt

# Mean and standard deviation


mean = 0
std_dev = 1

# Generating values for the normal curve


x = np.linspace(mean - 3*std_dev, mean + 3*std_dev, 100)
y = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev) ** 2)

# Plotting
plt.plot(x, y)
plt.title("Normal Distribution Curve")
plt.xlabel("Values")
plt.ylabel("Probability Density")
plt.show()

2. Scatter Plot

Definition:
A scatter plot is used to show the relationship between two numerical variables by plotting
data points on an XY axis. Each point represents a pair of values for the two variables.

Syntax:

import matplotlib.pyplot as plt

# Example data
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 18, 16]

# Plotting
plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

3. Histogram

Definition:
A histogram is a graphical representation of the distribution of numerical data. It groups data
into bins (ranges) and plots the frequency of each bin as bars, showing the shape of the data
distribution.
Syntax:

import matplotlib.pyplot as plt

# Example data
data = [10, 20, 20, 30, 30, 40, 40, 40, 50, 50, 50, 50]

# Plotting
plt.hist(data, bins=5, color='skyblue', edgecolor='black')
plt.title("Histogram")
plt.xlabel("Value Ranges")
plt.ylabel("Frequency")
plt.show()

4. Contour Plot

Definition:
A contour plot is a two-dimensional representation of a 3D surface, showing lines where a
particular z-value is constant. It is often used to show the density of data points in two
dimensions.

Syntax:

import numpy as np
import matplotlib.pyplot as plt

# Generating meshgrid data


x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# Plotting
plt.contour(X, Y, Z, levels=10, cmap="viridis")
plt.title("Contour Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

5. Density Plot

Definition:
A density plot represents the distribution of a numerical variable using kernel density
estimation (KDE). It smooths out data points to give an estimated continuous probability
density function.

Syntax:

import matplotlib.pyplot as plt


import pandas as pd
import numpy as np

# Example data
data = np.random.normal(0, 1, 1000)

# Using pandas density plot


pd.Series(data).plot(kind='density')
plt.title("Density Plot")
plt.xlabel("Value")
plt.show()

6. 3D Plot

Definition:
A 3D plot is a graphical representation that shows data points in three dimensions, typically
using x, y, and z coordinates. It can be a line, scatter plot, or surface plot.

Syntax:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Example data
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)

# Plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z)
ax.set_title("3D Scatter Plot")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.set_zlabel("Z-axis")
plt.show()

Result :
Thus python programs for exploring various plots using matplotlib were executed
successfully.
OUTPUT:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load data from CSV file (use a sample data if hou_all.csv is unavailable)
# Replace 'hou_all.csv' with your file path
# df = pd.read_csv('C:\\Users\\Admin\\Downloads\\hou_all.csv')
# For demonstration, create a sample DataFrame
np.random.seed(0)
df = pd.DataFrame({
'set': np.random.normal(loc=50, scale=15, size=100),
'value': np.random.normal(loc=30, scale=10, size=100)
})
# 1. Normal Curve Plot
mean = df['set'].mean()
std_dev = df['set'].std()
x = np.linspace(mean - 3*std_dev, mean + 3*std_dev, 100)
y = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev) ** 2)
plt.plot(x, y)
plt.title("Normal Distribution Curve")
plt.xlabel("Set")
plt.ylabel("Probability Density")
plt.show()
# 2. Scatter Plot
plt.scatter(df['set'], df['value'])
plt.title("Scatter Plot")
plt.xlabel("Set")
plt.ylabel("Value")
plt.show()
# 3. Histogram
plt.hist(df['set'], bins=10, color='skyblue', edgecolor='black')
plt.title("Histogram of Set")
plt.xlabel("Set")
plt.ylabel("Frequency")
plt.show()
# 4. Contour Plot
# Creating a 2D grid of values for contour plot
X, Y = np.meshgrid(np.linspace(df['set'].min(), df['set'].max(), 100),
np.linspace(df['value'].min(), df['value'].max(), 100))
Z = np.exp(-((X - mean)**2 + (Y - mean)**2) / (2 * std_dev**2))
plt.contour(X, Y, Z, levels=10, cmap="viridis")
plt.title("Contour Plot")
plt.xlabel("Set")
plt.ylabel("Value")
plt.show()
# 5. Density Plot
df['set'].plot(kind='density')
plt.title("Density Plot of Set")
plt.xlabel("Set")
plt.show()
# 6. 3D Plot
fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

ax.scatter(df['set'], df['value'], c='r', marker='o')

ax.set_title("3D Scatter Plot")

ax.set_xlabel("Set")

ax.set_ylabel("Value")

ax.set_zlabel("Frequency")

plt.show()
EX.No:7 Visualizing Geographic Data with Basemap
Date:

Aim:
To Visualize Geographic Data with Basemap
Algorithm:
1. Start the Program.
2. Import Basemap Package.
3. Perform Visualize Function of Geographic data.
4. Display the Output.
5. Stop the Program

Base Map
The common type of visualization in data science is that of geographic data.
Matplotlib&#39;s main
tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib
toolkits which lives under the mpl_toolkits namespace.
Basemap is a matplotlib extension used to visualize and create geographical maps in python.
Installing Base map :
pip Install basemap
Importing Basemap and matplotlib libraries:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
Functions used in Basemap:
Basemap() - To create a base map class.
Physical boundaries and bodies of water

drawcoastlines(): Draw continental coast lines


drawlsmask(): Draw a mask between the land and sea, for use with projecting images
on one or the other
drawmapboundary(): Draw the map boundary, including the fill color for oceans.
drawrivers(): Draw rivers on the map
fillcontinents(): Fill the continents with a given color; optionally fill lakes with another
color
·Political boundaries
·drawcountries(): Draw country boundaries
drawstates(): Draw US state boundaries
drawcounties(): Draw US county boundaries
·Map features
·drawgreatcircle(): Draw a great circle between two points
drawparallels(): Draw lines of constant latitude
drawmeridians(): Draw lines of constant longitude
drawmapscale(): Draw a linear scale on the map
·Whole-globe images
·bluemarble(): Project NASA&#39;s blue marble image onto the map
shadedrelief(): Project a shaded relief image onto the map
PROGRAM :

# Step 1: Import the necessary libraries


from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

# Step 2: Create a Basemap instance for the world map (Orthographic projection)
m = Basemap(projection='ortho', lat_0=0, lon_0=0) # Orthographic projection centered at (0,
0)

# Step 3: Draw map features


m.drawcoastlines() # Draw continental coastlines
m.drawcountries() # Draw country boundaries
m.drawstates() # Draw US state boundaries (if you are working with US data)
m.drawrivers() # Draw rivers on the map
m.drawparallels(range(-90, 90, 30), labels=[1,0,0,0]) # Draw parallels (latitude lines)
m.drawmeridians(range(-180, 180, 60), labels=[0,0,0,1]) # Draw meridians (longitude lines)
m.drawmapboundary(fill_color='lightblue') # Draw map boundary with ocean fill color
m.fillcontinents(color='lightgreen',lake_color='lightblue') # Fill continents and lakes
m.drawmapscale(-180, -90, -180, -90, 5000) # Draw map scale (optional)

# Step 4: Add a whole-globe image (NASA's blue marble or shaded relief)


m.bluemarble() # Adds a global image based on satellite data

# Step 5: Title and Show the map


plt.title("World Map Visualization with Basemap")
plt.show()
# Step 6: End of program

Result :

Thus the program was executed successfully.


OUTPUT:
EX.NO:8 Working with pivot table in pandas.
DATE:

AIM:
Write a python code to perform multidimensional summarization using pivot table.

Algorithm:

1. Import the numpy and pandas libraries.


2. (Add this missing step) Load or create a dataset as a pandas DataFrame.
3. Create pandas DataFrame objects if the data is not already in DataFrame format.
4. Use the pivot_table function in pandas to group the values based on specified index
and columns.
5. Apply different aggregate functions in pivot_table as needed.
6. Calculate multiple types of aggregations for any given value column within the
pivot_table function.

PROGRAM:

import numpy as np
import pandas as pd

# Creating a sample DataFrame


df = pd.DataFrame({
"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"],
"C": ["small", "large", "large", "small", "small", "large", "small", "small", "large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]
})

# Display the original DataFrame


print("Original DataFrame:")
print(df)
# Creating a pivot table with the sum of 'D' values grouped by 'A' and 'B', with columns
based on 'C'
table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)
print("\nPivot Table with sum of 'D' values:")
print(table)

# Creating a pivot table with the sum of 'D' values, and fill NaN values with 0
table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum,
fill_value=0)
print("\nPivot Table with sum of 'D' values and fill_value=0:")
print(table)

# Pivot table with mean of 'D' and 'E' values, grouped by 'A' and 'C'
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E':
np.mean})
print("\nPivot Table with mean of 'D' and 'E' values:")
print(table)

# Pivot table with mean of 'D' and multiple aggregations for 'E'
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E':
[min, max, np.mean]})
print("\nPivot Table with mean of 'D' and multiple aggregations for 'E':")
print(table)

Result:

Thus the program has been executed successfully.


OUTPUT:
EX.NO:9 Comprehensive Data Analysis Algorithm for Frequency
DATE: Distribution and Descriptive Statistics

AIM:

Write a python program for frequency distribution, find the averages like mean,
median,mode of the given dataset, find data type, mean, median, standard deviation,
variance, mean absolute deviation.

Algorithm:

1. Import Libraries:
o Import the necessary libraries such as pandas, numpy, and scipy.stats.
2. Load the Dataset:
o Read the dataset from a CSV file using pandas.read_csv().
3. Generate Frequency Distribution:
o Use value_counts() on specific columns to generate the frequency table.
o Sort the frequency table by index using value_counts().sort_index() for an
organized view.
o Calculate relative frequency by dividing each frequency by the total number of
observations.
o Calculate percentage frequency by multiplying each relative frequency by 100.
4. Calculate Percentiles and Percentile Ranks:
o Use scipy.stats.percentileofscore() to find the percentile rank for a specific
score.
o Use pandas.describe() to find percentiles (like 25th, 50th, 75th) and other
statistics.
5. Create Grouped Frequency Distribution:
o Set bins in value_counts() or use pandas.cut() to create a grouped frequency
distribution.
o For custom intervals, use pandas.interval_range() to set specific range
intervals as required.
6. Calculate Descriptive Statistics:
o Data Type: Use type() function on columns to determine data types.
o Mean: Apply np.mean() on specific columns to calculate the mean.
o Median: Use np.median() to calculate the median value.
o Mode: Use scipy.stats.mode() to find the mode of specific columns.
7. Calculate Additional Statistics:
o Standard Deviation: Use pandas.std() to calculate the standard deviation.
o Variance: Use pandas.var() to calculate variance.
o Mean Absolute Deviation: Use pandas.mad() for the mean absolute
deviation.
8. Output Results:
o Display each result (frequency distribution tables, mean, median, mode,
standard deviation, variance, and other calculated statistics) in an organized
format for analysis.
PROGRAM:

# Import necessary libraries


import pandas as pd
import numpy as np
from scipy.stats import percentileofscore
from statistics import mode

# Load the wnba dataset


wnba = pd.read_csv('/content/wnba.csv')

# Frequency distribution for Position and Height


freq_dis_pos = wnba['Pos'].value_counts()
print("Frequency Distribution for Position:\n", freq_dis_pos, "\n")

freq_dis_height = wnba['Height'].value_counts()
print("Frequency Distribution for Height:\n", freq_dis_height, "\n")

freq_dis_height_sorted = wnba['Height'].value_counts().sort_index(ascending=False)
print("Sorted Frequency Distribution for Height (Descending):\n", freq_dis_height_sorted,
"\n")

# Relative and Percentage Frequencies for Age


relative_freq_age = wnba['Age'].value_counts() / len(wnba)
print("Relative Frequency Distribution for Age:\n", relative_freq_age, "\n")

percentages_pos = wnba['Age'].value_counts(normalize=True).sort_index() * 100


print("Percentage Frequency Distribution for Age:\n", percentages_pos, "\n")

# Percentile calculations for Age


percentile_of_25 = percentileofscore(wnba['Age'], 25, kind='weak')
print("Percentile rank of age 25:", percentile_of_25)
# Descriptive statistics with custom percentiles
percentiles = wnba['Age'].describe(percentiles=[.1, .15, .33, .5, .592, .85, .9])
print("Descriptive Statistics with Custom Percentiles for Age:\n", percentiles, "\n")

# Grouped Frequency Distribution for Age


grouped_freq = wnba['Age'].value_counts(bins=5).sort_index()
print("Grouped Frequency Distribution for Age:\n", grouped_freq, "\n")

# Interval range example


intervals = pd.interval_range(start=0, end=600, freq=100)
print("Custom Interval Range from 0 to 600 with a frequency of 100:\n", intervals, "\n")

# Load the cric_data dataset


cric_data = np.loadtxt("cric_data.tsv", skiprows=1)[:, [1, 2, 3]]
sachin = cric_data[:, 0]
rahul = cric_data[:, 1]
india = cric_data[:, 2]

# Mode example
data1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7)
print("Mode of data1:", mode(data1), "\n")

# Function to calculate mean and median


def calculate_stats(col, name):
print(f"Statistics for {name}:")
print("Mean:", np.mean(col))
print("Median:", np.median(col), "\n")
calculate_stats(sachin, "Sachin's Scores")
calculate_stats(rahul, "Rahul's Scores")
calculate_stats(india, "India's Scores")

# Statistical analysis for sample data


lst = [33219, 36254, 38801, 46335, 46840, 47596, 55130, 56863, 78070, 88830]
sample = pd.Series(lst)

print("Data Type:", type(sample))


print("Sample Data:\n", sample)
print("Mean:", sample.mean())
print("Median:", sample.median())
print("Standard Deviation (Sample):", sample.std())
print("Standard Deviation (Population):", sample.std(ddof=0))
print("Variance (Sample):", sample.var(ddof=1))
print("Variance (Population):", sample.var(ddof=0))
print("Mean Absolute Deviation:", sample.mad())

Result:
Thus the program for Comprehensive Data Analysis Algorithm for Frequency
Distribution and Descriptive Statistics has been executed successfully
OUTPUT:
EX.NO:10 Correlation, Scatter Plots, Correlation Coefficient

DATE: Regression

Aim:

Write a python program to perform correlation and scatter plots, to find correlation
coefficient, to find Regression.

Algorithm:

1. Load Libraries and Data:


o Step 1: Import the required libraries: matplotlib, pandas, statsmodels, and any
necessary data handling libraries.
o Step 2: Load the dataset (e.g., from sklearn or a CSV file).
2. Plot Scatter Plot:
o Step 3: Define the x-axis and y-axis values for the graph.
o Step 4: Set labels for the x and y axes.
o Step 5: Set the title of the graph.
o Step 6: Use plt.scatter() to create the scatter plot.
3. Calculate Correlation Coefficient:
o Step 7: Use the pearsonr() function to calculate the correlation coefficient.
o Step 8: Round the result if necessary for clarity.
4. Visualize Correlation with Heatmap:
o Step 9: Use the heatmap() function to create a correlation heatmap for a
comprehensive view of variable relationships.
5. Perform Regression Analysis:
o Step 10: Define Y to hold the response variable and X for explanatory
variables.
o Step 11: Use add_constant() to include a constant in X.
o Step 12: Fit the regression model using statsmodels.

PROGRAM:

# Import necessary libraries


import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sklearn.datasets import load_iris
from scipy import stats
# Step 1: Scatter Plot using Iris Dataset
iris = load_iris()
features = iris.data.T

# Scatter plot for the first two features in the iris dataset
plt.scatter(features[0], features[1], alpha=0.2, s=100 * features[3], c=iris.target,
cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title("Scatter Plot of Iris Features")
plt.show()

# Step 2: Load the Concrete Dataset


con = pd.read_csv('/content/Concrete_Data_Yeh.csv')

# Step 3: Calculate Pearson Correlation


correlation, p_value = stats.pearsonr(con['csMPa'], con['superplasticizer'])
print(f"Pearson Correlation between csMPa and superplasticizer: {round(correlation, 2)}")

# Step 4: Display Heatmap of Correlation Matrix


cormat = con.corr()
sns.heatmap(round(cormat, 2), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap of Concrete Dataset")
plt.show()

# Step 5: Perform Regression Analysis


Y = con['csMPa']
X = con['water']
X = sm.add_constant(X) # Add constant to predictor variable
model = sm.OLS(Y, X, missing='drop')
model_result = model.fit()

# Step 6: Display Regression Summary and Residual Histogram


print(model_result.summary())
sns.histplot(model_result.resid, kde=True)
plt.title("Residuals of the Regression Model")
plt.show()

RESULT:
Thus the program has been successfully executed.
OUTPUT:

You might also like