0% found this document useful (0 votes)
36 views

Anshu Complete Data Science Files

Fuududuucuciufiig

Uploaded by

developer adarsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Anshu Complete Data Science Files

Fuududuucuciufiig

Uploaded by

developer adarsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY,

Rohini Delhi

Faculty name: Akshay Mool Student’s Name: Anshu Kumar Pathak


Roll No.: 00214822722
Semester: 7th
Group: 7CST-(FSD -2)

Maharaja Agrasen Institute of Technology, PSP Area,


Sector – 22, Rohini, New Delhi – 110085
Data Science Lab

PAPER CODE : CIE-405P

Name of the student : Anshu Kumar Pathak

University Roll No. : 00214822722

Branch : CST

Section/ Group : 7CST-( FSD -2)

PRACTICAL DETAILS

S.No Experiment Name Date R1 R2 R3 R4 R5 Total Marks Signature

1. Describing data, viewing, and


manipulating data

2. To plot the probability


distribution curve

3. To perform Chi-square test on


various datasets
To use Python as a programming
4. tool for the analysis of data
structures

5. To perform various operations


such as data storage, analysis,
and visualization

6. To perform descriptive statistics


analysis and data visualization.

7. To perform Principal
Component Analysis on
datasets.

8. To perform linear regression on


datasets

9. To perform Data Aggregation


and GroupWise Operations
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
VISION

To nurture young minds in a learning environment of high academic value and imbibe spiritual and
ethical values with technological and management competence.

MISSION
The Institute shall endeavor to incorporate the following basic missions in the teaching methodology:

Engineering Hardware – Software Symbiosis


Practical exercises in all Engineering and Management disciplines shall be carried out by
Hardware equipment as well as the related software enabling deeper understanding of basic
concepts and encouraging inquisitive nature.

Life – Long Learning


The Institute strives to match technological advancements and encourage students to keep updating
their knowledge for enhancing their skills and inculcating their habit of continuous learning.

Liberalization and Globalization


The Institute endeavors to enhance technical and management skills of students so that they are
intellectually capable and competent professionals with Industrial Aptitude to face the challenges
of globalization.

Diversification
The Engineering, Technology and Management disciplines have diverse fields of studies with
different attributes. The aim is to create a synergy of the above attributes by encouraging
analytical thinking.

Digitization of Learning Processes


The Institute provides seamless opportunities for innovative learning in all Engineering and
Management disciplines through digitization of learning processes using analysis, synthesis,
simulation, graphics, tutorials and related tools to create a platform for multi-disciplinary
approaches.

Entrepreneurship
The Institute strives to develop potential Engineers and Managers by enhancing their skills and research
capabilities so that they become successful entrepreneurs and responsible citizen
Department of Computer Science and Engineering
Rubrics for Lab Assessment
0 1 2 3
Rubrics
Missing Inadequate Needs Improvement Adequate
An attempt is made to identify The problem to be solved is The problem to be solved is
the problem to be solved but it described but there are minor clearly stated. Objectives are
Is able to identify the
No mention is made is described in a confusing omissions or vague details. complete, specific, concise,
problem to be solved and
R1 of the problem to be manner, objectives are not Objectives are conceptually and measurable. They are
define the objectives of
solved. relevant, objectives contain correct and measurable but written using correct technical
the experiment.
technical/ conceptual errors or may be incomplete in scope or terminology and are free from
objectives are not measurable. have linguistic errors. linguistic errors.
The experiment attempts to
The experiment attempts to The experiment solves the
solve the problem but due to
Is able to design a reliable The experiment does solve the problem but due to problem and has a high
the nature of the design there
R2 experiment that solves the not solve the the nature of the design the likelihood of producing data
is a moderate chance the data
problem. problem. data will not lead to a reliable that will lead to a reliable
will not lead to a reliable
solution. solution.
solution.
Is able to communicate Diagrams are missing Diagrams are present but Diagrams and/or experimental
Diagrams and/or experimental
the details of an and/or experimental unclear and/or experimental procedure are present but with
R3 procedure are clear and
experimental procedure procedure is missing procedure is present but minor omissions or vague
complete.
clearly and completely. or extremely vague. important details are missing. details.

All important data are present,


Is able to record and All important data are present,
Data are either absent Some important data are absent but recorded in a way that
R4 represent data in a organized and recorded
or incomprehensible. or incomprehensible. requires some effort to
meaningful way. clearly.
comprehend.
An acceptable judgment is
No discussion is An acceptable judgment is
Is able to make a A judgment is made about the made about the result, with
presented about the made about the result, but the
R5 judgment about the results, but it is not reasonable clear reasoning. The effects of
results of the reasoning is flawed or
results of the experiment. or coherent. assumptions and experimental
experiment . incomplete.
uncertainties are considered.
Experiment – 1

Aim: Describing data, viewing, and manipulating data

Theory: Data viewing and manipulation are critical processes in data science, enabling analysts and
data scientists to explore, clean, and transform datasets for further analysis.

Data Viewing

The process of data viewing involves examining raw data to understand its structure, content, and
potential issues. Tools like pandas in Python or data frames in R allow users to load and view
datasets, enabling quick inspections of rows, columns, and data types. Typical methods include:

Head and Tail Views: Displaying the first or last few rows of a dataset to get a sense of the data
without loading it entirely.

Summarization: Functions like describe() or info() provide summary statistics (mean, median,
standard deviation) and metadata (data types, missing values) that give an overview of the dataset’s
characteristics.

These methods help in detecting anomalies such as missing data, outliers, and inconsistencies,
guiding the need for further manipulation.

Data Manipulation

Data manipulation refers to the process of cleaning, reshaping, and transforming data to make it
suitable for analysis. The main steps include:

Handling Missing Data: Missing values can be dealt with by filling them using techniques like mean
or median imputation or removing them if they don’t add significant value.

Filtering and Subsetting: Data filtering involves selecting rows and columns based on conditions,
such as removing irrelevant data or focusing on specific variables.

Data Transformation: Transformations involve converting data types, scaling numerical values, or
encoding categorical variables. Operations such as merging datasets, pivoting tables, or adding new
calculated columns are also common.

Efficient data manipulation not only cleans the data but also reshapes it to align with the
requirements of the analytical model or algorithm being applied. This process ensures that the
dataset is in a structured form suitable for further statistical or machine learning analysis.

Source Code:

import pandas as pd

# Load a dataset
data = pd.read_csv('Customer.csv')

# View the first few rows of the dataset


print(data.head())

# Get the summary statistics


print(data.describe())

# Manipulating data: Adding a new column


data['new_column'] = data['purchase_amount'] * 2

# View modified dataset


print(data.head())
Output:
Viva-Voce:

Q1) What is the purpose of data viewing in data analysis, and what tools are commonly used?

A1) Data viewing involves inspecting the raw data to understand its structure, quality, and content.
Common tools include data frames in libraries like pandas (Python) and tibbles (R), which allow users
to preview rows, columns, and summary statistics to identify potential issues and guide further
analysis.

Q2) How can you view the first and last few rows of a dataset using pandas in Python?

A2) In pandas, you can use the head() method to view the first few rows and the tail() method to view
the last few rows of a DataFrame. For example, df.head() displays the first five rows, and df.tail()
shows the last five rows by default.

Q3) What are some common data manipulation tasks, and how are they performed in Python?

A3) Common data manipulation tasks include filtering, sorting, merging, and aggregating data. In
Python, pandas provides functions such as filter(), sort_values(), merge(), and groupby() to perform
these operations. For example, df.sort_values(by='column_name') sorts data by a specific column.

Q4) How can you handle missing values in a dataset using pandas?

A4) Missing values can be handled using methods like fillna() to replace them with a specific value or
method (e.g., mean, median), or dropna() to remove rows or columns containing missing values. For
example, df.fillna(value=0) replaces all missing values with 0.

Q5) What is data normalization, and why is it important in data manipulation?

A5) Data normalization is the process of scaling data to a standard range or distribution, often to
ensure that features contribute equally to analysis or modeling. It is important because it can improve
the performance and convergence of machine learning algorithms and make comparisons between
features more meaningful. Common normalization techniques include min-max scaling and z-score
standardization.
Experiment – 2

Aim: To plot the probability distribution curve.

Theory: Probability curves in data science represent the distribution of possible outcomes for a
random variable, providing insights into data behaviour and uncertainty. Common curves include the
normal distribution (bell curve), which is symmetric around the mean, and skewed distributions where
one tail is longer than the other. These curves help in modelling real-world phenomena, identifying
patterns, and making predictions. For example, the normal curve is widely used due to the central limit
theorem, which states that the sum of independent variables tends toward a normal distribution.
Understanding probability curves is crucial for tasks like statistical inference and machine learning.

Source Code:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate random data


data = np.random.normal(0, 1, 1000)

# Plot the probability distribution curve


sns.histplot(data, kde=True)
plt.title('Probability Distribution Curve')
plt.show()

Output:
Viva-Voce:

Q1) What is a probability curve, and why is it important in data science?

A1) A probability curve represents the likelihood of different outcomes for a random variable. It's
important in data science because it helps in modeling uncertainty, understanding distributions, and
making predictions based on the data.

Q2) Can you explain the normal distribution and its significance?

A2) The normal distribution, also known as the bell curve, is symmetric and centered around the
mean. It's significant because many real-world phenomena follow a normal distribution, and it forms
the basis of statistical inference due to the central limit theorem.

Q3) What are skewed distributions, and how do they differ from normal distributions?

A3) Skewed distributions have one tail longer than the other, indicating that data is not symmetrically
distributed. Unlike normal distributions, where the mean, median, and mode coincide, in skewed
distributions, these measures of central tendency differ.

Q4) How does the shape of a probability curve impact data analysis?

A4) The shape of a probability curve affects the choice of statistical methods and models. For
example, normally distributed data allows for parametric tests, while skewed or non-normal
distributions may require non-parametric tests or data transformation.

Q5) What is the role of the central limit theorem in probability curves?

A5) The central limit theorem states that the sum or average of a large number of independent
random variables tends to follow a normal distribution, regardless of the original distribution. This
theorem underpins many statistical techniques and allows the use of normal distribution-based
models in diverse situations.
Experiment – 3

Aim: To perform Chi-square test on various datasets.

Theory: Chi-square tests in data science are used to determine the relationship between categorical
variables and assess the goodness of fit or independence within a dataset. By comparing observed
data with expected outcomes, the test evaluates whether deviations are due to chance or a significant
association. In a goodness-of-fit test, chi-square determines how well an observed distribution
matches a theoretical one, while in independence tests, it checks if two categorical variables are
related. The chi-square statistic is calculated by summing the squared differences between observed
and expected values, divided by the expected values, providing insight into data
dependencies and patterns.

Source Code:

import pandas as pd
from scipy.stats import chi2_contingency

# Create a contingency table


data = {'Observed': [50, 30, 20], 'Expected': [40, 40, 20]}
df = pd.DataFrame(data)

# Perform chi-square test


chi2, p, dof, expected = chi2_contingency([df['Observed'], df['Expected']])
print(f"Chi2 Statistic: {chi2}, p-value: {p}")

Output:

Viva-Voce:

Q1) What is a chi-square test, and when is it used in data science?

A1) A chi-square test assesses the association between categorical variables or the goodness of fit
between observed and expected frequencies. It's used to determine if there is a significant difference
between expected and observed data, helping in hypothesis testing and evaluating model
performance.

Q2) Explain the difference between the chi-square test of independence and the chi-square
goodness-of-fit test.

A2) The chi-square test of independence examines whether two categorical variables are related or
independent, using a contingency table. The chi-square goodness-of-fit test compares observed data
against a theoretical distribution to see if the data follows the expected distribution.

Q3) How is the chi-square statistic calculated?

A3) The chi-square statistic is calculated by summing the squared differences between observed and
expected frequencies, divided by the expected frequencies:

Where,

c = Degrees of freedom

O = Observed Value
E = Expected Value

Q4) What are the assumptions of the chi-square test?

A4) The main assumptions are that the data should be categorical, the observations should be
independent, and the expected frequency in each cell of the contingency table should be at least 5 for
the test to be valid.

Q5) How do you interpret the results of a chi-square test?

A5) The results are interpreted by comparing the chi-square statistic to a critical value from the
chi-square distribution table or by looking at the p-value. A significant p-value (typically <0.05)
indicates that there is a significant difference between observed and expected frequencies,
suggesting a relationship or discrepancy in the data.
Experiment – 4

Aim: To use Python as a programming tool for the analysis of data structures.

Theory: Python is a powerful programming tool for analyzing data structures due to its simplicity,
flexibility, and vast ecosystem of libraries. It supports a variety of built-in data structures like lists,
dictionaries, tuples, and sets, allowing for efficient data manipulation. Libraries such as NumPy and
pandas enhance Python's capabilities by providing specialized data structures like arrays and
DataFrames for handling large datasets. Python’s rich collection of algorithms and functions helps in
sorting, searching, and transforming data, making it a preferred choice for tasks like data analysis,
machine learning, and algorithm design, all while ensuring readable and maintainable code.

Source Code:

import numpy as np

# Example of array manipulation using NumPy


array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# View array
print("Array:\n", array)

# Sum of all elements


print("Sum of elements:", np.sum(array))

# Transpose of the array


print("Transpose:\n", np.transpose(array))

Output:
Viva-Voce:

Q1) Why is Python a popular choice for data structure analysis in data science?

A1) Python is popular due to its readability, extensive libraries (such as NumPy, pandas, and SciPy),
and ease of integration with other tools. Its versatile data structures (lists, dictionaries, sets, tuples)
and powerful data manipulation capabilities make it well-suited for analyzing and managing complex
datasets.

Q2) What role do libraries like NumPy and pandas play in data structure analysis?

A2) NumPy provides support for numerical operations with its array object, enabling efficient handling
of large datasets and mathematical computations. Pandas offers data structures like DataFrames and
Series, which simplify data manipulation, analysis, and cleaning, making it easier to work with tabular
data.

Q3) How do Python’s built-in data structures compare to those provided by libraries like pandas?

A3) Python’s built-in data structures (lists, dictionaries, tuples, sets) are versatile but may lack
efficiency for large-scale data operations. Pandas provides specialized data structures like
DataFrames and Series designed for handling large datasets with functionalities for data alignment,
indexing, and complex operations that are more efficient and user-friendly for data analysis.

Q4) What are the advantages of using Python for handling and analyzing large datasets?

A4) Python offers advantages such as its powerful libraries (e.g., NumPy for numerical operations,
pandas for data manipulation), scalability through integration with big data tools, and a wide range of
visualization libraries (e.g., Matplotlib, Seaborn). These tools facilitate efficient data handling, analysis,
and visualization, making Python a robust choice for large datasets.

Q5) Can you describe a common workflow for data analysis using Python?

A5) A common workflow includes:

o Data Collection: Importing data from various sources (CSV, databases, APIs) using
libraries like pandas.

o Data Cleaning: Handling missing values, outliers, and inconsistencies using pandas.

o Data Transformation: Reshaping data, merging datasets, and feature engineering.

o Analysis: Performing statistical analysis and model building using libraries like
NumPy, pandas, and SciPy.

o Visualization: Creating plots and charts with Matplotlib or Seaborn to interpret and
present the findings.
Experiment – 5

Aim: To perform various operations such as data storage, analysis, and visualization

Theory: Data storage, analysis, and visualization are fundamental aspects of data science, each
playing a crucial role in deriving insights from data.

Data Storage: Efficient data storage solutions, such as databases (SQL, NoSQL), data lakes, and
cloud storage systems, ensure data is well-organized, accessible, and scalable for future analysis.
Tools like MySQL, MongoDB, and AWS S3 are commonly used.

Data Analysis: Python, R, and SQL offer powerful methods to explore, clean, and manipulate data.
Libraries like pandas, NumPy, and SciPy allow for complex analysis, statistical modeling, and
handling large datasets.

Data Visualization: Visualization tools like Matplotlib, Seaborn, and Tableau are essential for
presenting data trends, distributions, and insights in graphical formats. Visualizations simplify complex
data, enabling better decision-making.

Source Code:

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv('organizations.csv')

# Data storage: Save the dataset to a new file


data.to_csv('new_dataset.csv', index=False)

# Data analysis: Describe the dataset


print(data.describe())

# Data visualization: Plot a histogram


data['Number of employees'].hist()
plt.title('Histogram of Number of employees')
plt.show()

Output:
Viva-Voce:

Q1) What are the common methods for data storage in data science?

A1) Data can be stored in various formats, such as relational databases (SQL), NoSQL databases
(e.g., MongoDB), flat files (CSV, JSON), or cloud storage (AWS S3, Google Cloud). These methods
depend on factors like the type of data, the need for scalability, and query performance requirements.

Q2) How is data analysis performed in data science, and which tools are commonly used?

A2) Data analysis involves exploring, cleaning, and modeling data to extract insights. Common tools
include Python libraries like pandas and NumPy for handling data, R for statistical analysis, and SQL
for querying structured data. These tools allow for operations like filtering, aggregating, and visualizing
data.

Q3) What is the role of cloud storage in data science, and how does it benefit data analysis?

A3) Cloud storage (e.g., AWS S3, Google Cloud Storage) enables scalable, accessible, and
cost-effective data management. It allows data scientists to store vast amounts of data without
worrying about physical infrastructure, facilitating collaboration and integration with cloud-based
analysis tools for large-scale processing.

Q4) What are the key differences between descriptive and inferential data analysis?

A4) Descriptive analysis summarizes data with measures like mean, median, and standard deviation,
providing an overview of the dataset. Inferential analysis goes beyond the data at hand to make
predictions or inferences about a larger population based on a sample, often using hypothesis testing
or predictive modeling.

Q5) How do tools like Matplotlib and Seaborn assist in data visualization?A5) Matplotlib and Seaborn
are Python libraries used to create static, animated, or interactive plots. Matplotlib provides detailed
control over plot elements, while Seaborn simplifies the creation of complex visualizations .
Experiment – 6

Aim: To perform descriptive statistics analysis and data visualization.

Theory: Descriptive statistical analysis and data visualization are key techniques in summarizing and
interpreting data in data science.

Descriptive Statistical Analysis: This method involves summarizing data using measures like mean,
median, mode, standard deviation, and variance. It helps in understanding the central tendency,
spread, and overall distribution of data, offering insights without making predictions. It’s the foundation
for understanding the dataset’s basic characteristics.

Data Visualization: Complementing descriptive statistics, data visualization tools like Matplotlib,
Seaborn, and Power BI create graphical representations (histograms, box plots, bar charts) of these
summaries, making patterns, trends, and outliers easily understandable, aiding in effective
communication of data insights.

Source Code:

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv('organizations.csv')

# Descriptive statistics
print(data.describe())

# Data visualization: Box plot of a column


data.boxplot(column='Number of employees')
plt.title('Box Plot of Number of employees')
plt.show()

Output:
Viva-Voce:

Q1) What is descriptive statistical analysis, and what are its main components?

A1) Descriptive statistical analysis involves summarizing and describing the main features of a
dataset using measures such as mean, median, mode, standard deviation, and range. It provides
insights into the central tendency, dispersion, and overall distribution of data without making
predictions or generalizations.

Q2) How do measures of central tendency differ from measures of dispersion in descriptive statistics?

A2) Measures of central tendency (mean, median, mode) describe the center or typical value of a
dataset. Measures of dispersion (standard deviation, variance, range) describe the spread or
variability around the central value. Together, they provide a comprehensive summary of the dataset's
characteristics.

Q3) Why is data visualization important in the context of descriptive statistics?

A3) Data visualization is crucial as it helps to communicate the insights gained from descriptive
statistics in a clear and intuitive manner. Visual tools like histograms, box plots, and scatter plots
make it easier to identify patterns, trends, and anomalies in the data, facilitating better understanding
and decision-making.

Q4) Can you explain the role of histograms and box plots in visualizing descriptive statistics?

A4) Histograms display the distribution of numerical data by showing the frequency of data points
within specified ranges or bins, revealing the shape and spread of the data. Box plots, on the other
hand, provide a visual summary of data distribution through quartiles, highlighting the median, spread,
and potential outliers.

Q5) How can descriptive statistical measures be used to identify data quality issues?

A5) Descriptive statistical measures can reveal data quality issues by highlighting inconsistencies,
such as unusual values or outliers. For example, an unusually high standard deviation might indicate
data entry errors, while skewed distributions can signal data problems or imbalances that need
addressing before further analysis.
Experiment – 7

Aim: To perform Principal Component Analysis on datasets.

Theory: Component analysis in datasets refers to techniques that reduce the dimensionality of data
while retaining essential patterns and variance.

Principal Component Analysis (PCA): PCA is a widely used method in component analysis,
transforming a dataset into a set of linearly uncorrelated components. It reduces the number of
variables by identifying the directions (principal components) that capture the most variance in the
data, making it easier to analyze and visualize complex, high-dimensional datasets.

Visualization: PCA and other component analysis techniques can be visualized through 2D or 3D
plots, helping data scientists interpret data structure, identify patterns, and detect relationships
between variables efficiently.

Source Code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# importing or loading the dataset


dataset = pd.read_csv('wine.csv')

# distributing the dataset into two components X and Y


X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

# Splitting the X and Y into the Training set and Testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# performing preprocessing part


from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Applying PCA function on training and testing set of X component


from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_

# Fitting Logistic Regression To the training set


from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

# Predicting the test set result using predict function under LogisticRegression
y_pred = classifier.predict(X_test)

# making confusion matrix between test set of Y and predicted value.


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Predicting the training set result through scatter plot


from matplotlib.colors import ListedColormap

X_set, y_set = X_train, y_train


X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1,
stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1,
stop=X_set[:, 1].max() + 1, step=0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape), alpha=0.75,


cmap=ListedColormap(('yellow', 'white', 'aquamarine')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
color=ListedColormap(('red', 'green', 'blue'))(i), label=j)

plt.title('Logistic Regression (Training set)')


plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend() # to show legend

# show scatter plot


plt.show()

# Visualising the Test set results through scatter plot


X_set, y_set = X_test, y_test

X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),


np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75,


cmap=ListedColormap(('yellow', 'white', 'aquamarine')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
color=ListedColormap(('red', 'green', 'blue'))(i), label=j)

# title for scatter plot


plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend()

# show scatter plot


plt.show()
Output:
Viva-Voce:

Q1) What is component analysis, and what are its main objectives?

A1) Component analysis refers to techniques used to reduce the dimensionality of datasets while
retaining important information. Its main objectives are to simplify data, identify underlying patterns,
and make complex datasets easier to visualize and analyze. Common techniques include Principal
Component Analysis (PCA) and Factor Analysis.

Q2) How does Principal Component Analysis (PCA) work, and what is its purpose?

A2) PCA works by transforming the original dataset into a new set of orthogonal (uncorrelated)
components called principal components. These components are ordered by the amount of variance
they explain in the data. The purpose of PCA is to reduce the dimensionality of the data while
preserving as much variance as possible, making it easier to analyze and visualize.

Q3) What is the significance of the explained variance in PCA?

A3) Explained variance indicates the proportion of the total variance in the dataset that is captured by
each principal component. It helps in understanding how much information each component retains
and guides the selection of a subset of components that can effectively represent the original data.

Q4) How can you interpret the results of a PCA analysis?

A4) The results of PCA can be interpreted by examining the principal components' loadings, which
show the contribution of each original variable to the components. The explained variance plot (scree
plot) helps determine the number of components to retain. Visualizing the data projected onto the
principal components can reveal patterns and relationships.

Q5) What are some potential limitations of PCA?

A5) PCA assumes linear relationships between variables and may not capture complex, non-linear
patterns. It also requires careful interpretation, as principal components are combinations of original
variables, which may not always have a straightforward or meaningful interpretation. Additionally, PCA
is sensitive to scaling, so data normalization is often necessary.
Experiment – 8

Aim: To perform linear regression on datasets.

Theory: Linear regression is a fundamental technique in data science for modeling the relationship
between a dependent variable and one or more independent variables.

Linear Regression: This method fits a straight line (regression line) through the dataset that best
represents the relationship between variables. The equation of the line,

𝑦 = 𝑚𝑥 + 𝑏

y=mx+b, shows how changes in the independent variable(s) predict changes in the dependent
variable. It’s used for prediction, trend analysis, and forecasting.

Visualization: The regression line is often visualized on a scatter plot, with data points and the fitted
line providing a clear representation of the correlation, making it easier to assess the model’s
accuracy and goodness of fit.

Source Code:

import numpy as np
from sklearn.linear_model import LinearRegression

x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x, y = np.array(x), np.array(y)

model = LinearRegression().fit(x, y)

r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")

y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")

Output:
Viva-Voce:

Q1) What is linear regression, and how is it used in data analysis?

A1) Linear regression is a statistical technique used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to observed data. It is used
to predict the value of the dependent variable based on the values of the independent variables and to
identify trends and relationships in the data.

Q2) How do you interpret the coefficients of a linear regression model?

A2) In linear regression, the coefficients represent the change in the dependent variable for a one-unit
change in the independent variable, holding all other variables constant. A positive coefficient
indicates a direct relationship, while a negative coefficient suggests an inverse relationship.

Q3) What is the purpose of the R-squared value in linear regression?

A3) The R-squared value measures the proportion of the variance in the dependent variable that is
explained by the independent variables in the model. It provides an indication of how well the model
fits the data, with higher values indicating a better fit.

Q4) What are some common assumptions of linear regression that should be checked?

A4) Common assumptions include linearity (the relationship between variables is linear),
independence (residuals are independent), homoscedasticity (constant variance of residuals), and
normality (residuals are normally distributed). Checking these assumptions ensures the validity of the
regression model and its predictions.

Q5) How can you assess the quality of a linear regression model beyond R-squared?

A5) Besides R-squared, model quality can be assessed using metrics like Mean Absolute Error
(MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) to evaluate prediction
accuracy. Residual plots and diagnostic tests can also be used to check for violations of model
assumptions and to identify any patterns that might suggest model improvements.
Experiment – 9

Aim: To perform Data Aggregation and GroupWise Operations.

Theory: Data aggregation and groupwise operations are essential techniques in data science for
summarizing and analyzing data based on specific groups or categories.

Data Aggregation: This process involves combining data from multiple records to calculate summary
statistics like sums, averages, counts, or maximum/minimum values. It helps in simplifying large
datasets to gain meaningful insights, often using functions such as groupby() in pandas for Python.

Groupwise Operations: These operations allow for applying functions or transformations to subsets
of data that share common attributes. For example, grouping by a categorical variable (like region or
product) and performing operations on each group helps uncover trends, patterns, and relationships
that may vary across different groups.

Visualization: Groupwise data can be effectively visualized using bar charts, pie charts, or box plots
to compare groups and highlight variations or similarities, making it easier to communicate
insights from the data.

Source Code:

# import module
import pandas as pd

# Creating our dataset


df = pd.DataFrame([[9, 4, 8, 9],[8, 10, 7, 6], [7, 6, 8, 5]], columns=['Maths', 'English','Science', 'History'])

# display dataset
print(df)
print(df.sum())
print(df.describe())
print(df.agg(['sum', 'min', 'max']))

a = df.groupby('Maths')
a.first()
b = df.groupby(['Maths', 'Science'])
b.first()

Output:
Viva-Voce:

Q1) What is data aggregation, and why is it important in data analysis?

A1) Data aggregation involves summarizing and combining data from multiple records or sources to
compute aggregate values such as sums, averages, or counts. It is important because it helps in
simplifying large datasets, providing meaningful summaries and insights, and facilitating comparative
analysis.

Q2) How do you perform groupwise operations in Python using pandas?

A2) In Python, pandas provides the groupby() function to perform groupwise operations. By grouping
data based on one or more columns, you can apply aggregate functions (like sum, mean, count) or
transformations to each group. For example, df.groupby('column').mean() calculates the mean of each
group in the specified column.

Q3) What are some common aggregation functions used in data analysis?

A3) Common aggregation functions include sum (total value), mean (average value), median (middle
value), count (number of entries), min (minimum value), and max (maximum value). These functions
help in summarizing the data and extracting key metrics.

Q4) How can you handle missing values during aggregation?

A4) During aggregation, missing values can be handled by using functions like fillna() to impute
missing values or by choosing to ignore them with options like dropna(). Aggregation functions often
have parameters to handle missing values, such as skipna=True in pandas, which excludes NaNs
from calculations.

Q5) What is the difference between aggregation and transformation in groupwise operations?

A5) Aggregation involves computing summary statistics for each group, such as totals or averages,
and returning a reduced dataset. Transformation, on the other hand, involves applying functions to
each group.

You might also like