0% found this document useful (0 votes)
2 views27 pages

UNIT-4 (1)

The document provides an overview of key concepts in NumPy and Pandas, including reshaping arrays, differences between iloc and loc indexers, and methods for installing Python libraries. It also discusses data manipulation techniques in NumPy, importing data from CSV files into Pandas, and the attributes of Pandas Series. Additionally, it highlights how NumPy and Pandas can be integrated in a data analysis workflow.

Uploaded by

byjuslearn874
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views27 pages

UNIT-4 (1)

The document provides an overview of key concepts in NumPy and Pandas, including reshaping arrays, differences between iloc and loc indexers, and methods for installing Python libraries. It also discusses data manipulation techniques in NumPy, importing data from CSV files into Pandas, and the attributes of Pandas Series. Additionally, it highlights how NumPy and Pandas can be integrated in a data analysis workflow.

Uploaded by

byjuslearn874
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT-4

1. How to reshape arrays in NumPy. What happens when you use the reshape() function with
a -1 parameter?

In NumPy, the reshape() function is used to change the shape (or dimensions) of an array without
changing its data.

Syntax:

numpy.reshape(a, newshape)

if you're calling from a NumPy array:

a.reshape(newshape)

Example:

import numpy as np

a = np.array([1, 2, 3, 4, 5, 6])

b = a.reshape((2, 3)) # 2 rows, 3 columns

print(b)

Output:

[[1 2 3]

[4 5 6]]

What happens when you use -1 in reshape()?

The -1 is a special placeholder in reshape() that tells NumPy to infer the correct dimension
automatically based on the original array's size.

📘 Example:

a = np.array([1, 2, 3, 4, 5, 6])

b = a.reshape((3, -1)) # Let NumPy determine the number of columns

print(b)

Output:

[[1 2]

[3 4]
[5 6]

2. Difference between iloc and loc indexers in Pandas

Feature iloc loc

Stands for Integer location Label location

Index type Integer-based (positional) Label-based (index names)

Syntax df.iloc[row_idx, col_idx] df.loc[row_label, col_label]

Inclusive? Excludes the end for slicing Includes an end for slicing

3. Explain the process of installing Python libraries using different methods. Compare pip,
conda, and manual installation, highlighting their advantages and limitations.

Python supports multiple ways to install libraries. The three main methods are:

Method Description

pip The default Python package manager

conda A package/environment manager provided by Anaconda

Manual installation Directly downloading and installing packages

🔹 1. pip (Python Package Installer)


pip is Python’s official package manager that downloads packages from PyPI.
Installation:
pip install package_name
Advantages:
 Comes with Python 3.4+
 Wide support (100k+ packages on PyPI)
 Lightweight and fast
 Works in virtual environments (venv)
⚠️ Limitations:
 Dependency conflicts can occur
 Doesn’t handle non-Python dependencies well (e.g., system libraries like OpenCV's C++
components)

🔹 2. conda (Anaconda Package Manager)


conda is a powerful package and environment manager, part of the Anaconda distribution.
Installation:
conda install package_name
Advantages:
 Handles both Python and non-Python packages (e.g., NumPy with C optimizations)
 Manages isolated environments easily
 Good for data science and machine learning setups
Limitations:
 Heavier footprint (Anaconda is ~3GB)
 Smaller package repository than PyPI
 Packages may lag behind PyPI in updates

🔹 3. Manual Installation
Downloading the source code or a .whl (wheel) or .tar.gz file and installing manually.
Steps:
# Download package
pip install /path/to/package.whl

# OR clone from GitHub


git clone https://github.com/author/project.git
cd project
python setup.py install
Advantages:
 Full control over version and build
 Useful for unreleased or custom packages
⚠️ Limitations:
 Requires more technical knowledge
 Dependency resolution is manual
 Risk of incompatible builds

4. Compare and contrast the primary data structures in Pandas: Series and DataFrame

Feature Series DataFrame

Dimensionality 1D 2D

Index type Single axis (index) Two axes (rows and columns)

Data type Homogeneous Heterogeneous (multiple dtypes)

Example Column of data Full table with rows & columns

Usage example Time series, scores Structured datasets, CSVs, DBs

Example:
import pandas as pd

s1 = pd.Series([85, 78, 92], index=['Math', 'English', 'Science'])

s2 = pd.Series([90, 88, 80], index=['Math', 'English', 'Science'])

df = pd.DataFrame({'Student1': s1, 'Student2': s2}).T

df['Average'] = df.mean(axis=1)

print(df)

Output:

Math English Science Average

Student1 85 78 92 85.00

Student2 90 88 80 86.00

5. Explain the process of manipulating array shapes in NumPy. Discuss transpose operations,
reshaping, stacking, and splitting arrays with appropriate examples.

NumPy provides powerful tools to change the shape or structure of arrays for mathematical and data
operations.

🔹 1. Reshaping Arrays

Purpose: Change the shape (dimensions) of an array without changing the data.

import numpy as np

a = np.arange(6) # [0, 1, 2, 3, 4, 5]

b = a.reshape((2, 3)) # Reshape to 2 rows, 3 columns

print(b)

Output:

[[0 1 2]

[3 4 5]]

🔹 2. Transposing Arrays

Purpose: Flip rows and columns (useful in linear algebra and image processing).
a = np.array([[1, 2], [3, 4]])

print(a.T) # or np.transpose(a)

Output:

[[1 3]

[2 4]]

🔹 3. Stacking Arrays

Purpose: Combine multiple arrays into one.

 Vertical Stack (vstack) – Stack arrays row-wise (like adding more rows):

a = np.array([[1, 2], [3, 4]])

b = np.array([[5, 6]])

print(np.vstack((a, b)))

 Horizontal Stack (hstack) – Stack arrays column-wise:

c = np.array([[7], [8]])

print(np.hstack((a, c)))

🔹 4. Splitting Arrays

Purpose: Divide arrays into multiple sub-arrays.

a = np.array([[1, 2, 3], [4, 5, 6]])

print(np.hsplit(a, 3))

print(np.vsplit(a, 2))

6. Explain the process of importing data from CSV files into Pandas DataFrames. Discuss
various parameters that can be used to handle different CSV formats, missing values, and
data types.

CSV (Comma-Separated Values) is a common format for datasets. Pandas makes it easy to import
them.
Basic import:

import pandas as pd

df = pd.read_csv("students.csv")

print(df.head())

Key parameters:

Parameter Purpose

sep Delimiter (default is comma)

header Row number(s) to use as column names

names Provide column names manually

index_col Column to set as index

usecols Load only specific columns

dtype Specify data types

Define custom missing value


na_values
representations

skiprows Skip rows at the start

nrows Read only N rows

Handle different file encodings (like utf-


encoding
8, latin1)

Examples:

1. Load a CSV with a custom delimiter:

df = pd.read_csv("data.csv", sep=";")

2. Handle missing values and data types:

df = pd.read_csv("data.csv", na_values=["NA", "n/a"], dtype={"Age": int})

3. Use a specific column as index:

df = pd.read_csv("data.csv", index_col="StudentID")
4. Read specific columns and skip rows:

df = pd.read_csv("data.csv", usecols=["Name", "Score"], skiprows=1)

7. How array slicing works in NumPy. Provide an example

Array slicing in NumPy works similarly to Python list slicing, but it's more powerful because it
supports multi-dimensional arrays.

🔹 Syntax:

array[start:stop:step]

 start: starting index (inclusive)

 stop: ending index (exclusive)

 step: stride (optional)

✅ 1D Example:

import numpy as np

a = np.array([10, 20, 30, 40, 50])

print(a[1:4])

# Elements at index 1, 2, 3 → [20, 30, 40]

2D Example:

b = np.array([

[1, 2, 3],

[4, 5, 6],

[7, 8, 9]

])

print(b[:2, :2])

Output:

[[1 2]
[4 5]

8. Write two methods to import data from a CSV file into a Pandas DataFrame

Method 1: Using pd.read_csv()

This is the most common and recommended way.

import pandas as pd

df = pd.read_csv("data.csv")
print(df.head())

Method 2: Using csv module with Pandas DataFrame

This is useful when you want more control while reading the file manually.

import csv
import pandas as pd

with open("data.csv", newline='') as file:


reader = csv.reader(file)
data = list(reader)

# Convert to DataFrame manually (assumes first row is header)


df = pd.DataFrame(data[1:], columns=data[0])
print(df.head())

9. Identify the Python library commonly used for solving differential equations numerically

Solve ODE dy/dt=−2y

import numpy as np

from scipy.integrate import solve_ivp

import matplotlib.pyplot as plt

def dydt(t, y):

return -2*y

y0 = [1]

t_span = (0, 5)

t_eval = np.linspace(*t_span, 100)


solution = solve_ivp(dydt, t_span, y0, t_eval=t_eval)

plt.plot(solution.t, solution.y[0])

plt.xlabel("Time")

plt.ylabel("y(t)")

plt.title("Solution of dy/dt = -2y")

plt.grid()

plt.show()

10. Describe the function in Matplotlib used to plot a graph

Function: plot()

 Belongs to: matplotlib.pyplot

 Used for 2D line plots.

🔹 Basic Example:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [2, 4, 1, 8, 7]

plt.plot(x, y) # Plot the line

plt.title("Simple Line Plot")

plt.xlabel("X-axis")

plt.ylabel("Y-axis")

plt.grid(True)

plt.show()

🧠 Other Useful Plot Functions:

Function Purpose

scatter() Scatter plots

bar() Bar charts

hist() Histograms
Function Purpose

Display images or
imshow()
2D arrays

Multiple plot
subplot()
layout

11. Describe the various ways to create NumPy arrays from Python lists, ranges, and using
built-in functions. Explain the significance of the dtype parameter when creating arrays.

NumPy arrays can be created in multiple ways, primarily from Python sequences, iterables, and built-
in NumPy functions. Here’s an explanation of each method, along with the role of the dtype
parameter:

A. Creating Arrays from Python Lists

You can convert Python lists (or nested lists) into NumPy arrays using numpy.array().

Example:

import numpy as np

list1 = [1, 2, 3, 4]

arr1 = np.array(list1)

print(arr1)

Output:

[1 2 3 4]

For multi-dimensional arrays:

list2 = [[1, 2], [3, 4]]

arr2 = np.array(list2)

print(arr2)

# Output:

[[1 2]

[3 4]]
B. Creating Arrays from Ranges

You can use Python’s range() function in combination with np.array(), or use NumPy’s own arange()
function which is more flexible.

Example using Python range:

arr3 = np.array(range(0, 10, 2))

print(arr3) # Output: [0 2 4 6 8]

Example using NumPy’s arange():

arr4 = np.arange(0, 10, 2)

print(arr4) # Output: [0 2 4 6 8]

C. Using Built-in NumPy Functions

NumPy provides several built-in functions to create arrays efficiently:

1. np.zeros() – Creates an array filled with zeros.

np.zeros((2, 3)) # Output: array([[0., 0., 0.], [0., 0., 0.]])

2. np.ones() – Creates an array filled with ones.

np.ones((2, 2)) # Output: array([[1., 1.], [1., 1.]])

3. np.full() – Creates an array filled with a specified value.

np.full((2, 2), 7) # Output: array([[7, 7], [7, 7]])

4. np.eye() – Creates an identity matrix.

np.eye(3) # Output: 3x3 identity matrix

5. np.linspace() – Creates a specified number of evenly spaced values between two


endpoints.

np.linspace(0, 1, 5) # Output: array([0. , 0.25, 0.5 , 0.75, 1. ])

6. np.random.rand() / np.random.randn() – Create arrays with random values.

np.random.rand(2, 3) # Random values in [0, 1)

D. Significance of the dtype Parameter

The dtype parameter defines the data type of the elements in the array. This is important for:
 Memory efficiency: Specifying float32 instead of the default float64 can save memory.

 Precision control: Choose int32, float64, or complex128 depending on the required precision.

 Type enforcement: Ensures consistency in calculations and avoids type-casting errors.

Example:

np.array([1, 2, 3], dtype=float)

Output:

array([1., 2., 3.])

Supported dtypes include: int32, int64, float32, float64, bool, complex, str, object, etc

12. Explain the attributes and properties of Pandas Series with examples.

A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers,
strings, floats, Python objects, etc.).

Creating a Series

import pandas as pd

s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

Key Attributes and Properties of Series

Attribute Description Example

s.values Returns the underlying NumPy array array([10, 20, 30, 40])

s.index Returns the index labels Index(['a', 'b', 'c', 'd'], dtype='object')

s.dtype Returns the data type of the Series dtype('int64')

s.shape Returns the shape (number of elements,) (4,)

s.size Number of elements in the Series 4

s.ndim Number of dimensions (always 1) 1

s.name Name of the Series (optional) Can be set via s.name = 'my_series'

s.isnull() Detects missing values Returns a Boolean Series


Attribute Description Example

s.notnull() Opposite of isnull() Returns a Boolean Series

s.hasnans Checks if Series contains NaNs False in the above case

Examples:

print(s.values) # [10 20 30 40]

print(s.index) # Index(['a', 'b', 'c', 'd'], dtype='object')

print(s.dtype) # int64

print(s.shape) # (4,)

print(s.name) # None (initially)

Accessing Elements

 By label: s['b'] → 20

 By position: s[1] → 20

Vectorized Operations

Pandas Series supports element-wise operations:

s + 5 # Adds 5 to each element

Summary Statistics

s.mean(), s.min(), s.max(), s.describe()

13. Explain how NumPy and Pandas can be used together in a data analysis workflow.

NumPy and Pandas are two core libraries in Python's data analysis stack. They complement each
other in various ways. NumPy provides fast, efficient numerical computations, while Pandas builds on
NumPy by offering powerful, user-friendly data structures like Series and DataFrames.

Typical Data Analysis Workflow Using NumPy and Pandas

Step 1: Data Collection


 Data is often imported using Pandas (pd.read_csv(), pd.read_excel(), etc.).

import pandas as pd

df = pd.read_csv('data.csv')

Step 2: Data Cleaning

 Use Pandas for handling missing values, renaming columns, converting types, etc.

df.dropna(inplace=True) # Remove missing values

df.fillna(0, inplace=True) # Replace NaNs with 0

Step 3: Data Transformation

 Convert columns to NumPy arrays for efficient numerical processing.

import numpy as np

values = df['column1'].to_numpy()

normalized = (values - np.mean(values)) / np.std(values)

df['normalized'] = normalized

Step 4: Feature Engineering

 Use NumPy functions for mathematical transformations.

df['log_sales'] = np.log(df['sales'] + 1)

Step 5: Statistical Analysis

 Perform descriptive statistics using both Pandas and NumPy.

mean = np.mean(df['sales'])

summary = df.describe()

Step 6: Visualization (using external libraries)

 Libraries like Matplotlib and Seaborn can plot Pandas Series/DataFrames directly.

Why Use Both?

Task Library Preferred

Efficient numeric computation NumPy

Data manipulation Pandas


Task Library Preferred

Handling missing data Pandas

Descriptive statistics Pandas + NumPy

Matrix algebra NumPy

Indexing and labeling Pandas

14. Demonstrate mathematical operations and statistical functions that can be performed on
Series objects. How do NaN values affect these operations?

Pandas Series supports vectorized operations and built-in statistical functions, making it ideal for
performing computations on data columns.

A. Mathematical Operations on Series

Operation Example Result

Addition s+2 Adds 2 to all elements

Subtraction s-1 Subtracts 1 from all elements

Multiplication s * 3 Multiplies each element by 3

Division s/2 Divides each element by 2

import pandas as pd

s = pd.Series([10, 20, 30, 40])

print(s * 2) # Output: [20, 40, 60, 80]

B. Statistical Functions on Series

Function Description

s.sum() Sum of all elements

s.mean() Mean (average)

s.median() Median value


Function Description

s.std() Standard deviation

s.var() Variance

s.min() Minimum value

s.max() Maximum value

s.count() Count of non-NaN values

s.describe() Summary statistics

s = pd.Series([10, 20, 30, 40])

print(s.mean()) # Output: 25.0

print(s.describe())

C. Handling of NaN (Missing) Values

NaN (Not a Number) values automatically get excluded in most statistical computations unless
explicitly handled.

s = pd.Series([10, 20, None, 40])

print(s.mean()) # Output: 23.33 (ignores None/NaN)

print(s.sum()) # Output: 70

print(s.count()) # Output: 3 (non-null values)

Detecting and Handling NaN:

Function Use

s.isnull() Returns a boolean Series indicating NaNs

s.notnull() Opposite of isnull()

s.dropna() Removes NaNs

s.fillna(x) Replaces NaNs with a specified value

s.fillna(0, inplace=True) # Replace NaNs with 0


Key Notes on NaN in Mathematical Ops

 Operations preserve NaN positions (they don’t get removed automatically).

s = pd.Series([1, np.nan, 3])

print(s * 2) # Output: [2, NaN, 6]

 Use skipna=False to force error or include NaNs in stats:

s.mean(skipna=False) # Output: NaN

15. How to concatenate NumPy arrays both horizontally and vertically. What happens when
the arrays have different shapes?

Concatenation in NumPy refers to joining multiple arrays along an axis. NumPy provides several
functions for concatenation, such as np.concatenate(), np.vstack(), and np.hstack().

✅ A. Horizontal Concatenation (Along Columns / Axis=1)

Method 1: np.concatenate()

import numpy as np

a = np.array([[1, 2], [3, 4]]) # shape: (2, 2)

b = np.array([[5, 6], [7, 8]]) # shape: (2, 2)

# Horizontal concatenation

result = np.concatenate((a, b), axis=1)

print(result)

Output:

[[1 2 5 6]

[3 4 7 8]]

Method 2: np.hstack()

result = np.hstack((a, b))


Same output as above.

✅ B. Vertical Concatenation (Along Rows / Axis=0)

Method 1: np.concatenate()

result = np.concatenate((a, b), axis=0)

print(result)

Output:

[[1 2]

[3 4]

[5 6]

[7 8]]

Method 2: np.vstack()

result = np.vstack((a, b))

Same output as above.

C. What Happens When Arrays Have Different Shapes?

If the arrays do not match in shape along the concatenation axis, NumPy will raise a ValueError.

Example of Shape Mismatch (Horizontal):

a = np.array([[1, 2], [3, 4]]) # shape: (2, 2)

b = np.array([[5], [6], [7]]) # shape: (3, 1)

np.concatenate((a, b), axis=1) # Raises ValueError!

Error:

ValueError: all the input arrays must have the same number of rows for axis=1

Example of Shape Mismatch (Vertical):

a = np.array([[1, 2], [3, 4]]) # shape: (2, 2)

b = np.array([[5, 6, 7]]) # shape: (1, 3)


np.concatenate((a, b), axis=0) # Raises ValueError!

Error:

ValueError: all the input arrays must have the same number of columns for axis=0

✅ D. Handling Shape Mismatches

To resolve mismatches:

 Reshape the arrays using np.reshape() or np.expand_dims() if needed.

 Use padding or broadcasting if logical (e.g., with zeros).

Example: Reshaping Before Concatenation

a = np.array([1, 2, 3]) # shape: (3,)

b = np.array([[4], [5], [6]]) # shape: (3,1)

# Reshape a to (3,1) for vertical stacking

a = a.reshape((3, 1))

result = np.hstack((a, b)) # Now both shapes are (3,1)

print(result)

16. Discuss DataFrame creation methods and attributes in detail.

A Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different
types — like an Excel spreadsheet or SQL table.

A. DataFrame Creation Methods

1. From a Dictionary of Lists or Arrays

Each key becomes a column label.

import pandas as pd
data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000]

df = pd.DataFrame(data)

2. From a Dictionary of Series

data = {

'A': pd.Series([1, 2, 3], index=['x', 'y', 'z']),

'B': pd.Series([4, 5], index=['x', 'y'])

df = pd.DataFrame(data)

Missing values will be filled with NaN.

3. From a List of Dictionaries

data = [

{'Name': 'Alice', 'Age': 25},

{'Name': 'Bob', 'Salary': 60000}

df = pd.DataFrame(data)

4. From a 2D NumPy Array

import numpy as np

array = np.array([[1, 2], [3, 4]])

df = pd.DataFrame(array, columns=['A', 'B'])

5. From a List of Tuples

data = [('Alice', 25), ('Bob', 30)]

df = pd.DataFrame(data, columns=['Name', 'Age'])


6. From External Sources

 CSV: pd.read_csv('file.csv')

 Excel: pd.read_excel('file.xlsx')

 SQL: pd.read_sql(query, connection)

✅ B. Common DataFrame Attributes

Attribute Description Example

df.shape Returns (rows, columns) (3, 3)

df.columns Returns column labels as an Index Index(['Name', 'Age', 'Salary'])

df.index Returns row labels RangeIndex(start=0, stop=3, step=1)

df.dtypes Returns the data types of each column Name: object, Age: int64, ...

df.size Total number of elements 9 for 3x3 DataFrame

df.ndim Number of dimensions (always 2) 2

df.values Numpy array of all values array([...])

df.info() Summary of index, columns, and data types

df.head(n) First n rows (default 5)

df.tail(n) Last n rows

17. Discuss indexing, reindexing, and aligning Series objects of Pandas Series with examples

A. Indexing in Series

Series objects are like dictionaries; they map labels (indices) to data (values).

1. Positional Indexing

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

print(s[0]) # 10

2. Label-based Indexing

print(s['b']) # 20

3. Slicing
print(s[1:]) # Uses position

print(s['a':'c']) # Uses label (inclusive of 'c')

4. Boolean Indexing

print(s[s > 15]) # Output: Series with values > 15

B. Reindexing Series

Reindexing means changing the index of a Series, potentially introducing or removing data.

Example:

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# Reindexing with a new index

s2 = s.reindex(['a', 'b', 'd'])

print(s2)

Output:

a 10.0

b 20.0

d NaN

You can fill in missing values:

s2 = s.reindex(['a', 'b', 'd'], fill_value=0)

✅ C. Aligning Series

Alignment happens automatically during arithmetic operations between Series with different
indexes.

Example:

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])

print(s1 + s2)

Output:

a NaN

b 12.0

c 23.0
d NaN

 Only matching indices are summed.

 Non-matching indices result in NaN.

✅ Handling Missing Data After Alignment

You can use .add() with a fill_value:

s1.add(s2, fill_value=0)

Now, missing values are treated as zero:

a 1.0

b 12.0

c 23.0

d 30.0

18. Compare the performance of NumPy's statistical operations with equivalent operations in
pure Python. Include examples of calculating mean, median, standard deviation,
correlation, and other statistical measures.

NumPy offers fast, vectorized operations using compiled C code, while pure Python uses interpreted
loops, which are slower and more verbose. Let’s compare both in terms of performance, simplicity,
and readability.

✅ A. Setup

import numpy as np

import statistics

import time

data = list(range(1, 1_000_001)) # 1 million numbers

np_data = np.array(data)

✅ B. Mean Calculation

Pure Python:
start = time.time()

mean_py = sum(data) / len(data)

end = time.time()

print("Pure Python Mean:", mean_py, "Time:", end - start)

NumPy:

start = time.time()

mean_np = np.mean(np_data)

end = time.time()

print("NumPy Mean:", mean_np, "Time:", end - start)

🔹 NumPy is ~10–50x faster for large datasets.

C. Median Calculation

Pure Python:

median_py = statistics.median(data)

NumPy:

median_np = np.median(np_data)

🔸 NumPy uses quickselect under the hood, much faster than Python’s sort-based method.

D. Standard Deviation

Pure Python:

std_py = statistics.stdev(data) # sample std dev

NumPy:

std_np = np.std(np_data, ddof=1) # ddof=1 for sample std dev

NumPy is significantly faster, especially with millions of numbers.

E. Correlation Coefficient

Pure Python:
# Manually compute the Pearson correlation coefficient

def pearson_corr(x, y):

mean_x = sum(x)/len(x)

mean_y = sum(y)/len(y)

num = sum((a - mean_x)*(b - mean_y) for a, b in zip(x, y))

denom = (sum((a - mean_x)**2 for a in x) * sum((b - mean_y)**2 for b in y)) ** 0.5

return num / denom

corr_py = pearson_corr(data, data)

NumPy:

corr_np = np.corrcoef(np_data, np_data)[0, 1]

💡 NumPy is much faster, more accurate, and handles edge cases better.

F. Performance Summary

Pure Python NumPy


Operation
(time) (time)

Mean Slower Very Fast

Fast
Median Slower (sorts)
(Quickselect)

Std Deviation Slower Fast

Fast & built-


Correlation Verbose & slow
in

G. Conclusion

 Use NumPy for any statistical work on large or even moderately sized datasets.
19. Explain how to select, filter, and manipulate data using different indexing methods (loc,
iloc, at, iat) used in DataFrame creation in Pandas.

Pandas provides four main indexing methods to access data in DataFrames:

✅ A. loc[] – Label-based Indexing

Access rows and columns by labels (names, not positions).

df = pd.DataFrame({

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

}, index=['a', 'b', 'c'])

df.loc['a'] # Row with label 'a'

df.loc['a', 'Age'] # Single value: Age of Alice

df.loc[:, 'Age'] # Entire Age column

🔹 Can also be used with boolean filtering:

df.loc[df['Age'] > 25]

✅ B. iloc[] – Position-based Indexing

Access rows and columns by integer position (like NumPy arrays).

df.iloc[0] # First row

df.iloc[0, 1] # First row, second column

df.iloc[:, 1] # All rows, second column

🔸 Useful when labels are unknown or not sequential.

✅ C. at[] – Fast Scalar Access (Label-based)

Access a single value using row and column labels (faster than loc[]).

df.at['a', 'Age'] # Faster than df.loc['a', 'Age']

✅ Best for scalar access when performance matters.

✅ D. iat[] – Fast Scalar Access (Position-based)


Access a single value using integer positions (like iloc, but faster).

df.iat[0, 1] # Age of first row (Alice)

✅ E. Comparison Summary

Method Based On Can Slice? Fast? Use Case

Named
loc Labels ✅ Yes Slow rows/colu
mns

Index-
iloc Positions ✅ Yes Medium based
access

Fast
access to
at Labels ❌ No ✅✅
one value
(label)

Fast
access to
iat Positions ❌ No ✅✅
one value
(pos)

✅ F. Filtering Example with loc

# Select people older than 25

df.loc[df['Age'] > 25]

✅ G. Updating Data

df.loc['a', 'Age'] = 26 # Using label

df.iloc[0, 1] = 27 # Using position

df.at['a', 'Age'] = 28 # Fast single update

df.iat[0, 1] = 29 # Fastest single update

You might also like