0% found this document useful (0 votes)
10 views

1.5

The document discusses NumPy boolean indexing, which allows filtering elements in an array based on specific conditions using boolean masks. It also covers the Pandas library, detailing its data structures like DataFrames and Series, and how to manipulate data within these structures. Additionally, it provides examples of creating and using boolean masks in both NumPy and Pandas.

Uploaded by

123109015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

1.5

The document discusses NumPy boolean indexing, which allows filtering elements in an array based on specific conditions using boolean masks. It also covers the Pandas library, detailing its data structures like DataFrames and Series, and how to manipulate data within these structures. Additionally, it provides examples of creating and using boolean masks in both NumPy and Pandas.

Uploaded by

123109015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

NumPy Boolean Indexing

● In NumPy, boolean indexing allows us to filter elements from an


array based on a specific condition.
● Boolean indexing is commonly known as a filter with boolean
masks to specify the condition.
● Boolean indexing uses the result of a Boolean operation over the
data, returning a mask with True or False for each row.
● The rows marked True in the mask will be selected.
● In NumPy, Boolean mask is a numpy array containing truth values
(True/False) that correspond to each element in the array.
Example of Boolean Masks
2

● Suppose we have an array named array1.


array1 = np.array([12, 24, 16, 21, 32, 29, 7, 15])
● Now let's create a mask that selects all elements of array1 that are greater than 20.
boolean_mask = array1 > 20
● Here, array1 > 20 creates a boolean mask that evaluates to True for elements that
are greater than 20, and False for elements that are less than or equal to 20.
● The resulting mask is an array stored in the boolean_mask variable as:
[False, True, False, True, True, True, False, False]
array1 = np.array([1, 2, 4, 9, 11, 16, 18, 22, 26, 31, 33, 47, 51, 52])
# create a boolean mask using combined logical operators
boolean_mask = (array1 < 10) | (array1 > 40)
# apply the boolean mask to the array
result = array1[boolean_mask]
print(result)
[ 1 2 4 9 47 51 52]

numbers = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])


numbers_copy = numbers.copy()
# change all even numbers to 0 in the copy
numbers_copy[numbers % 2 == 0] = 0
# print the modified copy
print(numbers_copy)
[1 0 3 0 5 0 7 0 9 0]
2D Boolean Indexing in NumPy
# create a 2D array
array1 = np.array([[1, 7, 9],[14, 19, 21],25, 29,
35]])
# create a boolean mask elements for greater than 9
boolean_mask = array1 > 9
result = array1[boolean_mask]
print(result)
[14 19 21 25 29 35]
Pandas Library for Data Manipulation and Analysis
∙ Pandas provides two types of classes for handling data:
∙ DataFrame: a two-dimensional data structure that holds data like a
two-dimension array or a table with rows and columns.
∙ Rows in DataFrame have a specific index to access rows and
columns, which can be any name or value.
∙ In Pandas, the columns are called Series, which consists of a list of
several values, where each value has an index.
∙ Values can be integers, strings, Python objects etc.
● python -m pip install --upgrade pip
● python3
● pip install pandas
Series in Pandas

● data = [10, 20, 30, 40, 50]


● my_series = pd.Series(data)
● print(my_series[2])
● a = [1, 3, 5]
● my_series = pd.Series(a, index = ["x", "y", "z"])
● print(my_series)
● print(my_series["y"])
import pandas as pd
# create a dictionary
grades = {"Sem1": 8.25, "Sem2": 9.5, "Sem3": 7.75}
# create a series from the dictionary
my_series = pd.Series(grades)
print(my_series)
first_year = pd.Series(grades, index =
["Sem1", "Sem2"])
Series in Pandas
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
import pandas as pd
data = [['John', 25, 'New York'],
['Alice', 30, 'London'],
['Bob', 35, 'Paris']]
# create a DataFrame from the list
df = pd.DataFrame(data, columns=['Name',
'Age', 'City'])

print(df)
Pandas DataFrame Using Python Dictionary
data = {’year’: [2010 , 2011 , 2012 ,
2010 , 2011 , 2012 ,2010 , 2011 , 2012],
’team’: [’FCBarcelona’, ’FCBarcelona’, ’FCBarcelona’,
’RMadrid ’, ’RMadrid’, ’RMadrid’, ’ValenciaCF’,
’ValenciaCF’, ’ValenciaCF’],
’wins’: [30 , 28 , 32 , 29 , 32 , 26 , 21 , 17 , 19],
’draws’:[6 , 7, 4, 5, 4, 7, 8, 10 , 8] ,
’losses’: [2 , 3, 2, 4, 2, 5, 9, 11 , 11]
}

football = pd.DataFrame(data,columns=[’year’,’team’,
’wins’, ’draws’, ’losses’] )
df = pd.DataFrame() # create an empty DataFrame
df = pd.read_csv('data.csv') #from CSV

df = pd.read_csv('./csv_files/data.csv', header = 0)

Employee ID,First Name,Last


Name,Department,Position,Salary
101,John,Doe,Marketing,Manager,50000
102,Jane,Smith,Sales,Associate,35000
103,Michael,Johnson,Finance,Analyst,45000
104,Emily,Williams,HR,Coordinator,40000
23, 'Hello', 45.6
56, 'World', 78.9
89, 'Foo', 12.3
34, 'Bar', 56.7

# read csv file with some arguments


df = pd.read_csv('data.csv', header = None, names =
['col1', 'col2', 'col3'], skiprows = 2)
print(df)
>>> data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],... 'Age': [25, 30, 35, 28],...
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
>>> df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
>>> df
Name Age City
A Alice 25 New York
B Bob 30 Los Angeles
C Charlie 35 Chicago
D David 28 Houston
>>> selected_row = df.loc['A']
>>> print(selected_row)
Name Alice
Age 25
City New York
Name: A, dtype: object
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28], 'City': ['New York', 'Los Angeles',
'Chicago', 'Houston']}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
# select specific rows and columns
selected_data = df.loc[['A', 'C'], ['Name', 'Age']]
print(selected_data)
cd1 = df.loc['B':'C', ['Name', 'Age']]
cd2 = df.loc[:, ['Name', 'Age']]
cd3 = df.loc[:]
sr2 = df.loc[['A','C'],:]
sr1 = df.loc[df['Age'] >= 30]
arr = np.array([1, 2, 3, 4, 5])
arr = np.array((1, 2, 3, 4, 5)) #Tuple

arr = np.array([1, 2, 3, 4])


print(arr[2] + arr[3])

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])


print('5th element on 2nd row: ', arr[1, 4])
3 + 2.5 * np.random.randn(2, 4)
>>>Array([[ 3.56443934, 0.21240777, 1.65220694, 6.32284338],
[-0.90036278, 4.78487666, 3.40952793, 1.71824131]])
>>> np.array([3] * 4, dtype="int32")
array([3, 3, 3, 3], dtype=int32)
>>> z = np.arange(3, dtype=np.uint8) #Array Range
>>> z
array([0, 1, 2], dtype=uint8)
https://www.programiz.com/python-programming/pandas/ge
tting-started

● Categoricals are a pandas data type corresponding to categorical


variables in statistics.
● Takes a limited / usually fixed, number of possible values
● Categorical data might have an order
● like ‘strongly agree’ vs ‘agree’ or
● ‘first observation’ vs. ‘second observation’
● “Test Data” , “Train Data”
● Order is defined by the order of categories, not lexical order of
the values
All values here are either in categories or np.nan
s =pd.Series(["a","b","c", "a"],dtype="category")

df = pd.DataFrame({"A": ["a", "b", "c", "a"]})


df["B"] = df["A"].astype("category")

data = {'Name': ['Alice', 'Bob', 'Charlie',


'David', 'Eve'], 'Age': [25, 32, 18, 47, 33],
'City':['New York', 'Paris', 'London', 'Tokyo',
'Sydney']}
df = pd.DataFrame(data)
names = df['Name']

name_city = df[['Name','City']]
df2 = pd.DataFrame(
{"A": 1.0, "B":pd.Timestamp("20250128"),
"C": pd.Series(1,index=list(range(4)),
dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test",
"train"]),
"F": "foo", } )
>>>df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2.dtypes
A float64
B datetime64[s]
C float32
D int32
E category
F object
dtype: object
>>> dates = pd.date_range("20250101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4),
>>>
index=dates, columns=list("ABCD"))
>>> df
A B C D
2025-01-01 0.293879 0.324915 0.434401 -1.391992
2025-01-02 -0.701108 -0.011810 0.835216 -0.586246
2025-01-03 -0.677587 0.348766 -0.457098 1.147319
2025-01-04 -1.671191 0.651669 -0.685242 -1.954809
2025-01-05 0.526734 -1.297472 0.177927 0.612196
2025-01-06 0.778206 0.865262 -0.970947 -0.460400
>>> df.head()
A B C D
2025-01-01 0.293879 0.324915 0.434401 -1.391992
2025-01-02 -0.701108 -0.011810 0.835216 -0.586246
2025-01-03 -0.677587 0.348766 -0.457098 1.147319
2025-01-04 -1.671191 0.651669 -0.685242 -1.954809
2025-01-05 0.526734 -1.297472 0.177927 0.612196
>>> df.tail(2)
A B C D
2025-01-05 0.526734 -1.297472 0.177927 0.612196
2025-01-06 0.778206 0.865262 -0.970947 -0.460400

>>> df.index
DatetimeIndex(['2025-01-01', '2025-01-02',
'2025-01-03', '2025-01-04','2025-01-05',
'2025-01-06'],dtype='datetime64[ns]', freq='D')
>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

>>> df.to_numpy()
array([[ 0.29387942, 0.32491506, 0.43440078,-1.39199244],
[-0.70110762,-0.01181039, 0.83521647, -0.58624567],
[-0.67758743, 0.34876597, -0.45709763, 1.14731948],
[-1.67119052, 0.65166926, -0.68524221, -1.95480876],
[ 0.52673407,-1.29747191, 0.17792695, 0.6121957 ],
[ 0.77820621,0.8652619 , -0.97094701, -0.46040001]])
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.241844 0.146888 -0.110957 -0.438989
std 0.934028 0.768723 0.702184 1.170421
min -1.671191 -1.297472 -0.970947 -1.954809
25% -0.695228 0.072371 -0.628206 -1.190556
50% -0.191854 0.336841 -0.139585 -0.523323
75% 0.468520 0.575943 0.370282 0.344047
max 0.778206 0.865262 0.835216 1.147319
>>> df.T
2025-01-01 2025-01-02 2025-01-03 2025-01-04 2025-01-05 2025-01-06
A 0.293879 -0.701108 -0.677587 -1.671191 0.526734 0.778206
B 0.324915 -0.011810 0.348766 0.651669 -1.297472 0.865262
C 0.434401 0.835216 -0.457098 -0.685242 0.177927 -0.970947
D -1.391992 -0.586246 1.147319 -1.954809 0.612196 -0.460400
>>> df["A"]
2025-01-01 0.293879
2025-01-02 -0.701108
2025-01-03 -0.677587
2025-01-04 -1.671191
2025-01-05 0.526734
2025-01-06 0.778206
Freq: D, Name: A, dtype: float64
>>> df.A
2025-01-01 0.293879
2025-01-02 -0.701108
2025-01-03 -0.677587
2025-01-04 -1.671191
2025-01-05 0.526734
2025-01-06 0.778206
Freq: D, Name: A, dtype: float64
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
# create a dataframe from the dictionary
df = pd.DataFrame(data)
# write dataframe to csv file
df.to_csv('output.csv', index=False)

df = pd.DataFrame(data)
df.duplicated(subset=['Name', 'Age']
df.drop_duplicates(inplace=True)
import pandas as pd

# create dataframe
data = {'Name': ['Tom', 'Nick', 'John', 'Tom'],
'Age': [20, 21, 19, 18],
'City': ['New York', 'London', 'Paris', 'Berlin']}
df = pd.DataFrame(data)

# write to csv file


df.to_csv('output.csv', sep = ';', index = False, header = True)
data = { 'A': [1, 2, 3, None, 5], 'B': [None, 2, 3, 4, 5],
'C': [1, 2, None, None, 5] }

df = pd.DataFrame(data)
print("Original Data:\n",df)
# use dropna() to remove rows with any missing values
df_cleaned = df.dropna()
print("Cleaned Data:\n",df_cleaned)
Cleaned Data:
A B C
1 2.0 2.0 2.0
4 5.0 5.0 5.0
import pandas as pd
data = { 'A': [1, 2, 3, None, 5],
'B': [None, 2, 3, 4, 5], 'C': [1, 2, None, None, 5]}
df = pd.DataFrame(data)
print("Original Data:\n", df)
# filling NaN values with 0
df.fillna(0, inplace=True)
print("\nData after filling NaN with 0:\n", df)
import pandas as pd
data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
'Age': [8, 9, 7, 80, 100], 'Gender': ['M', 'M', 'M', 'F', 'M'],
'Standard': [3, 4, 12, 3, 5]}
df = pd.DataFrame(data)
# replace F with M
df.loc[3, 'Gender'] = 'M'
print(df)
import pandas as pd
data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
'Age': [8, 9, 7, 80, 100], 'Gender': ['M', 'M', 'M', 'M', 'M'],
'Standard': [3, 4, 12, 3, 5] }
df = pd.DataFrame(data)
# replace values based on conditions
for i in df.index:
age_val = df.loc[i, 'Age']
if (age_val > 14) and (age_val%10 == 0):
df.loc[i, 'Age'] = age_val/10
print(df)
Resources: Datasets
39

◻ UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html


◻ Statlib: http://lib.stat.cmu.edu/
◻ European Union (Eurostat): https://ec.europa.eu/eurostat/data/database

You might also like