0% found this document useful (0 votes)
24 views

Pandas

Uploaded by

zam.pfe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Pandas

Uploaded by

zam.pfe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

0.1.

IMPORT 1

0.1 Import
1 import pandas as pd

0.2 Read files


0.2.1 Read csv
We can read any file (.csv, .txt, ...) with columns are seperated by the same
delimeter.
1 data = pd . read_csv ( r " < data path > " , header = None | row_index , names = <
list of header columns > , sep = < delimeter >)

0.2.2 Read json


1 data = pd . read_json ( r " < json file path > " )

0.2.3 Read excel


1 data = pd . read_excel ( r " < excel file path > " , sheet_name = < sheet_name >)

if we don’t specifie sheet name then by default pandas will read the first
sheet.

0.3 Max row and columns


0.3.1 Max rows
1 pd . set_option ( ’ display . max . rows ’ , < number >)

0.3.2 Max Columns


1 pd . set_option ( ’ display . max . columns ’ , < number >)

0.4 Dataframe infos


we use it to see dataframe columns and rows count, and non null values count
on each column (to see if there are a mising values) also memory usage.
1 df . infos ()
2

0.5 Dataframe shape


to see the shape of dataframe
1 df . shape

0.6 Head and Tail of dataframe


0.6.1 Head
to see the first rows of hdataframe
1 df . head ( < number >)

0.6.2 Tail
to see the last rows of hdataframe
1 df . tail ( < number >)

0.7 set index


we can set rows index by:
1 df . set_index ( [ list of indexes wich equals to df rows ])

0.8 Add columns to dataframe


1 df [ < new column name >|[ < columns names >]] = [ < column ( s ) with the same
number of rows of df >]

0.9 Column filtering


0.9.1 basic
supposing that we have three rows in our dataset
1 df [ [ True , False , True ] ]

this will give us the first and third rows

0.9.2 !!!!
1 df [ < column name >] >| <|= < value > ]

df [< columnname >] > | < | =< value > will return a list with size equls to
number of rows of ’df’ and each row contains a boolean (True if row of column
is satisfing the condition, otherwise False).
0.10. LOC VS ILOC 3

0.9.3 isin

1 df [ df [ df [ column ]. isin ([ item1 ,...]) ]]

0.9.4 string operations


it only works with string values,we use the str, we can use function like: con-
tains, endswith, ...
1 df [ df [ df [ column ]. str . contains ( " < string > " ) ]]

0.9.5 filter
Subset the dataframe rows or columns according to the specified index labels.
it do not filter a dataframe on its contents. The filter is applied to the labels
of the index.
1 df . filter ( items =[ list of columns or rows ] , axis =0|1)

1 df . filter ( like = " < value > " , axis =0|1)

• axis 0: rows axis

• axis 1:c columns axis

0.10 loc vs iloc


0.10.1 loc
it gets rows (and/or columns) with particular labels.
1 df . loc [0]
2 df . loc [5:]

0.10.2 iloc
it gets rows (and/or columns) at integer locations (indexes) .
1 df . iloc [0]
2 df . iloc [5:]
4

0.11 Indexing
0.11.1 set index
when reading file

1 df = pd . read_csv ( path , index_col = " column name " | [ list of columns ]


|[ list of indexed ])

after reading file

1 df = df . set_index ( ’ column name ’| [ list of columns ] | [ list of


indexes ])
2 // or
3 df . set_index ( ’ column name ’ , [ list of indexes ] , inplace = True )

0.11.2 reste index


rest indexes to default values
1 df = df . reset_index ( drop = True )

or
1 df . reset_index ( drop = True , inplace = True )

0.11.3 sort index


1 df = df . sort_index ( ascending = boolean |[ list of booleans ])
0.12. SORT VALUES 5

0.12 Sort values


1 df = df . sort_values [ by =[ list of indexes ] , ascending =[ < boolean
> ,...]]

0.13 describe dataframe (or grouped dataframe)


The describe() function in Pandas provides a quick summary of the central
tendencies, dispersion, and shape of a dataset’s distribution, excluding NaN val-
ues. This function is particularly useful for getting an overview of the dataset’s
numerical columns, but it can also be applied to object (string) columns.
Here’s how the describe() function works and what it provides:

0.14 Syntax
1 DataFrame . describe ( percentiles = None , include = None , exclude = None )

0.15 Parameters
• percentiles: A list-like structure specifying which percentiles to include
in the output. Default is [0.25, 0.5, 0.75].

• include: A white-list of data types to include in the result. Can be a


string or a list-like structure.

• exclude: A black-list of data types to exclude from the result. Can be a


string or a list-like structure.

0.16 Output
By default, describe() returns the following statistics for numeric columns:

• count: The number of non-null entries.

• mean: The average (mean) value.

• std: The standard deviation.

• min: The minimum value.

• 25%: The 25th percentile (first quartile).

• 50%: The 50th percentile (median or second quartile).

• 75%: The 75th percentile (third quartile).


6

• max: The maximum value.


For object (string) columns, it returns:
• count: The number of non-null entries.
• unique: The number of unique values.
• top: The most frequent value.
• freq: The frequency of the most frequent value.

0.17 applay
EX transform column from numbers to strings
1 df [ ’ column ’ ]. applay ( lambda x : str ( x ) )

0.18 Group by and Aggregate functions


To group a datafarme
1 groupedFrame = df . groupby ([ list of column ])

Then we can applay aggregate functions:


1 groupedFrame . mean ( numeric_only = True )
2 groupedFrame . count ( numeric_only = True )
3 groupedFrame . min ( numeric_only = True )
4 groupedFrame . avg ( numeric_only = True )
5 // etc ....

0.18.1 agg function


Using agg() with groupby in Pandas allows you to perform aggregation oper-
ations on groups of data within a DataFrame. This is particularly useful for
summarizing data by categories or groups.
1 import pandas as pd
2
3 # Sample DataFrame
4 data = {
5 ’ Category ’: [ ’A ’ , ’A ’ , ’B ’ , ’B ’ , ’C ’ , ’C ’] ,
6 ’ Values1 ’: [1 , 2 , 3 , 4 , 5 , 6] ,
7 ’ Values2 ’: [10 , 20 , 30 , 40 , 50 , 60]
8 }
9
10 df = pd . DataFrame ( data )
11
12 # Group by ’ Category ’ and aggregate with multiple functions
13 result = df . groupby ( ’ Category ’) . agg ({
14 ’ Values1 ’: [ ’ mean ’ , ’ sum ’] ,
15 ’ Values2 ’: [ ’ min ’ , ’ max ’]
0.19. MERGE 7

16 })
17
18 print ( result )

Resulat:

Category Values1 Values2


mean sum min max
A 1.5 3 10 20
B 3.5 7 30 40
C 5.5 11 50 60

0.19 Merge
The merge function combines two DataFrames based on a common key.
1 import pandas as pd
2
3 df1 = pd . DataFrame ({
4 ’ key ’: [ ’A ’ , ’B ’ , ’C ’ , ’D ’] ,
5 ’ value ’: [1 , 2 , 3 , 4]
6 })
7
8 df2 = pd . DataFrame ({
9 ’ key ’: [ ’B ’ , ’D ’ , ’E ’ , ’F ’] ,
10 ’ value ’: [5 , 6 , 7 , 8]
11 })
12
13 result = pd . merge ( df1 , df2 , on = ’ key ’ , how = ’ inner ’)
14 print ( result )
Listing 1: Merge DataFrames

0.20 Join
The join function combines two DataFrames based on their indices.
1 import pandas as pd
2
3 df1 = pd . DataFrame ({
4 ’ value1 ’: [1 , 2 , 3 , 4]
5 } , index =[ ’A ’ , ’B ’ , ’C ’ , ’D ’ ])
6
7 df2 = pd . DataFrame ({
8 ’ value2 ’: [5 , 6 , 7 , 8]
9 } , index =[ ’B ’ , ’D ’ , ’E ’ , ’F ’ ])
10
11 result = df1 . join ( df2 , how = ’ inner ’)
12 print ( result )
Listing 2: Join DataFrames
8

0.21 Concat
The concat function combines two DataFrames along a specified axis.
1 import pandas as pd
2
3 df1 = pd . DataFrame ({
4 ’A ’: [ ’ A0 ’ , ’ A1 ’ , ’ A2 ’] ,
5 ’B ’: [ ’ B0 ’ , ’ B1 ’ , ’ B2 ’]
6 })
7
8 df2 = pd . DataFrame ({
9 ’A ’: [ ’ A3 ’ , ’ A4 ’ , ’ A5 ’] ,
10 ’B ’: [ ’ B3 ’ , ’ B4 ’ , ’ B5 ’]
11 })
12
13 result = pd . concat ([ df1 , df2 ])
14 print ( result )
Listing 3: Concat DataFrames

0.22 Ploting
we use .plot to plot or .plot.¡name plot¿() (plot.scatter, plot.hbar,...).

The df.plot method in Pandas offers a variety of parameters to customize


plots. Here is a detailed explanation of these parameters:

• x:
– Description: Column name(s) or position(s) for the x-axis.
– Type: str or list of str or int or list of int
– Default: The index of the DataFrame.
• y:
– Description: Column name(s) or position(s) for the y-axis.
– Type: str or list of str or int or list of int
– Default: The columns not specified in x.
• kind:
– Description: Type of plot to be generated.
– Type: str
– Options: ’line’, ’bar’, ’barh’, ’hist’, ’box’, ’kde’, ’density’,
’area’, ’pie’, ’scatter’, etc.
– Default: ’line’
• ax:
0.22. PLOTING 9

– Description: Matplotlib axes object to which the plot is added.


– Type: matplotlib.axes.Axes or None
– Default: None

• figsize:

– Description: Size of the figure (width, height) in inches.


– Type: tuple of (int, int)
– Default: (6, 4)

• subplots:

– Description: Create a separate subplot for each column.


– Type: bool
– Default: False

• title:

– Description: Title of the plot.


– Type: str
– Default: None

• grid:

– Description: Whether to show grid lines.


– Type: bool
– Default: None (grid is shown if True)

• legend:

– Description: Whether to show the legend.


– Type: bool
– Default: True if the plot contains multiple series.

• xlabel:

– Description: Label for the x-axis.


– Type: str
– Default: None

• ylabel:

– Description: Label for the y-axis.


– Type: str
– Default: None
10

• color:
– Description: Color of the plot elements.
– Type: str or list of str
– Default: Cycle through Matplotlib default colors.

• style:
– Description: Line style or marker for the plot.
– Type: str or list of str
– Default: Matplotlib default styles.

• alpha:
– Description: Transparency level of the plot elements.
– Type: float (0.0 to 1.0)
– Default: None

• rot:
– Description: Rotation angle for the x-axis labels.
– Type: int or float
– Default: None

• logx:
– Description: Use logarithmic scaling for the x-axis.
– Type: bool
– Default: False

• logy:
– Description: Use logarithmic scaling for the y-axis.
– Type: bool
– Default: False

• loglog:
– Description: Use logarithmic scaling for both x and y axes.
– Type: bool
– Default: False

• xerr and yerr:


– Description: Error bars for the x and y data.
– Type: float or DataFrame or Series
0.22. PLOTING 11

– Default: None
• sharex:
– Description: Share the x-axis with other subplots.
– Type: bool
– Default: True if subplots=True
• sharey:
– Description: Share the y-axis with other subplots.
– Type: bool
– Default: True if subplots=True
12
Chapter 1

Data cleaning

we could clean our data with function like

1.1 Delete duplicates


1 df . d r op _d up l ic at es ()

1.2 Drop rows


1 df = df . drop ( < row index ( loc ) >)

1.3 Drop columns


1 df = df . drop ( columns = < column name >|[ < list of columns >])
2 // or
3 df . drop ( columns = < column name >|[ < list of columns >] , inplace = True )

1.4 Drop NaN


1 df = dropna ( subset = ’ < column > ’)

1.5 Fillna
it used to fill NaN values
1 \\ replace NaN with blank
2 df = df . fillna ()

13

You might also like