0% found this document useful (0 votes)
5 views16 pages

Data Science - Sec4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Data Science - Sec4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data

Science

Section4
Pandas
Pandas (cont.)
Pandas
• .str.startswith() :Test if the start of each
string element matches a pattern.
✓Can only use .str accessor with string values!
• .str.endswith(): Same as startswith but
tests the end of string.
• .str.contains():Tests if string element
contains a pattern.
Loc Vs iloc
• loc gets rows (and/or columns) with particular
labels.
• iloc gets rows (and/or columns) at integer
locations.
• both look same when access one row.
Pandas - Cleaning Data
• Data cleaning means fixing bad data in your
data set.
• Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
Empty cells
• Remove Rows
• One way to deal with empty cells is to remove rows that contain
empty cells.
• This is usually OK, since data sets can be very big, and removing a
few rows will not have a big impact on the result.
• dropna() : Return a new Data Frame with no empty cells.
• Note: By default, the dropna() method returns a new DataFrame,
and will not change the original.
• If you want to change the original DataFrame, use the inplace =
True argument.
• the dropna(inplace = True) will NOT return a new DataFrame, but it
will remove all rows containing NULL values from the original
DataFrame.
dropna( )
• dropna(axis = value ): axis{0 or ‘index’, 1 or ‘columns’}, default 0.
• 0, or ‘index’ : Drop rows which contain missing values.
• 1, or ‘columns’ : Drop columns which contain missing value.
• Only a single axis is allowed.
• dropna(how = value ): how{‘any’, ‘all’}, default ‘any’.
• ‘any’ : If any NA values are present, drop that row or column.
• ‘all’ : If all values are NA, drop that row or column.
• dropna(thresh = value): thresh (int, optional).
• Require that many non-NA values. Cannot be combined with how.
Empty cells
• Replace Empty Values
• Another way of dealing with empty cells is to insert a new value instead.
• This way you do not have to delete entire rows just because of some empty cells.
• fillna(value): method allows us to replace empty cells with a value.
• Replace Using Mean, Median, or Mode:
• Pandas uses the mean( ), median( ) and mode( ) methods to calculate the respective
values for a specified column.
• Mean = the average value (the sum of all values divided by number of values).
• Median = the value in the middle, after you have sorted all values ascending.
• Mode = the value that appears most frequently.
Data of Wrong Format
• Cells with data of wrong format can make it difficult,
or even impossible, to analyze data.
• To fix it, you have two options: remove the rows, or
convert all cells in the columns into the same format.
• In our Data Frame, we have two cells with the wrong
format.
• Check out row 22 and 26, the 'Date' column should
be a string that represents a date.
• Pandas has a to_datetime() method convert dataframe or
series to a pandas date object.
• Let's try to convert all cells in the 'Date' column into dates.
• As you can see from the result, the date in row 26 was
fixed, but the empty date in row 22 got a NaT (Not a Time)
value, in other words an empty value. One way to deal with
empty values is simply removing the entire row.
Replacing Values
• One way to fix wrong values is to replace them with something else.
Removing Rows
• Another way of handling wrong data is to remove the rows that contains wrong data.
Practical section
Steps:
• Download data set from this link :
• http://tiny.cc/my_data_sec4
• Import pandas
• Load “my_data.csv” file

You might also like