Data Science - Sec4
Data Science - Sec4
Science
Section4
Pandas
Pandas (cont.)
Pandas
• .str.startswith() :Test if the start of each
string element matches a pattern.
✓Can only use .str accessor with string values!
• .str.endswith(): Same as startswith but
tests the end of string.
• .str.contains():Tests if string element
contains a pattern.
Loc Vs iloc
• loc gets rows (and/or columns) with particular
labels.
• iloc gets rows (and/or columns) at integer
locations.
• both look same when access one row.
Pandas - Cleaning Data
• Data cleaning means fixing bad data in your
data set.
• Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
Empty cells
• Remove Rows
• One way to deal with empty cells is to remove rows that contain
empty cells.
• This is usually OK, since data sets can be very big, and removing a
few rows will not have a big impact on the result.
• dropna() : Return a new Data Frame with no empty cells.
• Note: By default, the dropna() method returns a new DataFrame,
and will not change the original.
• If you want to change the original DataFrame, use the inplace =
True argument.
• the dropna(inplace = True) will NOT return a new DataFrame, but it
will remove all rows containing NULL values from the original
DataFrame.
dropna( )
• dropna(axis = value ): axis{0 or ‘index’, 1 or ‘columns’}, default 0.
• 0, or ‘index’ : Drop rows which contain missing values.
• 1, or ‘columns’ : Drop columns which contain missing value.
• Only a single axis is allowed.
• dropna(how = value ): how{‘any’, ‘all’}, default ‘any’.
• ‘any’ : If any NA values are present, drop that row or column.
• ‘all’ : If all values are NA, drop that row or column.
• dropna(thresh = value): thresh (int, optional).
• Require that many non-NA values. Cannot be combined with how.
Empty cells
• Replace Empty Values
• Another way of dealing with empty cells is to insert a new value instead.
• This way you do not have to delete entire rows just because of some empty cells.
• fillna(value): method allows us to replace empty cells with a value.
• Replace Using Mean, Median, or Mode:
• Pandas uses the mean( ), median( ) and mode( ) methods to calculate the respective
values for a specified column.
• Mean = the average value (the sum of all values divided by number of values).
• Median = the value in the middle, after you have sorted all values ascending.
• Mode = the value that appears most frequently.
Data of Wrong Format
• Cells with data of wrong format can make it difficult,
or even impossible, to analyze data.
• To fix it, you have two options: remove the rows, or
convert all cells in the columns into the same format.
• In our Data Frame, we have two cells with the wrong
format.
• Check out row 22 and 26, the 'Date' column should
be a string that represents a date.
• Pandas has a to_datetime() method convert dataframe or
series to a pandas date object.
• Let's try to convert all cells in the 'Date' column into dates.
• As you can see from the result, the date in row 26 was
fixed, but the empty date in row 22 got a NaT (Not a Time)
value, in other words an empty value. One way to deal with
empty values is simply removing the entire row.
Replacing Values
• One way to fix wrong values is to replace them with something else.
Removing Rows
• Another way of handling wrong data is to remove the rows that contains wrong data.
Practical section
Steps:
• Download data set from this link :
• http://tiny.cc/my_data_sec4
• Import pandas
• Load “my_data.csv” file