0% found this document useful (0 votes)
37 views

IP XII U1 Ch3 DataHandling (DataFrame) Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
37 views

IP XII U1 Ch3 DataHandling (DataFrame) Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 45
Revised as per CBSE > Curriculum , 2021-22 Informatics Practices Class XII (CBSE Board) 4 Data Handling using Pandas UNIT-1: & Data Visualization Chapter: 3 Data Handling using Pandas-II *Descriptive Statistics & Advanced operations on Data Frames * ] Visit: www. ip4you.blogspot.com for more... Open Teaching-Learning Material Kendriya Vidyalaya Khanapai e-mail : [email protected] Expected Learning Outcome: CBSE Syllabus (2021-22) Covered in this presentation: O Descriptive Statistics: max, min, count, sum, mean, median, mode, quartile, Standard deviation, variance. O Data Frame operations: Aggregation, group by, Sorting, Deleting and Renaming Index, Pivoting. O Handling missing values - dropping and filling. Ol Importing/Exporting Data between MySQL database and Pandas. In this presentation you will learn about data handling using Pandas and its basic concepts like... Oo Oo oO What are Descriptive Statistics? Applying various statistical functions on DataFrame. Advanced operations on DataFrame: Aggregation, Grouping, Sorting and Pivoting etc. How to handle missing values in DataFrame: Dropping & Filling. Importing/ Exporting data between MySQL Database and Pandas DataFrame. Introduction to Descriptive Statistics. O Statistics, being a branch of Mathematics, deals with collection, organization, analysis or interpretation of data. O Descriptive statistics involves summarizing and organizing bulk data, so that interpretation or analysis becomes easy. In other words, they refer to the methods which are used to get some statistical idea about the data. O Pandas provides various Statistical functions for data analysis, which can be applied on Data sets. O Apart from statistical functions, Pandas also provides some Aggregate functions to summarize data. O These functions/methods can be applied on rows or columns of DataFrame to get summarized or aggregated values. Pandas Aggregate functions Aggregate Functions Purpose count() Counts total number of not-Nan-Values sum() Calculate the sum of given set of numbers min() Finds the minimum value of given data set. max() Finds the maximum value of given data set. Statistical Functions Purpose mean() Calculate the arithmetic mean (average) of given set of numbers. median() Calculate the median (middle value) of given set of numbers mode() Calculate the mode (most repeated value) of given set of numbers std() Calculate standard deviation of given set of numbers var() Calculate variance of given set of numbers. quantile() Calculate the quartile of given set of numbers How to apply Aggregate & Statistical functions sé - = #- i= = #- Consider the following data set of student’s score in Unit Test (UT), Half Yearly(HY) and Session Ending Exam (SEE) of three students in five subjects. import pandas as pd marks={'Name':[ ‘Amar’, ‘Amar’, ‘Amar’, ‘Akbar’, ‘Akbar’, ‘Akbar’, ‘Anthony’, ‘Anthony’, 'Anthony'], “exami: UN 5 H¥ 5 SEEs Ul 4 HY’, SEE, Ut". HY. SEE J; ‘English’ :[22,21,14,20,23,22,23,33,24], "Hindi':[21,22,19,25,35,28,39,24,41], "Science':[18,17,15,22, 21,19, 20, 24,23], 'Maths':[20,22,24,24,25,23,15, 33,32], "SST' :[21,24,23,19,15,13,22,34,32]} df= pd.DataFrame(marks) print(df) Name Exam English Hindi Science Maths We can apply Aggregate and statistical functions on DataFrame in row- wise (axis=1) or column-wise (axis=0). How to apply Aggregate & Statistical functions —E—————— ee |)hDhDmhrlrlrlrlrlrmrmrmrrmm—“C=tisSNCS Aggregate functions can be applied on a DataFrame in following ways as per requirement- Application How to apply Clute On whole DF.function() Default axis is 0, so DataFrame Ex:- df.sum() applied along-row on all columns On Selected DF[‘Col’].function() Applied on selected Column Ex:- df[‘Science’].sum() column(s) On selected Rows DF.iloc[:].function() Ex:- df.iloc[2:6].sum() Applied on selected rows using loc/iloc method. On Subset of DataFrame DF.iloc[:, :].function() Ex:- df.iloc[3:5,2:6].sum() Applied on selected rows and columns using loc/iloc method. On Conditional selection of rows/columns DF[DF[‘col’]].function() Example: df[df[‘Exam’]==‘UT’].sum() Applied on selected set of rows and columns filtered on condition. You can give axis=1 setting with aggregate functions application on DataFrame. ‘or row-wise (along row) Aggregate Function: coun O count(): aE 0 amar OP Pandas count() is used to count the number of non-NaN values of a DataFrame or series. It works with strings also. .count( axis=0/1, numeric_only=True/False) for Column-wise and 1 for Row-wise. Default is 0. -_only=True: Ignores non-numeric values. False is default. Exam English Hindi Science Maths SST dtype: inté # counting values column wise (axis=0) print (df.count()) # counting values for non-numeric columns print (df.count (numeric_only=True) # counting values row wise (axis=1) print (df.count(axis=1) ). # counting values of ‘English’ colum ——— print(df[ ‘English’ ].count()). # counting values having more tha marks in Science dfi=df[df[ ‘Science’ ]>20].count() print (df1[ 'Science']) 3 9 5 Maths 3 3 4 : inté Aggregate Function: sum() O sum(): meat TE amar UP amar BY Pandas sum() is used to get total of non-null values of a DataFrame Science 1 1 Amar SEE Akbar UT Akbar HY Akbar SEE Anthony UT Anthony Anthony or series. It concatenates strings data types. Axis=0 for Column-wise and 1 for Row-wise. Default is 0. skipna=True: Ignores NaN values. Default is True. Numeri -_only=True: Ignores non-numeric values. # column wise sum for non-numeric column print (df.sum(numeric_only=True) )- # row wise sum (axis=1) print (df.sum(axis=1) ). # sum of 'English' column print(df['English'].sum()) # sum of 'Hindi' to 'Maths‘ column print(df.loc[:, 'Hindi': 'Maths'].sum() Donne WN EO} Aggregate Function: min() O min(): Pandas min() is used to get minimum values of a DataFrame or series. an An amar Akbar Akbar Akbar thony Anthony thony sh 22 21 14 20 23 22 23 33 24 Axis=0 for Column-wise and 1 for Row-wise. Default is 0. skipna=True: Ignores NaN values. Default is True. Numeric_only=True: Ignores non-numeric values. # column wise minimum(axis=0 print(df.min()) # column wise min for non-numeric columns print (df.min(numeric_only=True) ). # Student-wise (Row wise) min marks print (df.min(axis=1) ). # Minimum in ‘English’ subject print(df[‘English'].min()) # Minimum from ‘Hindi’ to 'Maths‘* subject print(df.loc[:, ‘Hindi’: 'Maths'].min() Hindi 21 22 19 25 35 28 39 24 41 Science 18 17 15 22 21 19 20 24 23 Maths 20 24 24 25 23 15 : object 18 17 14 19 15 13 15 24 23 type int6é4 SST 21 Aggregate Function: max() O max(): Pandas max() is used to get maximum values of a DataFrame or series. amar amar amar Akbar Akbar Akbar Anthony Anthony Anthony 22 21 14 20 23 22 23 33 24 Axis=0 for Column-wise and 1 for Row-wise. Default is 0. skipna=True: Ignores NaN values. Default is True. Numeric_only=True: Ignores non-numeric values. # column wise maximum(axis=0) print (df.max(axis=@) ) # column wise max for non-numeric columns print (df.max(numeric_only=True) ). # Student-wise (Row wise) max marks print (df.max(axis=1) ). # Maximum in ‘Science’ subject print (df[ ‘Science’ ].max()) # Maximum from ‘Hindi' to 'Maths‘ subject print(df.loc[:, Hindi’: 'Maths'].max() English Hindi Science Maths ssT dtype: int6é4 24 Hindi 41 Science 24 Maths 33 dtype: int6é4 sh Hindi 21 22 19 25 35 28 39 24 41 Science BoOIKNUAWHEHO!| 18 17 15 22 21 19 20 24 23 Maths 20 22 24 24 25 23 15 33 32 SST Statistical Function: mean() Name Exam English Hindi Science Maths SST . o Ama: ur 22 Zi. 18 20 21 oO mean(): 1 aaa BY 21 22 17 22 24 2 Amar SEE 14 19 15 24 23 . 3 Akba: ur 20 25 22 24 19 mean() is used to get 4 Bear HY = 2335 tS rf j 5 kb: SEE 22 28 19 23 13 arithmetic mean (average) ofa Gi aches oa a a DataFrame or series. 7 Anthony BY 33a 24 33 34 8 Anthony SEE 24 41 23 32 32 Axis=0 for Column-wise and 1 for Row-wise. Default is 0. | =nglish 444444 skipna=True: Ignores NaN values. Default is True. Binds peeeeee = is Z 3 Science -888889 Numeric_only=True: Ignores non-numeric values. Mattie "222222 - ; Sst 555556 # column wise average(axi | dtype: print(df.mean() # OR print(df.mean(numeric_only=True) ) # Student wise average (axis=0) print (df.mean(axis=1) ) # Average of ‘Science’ Subject print (df['Science'].mean() # Average of Amar in all exams print (df[d['Name']=='Amar'].mean() 19.88888888888889 English 19.000000 Hindi 20. 666667 Science 16.666667 Maths 22.000000 Sst 22.666667 dtype: floaté4 Statistical Function : median() O median(): Middle value in sorted list or data set. median() function is used to calculate the median or middle value of a given set of numbers. The dataset can be a series, a data frame. For even size of data set, the median will be average of two almost medians. Since total size is 5 (odd), so Since total size is 6 (even), so median will be middle value i.e. 8 median will be (8+10)/2=9 - - - English # column wise median(axis=@) Hindi print (df.median()) Science # OR print(df.median(axis=@) ) Maths # Student wise (along ror) median (axis=1) Pe . print(d.median(axis=1) ) oCYDS # Median of ‘Science’ column print(df['Science'].median() Visit www.ip4you.blogspot.com for more teaching-learning materials... Statistical Function : mode() O mode(): Most frequent value in the data set. The mode() function is used to calculate the BmmroOrmor mode or most repeated value of a given data set as Series or DataFrame. If more than one number occurs many times, then both numbers will be considered as mode. Frequency of 6 and 10 is 2, so mode will be 6 and 10. .mode(axis=0/1, skipna=True/False, numeric_only=True/False) Name Exam English Hindi Science Maths SST Amar UT 22 21 18 20 21 Amar HY 21) 22 av 22 «24 Amar SEE 19 19 18 24 23 Akbar UT 20 20 23 24 19 English Hindi Science Maths SST 50 20.0 24 18 20 21.0 1 NaN 35 20 24 NaN BIAWeRWREO Akbar HY 23 35 21 25° 22 Akbar SEE 21 35 19 2319 anthony UT 20 39 20 20 21 — . Anthony HY 33 24 24 33 Hindi Column is Anthony SEE 20 24 20 a3. having 24 and 35 as most and equal frequency count, so appeared twice. # Subject wise most frequent marks print (df.mode(axis=@,numeric_only=True) ) # Row wise most occurring marks print(df.mode(axis=1, numeric_only=True) ) Statistical Function : std() O std(): Computing distance from mean. Standard Deviation is a quantity which tells variation (difference) of a members of data set from its mean value. A low standard deviation indicates that the values are closer to the mean of the data set, while a high standard deviation indicates that the values are spread out over a wider range. The Std() function calculates standard deviation of a given set of numbers as Series or DataFrame. .std(axis=0/1, skipna=True/False, numeric_only=True/False) Name Exam English Hindi Science Maths SsT Amar UT 22 21 18 20 21 Amar HY 21 22 17 22 24 English 4.255715 Hindi 7.601170 Science 2.345208 0 1 2 Amar SEE 19 19 18 24 23 Maths 4.711098 3 Akbar UT 20 20 23 24 19 Sst 4.555217 4 Akbar HY 23 35 at 25° 22 tyes Pidatna 5S Akbar SEE 21 35 1g 23° «19 6 anthony vu? 20 39 20 20 7916575 7 Anthony HY 33 24 24 33 - 588436 8 Anthony SEE 20 24 20 32 + 701851 167948 + 830952 - 693280 -396428 -128353 - 979960 floate4) # Subject wise (column) standard devi print(df.std(axis=0,numeric_only=True) ) # Row wise standard deviation print(df.std(axis=1, numeric_only=True) BaAIKHAUAWNHED e RUDAUNNN Statistical Function : var() Oo var(): Computing variability of group from mean. Variance is a numerical value that describes the variability of observations from its arithmetic mean. In contrast to standard deviation it tells how far individuals in a group are spread out from mean. Variance is nothing but an average of squared deviations. The var() function calculates variance of a given set of numbers as Series or DataFrame. .var(axis=0/1, skipna=True/False, numeric_only=True/False) Name Exam English Hindi Science 0 Amar UT 22 21 18 1 amar HY 21 22 17 2 Amar SEE 19 19 18 3 Akbar UT 20 20 23 4 Akbar «HY 23 a5 21 5 Akbar SEE 21 35 19 6 Anthony vt 20 39 20 7 Anthony HY 33 24 24 @ Anthony SEE 20 24 20 # Row wise standard deviation print(df.var(axis=1) ) # Column wise standard deviation print(df.var(axis=@) ) DIHNARWNHO English aoenes 11 Hindi 57.777778 Science 5.500000 Maths 22.194444 20.750000 : float64 Statistical Function : quantile() O What is Quantile? Consider the following Data set and its median (middle value) Median value 12 is dividing data set in 2 equal parts ESRC EEE 50% values are less than median 50% values are greater than median So, if we divide data set into 4 equal parts of 25% of each then it is called Quartile (quarter) and if 5 parts (20%)then called Quintiles. 25% percentile 25% percentile 25% percentile 25% percentile 25% (0.25 50% (0.50 75% (0.75. 100% (1.0 When data set is divided in 100 parts (1% each) the it is called Percentile. Statistical Function : quantile() Oo quantile(): Dividing dataset in equal proportions. Pandas quantile() function is used to get the quartiles. It gives the quartile of each column or row of the DataFrame. By default, it will display the second quantile (median) of all numeric values. .quantile(q=< values>, axis=0/1, numeric_only=True/False) q= : refers quartile values [0.25/0.5/0.75/1.0]. Default is 0.5 Axis=0 for Column-wise and 1 for Row-wise. Default is 0. Numeric_only=True: Ignores non-numeric values. English -0 Hindi -0 # Column wise quartile ([email protected]) on axis=0 Science 20.0 : Maths 24.0 print(df.quantile()) nar 21.0 # Row wise quartile (q=0. 5) Name: 0.5, dtype: floaté4 print(df.quantile(q=0.5, # Row wise all four quartiles Print(df.quantile(q=[@.25,0.50,0.75,1.0],axis=0) V English Hindi Science Maths Sst baS 20.0 21.0 18.0 22.0 21.0 students are -50 es 24.0 20.0 24 21.9 below 23 22.0 35.0 21.0 marks in SST 33.0 33.0 24.0 bject, Other Statistical Functions: Oo discribe(): All-in-one descriptive statistics function Pandas describe() function is used to get all descriptive statistical analysis like count, mean, maximum, minimum, quantiles etc. through one function. For columns having string data types, describe function returns count of data, number of unique entries and most frequent string. Import pandas as pd dct={'Phy':[60,65,70,67], “Chem' :[34,55,32,46], "Maths':[45,56,65,75]} df=pd.DataFrame(dct, index=['Amar', ‘Akbar! > ‘Manpreet']) Chem Maths 4.000000 4.000000 4.00000} 65.500000 41.750000 60.25000 : . 4.203173 10.781929 12.78997 print (df.describe() i 60.000000 32.000000 45.00000 63.750000 33.500000 53.25000 66.000000 40.000000 60.50000 67.750000 48.250000 67.50000 000000 _55.000000 Visit www.ip4you.blogspot.com for more teaching-learning materials... Advanced operations on DataFrame Pandas offers some advanced operations which can be applied on DataFrame for summarizing, ordering and reshaping of data as per need. These operations may include the followings- Aggregation S «Summarizing data fr oD) Handling Missing Data S «Removing and managing ay Pivoting S «Reshaping Data Visit www.ip4you.blogspot.com for more teaching-learning materials... Advanced operations: Aggregation Oo aggregate() or agg():Aggregation of functions The term ‘Aggregation’ refers to summarization of data set to provide single numeric value. The aggregate functions like count(), sum(), min(), max(), median(), mode() etc. as discussed earlier are used to get single value in terms of count, sum, minimum etc. Sometimes, it is required to apply multiple aggregate functions on same data set. Pandas offers aggregate() or agg()function to apply one or more _aqgregate functions at a time on same data set. .aggregate(, axis=0/1) : max sum # refer DataFrame as created earlier a 60 139 # Applying max() function on axis=0 Akbar 65 176 | 70 167 print(df.aggregate( ‘max’ ))-# same as df.max() . Anthony # Applying max() & sum() function on axis=1 = Ee print(df.aggregate([ ‘max’, ‘sum’], axis=1)) # Applying max()& sum() function on axis=0 = on a eer print (df.agg([ ‘max? , ‘sum?])) Se max 70 min 60 Name: Phy, dtype: inteal Phy Chem Maths # Applying max()& min() on ‘Phy’ column print(df[ ‘Phy’ ].aggregate([ ‘max’, ‘min’ ]))- Advanced operations: Sorting O Sorting of DataFrames: The term ‘Sorting’ refers to arrangement of data set in specified order i.e. ascending (low>high) or descending (high>low) order. A DataFrame can be sorted on the basis of values of columns or row/column labels. By default sorting is done on Row index (label) in ascending order. Pandas provides the following two methods for sorting Data Frames. sort_values() Sorts data frame in ascending or descending order as per values of specified column(s). sort_index() Sorts data frame in ascending or descending order as per labels i.e. row-indexes (axis=0) or column-indexes (axis=1) pee functions creates another DataFrame after sorting. You can give Advanced operations: Sorting Oo sort_values(): Sorting on values of columns. The sort_values() functions sorts data frame as per values of given columns. .sort_values(by=, axis=0/1, ascending=True/False) Chem Maths by= : Refers to columns for sorting. Multiple columns to be given as list. ee 6 me 5 axis=0 for Column-wise and 1 for Row-wise sorting. nea ascending=True: True for ascending and False for _ I Manpreet ndin raer descending orde Original Data Frame print (df.sort_values('Chem') print (df.sort_values(by='Akbar' ,axis=1) ) print (df.sort_values('Phy' ,ascending=False) ) print(df.sort_values(['Phy', 'Chem' ])). Chem’ Maths Anthony 32 65 Akbar 55 56 Manpreet 46 15 Amar 34 45 For same marks of Phy (Primary col.), record are arranged as per marks of Chem (secondary col.) Advanced operations: Sorting Oo sort_index(): Sorting on indexes/labels. The sort_index() functions sorts data frame as per labels of given axis. .sort_index( axis=0/1, ascending=True/False) axis=0 for Column-wise and 1 for Row-wise sorting. ascending=True: True for ascending and False for descending order. print(df.sort_index()) print(df.sort_index(ascending=False) ) print(df.sort_index(axis=1) print(df.sort_index(axis=1, jascending=False) Maths Amar 60 34 65 55 56 Amar Akbar Anthony Manpreet — Maths Chem Phy Maths Chem Phy Akbar 65 55 56 65 46 75 Amar 45 60 34 60 34 «45 70 32) 6S J a ynae 56 65 55 60 34 45 65 Anthony 65 Manpreet 5 Advanced operations: Reindexing O Reindex(): Reordering of Rows and columns As you know, that row indexes and column indexes can be arranged in ascending order or descending order by using sort_index() function. But if you want to arrange rows or columns in any user defined order, then reindex() method can be used. .reindex( index=[row order], columns=[column order]) Name Phy Chem Maths # Reindexing rows only Amar 68 45 print (df.reindex(index=[2,4,3,1])) Akbar 6s 36 # Reindexing rows and columns both oe : Ee print (df.reindex(index=[2,4,3,1], a columns=['Eng','Chem', 'Phy', 'Maths' ai a 65 56 Akbar 7 45 65 Manpreet 64 77-70 65 Anthony 45 68 60 45 Amar Name Phy Chem Maths 2 Akbar 65 65 56 65 4) Manpreet 45 56 65 78 3] Anthony 70 71 65 64 1 Amar 60 45 45 Reordering of rows & columns Advanced operations: Grouping O What is Grouping? Grouping is a process to classify records (subset of records) on some criteria for group-wise operations. Consider the following data collection of marks in Unit Test, Half Yearly and Session Ending Exam of three students. Now think on the following requirement.. = Prepare Student-wise report with Total of marks in all exams. = Prepare Exam-Wise report with mean of marks scored by students. Name Total Marks Exam [Mean Amar 944 se Total UT 314.67 Akbar 916 HY 305.33 Anthony 912 Mean You may observe that these reports require to apply aggregate function (sum/median) on student-wise/exam-wise groups (subsets) of records instead of whole data frame. Advanced operations: Grouping O Grouping of Records Pandas groupby() function is used to create groups (based on some criteria) and then apply aggregate functions on groups, instead of whole data set. Pandas groupby() function works on split-apply-combine strategy which works as follows- = Split original DataFrame into Groups by creating Groupby objects. = Apply required aggregate function on each group. * Combine results to form a new DataFrame. Amar UT 350 - Amar HY 304 Combine result Amar SEE 290 Name Total Marks Akbar UT 346 ‘Amar 944 Akbar _ [HY 310 ps akbar 916 Akbar _|[SEE__| 260 Anthony 912 Anthony |UT 295 Anthony |HY 312 Anthony |SEE 305 A Sum() on each group Name Amar UT 350 Amar HY 304 Amar ‘SEE 290 Akbar UT 346 Akbar HY 310 Akbar SEE 260 Anthony [UT 295 Anthony Anthony Visit www.ip4you.blogspot.com for more teaching-learning materials... Advanced operations: Grouping O groupby(): Grouping of records .groupby( by=, axis=0/1) more about groups. Pandas groupby() function create Groups internally, which can be stored on Groupby objects. Once the group object has been created, we can apply the following commonly used functions on group object to get .groups Displays list of groups created .get_group() Display group as per given value .[]. Applies given aggregate function on each group and display results. .count() Counts non-NaN values of each column of the group .size() Displays size of group .first() Displays first row of each group Advanced operations: Grouping . {"Akbar': Int64Index([3, 4, 5], dtyp Oo groupby(): Example ‘amar': Int64Index([0, 1, 2], dtype='int64"), ' JAnthony': Int64Index([6, 7, 8], dtype='int64')} Name Exam Amar UT 350 Amar HY 304 go=df . groupby( ‘Name print(go.groups) nethay Akbar UT 346| | print(go.size() dtype: inté: print(go.count()) ~ print(go.get_group("Amar") pico ‘ . i ' ' Akbar print (go[ 'Marks'].sum()) pane 3 es Anthony oe aie Exam Marks | ‘ Akbar 0 oT 350 aggregate functions Amar 944 1 HY 304 can also be applied Anthony 912 2 SEE 290 RiihICOrningtGn Name: Marks, dtype: int64 sum mean std print (df.groupby( ‘Name? )[ 'Marks" Name print (df.groupby('Name’)[ "Marks" ] Akbar 916 305.333333 43.189505 -aggregate([ ‘sum’, ‘mean’, 'std'])) lamar 944 314.666667 31.390020 Since you can put df.groupby(‘Name’) in place of go Anthony 912 304.000000 8.544004 Advanced operations: Missing Data O Handling Missing Data: A DataFrame may contains multiple rows and each row can have multiple values for corresponding columns. If a value corresponding to a column is not available then it is considered as missing value and denoted by None or NaN (Not a Number) constant. In real world, during data collection, some columns may not applicable to some individuals. For e.g. salary column is irrelevant for an unemployed person and may be left blank. But this missing data may generate misleading/inaccurate information during data analysis and should be handled properly. These NaN values may be deleted or replaced by some relevant values to get correct analysis. Pandas provides the following methods for handling NaN values isnull() Checks that which row or column is having NaN values. dropna() | Drop (delete) the row/column having NaN (missing) values. fillna() Fill some estimated value for NaN (missing) value. Advanced operations: Missing Data O isnull(): Checking for NaN (missing values) Pandas isnull() function returns True, if row/column contains NaN (missing) values otherwise returns False. This function can be applied on Series/DataFrame or column of a DataFrame. We can also count NaN values by piping sum() method with isnull(). import pandas as pd Nams: import numpy as np artes dct={'Name':['Amar', 'Akbar', ‘Anthony’, akbar ‘Manpreet'], 'Phy':[60,None,70,45], 7 "Chem" :[68,65,None, 56], aren ane *Maths':[45,56,65,65], Name Maths "Eng':[45,np.NaN, 64,np.NaN]} False False df=pd.DataFrame(dct) rales malsa False False False False print(df) print(df.isnull()) = print(df['Eng'].isnull() n 0 False # getting count/sum of NaN values doce print(df.isnull().sum()) th: True print(df.isnull().sum().sum())—> 4 5 : Eng, dtype: bool Advanced operations: Missing Data O dropna(): Deleting Row/Columns having NaN Values Pandas dropna() function deletes rows or columns containing NaN (missing ) values. .dropna( axi: /1, how axis=0 deletes along rows and 1 deletes column. How="‘any’ : Deletes Row/Column if any number of NaN value is found. In case of ‘all’ , row/column to be deleted only when all values are NaN. ‘any’ Since no any row or column value, so no impact of how: having all NaN Chem Maths II’ setting. Name 0 Amar 60.0 68.0 45 45.0 it Akbar NaN 65.0 56 NaN Be 2 Anthony 70.0 NaN 5 64.0 print (df.dropna(axis=1) ) 3 Manpreet 56.0 5 #0R (axis=1 for column deletion) print (df.dropna(axis=1,how='any')) 6 6: print (df.dropna()) #OR (axis=@ for row deletion) Name Maths Alveolumnes print (df.dropna(axis=0,how='any')) 0 Amar 45 containing i Akbar 36 NaN values Name Phy Chem Maths Eng 2 Anthony 65 have been 0 Amar 60.0 68.0 45 45.0 3 Manpreet 65 deleted. Advanced operations: Missing Data O fillna(): Replacing NaN Values with other values Pandas fillna() function replaces NaN values with any other defined values. You can also define different values for different columns by providing Filler dictionary. Ina( |) Number : fill all NaN values with any number. Filler Dictionary: Fill values as per given dictionary of column: value pair. Name Age City print (df. fillna(@)) 0 Amar 21.0 Delhi Tams Age i Akbar 25.0 None 0 Amar 21.0 Delhi 45.0 2 Anthony NaN Agartala 1 Akbar 25.0 a 3 Manpreet 45.0 NaN 2 anthony [QMO agartala 3 Manpreet ame Amar * i z Akbar 25.0 [GGWahati 20m0i] Anthony [GMM Agartala 65.0 Manpreet 45.0 (GGwaHaea print(df.fillna({‘Name':'NA', ‘Age':0, ‘City’: ‘Guwahati’, ‘Marks' :20})) WNPO Visit www.ip4you.blogspot.com for more teaching-learning materials... Advanced operations: Pivoting O Pivoting of DataFrame : Pivoting is a process of reshaping/rearrangement of data by rotating row and columns & applying aggregating data to provide summarized report in different view of user. Consider the following sales data of an electronic shop and two reports (Brand-wise & Item-wise) generated from the DataFrame- Item |Brand_|Qty Brand-Wise Report Item-Wise Report Tv SONY 65 ac |sony | 32 pr ac | pc | Tv Brand | HP | LG | SONY Bran pC___[HP 45 — 5 T___|le 25_ |} | He 45 | 2 a2 ac |is 25 be 5 25 65 25 65 SONY 32 15 PC SONY 15 32 28 wM__[LG 32 wm |SONY 38 To generate above reshaped & summarized reports, Pandas offers the following two functions- pivot() | Reshapes data when single values found for row-column combinations pivot_table() | Reshapes data when multiple values found for row-column combinations Advanced operations: Pivoting O pivot() : Reshapes data on defined indexes and values .pivot(index=, columns=, values=) index=:specify column which will be placed as row-index of pivot table. columns=:Specify column whose values to be used as columns of pivot. values=:Specify columns whose values to be filled across rows and columns of pivot. All remaining columns will be used as values, if values parameter is not provided. Pandas will use NaN constant for any missing values. Reshaping of Columns to fe NaN 45.0 NaN NaN ee 25.0 NaN 25.0 32.0 32.0 15.0 65.0 28.0 OIKAN WN You can apply fillna() to replace NaN values with any numbers. print(df.pivot(index=‘Brand’, columns= ‘Item’, values= ‘Qty’)) Advanced operations: Pivoting O pivot_table() : Reshapes data on defined indexes Consider the following DataFrame for creating Pivot Table. Notice that there are multiple value (Qty) for TV-LG combination. When you will try to create Pivot using pivot() method, Pandas will raise value error. Since there are two values for a cell (i.e. 25 and 35), so which value to be taken for cell ? In such case Pandas offers pivot_table() method to resolve multiple values problem using aggregate functions like sum, min, max, mean etc. By default mean method is used, when aggregate function is not given. df. pivot (index= ‘Brand’ , columns=‘Item’ , values=‘Qty’ ) = mW NE w Item Brand TV AC Year SONY 2020 SONY 2020 2021 PC___HP [ry te] 2020 pivot() method is Item AC PC Tv unable to resolve Brand multiple values for HP NaN 45 NaN same combination LG 25 NaN | 25?35 of Row-Column SONY a2 15 65 index. ivot_table() method uses aggregate functions to AC LG 2021 6 2021 G5p P —_ 7 —_ handle duplicate entry (multiple values) for a cell. Advanced operations: Pivoting O pivot_table() : Reshapes data on defined indexes .pivot_table(index=, columns=, values=, aggfunc= ) index=:Specify column which will be placed as row-index of pivot table. columns=:Specify column whose values to be used as columns of pivot. values=:Specify columns whose values to be filled across rows and columns of pivot. aggfunc=:Specify aggregate function like sum, min, max, mean etc. to be used to handle multiple values. Default is mean function. print (df.pivot_table(index='Brand',columns='Item', values='Qty')) print (df.pivot_table(index='Brand' ,columns='"Item' ,values='Qty', aggfunc='sum')) Item Brand TV SONY 2020 6s AC SONY 2020 32 Be HP 2021 45 mw MM 2020 BB AC LG 2021 25 Item AC BC Tv Brand HP NaN 45.0 NaN LG 25:0 NaN 32.0 NaN 65.0 Item AC Pc TV Brand HP NaN 45.0 NaN LG 25.0 NaN SONY 32.0 NaN AubWNE Importing & Exporting Data Between Python Pandas and MySQL As you are aware that MySQL is a Relational Database Management (RDBMS) application based on Structured Query Language(SQL). A database in MySQL may contains several tables containing data in 2D structure of rows and columns. Since Pandas DataFrame is also 2D structure, so we can import/exports data between Pandas and MySQL database. However importing/exporting of data using CSV file is also a common way, but CSV file is not secure and not compatible to MySQL. Python offers various ways to connect MySQL database but the following three methods are very common. Method-1 | Using pymysql e Provided by Python connector e Less support to execute parameterized query. Method-2 | Using e Provided by Oracle Corp. (MySQL distributor) mysql.connector | ¢ Supports parameterized query while reading package from MySQL database. Method-3 | Using e Sqlalchemy is ORM based additional Toolkit to sqlalchemy handle database connection and packaging of package data compatible to Pandas. It can also be used with pymysql/mysql.connector etc. Method-3: is recommended for handling Pandas-MySQL connectivity as per syllabus. Importing & Exporting Data Between Python Pandas and MySQL O Installing and using pymysq|I package library There are so many ways to connect MySQL database with Python application using different libraries and functionalities, but for exchanging data between Pandas DataFrame and MySQL , we need the following two library packages to be installed in Python environment. pymysql A Python’s Database-Connection Driver How to install- package library, which handles connection | pip install pymysq]l Python and MySQL Database. sqlalchemy | It is add-on toolkit used to interact with How to install- MySQL database by creating connection _| pip install sqlalchemy (Engine) using connection details. Note: You can import/export data using pymysql only, but using add-on library sqlalchemy along with pymysq! make job easier through minimum code/steps. Pandas library offers the following two functions to handle MySQL database/ Table: ™ read_sql(): reads/imports data from MySQL to DataFrame. ™ to_sql(): writes/exports data from DataFrame to MySQL table. Importing & Exporting Data Between Python Pandas and MySQL O Steps to Import/Export Data between Pandas & MySQL After installing the required packages (pymysql & sqlalchemy), the following two steps are required to connect MySQL Database to Python program. O Step 1: Importing libraries You can also use mysql.connector The following libraries to imported- in place of pymysql 1. Pandas: For handling Pandas data structures. 2. Pymysql: MySQL connection driver library. import pymysql 3. Sqlalchemy: For database connections (engine). import sqlalchemy O Step 2: Establishing connection (Engine) The sqlalchemy offers create_engine() method to establish secure connection to MySQL Database using MySQL credentials and Database name. import pandas as pd =sqlalchemy.create_engine(‘mysql+pymysalI: // :@localhost/’) =.connect() # Establishing connection to school database with MySQL user root (pass-tiger). engine= sqlalchemy.create_engine( 'mysql+pymysq1://root:tiger@localhost/school' ) con= engine. connect() Importing & Exporting Data Between Python Pandas and MySQL ee O Importing (Reading) records from MySQL to DataFrame Pandas offers read_sql() method to import data from MySQL table. =pandas.read_sql(|,) Suppose, student table of school database to be imported. The MySQL user is root and password is tiger # importing packages rane import pandas as pd ca import pymysql Anthony import sqlalchemy as alc = # Establishing connection to school database. name exam engine= alc.create_engine( 'mysql+pymysql:// eiae Gr root :tiger@localhost/school?charset=utf8' ) Bebe 86 con= engine.connect() Bonuony a df=pd.read_sql('select * from student',con) eae = #0R df=pd.read_sql( ‘student’ ,con) Anthony HY Note: if you are using MySQL below 5.6 version, you must specify character set (?charset=utf8 ) with database name to avoid unknown character set error. www.ip4you.blogspot.com for more teaching-learning materi Importing & Exporting Data Between Python Pandas and MySQL O Exporting (Writing) data from DataFrame to MySQL Pandas offers to_sql() method to export data from DataFrame.
: Refers MySQL table in which data to be written. : Specify connection object created through create_engine(). index=True : By default Row index of DataFrame will be added as Index column in MySQL table. In case of False, row index will not be added. If exists= : Defines action to be taken, if table already exists. By default error (fail) to be reported, you can also append or overwrite (replace). #Import commands to be written here dct={'Name':['Amar', ‘Akbar’, 'Anthony'], ‘Total’ :[320, 360,380] } df=pd.DataFrame(dct) print (df) engine=alc.create_engine( 'mysql+pymysql:// root: tiger@localhost/school?charset=utf8' ) con=engine.connect() df.to_sql('student',con, if_exists='replace') eae) Importing & Exporting Data Between Python and MySQL (Alternative Method) O Alternative Method to import/export data to MySQL In previous examples, we have used sqlalchemy library to make connection to MySQL database using create_engine() method, which automatically handles dataflow compatible to Panda’s to_sql() and read_sql() method. So, if you want to import/export data directly from/to Pandas DataFrame then sqlalchemy plays very important role. But, you can also apply general methodology to handle database connections and dataflow using Pymysq] or sql-connector package library methods programmatically without using sqlalchemy toolkit. In fact this method can be used to import/export data from any data object like variables, list, tuple or DataFrame etc. in Python. The program depicted in next two slides, does the same reading/writing job through MySQL commands without using sglalchemy library. Kindly note that while exporting (writing) data to MySQL table using this method, you must create a compatible table using create command of MySQL within Python Program or compatible MySQL table must be present before executing Insert command of MySQL. ¥| This slide may be used as additional reference material. Importing & Exporting Data Between Python and MySQL (Alternative Method) O Exporting (Writing) records from DataFrame to MySQL #Program to export data from DataFrame to MySQL without using sqlalchemy import pandas as pd import pymysql dct={'Name':['Amar', ‘Akbar’, ‘Anthony'], ‘Total’ :[320,360,380]} df=pd.DataFrame(dct) print (df) > Name Total # Create Connection to the database 0 Amar 320 con=pymysql.connect(host='localhost',user='root', 1 Akbar 360 password=' tiger’ ,db="school') 2_ Anthony 380 cur=con.cursor() # Create table executing MySQL Query cur.execute("Create table student (name char(3@),total int(3));") for i,row in df.iterrows(): sql="INSERT INTO student VALUES ('%s','%s');" %(row[ 'Name'],row[ 'Total']) cur.execute(sql) con.commit() con.close() mysql> select * from student; ide may be used as additional_reference mate! Importing & Exporting Data Between Python and MySQL (Alternative Method) O Importing (Reading) records from MySQL to DataFrame #Program to export data from DataFrame to MySQL without using sqlalchemy import pandas as pd import pymysql # Connection to the database Teneo aay con = pymysql.connect(host='localhost', user='root', password='tiger', db='school') + 3 rows in set (@.@0 sec) cur=con.cursor() cur.execute("select * from student;") rows=cur.fetchall() # Create DataFrame using fetched records df=pd.DataFrame(rows, columns=['Name', 'Total']) Amar print (df) >i Akbar con.close() 2__ Anthony This slide may be used as additional reference material. es eMac taser wete Mma steM (sr eurtteT ey of f but the training of the mind to think.” Visit www.ip4you.blogspot.com for more....

You might also like