Sarkar, DR Tirthajyoti - Roychowdhury, Shubhadeep - Data Wrangling With Python - Creating Actionable Data From Raw Sources-Packt Publishing (2019)
Sarkar, DR Tirthajyoti - Roychowdhury, Shubhadeep - Data Wrangling With Python - Creating Actionable Data From Raw Sources-Packt Publishing (2019)
1. Preface
2. Chapter 1
3. Introduction to Data Wrangling with Python
1. Introduction
1. Importance of Data
Wrangling
1. Lists
2. Exercise 1: Accessing the List
Members
3. Exercise 2: Generating a List
4. Exercise 3: Iterating over a
List and Checking
Membership
5. Exercise 4: Sorting a List
6. Exercise 5: Generating a
Random List
7. Activity 1: Handling Lists
8. Sets
9. Introduction to Sets
10. Union and Intersection of
Sets
11. Creating Null Sets
12. Dictionary
13. Exercise 6: Accessing and
Setting Values in a Dictionary
14. Exercise 7: Iterating Over a
Dictionary
15. Exercise 8: Revisiting the
Unique Valued List Problem
16. Exercise 9: Deleting Value
from Dict
17. Exercise 10: Dictionary
Comprehension
18. Tuples
19. Creating a Tuple with
Different Cardinalities
20. Unpacking a Tuple
21. Exercise 11: Handling Tuples
22. Strings
23. Exercise 12: Accessing Strings
24. Exercise 13: String Slices
25. String Functions
26. Exercise 14: Split and Join
27. Activity 2: Analyze a Multiline
String and Generate the
Unique Word Count
4. Summary
4. Chapter 2
5. Advanced Data Structures and File Handling
1. Introduction
2. Advanced Data Structures
1. Iterator
2. Exercise 15: Introduction to
the Iterator
3. Stacks
4. Exercise 16: Implementing a
Stack in Python
5. Exercise 17: Implementing a
Stack Using User-Defined
Methods
6. Exercise 18: Lambda
Expression
7. Exercise 19: Lambda
Expression for Sorting
8. Exercise 20: Multi-Element
Membership Checking
9. Queue
10. Exercise 21: Implementing a
Queue in Python
11. Activity 3: Permutation,
Iterator, Lambda, List
4. Summary
6. Chapter 3
7. Introduction to NumPy, Pandas,and Matplotlib
1. Introduction
2. NumPy Arrays
3. Pandas DataFrames
8. Chapter 4
9. A Deep Dive into Data Wrangling with Python
1. Introduction
2. Subsetting, Filtering, and Grouping
6. Summary
10. Chapter 5
11. Getting Comfortable with Different Kinds of Data
Sources
1. Introduction
2. Reading Data from Different Text-Based
(and Non-Text-Based) Sources
1. Structure of HTML
2. Exercise 69: Reading an
HTML file and Extracting its
Contents Using
BeautifulSoup
3. Exercise 70: DataFrames and
BeautifulSoup
4. Exercise 71: Exporting a
DataFrame as an Excel File
5. Exercise 72: Stacking URLs
from a Document using bs4
6. Activity 7: Reading Tabular
Data from a Web Page and
Creating DataFrames
4. Summary
12. Chapter 6
13. Learning the Hidden Secrets of Data Wrangling
1. Introduction
1. Introduction to Generator
Expressions
2. Exercise 73: Generator
Expressions
3. Exercise 74: One-Liner
Generator Expression
4. Exercise 75: Extracting a List
with Single Words
5. Exercise 76: The zip
Function
6. Exercise 77: Handling Messy
Data
3. Data Formatting
1. The % operator
2. Using the format Function
3. Exercise 78: Data
Representation Using {}
14. Chapter 7
15. Advanced Web Scraping and Data Gathering
1. Introduction
2. The Basics of Web Scraping and the
Beautiful Soup Library
1. Libraries in Python
2. Exercise 81: Using the
Requests Library to Get a
Response from the Wikipedia
Home Page
3. Exercise 82: Checking the
Status of the Web Request
4. Checking the Encoding of the
Web Page
5. Exercise 83: Creating a
Function to Decode the
Contents of the Response
and Check its Length
6. Exercise 84: Extracting
Human-Readable Text From
a BeautifulSoup Object
7. Extracting Text from a
Section
8. Extracting Important
Historical Events that
Happened on Today's Date
9. Exercise 85: Using Advanced
BS4 Techniques to Extract
Relevant Text
10. Exercise 86: Creating a
Compact Function to Extract
the "On this Day" Text from
the Wikipedia Home Page
6. Summary
16. Chapter 8
17. RDBMS and SQL
1. Introduction
2. Refresher of RDBMS and SQL
1. How is an RDBMS
Structured?
2. SQL
3. Using an RDBMS
(MySQL/PostgreSQL/SQLite)
4. Summary
18. Chapter 9
19. Application of Data Wrangling in Real Life
1. Introduction
2. Applying Your Knowledge to a Real-life
Data Wrangling Task
3. Activity 12: Data Wrangling Task –
Fixing UN Data
4. Activity 13: Data Wrangling Task –
Cleaning GDP Data
5. Activity 14: Data Wrangling Task –
Merging UN Data and GDP Data
6. Activity 15: Data Wrangling Task –
Connecting the New Data to the
Database
7. An Extension to Data Wrangling
8. Summary
20. Appendix
Ev er y ef f or t h as b een made i n t h e p r ep ar at i on of
t h i s b ook t o ensu r e t h e ac c u r ac y of t h e
i nf or mat i on p r esent ed. H ow ev er , t h e
i nf or mat i on c ont ai ned i n t h i s b ook i s sol d
w i t h ou t w ar r ant y , ei t h er ex p r ess or i mp l i ed.
N ei t h er t h e au t h or , nor Pac k t Pu b l i sh i ng, and i t s
deal er s and di st r i b u t or s w i l l b e h el d l i ab l e f or
any damages c au sed or al l eged t o b e c au sed
di r ec t l y or i ndi r ec t l y b y t h i s b ook .
Pac k t Pu b l i sh i ng h as endeav or ed t o p r ov i de
t r ademar k i nf or mat i on ab ou t al l of t h e
c omp ani es and p r odu c t s ment i oned i n t h i s b ook
b y t h e ap p r op r i at e u se of c ap i t al s. H ow ev er ,
Pac k t Pu b l i sh i ng c annot gu ar ant ee t h e ac c u r ac y
of t h i s i nf or mat i on.
Pu b l i sh ed b y Pac k t Pu b l i sh i ng Lt d.
Li v er y Pl ac e, 35 Li v er y St r eet
Bi r mi ngh am B3 2PB, U K
Table of Contents
Preface
Introduction to Data
Wrangling with Python
INTRODUCTION
IMPORTANCE OF DATA
WRANGLING
LISTS
EXERCISE 2: GENERATING A
LIST
EXERCISE 5: GENERATING A
RANDOM LIST
SETS
INTRODUCTION TO SETS
DICTIONARY
TUPLES
UNPACKING A TUPLE
STRINGS
STRING FUNCTIONS
SUMMARY
Advanced Data
Structures and File
Handling
INTRODUCTION
ADVANCED DATA
STRUCTURES
ITERATOR
STACKS
ACTIVITY 3: PERMUTATION,
ITERATOR, LAMBDA, LIST
FILE HANDLING
SUMMARY
Introduction to NumPy,
Pandas,and Matplotlib
INTRODUCTION
NUMPY ARRAYS
NUMPY ARRAY AND
FEATURES
CONDITIONAL SUBSETTING
STACKING ARRAYS
PANDAS DATAFRAMES
STATISTICS AND
VISUALIZATION WITH NUMPY
AND PANDAS
REFRESHER OF BASIC
DESCRIPTIVE STATISTICS
(AND THE MATPLOTLIB
LIBRARY FOR
VISUALIZATION)
DEFINITION OF STATISTICAL
MEASURES – CENTRAL
TENDENCY AND SPREAD
WHAT IS A PROBABILITY
DISTRIBUTION?
DISCRETE DISTRIBUTIONS
CONTINUOUS
DISTRIBUTIONS
DATA WRANGLING IN
STATISTICS AND
VISUALIZATION
RANDOM NUMBER
GENERATION USING NUMPY
EXERCISE 43: GENERATING
RANDOM NUMBERS FROM A
UNIFORM DISTRIBUTION
ACTIVITY 5: GENERATING
STATISTICS FROM A CSV FILE
SUMMARY
CONDITIONAL SELECTION
AND BOOLEAN FILTERING
CONCATENATING, MERGING,
AND JOINING
EXERCISE 54:
CONCATENATION
USEFUL METHODS OF
PANDAS
SUMMARY
Getting Comfortable
with Different Kinds of
Data Sources
INTRODUCTION
READING DATA FROM
DIFFERENT TEXT-BASED (AND
NON-TEXT-BASED) SOURCES
SETTING THE
SKIP_BLANK_LINES OPTION
READ CSV FROM A ZIP FILE
INTRODUCTION TO
BEAUTIFUL SOUP 4 AND WEB
PAGE PARSING
STRUCTURE OF HTML
ACTIVITY 7: READING
TABULAR DATA FROM A WEB
PAGE AND CREATING
DATAFRAMES
SUMMARY
ADDITIONAL SOFTWARE
REQUIRED FOR THIS SECTION
ADVANCED LIST
COMPREHENSION AND THE
ZIP FUNCTION
INTRODUCTION TO
GENERATOR EXPRESSIONS
DATA FORMATTING
THE % OPERATOR
Z-SCORE
ACTIVITY 8: HANDLING
OUTLIERS AND MISSING DATA
SUMMARY
LIBRARIES IN PYTHON
EXTRACTING IMPORTANT
HISTORICAL EVENTS THAT
HAPPENED ON TODAY'S DATE
EXERCISE 85: USING
ADVANCED BS4 TECHNIQUES
TO EXTRACT RELEVANT TEXT
SUMMARY
HOW IS AN RDBMS
STRUCTURED?
SQL
USING AN RDBMS
(MYSQL/POSTGRESQL/SQLIT
E)
RELATION MAPPING IN
DATABASES
RETRIEVING SPECIFIC
COLUMNS FROM A JOIN
QUERY
SUMMARY
Application of
Data Wrangling in Real
Life
INTRODUCTION
AN EXTENSION TO DATA
WRANGLING
ADDITIONAL SKILLS
REQUIRED TO BECOME A DATA
SCIENTIST
SUMMARY
Appendix
Preface
About
Th i s sec t i on b r i ef l y i nt r odu c es t h e au t h or (s), t h e
c ov er age of t h i s b ook , t h e t ec h ni c al sk i l l s y ou 'l l
need t o get st ar t ed, and t h e h ar dw ar e and
sof t w ar e r equ i r ement s r equ i r ed t o c omp l et e al l
of t h e i nc l u ded ac t i v i t i es and ex er c i ses.
LEARNING OBJECTIVES
Use and m anipu late com plex
and sim ple data str u ctu r es
APPROACH
Dat a W r angl i ng w i t h Py t h on t ak es a p r ac t i c al
ap p r oac h t o equ i p b egi nner s w i t h t h e most
essent i al dat a anal y si s t ool s i n t h e sh or t est
p ossi b l e t i me. It c ont ai ns mu l t i p l e ac t i v i t i es
t h at u se r eal -l i f e b u si ness sc enar i os f or y ou t o
p r ac t i c e and ap p l y y ou r new sk i l l s i n a h i gh l y
r el ev ant c ont ex t .
AUDIENCE
Dat a W r angl i ng w i t h Py t h on i s desi gned f or
dev el op er s, dat a anal y st s, and b u si ness anal y st s
w h o ar e k een t o p u r su e a c ar eer as a f u l l -f l edged
dat a sc i ent i st or anal y t i c s ex p er t . A l t h ou gh , t h i s
b ook i s f or b egi nner s, p r i or w or k i ng k now l edge
of Py t h on i s nec essar y t o easi l y gr asp t h e
c onc ep t s c ov er ed h er e. It w i l l al so h el p t o h av e
r u di ment ar y k now l edge of r el at i onal dat ab ase
and SQL.
MINIMUM HARDWARE
REQUIREMENTS
For t h e op t i mal st u dent ex p er i enc e, w e
r ec ommend t h e f ol l ow i ng h ar dw ar e
c onf i gu r at i on:
Mem or y : 8 GB RA M
SOFTWARE REQUIREMENTS
You 'l l al so need t h e f ol l ow i ng sof t w ar e i nst al l ed
i n adv anc e:
v er sion of OS X
Mem or y : 4 GB RA M (8 GB
Pr efer r ed)
CONVENTIONS
Code w or ds i n t ex t , dat ab ase t ab l e names, f ol der
names, f i l enames, f i l e ex t ensi ons, p at h names,
du mmy U RLs, u ser i np u t , and Tw i t t er h andl es
ar e sh ow n as f ol l ow s: " Th i s w i l l r et u r n t h e
v al u e assoc i at ed w i t h i t - ["list_element1", 34]"
A b l oc k of c ode i s set as f ol l ow s:
list_1 = []
list_1.append(x)
list_1
N ew t er ms and i mp or t ant w or ds ar e sh ow n i n
b ol d. W or ds t h at y ou see on t h e sc r een, f or
ex amp l e, i n menu s or di al og b ox es, ap p ear i n t h e
t ex t l i k e t h i s: "Cl i c k on Ne w and c h oose Py thon
3."
I nstall Docke r
2 . Once y ou h av e set u p
Docker , open a sh ell (or
Ter m inal if y ou ar e a Mac
u ser ) and ty pe th e follow ing
com m and to v er ify th at th e
installation h as
been su ccessfu l:
docker version
If th e ou tpu t sh ow s y ou th e
ser v er and client v er sion of
Docker , th en y ou ar e all set
u p.
docker pull
rcshubhadeep/packt-
data-wrangling-base
2 . If y ou w ant to know th e fu ll
list of all th e packages and
th eir v er sions inclu ded in
th is im age, y ou can ch eck
ou t th e requirements.txt
file in th e setup folder of th e
sou r ce code r epositor y of th is
book. Once th e im age
is th er e, y ou ar e r eady to
r oll. Dow nloading it m ay
take tim e, depending
on y ou r connection speed.
1 . Ru n th e im age u sing th e
follow ing com m and:
docker run -p
8888:8888 -v
'pwd':/notebooks -it
rcshubhadeep/packt-
data-wrangling-base
3 . Befor e y ou r u n th e im age,
cr eate a new folder and
nav igate th er e fr om th e sh ell
u sing th e cd com m and.
1 . Once y ou ar e r u nning th e
Ju py ter ser v er , click on New
and ch oose Py t hon 3. A new
br ow ser tab w ill open w ith a
new and em pty notebook.
Renam e th e Ju py ter file:
Note
ADDITIONAL RESOURCES
Th e c ode b u ndl e f or t h i s b ook i s al so h ost ed on
Gi t H u b at :
h t t p s://gi t h u b .c om/Tr ai ni ngBy Pac k t /Dat a-
W r angl i ng-w i t h -Py t h on.
W e al so h av e ot h er c ode b u ndl es f r om ou r r i c h
c at al og of b ook s and v i deos av ai l ab l e at
h t t p s://gi t h u b .c om/Pac k t Pu b l i sh i ng/. Ch ec k
t h em ou t !
Chapter 1
Introduction to Data
Wrangling with Python
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o do
t h e f ol l ow i ng:
Introduction
Dat a sc i enc e and anal y t i c s ar e t ak i ng ov er t h e
w h ol e w or l d and t h e job of a dat a sc i ent i st i s
r ou t i nel y b ei ng c al l ed t h e c ool est job of t h e 21 st
c ent u r y . Bu t f or al l t h e emp h asi s on dat a, i t i s t h e
sc i enc e t h at mak es y ou – t h e p r ac t i t i oner –
t r u l y v al u ab l e.
IMPORTANCE OF DATA
WRANGLING
Oi l does not c ome i n i t s f i nal f or m f r om t h e r i g; i t
h as t o b e r ef i ned. Si mi l ar l y , dat a mu st b e
c u r at ed, massaged, and r ef i ned t o b e u sed i n
i nt el l i gent al gor i t h ms and c onsu mer p r odu c t s.
Th i s i s k now n as w r angl i ng. Most dat a sc i ent i st s
sp end t h e major i t y of t h ei r t i me dat a w r angl i ng.
Detecting ou tlier s
Th i s i s an i l l u st r at i v e r ep r esent at i on of t h e
p osi t i oni ng and essent i al f u nc t i onal r ol e of dat a
w r angl i ng i n a t y p i c al dat a sc i enc e p i p el i ne:
Figure 1.1: Process of data wrangling
Modeling platfor m s su ch as
RapidMiner
A s t h e v ol u me, v el oc i t y , and v ar i et y (t h e t h r ee
V s of big data) of dat a u nder go r ap i d c h anges, i t
i s al w ay s a good i dea t o dev el op and nu r t u r e a
si gni f i c ant amou nt of i n-h ou se ex p er t i se i n dat a
w r angl i ng u si ng f u ndament al p r ogr ammi ng
f r amew or k s so t h at an or gani zat i on i s not
b eh ol den t o t h e w h i ms and f anc i es of any
ent er p r i se p l at f or m f or as b asi c a t ask as dat a
w r angl i ng:
Figure 1.2: Google trend worldwide over the last Five
years
W e c an i ssu e t h e f ol l ow i ng c ommand t o st ar t a
new Ju p y t er ser v er b y t y p i ng t h e f ol l ow i ng i n
t o t h e Command Pr omp t w i ndow :
Th i s w i l l st ar t a ju p y t er ser v er and y ou c an v i si t
i t at http://localhost:8888 and u se t h e p assc ode
dw_4_all t o ac c ess t h e mai n i nt er f ac e.
LISTS
Li st s ar e f u ndament al Py t h on dat a st r u c t u r es
t h at h av e c ont i nu ou s memor y l oc at i ons, c an h ost
di f f er ent dat a t y p es, and c an b e ac c essed b y t h e
i ndex .
Th e f ol l ow i ng i s an ex amp l e of a si mp l e l i st :
Th e f ol l ow i ng i s al so an ex amp l e of a l i st :
If y ou ar e c omi ng f r om a st r ongl y t y p ed
l angu age, su c h as C, C++, or Jav a, t h en t h i s w i l l
p r ob ab l y b e st r ange as y ou ar e not al l ow ed t o
mi x di f f er ent k i nds of dat a t y p es i n a si ngl e
ar r ay i n t h ose l angu ages. Li st s ar e somew h at l i k e
ar r ay s, i n t h e sense t h at t h ey ar e b ot h b ased on
c ont i nu ou s memor y l oc at i ons and c an b e
ac c essed u si ng i ndex es. Bu t t h e p ow er of Py t h on
l i st s c ome f r om t h e f ac t t h at t h ey c an h ost
di f f er ent dat a t y p es and y ou ar e al l ow ed t o
mani p u l at e t h e dat a.
Note
Be c a re fu l, th ou g h , a s th e v e ry p ow e r of lis ts , a nd
th e fa c t th a t y ou c a n m ix d iffe re nt d a ta ty p e s in a
s ing le lis t, c a n a c tu a lly c re a te s u btle bu g s th a t
c a n be v e ry d iffic u lt to tra c k .
Th e indices w ill be
au tom atically assigned, as
follow s:
list_1[0] #34
list_1[3] #1
list_1[len(list_1) -
1] #1
Th e len fu nction in Py th on
r etu r ns th e length of th e
specified list.
list_1[-1] #1
6 . A ccess th e fir st th r ee
elem ents fr om list_1 u sing
for w ar d indices:
Note
list_1[-1::-1] # [1,
89, 12, 34]
Note
EXERCISE 2: GENERATING A
LIST
W e ar e goi ng t o ex ami ne v ar i ou s w ay s of
gener at i ng a l i st :
list_1 = []
list_1.append(x)
list_1
[0, 1, 2, 3, 4, 5, 6,
7, 8, 9]
Her e, w e star ted by
declar ing an em pty list and
th en w e u sed a for loop to
append v alu es to it. Th e
append m eth od is a m eth od
th at's giv en to u s by th e
Py th on list data ty pe.
list_2 = [x for x in
range(0, 100)]
list_2
print(list_1[i])
i += 1
list_3 = [x for x in
range(0, 100) if x % 5
== 0]
list_3
list_3
Th e ou tpu t is as follow s:
[1, 4, 56, -1, 1, 39,
245, -23, 0, 45]
list_1.extend(list_2)
list_1
list_1 = [x for x in
range(0, 100)]
for i in range(0,
len(list_1)):
print(list_1[i])
Th e ou tpu t is as follow s:
Figure 1.7: Section of list_1
2 . How ev er , it is not v er y
Py th onic. Being Py th onic is
to follow and confor m to a set
of best pr actices and
conv entions th at h av e been
cr eated ov er th e y ear s by
th ou sands of v er y able
dev eloper s, w h ich in th is
case m eans to u se th e in
key w or d, becau se Py th on
does not h av e index
initialization, bou nds
ch ecking, or index
incr em enting, u nlike
tr aditional langu ages. Th e
Py th onic w ay of iter ating
ov er a list is as follow s:
for i in list_1:
print(i)
Th e ou tpu t is as follow s:
25 in list_1
Th e ou tpu t is True.
-45 in list_1
Th e ou tpu t is False.
1 . A s th e list w as or iginally a
list of nu m ber s fr om 0 to 99,
w e w ill sor t it in th e r ev er se
dir ection. To do th at, w e w ill
u se th e sort m eth od w ith
reverse=True:
list_1.sort(reverse=Tr
ue)
list_1
2 . We can u se th e reverse
m eth od dir ectly to ach iev e
th is r esu lt:
list_1.reverse()
list_1
Th e ou tpu t is as follow s:
Figure 1.10: Section of output a er reversing the string
Note
Th e d iffe re nc e be tw e e n th e s ort fu nc tion a nd th e
re v e rs e fu nc tion is th e fa c t th a t w e c a n u s e s ort
w ith c u s tom s orting fu nc tions to d o c u s tom
s orting , w h e re a s w e c a n only u s e re v e rs e to
re v e rs e a lis t. He re a ls o, both th e fu nc tions w ork
in-p la c e , s o be a w a re of th is w h ile u s ing th e m .
EXERCISE 5: GENERATING A
RANDOM LIST
In t h i s ex er c i se, w e w i l l b e gener at i ng a list
w i t h r andom nu mb er s:
import random
list_1 =
[random.randint(0, 30)
for x in range (0,
100)]
list_1
Th er e ar e many w ay s t o get a l i st of u ni qu e
nu mb er s, and w h i l e y ou may b e ab l e t o w r i t e a
f ew l i nes of c ode u si ng a f or l oop and anot h er l i st
(y ou sh ou l d ac t u al l y t r y doi ng i t !), l et 's see h ow
w e c an do t h i s w i t h ou t a f or l oop and w i t h a
si ngl e l i ne of c ode. Th i s w i l l b r i ng u s t o t h e nex t
dat a st r u c t u r e, set s.
Th ese ar e t h e st ep s f or c omp l et i ng t h i s ac t i v i t y :
1 . Cr eate a list of 1 00 r andom
nu m ber s.
Note
SETS
A set , mat h emat i c al l y sp eak i ng, i s ju st a
c ol l ec t i on of w el l -def i ned di st i nc t ob jec t s.
Py t h on gi v es u s a st r ai gh t f or w ar d w ay t o deal
w i t h t h em u si ng i t s set dat at y p e.
INTRODUCTION TO SETS
W i t h t h e l ast l i st t h at w e gener at ed, w e ar e goi ng
t o r ev i si t t h e p r ob l em of get t i ng r i d of
du p l i c at es f r om i t . W e c an ac h i ev e t h at w i t h t h e
f ol l ow i ng l i ne of c ode:
list_12 = list(set(list_1))
list_12
Th e ou t p u t w i l l b e as f ol l ow s:
Figure 1.12: Section of output for list_21
Th i s si mp l y means t ak e ev er y t h i ng f r om b ot h
set s b u t t ak e t h e c ommon el ement s onl y onc e.
W e c an c r eat e t h i s u si ng t h e f ol l ow i ng c ode:
To f i nd t h e u ni on of t h e t w o set s, t h e f ol l ow i ng
i nst r u c t i ons sh ou l d b e u sed:
set1 | set2
Th e ou t p u t w ou l d b e as f ol l ow s:
Note
You c a n a ls o c a lc u la te th e d iffe re nc e be tw e e n
s e ts (a ls o k now n a s c om p le m e nts ). To find ou t
m ore , re fe r to th is link :
h ttp s ://d oc s .p y th on.org /3/tu toria l/d a ta s tru c tu re
s .h tm l#s e ts .
null_set_1 = set({})
null_set_1
Th e ou t p u t i s as f ol l ow s:
set()
H ow ev er , t o c r eat e a di c t i onar y , u se t h e
f ol l ow i ng c ommand:
null_set_2 = {}
null_set_2
Th e ou t p u t i s as f ol l ow s:
{}
DICTIONARY
A di c t i onar y i s l i k e a l i st , w h i c h means i t i s a
c ol l ec t i on of sev er al el ement s. H ow ev er , w i t h
t h e di c t i onar y , i t i s a c ol l ec t i on of k ey -v al u e
p ai r s, w h er e t h e k ey c an b e any t h i ng t h at c an b e
h ash ed. Gener al l y , w e u se nu mb er s or st r i ngs as
k ey s.
dict_1
Th e ou t p u t i s as f ol l ow s:
Th i s i s al so a v al i d di c t i onar y :
dict_2
Th e ou t p u t i s as f ol l ow s:
{'key1': 1,
'key5': 4.5}
Th e k ey s mu st b e u ni qu e i n a di c t i onar y .
dict_2["key2"]
['list_element1', 34]
dict_2["key2"] = "My
new value"
dict_3 = {} # Not a
null set. It is a dict
dict_3["key1"] =
"Value1"
dict_3
Th e ou tpu t is as follow s:
{'key1': 'Value1'}
1 . Cr eate dict_1:
dict_1 = {"key1": 1,
"key2":
["list_element1", 34],
"key3": "value3",
"key4": {"subkey1":
"v1"}, "key5": 4.5}
for k, v in
dict_1.items():
print("{} -
{}".format(k, v))
Th e ou tpu t is as follow s:
key1 - 1
key2 -
['list_element1', 34]
key3 - value3
key4 - {'subkey1':
'v1'}
key5 - 4.5
Note
list_1 =
[random.randint(0, 30)
for x in range (0,
100)]
list(dict.fromkeys(lis
t_1).keys())
Th e sam ple ou tpu t is as
follow s:
"key4": {"subkey1":
"v1"}, "key5": 4.5}
dict_1
Th e ou tpu t is as follow s:
{'key1': 1,
'key2':
['list_element', 34],
'key3': 'value3',
'key4': {'subkey1':
'v1'},
'key5': 4.5}
del dict_1["key2"]
Th e ou tpu t is as follow s:
{'key3': 'value3',
'key4': {'subkey1':
'v1'}, 'key5': 4.5}
Note
dict_1
Th e ou tpu t is as follow s:
{0: 0, 1: 1, 2: 4, 3:
9, 4: 16, 5: 25, 6:
36, 7: 49, 8: 64, 9:
81}
dict_2 = dict([('Tom',
100), ('Dick', 200),
('Harry', 300)])
dict_2
Th e ou tpu t is as follow s:
dict_3 = dict(Tom=100,
Dick=200, Harry=300)
dict_3
Th e ou tpu t is as follow s:
TUPLES
A t u p l e i s anot h er dat a t y p e i n Py t h on. It i s
sequ ent i al i n nat u r e and si mi l ar t o l i st s.
N ot i c e t h at , u nl i k e l i st s, w e di d not op en and
c l ose squ ar e b r ac k et s h er e.
tuple_1 = ()
tuple_1 = "Hello",
N ot i c e t h e t r ai l i ng c omma h er e.
One sp ec i al t h i ng ab ou t t u p l es i s t h e f ac t t h at
t h ey ar e an i mmu t ab l e dat a t y p e. So, onc e
c r eat ed, w e c annot c h ange t h ei r v al u es. W e c an
ju st ac c ess t h em, as f ol l ow s:
tuple_1[1] = "Universe!"
UNPACKING A TUPLE
Th e t er m u np ac k i ng a t u p l e si mp l y means t o get
t h e v al u es c ont ai ned i n t h e t u p l e i n di f f er ent
v ar i ab l es:
print(world)
Th e ou t p u t i s as f ol l ow s:
Hello
World
tupleE
Th e ou tpu t is as follow s:
2 . Tr y to ov er r ide a v ar iable
fr om th e tupleE tu ple:
tupleE[1] = "5"
1, 3, 5 = tupleE
4 . Pr int th e ou tpu t:
print(1)
print(3)
Th e ou tpu t is as follow s:
Th i s i s a st r i ng:
A st r i ng c an al so b e dec l ar ed i n t h i s manner :
str_1[0]
Th e ou tpu t is as follow s:
'H'
str_1[4]
Th e ou tpu t is as follow s:
'o'
str_1[len(str_1) - 1]
Th e ou tpu t is as follow s:
'!'
Th e ou tpu t is as follow s:
'!'
Each of th e pr eceding
oper ations w ill giv e y ou th e
ch ar acter at th e specific
index.
Note
str_1[2:10]
Th e ou tpu t is th is:
'llo Worl'
str_1[-31:]
Th e ou tpu t is as follow s:
str_1[-10:-5]
Th e ou tpu t is as follow s:
' wran'
STRING FUNCTIONS
To f i nd ou t t h e l engt h of a st r i ng, w e si mp l y u se
t h e len f u nc t i on:
len(str_1)
Th e l engt h of t h e st r i ng i s 41 . To c onv er t a
st r i ng's c ase, w e c an u se t h e lower and upper
met h ods:
str_1.lower()
str_1.upper()
Th e ou t p u t i s as f ol l ow s:
To sear c h f or a st r i ng w i t h i n a st r i ng, w e c an u se
t h e find met h od:
str_1.find("complicated")
Th e ou t p u t i s -1 . Can y ou f i gu r e ou t w h et h er t h e
f i nd met h od i s c ase-sensi t i v e or not ? A l so, w h at
do y ou t h i nk t h e f i nd met h od r et u r ns w h en i t
ac t u al l y f i nds t h e st r i ng?
To r ep l ac e one st r i ng w i t h anot h er , w e h av e t h e
replace met h od. Si nc e w e k now t h at a st r i ng i s an
i mmu t ab l e dat a st r u c t u r e, r ep l ac e ac t u al l y
r et u r ns a new st r i ng i nst ead of r ep l ac i ng and
r et u r ni ng t h e ac t u al one:
str_1.replace("complicated", "simple")
Th e ou t p u t i s as f ol l ow s:
list_1 =
str_1.split(",")
list_1
" | ".join(list_1)
W e h av e desi gned an ac t i v i t y f or y ou so t h at y ou
c an p r ac t i c e al l t h e sk i l l s y ou ju st l ear ned. Th i s
smal l ac t i v i t y sh ou l d t ak e ar ou nd 30 t o 45
mi nu t es t o f i ni sh .
ACTIVITY 2: ANALYZE A
MULTILINE STRING AND
GENERATE THE UNIQUE
WORD COUNT
Th i s sec t i on w i l l ensu r e t h at y ou h av e
u nder st ood t h e v ar i ou s b asi c dat a st r u c t u r es and
t h ei r mani p u l at i on. W e w i l l do t h at b y goi ng
t h r ou gh an ac t i v i t y t h at h as b een desi gned
sp ec i f i c al l y f or t h i s p u r p ose.
In t h i s ac t i v i t y , w e w i l l do t h e f ol l ow i ng:
Note
Th ese ar e t h e st ep s t o gu i de y ou t h r ou gh sol v i ng
t h i s ac t i v i t y :
1 . Cr eate a mutliline_text
v ar iable by copy ing th e text
fr om th e fir st ch apter of Pride
and Prejudice.
Note
4 . Find all of th e w or ds in
multiline_text u sing th e
split fu nction.
7 . Find th e top 2 5 w or ds fr om
th e u niqu e w or ds th at y ou
h av e fou nd u sing th e slice
fu nction.
Note
Summary
In t h i s c h ap t er , w e l ear ned w h at t h e t er m dat a
w r angl i ng means. W e al so got ex amp l es f r om
v ar i ou s r eal -l i f e dat a sc i enc e si t u at i ons w h er e
dat a w r angl i ng i s v er y u sef u l and i s u sed i n
i ndu st r y . W e mov ed on t o l ear n ab ou t t h e
di f f er ent b u i l t -i n dat a st r u c t u r es t h at Py t h on
h as t o of f er . W e got ou r h ands di r t y b y ex p l or i ng
l i st s, set s, di c t i onar i es, t u p l es, and st r i ngs. Th ey
ar e t h e f u ndament al b u i l di ng b l oc k s i n Py t h on
dat a st r u c t u r es, and w e need t h em al l t h e t i me
w h i l e w or k i ng and mani p u l at i ng dat a i n Py t h on.
W e di d sev er al smal l h ands-on ex er c i ses t o l ear n
mor e ab ou t t h em. W e f i ni sh ed t h i s c h ap t er w i t h
a c ar ef u l l y desi gned ac t i v i t y , w h i c h l et u s
c omb i ne a l ot of di f f er ent t r i c k s f r om al l t h e
di f f er ent dat a st r u c t u r es i nt o a r eal -l i f e
si t u at i on and l et u s ob ser v e t h e i nt er p l ay
b et w een al l of t h em.
Introduction
W e w er e i nt r odu c ed t o t h e b asi c c onc ep t s of
di f f er ent f u ndament al dat a st r u c t u r es i n t h e l ast
c h ap t er . W e l ear ned ab ou t t h e l i st , set , di c t ,
t u p l e, and st r i ng. Th ey ar e t h e b u i l di ng b l oc k s of
f u t u r e c h ap t er s and ar e essent i al f or dat a
sc i enc e.
H ow ev er , w h at w e h av e c ov er ed so f ar w er e onl y
b asi c op er at i ons on t h em. Th ey h av e mu c h mor e
t o of f er onc e y ou l ear n h ow t o u t i l i ze t h em
ef f ec t i v el y . In t h i s c h ap t er , w e w i l l v ent u r e
f u r t h er i nt o t h e l and of dat a st r u c t u r es. W e w i l l
l ear n ab ou t adv anc ed op er at i ons and
mani p u l at i ons and u se t h ese f u ndament al dat a
st r u c t u r es t o r ep r esent mor e c omp l ex and
h i gh er -l ev el dat a st r u c t u r es; t h i s i s of t en h andy
w h i l e w r angl i ng dat a i n r eal l i f e.
To st ar t t h i s c h ap t er , y ou h av e t o op en an emp t y
not eb ook . To do t h at , y ou c an si mp l y i np u t t h e
f ol l ow i ng c ommand i n a sh el l . It i s adv i sed t h at
y ou f i r st nav i gat e t o an emp t y di r ec t or y u si ng cd
b ef or e y ou ent er t h e c ommand:
ITERATOR
W e w i l l st ar t of f t h i s t op i c w i t h l i st s. H ow ev er ,
b ef or e w e get i nt o l i st s, w e w i l l i nt r odu c e t h e
c onc ep t of an i t er at or . A n i t er at or i s an ob jec t
t h at i mp l ement s t h e next met h od, meani ng an
i t er at or i s an ob jec t t h at c an i t er at e ov er a
c ol l ec t i on (l i st s, t u p l es, di c t s, and so on). It i s
st at ef u l , w h i c h means t h at eac h t i me w e c al l t h e
next met h od, i t gi v es u s t h e nex t el ement f r om
t h e c ol l ec t i on. A nd i f t h er e i s no f u r t h er
el ement , t h en i t r ai ses a StopIteration
ex c ep t i on.
Note
A StopIteration e xc e p tion oc c u rs w ith th e
ite ra tor's ne xt m e th od w h e n th e re a re no fu rth e r
v a lu e s to ite ra te .
big_list_of_numbers =
[1 for x in range(0,
10000000)]
2 . Ch eck th e size of th is
v ar iable:
getsizeof(big_list_of_
numbers)
Th e v alu e it w ill sh ow y ou
w ill be som eth ing ar ou nd
81528056 (it is in by tes).
Th is is a lot of m em or y ! A nd
th e big_list_of_numbers
v ar iable is only av ailable
once th e list com pr eh ension
is ov er . It can also ov er flow
th e av ailable sy stem
m em or y if y ou tr y too big a
nu m ber .
small_list_of_numbers
= repeat(1,
times=10000000)
getsizeof(small_list_o
f_numbers)
Th e last line sh ow s th at ou r
small_list_of_numbers is
only 56 by tes in size. A lso, it
is a lazy m eth od, as it did not
gener ate all th e elem ents. It
w ill gener ate th em one by
one w h en asked, th u s sav ing
u s tim e. In fact, if y ou om it
th e times key w or d
ar gu m ent, th en y ou can
pr actically gener ate an
infinite nu m ber of 1 s.
4 . Loop ov er th e new ly
gener ated iter ator :
for i, x in
enumerate(small_list_o
f_numbers):
print(x)
if i > 10:
break
We u se th e enumerate
fu nction so th at w e get th e
loop cou nter , along w ith th e
v alu es. Th is w ill h elp u s
br eak once w e r each a
cer tain nu m ber of th e
cou nter (1 0 for exam ple).
5. To look u p th e definition of
any fu nction, ty pe th e
fu nction nam e, follow ed by a
? and pr ess Shift + Enter in a
Ju py ter notebook. Ru n th e
follow ing code to u nder stand
h ow w e can u se
per m u tations and
com binations w ith iter tools:
zip_longest)
permutations?
combinations?
dropwhile?
repeat?
zip_longest?
STACKS
A st ac k i s a v er y u sef u l dat a st r u c t u r e. If y ou
k now a b i t ab ou t CPU i nt er nal s and h ow a
p r ogr am get s ex ec u t ed, t h en y ou h av e an i dea
t h at a st ac k i s p r esent i n many su c h c ases. It i s
si mp l y a l i st w i t h one r est r i c t i on, Last In Fi r st
Ou t (LIFO), meani ng an el ement t h at c omes i n
l ast goes ou t f i r st w h en a v al u e i s r ead f r om a
st ac k . Th e f ol l ow i ng i l l u st r at i on w i l l mak e t h i s
a b i t c l ear er :
stack = []
stack.append(25)
stack
Th e ou tpu t is as follow s:
[25]
stack
Th e ou tpu t is as follow s:
[25, -12]
tos = stack.pop()tos
Th e ou tpu t is as follow s:
-12
A fter w e execu te th e
pr eceding code, w e w ill h av e
-1 2 in tos and th e stack w ill
h av e only one elem ent in it,
25.
stack.append("Hello")
stack
Th e ou tpu t is as follow s:
[25, 'Hello']
def stack_push(s,
value):
return s + [value]
def stack_pop(s):
tos = s[-1]
del s[-1]
return tos
url_stack = []
Note
4 . Now , w e ar e going to h av e a
str ing w ith a few URLs in it.
Ou r job is to analy ze th e
str ing so th at w e pu sh th e
URLs in th e stack one by one
as w e encou nter th em , and
th en finally u se a for loop to
pop th em one by one. Let's
take th e fir st line fr om th e
Wikipedia ar ticle abou t data
science:
wikipedia_datascience
= "Data science is an
interdisciplinary
field that uses
scientific methods,
processes, algorithms
and systems to extract
knowledge
[https://en.wikipedia.
org/wiki/Knowledge]
and insights from data
[https://en.wikipedia.
org/wiki/Data] in
various forms, both
structured and
unstructured,similar
to data mining
[https://en.wikipedia.
org/wiki/Data_mining]"
len(wikipedia_datascie
nce)
Th e ou tpu t is as follow s:
347
wd_list =
wikipedia_datascience.
split()
len(wd_list)
Th e ou tpu t is as follow s:
34
if word.startswith("
[https://"):
url_stack =
stack_push(url_stack,
word[1:-1])
9 . Pr int th e v alu e in
url_stack:
url_stack
Th e ou tpu t is as follow s:
['https://en.wikipedia
.org/wiki/Knowledge',
'https://en.wikipedia.
org/wiki/Data',
'https://en.wikipedia.
org/wiki/Data_mining']
for i in range(0,
len(url_stack)):
print(stack_pop(url_st
ack))
Th e ou tpu t is as follow s:
print(url_stack)
Th e ou tpu t is as follow s:
[]
W e h av e not i c ed a st r ange p h enomenon i n t h e
stack_pop met h od. W e p assed t h e l i st v ar i ab l e
t h er e, and w e u sed t h e del op er at or i nsi de t h e
f u nc t i on, b u t i t c h anged t h e or i gi nal v ar i ab l e b y
del et i ng t h e l ast i ndex eac h t i me w e c al l t h e
f u nc t i on. If y ou ar e c omi ng f r om a l angu age l i k e
C, C++, and Jav a, t h en t h i s i s a c omp l et el y
u nex p ec t ed b eh av i or , as i n t h ose l angu ages t h i s
c an onl y h ap p en i f w e p ass t h e v ar i ab l e b y
r ef er enc e and i t c an l ead t o su b t l e b u gs i n
Py t h on c ode. So b e c ar ef u l . In gener al , i t i s not a
good i dea t o c h ange a v ar i ab l e's v al u e i n p l ac e,
meani ng i nsi de a f u nc t i on. A ny v ar i ab l e t h at 's
p assed t o t h e f u nc t i on sh ou l d b e c onsi der ed and
t r eat ed as i mmu t ab l e. Th i s i s c l ose t o t h e
p r i nc i p l es of f u nc t i onal p r ogr ammi ng. A l amb da
ex p r essi on i n Py t h on i s a w ay t o c onst r u c t one-
l i ne, namel ess f u nc t i ons t h at ar e, b y c onv ent i on,
si de ef f ec t -f r ee.
import math
2 . Define tw o fu nctions,
my_sine and my_cosine. Th e
r eason w e ar e declar ing
th ese fu nctions is becau se th e
or iginal sin and cos
fu nctions fr om th e m ath
package take r adians as
inpu t, bu t w e ar e m or e
fam iliar w ith degr ees. So, w e
w ill u se a lam bda expr ession
to define a nam eless one-line
fu nction and u se it. Th is
lam bda fu nction w ill
au tom atically conv er t ou r
degr ee inpu t to r adians and
th en apply sin or cos on it
and r etu r n th e v alu e:
def my_sine():
return lambda x:
math.sin(math.radians(
x))
def my_cosine():
return lambda x:
math.cos(math.radians(
x))
sine = my_sine()
cosine = my_cosine()
math.pow(sine(30), 2)
+ math.pow(cosine(30),
2)
Th e ou tpu t is as follow s:
1.0
Notice th at w e h av e assigned
th e r etu r n v alu e fr om both
my_sine and my_cosine to
tw o v ar iables, and th en u sed
th em dir ectly as th e
fu nctions. It is a m u ch
cleaner appr oach th an u sing
th em explicitly . Notice th at
w e did not explicitly w r ite a
return statem ent inside th e
lam bda fu nction. It is
assu m ed.
capitals = [("USA",
"Washington"),
("India", "Delhi"),
("France", "Paris"),
("UK", "London")]
capitals
[('USA',
'Washington'),
('India', 'Delhi'),
('France', 'Paris'),
('UK', 'London')]
capitals.sort(key=lamb
da item: item[1])
capitals
[('India', 'Delhi'),
('UK', 'London'),
('France', 'Paris'),
('USA', 'Washington')]
list_of_words =
["Hello", "there.",
"How", "are", "you",
"doing?"]
2 . Find ou t w h eth er th is list
contains all th e elem ents
fr om anoth er list:
check_for = ["How",
"are"]
3 . Using th e in key w or d to
ch eck m em ber sh ip in th e list
list_of_words:
all(w in list_of_words
for w in check_for)
Th e ou tpu t is as follow s:
True
QUEUE
A p ar t f r om st ac k s, anot h er h i gh -l ev el dat a
st r u c t u r e t h at w e ar e i nt er est ed i n i s qu eu e. A
qu eu e i s l i k e a st ac k , meani ng t h at y ou c ont i nu e
addi ng el ement s one b y one. W i t h a qu eu e, t h e
r eadi ng of el ement s ob ey s a FIFO (Fi r st In Fi r st
Ou t ) st r at egy . Ch ec k ou t t h e f ol l ow i ng di agr am
t o u nder st and t h i s b et t er :
Figure 2.4: Pictorial representation of a queue
%%time
queue = []
for i in range(0,
100000):
queue.append(i)
print("Queue created")
Th e ou tpu t is as follow s:
Queue created
Wall time: 11 ms
queue.pop(0)
print("Queue emptied")
Th e ou tpu t is as follow s:
Queue emptied
If w e u se th e %%time m agic
com m and w h ile execu ting
th e pr eceding code, w e w ill
see th at it takes a w h ile to
finish . In a m oder n MacBook,
w ith a qu ad-cor e pr ocessor
and 8 GB RA M, it took
ar ou nd 1 .2 0 seconds to
finish . Th is tim e is taken
becau se of th e pop(0)
oper ation, w h ich m eans
ev er y tim e w e pop a v alu e
fr om th e left of th e list
(w h ich is th e cu r r ent 0
index), Py th on h as to
r ear r ange all th e oth er
elem ents of th e list by
sh ifting th em one space left.
Indeed, it is not a v er y
optim ized im plem entation.
%%time
from collections
import deque
queue2 = deque()
for i in range(0,
100000):
queue2.append(i)
print("Queue created")
for i in range(0,
100000):
queue2.popleft()
print("Queue emptied")
Th e ou tpu t is as follow s:
Queue created
Queue emptied
Wall time: 23 ms
A qu eu e i s a v er y i mp or t ant dat a st r u c t u r e. To
gi v e one ex amp l e f r om r eal l i f e, w e c an t h i nk
ab ou t a p r odu c er -c onsu mer sy st em desi gn. W h i l e
doi ng dat a w r angl i ng, y ou w i l l of t en c ome ac r oss
a p r ob l em w h er e y ou mu st p r oc ess v er y b i g f i l es.
One of t h e w ay s t o deal w i t h t h i s p r ob l em i s t o
c h u nk t h e c ont ent s of t h e f i l e i n t o smal l er p ar t s
and t h en p u sh t h em i n t o a qu eu e w h i l e c r eat i ng
smal l , dedi c at ed w or k er p r oc esses, w h i c h r eads
of f t h e qu eu e and p r oc esses one smal l c h u nk at a
t i me. Th i s i s a v er y p ow er f u l desi gn, and y ou c an
ev en u se i t ef f i c i ent l y t o desi gn h u ge mu l t i -node
dat a w r angl i ng p i p el i nes.
ACTIVITY 3: PERMUTATION,
ITERATOR, LAMBDA, LIST
In t h i s ac t i v i t y , w e w i l l b e u si ng permutations t o
gener at e al l p ossi b l e t h r ee-di gi t nu mb er s t h at
c an b e gener at ed u si ng 0, 1 , and 2. Th en, l oop ov er
t h i s i t er at or , and al so u se isinstance and assert
t o mak e su r e t h at t h e r et u r n t y p es ar e t u p l es.
A l so, u se a si ngl e l i ne of c ode i nv ol v i ng
dropwhile and lambda ex p r essi ons t o c onv er t al l
t h e t u p l es t o l i st s w h i l e dr op p i ng any l eadi ng
zer os (f or ex amp l e, (0, 1 , 2) b ec omes [1 , 2]).
Fi nal l y , w r i t e a f u nc t i on t h at t ak es a l i st l i k e
b ef or e and r et u r ns t h e ac t u al nu mb er c ont ai ned
i n i t.
Th ese st ep s w i l l gu i de y ou t o sol v e t h i s ac t i v i t y :
1 . Look u p th e definition of
permutations and
dropwhile fr om itertools.
2 . Wr ite an expr ession to
gener ate all th e possible
th r ee-digit nu m ber s u sing 0,
1, and 2.
5. Ch eck th e actu al ty pe th at
dropwhile r etu r ns.
W i t h t h i s ac t i v i t y , w e h av e f i ni sh ed t h i s t op i c
and w e w i l l h ead ov er t o t h e nex t t op i c , w h i c h
i nv ol v es b asi c f i l e-l ev el op er at i ons. Bu t b ef or e
w e l eav e t h i s t op i c , w e enc ou r age y ou t o t h i nk
ab ou t a sol u t i on t o t h e p r ec edi ng p r ob l em
w i t h ou t u si ng al l t h e adv anc ed op er at i ons and
dat a st r u c t u r es w e h av e u sed h er e. You w i l l soon
r eal i ze h ow c omp l ex t h e nai v e sol u t i on i s, and
h ow mu c h v al u e t h ese dat a st r u c t u r es and
op er at i ons b r i ng.
Note
Th e s olu tion for th is a c tiv ity c a n be fou nd on p a g e
289.
Note
I n fa c t, one of th e fa c tors of th e fa m ou s 12-fa c tor
a p p d e s ig n is th e v e ry id e a of s toring
c onfig u ra tion in th e e nv ironm e nt. You c a n c h e c k
it ou t a t th is URL: h ttp s ://12fa c tor.ne t/c onfig .
Th e p u r p ose of t h e OS modu l e i s t o gi v e y ou w ay s
t o i nt er ac t w i t h op er at i ng sy st em-dep endent
f u nc t i onal i t i es. In gener al , i t i s p r et t y l ow -l ev el
and most of t h e f u nc t i ons f r om t h er e ar e not
u sef u l on a day -t o-day b asi s, h ow ev er , some ar e
w or t h l ear ni ng. os.environ i s t h e c ol l ec t i on
Py t h on mai nt ai ns w i t h al l t h e p r esent
env i r onment v ar i ab l es i n y ou r OS. It gi v es y ou
t h e p ow er t o c r eat e new ones. Th e os.getenv
f u nc t i on gi v es y ou t h e ab i l i t y t o r ead an
env i r onment v ar i ab l e:
import os
2 . Set few env ir onm ent
v ar iables:
os.environ['MY_KEY'] =
"MY_VAL"
os.getenv('MY_KEY')
Th e ou tpu t is as follow s:
'MY_VAL'
print(os.getenv('MY_KE
Y_NOT_SET'))
Th e ou tpu t is as follow s:
None
print(os.environ)
Note
FILE HANDLING
In t h i s ex er c i se, w e w i l l l ear n ab ou t h ow t o op en
a f i l e i n Py t h on. W e w i l l l ear n ab ou t t h e
di f f er ent modes t h at w e c an u se and w h at t h ey
st and f or . Py t h on h as a b u i l t -i n open f u nc t i on
t h at w e w i l l u se t o op en a f i l e. Th e open f u nc t i on
t ak es f ew ar gu ment s as i np u t . A mong t h em, t h e
f i r st one, w h i c h st ands f or t h e name of t h e f i l e
y ou w ant t o op en, i s t h e onl y one t h at 's
mandat or y . Ev er y t h i ng el se h as a def au l t v al u e.
W h en y ou c al l open, Py t h on u ses u nder l y i ng
sy st em-l ev el c al l s t o op en a f i l e h andl er and w i l l
r et u r n i t t o t h e c al l er .
U su al l y , a f i l e c an b e op ened ei t h er f or r eadi ng
or f or w r i t i ng. If w e op en a f i l e i n one mode, t h e
ot h er op er at i on i s not su p p or t ed. W h er eas
r eadi ng u su al l y means w e st ar t t o r ead f r om t h e
b egi nni ng of an ex i st i ng f i l e, w r i t i ng c an mean
ei t h er st ar t i ng a new f i l e and w r i t i ng f r om t h e
b egi nni ng or op eni ng an ex i st i ng f i l e and
ap p endi ng t o i t . H er e i s a t ab l e sh ow i ng y ou al l
t h e di f f er ent modes Py t h on su p p or t s f or op eni ng
a f i l e:
Th er e al so ex i st s a dep r ec at ed mode, U, w h i c h i n a
Py t h on3 env i r onment does not h i ng. One t h i ng
w e mu st r ememb er h er e i s t h at Py t h on w i l l
al w ay s di f f er ent i at e b et w een t and b modes, ev en
i f t h e u nder l y i ng OS doesn't . Th i s i s b ec au se i n b
mode, Py t h on does not t r y t o dec ode w h at i t i s
r eadi ng and gi v es u s b ac k t h e b y t es ob jec t
i nst ead, w h er eas i n t mode, i t does t r y t o dec ode
t h e st r eam and gi v es u s b ac k t h e st r i ng
r ep r esent at i on.
fd = open("Alice’s Adventures in
Wonderland, by Lewis Carroll")
Th i s i s op ened i n rt mode. You c an op en t h e same
f i l e i n b i nar y mode i f y ou w ant . To op en t h e f i l e
i n b i nar y mode, u se t h e rb mode:
fd = open("Alice’s Adventures in
Wonderland, by Lewis Carroll",
"rb")
fd
Th e ou t p u t i s as f ol l ow s:
<_io.BufferedReader name='Alice's
Adventures in Wonderland, by Lewis
Carroll'>
Th i s i s h ow w e op en a f i l e f or w r i t i ng:
fd = open("interesting_data.txt", "w")
fd
Th e ou t p u t i s as f ol l ow s:
<_io.TextIOWrapper
name='interesting_data.txt' mode='w'
encoding='cp1252'>
fd = open("Alice's
Adventures in
Wonderland, by Lewis
Carroll",
"rb")
fd.close()
3 . Py th on also giv es u s a
closed flag w ith th e file
h andler . If w e pr int it befor e
closing, th en w e w ill see
False, w h er eas if w e pr int it
after closing, th en w e w ill see
True. If ou r logic ch ecks
w h eth er a file is pr oper ly
closed or not, th en th is is th e
flag w e w ant to u se.
THE WITH STATEMENT
In t h i s ex er c i se, w e w i l l l ear n ab ou t t h e with
st at ement i n Py t h on and h ow w e c an ef f ec t i v el y
u se i t i n t h e c ont ex t of op eni ng and c l osi ng f i l es.
Note
Th e re is a n e ntire PEP for w ith a t
h ttp s ://w w w .p y th on.org /d e v /p e p s /p e p -0343/.
We e nc ou ra g e y ou to look into it.
print(fd.closed)
print(fd.closed)
Th e ou t p u t i s as f ol l ow s:
False
True
Note
Th is is by fa r th e c le a ne s t a nd m os t Py th onic w a y
to op e n a file a nd obta in a file d e s c rip tor for it. We
e nc ou ra g e y ou to u s e th is p a tte rn w h e ne v e r y ou
ne e d to op e n a file by y ou rs e lf.
with open("Alice’s
Adventures in
Wonderland, by Lewis
Carroll",
encoding="utf8") as
fd:
print(line)
Th e ou tpu t is as follow s:
2 . Looking at th e pr eceding
code, w e can r eally see w h y
it is im por tant. With th is
sm all snippet of code, y ou
can ev en open and r ead files
th at ar e m any GBs in size,
line by line, and w ith ou t
flooding or ov er r u nning th e
sy stem m em or y !
Th er e is anoth er explicit
m eth od in th e file descr iptor
object called readline,
w h ich r eads one line at a
tim e fr om a file.
with open("Alice’s
Adventures in
Wonderland, by Lewis
Carroll",
encoding="utf8") as
fd:
print(line)
print("Ended first
loop")
print(line)
Th e ou tpu t is as follow s:
data_dict = {"India":
"Delhi", "France":
"Paris", "UK":
"London",
"USA": "Washington"}
with
open("data_temporary_f
iles.txt", "w") as fd:
fd.write("The capital
of {} is {}\n".format(
country, capital))
with
open("data_temporary_f
iles.txt", "r") as fd:
print(line)
Th e ou tpu t is as follow s:
The capital of UK is
London
data_dict_2 =
{"China": "Beijing",
"Japan": "Tokyo"}
with
open("data_temporary_f
iles.txt", "a") as fd:
print("The capital of
{} is {}".format(
country, capital),
file=fd)
with
open("data_temporary_f
iles.txt", "r") as fd:
print(line)
Th e ou tpu t is as follow s:
The capital of UK is
London
Note:
W i t h t h i s, w e w i l l end t h i s t op i c . Ju st l i k e t h e
p r ev i ou s t op i c s, w e h av e desi gned an ac t i v i t y f or
y ou t o p r ac t i c e y ou r new l y ac qu i r ed sk i l l s.
In t h i s ac t i v i t y , w e w i l l b e t ask ed w i t h b u i l di ng
ou r ow n CSV r eader and p ar ser . A l t h ou gh i t i s a
b i g t ask i f w e t r y t o c ov er al l u se c ases and edge
c ases, al ong w i t h esc ap e c h ar ac t er s and al l , f or
t h e sak e of t h i s smal l ac t i v i t y , w e w i l l k eep ou r
r equ i r ement s smal l . W e w i l l assu me t h at t h er e i s
no esc ap e c h ar ac t er , meani ng t h at i f y ou u se a
c omma at any p l ac e i n y ou r r ow , i t means y ou ar e
st ar t i ng a new c ol u mn. W e w i l l al so assu me t h at
t h e onl y f u nc t i on w e ar e i nt er est ed i n i s t o b e
ab l e t o r ead a CSV f i l e l i ne b y l i ne w h er e eac h
r ead w i l l gener at e a new di c t w i t h t h e c ol u mn
names as k ey s and r ow names as v al u es.
H er e i s an ex amp l e:
1 . Im por t zip_longest fr om
itertools. Cr eate a
fu nction to zip header, line
and fillvalue=None.
Note
Summary
In t h i s c h ap t er , w e l ear ned ab ou t t h e w or k i ngs
of adv anc ed dat a st r u c t u r es su c h as st ac k s and
qu eu es. W e i mp l ement ed and mani p u l at ed b ot h
st ac k s and qu eu es. W e t h en f oc u sed on di f f er ent
met h ods of f u nc t i onal p r ogr ammi ng, i nc l u di ng
i t er at or s, and c omb i ned l i st s and f u nc t i ons
t oget h er . A f t er t h i s, w e l ook ed at t h e OS-l ev el
f u nc t i ons and t h e management of env i r onment
v ar i ab l es and f i l es. W e al so ex ami ned a c l ean
w ay t o deal w i t h f i l es, and w e c r eat ed ou r ow n
CSV p ar ser i n t h e l ast ac t i v i t y .
In t h e nex t c h ap t er , w e w i l l b e deal i ng w i t h t h e
t h r ee most i mp or t ant l i b r ar i es, namel y N u mPy ,
p andas, and mat p l ot l i b .
Chapter 3
Introduction to NumPy,
Pandas,and Matplotlib
Learning Objectives
By t h e end of t h e c h ap t er , y ou w i l l b e ab l e t o:
A pply m atplotlib, Nu m Py ,
and pandas to calcu late
descr iptiv e statistics fr om a
DataFr am e/m atr ix
In t h i s c h ap t er , y ou w i l l l ear n ab ou t t h e
f u ndament al s of t h e N u mPy , p andas, and
mat p l ot l i b l i b r ar i es.
Introduction
In t h e p r ec edi ng c h ap t er s, w e h av e c ov er ed some
adv anc ed dat a st r u c t u r es, su c h as st ac k , qu eu e,
i t er at or , and f i l e op er at i ons i n Py t h on. In t h i s
sec t i on, w e w i l l c ov er t h r ee essent i al l i b r ar i es,
namel y N u mPy , p andas, and mat p l ot l i b .
NumPy Arrays
In t h e l i f e of a dat a sc i ent i st , r eadi ng and
mani p u l at i ng ar r ay s i s of p r i me i mp or t anc e, and
i t i s al so t h e most f r equ ent l y enc ou nt er ed t ask .
Th ese ar r ay s c ou l d b e a one-di mensi onal l i st or a
mu l t i -di mensi onal t ab l e or a mat r i x f u l l of
nu mb er s.
Note
Nu m Py a rra y s a re op tim iz e d d a ta s tru c tu re s for
nu m e ric a l a na ly s is , a nd th a t's w h y th e y a re s o
im p orta nt to d a ta s c ie ntis ts .
1 . To w or k w ith Nu m Py , w e
m u st im por t it. By
conv ention, w e giv e it a
sh or t nam e, np, w h ile
im por ting:
import numpy as np
list_1 = [1,2,3]
array_1 =
np.array(list_1)
We ju st cr eated a Nu m Py
ar r ay object called array_1
fr om th e r egu lar Py th on list
object, list_1.
4 . Cr eate an ar r ay of floating
ty pe elem ents 1 .2 , 3 .4 , and
5.6 :
a = arr.array('d',
[1.2, 3.4, 5.6])
print(a)
Th e ou tpu t is as follow s:
array('d', [1.2, 3.4,
5.6])
5. Let's ch eck th e ty pe of th e
new ly cr eated object by
u sing th e type fu nction:
type(array_1)
Th e ou tpu t is as follow s:
numpy.ndarray
type (list_1)
Th e ou tpu t is as follow s:
list
list_2 = list_1 +
list_1
print(list_2)
Th e ou tpu t is as follow s:
[1, 2, 3, 1, 2, 3]
array_2 = array_1 +
array_1
print(array_2)
Th e ou tpu t is as follow s:
[2, ,4, 6]
Di d y ou not i c e t h e di f f er enc e? Th e f i r st p r i nt
sh ow s a l i st w i t h 6 el ement s [1 , 2, 3, 1 , 2, 3]. Bu t t h e
sec ond p r i nt sh ow s anot h er N u mPy ar r ay (or
v ec t or ) w i t h t h e el ement s [2, 4, 6], w h i c h ar e ju st
t h e su m of t h e i ndi v i du al el ement s of array_1.
A v ec t or i s a c ol l ec t i on of nu mb er s t h at c an
r ep r esent , f or ex amp l e, t h e c oor di nat es of p oi nt s
i n a t h r ee-di mensi onal sp ac e or t h e c ol or of
nu mb er s (RGB) i n a p i c t u r e. N at u r al l y , r el at i v e
or der i s i mp or t ant f or su c h a c ol l ec t i on and as
w e di sc u ssed p r ev i ou sl y , a N u mPy ar r ay c an
mai nt ai n su c h or der r el at i onsh i p s. Th at 's w h y
t h ey ar e p er f ec t l y su i t ab l e t o u se i n nu mer i c al
c omp u t at i ons.
N u mPy ar r ay s ev en su p p or t el ement -w i se
ex p onent i at i on. For ex amp l e, su p p ose t h er e ar e
t w o ar r ay s – t h e el ement s of t h e f i r st ar r ay w i l l
b e r ai sed t o t h e p ow er of t h e el ement s i n t h e
sec ond ar r ay :
1 . Mu ltiply tw o ar r ay s u sing
th e follow ing com m and:
print("array_1
multiplied by array_1:
",array_1*array_1)
Th e ou tpu t is as follow s:
array_1 multiplied by
array_1: [1 4 9]
2 . Div ide tw o ar r ay s u sing th e
follow ing com m and:
print("array_1 divided
by array_1:
",array_1/array_1)
Th e ou tpu t is as follow s:
array_1 divided by
array_1: [1. 1. 1.]
print("array_1 raised
to the power of
array_1:
",array_1**array_1)
Th e ou tpu t is as follow s:
list_5=[i for i in
range(1,6)]
print(list_5)
Th e ou tpu t is as follow s:
[1, 2, 3, 4, 5]
array_5
Th e ou tpu t is as follow s:
array([1, 2, 3, 4, 5])
# sine function
print("Sine:
",np.sin(array_5))
Th e ou tpu t is as follow s:
Sine: [ 0.84147098
0.90929743 0.14112001
-0.7568025
-0.95892427]
# logarithm
print("Natural
logarithm:
",np.log(array_5))
print("Base-10
logarithm:
",np.log10(array_5))
print("Base-2
logarithm:
",np.log2(array_5))
Th e ou tpu t is as follow s:
# Exponential
print("Exponential:
",np.exp(array_5))
Th e ou tpu t is as follow s:
Exponential: [
2.71828183 7.3890561
20.08553692
54.59815003
148.4131591 ]
print("A series of
numbers:",np.arange(5,
16))
Th e ou tpu t is as follow s:
A series of numbers: [
5 6 7 8 9 10 11 12 13
14 15]
print("Numbers spaced
apart by 2:
",np.arange(0,11,2))
print("Numbers spaced
apart by a floating
point number:
",np.arange(0,11,2.5))
print("Every 5th
number from 30 in
reverse
order\n",np.arange(30,
-1,-5))
Th e ou tpu t is as follow s:
[30 25 20 15 10 5 0]
print("11 linearly
spaced numbers between
1 and 5:
",np.linspace(1,5,11))
Th e ou tpu t is as follow s:
11 linearly spaced
numbers between 1 and
5: [1. 1.4 1.8 2.2 2.6
3. 3.4 3.8 4.2 4.6 5.
]
list_2D = [[1,2,3],
[4,5,6],[7,8,9]]
mat1 =
np.array(list_2D)
print("Type/Class of
this
object:",type(mat1))
print("Here is the
matrix\n----------
\n",mat1,"\n----------
")
Th e ou tpu t is as follow s:
Type/Class of this
object: <class
'numpy.ndarray'>
----------
[[1 2 3]
[4 5 6]
[7 8 9]]
----------
tuple_2D =
np.array([(1.5,2,3),
(4,5,6)])
mat_tuple =
np.array(tuple_2D)
print (mat_tuple)
Th e ou tpu t is as follow s:
[[1.5 2. 3. ]
[4. 5. 6. ]]
print("Dimension of
this matrix:
",mat1.ndim,sep='')
Th e ou tpu t is as follow s:
Dimension of this
matrix: 2
print("Size of this
matrix: ",
mat1.size,sep='')
Th e ou tpu t is as follow s:
Size of this matrix: 9
print("Shape of this
matrix: ",
mat1.shape,sep='')
Th e ou tpu t is as follow s:
print("Data type of
this matrix: ",
mat1.dtype,sep='')
Th e ou tpu t is as follow s:
print("Vector of
zeros: ",np.zeros(5))
Th e ou tpu t is as follow s:
print("Matrix of
zeros:
",np.zeros((3,4)))
Th e ou tpu t is as follow s:
Matrix of zeros: [[0.
0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
print("Matrix of 5's:
",5*np.ones((3,3)))
Th e ou tpu t is as follow s:
[5. 5. 5.]
[5. 5. 5.]]
Th e ou tpu t is as follow s:
Identity matrix of
dimension 2: [[1. 0.]
[0. 1.]]
print("Identity matrix
of dimension
4:",np.eye(4))
Th e ou tpu t is as follow s:
Identity matrix of
dimension 4: [[1. 0.
0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
print("Random matrix
of shape
(4,3):\n",np.random.ra
ndint(low=1,high=10,si
ze=(4,3)))
[[6 7 6]
[5 6 7]
[5 3 6]
[2 9 4]]
Note
1 . Cr eate an ar r ay of 3 0
r andom integer s (sam pled
fr om 1 to 9 9 ) and r esh ape it
into tw o differ ent for m s
u sing th e follow ing code:
a =
np.random.randint(1,10
0,30)
b = a.reshape(2,3,5)
c = a.reshape(6,5)
Th e ou tpu t is as follow s:
Shape of a: (30,)
Shape of b: (2, 3, 5)
Shape of c: (6, 5)
3 . Pr int th e ar r ay s a, b, and c
u sing th e follow ing code:
print("\na looks
like\n",a)
print("\nb looks
like\n",b)
print("\nc looks
like\n",c)
a looks like
[ 7 82 9 29 50 50 71
65 33 84 55 78 40 68
50 15 65 55 98 38 23
75 50 57
32 69 34 59 98 48]
b looks like
[[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]]
[[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]]
c looks like
[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]
[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]
Note
"b" is a three-dimensional
array – a kind of list of a list of
a list.
4 . Rav el file b u sing th e
follow ing code:
b_flat = b.ravel()
print(b_flat)
[ 7 82 9 29 50 50 71
65 33 84 55 78 40 68
50 15 65 55 98 38 23
75 50 57
32 69 34 59 98 48]
Note
I n m u lti-d im e ns iona l a rra y s , y ou c a n u s e tw o
nu m be rs to d e note th e p os ition of a n e le m e nt. For
e xa m p le , if th e e le m e nt is in th e th ird row a nd
s e c ond c olu m n, its ind ic e s a re 2 a nd 1 (be c a u s e of
Py th on's z e ro-ba s e d ind e xing ).
1 . Cr eate an ar r ay of 1 0
elem ents and exam ine its
v ar iou s elem ents by slicing
and indexing th e ar r ay w ith
sligh tly differ ent sy ntaxes.
Do th is by u sing th e
follow ing com m and:
array_1 =
np.arange(0,11)
print("Array:",array_1
)
Th e ou tpu t is as follow s:
Array: [ 0 1 2 3 4 5 6
7 8 9 10]
2 . Pr int th e elem ent in th e
sev enth position by u sing th e
follow ing com m and:
print("Element at 7th
index is:",
array_1[7])
Th e ou tpu t is as follow s:
print("Elements from
3rd to 5th index
are:", array_1[3:6])
Th e ou tpu t is as follow s:
print("Elements up to
4th index are:",
array_1[:4])
Th e ou tpu t is as follow s:
Elements up to 4th
index are: [0 1 2 3]
print("Elements from
last backwards are:",
array_1[-1::-1])
Th e ou tpu t is as follow s:
Th e ou tpu t is as follow s:
array_2 =
np.arange(0,21,2)
print("New
array:",array_2)
Th e ou tpu t is as follow s:
New array: [ 0 2 4 6 8
10 12 14 16 18 20]
print("Elements at
2nd, 4th, and 9th
index are:",
array_2[[2,4,9]])
Th e ou tpu t is as follow s:
matrix_1 =
np.random.randint(10,1
00,15).reshape(3,5)
print("Matrix of
random 2-digit
numbers\n ",matrix_1)
[[21 57 60 24 15]
[53 20 44 72 68]
[39 12 99 99 33]]
print("\nDouble
bracket indexing\n")
print("Element in row
index 1 and column
index 2:", matrix_1[1]
[2])
Double bracket
indexing
print("\nSingle
bracket with comma
indexing\n")
print("Element in row
index 1 and column
index 2:",
matrix_1[1,2])
print("Entire row at
index 2:",
matrix_1[2])
print("Entire column
at index 3:",
matrix_1[:,3])
print("\nSubsetting
sub-matrices\n")
Subsetting sub-
matrices
[[72 68]
[99 33]]
[[57 24]
[20 72]]
CONDITIONAL SUBSETTING
Conditional subse tting i s a w ay t o sel ec t
sp ec i f i c el ement s b ased on some nu mer i c
c ondi t i on. It i s al most l i k e a sh or t ened v er si on of
a SQL qu er y t o su b set el ement s. See t h e f ol l ow i ng
ex amp l e:
matrix_1 =
np.array(np.random.randint(10,100,15)).res
hape(3,5)
Th e samp l e ou t p u t i s as f ol l ow s (not e t h at t h e
ex ac t ou t p u t w i l l b e di f f er ent f or y ou as i t i s
r andom):
[[71 89 66 99 54]
[28 17 66 35 85]
[82 35 38 15 47]]
[71 89 66 99 54 66 85 82]
matrix_1 =
np.random.randint(1,10
,9).reshape(3,3)
matrix_2 =
np.random.randint(1,10
,9).reshape(3,3)
print("\n1st Matrix of
random single-digit
numbers\n",matrix_1)
print("\n2nd Matrix of
random single-digit
numbers\n",matrix_2)
[[6 5 9]
[4 7 1]
[3 2 7]]
[[2 3 1]
[9 9 9]
[9 9 6]]
print("\nAddition\n",
matrix_1+matrix_2)
print("\nMultiplicatio
n\n",
matrix_1*matrix_2)
print("\nDivision\n",
matrix_1/matrix_2)
print("\nLinear
combination: 3*A -
2*B\n", 3*matrix_1-
2*matrix_2)
Addition
[[ 8 8 10]
[13 16 10]
[12 11 13]] ^
Multiplication
[[12 15 9]
[36 63 9]
[27 18 42]]
Division
[[3. 1.66666667 9. ]
[0.44444444 0.77777778
0.11111111]
[0.33333333 0.22222222
1.16666667]]
Linear combination:
3*A - 2*B
[[ 14 9 25]
[ -6 3 -15]
[ -9 -12 9]]
print("\nAddition of a
scalar (100)\n",
100+matrix_1)
print("\nExponentiatio
n, matrix cubed
here\n", matrix_1**3)
print("\nExponentiatio
n, square root using
'pow'
function\n",pow(matrix
_1,0.5))
Addition of a scalar
(100)
Exponentiation, matrix
cubed here
[ 64 343 1]
[ 27 8 343]]
Exponentiation, square
root using 'pow'
function
[[2.44948974
2.23606798 3. ]
[2. 2.64575131 1. ]
[1.73205081 1.41421356
2.64575131]]
STACKING ARRAYS
Stacking array s on t op of eac h ot h er (or si de b y
si de) i s a u sef u l op er at i on f or dat a w r angl i ng.
H er e i s t h e c ode:
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
print("Matrix a\n",a)
print("Matrix b\n",b)
print("Vertical
stacking\n",np.vstack((a,b)))
print("Horizontal
stacking\n",np.hstack((a,b)))
Th e ou t p u t i s as f ol l ow s:
Matrix a
[[1 2]
[3 4]]
Matrix b
[[5 6]
[7 8]]
Vertical stacking
[[1 2]
[3 4]
[5 6]
[7 8]]
Horizontal stacking
[[1 2 5 6]
[3 4 7 8]]
Pandas DataFrames
Th e p andas l i b r ar y i s a Py t h on p ac k age t h at
p r ov i des f ast , f l ex i b l e, and ex p r essi v e dat a
st r u c t u r es t h at ar e desi gned t o mak e w or k i ng
w i t h r el at i onal or l ab el ed dat a b ot h easy and
i nt u i t i v e. It ai ms t o b e t h e f u ndament al h i gh -
l ev el b u i l di ng b l oc k f or doi ng p r ac t i c al , r eal -
w or l d dat a anal y si s i n Py t h on. A ddi t i onal l y , i t
h as t h e b r oader goal of b ec omi ng t h e most
p ow er f u l and f l ex i b l e op en sou r c e dat a
anal y si s/mani p u l at i on t ool t h at 's av ai l ab l e i n
any l angu age.
labels = ['a','b','c']
my_data = [10,20,30]
array_1 =
np.array(my_data)
d =
{'a':10,'b':20,'c':30}
print ("Labels:",
labels)
print("My data:",
my_data)
print("Dictionary:",
d)
Th e ou tpu t is as follow s:
import pandas as pd
series_1=pd.Series(dat
a=my_data)
print(series_1)
Th e ou tpu t is as follow s:
0 10
1 20
2 30
dtype: int64
series_2=pd.Series(dat
a=my_data, index =
labels)
print(series_2)
Th e ou tpu t is as follow s:
a 10
b 20
c 30
dtype: int64
series_3=pd.Series(arr
ay_1,labels)
print(series_3)
Th e ou tpu t is as follow s:
a 10
b 20
c 30
dtype: int32
series_4=pd.Series(d)
print(series_4)
Th e ou tpu t is as follow s:
a 10
b 20
c 30
dtype: int64
print ("\nHolding
numerical data\n",'-
'*25, sep='')
print(pd.Series(array_
1))
Th e ou tpu t is as follow s:
Holding numerical data
----------------------
---
0 10
1 20
2 30
dtype: int32
print(pd.Series(labels
))
Th e ou tpu t is as follow s:
--------------------
0 a
1 b
2 c
dtype: object
print ("\nHolding
functions\n",'-'*20,
sep='')
print(pd.Series(data=
[sum,print,len]))
Th e ou tpu t is as follow s:
Holding functions
--------------------
0 <built-in function
sum>
1 <built-in function
print>
2 <built-in function
len>
dtype: object
print ("\nHolding
objects from a
dictionary\n",'-'*40,
sep='')
print(pd.Series(data=
[d.keys, d.items,
d.values]))
Th e ou tpu t is as follow s:
----------------------
------------------
0 <built-in method
keys of dict object at
0x0000...
1 <built-in method
items of dict object
at 0x000...
2 <built-in method
values of dict object
at 0x00...
dtype: object
matrix_data =
np.random.randint(1,10
,size=20).reshape(5,4)
2 . Define th e r ow s labels as
('A','B','C','D','E')
and colu m n labels as
('W','X','Y','Z'):
row_labels =
['A','B','C','D','E']
column_headings =
['W','X','Y','Z']
df =
pd.DataFrame(data=matr
ix_data,
index=row_labels,
columns=column_heading
s)
3 . Th e fu nction to cr eate a
DataFr am e is pd.DataFrame
and it is called in next:
print("\nThe data
frame looks like\n",'-
'*45, sep='')
print(df)
----------------------
----------------------
-
W X Y Z
A 6 3 3 3
B 1 9 9 4
C 4 3 6 9
D 4 8 6 7
E 6 6 9 1
4 . Cr eate a DataFr am e fr om a
Py th on dictionar y of som e
lists of integer s by u sing th e
follow ing com m and:
d={'a':[10,20],'b':
[30,40],'c':[50,60]}
5. Pass th is dictionar y as th e
data ar gu m ent to th e
pd.DataFrame fu nction. Pass
on a list of r ow s or indices.
Notice h ow th e dictionar y
key s becam e th e colu m n
nam es and th at th e v alu es
w er e distr ibu ted am ong
m u ltiple r ow s:
df2=pd.DataFrame(data=
d,index=['X','Y'])
print(df2)
Th e ou tpu t is as follow s:
a b c
X 10 30 50
Y 20 40 60
Note
matrix_data =
np.random.randint(1,10
0,100).reshape(25,4)
column_headings =
['W','X','Y','Z']
df =
pd.DataFrame(data=matr
ix_data,columns=column
_headings)
df.head()
df.head(8)
df.tail(10)
print(df['X'])
print(df[['X','Z']])
Th e ou t p u t i s as f ol l ow s (a sc r eensh ot i s sh ow n
h er e b ec au se t h e ac t u al c ol u mn i s l ong):
Th i s i s t h e ou t p u t sh ow i ng t h e t y p e of c ol u mn:
Th i s i s t h e ou t p u t sh ow i ng t h e t y p e of t h e p ai r of
c ol u mn:
Note
For m ore th a n one c olu m n, th e obje c t tu rns into a
Da ta Fra m e . Bu t for a s ing le c olu m n, it is a
p a nd a s s e rie s obje c t.
matrix_data =
np.random.randint(1,10,size=20).reshape(5,
4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data,
index=row_labels,
columns=column_headings)
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])
Th e samp l e ou t p u t i s as f ol l ow s:
print("\nA column is
created by assigning
it in relation\n",'-
'*75, sep='')
df['New'] =
df['X']+df['Z']
print(df)
2 . Dr op a colu m n u sing th e
df.drop m eth od:
print("\nA column is
dropped by using
df.drop() method\n",'-
'*55, sep='')
df = df.drop('New',
axis=1) # Notice the
axis=1 option, axis =
0 is #default, so one
has to change it to 1
print(df)
3 . Dr op a specific r ow u sing th e
df.drop m eth od:
df1=df.drop('A')
print("\nA row is
dropped by using
df.drop method and
axis=0\n",'-'*65,
sep='')
print(df1)
print("\nAn in-place
change can be done by
making inplace=True in
the drop method\n",'-
'*75, sep='')
df.drop('New (Sum of X
and Z)', axis=1,
inplace=True)
print(df)
Note
All th e norm a l op e ra tions a re not in-p la c e , th a t is ,
th e y d o not im p a c t th e orig ina l Da ta Fra m e obje c t
bu t re tu rn a c op y of th e orig ina l w ith a d d ition (or
d e le tion). Th e la s t bit of c od e s h ow s h ow to m a k e
a c h a ng e in th e e xis ting Da ta Fra m e w ith th e
inplace=True a rg u m e nt. Ple a s e note th a t th is
c h a ng e is irre v e rs ible a nd s h ou ld be u s e d w ith
c a u tion.
Statistics and
Visualization with
NumPy and Pandas
One of t h e gr eat adv ant ages of u si ng l i b r ar i es
su c h as N u mPy and p andas i s t h at a p l et h or a of
b u i l t -i n st at i st i c al and v i su al i zat i on met h ods
ar e av ai l ab l e, f or w h i c h w e don't h av e t o sear c h
f or and w r i t e new c ode. Fu r t h er mor e, most of
t h ese su b r ou t i nes ar e w r i t t en u si ng C or For t r an
c ode (and p r e-c omp i l ed), mak i ng t h em ex t r emel y
f ast t o ex ec u t e.
REFRESHER OF BASIC
DESCRIPTIVE STATISTICS
(AND THE MATPLOTLIB
LIBRARY FOR
VISUALIZATION)
For any dat a w r angl i ng t ask , i t i s qu i t e u sef u l t o
ex t r ac t b asi c desc r i p t i v e st at i st i c s f r om t h e dat a
and c r eat e some si mp l e v i su al i zat i ons/p l ot s.
Th ese p l ot s ar e of t en t h e f i r st st ep i n i dent i f y i ng
f u ndament al p at t er ns as w el l as oddi t i es (i f
p r esent ) i n t h e dat a. In any st at i st i c al anal y si s,
desc r i p t i v e st at i st i c s i s t h e f i r st st ep , f ol l ow ed
b y i nf er ent i al st at i st i c s, w h i c h t r i es t o i nf er t h e
u nder l y i ng di st r i b u t i on or p r oc ess f r om w h i c h
t h e dat a mi gh t h av e b een gener at ed.
A s t h e i nf er ent i al st at i st i c s ar e i nt i mat el y
c ou p l ed w i t h t h e mac h i ne l ear ni ng/p r edi c t i v e
model i ng st age of a dat a sc i enc e p i p el i ne,
desc r i p t i v e st at i st i c s nat u r al l y b ec omes
assoc i at ed w i t h t h e dat a w r angl i ng asp ec t .
In t h i s t op i c , w e w i l l demonst r at e h ow y ou c an
ac c omp l i sh b ot h of t h ese t ask s u si ng Py t h on.
A p ar t f r om N u mPy and p andas, w e w i l l need t o
l ear n t h e b asi c s of anot h er gr eat p ac k age –
matplotlib – w h i c h i s t h e most p ow er f u l and
v er sat i l e v i su al i zat i on l i b r ar y i n Py t h on.
people =
['Ann','Brandon','Chen
','David','Emily','Far
ook',
'Gagan','Hamish','Imra
n','Joseph','Katherine
','Lily']
age =
[21,12,32,45,37,18,28,
52,5,40,48,15]
weight =
[55,35,77,68,70,60,72,
69,18,65,82,48]
height =
[160,135,170,165,173,1
68,175,159,105,171,155
,158]
import
matplotlib.pyplot as
plt
plt.scatter(age,weight
)
plt.show()
Th e ou tpu t is as follow s:
Figure 3.13: A screenshot of a scatter plot
containing age and weight
Th e plot can be im pr ov ed by
enlar ging th e figu r e size,
cu stom izing th e aspect r atio,
adding a title w ith a pr oper
font size, adding X-axis and
Y-axis labels w ith a
cu stom ized font size, adding
gr id lines, ch anging th e Y-
axis lim it to be betw een 0
and 1 00, adding X and Y-tick
m ar ks, cu stom izing th e
scatter plot's color , and
ch anging th e size of th e
scatter dots.
4 . Th e code for th e im pr ov ed
plot is as follow s:
plt.figure(figsize=
(8,6))
plt.title("Plot of Age
vs. Weight (in
kgs)",fontsize=20)
plt.xlabel("Age
(years)",fontsize=16)
plt.ylabel("Weight
(kgs)",fontsize=16)
plt.grid (True)
plt.ylim(0,100)
plt.xticks([i*5 for i
in
range(12)],fontsize=15
)
plt.yticks(fontsize=15
)
plt.scatter(x=age,y=we
ight,c='orange',s=150,
edgecolors='k')
plt.text(x=20,y=85,s="
Weights after 18-20
years of
age",fontsize=15)
plt.vlines(x=20,ymin=0
,ymax=80,linestyles='d
ashed',color='blue',lw
=3)
plt.legend(['Weight in
kgs'],loc=2,fontsize=1
2)
plt.show()
Th e ou tpu t is as follow s:
Th e plt.show() fu nction is
u sed at th e v er y end. Th e
idea is to keep on adding
v ar iou s gr aph ics pr oper ties
(font, color , axis lim its, text,
legend, gr id, and so on) u ntil
y ou ar e satisfied and th en
sh ow th e plot w ith one
fu nction. Th e plot w ill not be
display ed w ith ou t th is last
fu nction call.
DEFINITION OF STATISTICAL
MEASURES – CENTRAL
TENDENCY AND SPREAD
A measu r e of c ent r al t endenc y i s a si ngl e v al u e
t h at at t emp t s t o desc r i b e a set of dat a b y
i dent i f y i ng t h e c ent r al p osi t i on w i t h i n t h at set
of dat a. Th ey ar e al so c at egor i zed as su mmar y
st at i st i c s:
Median: Th e m edian is th e
m iddle v alu e. It is th e v alu e
th at splits th e dataset in
h alf. To find th e m edian,
or der y ou r data fr om
sm allest to lar gest, and th en
find th e data point th at h as
an equ al am ou nt of v alu es
abov e it and below it.
Mode: Th e m ode is th e v alu e
th at occu r s th e m ost
fr equ ently in y ou r dataset.
On a bar ch ar t, th e m ode is
th e h igh est bar .
Variance: Th is is th e m ost
com m on m easu r e of spr ead.
V ar iance is th e av er age of
th e squ ar es of th e dev iations
fr om th e m ean. Squ ar ing th e
dev iations ensu r es th at
negativ e and positiv e
dev iations do not cancel each
oth er ou t.
St andard Dev iat ion:
Becau se v ar iance is pr odu ced
by squ ar ing th e distance
fr om th e m ean, its u nit does
not m atch th at of th e
or iginal data. Standar d
dev iation is a m ath em atical
tr ick to br ing back th e
par ity . It is th e positiv e
squ ar e r oot of th e v ar iance.
Ty p i c al ex amp l es of r andom v ar i ab l es t h at ar e
ar ou nd u s ar e as f ol l ow s:
Th e econom ic ou tpu t of a
nation
Th e blood pr essu r e of a
patient
WHAT IS A PROBABILITY
DISTRIBUTION?
A probability distribution i s a f u nc t i on t h at
desc r i b es t h e l i k el i h ood of ob t ai ni ng t h e
p ossi b l e v al u es t h at a r andom v ar i ab l e c an
assu me. In ot h er w or ds, t h e v al u es of a v ar i ab l e
v ar y b ased on t h e u nder l y i ng p r ob ab i l i t y
di st r i b u t i on.
DISCRETE DISTRIBUTIONS
Discre te probability functions ar e al so k now n
as probability mass functions and c an assu me a
di sc r et e nu mb er of v al u es. For ex amp l e, c oi n
t osses and c ou nt s of ev ent s ar e di sc r et e
f u nc t i ons. You c an h av e onl y h eads or t ai l s i n a
c oi n t oss. Si mi l ar l y , i f y ou 'r e c ou nt i ng t h e
nu mb er of t r ai ns t h at ar r i v e at a st at i on p er
h ou r , y ou c an c ou nt 1 1 or 1 2 t r ai ns, b u t not h i ng
i n-b et w een.
CONTINUOUS
DISTRIBUTIONS
Continuous probability functions ar e al so
k now n as probability de nsity functions. You
h av e a c ont i nu ou s di st r i b u t i on i f t h e v ar i ab l e
c an assu me an i nf i ni t e nu mb er of v al u es
b et w een any t w o v al u es. Cont i nu ou s v ar i ab l es
ar e of t en measu r ement s on a r eal nu mb er sc al e,
su c h as h ei gh t , w ei gh t , and t emp er at u r e.
H er e, w e w i l l di sc u ss t h r ee of t h e most
i mp or t ant di st r i b u t i ons t h at may c ome i n h andy
f or dat a w r angl i ng t ask s – u ni f or m, b i nomi al ,
and gau ssi an nor mal . Th e goal h er e i s t o sh ow an
ex amp l e of si mp l e f u nc t i on c al l s t h at c an
gener at e one or mor e r andom nu mb er s/ar r ay s
w h enev er t h e u ser needs t h em.
Note
Th e re s u lts w ill be d iffe re nt for e a c h s tu d e nt
w h e n th e y u s e th e s e fu nc tions a s th e y a re
s u p p os e d to be ra nd om .
x =
np.random.randint(1,10
)
print(x)
x =
np.random.randint(1,10
,size=1)
print(x)
[8]
x =
50+50*np.random.random
(size=15)
x= x.round(decimals=2)
print(x)
print(x)
[[0.99240105 0.9149215
0.04853315]
[0.8425871 0.11617792
0.77983995]
[0.82769081 0.57579771
0.11358125]]
Su p p ose w e h av e a b i ased c oi n w h er e t h e
p r ob ab i l i t y of h eads i s 0.6. W e t oss t h i s c oi n t en
t i mes and not e dow n t h e nu mb er of h eads t u r ni ng
u p eac h t i me. Th at i s one t r i al or ex p er i ment .
N ow , w e c an r ep eat t h i s ex p er i ment (1 0 c oi n
t osses) any nu mb er of t i mes, say 8 t i mes. Eac h
t i me, w e r ec or d t h e nu mb er of h eads:
x =
np.random.binomial(10,
0.6,size=8)
print(x)
[6 6 5 6 5 8 4 5]
plt.figure(figsize=
(7,4))
plt.title("Number of
successes in coin
toss",fontsize=16)
plt.bar(left=np.arange
(1,9),height=x)
plt.xlabel("Experiment
number",fontsize=15)
plt.ylabel("Number of
successes",fontsize=15
)
plt.show()
x = np.random.normal()
print(x)
-1.2423774071573694
We know th at nor m al
distr ibu tion is ch ar acter ized
by tw o par am eter s – m ean
(µ ) and standar d dev iation
(σ ). In fact, th e defau lt
v alu es for th is par ticu lar
fu nction ar e µ = 0.0 and σ =
1 .0.
Su ppose w e know th at th e
h eigh ts of th e teenage (1 2 -1 6
y ear s) stu dents in a
par ticu lar sch ool is
distr ibu ted nor m ally w ith a
m ean h eigh t of 1 55 cm and a
standar d dev iation of 1 0 cm .
heights =
np.random.normal(loc=1
55,scale=10,size=100)
# Plotting code
#---------------------
--
plt.figure(figsize=
(7,5))
plt.hist(heights,color
='orange',edgecolor='k
')
plt.title("Histogram
of teen aged
students's
height",fontsize=18)
plt.xlabel("Height in
cm",fontsize=15)
plt.xticks(fontsize=15
)
plt.yticks(fontsize=15
)
plt.show()
Th e b est p ar t of w or k i ng w i t h a p andas
Dat aFr ame i s t h at i t h as a b u i l t -i n u t i l i t y
f u nc t i on t o sh ow al l of t h ese desc r i p t i v e
st at i st i c s w i t h a si ngl e l i ne of c ode. It does t h i s
b y u si ng t h e describe met h od:
people_df=pd.DataFrame
(data=people_dict)
people_df
Th e ou tpu t is as follow s:
Figure 3.19: Output of the created
dictionary
print(people_df.shape)
Th e ou tpu t is as follow s:
(12, 4)
print(people_df['Age']
.count())
Th e ou tpu t is as follow s:
12
print(people_df['Age']
.sum())
Th e ou tpu t is as follow s:
353
Th e ou tpu t is as follow s:
29.416666666666668
print(people_df['Weigh
t'].median())
Th e ou tpu t is as follow s:
66.5
7 . Calcu late th e maximum
h eigh t by u sing th e follow ing
com m and:
print(people_df['Heigh
t'].max())
Th e ou tpu t is as follow s:
175
print(people_df['Weigh
t'].std())
Th e ou tpu t is as follow s:
18.45120510148239
Note h ow w e ar e calling th e
statistical fu nctions dir ectly
fr om a DataFr am e object.
pcnt_75 =
np.percentile(people_d
f['Age'],75)
pcnt_25 =
np.percentile(people_d
f['Age'],25)
print("Inter-quartile
range: ",pcnt_75-
pcnt_25)
Th e ou tpu t is as follow s:
Inter-quartile range:
24.0
1 0. Use th e describe com m and
to find a detailed descr iption
of th e DataFr am e:
print(people_df.descri
be())
Th e ou tpu t is as follow s:
Note
Th is fu nc tion w ork s only on th e c olu m ns w h e re
nu m e ric d a ta is p re s e nt. I t h a s no im p a c t on th e
non-nu m e ric c olu m ns , for e xa m p le , Pe op le in th is
Da ta Fra m e .
1 . Find th e h istogr am of th e
w eigh ts by u sing th e hist
fu nction:
people_df['Weight'].hi
st()
plt.show()
Th e ou tpu t is as follow s:
Figure 3.21: Histogram of the weights
people_df.plot.scatter
('Weight','Height',s=1
50,
c='orange',edgecolor='
k')
plt.grid(True)
plt.title("Weight vs.
Height scatter
plot",fontsize=18)
plt.xlabel("Weight (in
kg)",fontsize=15)
plt.ylabel("Height (in
cm)",fontsize=15)
plt.show()
Th e ou tpu t is as follow s:
Figure 3.22: Weight versus Height scatter plot
Note
You c a n try re g u la r m a tp lotlib m e th od s a rou nd
th is fu nc tion c a ll to m a k e y ou r p lot p re tty .
ACTIVITY 5: GENERATING
STATISTICS FROM A CSV FILE
Su p p ose y ou ar e w or k i ng w i t h t h e f amou s Bost on
h ou si ng p r i c e (f r om 1 960) dat aset . Th i s dat aset i s
f amou s i n t h e mac h i ne l ear ni ng c ommu ni t y .
Many r egr essi on p r ob l ems c an b e f or mu l at ed,
and mac h i ne l ear ni ng al gor i t h ms c an b e r u n on
t h i s dat aset . You w i l l do p er f or m a b asi c dat a
w r angl i ng ac t i v i t y (i nc l u di ng p l ot t i ng some
t r ends) on t h i s dat aset b y r eadi ng i t as a p andas
Dat aFr ame.
Note
Th e p a nd a s fu nc tion for re a d ing a CS V file is
read_csv.
Th ese st ep s w i l l h el p y ou c omp l et e t h i s ac t i v i t y :
Note
Summary
In t h i s c h ap t er , w e st ar t ed w i t h t h e b asi c s of
N u mPy ar r ay s, i nc l u di ng h ow t o c r eat e t h em and
t h ei r essent i al p r op er t i es. W e di sc u ssed and
sh ow ed h ow a N u mPy ar r ay i s op t i mi zed f or
v ec t or i zed el ement -w i se op er at i ons and di f f er s
f r om a r egu l ar Py t h on l i st . Th en, w e mov ed on t o
p r ac t i c i ng v ar i ou s op er at i ons on N u mPy ar r ay s
su c h as i ndex i ng, sl i c i ng, f i l t er i ng, and
r esh ap i ng. W e al so c ov er ed sp ec i al one-
di mensi onal and t w o-di mensi onal ar r ay s, su c h as
zer os, ones, i dent i t y mat r i c es, and r andom ar r ay s.
N ex t , w e c ov er ed t h e b asi c s of p l ot t i ng w i t h
mat p l ot l i b , t h e most w i del y u sed and p op u l ar
Py t h on l i b r ar y f or v i su al i zat i on. A l ong w i t h
p l ot t i ng ex er c i ses, w e t ou c h ed u p on r ef r esh er
c onc ep t s of desc r i p t i v e st at i st i c s (su c h as
c ent r al t endenc y and measu r e of sp r ead) and
p r ob ab i l i t y di st r i b u t i ons (su c h as u ni f or m,
b i nomi al , and nor mal ).
In t h e nex t c h ap t er , w e w i l l c ov er mor e
adv anc ed op er at i on w i t h p andas Dat aFr ames
t h at w i l l c ome i n v er y h andy f or day -t o-day
w or k i ng i n a dat a w r angl i ng job .
Chapter 4
A Deep Dive into Data
Wrangling with Python
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:
In t h i s c h ap t er , w e w i l l l ear n ab ou t p andas
Dat aFr ames i n det ai l .
Introduction
In t h i s c h ap t er , w e w i l l l ear n ab ou t sev er al
adv anc ed op er at i ons i nv ol v i ng p andas
Dat aFr ames and N u mPy ar r ay s. On c omp l et i ng
t h e det ai l ed ac t i v i t y f or t h i s c h ap t er , y ou w i l l
h av e h andl ed r eal -l i f e dat aset s and u nder st ood
t h e p r oc ess of dat a w r angl i ng.
Subsetting, Filtering,
and Grouping
One of t h e most i mp or t ant asp ec t s of dat a
w r angl i ng i s t o c u r at e t h e dat a c ar ef u l l y f r om
t h e del u ge of st r eami ng dat a t h at p ou r s i nt o an
or gani zat i on or b u si ness ent i t y f r om v ar i ou s
sou r c es. Lot s of dat a i s not al w ay s a good t h i ng;
r at h er , dat a needs t o b e u sef u l and of h i gh -
qu al i t y t o b e ef f ec t i v el y u sed i n dow nst r eam
ac t i v i t i es of a dat a sc i enc e p i p el i ne su c h as
mac h i ne l ear ni ng and p r edi c t i v e model
b u i l di ng. Mor eov er , one dat a sou r c e c an b e u sed
f or mu l t i p l e p u r p oses and t h i s of t en r equ i r es
di f f er ent su b set s of dat a t o b e p r oc essed b y a dat a
w r angl i ng modu l e. Th i s i s t h en p assed on t o
sep ar at e anal y t i c s modu l es.
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
df =
pd.read_excel("Sample
- Superstore.xls")
df.head()
3 . Dr op th is colu m n altogeth er
fr om th e DataFr am e by
u sing th e drop m eth od:
df.drop('Row
ID',axis=1,inplace=Tru
e)
4 . Ch eck th e nu m ber of r ow s
and colu m ns in th e new ly
cr eated dataset. We w ill u se
th e shape fu nction h er e:
df.shape
Th e ou tpu t is as follow s:
(9994, 20)
SUBSETTING THE
DATAFRAME
Subse tting i nv ol v es t h e ex t r ac t i on of p ar t i al
dat a b ased on sp ec i f i c c ol u mns and r ow s, as p er
b u si ness needs. Su p p ose w e ar e i nt er est ed onl y
i n t h e f ol l ow i ng i nf or mat i on f r om t h i s dat aset :
Cu st omer ID, Cu st omer N ame, Ci t y , Post al Code,
and Sal es. For demonst r at i on p u r p oses, l et 's
assu me t h at w e ar e onl y i nt er est ed i n 5 r ec or ds –
r ow s 5-9. W e c an su b set t h e Dat aFr ame t o ex t r ac t
onl y t h i s mu c h i nf or mat i on u si ng a si ngl e l i ne of
Py t h on c ode.
df_subset = df.loc[
[i for i in range(5,10)],
['Customer ID','Customer
Name','City','Postal Code',
'Sales']]
df_subset
Th e ou t p u t i s as f ol l ow s:
Figure 4.2: DataFrame indexed by name of the
columns
df_subset.describe()
Th e ou t p u t i s as f ol l ow s:
Figure 4.3 Output of descriptive statistics of data
W e si mp l y ex t r ac t r ec or ds 1 00-1 99 and r u n t h e
describe f u nc t i on on i t b ec au se w e don't w ant t o
p r oc ess al l t h e dat a! For t h i s p ar t i c u l ar b u si ness
qu est i on, w e ar e onl y i nt er est ed i n sal es and
p r of i t nu mb er s and t h er ef or e w e sh ou l d not t ak e
t h e easy r ou t e and r u n a desc r i b e f u nc t i on on al l
t h e dat a. For a r eal -l i f e dat aset , t h e nu mb er of
r ow s and c ol u mns c ou l d of t en b e i n t h e mi l l i ons,
and w e don't w ant t o c omp u t e any t h i ng t h at i s
not ask ed f or i n t h e dat a w r angl i ng t ask . W e
al w ay s ai m t o su b set t h e ex ac t dat a t h at i s needed
t o b e p r oc essed and r u n st at i st i c al or p l ot t i ng
f u nc t i ons on t h at p ar t i al dat a:
Figure 4.4: Boxplot of sales and profit
1 . Extr act th e
cou ntr ies/states/cities for
w h ich th e infor m ation is in
th e database, w ith one
sim ple line of code, as follow s:
df['State'].unique()
Th e ou tpu t is as follow s:
Figure 4.5: Different states present in the
dataset
df['State'].nunique()
Th e ou tpu t is as follow s:
49
Th is r etu r ns 4 9 for th is
dataset. So, one ou t of 50
states in th e US does not
appear in th is dataset.
Si mi l ar l y , i f w e r u n t h i s f u nc t i on on t h e
Cou nt r y c ol u mn, w e get an ar r ay w i t h onl y one
el ement , United States. Immedi at el y , w e c an see
t h at w e don't need t o k eep t h e c ou nt r y c ol u mn at
al l , b ec au se t h er e i s no u sef u l i nf or mat i on i n
t h at c ol u mn ex c ep t t h at al l t h e ent r i es ar e t h e
same. Th i s i s h ow a si mp l e f u nc t i on h el p ed u s t o
dec i de ab ou t dr op p i ng a c ol u mn al t oget h er –
t h at i s, r emov i ng 9,994 p i ec es of u nnec essar y
dat a!
CONDITIONAL SELECTION
AND BOOLEAN FILTERING
Of t en, w e don't w ant t o p r oc ess t h e w h ol e dat aset
and w ou l d l i k e t o sel ec t onl y a p ar t i al dat aset
w h ose c ont ent s sat i sf y a p ar t i c u l ar c ondi t i on.
Th i s i s p r ob ab l y t h e most c ommon u se c ase of any
dat a w r angl i ng t ask .
Wh at ar e th e av er age sales
and pr ofit figu r es in
Califor nia?
Wh ich states h av e th e
h igh est and low est total
sales?
W e w i l l sh ow y ou h ow t o u se c ondi t i onal
su b set t i ng and Bool ean f i l t er i ng t o answ er su c h
qu est i ons.
df_subset
Th e ou t p u t i s as f ol l ow s:
Figure 4.6: Sample dataset
N ow , i f w e ju st w ant t o k now t h e r ec or ds w i t h
sal es h i gh er t h an $1 00, t h en w e c an w r i t e t h e
f ol l ow i ng:
df_subset>100
df_subset[df_subset>100]
Th e ou t p u t i s as f ol l ow s:
Figure 4.8: Results a er passing the boolean
DataFrame as an index to the original DataFrame
Th e N aN v al u es c ame f r om t h e f ac t t h at t h e
p r ec edi ng c ode t r i ed t o c r eat e a Dat aFr ame w i t h
TRU E i ndi c es (i n t h e Bool ean Dat aFr ame) onl y .
Th e v al u es w h i c h w er e TRU E i n t h e b ool en
Dat aFr ame w er e r et ai ned i n t h e f i nal ou t p u t
Dat aFr ame.
N ow , w e p r ob ab l y don't w ant t o w or k w i t h t h i s
r esu l t i ng Dat aFr ame w i t h NaN v al u es. W e
w ant ed a smal l er Dat aFr ame w i t h onl y t h e r ow s
w h er e Sales > $100. W e c an ac h i ev e t h at b y
si mp l y p assi ng onl y t h e Sales c ol u mn:
df_subset[df_subset['Sales']>100]
Th i s p r odu c es t h e ex p ec t ed r esu l t :
df_subset[(df_subset['State']!='Colorado')
& (df_subset['Sales']>100)]
Note
Alth ou g h , in th e ory , th e re is no lim it on h ow
c om p le x a c ond itiona l y ou c a n bu ild u s ing
ind iv id u a l e xp re s s ions a nd & (LOGI CAL AND)
a nd | (LOGI CAL OR) op e ra tors , it is a d v is a ble to
c re a te inte rm e d ia te boole a n Da ta Fra m e s w ith
lim ite d c ond itiona l e xp re s s ions a nd bu ild y ou r
fina l Da ta Fra m e s te p by s te p . Th is k e e p s th e c od e
le g ible a nd s c a la ble .
1 . Cr eate th e matrix_data,
row_labels, and
column_headings fu nctions
u sing th e follow ing
com m and:
matrix_data =
np.matrix(
'22,66,140;42,70,148;3
0,62,125;35,68,160;25,
62,152')
row_labels =
['A','B','C','D','E']
column_headings =
['Age', 'Height',
'Weight']
df1 =
pd.DataFrame(data=matr
ix_data,
index=row_labels,
columns=column_heading
s)
print("\nThe
DataFrame\n",'-'*25,
sep='')
print(df1)
Th e ou tpu t is as follow s:
print("\nAfter
resetting index\n",'-
'*35, sep='')
print(df1.reset_index(
))
Figure 4.12: DataFrame a er resetting
the index
print("\nAfter
resetting index with
'drop' option
TRUE\n",'-'*45,
sep='')
print(df1.reset_index(
drop=True))
print("\nAdding a new
column
'Profession'\n",'-
'*45, sep='')
df1['Profession'] =
"Student Teacher
Engineer Doctor
Nurse".split()
print(df1)
Th e ou tpu t is as follow s:
print("\nSetting
'Profession' column as
index\n",'-'*45,
sep='')
print
(df1.set_index('Profes
sion'))
Th e ou tpu t is as follow s:
Note
Th e na m e GroupBy s h ou ld be q u ite fa m ilia r to
th os e w h o h a v e u s e d a S QL-ba s e d tool be fore .
df_subset = df.loc[[i
for i in range (10)],
['Ship
Mode','State','Sales']
]
2 . Cr eate a pandas DataFr am e
u sing th e groupby object, as
follow s:
byState =
df_subset.groupby('Sta
te')
print("\nGrouping by
'State' column and
listing mean
sales\n",'-'*50,
sep='')
print(byState.mean())
Th e ou tpu t is as follow s:
print("\nGrouping by
'State' column and
listing total sum of
sales\n",'-'*50,
sep='')
print(byState.sum())
Th e ou tpu t is as follow s:
Figure 4.17: The output a er grouping the
state with the listing sum of sales
pd.DataFrame(byState.d
escribe().loc['Califor
nia'])
Th e ou tpu t is as follow s:
Note h ow pandas h as
gr ou ped th e data by State
fir st and th en by cities u nder
each state.
byStateCity=df.groupby
(['State','City'])
byStateCity.describe()
['Sales']
Th e ou tpu t is as follow s:
Figure 4.20: Checking the summary statistics of sales
A p ar t f r om t h e ob v i ou s i ssu e of p oor qu al i t y
dat a, mi ssi ng dat a c an somet i mes w r eak h av oc
w i t h t h e mac h i ne l ear ni ng (ML) model
dow nst r eam. A f ew ML model s, l i k e Bay esi an
l ear ni ng, ar e i nh er ent l y r ob u st t o ou t l i er s and
mi ssi ng dat a, b u t c ommonl y t ec h ni qu es l i k e
Dec i si on Tr ees and Random For est h av e an i ssu e
w i t h mi ssi ng dat a b ec au se t h e f u ndament al
sp l i t t i ng st r at egy emp l oy ed b y t h ese t ec h ni qu es
dep ends on an i ndi v i du al p i ec e of dat a and not a
c l u st er . Th er ef or e, i t i s al most al w ay s
i mp er at i v e t o i mp u t e mi ssi ng dat a b ef or e
h andi ng i t ov er t o su c h a ML model .
Ou t l i er det ec t i on i s a su b t l e ar t . Of t en, t h er e i s
no u ni v er sal l y agr eed def i ni t i on of an ou t l i er . In
a st at i st i c al sense, a dat a p oi nt t h at f al l s ou t si de
a c er t ai n r ange may of t en b e c l assi f i ed as an
ou t l i er , b u t t o ap p l y t h at def i ni t i on, y ou need t o
h av e a f ai r l y h i gh degr ee of c er t ai nt y ab ou t t h e
assu mp t i on of t h e nat u r e and p ar amet er s of t h e
i nh er ent st at i st i c al di st r i b u t i on ab ou t t h e dat a.
It t ak es a l ot of dat a t o b u i l d t h at st at i st i c al
c er t ai nt y and ev en af t er t h at , an ou t l i er may not
b e ju st an u ni mp or t ant noi se b u t a c l u e t o
somet h i ng deep er . Let 's t ak e an ex amp l e w i t h
some f i c t i t i ou s sal es dat a f r om an A mer i c an f ast
f ood c h ai n r est au r ant . If w e w ant t o model t h e
dai l y sal es dat a as a t i me ser i es, w e ob ser v e an
u nu su al sp i k e i n t h e dat a somew h er e ar ou nd
mi d-A p r i l :
Figure 4.21: Fictitious sales data of an American fast
food chain restaurant
Th er ef or e, t h e k ey t o ou t l i er s i s t h ei r sy st emat i c
and t i mel y det ec t i on i n an i nc omi ng st r eam of
mi l l i ons of dat a or w h i l e r eadi ng dat a f r om a
c l ou d-b ased st or age. In t h i s t op i c , w e w i l l
qu i c k l y go ov er some b asi c st at i st i c al t est s f or
det ec t i ng ou t l i er s and some b asi c i mp u t at i on
t ec h ni qu es f or f i l l i ng u p mi ssi ng dat a.
df_missing=pd.read_excel("Sample -
Superstore.xls",sheet_name="Missing")
df_missing
Th e ou t p u t i s as f ol l ow s:
df_missing.isnull()
Figure 4.24 Output highlighting the missing values
for c in df_missing.columns:
miss = df_missing[c].isnull().sum()
if miss>0:
else:
Th e mean f u nc t i on c an b e u sed t o f i l l u si ng t h e
av er age of t h e t w o v al u es:
Th e ou tpu t is as follow s:
Figure 4.26: Missing values replaced with
FILL
df_missing[['Customer'
,'Product']].fillna('F
ILL')
Th e ou tpu t is as follow s:
Note
df_missing['Sales'].fi
llna(method='ffill')
4 . Use backfill or bfill to fill
backw ar d, th at is, copy fr om
th e next data in th e ser ies:
df_missing['Sales'].fi
llna(method='bfill')
df_missing['Sales'].fi
llna(df_missing.mean()
['Sales'])
Figure 4.29: Using average to fill in missing data
Th e how ar gu m ent
deter m ines if a r ow or
colu m n is r em ov ed fr om a
DataFr am e, w h en w e h av e
at least one NaN or all NaNs
Th e thresh ar gu m ent
r equ ir es th at m any non-
NaN v alu es to keep th e
r ow /colu m n
df_missing.dropna(axis
=0)
df_missing.dropna(axis
=1)
Figure 4.30: Dropping rows or columns to
handle missing data
Th e ou tpu t is as follow s:
Data entr y er r or s
Measu r em ent er r or s du e to
noise or instr u m ental failu r e
df_sample = df[['Customer
Name','State','Sales','Profit']].sample(n=
50).copy()
df_sample['Sales'].iloc[5]=-1000.0
df_sample['Sales'].iloc[15]=-500.0
To p l ot t h e b ox p l ot , u se t h e f ol l ow i ng c ode:
df_sample.plot.box()
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.grid(True)
Th e ou t p u t i s as f ol l ow s:
W e c an c r eat e si mp l e b ox p l ot s t o c h ec k f or any
u nu su al /nonsensi c al v al u es. For ex amp l e, i n t h e
p r ec edi ng ex amp l e, w e i nt ent i onal l y c or r u p t ed
t w o sal es v al u es t o b e negat i v e and t h ey w er e
r eadi l y c au gh t i n a b ox p l ot .
W e c an c r eat e a di st r i b u t i on of a nu mer i c al
qu ant i t y and c h ec k f or v al u es t h at l i e at t h e
ex t r eme end t o see i f t h ey ar e t r u l y p ar t of t h e
dat a or ou t l i er . For ex amp l e, i f a di st r i b u t i on i s
al most nor mal , t h en any v al u e mor e t h an 4 or 5
st andar d dev i at i ons aw ay may b e a su sp ec t :
Concatenating, Merging,
and Joining
Mer gi ng and joi ni ng t ab l es or dat aset s ar e h i gh l y
c ommon op er at i ons i n t h e day -t o-day job of a dat a
w r angl i ng p r of essi onal . Th ese op er at i ons ar e
ak i n t o t h e JOIN qu er y i n SQL f or r el at i onal
dat ab ase t ab l es. Of t en, t h e k ey dat a i s p r esent i n
mu l t i p l e t ab l es, and t h ose r ec or ds need t o b e
b r ou gh t i nt o one c omb i ned t ab l e t h at 's mat c h i ng
on t h at c ommon k ey . Th i s i s an ex t r emel y
c ommon op er at i on i n any t y p e of sal es or
t r ansac t i onal dat a, and t h er ef or e mu st b e
mast er ed b y a dat a w r angl er . Th e p andas l i b r ar y
of f er s ni c e and i nt u i t i v e b u i l t -i n met h ods t o
p er f or m v ar i ou s t y p es of JOIN qu er i es
i nv ol v i ng mu l t i p l e Dat aFr ame ob jec t s.
EXERCISE 54:
CONCATENATION
W e w i l l st ar t b y l ear ni ng t h e c onc at enat i on of
Dat aFr ames al ong v ar i ou s ax es (r ow s or
c ol u mns). Th i s i s a v er y u sef u l op er at i on as i t
al l ow s y ou t o gr ow a Dat aFr ame as t h e new dat a
c omes i n or new f eat u r e c ol u mns need t o b e
i nser t ed i n t h e t ab l e:
df_1 = df[['Customer
Name','State','Sales',
'Profit']].sample(n=4)
df_2 = df[['Customer
Name','State','Sales',
'Profit']].sample(n=4)
df_3 = df[['Customer
Name','State','Sales',
'Profit']].sample(n=4)
df_cat1 =
pd.concat([df_1,df_2,d
f_3], axis=0)
df_cat1
Figure 4.34: Concatenating DataFrames
together
df_cat2 =
pd.concat([df_1,df_2,d
f_3], axis=1)
df_cat2
Th i s i s of t en t h e f i r st st ep i n b u i l di ng a l ar ge
dat ab ase f or mac h i ne l ear ni ng t ask s w h er e dai l y
i nc omi ng dat a may b e p u t i nt o sep ar at e t ab l es.
H ow ev er , at t h e end of t h e day , t h e most r ec ent
t ab l e needs t o b e mer ged w i t h t h e mast er dat a
t ab l e t o b e f ed i nt o t h e b ac k end mac h i ne
l ear ni ng ser v er , w h i c h w i l l t h en u p dat e t h e
model and i t s p r edi c t i on c ap ac i t y .
H er e, w e w i l l sh ow a si mp l e ex amp l e of an i nner
joi n w i t h Cu st omer N ame as t h e k ey :
df_1=df[['Ship
Date','Ship
Mode','Customer
Name']][0:4]
df_1
Th e ou tpu t is as follow s:
Th e second DataFr am e is as
follow s:
df_2=df[['Customer
Name','Product
Name','Quantity']]
[0:4]
df_2
Th e ou tpu t is as follow s:
pd.merge(df_1,df_2,on=
'Customer
Name',how='inner')
Th e ou tpu t is as follow s:
3 . Dr op th e du plicates by u sing
th e follow ing com m and.
pd.merge(df_1,df_2,on=
'Customer
Name',how='inner').dro
p_duplicates()
Th e ou tpu t is as follow s:
df_3=df[['Customer
Name','Product
Name','Quantity']]
[2:6]
df_3
Th e ou tpu t is as follow s:
Figure 4.40: Creating table df_3
pd.merge(df_1,df_3,on=
'Customer
Name',how='inner').dro
p_duplicates()
Th e ou tpu t is as follow s:
pd.merge(df_1,df_3,on=
'Customer
Name',how='outer').dro
p_duplicates()
Th e ou tpu t is as follow s:
Figure 4.42: Outer join on table df_1 and table df_2 and
dropping the duplicates
df_1=df[['Customer
Name','Ship
Date','Ship Mode']]
[0:4]
df_1.set_index(['Custo
mer
Name'],inplace=True)
df_1
df_2=df[['Customer
Name','Product
Name','Quantity']]
[2:6]
df_2.set_index(['Custo
mer
Name'],inplace=True)
df_2
Th e ou tpu ts is as follow s:
df_1.join(df_2,how='le
ft').drop_duplicates()
Th e ou tpu t is as follow s:
df_1.join(df_2,how='ri
ght').drop_duplicates(
)
Th e ou tpu t is as follow s:
Figure 4.45: Right join on table df_1 and
table df_2 a er dropping the duplicates
df_1.join(df_2,how='in
ner').drop_duplicates(
)
Th e ou tpu t is as follow s:
df_1.join(df_2,how='ou
ter').drop_duplicates(
)
Th e ou tpu t is as follow s:
Figure 4.47: Outer join on table df_1 and table df_2
a er dropping the duplicates
Useful Methods of
Pandas
In t h i s t op i c , w e w i l l di sc u ss some smal l u t i l i t y
f u nc t i ons t h at ar e of f er ed b y p andas so t h at w e
c an w or k ef f i c i ent l y w i t h Dat aFr ames. Th ey
don't f al l u nder any p ar t i c u l ar gr ou p of
f u nc t i on, so t h ey ar e ment i oned h er e u nder t h e
Mi sc el l aneou s c at egor y .
1 . Specify th e nu m ber of
sam ples th at y ou r equ ir e
fr om th e DataFr am e by
u sing th e follow ing
com m and:
df.sample(n=5)
Th e ou tpu t is as follow s:
df.sample(frac=0.1)
Th e ou tpu t is as follow s:
Figure 4.49: DataFrame with 0.1% data
sampled
df.sample(frac=0.1,
replace=True)
Th e ou tpu t is as follow s:
Figure 4.50: DataFrame with 0.1% data sampled and
repetition enabled
df['Customer Name'].value_counts()[:10]
Th e ou t p u t i s as f ol l ow s:
W e c an ex t r ac t t h i s i nf or mat i on b y u si ng one
si mp l e p i ec e of c ode (w e samp l e 1 00 r ec or ds f i r st
f or k eep i ng t h e c omp u t at i on f ast and t h en ap p l y
t h e c ode):
df_sample = df.sample(n=100)
df_sample.pivot_table(values=
['Sales','Quantity','Profit'],index=
['Region','State'],aggfunc='mean')
Th e ou t p u t i s as f ol l ow s (not e t h at y ou r sp ec i f i c
ou t p u t may b e di f f er ent du e t o r andom
samp l i ng):
df_sample
Th e ou tpu t is as follow s:
df_sample.sort_values(
by='Sales')
Th e ou tpu t is as follow s:
Figure 4.54: DataFrame with the Sales
value sorted
df_sample.sort_values(
by=['State','Sales'])
Th e ou tpu t is as follow s:
Figure 4.55: DataFrame sorted with respect to Sales
and State
return "Low"
return "Medium"
else:
return "High"
df_sample=df[['Custome
r
Name','State','Sales']
].sample(n=100)
df_sample.head(10)
Th e ou tpu t is as follow s:
Note
We need to create a new
column to store the category
string values that are returned
by the function.
df_sample['Sales Price
Category']=df_sample['
Sales'].apply(categori
ze_sales)
df_sample.head(10)
Th e ou tpu t is as follow s:
df_sample['Customer
Name
Length']=df_sample['Cu
stomer
Name'].apply(len)
df_sample.head(10)
Th e ou tpu t is as follow s:
5. Instead of w r iting ou t a
separ ate fu nction, w e can
ev en inser t lam bda
expr essions dir ectly into th e
apply m eth od for sh or t
fu nctions. For exam ple, let's
say w e ar e pr om oting ou r
pr odu ct and w ant to sh ow
th e discou nted sales pr ice if
th e or iginal pr ice is > $200.
We can do th is u sing a
lambda fu nction and th e
apply m eth od:
df_sample['Discounted
Price']=df_sample['Sal
es'].apply(lambda
x:0.85*x if x>200 else
x)
df_sample.head(10)
Th e ou tpu t is as follow s:
Figure 4.59: Lambda function
Note
Th e la m bd a fu nc tion c onta ins a c ond itiona l, a nd
a d is c ou nt is a p p lie d to th os e re c ord s w h e re th e
orig ina l s a le s p ric e is > $200.
Th e ai m of t h i s ac t i v i t y i s t o p r ac t i c e v ar i ou s
adv anc ed p andas Dat aFr ame op er at i ons, f or
ex amp l e, f or su b set t i ng, ap p l y i ng u ser -def i ned
f u nc t i ons, su mmar y st at i st i c s, v i su al i zat i ons,
b ool ean i ndex i ng, gr ou p b y , and ou t l i er
det ec t i on on a r eal -l i f e dat aset . W e h av e t h e dat a
dow nl oaded as a CSV f i l e on t h e di sk f or y ou r
ease. H ow ev er , i t i s r ec ommended t o p r ac t i c e
dat a dow nl oadi ng on y ou r ow n so t h at y ou ar e
f ami l i ar w i t h t h e p r oc ess.
H er e i s t h e U RL f or t h e dat aset :
h t t p s://ar c h i v e.i c s.u c i .edu /ml /mac h i ne-
l ear ni ng-dat ab ases/adu l t /.
Th ese ar e t h e st ep s t h at w i l l h el p y ou sol v e t h i s
ac t i v i t y :
1 1 . Gr ou p th e r ecor ds based on
age and edu cation to find
h ow th e m ean age is
distr ibu ted.
1 2 . Gr ou p by occu pation and
sh ow th e su m m ar y statistics
of age. Find w h ich pr ofession
h as th e oldest w or ker s on
av er age and w h ich
pr ofession h as its lar gest
sh ar e of th e w or kfor ce abov e
th e 7 5th per centile.
Note
Summary
In t h i s c h ap t er , w e di v ed deep i nt o t h e p andas
l i b r ar y t o l ear n adv anc ed dat a w r angl i ng
t ec h ni qu es. W e st ar t ed w i t h some adv anc ed
su b set t i ng and f i l t er i ng on Dat aFr ames and
r ou nd t h i s u p b y l ear ni ng ab ou t b ool ean
i ndex i ng and c ondi t i onal sel ec t i on of a su b set of
dat a. W e al so c ov er ed h ow t o set and r eset t h e
i ndex of a Dat aFr ame, esp ec i al l y w h i l e
i ni t i al i zi ng.
N ex t , w e l ear ned ab ou t a p ar t i c u l ar t op i c t h at
h as a deep c onnec t i on w i t h t r adi t i onal
r el at i onal dat ab ase sy st ems – t h e gr ou p b y
met h od. Th en, w e di v ed deep i nt o an i mp or t ant
sk i l l f or dat a w r angl i ng - c h ec k i ng f or and
h andl i ng mi ssi ng dat a. W e sh ow ed y ou h ow
p andas h el p i n h andl i ng mi ssi ng dat a u si ng
v ar i ou s i mp u t at i on t ec h ni qu es. W e al so
di sc u ssed met h ods f or dr op p i ng mi ssi ng v al u es.
Fu r t h er mor e, met h ods and u sage ex amp l es of
c onc at enat i on and mer gi ng of Dat aFr ame ob jec t s
w er e sh ow n. W e saw t h e joi n met h od and h ow i t
c omp ar es t o a si mi l ar op er at i on i n SQL.
In t h i s c h ap t er , y ou w i l l b e ex p osed t o r eal -l i f e
dat a w r angl i ng t ec h ni qu es, as ap p l i ed t o w eb
sc r ap i ng.
Introduction
So f ar i n t h i s b ook , w e h av e f oc u sed on l ear ni ng
p andas Dat aFr ame ob jec t s as t h e mai n dat a
st r u c t u r e f or t h e ap p l i c at i on of w r angl i ng
t ec h ni qu es. N ow , w e w i l l l ear n ab ou t v ar i ou s
t ec h ni qu es b y w h i c h w e c an r ead dat a i nt o a
Dat aFr ame f r om ex t er nal sou r c es. Some of t h ose
sou r c es c ou l d b e t ex t -b ased (CSV , H TML, JSON ,
and so on), w h er eas some ot h er s c ou l d b e b i nar y
(Ex c el , PDF, and so on), t h at i s, not i n A SCII
f or mat . In t h i s c h ap t er , w e w i l l l ear n h ow t o deal
w i t h dat a t h at i s p r esent i n w eb p ages or H TML
doc u ment s. Th i s h ol ds v er y h i gh i mp or t anc e i n
t h e w or k of a dat a p r ac t i t i oner .
Note
S inc e w e h a v e g one th rou g h a d e ta ile d e xa m p le
of ba s ic op e ra tions w ith Nu m Py a nd p a nd a s , in
th is c h a p te r, w e w ill ofte n s k ip triv ia l c od e
s nip p e ts s u c h a s v ie w ing a ta ble , s e le c ting a
c olu m n, a nd p lotting . I ns te a d , w e w ill foc u s on
s h ow ing c od e e xa m p le s for th e ne w top ic s w e
a im to le a rn a bou t h e re .
In t h e f i r st t op i c of t h i s c h ap t er , w e w i l l go
t h r ou gh v ar i ou s dat a sou r c es and h ow t h ey c an
b e i mp or t ed i nt o p andas Dat aFr ames, t h u s
i mb u i ng w r angl i ng p r of essi onal s w i t h
ex t r emel y v al u ab l e dat a i ngest i on k now l edge.
Ex ec u t e t h e f ol l ow i ng c odes i n y ou r Ju p y t er
not eb ook c el l s (don't f or get t h e ! b ef or e eac h l i ne
of c ode) t o i nst al l t h e nec essar y l i b r ar i es:
import numpy as np
import pandas as pd
df1 =
pd.read_csv("CSV_EX_1.
csv")
df1
Th e ou tpu t is as follow s:
df2 =
pd.read_csv("CSV_EX_2.
csv")
df2
Th e ou tpu t is as follow s:
Figure 5.2: Output of the .csv being read
using a DataFrame
df2 =
pd.read_csv("CSV_EX_2.
csv",header=None)
df2
4 . A dd th e names ar gu m ent to
get th e cor r ect h eader s:
df2 =
pd.read_csv("CSV_EX_2.
csv",header=None,
names=
['Bedroom','Sq.ft','Lo
cality','Price($)'])
df2
df3 =
pd.read_csv("CSV_EX_3.
csv")
df3
2 . Th e ou tpu t w ill be as follow s:
df3 =
pd.read_csv("CSV_EX_3.
csv",sep=';')
df3
Th e ou tpu t is as follow s:
df4 =
pd.read_csv("CSV_EX_1.
csv",names=
['A','B','C','D'])
df4
Th e ou tpu t is as follow s:
df4 =
pd.read_csv("CSV_EX_1.
csv",header=0,names=
['A','B','C','D'])
df4
Th e ou tpu t is as follow s:
Note
Th e firs t tw o line s in th e CS V file a re irre le v a nt
d a ta .
df5 =
pd.read_csv("CSV_EX_sk
iprows.csv")
df5
Th e ou tpu t is as follow s:
Figure 5.10: DataFrame with an
unexpected error
df5 =
pd.read_csv("CSV_EX_sk
iprows.csv",skiprows=2
)
df5
Th e ou tpu t is as follow s:
We h av e to u se skipfooter
and th e engine='python'
option to enable th is. Th er e
ar e tw o engines for th ese CSV
r eader fu nctions – based on C
or Py th on, of w h ich only th e
Py th on engine su ppor ts th e
skipfooter option.
df6 =
pd.read_csv("CSV_EX_sk
ipfooter.csv",skiprows
=2,
skipfooter=1,engine='p
ython')
df6
Th e ou tpu t is as follow s:
df7 = pd.read_csv("CSV_EX_1.csv",nrows=2)
df7
Th e ou t p u t i s as f ol l ow s:
1 . Cr eate a list w h er e
DataFr am es w ill be stor ed:
list_of_dataframe = []
2 . Stor e th e nu m ber of r ow s to
be r ead into a v ar iable:
rows_in_a_chunk = 10
num_chunks = 5
4 . Cr eate a du m m y DataFr am e
to get th e colu m n nam es:
df_dummy =
pd.read_csv("Boston_ho
using.csv",nrows=2)
colnames =
df_dummy.columns
for i in
range(0,num_chunks*row
s_in_a_chunk,rows_in_a
_chunk):
df =
pd.read_csv("Boston_ho
using.csv",header=0,sk
iprows=i,nrows=rows_in
_a_chunk,names=colname
s)
list_of_dataframe.appe
nd(df)
SETTING THE
SKIP_BLANK_LINES OPTION
By def au l t , read_csv i gnor es b l ank l i nes. Bu t
somet i mes, y ou may w ant t o r ead t h em i n as N aN
so t h at y ou c an c ou nt h ow many su c h b l ank
ent r i es w er e p r esent i n t h e r aw dat a f i l e. In some
si t u at i ons, t h i s i s an i ndi c at or of t h e def au l t dat a
st r eami ng qu al i t y and c onsi st enc y . For t h i s, y ou
h av e t o di sab l e t h e skip_blank_lines op t i on:
df9 =
pd.read_csv("CSV_EX_blankline.csv",skip_bl
ank_lines=False)
df9
Th e ou t p u t i s as f ol l ow s:
Figure 5.15: DataFrame that has blank rows of a .csv
file
df10 = pd.read_csv('CSV_EX_1.zip')
df10
Th e ou t p u t i s as f ol l ow s:
df11_1 =
pd.read_excel("Housing_data.xlsx",sheet_na
me='Data_Tab_1')
df11_2 =
pd.read_excel("Housing_data.xlsx",sheet_na
me='Data_Tab_2')
df11_3 =
pd.read_excel("Housing_data.xlsx",sheet_na
me='Data_Tab_3')
If t h e Ex c el f i l e h as mu l t i p l e di st i nc t sh eet s b u t
t h e sheet_name ar gu ment i s set t o None, t h en an
or der ed di c t i onar y w i l l b e r et u r ned b y t h e
read_excel f u nc t i on. Th er eaf t er , w e c an si mp l y
i t er at e ov er t h at di c t i onar y or i t s k ey s t o
r et r i ev e i ndi v i du al Dat aFr ames.
dict_df =
pd.read_excel("Housing_data.xlsx",sheet_na
me=None)
dict_df.keys()
Th e ou t p u t i s as f ol l ow s:
odict_keys(['Data_Tab_1', 'Data_Tab_2',
'Data_Tab_3'])
df13 =
pd.read_table("Table_E
X_1.txt")
df13
Th e ou tpu t is as follow s:
2 . In th is case, w e h av e to set
th e separ ator explicitly , as
follow s:
df13 =
pd.read_table("Table_E
X_1.txt",sep=',')
df13
Th e ou tpu t is follow s:
Note
Th e read_html m e th od re tu rns a lis t of
Da ta Fra m e s (e v e n if th e p a g e h a s a s ing le
Da ta Fra m e ) a nd y ou h a v e to e xtra c t th e re le v a nt
ta ble s from th e lis t.
url =
'http://www.fdic.gov/bank/individual/faile
d/banklist.html'
list_of_df = pd.read_html(url)
df14 = list_of_df[0]
df14.head()
Th ese r esu l t s ar e sh ow n i n t h e f ol l ow i ng
Dat aFr ame:
list_of_df =
pd.read_html("https://
en.wikipedia.org/wiki/
2016_Summer_Olympics_m
edal_table",header=0)
2 . If w e ch eck th e length of th e
list r etu r ned, w e w ill see it is
6:
len(list_of_df)
Th e ou tpu t is as follow s:
for t in list_of_df:
print(t.shape)
Th e ou tpu t is as follow s:
df15=list_of_df[1]
df15.head()
5. Th e ou tpu t is as follow s:
df16.head()
Th e ou tpu t is as follow s:
cast_of_avengers=df16[
(df16['title']=="The
Avengers") &
(df16['year']==2012)]
['cast']
print(list(cast_of_ave
ngers))
[['Robert Downey,
Jr.', 'Chris Evans',
'Mark Ruffalo', 'Chris
Hemsworth', 'Scarlett
Johansson', 'Jeremy
Renner', 'Tom
Hiddleston', 'Clark
Gregg', 'Cobie
Smulders', 'Stellan
SkarsgÃyrd', 'Samuel
L. Jackson']]
df17 = pd.read_stata("wu-data.dta")
u r llib3
pandas
py test
flake8
distr o
path lib
df18_1 =
read_pdf('Housing_data
.pdf',pages=
[1],pandas_options=
{'header':None})
df18_1
Th e ou tpu t is as follow s:
df18_2 =
read_pdf('Housing_data
.pdf',pages=
[2],pandas_options=
{'header':None})
df18_2
Th e ou tpu t is as follow s:
3 . To concatenate th e tables
th at w er e der iv ed fr om th e
fir st tw o steps, execu te th e
follow ing code:
df18=pd.concat([df18_1
,df18_2],axis=1)
df18
Th e ou tpu t is as follow s:
names=
['CRIM','ZN','INDUS','
CHAS','NOX','RM','AGE'
,'DIS','RAD','TAX','PT
RATIO','B','LSTAT','PR
ICE']
df18_1 =
read_pdf('Housing_data
.pdf',pages=
[1],pandas_options=
{'header':None,'names'
:names[:10]})
df18_2 =
read_pdf('Housing_data
.pdf',pages=
[2],pandas_options=
{'header':None,'names'
:names[10:]})
df18=pd.concat([df18_1
,df18_2],axis=1)
df18
Th e ou tpu t is as follow s:
W e w i l l h av e a f u l l ac t i v i t y on r eadi ng t ab l es
f r om a PDF r ep or t and p r oc essi ng t h em at t h e end
of t h i s c h ap t er .
Introduction to
Beautiful Soup 4 and
Web Page Parsing
Th e ab i l i t y t o r ead and u nder st and w eb p ages i s
one of p ar amou nt i nt er est f or a p er son
c ol l ec t i ng and f or mat t i ng dat a. For ex amp l e,
c onsi der t h e t ask of gat h er i ng dat a ab ou t mov i es
and t h en f or mat t i ng i t f or a dow nst r eam sy st em.
Dat a f or t h e mov i es i s b est ob t ai ned b y t h e
w eb si t es su c h as IMDB and t h at dat a does not
c ome p r e-p ac k aged i n ni c e f or ms(CSV , JSON < and
so on), so y ou need t o k now h ow t o dow nl oad and
r ead w eb p age.
STRUCTURE OF HTML
Bef or e w e ju mp i nt o bs4 and st ar t w or k i ng w i t h
i t , w e need t o ex ami ne t h e st r u c t u r e of a H TML
doc u ment . Hy p er T ex t Mar k u p Langu age i s a
st r u c t u r ed w ay of t el l i ng w eb b r ow ser s ab ou t
t h e or gani zat i on of a w eb p age, meani ng w h i c h
k i nd of el ement s (t ex t , i mage, v i deo, and so on)
c ome f r om w h er e, i n w h i c h p l ac e i nsi de t h e p age
t h ey sh ou l d ap p ear , w h at t h ey l ook l i k e, w h at
t h ey c ont ai n, and h ow t h ey w i l l b eh av e w i t h
u ser i np u t . H TML5 i s t h e l at est v er si on of H TML.
A n H TML doc u ment c an b e v i ew ed as a t r ee, as w e
c an see f r om t h e f ol l ow i ng di agr am:
Figure 5.27: HTML structure
So, w h en y ou ar e at a p ar t i c u l ar el ement of t h e
t r ee, y ou c an v i si t al l t h e c h i l dr en of t h at
el ement t o get t h e c ont ent s and at t r i b u t es of
t h em.
In t h i s t op i c , w e w i l l c ov er t h e r eadi ng and
p ar si ng of w eb p ages, b u t w e do not r equ est t h em
f r om a l i v e w eb si t e. Inst ead, w e r ead t h em f r om
di sk . A sec t i on on r eadi ng t h em f r om t h e i nt er net
w i l l f ol l ow i n a f u t u r e c h ap t er .
with open("test.html",
"r") as fd: soup =
BeautifulSoup(fd)
print(type(soup))
Th e ou tpu t is as follow s:
<class
'bs4.BeautifulSoup'>
Th e ou tpu t is as follow s:
Figure 5.29: Contents of the HTML file
with open("test.html",
"r") as fd:
soup =
BeautifulSoup(fd)
print(soup.p)
Th e ou tpu t is as follow s:
A s w e can see, th is is th e
content of a <p> tag.
with open("test.html",
"r") as fd:
soup =
BeautifulSoup(fd)
all_ps =
soup.find_all('p')
print("Total number of
<p> ---
{}".format(len(all_ps)
))
Th e ou tpu t is as follow s:
We h av e seen h ow to access
all th e tags of th e sam e ty pe.
We h av e also seen h ow to get
th e content of th e entir e
HTML docu m ent.
with open("test.html",
"r") as fd:
soup =
BeautifulSoup(fd)
table = soup.table
print(table.contents)
Th e ou tpu t is as follow s:
Figure 5.31: Content under the <table>
tag
Her e, w e ar e getting th e
(fir st) table fr om th e
docu m ent and th en u sing
th e sam e "." notation, to get
th e contents u nder th at tag.
We saw in th e pr ev iou s
exer cise th at w e can access
th e entir e content u nder a
par ticu lar tag. How ev er ,
HTML is r epr esented as a
tr ee and w e ar e able to
tr av er se th e ch ildr en of a
par ticu lar node. Th er e ar e a
few w ay s to do th is.
7 . Th e fir st w ay is by u sing th e
children gener ator fr om
any bs4 instance, as follow s:
with open("test.html",
"r") as fd:
soup =
BeautifulSoup(fd)
table = soup.table
for child in
table.children:
print(child)
print("*****")
Wh en w e execu te th e code,
w e w ill see som eth ing like
th e follow ing:
Figure 5.32: Traversing the children of a
table node
It seem s th at th e loop h as
only been execu ted tw ice!
Well, th e pr oblem w ith th e
"children" gener ator is th at
it only takes into accou nt th e
im m ediate ch ildr en of th e
tag. We h av e <tbody> u nder
th e <table> and ou r w h ole
table str u ctu r e is w r apped in
it. Th at's w h y it w as
consider ed a single ch ild of
th e <table> tag.
We looked into h ow to br ow se
th e im m ediate ch ildr en of a
tag. We w ill see h ow w e can
br ow se all th e possible
ch ildr en of a tag and not
only th e im m ediate one.
8. To do th at, w e u se th e
descendants gener ator fr om
th e bs4 instance, as follow s:
with open("test.html",
"r") as fd:
soup =
BeautifulSoup(fd)
table = soup.table
children =
table.children
des =
table.descendants
print(len(list(childre
n)), len(list(des)))
Th e ou tpu t is as follow s:
9 61
import pandas as pd
fd = open("test.html",
"r")
soup =
BeautifulSoup(fd)
data =
soup.findAll('tr')
print("Data is a {}
and {} items
long".format(type(data
), len(data)))
Th e ou tpu t is as follow s:
Data is a <class
'bs4.element.ResultSet
'> and 4 items long
data_without_header =
data[1:]
headers = data[0]
header
Th e ou tpu t is as follow s:
<tr>
<th>Entry Header
1</th>
<th>Entry Header
2</th>
<th>Entry Header
3</th>
<th>Entry Header
4</th>
</tr>
Note
col_headers =
[th.getText() for th
in
headers.findAll('th')]
col_headers
Th e ou tpu t is as follow s:
df_data =
[[td.getText() for td
in tr.findAll('td')]
for tr in
data_without_header]
df_data
Th e ou tpu t is as follow s:
df =
pd.DataFrame(df_data,
columns=col_headers)
df.head()
Figure 5.34: Output in tabular format with column
headers
2 . To sav e th e DataFr am e as an
Excel file, u se th e follow ing
com m and fr om inside of th e
Ju py ter notebook:
writer =
pd.ExcelWriter('test_o
utput.xlsx')df.to_exce
l(writer,
"Sheet1")writer.save()
writer
Th e ou tpu t is as follow s:
<pandas.io.excel._Xlsx
Writer at
0x24feb2939b0>
soup =
BeautifulSoup(fd)
lis =
soup.find('ul').findAl
l('li')
stack = []
for li in lis: a =
li.find('a',
href=True)
stack.append(a['href']
)
3 . Pr int th e stack:
ACTIVITY 7: READING
TABULAR DATA FROM A WEB
PAGE AND CREATING
DATAFRAMES
In t h i s ac t i v i t y , y ou h av e b een gi v en a
W i k i p edi a p age w h er e y ou h av e t h e GDP of al l
c ou nt r i es l i st ed. You h av e b een ask ed t o c r eat e
t h r ee DataFrames f r om t h e t h r ee sou r c es
ment i oned i n t h e p age
(h t t p s://en.w i k i p edi a.or g/w i k i /Li st _of _c ou nt r i e
s_b y _GDP_(nomi nal )):
You w i l l h av e t o do t h e f ol l ow i ng:
Note
Summary
In t h i s t op i c , w e l ook ed at t h e st r u c t u r e of an
H TML doc u ment . H TML doc u ment s ar e t h e
c or ner st one of t h e W or l d W i de W eb and, gi v en
t h e amou nt of dat a t h at 's c ont ai ned on i t , w e c an
easi l y i nf er t h e i mp or t anc e of H TML as a dat a
sou r c e.
Introduction
In t h i s c h ap t er , w e w i l l l ear n ab ou t t h e sec r et
sau c e b eh i nd c r eat i ng a su c c essf u l dat a
w r angl i ng p i p el i ne. In t h e p r ev i ou s c h ap t er s,
w e w er e i nt r odu c ed t o t h e b asi c dat a st r u c t u r es
and b u i l di ng b l oc k s of Dat a W r angl i ng, su c h as
p andas and N u mPy . In t h i s c h ap t er , w e w i l l l ook
at t h e dat a h andl i ng sec t i on of dat a w r angl i ng.
ADDITIONAL SOFTWARE
REQUIRED FOR THIS SECTION
Th e c ode f or t h i s ex er c i se dep ends on t w o
addi t i onal l i b r ar i es. W e need t o i nst al l SciPy and
python-Levenshtein, and w e ar e goi ng t o i nst al l
t h em i n t h e r u nni ng Doc k er c ont ai ner . Be w ar y
of t h i s, as w e ar e not i n t h e c ont ai ner .
To i nst al l t h e l i b r ar i es, t y p e t h e f ol l ow i ng
c ommand i n t h e r u nni ng Ju p y t er not eb ook :
Advanced List
Comprehension and the
zip Function
In t h i s t op i c , w e w i l l deep di v e i nt o t h e h ear t of
l i st c omp r eh ensi on. W e h av e al r eady seen a
b asi c f or m of i t , i nc l u di ng somet h i ng as si mp l e as
a = [i for i in range(0, 30)] t o somet h i ng a
b i t mor e c omp l ex t h at i nv ol v es one c ondi t i onal
st at ement . H ow ev er , as w e al r eady ment i oned,
l i st c omp r eh ensi on i s a v er y p ow er f u l t ool and,
i n t h i s t op i c , w e w i l l ex p l or e t h e p ow er of t h i s
amazi ng t ool f u r t h er . W e w i l l i nv est i gat e
anot h er c l ose r el at i v e of l i st c omp r eh ensi on
c al l ed generators, and al so w or k w i t h zip and i t s
r el at ed f u nc t i ons and met h ods. By t h e end of t h i s
t op i c , y ou w i l l b e c onf i dent i n h andl i ng
c omp l i c at ed l ogi c al p r ob l ems.
INTRODUCTION TO
GENERATOR EXPRESSIONS
Pr ev i ou sl y , w h i l e di sc u ssi ng adv anc ed dat a
st r u c t u r es, w e w i t nessed f u nc t i ons su c h as
repeat. W e sai d t h at t h ey r ep r esent a sp ec i al
t y p e of f u nc t i on k now n as i t er at or s. W e al so
sh ow ed y ou h ow t h e l azy ev al u at i on of an
i t er at or c an l ead t o an enor mou s amou nt of sp ac e
sav i ng and t i me ef f i c i enc y .
It er at or s ar e one b r i c k i n t h e f u nc t i onal
p r ogr ammi ng c onst r u c t t h at Py t h on h as t o of f er .
Fu nc t i onal p r ogr ammi ng i s i ndeed a v er y
ef f i c i ent and saf e w ay t o ap p r oac h a p r ob l em. It
of f er s v ar i ou s adv ant ages ov er ot h er met h ods,
su c h as modu l ar i t y , ease of deb u ggi ng and
t est i ng, c omp osab i l i t y , f or mal p r ov ab i l i t y (a
t h eor et i c al c omp u t er sc i enc e c onc ep t ), and so on.
odd_numbers2 = [x for
x in range(100000) if
x % 2 != 0]
getsizeof(odd_numbers2
)
Th e ou tpu t is as follow s:
406496
odd_numbers = (x for x
in range(100000) if x
% 2 != 0)
for i, number in
enumerate(odd_numbers)
:
print(number)
if i > 10:
break
Th e ou tpu t is as follow s:
11
13
15
17
19
21
23
words = ["Hello\n",
"My name", "is\n",
"Bob", "How are you",
"doing\n"]
2 . Wr ite th e follow ing
gener ator expr ession to
ach iev e th e task, as follow s:
modified_words =
(word.strip().lower()
for word in words)
final_list_of_word =
[word for word in
modified_words]
final_list_of_word
Th e ou tpu t is as follow s:
words = ["Hello\n",
"My name", "is\n",
"Bob", "How are you",
"doing\n"]
modified_words2 =
(w.strip().lower() for
word in words for w in
word.split(" "))
final_list_of_word =
[word for word in
modified_words2]
final_list_of_word
Th e ou tpu t is as follow s:
modified_words3 = []
for w in word.split("
"):
modified_words3.append
(w.strip().lower())
modified_words3
Th e ou tpu t is as follow s:
Th e f ol l ow i ng di agr am w i l l h el p y ou r ememb er
t h e t r i c k ab ou t nest ed for l oop s i n l i st
c omp r eh ensi on or gener at or ex p r essi on:
Cr eat e t h e f ol l ow i ng t w o l i st s:
for m in marbles:
for c in counts:
marble_with_count_as_list_2.append((m, c))
marble_with_count_as_list_2
Th e ou t p u t i s as f ol l ow s:
countries = ["India",
"USA", "France", "UK"]
capitals = ["Delhi",
"Washington", "Paris",
"London"]
countries_and_capitals
= [t for t in
zip(countries,
capitals)]
3 . Th is is not v er y w ell
r epr esented. We can u se
dict w h er e key s ar e th e
nam es of th e cou ntr ies,
w h er eas th e v alu es ar e th e
nam es of th e capitals by
u sing th e follow ing
com m and:
countries_and_capitals
_as_dict =
dict(zip(countries,
capitals))
Th e ou tpu t is as follow s:
countries = ["India",
"USA", "France", "UK",
"Brasil", "Japan"]
capitals = ["Delhi",
"Washington", "Paris",
"London"]
countries_and_capitals
_as_dict_2 =
dict(zip_longest(count
ries, capitals))
countries_and_capitals
_as_dict_2
Th e ou tpu t is as follow s:
Figure 6.7: Output using ziplongest
Data Formatting
In t h i s t op i c , w e w i l l f or mat a gi v en dat aset . Th e
mai n mot i v at i ons b eh i nd f or mat t i ng dat a
p r op er l y ar e as f ol l ow s:
To pr odu ce a h u m an-
r eadable r epor t fr om low er -
lev el data th at is, m ost of th e
tim e, cr eated for m ach ine
consu m ption.
To find er r or s in data.
Th er e ar e a f ew w ay s t o do dat a f or mat t i ng i n
Py t h on. W e w i l l b egi n w i t h t h e modu l u s
op er at or .
THE % OPERATOR
Py t h on gi v es u s t h e % op er at or t o ap p l y b asi c
f or mat t i ng on dat a. To demonst r at e t h i s, w e w i l l
l oad t h e dat a f i r st b y r eadi ng t h e CSV f i l e, and
t h en w e w i l l ap p l y some b asi c f or mat t i ng on i t .
raw_data = []
data_rows = DictReader(fd)
raw_data.append(dict(data))
N ow , w e h av e a l i st c al l ed raw_data t h at c ont ai ns
al l t h e r ow s of t h e CSV f i l e. Feel f r ee t o p r i nt i t
t o c h ec k ou t w h at i t l ook s l i k e.
Th e ou t p u t i s as f ol l ow s:
Figure 6.8: Raw data
W e w i l l b e p r odu c i ng a r ep or t on t h i s dat a. Th i s
r ep or t w i l l c ont ai n one sec t i on f or eac h dat a
p oi nt and w i l l r ep or t t h e name, age, w ei gh t ,
h ei gh t , h i st or y of f ami l y di sease, and f i nal l y t h e
p r esent h ear t c ondi t i on of t h e p er son. Th ese
p oi nt s mu st b e c l ear and easi l y u nder st andab l e
Engl i sh sent enc es.
W e do t h i s i n t h e f ol l ow i ng w ay :
print(report_str)
Th e ou t p u t i s as f ol l ow s:
Figure 6.9: Raw data in a presentable format
Th e % op er at or i s u sed i n t w o di f f er ent w ay s:
Wh en w e u se th e % oper ator
ou tside th e qu ote, it basically
tells Py th on to star t th e
r eplacem ent of all th e data
inside w ith th e v alu es
pr ov ided for th em ou tside.
""".format(data["Name"], data["Age"],
data["Height"], data["Weight"],
data["Disease_history"],
data["Heart_problem"])
print(report_str)
Th e ou t p u t i s as f ol l ow s:
Figure 6.10: Data formatted using the format function
of the string
N ot i c e t h at w e h av e r ep l ac ed t h e %s w i t h {} and
i nst ead of t h e % ou t si de t h e qu ot e, w e h av e c al l ed
t h e format f u nc t i on.
W e w i l l see h ow t h e p ow er f u l format f u nc t i on
c an mak e t h e p r ev i ou s c ode a l ot mor e r eadab l e
and u nder st andab l e. Inst ead of si mp l e and b l ank
{}, w e ment i on t h e k ey names i nsi de and t h en u se
t h e sp ec i al Py t h on ** op er at i on on a dict t o
u np ac k i t and gi v e t h at t o t h e f or mat f u nc t i on. It
i s smar t enou gh t o f i gu r e ou t h ow t o r ep l ac e t h e
k ey names i nsi de t h e qu ot e w i t h t h e v al u es f r om
t h e ac t u al dict b y u si ng t h e f ol l ow i ng c ommand:
""".format(**data)
print(report_str)
Th e ou t p u t i s as f ol l ow s:
original_number = 42
print("The binary
representation of 42
is -
{0:b}".format(original
_number))
Th e ou tpu t is as follow s:
print("
{:^42}".format("I am
at the center"))
Th e ou tpu t is as follow s:
print("
{:=^42}".format("I am
at the center"))
Th e ou tpu t is as follow s:
Figure 6.14: A string that's been center formatted with
padding
Th e ou t p u t i s as f ol l ow s:
Comp ar e i t w i t h t h e ac t u al ou t p u t of
datetime.utcnow and y ou w i l l see t h e p ow er of
t h i s ex p r essi on easi l y .
St at i st i c al l y , t h er e i s a p r op er def i ni t i on and
i dea ab ou t w h at an ou t l i er means. A nd of t en, y ou
need deep domai n ex p er t i se t o u nder st and w h en
t o c al l a p ar t i c u l ar r ec or d an ou t l i er . H ow ev er ,
i n t h i s p r esent ex er c i se, w e w i l l l ook i nt o some
b asi c t ec h ni qu es t h at ar e c ommonp l ac e t o f l ag
and f i l t er ou t l i er s i n r eal -w or l d dat a f or day -t o-
day w or k .
1 . To constr u ct a cosine cu r v e,
execu te th e follow ing
com m and:
ys = [cos(i*(pi/4))
for i in range(50)]
import
matplotlib.pyplot as
plt
plt.plot(ys)
Th e ou tpu t is as follow s:
A s w e can see, it is a v er y
sm ooth cu r v e, and th er e is
no ou tlier . We ar e going to
intr odu ce som e now .
4 . Plot th e cu r v e:
plt.plot(ys)
Figure 6.17: Wave with outliers
plt.boxplot(ys)
U se Sc i Py and c al c u l at e t h e z-sc or e b y u si ng t h e
f ol l ow i ng c ommand:
cos_arr_z_score = stats.zscore(ys)
Cos_arr_z_score
Th e ou t p u t i s as f ol l ow s:
Figure 6.19: The z-score values
import pandas as pd
df_original =
pd.DataFrame(ys)
2 . A ssign ou tlier s w ith a z-scor e
less th an 3 :
cos_arr_without_outlie
rs =
df_original[(cos_arr_z
_score < 3)]
print(cos_arr_without_
outliers.shape)
print(df_original.shap
e)
Fr om th e tw o pr ints (4 8, 1
and 50, 1 ), it is clear th at th e
der iv ed DataFr am e h as tw o
r ow s less. Th ese ar e ou r
ou tlier s. If w e plot th e
cos_arr_without_outliers
DataFr am e, th en w e w ill see
th e follow ing ou tpu t:
from Levenshtein
import distance
name_of_ship = "Sea
Princess"
for k, v in
ship_data.items():
print("{} {}
{}".format(k,
name_of_ship,
distance(name_of_ship,
k)))
Th e ou tpu t is as follow s:
Figure 6.22: Distance between the strings
H er e, agai n, w e need t o b e c au t i ou s ab ou t w h en
and h ow t o u se t h i s k i nd of f u zzy st r i ng mat c h i ng.
Somet i mes, t h ey ar e needed, and ot h er t i mes t h ey
w i l l r esu l t i n a v er y b ad b u g.
Activity 8: Handling
Outliers and Missing
Data
In t h i s ac t i v i t y , w e w i l l i dent i f y and get r i d of
ou t l i er s. H er e, w e h av e a CSV f i l e. Th e goal h er e
i s t o c l ean t h e dat a b y u si ng t h e k now l edge t h at
w e h av e l ear ned ab ou t so f ar and c ome u p w i t h a
ni c el y f or mat t ed Dat aFr ame. Ident i f y t h e t y p e of
ou t l i er s and t h ei r ef f ec t on t h e dat a and c l ean
t h e messy dat a.
Th e st ep s t h at w i l l h el p y ou sol v e t h i s ac t i v i t y
ar e as f ol l ow s:
1 . Read th e visit_data.csv
file.
4 . Get r id of th e ou tlier s.
Note
Summary
In t h i s c h ap t er , w e l ear ned ab ou t i nt er est i ng
w ay s t o deal w i t h l i st dat a b y u si ng a generator
ex p r essi on. Th ey ar e easy and el egant and onc e
mast er ed, t h ey gi v e u s a p ow er f u l t r i c k t h at w e
c an u se r ep eat edl y t o si mp l i f y sev er al c ommon
dat a w r angl i ng t ask s. W e al so ex ami ned
di f f er ent w ay s t o f or mat dat a. For mat t i ng of dat a
i s not onl y u sef u l f or p r ep ar i ng b eau t i f u l
r ep or t s – i t i s of t en v er y i mp or t ant t o gu ar ant ee
dat a i nt egr i t y f or t h e dow nst r eam sy st em.
W e ended t h e c h ap t er b y c h ec k i ng ou t some
met h ods t o i dent i f y and r emov e ou t l i er s. Th i s i s
i mp or t ant f or u s b ec au se w e w ant ou r dat a t o b e
p r op er l y p r ep ar ed and r eady f or al l ou r f anc y
dow nst r eam anal y si s job s. W e al so ob ser v ed h ow
i mp or t ant i t i s t o t ak e t i me and u se domai n
ex p er t i se t o set u p r u l es f or i dent i f y i ng ou t l i er s,
as doi ng t h i s i nc or r ec t l y c an do mor e h ar m t h an
good.
In t h e nex t c h ap t er , w e w i l l c ov er t h e h ow t o
r ead w eb p ages, XML f i l es, and A PIs.
Chapter 7
Advanced Web Scraping
and Data Gathering
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:
In t h i s c h ap t er , y ou w i l l l ear n h ow t o gat h er
dat a f r om w eb p ages, XML f i l es, and A PIs.
Introduction
Th e p r ev i ou s c h ap t er c ov er ed h ow t o c r eat e a
su c c essf u l dat a w r angl i ng p i p el i ne. In t h i s
c h ap t er , w e w i l l b u i l d a r eal -l i f e w eb sc r ap er
u si ng al l of t h e t ec h ni qu es t h at w e h av e l ear ned
so f ar . Th i s c h ap t er b u i l ds on t h e f ou ndat i on of
BeautifulSoup and i nt r odu c es v ar i ou s met h ods
f or sc r ap i ng a w eb p age and u si ng an A PI t o
gat h er dat a.
LIBRARIES IN PYTHON
Py t h on c omes equ i p p ed w i t h b u i l t -i n modu l es,
su c h as urllib 3, w h i c h t h at c an p l ac e H TTP
r equ est s ov er t h e i nt er net and r ec ei v e dat a f r om
t h e c l ou d. H ow ev er , t h ese modu l es op er at e at a
l ow er l ev el and r equ i r e deep er k now l edge of
H TTP p r ot oc ol s, enc odi ng, and r equ est s.
In t h i s ex er c i se, w e w i l l p eel of f t h e l ay er s of
H TML/CSS/Jav aSc r i p t t o p r y aw ay t h e
i nf or mat i on w e ar e i nt er est ed i n.
1 . Im por t th e requests
libr ar y :
import requests
2 . A ssign th e h om e page URL to
a v ar iable, wiki_home:
wiki_home =
"https://en.wikipedia.
org/wiki/Main_Page"
response =
requests.get(wiki_home
)
type(response)
Th e ou tpu t is as follow s:
requests.models.Respon
se
Th e w eb i s an ex t r emel y dy nami c p l ac e. It i s
p ossi b l e t h at t h e h ome p age of W i k i p edi a w i l l
h av e c h anged b y t h e t i me someb ody u ses y ou r
c ode, or t h at a p ar t i c u l ar w eb ser v er w i l l b e
dow n and y ou r r equ est w i l l essent i al l y f ai l . If
y ou p r oc eed t o w r i t e mor e c omp l ex and
el ab or at e c ode w i t h ou t c h ec k i ng t h e st at u s of
y ou r r equ est , t h en al l t h at su b sequ ent w or k w i l l
b e f r u i t l ess.
W e w i l l st ar t b y get t i ng i nt o t h e a h ab i t of
w r i t i ng smal l f u nc t i ons t o ac c omp l i sh smal l
modu l ar t ask s, i nst ead of w r i t i ng l ong sc r i p t s,
w h i c h ar e h ar d t o deb u g and t r ac k :
1 . Cr eate a status_check
fu nction by u sing th e
follow ing com m and:
def status_check(r):
if r.status_code==200:
print("Success!")
return 1
else:
print("Failed!")
return -1
status_check(response)
Th e ou tpu t is as follow s:
W h en w e r u n t h i s f u nc t i on on t h e W i k i p edi a
h ome p age, w e get b ac k t h e p ar t i c u l ar enc odi ng
t y p e t h at 's u sed f or t h at p age. Th i s f u nc t i on, l i k e
t h e p r ev i ou s one, t ak es t h e requests r esp onse
ob jec t as an ar gu ment and r et u r ns a v al u e:
def encoding_check(r):
return (r.encoding)
Ch ec k t h e r esp onse:
encoding_check(response)
Th e ou t p u t i s as f ol l ow s:
'UTF-8'
return
(r.content.decode(enco
ding))
contents =
decode_content(respons
e,encoding_check(respo
nse))
2 . Ch eck th e ty pe of th e decoded
object:
type(contents)
Th e ou tpu t is as follow s:
str
Note
3 . Ch eck th e length of th e
object and tr y pr inting som e
of it:
len(contents)
Th e ou tpu t is as follow s:
74182
Ob v i ou sl y , t h i s i s a mi x ed b l ob of v ar i ou s H TML
mar k u p t ags, t ex t , and el ement s
names/p r op er t i es. W e c annot h op e t o ex t r ac t
meani ngf u l i nf or mat i on f r om t h i s w i t h ou t u si ng
sop h i st i c at ed f u nc t i ons or met h ods. For t u nat el y ,
t h e BeautifulSoup l i b r ar y p r ov i des su c h
met h ods, and w e w i l l see h ow t o u se t h em nex t .
soup =
BeautifulSoup(contents
, 'html.parser')
txt_dump=soup.text
3 . Find th e ty pe of th e
txt_dmp:
type(txt_dump)
Th e ou tpu t is as follow s:
str
4 . Find th e length of th e
txt_dmp:
len(txt_dump)
Th e ou tpu t is as follow s:
15326
print(txt_dump[10000:1
1000])
Fi r st , w e t r y t o i dent i f y t w o i ndi c es – t h e st ar t
i ndex and end i ndex of t h e st r i ng, w h i c h
demar c at e t h e st ar t and end of t h e t ex t w e ar e
i nt er est ed i n. In t h e nex t sc r eensh ot , t h e i ndi c es
ar e sh ow n:
Figure 7.7: Wikipedia page highlighting the text to be
extracted
Th e f ol l ow i ng c ode ac c omp l i sh es t h e ex t r ac t i on:
idx2=txt_dump.find("Recently featured")
print(txt_dump[idx1+len("From today's
featured article"):idx2])
It p r i nt s ou t somet h i ng l i k e t h i s (t h i s i s a samp l e
ou t p u t ):
EXTRACTING IMPORTANT
HISTORICAL EVENTS THAT
HAPPENED ON TODAY'S DATE
N ex t , w e w i l l t r y t o ex t r ac t t h e t ex t
c or r esp ondi ng t o t h e i mp or t ant h i st or i c al
ev ent s t h at h ap p ened on t oday 's dat e. Th i s c an
gener al l y b e f ou nd at t h e b ot t om-r i gh t c or ner as
sh ow n i n t h e f ol l ow i ng sc r eensh ot :
Figure 7.9: Wikipedia page highlighting the "On this
day" section
So, c an w e ap p l y t h e same t ec h ni qu e as w e di d f or
"F rom today 's fe ature d article "? A p p ar ent l y
not , b ec au se t h er e i s t ex t ju st b el ow w h er e w e
w ant ou r ex t r ac t i on t o end, w h i c h i s not f i x ed,
u nl i k e i n t h e p r ev i ou s c ase. N ot e t h at , i n t h e
p r ev i ou s ex er c i se, t h e f i x ed st r i ng "Re ce ntly
fe ature d" oc c u r s at t h e ex ac t p l ac e w h er e w e
w ant t h e ex t r ac t i on t o st op . So, w e c ou l d u se i t i n
ou r c ode. H ow ev er , w e c annot do t h at i n t h i s c ase,
and t h e r eason f or t h i s i s i l l u st r at ed i n t h e
f ol l ow i ng sc r eensh ot :
Figure 7.10: Wikipedia page highlighting the text to be
extracted
print(txt_dump[idx3+len("On this
day"):idx3+len("On this day")+1000])
Th i s l ook s as f ol l ow s:
Figure 7.11: Output of the "On this day" section from
Wikipedia
A s y ou h ov er ov er t h i s w i t h t h e mou se, y ou w i l l
see di f f er ent p or t i ons of t h e p age b ei ng
h i gh l i gh t ed. By doi ng t h i s, i t i s easy t o di sc ov er
t h e p r ec i se b l oc k of mar k u p t ex t , t h at i s
r esp onsi b l e f or t h e t ex t u al i nf or mat i on w e ar e
i nt er est ed i n. H er e, w e c an see t h at a c er t ai n
<ul> b l oc k c ont ai ns t h e t ex t :
N ow , i t i s p r u dent t o f i nd t h e <div> t ag t h at
c ont ai ns t h i s <ul> b l oc k w i t h i n i t . By l ook i ng
ar ou nd t h e same sc r een as b ef or e, w e f i nd t h e
<div> and al so i t s ID:
Figure 7.14: The <ul> tag containing the text
Note
text_list=[] #Empty
list
for d in
soup.find_all('div'):
if (d.get('id')=='mp-
otd'):
for i in
d.find_all('ul'):
text_list.append(i.tex
t)
for i in text_list:
print(i)
print('-'*100)
Note
Th e ou tpu t is as follow s:
Figure 7.15: The text highlighted
def
wiki_on_this_day(url="
https://en.wikipedia.o
rg/wiki/Main_Page"):
"""
"""
import requests
wiki_home = str(url)
response =
requests.get(wiki_home
)
def status_check(r):
if r.status_code==200:
return 1
else:
return -1
status =
status_check(response)
if status==1:
contents =
decode_content(respons
e,encoding_check(respo
nse))
else:
return -1
soup =
BeautifulSoup(contents
, 'html.parser')
text_list=[]
for d in
soup.find_all('div'):
if (d.get('id')=='mp-
otd'):
for i in
d.find_all('ul'):
text_list.append(i.tex
t)
return (text_list[0])
3 . Note h ow th is fu nction
u tilizes th e statu s ch eck and
pr ints ou t an er r or m essage
if th e r equ est failed. Wh en
w e test th is fu nction w ith an
intentionally incor r ect URL,
it beh av es as expected:
print(wiki_on_this_day
("https://en.wikipedia
.org/wiki/Main_Page1")
)
data = '''
<person>
<name>Dave</name>
<surname>Piccardo</sur
name>
<phone type="intl">
</phone>
<email hide="yes">
[email protected]</emai
l>
</person>'''
import
xml.etree.ElementTree
as ET
tree =
ET.fromstring(data)
type (tree)
Th e ou tpu t is as follow s:
xml.etree.ElementTree.
Element
print('Name:',
tree.find('name').text
)
Dave
print('Surname:',
tree.find('surname').t
ext)
Piccardo
print('Email hidden:',
tree.find('email').get
('hide'))
print('Email:',
tree.find('email').tex
t.strip())
Email:
[email protected]
Th i s i s a f ai r l y c ommon si t u at i on w h er e a
f r ont end w eb sc r ap i ng modu l e h as al r eady
dow nl oaded a l ot of XML f i l es b y r eadi ng a t ab l e
of dat a on t h e w eb and now t h e dat a w r angl er
needs t o p ar se t h r ou gh t h i s XML f i l e t o ex t r ac t
meani ngf u l p i ec es of nu mer i c al and t ex t u al dat a.
W e h av e a f i l e assoc i at ed w i t h t h i s c h ap t er ,
c al l ed "xml1.xml". Pl ease mak e su r e y ou h av e t h e
f i l e i n t h e same di r ec t or y t h at y ou ar e r u nni ng
y ou r Ju p y t er N ot eb ook f r om:
tree2=ET.parse('xml1.xml')
type(tree2)
Th er e is a r oot
Th er e ar e ch ildr en objects
attach ed to th e r oot
Th er e cou ld be m u ltiple
lev els, th at is, ch ildr en of
ch ildr en r ecu r siv ely going
dow n
A ll of th e nodes of th e tr ee
(r oot and ch ildr en alike)
h av e attr ibu tes attach ed to
th em th at contain data
root=tree2.getroot()
print
("Child:",child.tag,
"| Child
attribute:",child.attr
ib)
Note
Re m e m be r th a t e v e ry XML d a ta file c ou ld follow a
d iffe re nt na m ing or s tru c tu ra l form a t, bu t u s ing
a n e le m e nt tre e a p p roa c h p u ts th e d a ta into a
s om e w h a t s tru c tu re d flow th a t c a n be e xp lore d
s y s te m a tic a lly . S till, it is be s t to e xa m ine th e ra w
XML file s tru c tu re onc e a nd u nd e rs ta nd (e v e n if a t
a h ig h le v e l) th e d a ta form a t be fore a tte m p ting
a u tom a tic e xtra c tions .
root[0][2]
root[0][2].text
'70617'
root[0][2].tag
'gdppc'
4 . Ch eck root[0]:
root[0]
<Element 'country1' at
0x00000000050298B8>
5. Ch eck th e tag:
root[0].tag
'country1'
We can u se th e attrib
m eth od to access it:
root[0].attrib
{'name': 'Norway'}
for c in root:
country_name=c.attrib['name']
gdppc = int(c[2].text)
print("{}: {}".format(country_name,gdppc))
Th e ou t p u t i s as f ol l ow s:
Norway: 70617
Austria: 44857
Israel: 38788
for c in root:
print("Neighbors\n"+"-"*25)
for i in ne: # Iterate over the neighbors
and print their 'name' attribute
print(i.attrib['name'])
print('\n')
Th e ou t p u t l ook s somet h i ng l i k e t h i s:
import urllib.request,
urllib.parse,
urllib.error
serviceurl =
'http://www.recipepupp
y.com/api/?'
item =
str(input('Enter the
name of a food item
(enter \'quit\' to
quit): '))
url = serviceurl +
urllib.parse.urlencode
({'q':item})+'&p=1&for
mat=xml'
uh =
urllib.request.urlopen
(url)
data =
uh.read().decode()
print('Retrieved',
len(data),
'characters')
tree3 =
ET.fromstring(data)
print('Retrieved',
len(data),
'characters')
tree3 =
ET.fromstring(data)
for elem in
tree3.iter():
print(elem.text)
Th e ou tpu t is as follow s:
Figure 7.21: The output that's generated
by using iter
for e in tree3.iter():
h=e.find('href')
t=e.find('title')
if h!=None and
t!=None:
print("Receipe Link
for:",t.text)
print(h.text)
print("-"*100)
A w eb A PI i s, as t h e name su ggest s, an A PI ov er
t h e w eb . N ot e t h at i t i s not a sp ec i f i c t ec h nol ogy
or p r ogr ammi ng f r amew or k , b u t an
ar c h i t ec t u r al c onc ep t . Th i nk of an A PI l i k e a
f ast f ood r est au r ant 's c u st omer ser v i c e c ent er .
Int er nal l y , t h er e ar e many f ood i t ems, r aw
mat er i al s, c ook i ng r esou r c es, and r ec i p e
management sy st ems, b u t al l y ou see ar e f i x ed
menu i t ems on t h e b oar d and y ou c an onl y
i nt er ac t t h r ou gh t h ose i t ems. It i s l i k e a p or t
t h at c an b e ac c essed u si ng an H TTP p r ot oc ol and
i s ab l e t o del i v er dat a and ser v i c es i f u sed
p r op er l y .
Th er ef or e, i t i s v er y i mp or t ant f or a dat a
w r angl i ng p r of essi onal t o u nder st and t h e b asi c s
of dat a ex t r ac t i on f r om a w eb A PI as y ou ar e
ex t r emel y l i k el y t o f i nd y ou r sel f i n a si t u at i on
w h er e l ar ge qu ant i t i es of dat a mu st b e r ead
t h r ou gh an A PI i nt er f ac e f or p r oc essi ng and
w r angl i ng. Th ese day s, most A PIs st r eam dat a ou t
i n JSON f or mat . In t h i s c h ap t er , w e w i l l u se a
f r ee A PI t o r ead some i nf or mat i on ab ou t v ar i ou s
c ou nt r i es ar ou nd t h e w or l d i n JSON f or mat and
p r oc ess i t .
import json
import pandas as pd
serviceurl =
'https://restcountries.eu/rest/v2/name/'
1 . Define a fu nction to pu ll ou t
data w h en w e pass th e nam e
of a cou ntr y as an ar gu m ent.
Th e cr u x of th e oper ation is
contained in th e follow ing
tw o lines of code:
url = serviceurl +
country_name
uh =
urllib.request.urlopen
(url)
def
get_country_data(count
ry):
"""
"""
country_name=str(count
ry)
url = serviceurl +
country_name
try:
uh =
urllib.request.urlopen
(url)
except HTTPError as e:
print("Sorry! Could
not retrieve anything
on
{}".format(country_nam
e))
return None
except URLError as e:
print('Failed to reach
a server.')
print('Reason: ',
e.reason)
return None
else:
data =
uh.read().decode()
print("Retrieved data
on {}. Total {}
characters
read.".format(country_
name,len(data)))
return data
Note
This is an example of
rudimentary error handling.
You have to think about
various possibilities and put in
such code to catch and
gracefully respond to user
input w hen you are building a
real-life w eb or enterprise
application.
Figure 7.24: Input arguments
x=json.loads(data)
y=x[0]
type(y)
Th e ou t p u t w i l l b e as f ol l ow s:
dict
W e c an qu i c k l y c h ec k t h e k ey s of t h e di c t i onar y ,
t h at i s t h e JSON dat a (not e t h at a f u l l sc r eensh ot
i s not sh ow n h er e). W e c an see t h e r el ev ant
c ou nt r y dat a, su c h as c al l i ng c odes, p op u l at i on,
ar ea, t i me zones, b or der s, and so on:
print("{}: {}".format(k,v))
Th e ou t p u t i s as f ol l ow s:
Figure 7.26: The output using dict
Note
I t is c le a r, th e re fore , th a t th e re is no u niv e rs a l
m e th od or p roc e s s ing fu nc tion for JS ON d a ta
form a t, a nd y ou h a v e to w rite c u s tom loop s a nd
fu nc tions to e xtra c t d a ta from s u c h a d ic tiona ry
obje c t ba s e d on y ou r p a rtic u la r ne e d s .
N ow , w e w i l l w r i t e a smal l l oop t o ex t r ac t t h e
l angu ages sp ok en i n Sw i t zer l and. Fi r st , l et 's
ex ami ne t h e di c t i onar y c l osel y and see w h er e
t h e l angu age dat a i s:
W e c an w r i t e si mp l e t w o-l i ne c ode t o ex t r ac t
t h i s dat a:
print(lang['name'])
Th e ou t p u t i s as f ol l ow s:
Capital
Region
Su b-r egion
Popu lation
Latitu de/longitu de
A r ea
Gini index
Tim e zones
Cu r r encies
Langu ages
Note
W e w i l l sh ow y ou t h e w h ol e f u nc t i on f i r st and
t h en di sc u ss some k ey p oi nt s ab ou t i t . It i s a
sl i gh t l y c omp l ex and l ong p i ec e of c ode.
H ow ev er , b ased on y ou r Py t h on- b ased dat a
w r angl i ng k now l edge, y ou sh ou l d b e ab l e t o
ex ami ne t h i s f u nc t i on c l osel y and u nder st and
w h at i t i s doi ng:
import pandas as pd
import json
def build_country_database(list_country):
"""
Takes a list of country names.
"""
country_dict={'Country':[],'Capital':
[],'Region':[],'Sub-region':
[],'Population':[],
'Lattitude':[],'Longitude':[],'Area':
[],'Gini':[],'Timezones':[],
'Currencies':[],'Languages':[]}
Note
Th e c od e h a s be e n tru nc a te d h e re . Ple a s e find th e
e ntire c od e a t th e follow ing GitHu b link a nd c od e
bu nd le fold e r link
h ttp s ://g ith u b.c om /Tra ining By Pa c k t/Da ta -
Wra ng ling -w ith -
Py th on/blob/m a s te r/Ch a p te r07/Exe rc is e 93-
94/Ch a p te r%207%20Top ic %203%20Exe rc is e s .ip y
nb.
H er e ar e some of t h e k ey p oi nt s ab ou t t h i s
f u nc t i on:
It star ts by bu ilding an
em pty dictionar y of lists.
Th is is th e ch osen for m at for
finally passing to th e pandas
DataFrame m eth od, w h ich
can accept su ch a for m at and
r etu r ns a nice DataFr am e
w ith colu m n nam es set to
th e dictionar y key s' nam es.
We u se th e pr ev iou sly
defined get_country_data
fu nction to extr act data for
each cou ntr y in th e u ser -
defined list. For th is, w e
sim ply iter ate ov er th e list
and call th is fu nction.
We ch eck th e ou tpu t of th e
get_country_data fu nction.
If, for som e r eason, it r etu r ns
a None object, w e w ill know
th at th e A PI r eading w as not
su ccessfu l, and w e w ill pr int
ou t a su itable m essage.
A gain, th is is an exam ple of
an er r or -h andling
m ech anism and y ou m u st
h av e th em in y ou r code.
With ou t su ch sm all er r or
ch ecking code, y ou r
application w on't be r obu st
enou gh for th e occasional
incor r ect inpu t or A PI
m alfu nction!
2 . Finally , th e ou tpu t is a
pandas DataFr am e, w h ich is
as follow s:
Figure 7.30: The data extracted correctly
Fundamentals of
Regular Expressions
(RegEx)
Re g u l ar e xp r essi ons or re g e x ar e u sed t o
i dent i f y w h et h er a p at t er n ex i st s i n a gi v en
sequ enc e of c h ar ac t er s a (st r i ng) or not . Th ey
h el p i n mani p u l at i ng t ex t u al dat a, w h i c h i s
of t en a p r er equ i si t e f or dat a sc i enc e p r ojec t s
t h at i nv ol v e t ex t mi ni ng.
REGEX IN THE CONTEXT OF
WEB SCRAPING
W eb p ages ar e of t en f u l l of t ex t and w h i l e t h er e
ar e some met h ods i n BeautifulSoup or XML
p ar ser t o ex t r ac t r aw t ex t , t h er e i s no met h od f or
t h e i nt el l i gent anal y si s of t h at t ex t . If , as a dat a
w r angl er , y ou ar e l ook i ng f or a p ar t i c u l ar p i ec e
of dat a (f or ex amp l e, emai l IDs or p h one nu mb er s
i n a sp ec i al f or mat ), y ou h av e t o do a l ot of st r i ng
mani p u l at i on on a l ar ge c or p u s t o ex t r ac t emai l
IDs or p h one nu mb er s. RegEx ar e v er y p ow er f u l
and sav e dat a w r angl i ng p r of essi onal a l ot of
t i me and ef f or t w i t h st r i ng mani p u l at i on
b ec au se t h ey c an sear c h f or c omp l ex t ex t u al
p at t er ns w i t h w i l dc ar ds of an ar b i t r ar y l engt h .
import re
import re
string1 = 'Python'
pattern = r"Python"
3 . Wr ite a conditional
expr ession to ch eck for a
m atch :
if
re.match(pattern,strin
g1):
print("Matches!")
else:
print("Doesn't
match.")
Th e pr eceding code sh ou ld
giv e an affir m ativ e answ er ,
th at is, "Match es!".
string2 = 'python'
if
re.match(pattern,strin
g2):
print("Matches!")
else:
print("Doesn't
match.")
Th e ou tpu t is as follow s:
Doesn't match.
prog = re.compile(pattern)
prog.match(string1)
Th e ou t p u t i s as f ol l ow s:
#string1 = 'Python'
#string2 = 'python'
#pattern = r"Python"
1 . Use th e compile fu nction in
RegEx:
prog =
re.compile(pattern)
if
prog.match(string1)!=N
one:
print("Matches!")
else:
print("Doesn't
match.")
Th e ou tpu t is as follow s:
Matches!
if
prog.match(string2)!=N
one:
print("Matches!")
else:
print("Doesn't
match.")
Th e ou tpu t is as follow s:
Doesn't match.
EXERCISE 97: USING
ADDITIONAL PARAMETERS IN
MATCH TO CHECK FOR
POSITIONAL MATCHING
By def au l t , match l ook s f or p at t er n mat c h i ng at
t h e b egi nni ng of t h e gi v en st r i ng. Bu t somet i mes,
w e need t o c h ec k mat c h i ng at a sp ec i f i c l oc at i on
i n t h e st r i ng:
prog =
re.compile(r'y')
prog.match('Python',po
s=1)
Th e ou tpu t is as follow s:
<_sre.SRE_Match
object; span=(1, 2),
match='y'>
prog =
re.compile(r'thon')
prog.match('Python',po
s=2)
Th e ou tpu t is as follow s:
<_sre.SRE_Match
object; span=(2, 6),
match='thon'>
prog.match('Marathon',
pos=4)
Th e ou tpu t is as follow s:
<_sre.SRE_Match
object; span=(4, 8),
match='thon'>
FINDING THE NUMBER OF
WORDS IN A LIST THAT END
WITH "ING"
Su p p ose w e w ant t o f i nd ou t i f a gi v en st r i ng h as
t h e l ast t h r ee l et t er s: 'i ng'. Th i s k i nd of qu er y
may c ome u p i n a t ex t anal y t i c s/t ex t mi ni ng
p r ogr am w h er e someb ody i s i nt er est ed i n
f i ndi ng i nst anc es of p r esent c ont i nu ou s t ense
w or ds, w h i c h ar e h i gh l y l i k el y t o end w i t h 'i ng'.
H ow ev er , ot h er nou ns may al so end w i t h 'i ng' (as
w e w i l l see i n t h i s ex amp l e):
prog = re.compile(r'ing')
words = ['Spring','Cycling','Ringtone']
for w in words:
if prog.match(w,pos=len(w)-3)!=None:
else:
Th e ou t p u t i s as f ol l ow s:
Note
I t look s p la in a nd s im p le , a nd y ou m a y w e ll
w ond e r w h a t th e p u rp os e of u s ing a s p e c ia l
re g e x m od u le for th is is . A s im p le s tring m e th od
s h ou ld h a v e be e n s u ffic ie nt. Ye s , it w ou ld h a v e
be e n OK for th is p a rtic u la r e xa m p le , bu t th e w h ole
p oint of u s ing re g e x is to be a ble to u s e v e ry
c om p le x s tring p a tte rns th a t a re not a t a ll obv iou s
w h e n it c om e s to h ow th e y a re w ritte n u s ing
s im p le s tring m e th od s . We w ill s e e th e re a l
p ow e r of re g e x c om p a re d to s tring m e th od s
s h ortly . Bu t be fore th a t, le t's e xp lore a noth e r of th e
m os t c om m only u s e d m e th od s , c a lle d search.
prog =
re.compile('ing')
if
prog.match('Spring')==
None:
print("None")
2 . Th e ou tpu t is as follow s:
None
prog.search('Spring')
<_sre.SRE_Match
object; span=(3, 6),
match='ing'>
prog.search('Ringtone'
)
<_sre.SRE_Match
object; span=(1, 4),
match='ing'>
prog =
re.compile(r'ing')
words =
['Spring','Cycling','R
ingtone']
for w in words:
mt = prog.search(w)
start_pos = mt.span()
[0] # Starting
position of the match
end_pos = mt.span()[1]
# Ending position of
the match
Th e ou t p u t i s as f ol l ow s:
prog =
re.compile(r'py.')
print(prog.search('pyg
my').group())
print(prog.search('Jup
yter').group())
Th e ou tpu t is as follow s:
pyg
pyt
prog =
re.compile(r'c\wm')
print(prog.search('com
edy').group())
print(prog.search('cam
era').group())
print(prog.search('pac
_man').group())
print(prog.search('pac
2man').group())
Th e ou tpu t is as follow s:
com
cam
c_m
c2m
3 . \W (u pper case W) m atch es
any th ing not cov er ed w ith
\w:
prog =
re.compile(r'4\W1')
print(prog.search('4/1
was a wonderful
day!').group())
print(prog.search('4-1
was a wonderful
day!').group())
print(prog.search('4.1
was a wonderful
day!').group())
print(prog.search('Rem
ember the wonderful
day 04/1?').group())
Th e ou tpu t is as follow s:
4/1
4-1
4.1
4/1
prog =
re.compile(r'Data\swra
ngling')
print(prog.search("Dat
a wrangling is
cool").group())
print("-"*80)
print("Data\twrangling
is the full string")
print(prog.search("Dat
a\twrangling is the
full string").group())
print("-"*80)
print("Data\nwrangling
is the full string")
print(prog.search("Dat
a\nwrangling").group()
)
Th e ou tpu t is as follow s:
Data wrangling
----------------------
----------------------
----------------------
----
Data wrangling
----------------------
----------------------
----------------------
----
Data
Data
wrangling
prog =
re.compile(r"score was
\d\d")
print(prog.search("My
score was
67").group())
print(prog.search("You
r score was
73").group())
Th e ou tpu t is as follow s:
score was 67
score was 73
def print_match(s):
if
prog.search(s)==None:
print("No match")
else:
print(prog.search(s).g
roup())
print_match("Russia
implemented this law")
print_match("India
implemented that law")
print_match("This law
was implemented by
India")
The output is as
follows: No match
India
No match
prog =
re.compile(r'Apple$')
print_match("Patent no
123456 belongs to
Apple")
print_match("Patent no
345672 belongs to
Samsung")
print_match("Patent no
987654 belongs to
Apple")
Th e ou tpu t is as follow s:
Apple
No match
Apple
Note:
For th e s e e xa m p le s a nd e xe rc is e s , a ls o try to
th ink h ow y ou w ou ld im p le m e nt th e m w ith ou t
re g e x, th a t is , by u s ing s im p le s tring m e th od s
a nd a ny oth e r log ic th a t y ou c a n th ink of. Th e n,
c om p a re th a t s olu tion to th e one s im p le m e nte d
w ith re g e x for bre v ity a nd e ffic ie nc y .
1 . Use * to m atch 0 or m or e
r epetitions of th e pr eceding
RE:
prog =
re.compile(r'ab*')
print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something
_abb_something")
Th e ou tpu t is as follow s:
ab
abbb
No match
ab
abb
prog =
re.compile(r'ab+')
print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something
_abb_something")
Th e ou tpu t is as follow s:
No match
ab
abbb
No match
ab
abb
prog =
re.compile(r'ab?')
print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something
_abb_something")
Th e ou tpu t is as follow s:
ab
ab
No match
ab
ab
prog =
re.compile(r'<.*>')
print_match('<a> b
<c>')
Th e ou tpu t is as follow s:
<a> b <c>
prog =
re.compile(r'<.*?>')
print_match('<a> b
<c>')
Th e ou tpu t is as follow s:
<a>
prog =
re.compile(r'A{3}')
print_match("ccAAAdd")
print_match("ccAAAAdd"
)
print_match("ccAAdd")
Th e ou tpu t is as follow s:
AAA
AAA
No match
prog =
re.compile(r'A{2,4}B')
print_match("ccAAABdd"
)
print_match("ccABdd")
print_match("ccAABBBdd
")
print_match("ccAAAAAAA
Bdd")
Th e ou tpu t is as follow s:
AAAB
No match
AAB
AAAAB
prog =
re.compile(r'A{,3}B')
print_match("ccAAABdd"
)
print_match("ccABdd")
print_match("ccAABBBdd
")
print_match("ccAAAAAAA
Bdd")
Th e ou tpu t is as follow s:
AAAB
AB
AAB
AAAB
4 . Om itting n specifies an
infinite u pper bou nd:
prog =
re.compile(r'A{3,}B')
print_match("ccAAABdd"
)
print_match("ccABdd")
print_match("ccAABBBdd
")
print_match("ccAAAAAAA
Bdd")
Th e ou tpu t is as follow s:
AAAB
No match
No match
AAAAAAAB
prog =
re.compile(r'A{2,4}')
print_match("AAAAAAA")
prog =
re.compile(r'A{2,4}?')
print_match("AAAAAAA")
Th e ou tpu t is as follow s:
AAAA
AA
prog =
re.compile(r'[A,B]')
print_match("ccAd")
print_match("ccABd")
print_match("ccXdB")
print_match("ccXdZ")
No match
prog =
re.compile(r'[a-zA-
Z]+@+[a-zA-Z]+\.com')
print_match("My email
is [email protected]")
print_match("My email
is [email protected]")
Th e ou tpu t is as follow s:
No match
Look at th e r egex patter n
inside th e [ … ]. It is 'a-zA-
Z'. Th is cov er s all alph abets,
inclu ding low er case and
u pper case! With th is one
sim ple r egex, y ou ar e able to
m atch any (pu r e)
alph abetical str ing for th at
par t of th e em ail. Now , th e
next patter n is '@', w h ich is
added to th e pr ev iou s r egex
by a '+' ch ar acter . Th is is th e
w ay to bu ild u p a com plex
r egex: by adding/stacking
u p indiv idu al r egex
patter ns. We also u se th e
sam e [a-zA-Z] for th e em ail
dom ain nam e and add a
'.com' at th e end to com plete
th e patter n as a v alid em ail
addr ess. Wh y \.? Becau se, by
itself, DOT (.) is u sed as a
special m odifier in r egex, bu t
h er e w e w ant to u se DOT (.)
ju st as DOT (.), not as a
m odifier . So, w e need to
pr ecede it by a '\'.
4 . Wh at h appened w ith th e
second em ail ID?
prog =
re.compile(r'[a-zA-Z0-
9]+@+[a-zA-Z]+\.com')
print_match("My email
is [email protected]")
print_match("My email
is [email protected]")
Th e ou tpu t is as follow s:
No match
prog =
re.compile(r'[a-zA-Z0-
9]+@+[a-zA-Z]+\.+[a-
zA-Z]{2,3}')
print_match("My email
is [email protected]")
print_match("My email
is
coolguy12[AT]xyz[DOT]o
rg")
Th e ou tpu t is as follow s:
No match
8. In th is r egex, w e u sed th e
fact th at m ost dom ain
identifier s h av e 2 or 3
ch ar acter s, so w e u sed [a-
zA-Z]{2,3} to captu r e th at.
prog =
re.compile(r'[0-9]
{10}')
print_match("312456789
7")
print_match("312-456-
7897")
Th e ou tpu t is as follow s:
3124567897
No match
So, h er e, w e ar e tr y ing to
extr act patter ns of 1 0-digit
nu m ber s th at cou ld be ph one
nu m ber s. Note th e u se of
{10} to denote exactly 1 0-
digit nu m ber s in th e
patter n. Bu t th e second
nu m ber cou ld not be
m atch ed for obv iou s r easons
– it h ad '-' sy m bols inser ted
in betw een gr ou ps of
nu m ber s.
prog =
re.compile(r'[0-9]
{10}|[0-9]{3}-[0-9]
{3}-[0-9]{4}')
print_match("312456789
7")
print_match("312-456-
7897")
Th e ou tpu t is as follow s:
3124567897
312-456-7897
p1= r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]
{3}-[0-9]{4}'
p3 = r'\([0-9]{3}\)[0-
9]{3}-[0-9]{4}'
p4 = r'[0-9]{3}\.[0-9]
{3}\.[0-9]{4}'
pattern=
p1+'|'+p2+'|'+p3+'|'+p
4
prog =
re.compile(pattern)
print_match("312456789
7")
print_match("312-456-
7897")
print_match("(312)456-
7897")
print_match("312.456.7
897")
Th e ou tpu t is as follow s:
3124567897
312-456-7897
(312)456-7897
312.456.7897
N ot e t h at , al t h ou gh w e ar e gi v i ng sh or t
ex amp l es of si ngl e sent enc es i n t h i s c h ap t er , y ou
w i l l of t en deal w i t h a l ar ge c or p u s of t ex t w h en
u si ng a RegEx .
312-Not-a-Number,777.345.2317,
312.331.6789"""
print(ph_numbers)
re.findall('312+[-\.][0-9-
\.]+',ph_numbers)
Th e ou t p u t i s as f ol l ow s:
312-Not-a-Number,777.345.2317,
312.331.6789
['312-423-3456', '312-5478-9999',
'312.331.6789']
Th ese ar e t h e st ep s t h at w i l l h el p y ou sol v e t h i s
ac t i v i t y :
1 . Im por t th e necessar y
libr ar ies, inclu ding regex
and beautifulsoup.
3 . Read th e HTML fr om th e
URL.
Note
Th e ai ms of t h i s ac t i v i t y ar e as f ol l ow s:
If a poster of th e m ov ie can
be fou nd, it dow nloads th e
file and sav es it at a u ser -
specified location
Th ese ar e t h e st ep s t h at w i l l h el p y ou sol v e t h i s
ac t i v i t y :
1 . Im por t urllib.request,
urllib.parse,
urllib.error, and json.
1 0. Test th e search_movie
fu nction by enter ing
Titanic.
1 1 . Test th e search_movie
fu nction by enter ing
"Random_error" (obv iou sly ,
th is w ill not be fou nd, and
y ou sh ou ld be able to ch eck
w h eth er y ou r er r or catch ing
code is w or king pr oper ly ).
Note:
A t t h e end of t h i s c h ap t er , w e w ent t h r ou gh a
det ai l ed ex er c i se of u si ng r egex t ec h ni qu es i n
t r i c k y st r i ng-mat c h i ng p r ob l ems t o sc r ap e
u sef u l i nf or mat i on f r om a l ar ge and messy t ex t
c or p u s, p ar sed f r om H TML. Th i s c h ap t er sh ou l d
c ome i n ex t r emel y h andy f or st r i ng and t ex t
p r oc essi ng t ask s i n y ou r dat a w r angl i ng c ar eer .
In t h e nex t c h ap t er , w e w i l l l ear n ab ou t
dat ab ases w i t h Py t h on.
Chapter 8
RDBMS and SQL
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:
Introduction
Th i s c h ap t er of ou r dat a jou r ney i s f oc u sed on
RDBMS (Rel at i onal Dat ab ase Management
Sy st ems) and SQL (St r u c t u r ed Qu er y Langu age).
In t h e p r ev i ou s c h ap t er , w e st or ed and r ead dat a
f r om a f i l e. In t h i s c h ap t er , w e w i l l r ead
st r u c t u r ed dat a, desi gn ac c ess t o t h e dat a, and
c r eat e qu er y i nt er f ac es f or dat ab ases.
Th ey ar e backed by a solid
m ath em atical fou ndation
(r elational algebr a and
calcu lu s) and th ey expose an
efficient and intu itiv e
declar ativ e langu age – SQL
– for easy inter action.
A s w e c an see i n t h e f ol l ow i ng c h ar t , t h e mar k et
of DBMS i s b i g. Th i s c h ar t w as p r odu c ed b ased on
mar k et r esear c h t h at w as done b y Gartne r, I nc.
i n 2016:
HOW IS AN RDBMS
STRUCTURED?
Th e RDBMS st r u c t u r e c onsi st s of t h r ee mai n
el ement s, namel y t h e st or age engi ne, qu er y
engi ne, and l og management . H er e i s a di agr am
t h at sh ow s t h e st r u c t u r e of a RDBMS:
Figure 8.2 RDBMS structure
St orage engine: Th is is th e
par t of th e RDBMS th at is
r esponsible for stor ing th e
data in an efficient w ay and
also to giv e it back w h en
asked for , in an efficient
w ay . A s an end u ser of th e
RDBMS sy stem (an
application dev eloper is
consider ed an end u ser of an
RDBMS), w e w ill nev er need
to inter act w ith th is lay er
dir ectly .
Query engine: Th is is th e
par t of th e RDBMS th at
allow s u s to cr eate data
objects (tables, v iew s, and so
on), m anipu late th em
(cr eate and delete colu m ns,
cr eate/delete/u pdate r ow s,
and so on), and qu er y th em
(r ead r ow s) u sing a sim ple
y et pow er fu l langu age.
Log management : Th is
par t of th e RDBMS is
r esponsible for cr eating and
m aintaining th e logs. If y ou
ar e w onder ing w h y th e log is
su ch an im por tant th ing,
th en y ou sh ou ld look into
h ow r eplication and
par titions ar e h andled in a
m oder n RDBMS (su ch as
Postgr eSQL) u sing som eth ing
called Writ e Ahead Log (or
WA L for sh or t).
W e w i l l f oc u s on t h e qu er y engi ne i n t h i s
c h ap t er .
SQL
St r u c t u r ed Qu er y Langu age or SQL (p r onou nc ed
sequ el ), as i t i s c ommonl y k now n, i s a domai n-
sp ec i f i c l angu age t h at w as or i gi nal l y desi gned
b ased on E.F. Codd's r el at i onal model and i s
w i del y u sed i n t oday 's dat ab ases t o def i ne, i nser t ,
mani p u l at e, and r et r i ev e dat a f r om t h em. It c an
b e f u r t h er su b -di v i ded i nt o f ou r smal l er su b -
l angu ages, namel y DDL (Dat a Def i ni t i on
Langu age), DML (Dat a Mani p u l at i on Langu age),
DQL (Dat a Qu er y Langu age), and DCL (Dat a
Cont r ol Langu age). Th er e ar e sev er al adv ant ages
of u si ng SQL, w i t h some of t h em b ei ng as f ol l ow s:
It is based on a solid
m ath em atical fr am ew or k
and th u s it is easy to
u nder stand.
Th e mai n ar eas of f oc u s f or t h e f ol l ow i ng t op i c
w i l l b e DDL, DML, and DQL. Th e DCL p ar t i s mor e
f or dat ab ase admi ni st r at or s.
DDL: Th is is h ow w e define
ou r data str u ctu r e in SQL. A s
RDBMS is m ainly designed
and bu ilt w ith str u ctu r ed
data in m ind, w e h av e to tell
an RDBMS engine
befor eh and w h at ou r data is
going to look like. We can
u pdate th is definition at a
later point in tim e, bu t an
initial one is a m u st. Th is is
w h er e w e w ill w r ite
statem ents su ch as CREATE
TABLE or DROP TABLE or
ALTER TABLE.
Note
Using an RDBMS
(MySQL/PostgreSQL/SQ
Lite)
In t h i s t op i c , w e w i l l f oc u s on h ow t o w r i t e some
b asi c SQL c ommands, as w el l as h ow t o c onnec t t o
a dat ab ase f r om Py t h on and u se i t ef f ec t i v el y
w i t h i n Py t h on. Th e dat ab ase w e w i l l c h oose h er e
i s SQLi t e. Th er e ar e ot h er dat ab ases, su c h as
Oracle, MySQL, Postgresql, and DB2. Th e mai n
t r i c k s t h at y ou ar e goi ng t o l ear n h er e w i l l not
c h ange b ased on w h at dat ab ase y ou ar e u si ng. Bu t
f or di f f er ent dat ab ases, y ou w i l l need t o i nst al l
di f f er ent t h i r d-p ar t y Py t h on l i b r ar i es (su c h as
Psycopg2 f or Postgresql, and so on). Th e r eason
t h ey al l b eh av e t h e same w ay (ap ar t f or some
smal l det ai l s) i s t h e f ac t t h at t h ey al l adh er e t o
PEP249 (c ommonl y k now n as Py t h on DB A PI 2).
import sqlite3
conn =
sqlite3.connect("chapt
er.db")
3 . Close th e connection, as
follow s:
conn.close()
with
sqlite3.connect("chapt
er.db") as conn:
pass
with
sqlite3.connect("chapt
er.db") as conn:
Note
cursor = conn.cursor()
cursor.execute("CREATE
TABLE IF NOT EXISTS
user (email text,
first_name text,
last_name text,
address text, age
integer, PRIMARY KEY
(email))")
cursor.execute("INSERT
INTO user VALUES
('[email protected]',
'Bob', 'Codd', '123
Fantasy lane, Fantasy
City', 31)")
cursor.execute("INSERT
INTO user VALUES
('[email protected]', 'Tom',
'Fake', '456 Fantasy
lane, Fantasu City',
39)")
4 . Com m it to th e database:
conn.commit()
Th i s w i l l c r eat e t h e t ab l e and w r i t e t w o r ow s t o
i t w i t h dat a.
READING DATA FROM A
DATABASE IN SQLITE
In t h e p r ec edi ng ex er c i se, w e c r eat ed a t ab l e and
st or ed dat a i n i t . N ow , w e w i l l l ear n h ow t o r ead
t h e dat a t h at 's st or ed i n t h i s dat ab ase.
with sqlite3.connect("chapter.db") as
conn:
cursor = conn.cursor()
print(row)
Th e ou t p u t i s as f ol l ow s:
Th e sy nt ax t o u se t h e SELECT c l au se w i t h a LIMIT
as f ol l ow s:
Note
Th is s y nta x is a s a m p le c od e a nd w ill not w ork
on Ju p y te r note book .
with
sqlite3.connect("chapt
er.db") as conn:
cursor = conn.cursor()
rows =
cursor.execute('SELECT
* FROM user ORDER BY
age DESC')
print(row)
Th e ou tpu t is as follow s:
with
sqlite3.connect("chapt
er.db") as conn:
cursor = conn.cursor()
rows =
cursor.execute('SELECT
* FROM user ORDER BY
age')
print(row)
3 . Th e ou tpu t is as follow s:
1 . Establish th e connection
w ith th e database by u sing
th e follow ing com m and:
with
sqlite3.connect("chapt
er.db") as conn:
cursor = conn.cursor()
2 . A dd anoth er colu m n in th e
user table and fill it w ith
null v alu es by u sing th e
follow ing com m and:
cursor.execute("ALTER
TABLE user ADD COLUMN
gender text")
cursor.execute("UPDATE
user SET gender='M'")
conn.commit()
rows =
cursor.execute('SELECT
* FROM user')
print(row)
Figure 8.8: Output a er altering the table
Th e f ol l ow i ng di agr am ex p l ai ns h ow t h e GROU P
BY c l au se w or k s:
cursor.execute("INSERT
INTO user VALUES
('[email protected]',
'Shelly', 'Milar',
'123, Ocean View
Lane', 39, 'F')")
rows =
cursor.execute("SELECT
COUNT(*), gender FROM
user GROUP BY gender")
print(row)
Th e ou tpu t is as follow s:
RELATION MAPPING IN
DATABASES
W e h av e b een w or k i ng w i t h a si ngl e t ab l e and
al t er i ng i t , as w el l as r eadi ng b ac k t h e dat a.
H ow ev er , t h e r eal p ow er of an RDBMS c omes
f r om t h e h andl i ng of r el at i onsh i p s among
di f f er ent ob jec t s (t ab l es). In t h i s sec t i on, w e ar e
goi ng t o c r eat e a new t ab l e c al l ed comments and
l i nk i t w i t h t h e u ser t ab l e i n a 1 : N r el at i onsh i p .
Th i s means t h at one u ser c an h av e mu l t i p l e
c omment s. Th e w ay w e ar e goi ng t o do t h i s i s b y
addi ng t h e user t ab l e's p r i mar y k ey as a f or ei gn
k ey i n t h e comments t ab l e. Th i s w i l l c r eat e a 1 : N
r el at i onsh i p .
W h en w e l i nk t w o t ab l es, w e need t o sp ec i f y t o
t h e dat ab ase engi ne w h at sh ou l d b e done i f t h e
p ar ent r ow i s del et ed, w h i c h h as many c h i l dr en
i n t h e ot h er t ab l e. A s w e c an see i n t h e f ol l ow i ng
di agr am, w e ar e ask i ng w h at h ap p ens at t h e
p l ac e of t h e qu est i on mar k s w h en w e del et e r ow 1
of t h e u ser t ab l e:
Figure 8.11: Illustration of relations
In a non-RDBMS si t u at i on, t h i s si t u at i on c an
qu i c k l y b ec ome di f f i c u l t and messy t o manage
and mai nt ai n. H ow ev er , w i t h an RDBMS, al l w e
h av e t o t el l t h e dat ab ase engi ne, i n v er y p r ec i se
w ay s, i s w h at t o do w h en a si t u at i on l i k e t h i s
oc c u r s. Th e dat ab ase engi ne w i l l do t h e r est f or
u s. W e u se ON DELETE t o t el l t h e engi ne w h at w e do
w i t h al l t h e r ow s of a t ab l e w h en t h e p ar ent r ow
get s del et ed. Th e f ol l ow i ng c ode i l l u st r at es t h ese
c onc ep t s:
with sqlite3.connect("chapter.db") as
conn:
cursor = conn.cursor()
sql = """
user_id text,
comments text,
"""
cursor.execute(sql)
conn.commit()
with sqlite3.connect("chapter.db") as
conn:
cursor = conn.cursor()
email = row[0]
for i in range(10):
conn.cursor().execute(sql.format(email,
comment))
conn.commit()
JOINS
In t h i s ex er c i se, w e w i l l l ear n h ow t o ex p l oi t t h e
r el at i onsh i p w e ju st b u i l t . Th i s means t h at i f w e
h av e t h e p r i mar y k ey f r om one t ab l e, w e c an
r ec ov er al l t h e dat a needed f r om t h at t ab l e and
al so al l t h e l i nk ed r ow s f r om t h e c h i l d t ab l e. To
ac h i ev e t h i s, w e w i l l u se somet h i ng c al l ed a j oin.
A joi n i s b asi c al l y a w ay t o r et r i ev e l i nk ed r ow s
f r om t w o t ab l es u si ng any k i nd of p r i mar y k ey -
f or ei gn k ey r el at i on t h at t h ey h av e. Th er e ar e
many t y p es of joi n, su c h as INNER, LEFT OUTER,
RIGHT OUTER, FULL OUTER, and CROSS. Th ey ar e u sed
i n di f f er ent si t u at i ons. H ow ev er , most of t h e
t i me, i n si mp l e 1 : N r el at i ons, w e end u p u si ng an
INNER joi n. In Ch a p te r 1: I ntrod u c tion to Da ta
Wra ng ling w ith Py th on, w e l ear ned ab ou t set s,
t h en w e c an v i ew an INNER JOIN as an
i nt er sec t i on of t w o set s. Th e f ol l ow i ng di agr am
i l l u st r at e t h e c onc ep t s:
Figure 8.12: Intersection Join
In ou r c ase, ou r f i r st t ab l e, user, h as t h r ee
ent r i es, w i t h t h e p r i mar y k ey b ei ng t h e email.
W e c an mak e u se of t h i s i n ou r qu er y t o get
c omment s ju st f r om Bob:
with sqlite3.connect("chapter.db") as
conn:
cursor = conn.cursor()s
sql = """
WHERE user.email='[email protected]'
"""
rows = cursor.execute(sql)
print(row)
Th e ou t p u t i s as f ol l ow s:
with sqlite3.connect("chapter.db") as
conn:
cursor = conn.cursor()
sql = """
WHERE user.email='[email protected]'
"""
rows = cursor.execute(sql)
print(row)
1 . To delete a r ow fr om a table,
w e u se th e DELETE clau se in
SQL. To r u n delete on th e
user table, w e ar e going to
u se th e follow ing code:
with
sqlite3.connect("chapt
er.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA
foreign_keys = 1")
cursor.execute("DELETE
FROM user WHERE
email='[email protected]
'")
conn.commit()
with
sqlite3.connect("chapt
er.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA
foreign_keys = 1")
rows =
cursor.execute("SELECT
* FROM user")
print(row)
Now , m ov ing on to th e
comments table, w e h av e to
r em em ber th at w e h ad
m entioned ON DELETE
CASCADE w h ile cr eating th e
table. Th e database engine
know s th at if a r ow is deleted
fr om th e par ent table (user),
all th e r elated r ow s fr om th e
ch ild tables (comments) w ill
h av e to be deleted.
with
sqlite3.connect("chapt
er.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA
foreign_keys = 1")
rows =
cursor.execute("SELECT
* FROM comments")
print(row)
Th e ou tpu t is as follow s:
('[email protected]', 'This
is comment 0 by Tom
Fake')
('[email protected]', 'This
is comment 1 by Tom
Fake')
('[email protected]', 'This
is comment 2 by Tom
Fake')
('[email protected]', 'This
is comment 3 by Tom
Fake')
('[email protected]', 'This
is comment 4 by Tom
Fake')
('[email protected]', 'This
is comment 5 by Tom
Fake')
('[email protected]', 'This
is comment 6 by Tom
Fake')
('[email protected]', 'This
is comment 7 by Tom
Fake')
('[email protected]', 'This
is comment 8 by Tom
Fake')
('[email protected]', 'This
is comment 9 by Tom
Fake')
with sqlite3.connect("chapter.db") as
conn:
cursor = conn.cursor()
conn.commit()
print(row)
Th e ou t p u t i s as f ol l ow s:
import pandas as pd
columns = ["Email",
"First Name", "Last
Name", "Age",
"Gender", "Comments"]
data = []
with
sqlite3.connect("chapt
er.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA
foreign_keys = 1")
sql = """
SELECT user.email,
user.first_name,
user.last_name,
user.age, user.gender,
comments.comments FROM
comments
JOIN user ON
comments.user_id =
user.email
WHERE user.email =
'[email protected]'
"""
rows =
cursor.execute(sql)
6 . A ppend th e r ow s to th e data
list:
data.append(row)
df =
pd.DataFrame(data,
columns=columns)
8. We h av e cr eated th e
DataFr am e u sing th e data
list. You can pr int th e v alu es
into th e DataFr am e u sing
df.head.
W e h av e t h e p et s t ab l e:
A s w e c an see, t h e id c ol u mn i n t h e p er sons t ab l e
(w h i c h i s an i nt eger ) ser v es as t h e p r i mar y k ey
f or t h at t ab l e and as a f or ei gn k ey f or t h e p et
t ab l e, w h i c h i s l i nk ed v i a t h e owner_id c ol u mn.
Th e p er sons t ab l e h as t h e f ol l ow i ng c ol u mns:
city: Th e city fr om w h er e
h e/sh e is fr om
Th e p et s t ab l e h as t h e f ol l ow i ng c ol u mns:
pet_name: Th e nam e of th e
pet.
pet_type: Wh at ty pe of pet
it is, for exam ple, cat, dog,
and so on. Du e to a lack of
fu r th er infor m ation, w e do
not know w h ich nu m ber
r epr esents w h at, bu t it is an
integer and can be nu ll.
treatment_done: It is also an
integer colu m n, and 0 h er e
r epr esents "No", w h er eas 1
r epr esents "Yes".
Th ese st ep s w i l l h el p y ou c omp l et e t h i s ac t i v i t y :
3 . Find th e age gr ou p th at h as
th e m axim u m nu m ber of
people.
Note
Summary
W e h av e c ome t o t h e end of t h e dat ab ase c h ap t er .
W e h av e l ear ned h ow t o c onnec t t o SQLi t e u si ng
Py t h on. W e h av e b r u sh ed u p on t h e b asi c s of
r el at i onal dat ab ases and l ear ned h ow t o op en and
c l ose a dat ab ase. W e t h en l ear ned h ow t o ex p or t
t h i s r el at i onal dat ab ase i nt o Py t h on Dat aFr ames.
In t h e nex t c h ap t er , w e w i l l b e p er f or mi ng dat a
w r angl i ng on r eal -w or l d dat aset s.
Chapter 9
Application of
Data Wrangling in Real
Life
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:
In t h i s c h ap t er , y ou w i l l ap p l y y ou r gat h er ed
k now l edge on r eal -l i f e dat aset s and i nv est i gat e
v ar i ou s asp ec t s of i t .
Introduction
W e l ear ned ab ou t dat ab ases i n t h e p r ev i ou s
c h ap t er , so now i t i s t i me t o c omb i ne t h e
k now l edge of dat a w r angl i ng and Py t h on w i t h a
r eal -w or l d sc enar i o. In t h e r eal w or l d, dat a f r om
one sou r c e i s of t en i nadequ at e t o p er f or m
anal y si s. Gener al l y , a dat a w r angl er h as t o
di st i ngu i sh b et w een r el ev ant and non-r el ev ant
dat a and c omb i ne dat a f r om di f f er ent sou r c es.
In t h i s t op i c , w e w i l l t r y t o mi mi c su c h a t y p i c al
t ask f l ow b y dow nl oadi ng and u si ng t w o
di f f er ent dat aset s f r om r ep u t ed w eb p or t al s.
Eac h of t h e dat aset s c ont ai ns p ar t i al dat a
p er t ai ni ng t o t h e k ey qu est i on t h at i s b ei ng
ask ed. Let 's ex ami ne i t mor e c l osel y .
Applying Your
Knowledge to a Real-life
Data Wrangling Task
Su p p ose y ou ar e ask ed t h i s qu est i on: I n I ndia,
did the e nrollme nt in
primary /se condary /te rtiary e ducation
incre ase with the improv e me nt of pe r capita
GDP in the past 15 y e ars? Th e ac t u al model i ng
and anal y si s w i l l b e done b y some seni or dat a
sc i ent i st , w h o w i l l u se mac h i ne l ear ni ng and
dat a v i su al i zat i on f or anal y si s. A s a dat a
w r angl i ng ex p er t , y our j ob will be to acquire
and prov ide a cle an datase t that contains
e ducational e nrollme nt and GDP data side by
side .
In t h i s ac t i v i t y , w e w i l l ex ami ne h ow t o h andl e
t h ese t w o sep ar at e sou r c es and c l ean t h e dat a t o
p r ep ar e a si mp l e f i nal dat aset w i t h t h e r equ i r ed
dat a and sav e i t t o t h e l oc al dr i v e as a SQL
dat ab ase f i l e:
Figure 9.1: Pictorial representation of the merging of
education and economic data
Note
Com ing u p w ith inte re s ting q u e s tions a bou t
s oc ia l, e c onom ic , te c h nolog ic a l, a nd g e o-p olitic a l
top ic s a nd th e n a ns w e ring th e m u s ing fre e ly
a v a ila ble d a ta a nd a little bit of p rog ra m m ing
k now le d g e is one of m os t fu n w a y s to le a rn a bou t
a ny d a ta s c ie nc e top ic . You w ill g e t a fla re of th a t
p roc e s s in th is c h a p te r.
Data I mputation
Bu t t o do t h at , w e f i r st need t o c r eat e a
Dat aFr ame w i t h mi ssi ng v al u es i n i t , t h at i s, w e
need t o ap p end anot h er Dat aFr ame w i t h mi ssi ng
v al u es t o t h e c u r r ent Dat aFr ame.
Th e U N dat a i s av ai l ab l e on
h t t p s://gi t h u b .c om/Tr ai ni ngBy Pac k t /Dat a-
W r angl i ng-w i t h -
Py t h on/b l ob /mast er /Ch ap t er 09/A c t i v i t y 1 2-
1 5/SYB61 _T07 _Edu c at i on.c sv .
Note
I f y ou d ow nloa d th e CS V file a nd op e n it u s ing
Exc e l, th e n y ou w ill s e e th a t th e Footnotes
c olu m n s om e tim e s c onta ins u s e fu l note s . We
m a y not w a nt to d rop it in th e be g inning . I f w e
a re inte re s te d in a p a rtic u la r c ou ntry 's d a ta (lik e
w e a re in th is ta s k ), th e n it m a y w e ll tu rn ou t th a t
Footnotes w ill be NaN, th a t is , bla nk . I n th a t c a s e ,
w e c a n d rop it a t th e e nd . Bu t for s om e c ou ntrie s
or re g ions , it m a y c onta in inform a tion.
Th e UN data h as m issing
v alu es. Clean th e data to
pr epar e a sim ple final
dataset w ith th e r equ ir ed
data and sav e it to th e local
dr iv e as a SQL database file.
4 . Dr op th e colu m n
r egion/cou ntr y /ar ea and
sou r ce.
7 . Ch eck th e ty pe of v alu e
colu m n.
Note:
1 . Cr eate th r ee DataFr am es
fr om th e or iginal DataFr am e
u sing filter ing. Cr eate th e
df_primary,
df_secondary, and
df_tertiary DataFrames
for stu dents enr olled in
pr im ar y edu cation,
secondar y edu cation, and
ter tiar y edu cation in
th ou sands, r espectiv ely .
2 . Plot bar ch ar ts of th e
enr ollm ent of pr im ar y
stu dents in a low -incom e
cou ntr y like India and a
h igh er -incom e cou ntr y like
th e USA .
6 . Cr eate a DataFr am e of
m issing v alu es (fr om th e
pr eceding dictionar y ) th at
w e can append.
7 . A ppend th e DataFr am es
togeth er .
1 1 . If th er e ar e v alu es th at ar e
u nfilled, u se th e limit and
limit_direction
par am eter s w ith th e
inter polate m eth od to fill
th em in.
1 4 . To av oid er r or s, tr y th e
error_bad_lines = False
option.
1 5. Since th er e is no delim iter in
th e file, add th e \t delim iter .
2 0. Renam e th e colu m ns
pr oper ly . Th is is necessar y
for m er ging th e tw o datasets.
Note
Note
3 . If w e look at th e cu r r ent
folder , w e sh ou ld see a file
called Education_GDP.db,
and if w e exam ine th at u sing
a database v iew er pr ogr am ,
w e can see th e data
tr ansfer r ed th er e.
Note
An Extension to Data
Wrangling
Th i s i s t h e c onc l u di ng c h ap t er of ou r b ook ,
w h er e w e w ant t o gi v e y ou a b r oad ov er v i ew of
some of t h e ex c i t i ng t ec h nol ogi es and
f r amew or k s t h at y ou may need t o l ear n b ey ond
dat a w r angl i ng t o w or k as a f u l l -st ac k dat a
sc i ent i st . Dat a w r angl i ng i s an essent i al p ar t of
t h e w h ol e dat a sc i enc e and anal y t i c s p i p el i ne,
b u t i t i s not t h e w h ol e ent er p r i se. You h av e
l ear ned i nv al u ab l e sk i l l s and t ec h ni qu es i n t h i s
b ook , b u t i t i s al w ay s good t o b r oaden y ou r
h or i zons and l ook b ey ond t o see w h at ot h er t ool s
t h at ar e ou t t h er e c an gi v e y ou an edge i n t h i s
c omp et i t i v e and ev er -c h angi ng w or l d.
ADDITIONAL SKILLS
REQUIRED TO BECOME A DATA
SCIENTIST
To p r ac t i c e as a f u l l y qu al i f i ed dat a
sc i ent i st /anal y st , y ou sh ou l d h av e some b asi c
sk i l l s i n y ou r r ep er t oi r e, i r r esp ec t i v e of t h e
p ar t i c u l ar p r ogr ammi ng l angu age y ou c h oose t o
f oc u s on. Th ese sk i l l s and k now -h ow s ar e
l angu age agnost i c and c an b e u t i l i zed w i t h any
f r amew or k t h at y ou h av e t o emb r ac e, dep endi ng
on y ou r or gani zat i on and b u si ness needs. W e
desc r i b e t h em i n b r i ef h er e:
Docker and
cont ainerizat ion: Since its
fir st r elease in 2 01 3 , Docker
h as ch anged th e w ay w e
distr ibu te and deploy
softw ar e in ser v er -based
applications. It giv es y ou a
clean and ligh tw eigh t
abstr action ov er th e
u nder ly ing OS and lets y ou
iter ate fast on dev elopm ent
w ith ou t th e h eadach e of
cr eating and m aintaining a
pr oper env ir onm ent. It is
v er y u sefu l in both th e
dev elopm ent and pr odu ction
ph ases. With ou t v ir tu ally no
com petitor pr esent, th ey ar e
becom ing th e defau lt in th e
indu str y v er y fast. We
str ongly adv ise y ou to
explor e it in gr eat detail.
Fundament al
charact erist ics of big
dat a: Big data is sim ply data
th at is v er y big in size. Th e
ter m size is a bit am bigu ou s
h er e. It can m ean one static
ch u nk of data (like th e detail
censu s data of a big cou ntr y
like India or th e US) or data
th at is dy nam ically
gener ated as tim e passes,
and each tim e it is h u ge. To
giv e an exam ple for th e
second categor y , w e can
th ink of h ow m u ch data is
gener ated by Facebook per
day . It's abou t 500+
Ter aby tes per day . You can
easily im agine th at w e w ill
need specialized tools to deal
w ith th at am ou nt of data.
Th er e ar e th r ee differ ent
categor ies of big data, th at is,
Str u ctu r ed, Unstr u ctu r ed,
and Sem i-Str u ctu r ed. Th e
m ain featu r es th at define big
data ar e V olu m e, V ar iety ,
V elocity , and V ar iability .
N ow , w e h av e, i n f ac t , al r eady l ai d ou t a sol i d
gr ou ndw or k i n t h i s b ook f or t h e dat a p l at f or m
p ar t , assu mi ng t h at i t i s an i nt egr al p ar t of dat a
w r angl i ng w or k f l ow . For ex amp l e, w e h av e
c ov er ed w eb sc r ap i ng, w or k i ng w i t h RESTf u l
A PIs, and dat ab ase ac c ess and mani p u l at i on u si ng
Py t h on l i b r ar i es i n det ai l .
W e h av e al so t ou c h ed on b asi c v i su al i zat i on
t ec h ni qu es and p l ot t i ng f u nc t i ons i n Py t h on
u si ng mat p l ot l i b . H ow ev er , t h er e ar e ot h er
adv anc ed st at i st i c al p l ot t i ng l i b r ar i es su c h as
Se aborn t h at y ou c an mast er f or mor e
sop h i st i c at ed v i su al i zat i on f or dat a sc i enc e
t ask s.
Th e f r u i t of t h e h ar d w or k of dat a w r angl i ng i s
r eal i zed f u l l y i n t h e domai n of mac h i ne
l ear ni ng. It i s t h e sc i enc e and engi neer i ng of
mak i ng mac h i nes l ear n p at t er ns and i nsi gh t s
f r om dat a f or p r edi c t i v e anal y t i c s and
i nt el l i gent , au t omat ed dec i si on-mak i ng w i t h a
del u ge of dat a, w h i c h c annot b e anal y zed
ef f i c i ent l y b y h u mans. Mac h i ne l ear ni ng h as
b ec ome one of t h e most sou gh t -af t er sk i l l s i n t h e
moder n t ec h nol ogy l andsc ap e. It h as t r u l y
b ec ome one of t h e most ex c i t i ng and p r omi si ng
i nt el l ec t u al f i el ds, w i t h ap p l i c at i ons r angi ng
f r om e-c ommer c e t o h eal t h c ar e and v i r t u al l y
ev er y t h i ng i n-b et w een. Dat a w r angl i ng i s
i nt r i nsi c al l y l i nk ed w i t h mac h i ne l ear ni ng as i t
p r ep ar es t h e dat a so t h at i t 's su i t ab l e f or
i nt el l i gent al gor i t h ms t o p r oc ess. Ev en i f y ou
st ar t y ou r c ar eer i n dat a w r angl i ng, i t c ou l d b e a
nat u r al p r ogr essi on t o mov e t o mac h i ne
l ear ni ng.
If y ou ch oose Py th on as y ou r
pr efer r ed langu age for
m ach ine lear ning tasks, y ou
h av e a gr eat ML libr ar y in
scikit -learn. It is th e m ost
w idely u sed gener al m ach ine
lear ning package in th e
Py th on ecosy stem . scikit-
lear n h as a w ide v ar iety of
su per v ised and u nsu per v ised
lear ning algor ith m s, w h ich
ar e exposed v ia a stable
consistent inter face.
Mor eov er , it is specifically
designed to inter face
seam lessly w ith oth er
popu lar data w r angling and
nu m er ical libr ar ies su ch as
Nu m Py and pandas.
Summary
Dat a i s ev er y w h er e and i t i s al l ar ou nd u s. In
t h ese ni ne c h ap t er s, w e h av e l ear ned ab ou t h ow
dat a f r om di f f er ent t y p es and sou r c es c an b e
c l eaned, c or r ec t ed, and c omb i ned. U si ng t h e
p ow er of Py t h on and t h e k now l edge of dat a
w r angl i ng and ap p l y i ng t h e t r i c k s and t i p s t h at
y ou h av e st u di ed i n t h i s b ook , y ou ar e r eady t o b e
a dat a w r angl er .
Appendix
About
Th i s sec t i on i s i nc l u ded t o assi st t h e st u dent s t o
p er f or m t h e ac t i v i t i es i n t h e b ook . It i nc l u des
det ai l ed st ep s t h at ar e t o b e p er f or med b y t h e
st u dent s t o ac h i ev e t h e ob jec t i v es of t h e
ac t i v i t i es.
SOLUTION OF ACTIVITY 1:
HANDLING LISTS
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
import random
LIMIT = 100
random_number_list =
[random.randint(0,
LIMIT) for x in
range(0, LIMIT)]
4 . Pr int random_number_list:
random_number_list
5. Cr eate a
list_with_divisible_by_3
list fr om
random_number_list, w h ich
w ill contain only nu m ber s
th at ar e div isible by 3:
list_with_divisible_by
_3 = [a for a in
random_number_list if
a % 3 == 0]
list_with_divisible_by
_3
length_of_random_list
=
len(random_number_list
)
length_of_3_divisible_
list =
len(list_with_divisibl
e_by_3)
difference =
length_of_random_list
-
length_of_3_divisible_
list
difference
62
NUMBER_OF_EXPERIMENTS
= 10
difference_list = []
for i in range(0,
NUMBER_OF_EXPERIEMENTS
):
random_number_list =
[random.randint(0,
LIMIT) for x in
range(0, LIMIT)]
list_with_divisible_by
_3 = [a for a in
random_number_list if
a % 3 == 0]
length_of_random_list
=
len(random_number_list
)
length_of_3_divisible_
list =
len(list_with_divisibl
e_by_3)
difference =
length_of_random_list
-
length_of_3_divisible_
list
difference_list.append
(difference)
difference_list
avg_diff =
sum(difference_list) /
float(len(difference_l
ist))
avg_diff
66.3
SOLUTION OF ACTIVITY 2:
ANALYZE A MULTILINE STRING
AND GENERATE THE UNIQUE
WORD COUNT
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
type(multiline_text)
Th e ou tpu t is as follow s:
str
len(multiline_text)
Th e ou tpu t is as follow s:
4475
multiline_text =
multiline_text.replace
('\n', "")
multiline_text
Th e ou tpu t is as follow s:
# remove special
chars, punctuation
etc.
cleaned_multiline_text
= ""
for char in
multiline_text:
elif char.isalnum(): #
using the isalnum()
method of strings.
cleaned_multiline_text
+= char
else:
cleaned_multiline_text
+= " "
6 . Ch eck th e content of
cleaned_multiline_text:
cleaned_multiline_text
Th e ou tpu t is as follow s:
list_of_words =
cleaned_multiline_text
.split()
list_of_words
Th e ou tpu t is as follow s:
len(list_of_words)
Th e ou tpu t is 852.
unique_words_as_dict =
dict.fromkeys(list_of_
words)
len(list(unique_words_
as_dict.keys()))
Th e ou tpu t is 340.
for word in
list_of_words:
if
unique_words_as_dict[w
ord] is None:
unique_words_as_dict[w
ord] = 1
else:
unique_words_as_dict[w
ord] += 1
unique_words_as_dict
Th e ou tpu t is as follow s:
1 1 . Find th e top 2 5 w or ds fr om
th e unique_words_as_dict.
top_words =
sorted(unique_words_as
_dict.items(),
key=lambda
key_val_tuple:
key_val_tuple[1],
reverse=True)
top_words[:25]
Th ese ar e th e steps to
com plete th is activ ity :
SOLUTION OF ACTIVITY 3:
PERMUTATION, ITERATOR,
LAMBDA, LIST
Th ese ar e t h e st ep s t o sol v e t h i s ac t i v i t y :
1 . Look u p th e definition of
permutations and
dropwhile fr om itertools.
Th er e is a w ay to look u p th e
definition of a fu nction inside
Ju py ter itself. Ju st ty pe th e
fu nction nam e, follow ed by
?, and pr ess Shift + Enter:
permutations?
dropwhile?
permutations(range(3))
Th e ou tpu t is as follow s:
<itertools.permutation
s at 0x7f6c6c077af0>
for number_tuple in
permutations(range(3))
:
print(number_tuple)
assert
isinstance(number_tupl
e, tuple)
Th e ou tpu t is as follow s:
(0, 1, 2)
(0, 2, 1)
(1, 0, 2)
(1, 2, 0)
(2, 0, 1)
(2, 1, 0)
for number_tuple in
permutations(range(3))
:
print(list(dropwhile(l
ambda x: x <= 0,
number_tuple)))
Th e ou tpu t is as follow s:
[1, 2]
[2, 1]
[1, 0, 2]
[1, 2, 0]
[2, 0, 1]
[2, 1, 0]
import math
def
convert_to_number(numb
er_stack):
final_number = 0
for i in range(0,
len(number_stack)):
final_number +=
(number_stack.pop() *
(math.pow(10, i)))
return final_number
for number_tuple in
permutations(range(3))
:
number_stack =
list(dropwhile(lambda
x: x <= 0,
number_tuple))
print(convert_to_numbe
r(number_stack))
Th e ou tpu t is as follow s:
12.0
21.0
102.0
120.0
201.0
210.0
SOLUTION OF ACTIVITY 4:
DESIGN YOUR OWN CSV
PARSER
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
1 . Im por t zip_longest fr om
itertools:
2 . Define th e
return_dict_from_csv_lin
e fu nction so th at it contains
header, line, and
fillvalue as None, and add
it to a dict:
def
return_dict_from_csv_l
ine(header, line):
# Zip them
zipped_line =
zip_longest(header,
line, fillvalue=None)
# Use dict
comprehension to
generate the final
dict
ret_dict = {kv[0]:
kv[1] for kv in
zipped_line}
return ret_dict
first_line =
fd.readline()
header =
first_line.replace("\n
", "").split(",")
for i, line in
enumerate(fd):
line =
line.replace("\n",
"").split(",")
d =
return_dict_from_csv_l
ine(header, line)
print(d)
if i > 10:
break
Th e ou tpu t is as follow s:
SOLUTION OF ACTIVITY 5:
GENERATING STATISTICS
FROM A CSV FILE
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
df=pd.read_csv("Boston
_housing.csv")
df.head(10)
Th e ou tpu t is as follow s:
Figure 3.23: Output displaying the first 10
records
df.shape
Th e ou tpu t is as follow s:
(506, 14)
df1=df[['CRIM','ZN','I
NDUS','RM','AGE','DIS'
,'RAD','TAX','PTRATIO'
,'PRICE']]
df1.tail(7)
Th e ou tpu t is as follow s:
for c in df1.columns:
plt.title("Plot of
"+c,fontsize=15)
plt.hist(df1[c],bins=2
0)
plt.show()
Th e ou tpu t is as follow s:
Figure 3.25: Plot of all variables using a
for loop
8. Cr im e r ate cou ld be an
indicator of h ou se pr ice
(people don't w ant to liv e in
h igh -cr im e ar eas). Cr eate a
scatter plot of cr im e r ate
v er su s pr ice:
plt.scatter(df1['CRIM'
],df1['PRICE'])
plt.show()
Th e ou tpu t is as follow s:
9 . Cr eate th at plot of
log1 0(cr im e) v er su s pr ice:
plt.scatter(np.log10(d
f1['CRIM']),df1['PRICE
'],c='red')
plt.title("Crime rate
(Log) vs. Price plot",
fontsize=18)
plt.xlabel("Log of
Crime
rate",fontsize=15)
plt.ylabel("Price",fon
tsize=15)
plt.grid(True)
plt.show()
Th e ou tpu t is as follow s:
df1['RM'].mean()
Th e ou tpu t is
6.284634387351788.
df1['AGE'].median()
Th e ou tpu t is 77.5.
df1['DIS'].mean()
Th e ou tpu t is
3.795042687747034.
low_price=df1['PRICE']
<20
# This creates a
Boolean array of True,
False
print(low_price)
# True = 1, False = 0,
so now if you take an
average of this NumPy
array, you will know
how many 1's are
there.
pcnt=low_price.mean()*
100
print("\nPercentage of
house with <20,000
price is: ",pcnt)
Th e ou tpu t is as follow s:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 True
500 True
501 False
502 False
503 False
504 False
505 True
Percentage of house
with <20,000 price is:
41.50197628458498
SOLUTION OF ACTIVITY 6:
WORKING WITH THE ADULT
INCOME DATASET (UCI)
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
df =
pd.read_csv("adult_inc
ome_data.csv")
df.head()
Th e ou tpu t is as follow s:
Figure 4.61: DataFrame displaying the
first five records from the .csv file
names = []
with
open('adult_income_nam
es.txt','r') as f:
for line in f:
f.readline()
var=line.split(":")[0]
names.append(var)
names
Th e ou tpu t is as follow s:
names.append('Income')
df =
pd.read_csv("adult_inc
ome_data.csv",names=na
mes)
df.head()
Th e ou tpu t is as follow s:
df.describe()
Th e ou tpu t is as follow s:
Figure 4.64: Statistical summary of the
dataset
vars_class =
['workclass','educatio
n','marital-status',
'occupation','relation
ship','sex','native-
country']
for v in vars_class:
classes=df[v].unique()
num_classes =
df[v].nunique()
print("There are {}
classes in the \"{}\"
column. They are:
{}".format(num_classes
,v,classes))
print("-"*100)
Th e ou tpu t is as follow s:
df.isnull().sum()
Th e ou tpu t is as follow s:
Figure 4.66: Finding the missing values
df_subset =
df[['age','education',
'occupation']]
df_subset.head()
Th e ou tpu t is as follow s:
df_subset['age'].hist(
bins=20)
Th e ou tpu t is as follow s:
<matplotlib.axes._subp
lots.AxesSubplot at
0x19dea8d0>
Figure 4.68: Histogram of age with a bin
size of 20
df_subset.boxplot(colu
mn='age',by='education
',figsize=(25,10))
plt.xticks(fontsize=15
)
plt.xlabel("Education"
,fontsize=20)
plt.show()
Th e ou tpu t is as follow s:
Figure 4.69: Boxplot of age grouped by
education
def
strip_whitespace(s):
return s.strip()
# Education column
df_subset['education_s
tripped']=df['educatio
n'].apply(strip_whites
pace)
df_subset['education']
=df_subset['education_
stripped']
df_subset.drop(labels=
['education_stripped']
,axis=1,inplace=True)
# Occupation column
df_subset['occupation_
stripped']=df['occupat
ion'].apply(strip_whit
espace)
df_subset['occupation'
]=df_subset['occupatio
n_stripped']
df_subset.drop(labels=
['occupation_stripped'
],axis=1,inplace=True)
# Conditional clauses
and join them by &
(AND)
df_filtered=df_subset[
(df_subset['age']>=30)
& (df_subset['age']
<=50)]
Ch eck th e contents of th e
new dataset:
df_filtered.head()
Th e ou tpu t is as follow s:
Figure 4.71: Contents of new DataFrame
answer_1=df_filtered.s
hape[0]
answer_1
Th e ou tpu t is as follow s:
1630
print("There are {}
people of age between
30 and 50 in this
dataset.".format(answe
r_1))
Th e ou tpu t is as follow s:
1 8. Gr ou p th e r ecor ds based on
occu pation to find h ow th e
m ean age is distr ibu ted:
df_subset.groupby('occ
upation').describe()
['age']
Th e ou tpu t is as follow s:
df_subset.groupby('occ
upation').describe()
['age']
Th e ou tpu t is as follow s:
Figure 4.73: DataFrame showing
summary statistics of age
occupation_stats=
df_subset.groupby(
'occupation').describe
()['age']
plt.figure(figsize=
(15,8))
plt.barh(y=occupation_
stats.index,
width=occupation_stats
['count'])
plt.yticks(fontsize=13
)
plt.show()
Th e ou tpu t is as follow s:
Figure 4.74: Bar chart displaying
occupation statistics
2 2 . Pr actice m er ging by
com m on key s. Su ppose y ou
ar e giv en tw o datasets w h er e
th e com m on key is
occupation. Fir st, cr eate
tw o su ch disjoint datasets by
taking r andom sam ples fr om
th e fu ll dataset and th en tr y
m er ging. Inclu de at least tw o
oth er colu m ns, along w ith
th e com m on key colu m n for
each dataset. Notice h ow th e
r esu lting dataset, after
m er ging, m ay h av e m or e
data points th an eith er of th e
tw o star ting datasets if y ou r
com m on key is not u niqu e:
df_1 = df[['age',
'workclass',
'occupation']].sample(
5,random_state=101)
df_1.head()
Th e ou tpu t is as follow s:
df_2 = df[['education',
'occupation']].sample(5,random_state=101)
df_2.head()
Th e ou t p u t i s as f ol l ow s:
df_merged = pd.merge(df_1,df_2,
on='occupation',
how='inner').drop_duplicates()
df_merged
Th e ou t p u t i s as f ol l ow s:
Figure 4.77: Output of distinct occupation values
SOLUTION OF ACTIVITY 7:
READING TABULAR DATA
FROM A WEB PAGE AND
CREATING DATAFRAMES
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
import pandas as pd
fd = open("List of
countries by GDP
(nominal) -
Wikipedia.htm", "r")
soup =
BeautifulSoup(fd)
fd.close()
all_tables =
soup.find_all("table")
print("Total number of
tables are {}
".format(len(all_table
s)))
Th er e ar e 9 tables in total.
4 . Find th e r igh t table u sing
th e class attr ibu te by u sing
th e follow ing com m and:
data_table =
soup.find("table",
{"class":
'"wikitable"|}'})
print(type(data_table)
)
Th e ou tpu t is as follow s:
<class
'bs4.element.Tag'>
sources =
data_table.tbody.findA
ll('tr',
recursive=False)[0]
print(len(sources_list
))
Th e ou tpu t is as follow s:
data =
data_table.tbody.findA
ll('tr',
recursive=False)
[1].findAll('td',
recursive=False)
data_tables = []
for td in data:
data_tables.append(td.
findAll('table'))
8. Find th e length of
data_tables by u sing th e
follow ing com m and:
len(data_tables)
Th e ou tpu t is as follow s:
source_names =
[source.findAll('a')
[0].getText() for
source in
sources_list]
print(source_names)
Th e ou tpu t is as follow s:
['International
Monetary Fund', 'World
Bank', 'United
Nations']
header1 =
[th.getText().strip()
for th in
data_tables[0]
[0].findAll('thead')
[0].findAll('th')]
header1
Th e ou tpu t is as follow s:
['Rank', 'Country',
'GDP(US$MM)']
1 1 . Find th e r ow s fr om
data_tables u sing findAll:
rows1 = data_tables[0]
[0].findAll('tbody')
[0].findAll('tr')[1:]
1 2 . Find th e data fr om rows1
u sing th e strip fu nction for
each td tag:
data_rows1 =
[[td.get_text().strip(
) for td in
tr.findAll('td')] for
tr in rows1]
1 3 . Find th e DataFr am e:
df1 =
pd.DataFrame(data_rows
1, columns=header1)
df1.head()
Th e ou tpu t is as follow s:
header2 =
[th.getText().strip()
for th in
data_tables[1]
[0].findAll('thead')
[0].findAll('th')]
header2
Th e ou tpu t is as follow s:
['Rank', 'Country',
'GDP(US$MM)']
1 5. Find th e r ow s fr om
data_tables u sing findAll
by u sing th e follow ing
com m and:
rows2 = data_tables[1]
[0].findAll('tbody')
[0].findAll('tr')[1:]
1 6 . Define find_right_text
u sing th e strip fu nction by
u sing th e follow ing
com m and:
def find_right_text(i,
td):
if i == 0:
return
td.getText().strip()
elif i == 1:
return
td.getText().strip()
else:
index =
td.text.find("♠")
return
td.text[index+1:].stri
p()
1 7 . Find th e r ow s fr om
data_rows u sing
find_right_text by u sing
th e follow ing com m and:
data_rows2 =
[[find_right_text(i,
td) for i, td in
enumerate(tr.findAll('
td'))] for tr in
rows2]
df2 =
pd.DataFrame(data_rows
2, columns=header2)
df2.head()
Th e ou tpu t is as follow s:
Figure 5.36: Output of the DataFrame
header3 =
[th.getText().strip()
for th in
data_tables[2]
[0].findAll('thead')
[0].findAll('th')]
header3
Th e ou tpu t is as follow s:
['Rank', 'Country',
'GDP(US$MM)']
2 0. Find th e r ow s fr om
data_tables u sing findAll
by u sing th e follow ing
com m and:
rows3 = data_tables[2]
[0].findAll('tbody')
[0].findAll('tr')[1:]
2 1 . Find th e r ow s fr om
data_rows3 by u sing
find_right_text:
data_rows3 =
[[find_right_text(i,
td) for i, td in
enumerate(tr.findAll('
td'))] for tr in
rows2]
df3 =
pd.DataFrame(data_rows
3, columns=header3)
df3.head()
Th e ou tpu t is as follow s:
SOLUTION OF ACTIVITY 8:
HANDLING OUTLIERS AND
MISSING DATA
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
1 . Load th e data:
import pandas as pd
import numpy as np
import
matplotlib.pyplot as
plt
%matplotlib inline
df =
pd.read_csv("visit_dat
a.csv")
3 . Pr int th e data fr om th e
DataFr am e:
df.head()
Th e ou tpu t is as follow s:
Figure 6.10: The contents of the CSV file
print("First name is
duplicated -
{}".format(any(df.firs
t_name.duplicated())))
print("Last name is
duplicated -
{}".format(any(df.last
_name.duplicated())))
print("Email is
duplicated -
{}".format(any(df.emai
l.duplicated())))
Th e ou tpu t is as follow s:
First name is
duplicated - True
Last name is
duplicated - True
Email is duplicated -
False
Th er e ar e du plicates in both
th e fir st and last nam es,
w h ich is nor m al. How ev er ,
as w e can see, th er e is no
du plicate in em ail. Th at's
good.
print("The column
Email contains NaN -
%r " %
df.email.isnull().valu
es.any())
print("The column IP
Address contains NaN -
%s " %
df.ip_address.isnull()
.values.any())
print("The column
Visit contains NaN -
%s " %
df.visit.isnull().valu
es.any())
Th e ou tpu t is as follow s:
6 . Get r id of th e ou tlier s:
size_prev = df.shape
df =
df[np.isfinite(df['vis
it'])] #This is an
inplace operation.
After this operation
the original DataFrame
is lost.
size_after = df.shape
# Notice how
parameterized format
is used and then the
indexing is working
inside the quote marks
print("The size of
previous data was -
{prev[0]} rows and the
size of the new one is
- {after[0]} rows".
format(prev=size_prev,
after=size_after))
Th e ou tpu t is as follow s:
plt.boxplot(df.visit,
notch=True)
Th e ou tpu t is as follow s:
{'whiskers':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc08668>,
<matplotlib.lines.Line
2D at
0x7fa04cc08b00>],
'caps':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc08f28>,
<matplotlib.lines.Line
2D at
0x7fa04cc11390>],
'boxes':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc08518>],
'medians':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc117b8>],
'fliers':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc11be0>],
'means': []}
Th e boxplot is as follow s:
print("After getting
rid of outliers the
new size of the data
is -
{}".format(*df1.shape)
)
A fter getting r id of th e
ou tlier s, th e new size of th e
data is 923.
SOLUTION OF ACTIVITY 9:
EXTRACTING THE TOP 100
EBOOKS FROM GUTENBERG
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
1 . Im por t th e necessar y
libr ar ies, inclu ding regex
and beautifulsoup:
import urllib.request,
urllib.parse,
urllib.error
import requests
import ssl
import re
# Ignore SSL
certificate errors
ctx =
ssl.create_default_con
text()
ctx.check_hostname =
False
ctx.verify_mode =
ssl.CERT_NONE
3 . Read th e HTML fr om th e
URL:
top100url =
'https://www.gutenberg
.org/browse/scores/top
'
response =
requests.get(top100url
)
def status_check(r):
if r.status_code==200:
print("Success!")
return 1
else:
print("Failed!")
return -1
5. Ch eck th e statu s of
response:
status_check(response)
Th e ou tpu t is as follow s:
Success!
contents =
response.content.decod
e(response.encoding)
soup =
BeautifulSoup(contents
, 'html.parser')
lst_links=[]
for link in
soup.find_all('a'):
#print(link.get('href'
))
lst_links.append(link.
get('href'))
lst_links[:30]
Th e ou tpu t is as follow s:
['/wiki/Main_Page',
'/catalog/',
'/ebooks/',
'/browse/recent/last1'
,
'/browse/scores/top',
'/wiki/Gutenberg:Offli
ne_Catalogs',
'/catalog/world/mybook
marks',
'/wiki/Main_Page',
'https://www.paypal.co
m/xclick/business=dona
te%40gutenberg.org&ite
m_name=Donation+to+Pro
ject+Gutenberg',
'/wiki/Gutenberg:Proje
ct_Gutenberg_Needs_You
r_Donation',
'http://www.ibiblio.or
g',
'http://www.pgdp.net/'
,
'pretty-pictures',
'#books-last1',
'#authors-last1',
'#books-last7',
'#authors-last7',
'#books-last30',
'#authors-last30',
'/ebooks/1342',
'/ebooks/84',
'/ebooks/1080',
'/ebooks/46',
'/ebooks/219',
'/ebooks/2542',
'/ebooks/98',
'/ebooks/345',
'/ebooks/2701',
'/ebooks/844',
'/ebooks/11']
booknum=[]
1 0. Nu m ber s 1 9 to 1 1 8 in th e
or iginal list of links h av e th e
top 1 00 eBooks' nu m ber s.
Loop ov er th e appr opr iate
r ange and u se a r egex to find
th e nu m er ic digits in th e
link (h r ef) str ing. Use th e
findall() m eth od:
for i in
range(19,119):
link=lst_links[i]
link=link.strip()
# Regular expression
to find the numeric
digits in the link
(href) string
n=re.findall('[0-
9]+',link)
if len(n)==1:
# Append the
filenumber casted as
integer
booknum.append(int(n[0
]))
print(booknum)
Th e ou tpu t is as follow s:
----------------------
----------------------
----------------------
----
print(soup.text[:2000]
)
if (top != self) {
top.location.replace
(http://www.gutenberg.
org);
alert ('Project
Gutenberg is a FREE
service with NO
membership required.
If you paid somebody
else to get here, make
them give you your
money back!');
Th e ou tpu t is as follow s:
Book Search
-- Recent Books
-- Top 100
-- Offline Catalogs
-- My Bookmarks
Main Page
Pretty Pictures
A Modest Proposal by
Jonathan Swift (1020)
A Christmas Carol in
Prose; Being a Ghost
Story of Christmas by
Charles Dickens (953)
Heart of Darkness by
Joseph Conrad (887)
Et dukkehjem. English
by Henrik Ibsen (761)
The Importance of
Being Earnest: A
Trivial Comedy for
Serious People by
Oscar Wilde (646)
Alice's Adventures in
Wonderland by Lewis
Carrol
lst_titles_temp=[]
start_idx=soup.text.sp
litlines().index('Top
100 EBooks yesterday')
for i in range(100):
lst_titles_temp.append
(soup.text.splitlines(
)[start_idx+2+i])
lst_titles=[]
for i in range(100):
id1,id2=re.match('^[a-
zA-Z
]*',lst_titles_temp[i]
).span()
lst_titles.append(lst_
titles_temp[i]
[id1:id2])
for l in lst_titles:
print(l)
Th e ou tpu t is as follow s:
Frankenstein
A Modest Proposal by
Jonathan Swift
A Christmas Carol in
Prose
Heart of Darkness by
Joseph Conrad
Et dukkehjem
Moby Dick
The Importance of
Being Earnest
Alice
Metamorphosis by Franz
Kafka
Beowulf
An Occurrence at Owl
Creek Bridge by
Ambrose Bierce
Democracy in America
Songs of Innocence
The Confessions of St
Persuasion by Jane
Austen
1 . Im por t urllib.request,
urllib.parse,
urllib.error, and json:
import urllib.request,
urllib.parse,
urllib.error
import json
Note
3 . Th e
stu dents/u ser s/instr u ctor
w ill need to obtain a key and
stor e it in a JSON file. We ar e
calling th is file
APIkeys.json.
with
open('APIkeys.json')
as f:
keys = json.load(f)
omdbapi =
keys['OMDBapi']
serviceurl =
'http://www.omdbapi.co
m/?'
apikey =
'&apikey='+omdbapi
def
print_json(json_data):
list_keys=['Title',
'Year', 'Rated',
'Released', 'Runtime',
'Genre', 'Director',
'Writer',
'Actors', 'Plot',
'Language', 'Country',
'Awards', 'Ratings',
'Metascore',
'imdbRating',
'imdbVotes', 'imdbID']
print("-"*50)
for k in list_keys:
if k in
list(json_data.keys())
:
print(f"{k}:
{json_data[k]}")
print("-"*50)
def
save_poster(json_data)
:
import os
title =
json_data['Title']
poster_url =
json_data['Poster']
poster_file_extension=
poster_url.split('.')
[-1]
poster_data =
urllib.request.urlopen
(poster_url).read()
savelocation=os.getcwd
()+'\\'+'Posters'+'\\'
# Creates new
directory if the
directory does not
exist. Otherwise, just
use the existing path.
if not
os.path.isdir(saveloca
tion):
os.mkdir(savelocation)
filename=savelocation+
str(title)+'.'+poster_
file_extension
f=open(filename,'wb')
f.write(poster_data)
f.close()
def
search_movie(title):
try:
url = serviceurl +
urllib.parse.urlencode
({'t':
str(title)})+apikey
print(f'Retrieving the
data of "{title}"
now... ')
print(url)
uh =
urllib.request.urlopen
(url)
data = uh.read()
json_data=json.loads(d
ata)
if
json_data['Response']=
='True':
print_json(json_data)
save_poster(json_data)
else:
print("Error
encountered:
",json_data['Error'])
except
urllib.error.URLError
as e:
print(f"ERROR:
{e.reason}"
1 0. Test th e search_movie
fu nction by enter ing
Titanic:
search_movie("Titanic"
)
http://www.omdbapi.com
/?
t=Titanic&apikey=17cdc
959
----------------------
----------------------
------
Title: Titanic
Year: 1997
Rated: PG-13
Director: James
Cameron
Actors: Leonardo
DiCaprio, Kate
Winslet, Billy Zane,
Kathy Bates
Plot: A seventeen-
year-old aristocrat
falls in love with a
kind but poor artist
aboard the luxurious,
ill-fated R.M.S.
Titanic.
Language: English,
Swedish
Country: USA
Ratings: [{'Source':
'Internet Movie
Database', 'Value':
'7.8/10'}, {'Source':
'Rotten Tomatoes',
'Value': '89%'},
{'Source':
'Metacritic', 'Value':
'75/100'}]
Metascore: 75
imdbRating: 7.8
imdbVotes: 913,780
imdbID: tt0120338
----------------------
----------------------
------
1 1 . Test th e search_movie
fu nction by enter ing
"Random_error" (obv iou sly ,
th is w ill not be fou nd, and
y ou sh ou ld be able to ch eck
w h eth er y ou r er r or catch ing
code is w or king pr oper ly ):
search_movie("Random_e
rror")
http://www.omdbapi.com
/?
t=Random_error&apikey=
17cdc959
Error encountered:
Movie not found!
1 . Connect to th e su pplied
petsDB database:
import sqlite3
conn =
sqlite3.connect("petsd
b")
# a tiny function to
make sure the
connection is
successful
def is_opened(conn):
try:
conn.execute("SELECT *
FROM persons LIMIT 1")
return True
except
sqlite3.ProgrammingErr
or as e:
print("Connection
closed {}".format(e))
return False
print(is_opened(conn))
Th e ou tpu t is as follow s:
True
3 . Close th e connection:
conn.close()
4 . Ch eck w h eth er th e
connection is open or closed:
print(is_opened(conn))
Th e ou tpu t is as follow s:
False
conn =
sqlite3.connect("petsd
b")
c = conn.cursor()
print("We have {}
people aged
{}".format(ppl, age))
Th e ou tpu t is as follow s:
Figure 8.17: Section of output grouped by
age
print("Highest number
of people is {} and
came from {} age
group".format(ppl,
age))
break
Th e ou tpu t is as follow s:
Highest number of
people is 5 and came
from 73 age group
8. To find ou t h ow m any people
do not h av e a fu ll nam e (th e
last nam e is blank/nu ll),
execu te th e follow ing
com m and:
res =
c.execute("SELECT
count(*) FROM persons
WHERE last_name IS
null")
print(row)
Th e ou tpu t is as follow s:
(60,)
res =
c.execute("SELECT
count(*) FROM (SELECT
count(owner_id) FROM
pets GROUP BY owner_id
HAVING count(owner_id)
>1)")
Th e ou tpu t is as follow s:
print(row)
Th e ou tpu t is as follow s:
(36,)
res =
c.execute("SELECT
count(*) FROM pets
WHERE treatment_done=1
AND pet_type IS NOT
null")
print(row)
Th e ou tpu t is as follow s:
(16,)
res =
c.execute("SELECT
count(*) FROM pets
JOIN persons ON
pets.owner_id =
persons.id WHERE
persons.city='east
port'")
print(row)
Th e ou tpu t is as follow s:
(49,)
res =
c.execute("SELECT
count(*) FROM pets
JOIN persons ON
pets.owner_id =
persons.id WHERE
persons.city='east
port' AND
pets.treatment_done=1"
)
print(row)
Th e ou tpu t is as follow s:
(11,)
1 . Im por t th e r equ ir ed
libr ar ies:
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
plt
import warnings
warnings.filterwarning
s('ignore')s
education_data_link="h
ttp://data.un.org/_Doc
s/SYB/CSV/SYB61_T07_Ed
ucation.csv"
df1 =
pd.read_csv(education_
data_link)
3 . Pr int th e data in th e
DataFr am e:
df1.head()
Th e ou tpu t is as follow s:
Figure 9.3: DataFrame from the UN data
df1 =
pd.read_csv(education_
data_link,skiprows=1)
5. Pr int th e data in th e
DataFr am e:
df1.head()
Th e ou tpu t is as follow s:
df2 =
df1.drop(['Region/Coun
try/Area','Source'],ax
is=1)
df2.columns=
['Region/Country/Area'
,'Year','Data','Enroll
ments
(Thousands)','Footnote
s']
8. Pr int th e data in th e
DataFr am e:
df1.head()
Th e ou tpu t is as follow s:
df2['Footnotes'].uniqu
e()
Th e ou tpu t is as follow s:
Figure 9.6: Unique values of the
Footnotes column
type(df2['Enrollments
(Thousands)'][0])
Th e ou tpu t is as follow s:
str
def to_numeric(val):
"""
Converts a given
string (with one or
more commas) to a
numeric value
"""
if ',' not in
str(val):
result = float(val)
else:
val=str(val)
val=''.join(str(val).s
plit(','))
result=float(val)
return result
df2['Enrollments
(Thousands)']=df2['Enr
ollments
(Thousands)'].apply(to
_numeric)
1 3 . Pr int th e u niqu e ty pes of
data in th e Data colu m n:
df2['Data'].unique()
Th e ou tpu t is as follow s:
1 4 . Cr eate th r ee DataFr am es by
filter ing and selecting th em
fr om th e or iginal
DataFr am e:
1 . df_primary :
Only stu dents
enr olled in
pr im ar y
edu cation
(th ou sands)
2 . df_secondary :
Only stu dents
enr olled in
secondar y
edu cation
(th ou sands)
df_primary =
df2[df2['Data
']=='Students
enrolled in
primary
education
(thousands)']
df_secondary
=
df2[df2['Data
']=='Students
enrolled in
secondary
education
(thousands)']
df_tertiary =
df2[df2['Data
']=='Students
enrolled in
tertiary
education
(thousands)']
primary_enrollment_ind
ia =
df_primary[df_primary[
'Region/Country/Area']
=='India']
primary_enrollment_USA
=
df_primary[df_primary[
'Region/Country/Area']
=='United States of
America']
1 6 . Pr int th e
primary_enrollment_india
data:
primary_enrollment_ind
ia
Th e ou tpu t is as follow s:
Figure 9.8: Data for the enrollment in
primary education in India
1 7 . Pr int th e
primary_enrollment_USA
data:
primary_enrollment_USA
Th e ou tpu t is as follow s:
plt.figure(figsize=
(8,4))
plt.bar(primary_enroll
ment_india['Year'],pri
mary_enrollment_india[
'Enrollments
(Thousands)'])
plt.title("Enrollment
in primary
education\nin India
(in
thousands)",fontsize=1
6)
plt.grid(True)
plt.xticks(fontsize=14
)
plt.yticks(fontsize=14
)
plt.xlabel("Year",
fontsize=15)
plt.show()
Th e ou tpu t is as follow s:
plt.figure(figsize=
(8,4))
plt.bar(primary_enroll
ment_USA['Year'],prima
ry_enrollment_USA['Enr
ollments
(Thousands)'])
plt.title("Enrollment
in primary
education\nin the
United States of
America (in
thousands)",fontsize=1
6)
plt.grid(True)
plt.xticks(fontsize=14
)
plt.yticks(fontsize=14
)
plt.xlabel("Year",
fontsize=15)
plt.show()
Th e ou tpu t is as follow s:
missing_years = [y for
y in
range(2004,2010)]+[y
for y in
range(2011,2014)]
2 1 . Pr int th e v alu e in th e
missing_years variable:
missing_years
Th e ou tpu t is as follow s:
dict_missing =
{'Region/Country/Area'
:
['India']*9,'Year':mis
sing_years,
'Data':'Students
enrolled in primary
education
(thousands)'*9,
'Enrollments
(Thousands)':
[np.nan]*9,'Footnotes'
:[np.nan]*9}
2 3 . Cr eate a DataFr am e of
m issing v alu es (fr om th e
pr eceding dictionar y ) th at
w e can append:
df_missing =
pd.DataFrame(data=dict
_missing)
primary_enrollment_ind
ia=primary_enrollment_
india.append(df_missin
g,ignore_index=True,so
rt=True)
2 5. Pr int th e data in
primary_enrollment_india
:
primary_enrollment_ind
ia
Th e ou tpu t is as follow s:
primary_enrollment_ind
ia.sort_values(by='Yea
r',inplace=True)
primary_enrollment_ind
ia.reset_index(inplace
=True,drop=True)
2 7 . Pr int th e data in
primary_enrollment_india
:
primary_enrollment_ind
ia
Th e ou tpu t is as follow s:
primary_enrollment_ind
ia.interpolate(inplace
=True)
2 9 . Pr int th e data in
primary_enrollment_india
:
primary_enrollment_ind
ia
Th e ou tpu t is as follow s:
3 0. Plot th e data:
plt.figure(figsize=
(8,4))
plt.bar(primary_enroll
ment_india['Year'],pri
mary_enrollment_india[
'Enrollments
(Thousands)'])
plt.title("Enrollment
in primary
education\nin India
(in
thousands)",fontsize=1
6)
plt.grid(True)
plt.xticks(fontsize=14
)
plt.yticks(fontsize=14
)
plt.xlabel("Year",
fontsize=15)
plt.show()
Th e ou tpu t is as follow s:
missing_years =
[2004]+[y for y in
range(2006,2010)]+[y
for y in
range(2011,2014)]+
[2016]
3 2 . Pr int th e v alu e in
missing_years.
missing_years
Th e ou tpu t is as follow s:
3 3 . Cr eate dict_missing, as
follow s:
dict_missing =
{'Region/Country/Area'
:['United States of
America']*9,'Year':mis
sing_years,
'Data':'Students
enrolled in primary
education
(thousands)'*9,
'Value':
[np.nan]*9,'Footnotes'
:[np.nan]*9}
df_missing =
pd.DataFrame(data=dict
_missing)
3 5. A ppend th is to th e
primary_enrollment_USA
v ar iable, as follow s:
primary_enrollment_USA
=primary_enrollment_US
A.append(df_missing,ig
nore_index=True,sort=T
rue)
3 6 . Sor t th e v alu es in th e
primary_enrollment_USA
v ar iable, as follow s:
primary_enrollment_USA
.sort_values(by='Year'
,inplace=True)
3 7 . Reset th e index of th e
primary_enrollment_USA
v ar iable, as follow s:
primary_enrollment_USA
.reset_index(inplace=T
rue,drop=True)
3 8. Inter polate th e
primary_enrollment_USA
v ar iable, as follow s:
primary_enrollment_USA
.interpolate(inplace=T
rue)
3 9 . Pr int th e
primary_enrollment_USA
v ar iable:
primary_enrollment_USA
Th e ou tpu t is as follow s:
primary_enrollment_USA
.interpolate(method='l
inear',limit_direction
='backward',limit=1)
Th e ou tpu t is as follow s:
4 1 . Pr int th e data in
pr im ar y _enr ollm ent_USA :
primary_enrollment_USA
Th e ou tpu t is as follow s:
Figure 9.18: Data for the enrollment in
primary education in USA
4 2 . Plot th e data:
plt.figure(figsize=
(8,4))
plt.bar(primary_enroll
ment_USA['Year'],prima
ry_enrollment_USA['Enr
ollments
(Thousands)'])
plt.title("Enrollment
in primary
education\nin the
United States of
America (in
thousands)",fontsize=1
6)
plt.grid(True)
plt.xticks(fontsize=14
)
plt.yticks(fontsize=14
)
plt.xlabel("Year",
fontsize=15)
plt.show()
Th e ou tpu t is as follow s:
df3=pd.read_csv("India
_World_Bank_Info.csv")
Th e ou tpu t is as follow s:
----------------------
----------------------
----------------------
---------
ParserError Traceback
(most recent call
last)
<ipython-input-45-
9239cae67df7> in
<module>()
…..
ParserError: Error
tokenizing data. C
error: Expected 1
fields in line 6, saw
3
We can tr y and u se th e
error_bad_lines=False
option in th is kind of
situ ation.
df3=pd.read_csv("India
_World_Bank_Info.csv",
error_bad_lines=False)
df3.head(10)
Th e ou tpu t is as follow s:
Figure 9.20: DataFrame from the India
World Bank Information
Note:
df3=pd.read_csv("India
_World_Bank_Info.csv",
error_bad_lines=False,
delimiter='\t')
df3.head(10)
Th e ou tpu t is as follow s:
Figure 9.21: DataFrame from the India
World Bank Information a er using a
delimiter
df3=pd.read_csv("India
_World_Bank_Info.csv",
error_bad_lines=False,
delimiter='\t',skiprow
s=4)
df3.head(10)
Th e ou tpu t is as follow s:
df4=df3[df3['Indicator
Name']=='GDP per
capita (current
US$)'].T
df4.head(10)
Th e ou tpu t is as follow s:
6 . Th er e is no index, so let's u se
reset_index again:
df4.reset_index(inplac
e=True)
df4.head(10)
Th e ou tpu t is as follow s:
df4.drop([0,1,2],inpla
ce=True)
df4.reset_index(inplac
e=True,drop=True)
df4.head(10)
Th e ou tpu t is as follow s:
df4.columns=
['Year','GDP']
df4.head(10)
Th e ou tpu t is as follow s:
9 . It looks like th at w e h av e
GDP data fr om 1 9 6 0
onw ar d. Bu t w e ar e
inter ested in 2 003 - 2 01 6 .
Let's exam ine th e last 2 0
r ow s:
df4.tail(20)
Th e ou tpu t is as follow s:
Figure 9.27: DataFrame from the India
World Bank Information
df_gdp=df4.iloc[[i for
i in range(43,57)]]
df_gdp
Th e ou tpu t is as follow s:
Figure 9.28: DataFrame from the India
World Bank Information
df_gdp.reset_index(inp
lace=True,drop=True)
df_gdp
Th e ou tpu t is as follow s:
Figure 9.29: DataFrame from the India
World Bank Information
1 2 . Th e y ear in th is DataFr am e
is not of th e int ty pe. So, it
w ill h av e pr oblem s m er ging
w ith th e edu cation
DataFr am e:
df_gdp['Year']
Th e ou tpu t is as follow s:
df_gdp['Year']=df_gdp[
'Year'].apply(int)
1 . Now , m er ge th e tw o
DataFr am es, th at is,
primary_enrollment_india
and df_gdp, on th e Year
colu m n:
primary_enrollment_wit
h_gdp=primary_enrollme
nt_india.merge(df_gdp,
on='Year')
primary_enrollment_wit
h_gdp
Th e ou tpu t is as follow s:
primary_enrollment_wit
h_gdp.drop(['Data','Fo
otnotes','Region/Count
ry/Area'],axis=1,inpla
ce=True)
primary_enrollment_wit
h_gdp
Th e ou tpu t is as follow s:
primary_enrollment_wit
h_gdp =
primary_enrollment_wit
h_gdp[['Year','Enrollm
ents
(Thousands)','GDP']]
primary_enrollment_wit
h_gdp
Th e ou tpu t is as follow s:
4 . Plot th e data:
plt.figure(figsize=
(8,5))
plt.title("India's GDP
per capita vs primary
education
enrollment",fontsize=1
6)
plt.scatter(primary_en
rollment_with_gdp['GDP
'],
primary_enrollment_wit
h_gdp['Enrollments
(Thousands)'],
edgecolor='k',color='o
range',s=200)
plt.xlabel("GDP per
capita (US
$)",fontsize=15)
plt.ylabel("Primary
enrollment
(thousands)",fontsize=
15)
plt.xticks(fontsize=14
)
plt.yticks(fontsize=14
)
plt.grid(True)
plt.show()
Th e ou tpu t is as follow s:
import sqlite3
with
sqlite3.connect("Educa
tion_GDP.db") as conn:
cursor = conn.cursor()
cursor.execute("CREATE
TABLE IF NOT EXISTS \
education_gdp(Year
INT, Enrollment FLOAT,
GDP FLOAT, PRIMARY KEY
(Year))")
with
sqlite3.connect("Educa
tion_GDP.db") as conn:
cursor = conn.cursor()
for i in range(14):
year =
int(primary_enrollment
_with_gdp.iloc[i]
['Year'])
enrollment =
primary_enrollment_wit
h_gdp.iloc[i]
['Enrollments
(Thousands)']
gdp =
primary_enrollment_wit
h_gdp.iloc[i]['GDP']
#print(year,enrollment
,gdp)
cursor.execute("INSERT
INTO education_gdp
(Year,Enrollment,GDP)
VALUES(?,?,?)",
(year,enrollment,gdp))
If w e look at th e cu r r ent
folder , w e sh ou ld see a file
called Education_GDP.db,
and if w e can exam ine th at
u sing a database v iew er
pr ogr am , w e can see th e data
tr ansfer r ed th er e.