Data Science With Python Class Room Notes PDF
Data Science With Python Class Room Notes PDF
Classroom Notes
By Ram Reddy
7. Dictionary ______________________________________________________________ 43
8. Lambda functions ________________________________________________________ 46
9. Syntax Errors and Exceptions _______________________________________________ 49
10. Iterables & Iterators ____________________________________________________ 51
11. List comprehensions ____________________________________________________ 57
12. Generators ____________________________________________________________ 59
### Level 03 of 08: Python Packages for Data Science ### ____________________ 63
2
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
13. NumPy package ________________________________________________________ 63
14. Pandas (powerful Python data analysis toolkit) ______________________________ 67
14.1 Introduction _______________________________________________________________ 67
14.2 Slicing Dataframe ___________________________________________________________ 67
14.3 Filtering Dataframe _________________________________________________________ 69
14.4 Transforming Dataframe _____________________________________________________ 70
14.5 Advanced indexing __________________________________________________________ 71
14.6 Stack and unstack ___________________________________________________________ 74
14.7 Groupby and aggregations ___________________________________________________ 76
3
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30.6 Regular expression operations _______________________________________________ 124
30.7 Dropping duplicate data ____________________________________________________ 125
30.8 Filling missing data _________________________________________________________ 126
30.9 Testing with asserts ________________________________________________________ 127
4
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 05 of 08: Deep Learning ### _____________________________________ 189
33. Deep learning ________________________________________________________ 189
33.1 Introduction ______________________________________________________________ 189
33.2 Forward propagation _______________________________________________________ 190
33.3 Activation functions ________________________________________________________ 191
33.4 Deeper networks __________________________________________________________ 193
33.5 Need for optimization ______________________________________________________ 195
33.6 Gradient descent __________________________________________________________ 197
33.7 Backpropagation __________________________________________________________ 202
33.8 Creating keras Regression Model _____________________________________________ 205
33.9 Creating keras Classification Models ___________________________________________ 208
33.10 Using models ___________________________________________________________ 209
33.11 Understanding Model Optimization _________________________________________ 210
33.12 Model Validation ________________________________________________________ 212
33.13 Model Capacity __________________________________________________________ 217
5
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.11 Introduction to SpaCy ____________________________________________________ 246
36.12 Multilingual NER with polyglot _____________________________________________ 248
36.13 Building a "fake news" classifier ____________________________________________ 249
36.14 Dialog Flow _____________________________________________________________ 255
36.15 RASA NLU ______________________________________________________________ 256
6
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 01 of 08 : Basic Python ###
1. Introduction
2. Python is a programming language that lets you work quickly and integrate systems
more effectively.
5. Python mainly has two version python 2.x and python 3.x. For more information refer
https://www.python.org/downloads/
7
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
a. Start anaconda Jupyter
8
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
2. Data Types and Variables
b.
9
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
3. Save Jupyter notebook
b. Strings
c. Bool
d. Lists
e. Tuples
f. Dictionary
# 01.Defining integer
savings = 1000
print(savings)
interest = 4.5
print(interest)
a_complex = 1 + 2J
a_complex
10
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# 04. Defining String
print(desc)
am_i_good=True
print(am_i_good)
print(type(savings))
print(type(interest))
print(type(a_complex))
print(type(desc))
print(type(am_i_good))
a. Convert one data type into other using functions int(),str(),float(),complex() and
bool()
a=10
type(a)
a1 = str(a)
type(a1)
b=10.5
type(b)
b1 = str(b)
type(b1)
11
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# 03.Use case: calculating simple interest
savings = 100
interest=12
print("I started with $" + str(savings) + " and now have $" + str(total_amount) + ".
Awesome!")
pi_string = "3.1415926"
print(type(pi_string))
pi_float = float(pi_string)
print(type(pi_float))
word = "python"
# 02. print y
word[1]
# 03. print h
word[3]
word[0:4]
word[:3]
12
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
for i in [0,3,5]:
print(word[i])
9. Mutable vs Immutable Objects: A mutable object can be changed after it's created, and
an immutable object can't.
13
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
3. List
2. The most versatile (flexible) is the list, which can be written as a list of comma-separated
values (items) between square brackets.
3. Lists might contain items of different types, but usually the items all have the same type.
rritec_emp_sal=[1000,2000,3000,4000]
print(rritec_emp_sal)
print(type(rritec_emp_sal))
rritec_emp_name_sal=["ram",1000,"nancy",2000,"robert",3000,"rabson",4000]
print(rritec_emp_name_sal)
rritec_emp_name_sal1=[["ram",1000],["nancy",2000],["robert",3000],["rabson",4000]]
print(rritec_emp_name_sal1)
print(type(rritec_emp_name_sal1))
print(rritec_emp_name_sal[0])
print(rritec_emp_name_sal[-1])
print(rritec_emp_name_sal[0:4])
print(rritec_emp_name_sal[:4])
print(rritec_emp_name_sal[4:])
print(rritec_emp_name_sal1[0][0])
14
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(rritec_emp_name_sal1[0][-1])
b. List is mutable
rritec_emp_name_sal[0]='Ram Reddy'
rritec_emp_name_sal[2:4]=['willis',2100]
print(rritec_emp_name_sal)
# Adding Elements
rritec_emp_name_sal=rritec_emp_name_sal + ['Helen',5000]
print(rritec_emp_name_sal)
# Deleting Elements
del(rritec_emp_name_sal[2:4])
print(rritec_emp_name_sal)
# Need to observe 1
x = [1000,2000,3000]
y=x
z = list(x)
y[0]=1001
print(y);print(x)
# Need to observe 2
x = [1000,2000,3000]
z = list(x)
z[1] = 2001
15
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(z);print(x)
8. Refer 1 of 2: https://docs.python.org/3/tutorial/introduction.html#lists
16
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
4. Operators
1. Operators 1 of 3: Comparison Operators are (<, >, ==, <=, >=, !=)
17
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
3. Operators 3 of 3: Numpy Boolean Operators (logical_and(), logical_or(),
logical_not())
18
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
5. Control Flows
if x < 0:
print('Negative Number')
elif x == 0:
print('Zero')
else :
print('Positive Number')
2. An if … elif … elif … sequence is a substitute for the switch or case statements found
in other languages.(example switch in R language)
19
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
5.2 While Loop:
1. The while loop is like a repeated if statement. The code is executed over and over again,
as long as the condition is true.
a, b = 0, 1
print(b, end=',')
a, b = b, a+b
a. With the break statement we can stop the loop even if the while condition is true
i = 1
while i < 6:
print(i)
if i == 3:
break
i += 1
a. With the continue statement we can stop the current iteration, and continue with
the next iteration
i = 0
while i < 6:
i += 1
if i == 3:
continue
20
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(i)
1. Python’s for statement iterates over the items of any sequence (a list / a string / a
dictionary ), in the order that they appear in the sequence
21
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 6: For loop on numpy array?
1. 1D Numpy array, looping over all elements can be as simple as list looping
22
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 7: For loop on pandas dataframe?
23
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
24
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 02 of 08: Advanced python ###
6.1 Functions
25
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.1.2 User Defined Functions (UDFs)
1. Though we have lot of built-in functions sometimes, we need to write user defined
functions
26
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 5: Function with multiple arguments
27
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.1.3 Tuple
28
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.1.4 Scope of objects in functions
5. Built-in scope Names in the pre-defined built-ins module (ex: len, max, min, … etc)
7. Exercise 2: we can use same name for local and global and assign different values
29
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
8. Exercise 3: searches for local variable ,if not find then searches for globally
30
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
12. Exercise 7: Nested Functions
"""def len(in_var):
l=0
for i in in_var:
l += 1
return l"""
def a_func(in_var):
a_func('Hello, World!')
31
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.1.6 Default and flexible arguments
32
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.2 Methods
# Define list
a_list = [10,20,30,10,30,40,50,10,60,10]
a_list.count(10)
a_list.append(100)
a_list
a_list.insert(2,200)
a_list
a_list.remove(20)
a_list
# sort items
a_list.sort()
a_list
# reverse items
a_list.reverse()
a_list
# Define String
33
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
a_str = "Data Science"
a_str.upper()
a_str.count("e")
34
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.3 Modules
2. The file name is the module name with the suffix .py appended.
# Define a number
a_num = 10
# define a string
a_str = "python"
# define a list
a_list = [10,20,30,40]
# define a float
a_float = 10.45
def square(x):
return x ** 2
def fibo(n):
a,b=0,1
while b<n:
a,b=b,a+b
35
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# import entire module and access objects
Import a_module1
a_module1.a_num
a_module2.a_float
a_module2.fibo(1000)
a_num
36
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.4 Packages
37
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
4. Create one more module with the name of my_fibo.py
38
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6.4.2 System defined Packages
12. ..etc
13. As we know anaconda software = python software + basic packages required for
data science
14. To see list of packages already installed in anaconda start all programs
anaconda folder right click on Anaconda Prompt run as administrator
39
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
15. Type conda list
16. Too see particular package type conda list <package name> (example see
matplotlib)
i. Open spyder
2. Exercise 1: Import Package and calculate circumference and area of the circle
40
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
3. Exercise 2: Import one function from Package and calculate circumference and
area of the circle
41
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
2. Right click on the project new package provide name as spy_package
click on ok
42
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
7. Dictionary
2. Observations of dictionary:
a. Each key is separated from its value by a colon (:), the items/ elements are
separated by commas and the whole thing is enclosed in curly braces.
b. An empty dictionary without any items is written with just two curly braces, like
this: {}.
c. Keys are unique within a dictionary while values may not be.
"key1":"value1",
"key2":"value2",
e. Refer : https://docs.python.org/3/tutorial/datastructures.html
43
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 3: Insert, Update and delete dictionary items
44
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 5: Create dataframe using dictionaries
45
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
8. Lambda functions
1. Python supports the creation of anonymous functions (i.e. functions that are not bound
to a name)
2. While normal functions are defined using the def keyword, in Python anonymous
functions are defined using the lambda keyword. Hence, anonymous functions are also
called lambda functions
4. Lambda functions used with typical functional concepts like map(), filter() and reduce()
Here :
1. x, y are arguments
# define function
def power(i):
return i ** 2
a_list = [10,20,30,40]
power(a_list)
46
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
import numpy as np
a_np_list = np.array(a_list)
power(a_np_list)
# can you avoid above defining function and converting list into numpy list using
map
rs = map(lambda i: i ** 2,a_list)
rs
list(rs)
Exercise 4: lambda function with reduce function : Find product of first 10 prime
numbers
a_list_prime_num = [2,3,5,7,11,13,17,19,23,29]
# regular way
product = 1
for i in a_list_prime_num:
47
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
product = product * i
product
48
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
9. Syntax Errors and Exceptions
1. There are (at least) two built in distinguishable kinds of errors: syntax errors and
exceptions.
a. Syntax Errors
print(hello World)
for i as "python":
print(i)
b. Exceptions
1. ZeroDivisionError try it
2. NameError
3. TypeError
49
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
a. Catch exceptions with try-except clause
c. Exercise 2: link user defined exception message with built-in error codes
50
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
10. Iterables & Iterators
1. Iterable
2. Iterator
print(person)
51
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(type(avoid_for))
print(next(avoid_for))
print(next(avoid_for))
print(next(avoid_for))
print(next(avoid_for))
print(num)
range_iterator = iter(range(5))
print(type(range_iterator))
print(next(range_iterator))
print(next(range_iterator))
print(next(range_iterator))
print(next(range_iterator))
print(next(range_iterator))
a_str = "python"
52
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
a_str_iter = a_str.__iter__()
next(a_str_iter)
next(a_str_iter)
next(a_str_iter)
next(a_str_iter)
next(a_str_iter)
next(a_str_iter)
a_tuple = (10,20,30)
a_tuple_iter=iter(a_tuple)
next(a_tuple_iter)
next(a_tuple_iter)
next(a_tuple_iter)
next(a_tuple_iter)
a_dict = {1:"ram",2:"Nancy",3:"rabson"}
a_dict_iter = iter(a_dict)
next(a_dict_iter)
next(a_dict_iter)
next(a_dict_iter)
next(a_dict_iter)
print(type(range_object))
53
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Print the range object
print(range_object)
range_object_list = list(range_object)
# Print range_object_list
print(range_object_list)
range_object_list_max = max(range_object)
# Print range_object_list_max
print(range_object_list_max)
DataScience = ['PYTHON','R','MATHS','STATS','ML','DL','NLU','NLP','AI']
print(enumerate(DataScience))
DataScience_list = list(enumerate(DataScience))
print(type(DataScience_list))
print(DataScience_list)
54
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(index1, value1)
print("***** 04. Unpack and print with given start index *****")
print(index2, value2)
c. If you want to print the values of a zip object, you can convert it into a list and then
print it
std_no = (10,20,30)
std_name = ("Ram","Nancy","Rabson")
std_marks = (100,2000,300)
print("***** 01. Print the list of tuples create by zip function *****")
print(type(std_data))
print(std_data)
print(list(std_data))
print("***** 02. Unpack the zip object and print the tuple values *****")
55
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
b. However we can use * and zip() to unzip
std_name = ("Ram","Nancy","Rabson")
std_course = ("R","PYTHON","DATASCIENCE")
z1 = zip(std_course, std_name)
print(*z1)
z1 = zip(std_course, std_name)
print(result1 == std_course)
print(result2 == std_name)
56
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
11. List comprehensions
2. In mathematics the square numbers of the natural numbers are for example created by {
x2 | x ∈ ℕ } or the set of complex integers { (x,y) | x ∈ ℤ , y ∈ ℤ }.
3. Guido van Rossum prefers list comprehensions and does not like lambda functions
4. List comprehension is a complete substitute for the lambda function as well as the
functions map(), filter() and reduce()
Exercise 1: List Comprehension to avoid loop or lambda functions. Find square of each list
item.
nums = [10,20,30,40,50]
new_nums = []
new_nums.append(num ** 2)
new_nums
a = map(lambda x: x **2,nums)
list(a)
57
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
new_nums1
pairs_1 = []
pairs_1.append((num1, num2))
print(pairs_1)
pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)]
print(pairs_2)
print(pos_neg)
print(type(pos_neg))
Refer: https://docs.python.org/3/tutorial/datastructures.html
58
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
12. Generators
1.Generator functions allow you to declare a function that behaves like an iterator, i.e. it
can be used in a for loop
2.list comprehensions and generator expressions look very similar in their syntax, except
for the use of parentheses () in generator expressions and brackets [] in list
comprehensions
import sys
print(sys.getsizeof(l)) # 9032
print(sys.getsizeof(g)) # 80
print(l[4]) # 8
for v in l:
print(v)
for v in g:
print(v)
# List of strings
# List comprehension
59
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(type(std1))
print(std1)
# Generator expression
print(type(std2))
list(std2)
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print("***** 02 Print the rest of the values using for Loop *****")
print(value)
def get_lengths(input_list):
60
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
length of the strings in input_list."""
yield len(person)
print(value)
61
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
62
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 03 of 08: Python Packages for Data Science ###
List
Python Numpy
Lists Lists
1. Data Science needs, Mathematical operations over collections of values in utmost no
time. It is not possible with regular list.
63
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
3. Exercise 2: Calculate BMI using NumPy Array. Is it possible?
6. NumPy array contains only one data type elements whereas Python array contains
elements of different data types
64
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
8. NumPy array allows Calculations over entire elements, Easy, Fast and occupies less
memory
11. https://www.mathsisfun.com/definitions/mean.html
12. https://www.mathsisfun.com/median.html
13. https://www.mathsisfun.com/data/standard-deviation.html
print(np.random.randint(10,50,5))
65
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(np.random.randint(10,50,5))
np.random.seed(100)
print(np.random.randint(10,50,5))
np.random.seed(100)
print(np.random.randint(10,50,5))
print(a)
print(b)
print(a+b)
17. https://docs.scipy.org/doc/numpy/user/quickstart.html
66
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
14. Pandas (powerful Python data analysis toolkit)
14.1 Introduction
import pandas as pd
df = pd.read_csv("pandas_sales.csv",index_col = 'month')
df
df['eggs']['May']
df['spam']
df.salt[0]
df.salt[0:3] # slice
df.salt[[0,3,5]] # fancy
df.loc['Jan','eggs']
df.loc['Jan':'May','eggs':'spam'] # slice
df.loc[['Jan','May'],['eggs','spam']] # fancy
df.iloc[1,1]
df.iloc[1:3,1:3] # by slice
67
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
df.iloc[[1,3],[0,2]] # by fancy
df_new = df[['eggs','spam']]
df_new
series = df['eggs']
series
68
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
14.3 Filtering Dataframe
69
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
14.4 Transforming Dataframe
70
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
14.5 Advanced indexing
2.Indexes
import pandas as pd
prices = [10,12,13,11,9]
type(prices)
shares = pd.Series(prices)
shares
type(shares)
days = ['Mon','Tue','Wed','Thur','Fri']
shares
shares.index
71
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
shares.index[1]
shares.index[:3]
shares.index[-2:]
shares.index.name
shares.index.name = 'weekday'
shares.index.name
shares
shares.index = ['Monday','Tuesday','Wednesday','Thursday','Friday']
shares
import pandas as pd
df = pd.read_csv("pandas_sales.csv",index_col= 'month')
df
72
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
5.Exercise 3: Hierarchical indexing
import pandas as pd
df = pd.read_csv("pandas_sales_hierarchical_indexing.csv")
df
# 02. Create index using two columns (similar to composite key in RDBMS)
df = df.set_index(['state','month'])
df
df.index
df.index.name
df.index.names
df = df.sort_index()
# 05. Reading
df.loc['CA',1]
df.loc[('CA',1),'salt']
df.loc['CA']
df.loc['CA':'NY']
73
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# 05-03. Fancy
df.loc[(['CA','TX'],1),:]
df.loc[(['CA','TX'],1),'eggs']
df.loc[('CA',[1,2]),:]
1.Recall pivot
import pandas as pd
df = pd.read_csv("pandas_users.csv")
df
visitors_pivot
signups_pivot
visitors_signups_pivot
74
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
#01. import required modules and load data
import pandas as pd
df = pd.read_csv("pandas_users.csv")
df
df = df.set_index(['city','weekday'])
df
byweekday
04. revert it
df1
75
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
14.7 Groupby and aggregations
import pandas as pd
sales = pd.DataFrame({
})
sales
sales.loc[sales['weekday'] == 'Sun'].count()
sales.groupby('weekday').count()
sales.groupby('weekday')['bread'].sum()
sales.groupby('weekday')[['bread','butter']].sum()
sales.groupby(['city','weekday']).mean()
sales.groupby('city')[['bread','butter']].max()
sales.groupby('weekday')[['bread','butter']].agg(['max','min'])
76
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
def data_range(series):
sales.groupby('weekday')[['bread', 'butter']].agg(data_range)
3.Refer
b. Read http://pandas.pydata.org/pandas-docs/stable/
d. http://pandas.pydata.org/pandas-docs/stable/cookbook.html
4.Exercise 1: Py31_1_some_graphs_using_iris_data
5.Exercise 2: Py31_2_iris_statistics
6.Exercise 3: Py31_3_iris_statistics
77
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
15. Matplotlib data visualization
78
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 2: Scatter Chart
import pandas as pd
df = pd.read_csv("percent-bachelors-degrees-women-usa.csv")
year = df["Year"]
# Import matplotlib.pyplot
79
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
import matplotlib.pyplot as plt
plt.show()
plt.axes([0.05,0.05,0.425,0.9])
plt.axes([0.525,0.05,0.425,0.9])
plt.show()
# Create a figure with 1x2 subplot and make the left subplot active
plt.subplot(1, 2, 1)
80
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.title('Physical Sciences')
# Make the right subplot active in the current 1x2 subplot grid
plt.subplot(1, 2, 2)
plt.title('Computer Science')
plt.tight_layout()
plt.show()
Exercise 7: Different line plots on distinct axes using subplot with 2 * 2 grid
education= df["Education"]
# Create a figure with 2x2 subplot layout and make the top left subplot active
plt.subplot(2, 2, 1)
plt.title('Physical Sciences')
# Make the top right subplot active in the current 2x2 subplot grid
plt.subplot(2, 2, 2)
plt.title('Computer Science')
# Make the bottom left subplot active in the current 2x2 subplot grid
plt.subplot(2, 2, 3)
81
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Plot in green the % of degrees awarded to women in Health Professions
plt.title('Health Professions')
# Make the bottom right subplot active in the current 2x2 subplot grid
plt.subplot(2, 2, 4)
plt.title('Education')
plt.tight_layout()
plt.show()
# Plot the % of degrees awarded to women in Computer Science and the Physical
Sciences
plt.plot(year,computer_science, color='red')
plt.xlabel('Year')
plt.xlim(1990,2010)
plt.ylim(0,50)
82
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.show()
plt.plot(year,computer_science, color='blue')
plt.plot(year, physical_sciences,color='red')
plt.axis((1990,2010,0,50))
plt.show()
plt.legend(loc='lower center')
plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.show()
83
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.plot(year, physical_sciences, color='blue', label='Physical Sciences')
plt.legend(loc='lower right')
cs_max = computer_science.max()
# Calculate the year in which there was maximum enrollment of women in Computer
Science: yr_max
yr_max = year[computer_science.argmax()]
plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.show()
# Import matplotlib
plt.style.use('ggplot')
plt.subplot(2, 2, 1)
plt.title('Physical Sciences')
84
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Plot the enrollment % of women in Computer Science
plt.subplot(2, 2, 2)
plt.title('Computer Science')
# Add annotation
cs_max = computer_science.max()
yr_max = year[computer_science.argmax()]
plt.subplot(2, 2, 3)
plt.title('Health Professions')
plt.subplot(2, 2, 4)
plt.title('Education')
plt.tight_layout()
plt.show()
85
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
16. Seaborn data visualization
9. wip
86
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
17. Bokeh data visualization
10. wip
87
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
18.
88
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
19. Import Data from Flat Files
1. Flat files
2. Change working directory using Spyder toolbar(temporarily ). (If this toolbar not
available then enable view menu bar )
a. Tools menu
b.
4. Exercise 1: About current working Directory using script
import os
wd = os.getcwd()
os.listdir(wd)
a. Copy plain_text.txt file from labdata to working directory and execute below script
89
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6. Exercise 3: Importing plain text file line by line and print first three lines
7. Exercise 4: Using numpy import MNIST file and create image to test not a robot
# Import package
import numpy as np
type(digits)
im = digits[21, 1:]
im.shape
90
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
np.load
# Import package
import numpy as np
# import file
type(emp)
emp
emp[0:4,1:]
# Import package
import numpy as np
# shape
print(np.shape(emp)) # (14,)
91
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(emp['SAL'])
plt.scatter(emp['COMM'], emp['SAL'])
plt.xlabel('Comm')
plt.ylabel('Sal')
plt.show()
a. You were able to import flat files containing columns with different datatypes as
numpy arrays.
b. We can easily import files of mixed data types as DataFrames using the pandas
functions read_csv() and read_table().
92
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
93
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
20. Import Data from Excel Files
2. Import any given sheet of your loaded work book file as a DataFrame, by specifying
either the sheet's name or its index.
3. It allows to skip rows, rename columns and select only particular columns.
import pandas as pd
# Parse the first column of the second sheet and rename the column: df4
import pandas as pd
94
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
writer.save() # Close the Pandas Excel writer and output the Excel file.
95
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
21. Import SAS and STATA Files
1. SAS: Statistical Analysis System mainly for business analytics and biostatistics
a. Advanced analytics
b. Multivariate analysis
c. Business intelligence
d. Data management
e. Predictive analytics
96
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Save file to a DataFrame: df_sas
df_sas = file.to_data_frame()
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()
# Import pandas
import pandas as pd
df = pd.read_stata('disarea.dta')
print(df.head())
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of coutries')
plt.show()
97
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
22. Import HDF5 Files
import numpy as np; import h5py; import matplotlib.pyplot as plt # Import packages
group = data['strain']
print(key)
# Plot data
plt.plot(time, strain[:num_samples])
98
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
23. Import from Relational Database (Ex: SQLite)
a. MySQL
b. PostgreSQL
c. SQLite
d. Sqlserver
e. Oracle
f. Hive
g. ..etc
2. SQLite database
3. SQLAlchemy
engine = create_engine('sqlite:///Chinook.sqlite')
99
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
table_names = engine.table_names()
print(table_names)
engine = create_engine('sqlite:///Chinook.sqlite')
con = engine.connect()
df = pd.DataFrame(result.fetchall())
con.close()
100
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(df.head())
# Import packages
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine in context manager (to avoid open and close of the connections)
df = pd.DataFrame(rs.fetchmany(size=3))
# Open engine in context manager ,run query and save results to DataFrame: df
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
101
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(df.head())
# Open engine in context manager, run query and save results to DataFrame: df1
df1 = pd.DataFrame(rs.fetchall())
df1.columns = rs.keys()
print(df.equals(df1)) # Confirm that both methods yield the same result: does df =
df1 ?
# Open engine in context manager, run query and save results to DataFrame: df
on Album.ArtistID = Artist.ArtistID")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
102
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(df.head()) # Print head of DataFrame df
103
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
24. Import web data
url_of_file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv'
urlretrieve(url_of_file, 'winequality-red.csv')
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-
red.csv'
104
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
pd.DataFrame.hist(df.ix[:, 0:1])
plt.ylabel('count')
plt.show()
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00192/BreastTissue.xls'
# Print the head of the first sheet ,use sheet name: 'Data'
print(xl['Data'].head())
105
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
25. Import using urllib and requests packages
response = urlopen(request)
print(html)
response.close()
# Import package
import requests
url = "http://rritec.blogspot.in/"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
text = r.text
106
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(text)
# Import packages
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
prettify_soup = soup.prettify()
print(prettify_soup)
# Import packages
import requests
url = 'https://www.python.org/~guido/'
107
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Package the request, send the request and catch the response: r
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc,"lxml")
guido_title = soup.title
print(guido_title)
guido_text = soup.get_text()
print(guido_text)
# Import packages
# Specify url
url = 'http://rritec.blogspot.in/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
108
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Print the title of Guido's webpage
print(soup.title)
a_tags = soup.find_all('a')
print(link.get('href'))
5. The function json.load() will load the JSON into Python as a dictionary.
import json
json_data = json.load(json_file)
for k in json_data.keys():
6. Refer https://docs.python.org/3/library/json.html
109
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
28. Movie and Wikipedia APIs
1. An API is a set of protocols and routines for building and interacting with software
applications.
4. An API is a bunch of code that allows two software programs to communicate with each
other.
import requests
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=coco'
# Package the request, send the request and catch the response: r
r = requests.get(url)
print(r.text)
json_data = r.json()
for k in json_data.keys():
110
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 2: Wikipedia API
# Import package
import requests
url =
'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&t
itles=pizza'
# Package the request, send the request and catch the response: r
r = requests.get(url)
json_data = r.json()
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)
111
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
29. Twitter API
1. Login to twitter
6. Follow the tutorial and do first exercise and continue rest of the things
http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html
112
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Exercise 1: Print Public tweets
https://labsblog.f-secure.com/2018/02/27/how-to-get-twitter-follower-data-using-
python-and-tweepy/
113
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30. Cleaning Data (ETL)
1.It is often said that 80% of data analysis is spent on the process of cleaning and
preparing the data (Dasu and Johnson 2003)
2.Data preparation is not just a first step, but must be repeated many times over the
course of analysis as new problems come to light or new data is collected
3.The principles of tidy data provide a standard way to organize data values within a
dataset. Please refer white paper
https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
b. Missing data
c. Outliers
d. Duplicate rows
e. Untidy
114
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30.1 Melt() data
2.Melt function "Unpivots" a DataFrame from wide format to long format. Observe
below exercise
# create a dataframe
import pandas as pd
df = pd.DataFrame({'A':{0:'a',1:'b',2:'c'},
'B':{0:1,1:3,2:5},
'C':{0:2,1:4,2:6},
'D':{0:7,1:9,2:11},
'E':{0:8,1:10,2:12}})
value_vars=["B","C","D","E"],
var_name = "my_var",
value_name ="my_val")
df
df_melt
import pandas as pd
115
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
df = pd.read_csv("Py_cleaning_Tidydata.csv")
df
id_vars = 'name',
var_name = 'treatment',
value_name = 'result')
df_melt
1.Opposite of melting
import pandas as pd
df = pd.read_csv("py_cleaning_pivot_data.csv")
df
116
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
df_pivot = df.pivot(index = "name",columns="treatment",values = "result")
df_pivot
import pandas as pd
import numpy as np
df = pd.read_csv("py_cleaning_pivot_data1.csv")
df
df_pivot
117
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30.3 Concatenating
import pandas as pd
df1 = pd.read_csv('Py_cleaning_pivot_concat1.csv')
df2 = pd.read_csv('Py_cleaning_pivot_concat2.csv')
df1
df2
concatenated
concatenated.loc[0, :]
concatenated
118
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Import pandas
import pandas as pd
df1 = pd.read_csv('Py_cleaning_pivot_concat3.csv')
df2 = pd.read_csv('Py_cleaning_pivot_concat4.csv')
print(df1)
print(df2)
print(concatenated)
e. The? Wildcard represents any 1 character, and the * wildcard represents any
number of characters.
119
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
import glob
import pandas as pd
print("***** 01. Read all emp_part files from working directory *****")
csv_files = glob.glob('emp_part?.csv')
print(csv_files)
emp_list = []
df = pd.read_csv(csv)
# Append df to emp_list
emp_list.append(df)
emp = pd.concat(emp_list)
print(emp)
30.4 Merge/Joins
import pandas as pd
emp_df1 = pd.read_csv('emp_merge.csv')
dept_df2 = pd.read_csv('dept_merge.csv')
120
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Merge two DataFrames: emp_dept
# Print emp_dept
print(emp_dept)
df1 = pd.DataFrame({'lkey':['foo','bar','baz','foo'],'value':[1,2,3,4]})
df2 = pd.DataFrame({'rkey':['foo','bar','qux','bar'],'value':[5,6,7,8]})
121
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30.5 Data Types Conversion
1.Converting categorical data in to ‘category’ dtype, can make the DataFrame smaller in
memory
2.Numeric data loaded as a string (for example any – in data), then system will identify
as object datatype.
import pandas as pd
df = pd.read_csv('tips.csv')
print("***** 01. Observe the info of data frame and note down memory *****")
df.info()
df.sex = df.sex.astype('category')
df.smoker = df.smoker.astype('category')
df.total_bill = pd.to_numeric(df.total_bill,errors="coerce")
df.tip = pd.to_numeric(df.tip,errors="coerce")
print("***** 02. Observe the info of data frame and note down memory *****")
df.info()
print("***** 03. from 01 and 02, did you observed memory difference *******")
df
122
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
123
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30.6 Regular expression operations
import re
pattern = re.compile('\d{3}-\d{3}-\d{4}')
# 02 - 2 Observe result
rs = pattern.match('123-456-7890')
bool(rs)
rs = pattern.match('1234-456-7890')
bool(rs)
pattern = re.compile('\$\d*.\d{2}')
# 03 - 2 Observe result
rs = pattern.match('$13.25')
bool(rs)
rs = pattern.match('13.25')
bool(rs)
124
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# 01. Import the regular expression module
import re
type(rs)
rs
rs
import pandas as pd
df = pd.read_csv('emp_duplicate.csv')
df.info()
df_no_duplicates = df.drop_duplicates()
df_no_duplicates.info()
125
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30.8 Filling missing data
import pandas as pd
df = pd.read_csv('emp_missing.csv')
df.info()
COMM_mean = round(df.COMM.mean(),1)
# 04. Replace all the missing values in the comm column with the mean
df.COMM = df.COMM.fillna(COMM_mean)
df.info()
df
df_dropped.info()
126
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
30.9 Testing with asserts
5. Exercise 1: Assert
import pandas as pd
df = pd.read_csv('emp_missing.csv')
df_dropped.info()
assert pd.notnull(df_dropped).all().all()
df_dropped
127
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
31. Time Series Analysis
import pandas as pd
# Load data
df
df.index = pd.to_datetime(df.week)
df.info()
del(df['week'])
df['2016'].plot()
plt.show()
df.plot(grid=True)
plt.show()
4. Explore diet, gym and finance search in google using google trends
https://trends.google.com/trends/explore?date=all&q=diet,gym,finance
128
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
import pandas as pd
# Load data
df
df.info()
df.index = pd.to_datetime(df.month)
df.info()
del(df['month'])
df.info()
df.plot(grid=True)
plt.show()
df.info()
6. Correlation
i. Magnitude – The larger the magnitude (closer to 1 or -1), the stronger the
correlation
import numpy as np
np.random.seed(1)
129
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Positive Correlation with some noise
np.corrcoef(x, y)
plt.matplotlib.style.use('ggplot')
plt.scatter(x, y)
plt.show()
np.corrcoef(x, y)
f. Scatter graph
plt.scatter(x, y)
plt.show()
g.No/Weak correlation(zero)
np.corrcoef(x, y)
130
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
h.Scatter graph
plt.scatter(x, y)
plt.show()
i. Correlation matrix
import pandas as pd
df.corr()
j. Scatter matrix
plt.show()
plt.matshow(df.corr())
plt.xticks(range(len(df.columns)), df.columns)
plt.yticks(range(len(df.columns)), df.columns)
plt.colorbar()
plt.show()
a.Can I believe blindly correlation value or we need to use bit common sense??
b.http://www.tylervigen.com/spurious-correlations
131
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
8. Exercise 2: Auto correlation
df = pd.read_csv("MSFT.csv",index_col=0)
df.info()
df.index = pd.to_datetime(df.index)
df.info()
df = df.resample(rule='W', how='last')
df
returns = df.pct_change()
# Import the acf module and the plot_acf module from statsmodels
df = pd.read_csv("HRB.csv",index_col=0)
df.info()
df.index = pd.to_datetime(df.index)
df.info()
132
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# understand data
plt.plot(df.index,df.Earnings) #or
plt.scatter(df.index,df.Earnings)
acf_array = acf(df)
print(acf_array)
plot_acf(df, alpha=1)
plt.show()
10.
133
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 04 of 08: Machine Learning Models ###
13. Two definitions of Machine Learning are offered by Arthur and Tom.
14. Arthur Samuel described it as: "the field of study that gives computers the ability to learn
without being explicitly programmed." This is an older, informal definition.
15. Tom Mitchell provides a more modern definition: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance/accuracy measure P, if
its performance at tasks in T, as measured by P, improves with experience E."
iii. P = the probability that the program will win the next game.
16. In the past decade, machine learning has given us self-driving cars, practical speech
recognition, effective web search, and a vastly improved understanding of the human
genome.
17. It is the best way to make progress towards human-level Artificial intelligence(AI)
a.Supervised learning
i. Classification
ii. Regression
b.Unsupervised learning
i. Clustering
134
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1 Supervised learning
2. In supervised learning, we are given a data set and already know what our correct output
should look like, having the idea that there is a relationship between the input and the
output.
a.Regression(continuous values)
b.Classification(Limited or categorical )
4. In a regression problem, we are trying to predict results within a continuous output, meaning
that we are trying to map input variables to some continuous function. (y=a+b*x)
6. Example 1:
a.Given data about the size of houses on the real estate market, try to predict their price.
Price as a function of size is a continuous output, so this is a regression problem.
b.We could turn this example into a classification problem by instead making our output
about whether the house "sells for more or less than the asking price." Here we are
classifying the houses based on price into two discrete categories.
7. Example 2:
a.(a) Regression - Given a picture of a person, we have to predict their age on the basis
of the given picture
b.(b) Classification - Given a patient with a tumor, we have to predict whether the tumor
is malignant or benign.
135
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
8. Exercise 1: Explore pre-defined Datasets (example iris data)
import pandas as pd
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
print(df.head(2))
136
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
10. Exercise 3: Create a bunch with our own data (Just Knowledge purpose ,not useful in
live)
import numpy as np
import sklearn.datasets
examples = []
target[0] = 0
target[1] = 1
target[2] = 0
137
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.1 k-nearest neighbours algorithm
32.1.1.1. Introduction
1. http://people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical-example.html
a.KNeighborsClassifier(n_neighbors=2):
b. KNeighborsClassifier(n_neighbors=3):
c.Refer https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
138
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
f. Note: how to calculate distance in case of classification problem
import numpy as np
iris = datasets.load_iris()
knn = KNeighborsClassifier(n_neighbors=6)
f = knn.fit(iris['data'], iris['target'])
139
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(f)
print(iris['data'].shape)
print(iris['target'].shape)
prediction = knn.predict(x_new)
print(prediction)
print(x_new.shape)
iris = datasets.load_iris()
test_size=0.3,random_state=21, stratify=iris['target'])
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(knn.score(X_test, y_test))
How you know ? n_neighbors=6 ? do you have any way to find it.
140
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Import KNeighborsClassifier from sklearn.neighbors
import numpy as np
iris = datasets.load_iris()
knn = KNeighborsClassifier()
X= iris['data']
y=iris['target']
knn_cv.fit(X, y)
knn_cv.best_params_
knn_cv.best_score_
'weights':['uniform','distance'],
https://en.wikipedia.org/wiki/Hamming_distance
141
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Note3: explore about RandomizedSearchCV it is similar to GridSearchCV
142
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2 Linear Models
1. Logistic regression, despite its name, is a linear model for classification rather than regression.
3. In this model, the probabilities describing the possible outcomes of a single trial are modelled
using a logistic function.
4. A logistic function or logistic curve is a common "S" shape (sigmoid curve), with equation
5. Maths behind it,just walk through it. If you are not understand do not be panic
a. https://en.wikipedia.org/wiki/Logistic_function
b. https://en.wikipedia.org/wiki/Logistic_regression
143
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
6. Refer http://scikit-learn.org/dev/modules/linear_model.html#logistic-regression
import pandas as pd
df = pd.read_csv(filepath_or_buffer ="pima-indians-diabetes.txt")
df.info()
X = df[["pregnancies","glucose","diastolic","triceps","insulin","bmi","dpf","age"]]
type(X)
X.info()
#target variable
y=df["diabetes"]
type(y)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(confusion_matrix(y_test, y_pred))
144
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(classification_report(y_test, y_pred))
1. Precision: When it predicts yes, how often is it correct? When it predicts No, how often is it
correct?
2. True Positive Rate/Sensitivity/Recall: When it's actually yes, how often does it predict yes?
b.TN/actual NO = 174/(174+32)=0.84
3. F Score: This is a weighted average of the true positive rate (recall) and precision
a.F1-score =
4. Support:
6. The ROC curve is created by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings
7. In machine learning
b.The false-positive rate is also known as the fall-out or probability of false alarm
8. Read https://en.wikipedia.org/wiki/Confusion_matrix
9. https://en.wikipedia.org/wiki/Receiver_operating_characteristic
145
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]
plt.plot(fpr, tpr)
plt.title('ROC Curve')
plt.show()
146
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2.4. AUC computation
y_pred_prob = logreg.predict_proba(X_test)[:,1]
147
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2.5. Hyperparameter tuning with GridSearchCV
import numpy as np
logreg = LogisticRegression()
logreg_cv.fit(X, y)
148
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2.6. Linear regression
1. Maths:
ii.
i. 5x = 6 + 3y
ii. y/2 = 3 – x
a.Slope-Intercept Form
149
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
b.Point-Slope Form
150
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
c. General Form
d.As a function
ii. It is called "Identity" because what comes out is identical to what goes in:
151
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
1.
f. Constant Functions
1. Example y = 3
ii. No matter what value of "x", f(x) is always equal to some constant value.
iii.
y2 − 2 = 0
3√x − y = 6
x3/2 = 16
y =3x-6
152
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
a.Answer : y = x/4 + 5/2
3. What is regression?
b.What price will the customer pay for our product? (Regression) (0 to Inf)
5. Linear Regression
y = β + β1 x1 + β2 x2 + ...
6. Refer http://www.learnbymarketing.com/tutorials/linear-regression-by-hand-in-excel/
a.Linear relationship
b.Multivariate normality
c. No or little multicollinearity
d.No auto-correlation
e.Homoscedasticity
153
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
8. Exercise 1: Linear regression with one variable
import numpy as np
import pandas as pd
df = pd.read_csv('gapminder.csv')
y = df['life'].values
X = df['fertility'].values
# sklearn package needs two dimensional array, hence reshape X and y into 2D.
y = y.reshape(-1, 1)
X = X.reshape(-1, 1)
# Import LinearRegression
reg = LinearRegression()
154
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Fit the model to the data
reg.fit(X, y)
# Print R^2
y_pred = reg.predict(prediction_space)
plt.show()
ii. Near 0: over fit / bad model. In this case better to use avg value to predict.
155
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
help(r2_score)
a.Formula is
Data
import numpy as np
import pandas as pd
156
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
from sklearn.metrics import mean_squared_error,r2_score
df = pd.read_csv('gapminder.csv')
X = (df[df.columns[[0,1,2,3,4,5,6,8]]].values)
y = df['life'].values
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
print("R^2: {}".format(r2_score(y_test,y_pred)))
std = np.std(y_test)
157
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
2. Exercise 5: Cross validation
reg = LinearRegression()
print(cv_scores)
158
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
3. If we increase folds accuracy will be increases .however performance will be impacted
4. Exercise 4: Linear regression using boston data(One More Example for practice)
boston = load_boston()
print(boston.data.shape)
print(type(boston))
print(boston.keys())
print(boston.data)
X = boston.data
print(boston.target)
y = boston.target
import numpy as np
print(" ##### 07. Split data as train and test ##### ")
print(" ##### 08. Fit model using train data ##### ")
159
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
reg_all = linear_model.LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
print(" ###### 10. Compute and print R^2 and RMSE ##### ")
160
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.3 Support Vector Machines (SVM)
32.1.3.1. Introduction
1. “Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
a. Classification
b. Regression
3. Pros:
c. It is effective in cases where number of dimensions is greater than the number of samples.
d. It uses a subset of training points in the decision function (called support vectors), so it is also
memory efficient.
4. Cons:
a. It doesn’t perform well, when we have large data set(>100K) because the required training
time is higher
b. It also doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping
c. SVM doesn’t directly provide probability estimates, these are calculated using an expensive
five-fold cross-validation. It is related SVC method of Python scikit-learn library.
6. https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-
code/
help(svm)
161
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.3.2. Support Vectors Classification (SVC)
9. Exercise 1: Iris flowers classification using SVC
iris = load_iris()
y = iris.target
test_size=0.3,
random_state=42)
# there is various option associated with it, like changing kernel,gamma and C value.
model.fit(X_train, y_train)
y_pred= model.predict(X_test)
r2_score(y_test,y_pred)
162
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.3.3. Tune parameters of SVM(SVC)
1. Tuning parameters effectively improves the model performance. Let’s look at the list of
parameters available with SVM.
grid_search.fit(X, y)
grid_search.best_params_
return (grid_search.best_params_,grid_search.best_score_)
svc_param_selection(X_train,y_train,5)
163
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.4 Pre-processing of machine learning data
32.1.4.1. Outliers
d.Explore https://en.wikipedia.org/wiki/Box_plot
import numpy as np
arr = [10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535,
617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530,
433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496,
484, 566, 554, 472, 335, 440, 579, 341, 545, 615, 548, 604,
439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496,
398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565,
415, 486, 668, 414, 665, 763, 557, 304, 404, 454, 689, 610,
164
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
483, 441, 657, 590, 492, 476, 437, 483, 529, 363, 711, 543,1000]
elements = np.array(arr)
plt.boxplot(elements)
q1,q3 =np.percentile(elements,[25,75])
q1
q3
iqr = q3-q1
lower = q1-1.5*iqr
lower
upper = q3+1.5*iqr
upper
np_arr = np.array(arr)
np_arr[np.logical_or(np_arr<lower,np_arr>upper)]
without_outliers =reject_outliers(np_arr)
np.count_nonzero(without_outliers)
np.count_nonzero(np_arr)
plt.boxplot(without_outliers)
165
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.4.2. Working with categorical features
1. Exercise 2: Exploring categorical features
# Import pandas
import pandas as pd
df = pd.read_csv('gapminder.csv')
plt.show()
a.scikit-learn does not accept non-numerical features. Hence we need to convert non
numeric into numeric
Df1
pd.get_dummies(Df1, prefix='col1')
166
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Create dummy variables: df_region
df_region = pd.get_dummies(df)
print(df_region.columns)
print(df_region.columns)
https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
y = df_region['life'].values
167
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
type(y)
X = (df_region[df_region.columns[[0,1,2,3,4,5,6,8,9,10,11,12,13]]].values)
type(X)
print(ridge_cv)
168
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.4.4. Handling missing data
1. Data can have missing values for a number of reasons such as observations that were not
recorded and data corruption.
2. Handling missing data is important as many machine learning algorithms do not support
data with missing values.
3. We should learn
import pandas as pd
import numpy as np
df = pd.read_csv("house-votes-84.csv")
df.info()
print(df.isnull().sum())
169
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Drop missing values and print shape of new DataFrame
df = df.dropna()
5. Refer https://machinelearningmastery.com/handle-missing-data-python/
170
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.1 ML Pipeline (Putting it all together)
import pandas as pd
import numpy as np
df = pd.read_csv("house-votes-84.csv")
df=df.replace('y', 1)
df
df=df.replace('n', 0)
df
y = df['Class Name'].values
type(y)
X = (df[df.columns[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]]].values)
type(X)
X.shape
('SVM', SVC())]
171
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Compute metrics
print(classification_report(y_test, y_pred))
172
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2 Tree Based Models
b. Internal node: one parent node, question giving rise to two children nodes.
d.
32.1.2.1. Decision Tree for Classification
import pandas as pd
173
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
from sklearn.metrics import accuracy_score
df = pd.read_csv("Wisconsin_Breast_Cancer_Dataset.csv")
df.info()
X = df[["radius_mean","concave points_mean"]]
y=df["diagnosis"]
y = y.replace('M',1)
y = y.replace('B',0)
X_train.shape # (455, 2)
y_train.shape # (455,)
X_test.shape # (114, 2)
y_test.shape # (114,)
dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
174
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print("Test set accuracy: {:.2f}".format(acc))
Note: Not bad! Using only two features, your tree was able to achieve an accuracy of 89%
# Instatiate logreg
logreg = LogisticRegression(random_state=1)
logreg.fit(X_train, y_train)
# predict
y_pred1 = logreg.predict(X_test)
acc1
175
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2.3. Information Gain (IG)
1. At each node, split the data based on information gain value or gini value
176
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2.3.1. Entropy and Information Gain
Must read:
https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-
b93975f7a1f1
Refer:
https://medium.com/udacity/shannon-entropy-information-gain-and-picking-balls-from-
buckets-5810d35d54b4
http://dni-institute.in/blogs/gini-index-work-out-example/
177
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.2.3.2. Gini Index
1. Gini Index is a metric to measure how often a randomly chosen element would be incorrectly
identified. It means an attribute with lower gini index should be preferred
import pandas as pd
df = pd.read_csv("Wisconsin_Breast_Cancer_Dataset.csv")
df.info()
X = df.iloc[:,2:32]
type(X)
X.info()
y=df["diagnosis"]
y = y.replace('M',1)
y = y.replace('B',0)
test_size = 0.2,
178
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
random_state=SEED,
stratify=y)
X_train.shape # (455, 2)
y_train.shape # (455,)
X_test.shape # (114, 2)
y_test.shape # (114,)
dt_entropy = DecisionTreeClassifier(max_depth=8,
criterion='entropy',
random_state=SEED)
dt_entropy.fit(X_train, y_train)
y_pred = dt_entropy.predict(X_test)
# Evaluate accuracy_entropy
dt_gini= DecisionTreeClassifier(max_depth=8,
criterion='gini',
random_state=SEED)
dt_gini.fit(X_train, y_train)
179
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
y_pred_gini = dt_gini.predict(X_test)
# Evaluate accuracy_entropy
# Print accuracy_entropy
# Print accuracy_gini
Note: Notice how the two models achieve exactly the same accuracy. Most of the time, the gini
index and entropy lead to the same results. The gini index is slightly faster to compute and is the
default criterion used in the DecisionTreeClassifier model of scikit-learn.
3.
180
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.3 Decision Tree For Regression
1. Decision trees can also be applied to regression problems, using the DecisionTreeRegressor
import pandas as pd
import numpy as np
df = pd.read_csv("auto-mpg.csv")
df.info()
df = pd.get_dummies(df)
df.info()
X = df.iloc[:,1:]
type(X)
X.info()
y=df["mpg"]
X_train.shape # (313, 8)
y_train.shape # (313,)
X_test.shape # (79, 8)
y_test.shape # (79,)
181
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
dt = DecisionTreeRegressor(max_depth=8,
min_samples_leaf=0.13,
random_state=SEED)
dt.fit(X_train, y_train)
# Compute y_pred
y_pred = dt.predict(X_test)
# Compute mse_dt
sd = np.std(y_test)
r2_score(y_test,y_pred)
182
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.1.4 Linear regression vs regression tree
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
rmse_lr
183
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
32.2 Unsupervised Learning
2. Unsupervised learning allows us to approach problems with little or no idea what our results
should look like. We can derive structure from data where we don't necessarily know the
effect of the variables.
3. We can derive this structure by clustering the data based on relationships among the
variables in the data.
4. Simple words : Unsupervised learning means that there is no outcome to be predicted, and
the algorithm just tries to find patterns in the data
5. Clustering Examples:
b.Example 2: Finding homogenous subgroups within larger group based on the fact
People have features such as income, education attainment and gender.
184
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
e.Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a
chaotic/disorder environment. (i.e. identifying individual voices and music from a
mesh of sounds at a cocktail party).
2. In k means clustering, we have to specify the number of clusters we want the data to be
grouped into. The algorithm randomly assigns each observation to a cluster, and finds the
centroid of each cluster. Then, the algorithm iterates through two steps:
3. These two steps are repeated till the within cluster variation cannot be reduced any further.
4. The within cluster variation is calculated as the sum of the euclidean distance between the
data points and their respective cluster centroids.
185
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Step 1 of 5: import required modules
iris = datasets.load_iris()
samples = iris.data
model = KMeans(n_clusters=3)
model.fit(samples)
labels = model.predict(samples)
print(labels)
new_samples = [[ 5.7, 4.4, 1.5, 0.4],[ 6.5, 3., 5.5, 1.8],[ 5.8, 2.7, 5.1, 1.9]]
new_labels = model.predict(new_samples)
print(new_labels)
xs = samples[:,0]
ys = samples[:,2]
plt.show()
def species_label(theta):
if theta==0:
return iris.target_names[0]
186
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
if theta==1:
return iris.target_names[1]
if theta==2:
return iris.target_names[2]
#create dataframe
import pandas as pd
df = pd.DataFrame(iris.data, columns=iris.feature_names)
#create crosstab
pd.crosstab(df['species'],labels)
6. Clustering quality:
ks = range(1,11)
inertias = []
for k in ks:
model = KMeans(n_clusters=k)
187
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
model.fit(samples)
inertias.append(model.inertia_)
# Plot ks vs inertias
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
10. Exercise 1: Do same job using red wine data set ,you will learn
a.Standardscaler
b. Make_pipeline
https://datahexa.com/kmeans-clustering-with-wine-dataset/
188
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 05 of 08: Deep Learning ###
33.1 Introduction
1. Deep learning is a machine learning technique that teaches computers to do what comes
naturally to humans: learn by example
2. Most deep learning methods use neural network architectures, which is why deep learning
models are often referred to as deep neural networks.
3. The term “deep” usually refers to the number of hidden layers in the neural network.
Traditional neural networks only contain 2-3 hidden layers, while deep networks can have as
many as required.
4. Deep learning models are trained by using large sets of labelled data and neural network
architectures that learn features directly from the data without the need for manual feature
extraction.
1. Deep learning applications are used in industries from automated driving to medical devices.
189
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
3. Aerospace and Defence: Deep learning is used to identify objects from satellites that
locate areas of interest, and identify safe or unsafe zones for troops.
4. Medical Research: Cancer researchers are using deep learning to automatically detect
cancer cells. Teams at UCLA built an advanced microscope that yields a high-dimensional
data set used to train a deep learning application to accurately identify cancer cells.
5. Industrial Automation: Deep learning is helping to improve worker safety around heavy
machinery by automatically detecting when people or objects are within an unsafe distance
of machines.
6. Electronics: Deep learning is being used in automated hearing and speech translation. For
example, home assistance devices that respond to your voice and know your preferences
are powered by deep learning applications.
a.Number of children
190
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
import numpy as np
print(hidden_layer_values) #[5, 1]
print(output)
a. Linear
b. Non linear
import numpy as np
191
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
weights = {'node_0': np.array([1, 1]), 'node_1': np.array([-1, 1]),'output': np.array([2, -1])}
node_0_output = np.tanh(node_0_input)
node_1_output = np.tanh(node_1_input)
print(output)
192
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.4 Deeper networks
import numpy as np
193
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
'node_2': np.array([-1, 1]),
node_0_output_relu = np.maximum(node_0_output,0)
node_1_output_relu = np.maximum(node_1_output,0)
hidden_layer1_output_relu = np.maximum(hidden_layer1_output,0)
node_2_output_relu = np.maximum(node_2_output,0)
node_3_output_relu = np.maximum(node_3_output,0)
hidden_layer2_output_relu = np.maximum(hidden_layer2_output,0)
output_relu = np.maximum(output,0)
print(output_relu)
https://stackoverflow.com/questions/32109319/how-to-implement-the-relu-function-in-
numpy
194
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.5 Need for optimization
2. Hence in ordered to get good quality/optimization, choosing right weights play main role.
def relu(my_input):
return(max(0, my_input))
node_0_output = relu(node_0_input)
node_1_output = relu(node_1_input)
model_output = relu(input_to_final_layer)
return(model_output)
195
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
#Step 3: Use above functions to predict
# Sample weights
target_actual = 3
# Create weights that cause the network to make perfect prediction (3): weights_1
print(error_0);print(error_1)
196
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.6 Gradient descent
1. How many
197
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Step 1 of 2: Calculate slope/gradient
import numpy as np
# Define weights
target = 6
learning_rate = 0.01
print(error)
gradient
print(error_updated)
import numpy as np
weights = np.array([0,2,1])
198
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
input_data = np.array([1,2,3])
target = 0
print(slope)
#################################################
learning_rate = 0.01
print(error)
print(error_updated)
#################################################
199
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
def get_error(input_data, target, weights):
return(error)
return(slope)
mse = np.mean(errors**2)
return(mse)
n_updates = 20
mse_hist = []
for i in range(n_updates):
mse_hist.append(mse)
200
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.plot(mse_hist)
plt.xlabel('Iterations')
plt.show() # Notice that, the mean squared error decreases as the number of iterations
go up.
201
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.7 Backpropagation
1. Update weights using error and iterate till we meet actual target data
2. Try to understand the process, however you will generally use a library that implements forward
and backward propagations.
202
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
4. If a variable z depends on the variable y, y variables depends on the variable x.We can say z
depends on x . we can write chain rule as for below
a. If you have gone through 4 iterations of calculating slopes (using backward propagation)
and then updated weights, how many times must you have done forward propagation?
i. 0
ii. 1
iii. 4
iv. 8
203
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
b. If your predictions were all exactly right, and your errors were all exactly 0, the slope of
the loss function with respect to your predictions would also be 0. In that circumstance,
which of the following statements would be correct?
ii. The updates to all weights in the network would be dependent on the activation
functions.
iii. The updates to all weights in the network would be proportional to values from
the input data.
204
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.8 Creating keras Regression Model
1. Specify Architecture
v. Define output
2. Compile
i. Define optimizer
3. Fit
i. Applying backpropagation
4. Predict
import pandas as pd
df = pd.read_csv("hourly_wages.csv")
df
predictors = (df[df.columns[[1,2,3,4,5,6,7,8,9]]].values)
target = (df[df.columns[0]].values)
205
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Import necessary modules
import keras
n_cols = predictors.shape[1]
model = Sequential()
model.add(Dense(32, activation='relu'))
model.add(Dense(1))
a. To compile the model, you need to specify the optimizer and loss function
206
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.fit(predictors, target)
5. Step 4 of 4: predict
model.predict(predictors)
207
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.9 Creating keras Classification Models
4. Output layer has separate node for each possible outcome, and uses ‘softmax’
activation
Process:
2. You will use predictors such as age, fare and where each passenger embarked from to
predict who will survive.
# understand data
import pandas as pd
df = pd.read_csv("titanic_all_numeric_train.csv")
target = to_categorical(df.survived)
df = pd.read_csv("titanic_all_numeric_test.csv")
import keras
model = Sequential()
n_cols = predictors.shape[1]
208
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Add the first layer
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(predictors, target)
1. Save
2. Reload
3. Make predictions
model.save('model_file.h5')
my_model = load_model('model_file.h5')
predictions = model.predict(test_data)
predicted_prob_true = predictions[:,1]
# print predicted_prob_true
print(predicted_prob_true)
209
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.11 Understanding Model Optimization
c. Updates too small (if learning rate is low) or too large (if learning rate is high)
2. Scenario: Try to optimize a model at a very low learning rate, a very high learning rate,
and a "just right" learning rate. We need to look at the results after running this exercise,
remembering that a low value for the loss function is good
input_shape=(10,)
type(input_shape)
input_shape
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))
return(model)
for lr in lr_to_test:
210
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print('\n\nTesting model with learning rate: %f\n'%lr )
model = get_new_model()
my_optimizer = SGD(lr=lr)
model.compile(optimizer=my_optimizer, loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(predictors, target)
3. Which of the following could prevent a model from showing an improved loss in its first
few epochs?
211
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.12 Model Validation
n_cols = predictors.shape[1]
input_shape = (n_cols,)
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
# Import EarlyStopping
n_cols = predictors.shape[1]
input_shape = (n_cols,)
212
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))
# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)
# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)
model_1 = Sequential()
model_1.add(Dense(10, activation='relu'))
model_1.add(Dense(2, activation='softmax'))
# Compile model_1
model_1.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
213
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Create the new model: model_2
model_2 = Sequential()
model_2.add(Dense(100, activation='relu'))
model_2.add(Dense(2, activation='softmax'))
# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
# Fit model_1
# Fit model_2
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()
5. Note: Model_2 blue line in the graph has less loss ,so it is good
a. In above exercise 3, you’ve seen how to experiment with wider networks. In this
exercise, you'll try a deeper network (more hidden layers).
214
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
input_shape = (n_cols,)
model_1 = Sequential()
model_1.add(Dense(2, activation='softmax'))
# Compile model_1
model_1.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
model_2 = Sequential()
model_2.add(Dense(50, activation='relu'))
model_2.add(Dense(50, activation='relu'))
model_2.add(Dense(2, activation='softmax'))
# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
# Fit model 1
# Fit model 2
215
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Create the plot
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()
216
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
33.13 Model Capacity
Mean Squared
Hidden Layers Nodes Per Layer Error Next Step
1 100 5.4 Increase Capacity
1 250 4.8 Increase Capacity
2 250 4.4 Increase Capacity
3 250 4.5 Decrease Capacity
3 200 4.3 Done
2. If we are not checking as shown above ,there is a chance to overfit the model
217
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
218
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 06 of 08: Project on Deep Learning ###
import numpy as np
import matplotlib
matplotlib.use('agg')
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='3'
os.environ['CUDA_VISIBLE_DEVICES'] = ''
# keras imports for the dataset and building our neural network
fig = plt.figure()
for i in range(9):
plt.subplot(3,3,i+1)
219
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.tight_layout()
plt.title("Class {}".format(y_train[i]))
plt.xticks([])
plt.yticks([])
fig
# In order to train our neural network to classify images we first have to unroll the # height
×width pixel format into one big vector - the input vector. So its length
# must be 28 * 28 = 784. But let's graph the distribution of our pixel values.
fig = plt.figure()
plt.subplot(2,1,1)
plt.title("Class {}".format(y_train[0]))
plt.xticks([])
plt.yticks([])
plt.subplot(2,1,2)
plt.hist(X_train[0].reshape(784))
fig
#Note that the pixel values range from 0 to 255: the background majority close to 0, and those
close to #255 representing the digit.
# Normalizing the input data helps to speed up the training. Also, it reduces the chance of
getting stuck # in local optima, since we're using stochastic gradient descent to find the optimal
weights for the #network.
#Let's reshape our inputs to a single vector and normalize the pixel values to lie between 0 and
1.
220
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print("X_train shape", X_train.shape)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print(np.unique(y_train, return_counts=True))
#Let's encode our categories - digits from 0 to 9 - using one-hot encoding. The result is a vector
with a #length equal to the number of categories. The vector is all zeroes except in the position
for the #respective category. Thus a '5' will be represented by [0,0,0,0,0,1,0,0,0,0]
n_classes = 10
221
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
1. Our pixel vector serves as the input. Then, two hidden 512-node layers, with enough
model complexity for recognizing digits. For the multi-class classification we add another
densely-connected (or fully-connected) layer for the 10 different output classes. For this
network architecture we can use the Keras Sequential Model. We can stack layers using
the .add() method.
2. When adding the first layer in the Sequential Model we need to specify the input shape
so Keras can create the appropriate matrices. For all remaining layers the shape is
inferred automatically.
3. In order to introduce nonlinearities into the network and elevate it beyond the capabilities
of a simple perceptron we also add activation functions to the hidden layers. The
differentiation for the training via backpropagation is happening behind the scenes
without having to implement the details.
4. We also add dropout as a way to prevent overfitting. Here we randomly keep some
network weights fixed when we would normally update them so that the network doesn't
rely too much on very few nodes.
5. The last layer consists of connections for our 10 classes and the softmax activation
which is standard for multi-class targets.
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
222
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
batch_size=128, epochs=8,
verbose=2,
validation_data=(X_test, Y_test))
model_name = 'keras_mnist.h5'
model.save(model_path)
fig = plt.figure()
plt.subplot(2,1,1)
223
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.subplot(2,1,2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.tight_layout()
fig
Note: This learning curve looks quite good! We see that the loss on the training set is
decreasing rapidly for the first two epochs. This shows the network is learning to classify
the digits pretty fast. For the test set the loss does not decrease as fast but stays roughly
within the same range as the training loss. This means our model generalizes well to
unseen data.
224
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
Step 5 of 5: Evaluate the Model Performance
predicted_classes = mnist_model.predict_classes(X_test)
print()
plt.subplot(6,3,i+1)
plt.xticks([])
225
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.yticks([])
plt.subplot(6,3,i+10)
plt.xticks([])
plt.yticks([])
# Import matplotlib
226
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
plt.imshow(data)
plt.show()
data[:40, :40, 0] = 1
data[:40, :40, 1] = 0
data[:40, :40, 2] = 0
plt.imshow(data)
plt.show()
n_categories = 3
for ii in range(len(labels)):
227
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Find the location of this label in the categories variable
jj = np.where(categories == labels[ii])
one_hot_encoding_labels[ii, jj] = 1
one_hot_encoding_labels
6.
228
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
229
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 07 of 08: NLU / NLP / Text Analytics/ Text Mining ###
36.1 Introduction
1. NLP is a way for computers to analyse, understand, and derive meaning from human
language in a smart and useful way.
2. By utilizing NLP, developers can organize and structure knowledge to perform tasks
such as automatic summarization, translation, named entity recognition, relationship
extraction, sentiment analysis, speech recognition, and topic segmentation.
3. Summarize blocks of text using Summarizer to extract the most important and central
ideas while ignoring irrelevant information.
4. Create a chat bot using Parsey McParseface, a language parsing deep learning model
made by Google that uses Point-of-Speech tagging.
5. Automatically generate keyword tags from content using AutoTag, which leverages
LDA, a technique that discovers topics contained within a body of text.
6. Identify the type of entity extracted, such as it being a person, place, or organization
using Named Entity Recognition.
7. Use Sentiment Analysis to identify the sentiment of a string of text, from very negative
to neutral to very positive.
8. Reduce words to their root, or stem, using PorterStemmer, or break up text into
tokens using Tokenizer
230
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.2 Regular Expressions
5. In below exercise, the 'r' in front tells Python the expression is a raw string. In a raw
string, escape sequences are not parsed. For example, '\n' is a single newline character.
But, r'\n' would be two characters: a backslash and an 'n'.
import re
my_string = "Let's write RegEx! Won't that be fun? I sure think so. Can you find
4 sentences? Or perhaps, all 19 words?"
sentence_endings = r"[.?!]"
231
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(re.split(sentence_endings, my_string))
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))
spaces = r"\s+"
print(re.split(spaces, my_string))
digits = r"\d+"
print(re.findall(digits, my_string))
import re
re.match("b","abcdef") # No Match
re.search("d","abcdef") # Match
232
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.3 Tokenization
3. Many different theories and rules available in creating tokens ,You can create your own
rules using regular expressions
4. Some examples:
b. Separating punctuation
5. Tokenization useful
d. TweetTokenizer: special class just for tweet tokenization, allowing you to separate
hashtags, mentions and lots of exclamation points!!!
233
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Import necessary modules
sentences = sent_tokenize(scene_one)
tokenized_sent = word_tokenize(sentences[1])
unique_tokens = set(word_tokenize(scene_one))
print(unique_tokens)
2. OR is represented using |
234
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# only digits
tokenize_digits_and_words= ('\d+')
# only alphabets
tokenize_digits_and_words= ('[a-z]+')
tokenize_digits_and_words= ('[a-z]+|\d+')
7. Example 2:
8. Example 3:
pattern1 = r"(\w+|#\d|\?|!)"
regexp_tokenize(my_string,pattern=pattern1)
235
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
tweets = ['This is the best #nlp exercise ive found online! #python',
pattern1 = r"#\w+"
regexp_tokenize(tweets[0], pattern1)
pattern2 = r"([#|@]\w+)"
regexp_tokenize(tweets[-1], pattern2)
tknzr = TweetTokenizer()
print(all_tokens)
# Create a string
german_text="Wann gehen wir zur Pizza? 🍕 Und fahren Sie mit vorbei? 🚕 "
all_words = word_tokenize(german_text)
print(all_words)
236
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-
\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))
plt.hist([1, 5, 5, 7, 7, 7, 9])
plt.hist(word_lengths)
plt.show()
237
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.6 Word counts with bag of words
2. Need to first create tokens using tokenization, and then count up all the tokens
counter = Counter(word_tokenize(
counter
counter.most_common(5)
# Import Counter
tokens = word_tokenize(article)
238
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
lower_tokens = [t.lower() for t in tokens]
bow_simple = Counter(lower_tokens)
print(bow_simple.most_common(15))
b. Lowercasing words
text = """The cat is in the box. The cat likes the box.The box is over the cat."""
Counter(no_stops).most_common(2)
# Import WordNetLemmatizer
239
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]
wordnet_lemmatizer = WordNetLemmatizer()
bow = Counter(lemmatized)
print(bow.most_common(10))
240
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.8 Gensim
3. Word vector
241
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
'More space films, please!',]
dictionary = Dictionary(tokenized_docs)
dictionary.token2id
corpus
9. This more advanced and feature rich bag-of-words can be used in future exercises
242
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.9 Tf-idf with gensim
8. Tf-idf formula
tfidf = TfidfModel(corpus)
tfidf[corpus[1]]
243
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.10 Named Entity Recognition
c. ... etc
import nltk
sentence = '''In New York, I like to ride the Metro to visit MOMA
tokenized_sent = nltk.word_tokenize(sentence)
244
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]
print(nltk.ne_chunk(tagged_sent))
type(article)
article
sentences = nltk.sent_tokenize(article)
print(chunk)
245
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.11 Introduction to SpaCy
3. Open-source, with extra libraries and tools . example: Displacy entity recognition visualizer
8. Verify
a.https://demos.explosion.ai/displacy/
b.https://demos.explosion.ai/displacy-ent/
import spacy
# First time users may need to execute in anaconda cmd prompt #python -m spacy
download en
nlp = spacy.load('en')
246
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# nlp.entity
doc.ents
print(doc.ents[0], doc.ents[0].label_)
print(doc.ents[1], doc.ents[1].label_)
print(doc.ents[2], doc.ents[2].label_)
# Import spacy
import spacy
doc = nlp(article)
print(ent.label_, ent.text)
11. Which are the extra categories that spacy uses compared to nltk in its named-entity
recognition?
247
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.12 Multilingual NER with polyglot
WIP
248
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.13 Building a "fake news" classifier
1. Which of the following are possible features for a text classification problem ?
c. Language.
b.Refer http://scikit-
learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.
html
import pandas as pd
df = pd.read_csv("fake_or_real_news.csv")
df.info()
print(df.head())
y = df.label
count_vectorizer = CountVectorizer(stop_words='english')
249
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
count_vectorizer
# Transform the training data using only the 'text' column values: count_train
count_train = count_vectorizer.fit_transform(X_train)
# Transform the test data using only the 'text' column values: count_test
count_test = count_vectorizer.transform(X_test)
print(count_vectorizer.get_feature_names()[:10])
# Import TfidfVectorizer
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
print(tfidf_vectorizer.get_feature_names()[:10])
print(tfidf_train.A[:5])
a.To get a better idea of how the vectors work, you'll investigate them by converting
them into pandas DataFrames
250
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Create the CountVectorizer DataFrame: count_df
print(count_df.head())
print(tfidf_df.head())
print(difference)
Note: Which of the below is the most reasonable model to use when training a new
supervised model using text vector data?
a.Random Forests
b. Naive Bayes
c. Linear Regression
d.Deep Learning
5. Step 4 of 7: Training and testing the "fake news" model with CountVectorizer
a.Train and test a Naive Bayes model using the CountVectorizer data.
nb_classifier = MultinomialNB()
251
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
print(score)
print(cm)
6. Step 5 of 7: Training and testing the "fake news" model with TfidfVectorizer
a.In above step we evaluated the model using the CountVectorizer, you'll do the same
using the TfidfVectorizer with a Naive Bayes model
nb_classifier = MultinomialNB()
nb_classifier.fit(tfidf_train, y_train)
pred = nb_classifier.predict(tfidf_test)
print(score)
print(cm)
7. Improving the model, what are the possible next steps you could take to improve the
model?
252
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
b.Trying a new classification model.
a.Test a few different alpha levels using the Tfidf vectors to determine if there is a
better performing combination.
import numpy as np
# Define train_and_predict()
def train_and_predict(alpha):
nb_classifier = MultinomialNB(alpha=alpha)
nb_classifier.fit(tfidf_train, y_train)
pred = nb_classifier.predict(tfidf_test)
return score
print()
253
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
9. Step 7 of 7: Inspecting your model
a.You can map the important vector weights back to actual words using some simple
inspection techniques.
class_labels = nb_classifier.classes_
feature_names = tfidf_vectorizer.get_feature_names()
# Zip the feature names together with the coefficient array and sort by weights:
feat_with_weights
# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])
254
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.14 Dialog Flow
1.
2. https://dialogflow.com/
3.
255
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
36.15 RASA NLU
Dialog Flow
256
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
### Level 08 of 08: Projects on NLU/NLP ###
37. Introduction
38. EchoBot
def respond(message):
return bot_message
# or
def respond1(message):
respond("hello")
respond1("hello")
def send_message(message):
257
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Print user_template including the user_message
print(user_template.format(message))
response = respond(message)
# Wait 2 secs and Print the bot template including the bot's response.
import time
time.sleep(2)
print(bot_template.format(response))
send_message("hello")
1. Exercise 3:Create a bot which can answer simple questions such as "What's your name?" and
"What's today's weather?"
a. You'll use a dictionary with these questions as keys, and the correct responses as values.
b. This means the bot will only respond correctly if the message matches exactly, which is a
big limitation. In later exercises you will create much more robust solutions.
# Define variables
name = "Greg"
weather = "cloudy"
responses = {
258
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Return the matching response if there is one, default otherwise
def respond(message):
if message in responses:
bot_message = responses[message]
else:
bot_message = responses["default"]
return bot_message
import random
name = "Greg"
weather = "cloudy"
responses = {
259
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
"they call me {0}".format(name),
"I go by {0}".format(name)
],
],
def respond(message):
if message in responses:
bot_message = random.choice(responses[message])
else:
bot_message = random.choice(responses["default"])
return bot_message
260
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
import re
pattern
message = "do you remember when you ate strawberries in the garden"
match = re.search(pattern,message)
if match:
print("String Matches")
import re
match.group(0)
match.group(1)
3. Grammatical Transformation
import re
def swap_pronouns(phrase):
if 'I' in phrase:
if 'my' in phrase:
else:
return phrase
261
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
41. Understanding intents and entities
1. Intents
i. I'm hungry
2. Entities
b. Drawback:
c. '|' is equivalent to OR
262
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
re.search(r"(hello|hey|hi)", "hey there!")
pattern = re.compile('[A-Z]{1}[a-z]*')
message = """
pattern.findall(message)
def send_message(message):
print(user_template.format(message))
response = respond(message)
263
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
print(bot_template.format(response))
patterns = {}
patterns[intent] = re.compile('|'.join(keys))
print(patterns)
def match_intent(message):
matched_intent = None
if pattern.search(message):
matched_intent = intent
return matched_intent
def respond(message):
intent = match_intent(message)
key = "default"
if intent in responses:
key = intent
264
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
return responses[key]
# Send messages
send_message("hello!")
send_message("bye byeee")
# Define find_name()
def find_name(message):
name = None
name_keyword = re.compile('name|call')
name_pattern = re.compile('[A-Z]{1}[a-z]*')
if name_keyword.search(message):
name_words = name_pattern.findall(message)
if len(name_words) > 0:
return name
# Define respond()
def respond(message):
265
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
name = find_name(message)
if name is None:
else:
# Send messages
send_message("call me Ishmael")
266
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
42. Word vectors
2. Programs which can get better at a task by being exposed to more data
5. Word vectors
f. GloVe algorithm
i. Cousin of word2vec
g. spaCy
import spacy
267
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
nlp = spacy.load('en') # if any FileNotFoundError , run from anconda prompt "python -m spacy
download en"
nlp.vocab.vectors_length
doc
7. Similarity
c. Cosine similarity
d. Exercise 1: Similarity
import spacy
nlp = spacy.load('en')
doc = nlp("cat")
doc.similarity(nlp("can"))
doc.similarity(nlp("dog"))
doc.similarity(nlp("cat"))
a. Create a 2D array X with as many rows as there are sentences in the dataset, where each
row is a vector describing that sentence.
268
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
with open('nlp_atis_words.txt', "r") as word_list:
sentences = word_list.read().split(',')
labels = word_list.read().split(',')
nlp = spacy.load('en')
n_sentences = len(sentences)
embedding_dim = nlp.vocab.vectors_length
import numpy as np
X = np.zeros((n_sentences, embedding_dim))
doc = nlp(sentence)
X[idx, :] = doc.vector
269
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
43. Intents and classification
1. Supervised Learning
270
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
44. Entity Extraction
1.
3. Context:
i. Spelling
ii. Capitalization
import spacy
nlp = spacy.load('en')
print(ent.text, ent.label_)
5. Pattern Recognition
import re
271
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
nlp = spacy.load('en')
# Define extract_entities()
def extract_entities(message):
ents = dict.fromkeys(include_entities)
doc = nlp(message)
if ent.label_ in include_entities:
ents[ent.label_] = ent.text
return ents
def entity_type(word):
_type = None
if word.text in colors:
_type = "color"
272
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
elif word.text in items:
_type = "item"
return _type
doc = nlp("let's see that jacket in red and some blue jeans")
def find_parent_item(word):
if entity_type(parent) == "item":
return parent.text
return None
def assign_colors(doc):
if entity_type(word) == "color":
item = find_parent_item(word)
assign_colors(doc)
273
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
45. Robust NLU with Rasa
b. conda pip
c. activate base
2. RASA NLU
a. Google https://dialogflow.com/
b. Microsoft https://www.luis.ai/
c. Facebook https://wit.ai/
a. Please note that iam using rasa nlu version 0.13.1.If you are using any other versions,
some of the functions may not work.
b. In this exercise you'll use Rasa NLU to create an interpreter, which parses incoming user
messages and returns a set of entities.
c. Your job is to train an interpreter using the MITIE entity recognition model in rasa NLU
274
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
from rasa_nlu.config import RasaNLUModelConfig
config = RasaNLUModelConfig(configuration_values=args)
trainer = Trainer(config)
training_data = load_data("./training_data.json")
interpreter = trainer.train(training_data)
# Try it out
275
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
46. Building a virtual assistant
1. In this chapter, you're going to build a personal assistant to help you plan a trip. It will be able to
respond to questions like "are there any cheap hotels in the north of town?" by looking inside a
hotels database for matching results
a. Scheduling a meeting
b. Booking a flight
engine = create_engine('sqlite:///hotels.db')
con = engine.connect()
df = pd.DataFrame(result.fetchall(),columns = result.keys())
con.close()
276
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Step 7 of 7: Print head of DataFrame df
print(df.head())
# or
import sqlite3
conn = sqlite3.connect('hotels.db')
con = conn.cursor()
con.fetchall()
import sqlite3
conn = sqlite3.connect('hotels.db')
c = conn.cursor()
t = (par_area, par_price)
277
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Step 6 of 6: Print the results
print(c.fetchall())
278
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
46.2 Exploring a DB with
natural language
1. Example messages
a. Now you're going to implement a more powerful function for querying the hotels
database. The goal is to take arguments that can later be specified by other parts of
your code.
b. Specifically, your job here is to define a find_hotels() function which takes a single
argument - a dictionary of column names and values - and returns a list of matching
hotels from the database.
# Define find_hotels()
def find_hotels(params):
if len(params) > 0:
print(filters)
print(query)
t = tuple(params.values())
print(t)
# Open connection to DB
279
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
conn = sqlite3.connect('hotels.db')
# Create a cursor
c = conn.cursor()
c.execute(query, t)
return c.fetchall()
a. Here, you're going to put your find_hotels() function into action! Recall that it accepts a
single argument, params, which is a dictionary of column names and values.
print(find_hotels(params))
a. Now you'll write a respond() function which can handle messages like "I want an
expensive hotel in the south of town" and respond appropriately according to the
number of matching results in a database. This is important functionality for any
database-backed chatbot.
b. Your find_hotels() function from the previous exercises has already been defined for you,
along with a rasa NLU interpreter object which can handle hotel queries and a list of
responses, which you can explore in the Shell.
280
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
'{} is one option, but I know others too :)']
# Define respond()
def respond(message):
entities = interpreter.parse(message)["entities"]
params = {}
params[ent["entity"]] = str(ent["value"])
results = find_hotels(params)
n = min(len(results),3)
return responses[n].format(*names)
df
281
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
46.3 Incremental slot filling
and negation
1. Incremental filters
i. Now you'll write a bot that allows users to add filters incrementally, in case they
don't specify all of their preferences in one message.
# Define a respond function, taking the message and existing params as input
entities = interpreter.parse(message)["entities"]
params[ent["entity"]] = str(ent["value"])
282
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
results = find_hotels(params)
n = min(len(results), 3)
params = {}
print("USER: {}".format(message))
print("BOT: {}".format(response))
1. Negated Entities
a. assume that "not" or "n't" just before an entity means user wants to exclude this
import spacy
nlp = spacy.load('en')
283
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
indices = [1, 4]
start = 0
for i in indices:
phrase = "{}".format(doc[start:i])
print(phrase)
negated_ents.append(doc[i])
else:
ents.append(doc[i])
start = i
negated_ents
a. Quite often you'll find your users telling you what they don't want - and that's important
to understand! In general, negation is a difficult problem in NLP. Here we'll take a very
simple approach that works for many cases.
b. A list of tests called tests has been defined for you. Explore it in the Shell - you'll find that
each test is a tuple consisting of:
ii. A dictionary containing the entities as keys, and a Boolean saying whether they
are negated as the key
c. Your job is to define a function called negated_ents() which looks for negated entities in
a message.
('no in the south not the north', {'north': False, 'south': True}),
284
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
('not north', {'north': False})]
# Define negated_ents()
start = 0
chunks = []
chunks.append(phrase[start:end])
start = end
result = {}
if ent in chunk:
result[ent] = False
else:
result[ent] = True
return result
a. Now you're going to put together some of the ideas from previous exercises, and allow
users to tell your bot about what they do and what they don't want, split across multiple
messages.
b. The negated_ents() function has already been defined for you. Additionally, a slightly
tweaked version of the find_hotels() function, which accepts a neg_params dictionary in
addition to a params dictionary, has been defined.
285
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
def find_hotels(params, neg_params):
if len(params) > 0:
t = tuple(params.values())
# open connection to DB
conn = sqlite3.connect('hotels.db')
# create a cursor
c = conn.cursor()
c.execute(query, t)
return c.fetchall()
start = 0
chunks = []
chunks.append(phrase[start:end])
start = end
result = {}
286
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
if ent in chunk:
result[ent] = False
else:
result[ent] = True
return result
entities = interpreter.parse(message)["entities"]
neg_params[ent["entity"]] = str(ent["value"])
else:
params[ent["entity"]] = str(ent["value"])
n = min(len(results),3)
287
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Initialize params and neg_params
params = {}
neg_params = {}
for message in ["but not in the north of town","I want a cheap hotel"]:
print("USER: {}".format(message))
print("BOT: {}".format(response))
c.
d.
288
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
47. Dialogue
1. Everything you've built so far has statelessly mapped intents to actions & responses. It's amazing
how far you can get with that! But to build more sophisticated bots you will always want to add
some statefulness. That's what you'll do here, as you build a chatbot that helps users order
coffee. Have fun!
1. Form filling
a. You'll often want your bot to guide users through a series of steps, such as when they're
placing an order.
b. In this exercise, you'll begin building a bot that lets users order coffee. They can choose
between two types: Colombian, and Kenyan. If the user provides unexpected input, your
bot will handle this differently depending on where they are in the flow.
c. Your job here is to identify the appropriate state and next state based on the intents and
response messages provided. For example, if the intent is "order", then the state
changes from INIT to CHOOSE_COFFEE.
d. A function send_message(policy, state, message) takes the policy, the current state and
message as arguments, and returns the new state as a result.
289
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
def send_message(policy, state, message):
print("USER : {}".format(message))
print("BOT : {}".format(response))
return new_state
def interpret(message):
msg = message.lower()
if 'order' in msg:
return 'order'
return 'specify_coffee'
return 'none'
INIT = 0
CHOOSE_COFFEE = 1
ORDERED = 2
290
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Define the policy rules
policy = {
(INIT, "none"): (INIT, "I'm sorry - I'm not sure how to help you"),
messages = [
"kenyan"
state = INIT
a. Sometimes your users need some help! They will have questions and expect the bot to
help them.
b. In this exercise, you'll allow users to ask the coffee bot to explain the steps to them. Like
before, the answer they get will depend on where they are in the flow.
# Define send_messages()
291
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
def send_messages(messages):
state = INIT
print("USER : {}".format(message))
print("BOT : {}".format(response))
return new_state
def interpret(message):
msg = message.lower()
if 'order' in msg:
return 'order'
return 'specify_coffee'
if 'what' in msg:
return 'ask_explanation'
return 'none'
INIT=0
CHOOSE_COFFEE=1
ORDERED=2
292
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
# Define the policy rules dictionary
policy_rules = {
(INIT, "ask_explanation"): (INIT, "I'm a bot to help you order coffee beans"),
send_messages([
"kenyan"
])
a. What happens if you make a suggestion to your user, and they don't like it? Your bot will
look really silly if it makes the same suggestion again right away.
b. Here, you're going to modify your respond() function so that it accepts and returns 4
arguments:
c. The user message as an argument, and the bot response as the first return value.
e. A suggestions list. When passed to respond(), this should contain the suggestions made
in the previous bot message. When returned by respond(), it should contain the current
suggestions.
f. An excluded list, which contains all of the results your user has already explicitly rejected.
293
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182
g. Your function should add the previous suggestions to the excluded list whenever it
receives a "deny" intent. It should also filter out excluded suggestions from the
response.
48. Heading
2. simple line
49. Heading
294
RR ITEC #209, Nilagiri Block, Adithya Enclave, Ameerpet @8374899166, 8790998182