0% found this document useful (0 votes)
2 views

pythonPPT合并版

The document provides an overview of Python programming concepts, including constants, reserved words, variables, and naming conventions. It covers assignment statements, numeric expressions, operator precedence, type conversions, user input, and comments. Additionally, it discusses conditional execution, including comparison and logical operators, boolean expressions, and the structure of if/elif/else statements.

Uploaded by

callenzcl
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

pythonPPT合并版

The document provides an overview of Python programming concepts, including constants, reserved words, variables, and naming conventions. It covers assignment statements, numeric expressions, operator precedence, type conversions, user input, and comments. Additionally, it discusses conditional execution, including comparison and logical operators, boolean expressions, and the structure of if/elif/else statements.

Uploaded by

callenzcl
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 406

PYTHON LANGUAGE

2024-IS60516
MSc BA
Lecturer: Dr Selja Seppälä
Explain the difference between constants, Correctly use the Python naming
reserved words and variables, with conventions and assign meaningful names
examples
Variables, expressions & statements
Describe assignment statements with Define and use the operators for writing
examples and write them in code numeric expressions, and list & use the
operator precedence rules

DESCRIBE AND GIVE EXAMPLES OF THE Write code to check for types and convert
DIFFERENT TYPES IN PYTHON one type to another

Use the Python function to get user input Write meaningful comments when
appropriate

IS6061 | Python Language Overview | S. Seppälä


Summary

• Vocabulary
– Constants
– Reserved Words
– Variables, namespace, assignment
– Python naming conventions
• Sentences or Lines
• Expressions
• Operators & Precedence Rules
• Type & Type Conversions
• Comments

IS1110 | Python Language Overview | S. Seppälä Source: Python for Everybody, www.py4e.com
Constants
• Fixed values such as numbers, letters, and strings, are called
“constants” because their value does not change
• Numeric constants are as you expect
>>> print(123)
• String constants use single quotes (') 123
or double quotes (") >>> print(98.6)
98.6
>>> print('Hello world')
Hello world
Reserved Words
• You cannot use reserved words as variable names / identifiers as
they have a special meaning for Python
False class return is
finally
None if for lambda
continue
True def from while
nonlocal
and del global not with
as elif try or yield
assert else import pass
break except in raise
Variables
• A variable is a named place in the memory where a programmer can store
data and later retrieve the data using the variable “name”

• Programmers get to choose the names of the variables

• You can change the contents of a variable in a later statement

x = 12.2 x 12.2
y = 14
y 14
Python Variable Name Rules
• Must start with a letter or underscore _

• Must consist of letters, numbers, and underscores

• Case Sensitive

See naming conventions and other writing rules in the PEP 8 – Style Guide for Python Code:
https://realpython.com/python-pep8/
Naming variables

https://forms.office.com/e/wk9
fTdpA3h

2024-IS6051 | Data | S. Seppälä


Python Variable Name Rules
• Must start with a letter or underscore _

• Must consist of letters, numbers, and underscores

• Case Sensitive

Good: spam eggs spam23 _speed


Bad: 23spam #sign var.12
Different: spam Spam SPAM
See naming conventions and other writing rules in the PEP 8 – Style Guide for Python Code:
https://realpython.com/python-pep8/
The Standard Way
• The standard way for most things named in Python is lower with
under, lower case with separate words joined by an underline:

• this_is_a_var

• my_list

• square_root_function

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Mnemonic Variable Names
• Since we programmers are given a choice in how we choose our
variable names, there is a bit of “best practice”
• We name variables to help us remember what we intend to store
in them (“mnemonic” = “memory aid”)
• This can confuse beginning students because well-named
variables often “sound” so good that they must be keywords

http://en.wikipedia.org/wiki/Mnemonic
Sentences or Lines

x = 2 Assignment statement
x = x + 2 Assignment with expression
print(x) Print statement

Variable Operator Constant Function


= is Assignment
• In many computer languages, = means assignment.

• my_int = my_int + 7

• lhs = rhs

• What assignment means is:


• evaluate the rhs of the =

• take the resulting value and associate it with the name on the lhs

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
A variable is a memory location x 0.6
used to store a value (0.6)

0.6 0.6
x = 3.9 * x * ( 1 - x )

0.4

The right side is an expression.


0.936
Once the expression is evaluated, the
result is placed in (assigned to) x.
Numeric Expressions
Operator Operation
• Because of the lack of mathematical
symbols on computer keyboards - we + Addition
use “computer-speak” to express the - Subtraction
classic math operations
* Multiplication
• Asterisk is multiplication / Division

• Exponentiation (raise to a power) looks ** Power


different than in math % Remainder
Numeric Expressions
>>> xx = 2 >>> jj = 23
>>> xx = xx + 2 >>> kk = jj % 5 Operator Operation
>>> print(xx) >>> print(kk)
+ Addition
4 3
>>> yy = 440 * 12 >>> print(4 ** 3) - Subtraction
>>> print(yy) 64 * Multiplication
5280
>>> zz = yy / 1000 4R3 / Division

>>> print(zz) 5 23 ** Power


5.28 20 % Remainder

3
Operator Precedence Rules
Highest precedence rule to lowest precedence rule:

• Parentheses are always respected Parenthesis


Power
• Exponentiation (raise to a power) Multiplication
Addition
• Multiplication, Division, and Remainder
Left to Right
• Addition and Subtraction

• Left to right
What Does “Type” Mean?
• In Python variables, literals, and
constants have a “type” >>> ddd = 1 + 4
>>> print(ddd)
• Python knows the difference between 5
an integer number and a string >>> eee = 'hello ' + 'there'
>>> print(eee)
hello there
• For example “+” means “addition” if
something is a number and
“concatenate” if something is a string
concatenate = put together
Python “types”
• integers: 5
• floats: 1.2
• booleans: True
• strings: "anything" or 'something'
• lists: [,] ['a',1,1.3]
• others (not seen yet)

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Type Matters
• Python knows what “type” >>> eee = 'hello ' + 'there'
everything is >>> eee = eee + 1
Traceback (most recent call last):
File "<stdin>", line 1, in
• Some operations are <module>TypeError: Can't convert
prohibited 'int' object to str implicitly
>>> type(eee)
• You cannot “add 1” to a string <class'str'>
>>> type('hello')
<class'str'>
• We can ask Python what type >>> type(1)
something is by using the <class'int'>
type() function >>>
Type Conversions
>>> print(float(99) + 100)
199.0
• When you put an integer and >>> i = 42
floating point in an >>> type(i)
expression, the integer is <class'int'>
implicitly converted to a float >>> f = float(i)
>>> print(f)
• You can control this with the 42.0
>>> type(f)
built-in functions int() and
<class'float'>
float()
>>>
User Input
• We can instruct Python to nam = input('Who are you? ')
print('Welcome', nam)
pause and read data from
the user using the input() Who are you? Chuck
function Welcome Chuck
• The input() function returns
a string inp = input('Europe floor?')
usf = int(inp) + 1
• If we want to read a number print('US floor', usf)
from the user, we must
convert it from a string to a Europe floor? 0
number using a type US floor 1
conversion function
Comments in Python
• Anything after a # is ignored
by Python
• Why comment?
- Describe what is going to
happen in a sequence of
code
- Document who wrote the
code or other ancillary
information
- Turn off a line of code -
perhaps temporarily
Acknowledgements / Contributions

These slides are Copyright 2010- Charles R. Severance (


www.dr-chuck.com) of the University of Michigan School of ...
Information and made available under a Creative Commons
Attribution 4.0 License. Please maintain this last slide in all
copies of the document to comply with the attribution
requirements of the license. If you make a change, feel free to
add your name and organization to the list of contributors on this
page as you republish the materials.

Initial Development: Charles Severance, University of Michigan


School of Information

Edits by Selja Seppälä: added slides on Variables, Assignment,


Naming Conventions and a comment mentioning PEP 8.
Removed some original slides.
CONDITIONAL EXECUTION

2024-IS60516
MSc BA
Lecturer: Dr Selja Seppälä
Use comparison operators in Describe what ASCII values are
boolean expressions and compare strings based on
Conditional execution these

Use logical operators considering Using methods that return


the principles of short-circuit boolean values
evaluation

Describe the flow of conditional Correctly use the if/elif/else


structures clauses in if statements

Write code for one-way, nested, Describe the try/except structure,


two-way & multi-way decisions explain when it is used and write
try/except code

IS1110 | Conditional Execution |


S. Seppälä
Summary

• Conditional Steps
– Structures that Control Flow
– Conditional/Decision Structures

IS1110 | Conditional Execution |


S. Seppälä
Structures that Control Flow
Relational and Logical Operators
• A condition is an expression (a Boolean expression)
• Involving relational (comparison) operators (such
as < and >=)
• Logical operators (such as and, or, and not)
• Evaluates to either True or False
• Conditions used to make decisions
• Control loops
• Choose between options
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Relational/Comparison Operators
Python Meaning
• Boolean expressions ask a
question and produce a Yes or No < Less than
result which we use to control <= Less than or Equal to
program flow == Equal to
• Boolean expressions using >= Greater than or Equal to
comparison operators evaluate to > Greater than
True / False or Yes / No
!= Not equal
• Comparison operators look at
variables but do not change the
variables Remember: “=” is used for assignment.

http://en.wikipedia.org/wiki/George_Boole
ASCII Values
• ASCII values determine order used to compare strings with
relational operators.
• Associated with keyboard letters, characters, numerals
• ASCII values are numbers ranging from 32 to 126.
• A few ASCII values.
• The ASCII standard also assigns characters to some numbers
above 126.
• Functions chr(int) and ord(str) access ASCII values.

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Relational/Comparison Operators

Table 3.3 Relational operators.


Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Relational/Comparison Operators
• Some rules
• An int can be compared to a float.
• Otherwise, values of different types cannot be compared
• Relational operators can be applied to lists or tuples

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Boolean Expressions
A Boolean expression is a logical statement that is either TRUE or FALSE.

age == 5 # variable equal to numeric literal


first_name == "John" # variable equal to string literal
quantity != 0 # variable not equal to numeric literal
distance > 5.6 # variable greater than numeric literal
fuel_req < fuel_cap # variable less than variable
distance >= limit # variable greater than or equal to variable
stock <= reorder_point # variable less than or equal to variable
rate / 100 >= 0.1 # expression greater than or equal to literal
Based on: Murach's Python Programming, C3, Slide 5, © 2016, Mike Murach & Associates, Inc.
Boolean Expressions
How to assign a Boolean value to a variable
active = True # variable is set to Boolean True value
active = False # variable is set to Boolean False value

Based on: Murach's Python Programming, C3, Slide 5, © 2016, Mike Murach & Associates, Inc.
Logical Operators
• Logical operators are the reserved words and, or, and not
• Enables combining multiple relational operators
• Conditions that use these operators are called compound
conditions

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Logical Operators
Operator Name Order of precedence
and AND 1. NOT operator
or OR 2. AND operator
not NOT 3. OR operator

Based on: Murach's Python Programming, C3, Slide 6, © 2016, Mike Murach & Associates, Inc.
Logical Operators
• Given: cond1 and cond2 are conditions
– cond1 and cond2 true only if both conditions are
true
– cond1 or cond2 true if either or both conditions
are true
– not cond1 is false if the condition is true, true if
the condition is false
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Short-Circuit Evaluation
• Consider the condition cond1 and cond2
• If Python evaluates cond1 as false, it does not bother to
check cond2
• Similarly with cond1 or cond2
• If Python finds cond1 true, it does not bother to check
further
• Think why this feature helps for
(number != 0) and (m == (n / number))
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Methods that Return Boolean Values
Methods that return either True or False.

Table 3.4 Methods that return either True or False.

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 1, © 2016 Pearson
Education, Inc., Hoboken, NJ. All rights reserved.
Conditional/Decision Structures
One-Way Decisions
x = 5 Yes
print('Before 5') Before 5 x == 5 ?
if x == 5 :
print('Is 5') Is 5 No print('Is 5’)
print('Is Still 5') Is Still 5
print('Third 5')
Third 5 print('Still 5')
print('Afterwards 5')
print('Before 6') Afterwards 5
if x == 6 : Before 6 print('Third 5')
print('Is 6')
print('Is Still 6')
print('Third 6')
print('Afterwards 6') Afterwards 6
Indentation
• Increase indent after an if statement or for statement (after : )
• Maintain indent to indicate the scope of the block (which lines are affected
by the if/for)
• Reduce indent back to the level of the if statement or for statement to
indicate the end of the block
• Blank lines are ignored - they do not affect indentation
• Comments on a line by themselves are ignored with regard to indentation
increase / maintain after if or for
decrease to indicate end of block
x = 5
if x > 2 :
print('Bigger than 2')
print('Still bigger')
print('Done with 2')

for i in range(5) :
print(i)
if i > 2 :
print('Bigger than 2')
print('Done with i', i)
print('All Done')
Think About begin/end Blocks
header
x = 5
compound if x > 2 :
statement print('Bigger than 2') indented
(stretches across
more than one line) print('Still bigger') block
print('Done with 2')

for i in range(5) :
print(i)
if i > 2 :
print('Bigger than 2')
print('Done with i', i)
print('All Done')
The syntax of the if statement
if boolean_expression:
statements...
[elif boolean_expression:
statements...]
...
[else:
statements...]

Based on: Murach's Python Programming, C3, Slide 11, © 2016, Mike Murach & Associates, Inc.
Nested x>1
yes

Decisions no print('More than one’)

x = 42
if x > 1 : yes
x < 100
print('More than one')
if x < 100 :
no
print('Less than 100') print('Less than 100')
print('All done')

print('All Done')
Two-way Decisions
x=4

• Sometimes we want to
do one thing if a logical no yes
x>2
expression is true and
something else if the
expression is false print('Not bigger') print('Bigger')

• It is like a fork in the


road - we must choose
one or the other path but print('All Done')
not both
Two-way Decisions
x=4
with else:
no yes
x = 4 x>2

if x > 2 :
print('Bigger') print('Not bigger') print('Bigger')
else :
print('Smaller')

print('All done')
print('All Done')
Multi-way
yes
x<2 print('small')
no
if x < 2 :
yes
print('small')
x < 10 print('Medium')
elif x < 10 :
print('Medium') no
else :
print('LARGE') print('LARGE')
print('All done')

print('All Done')
The try / except Structure

• You surround a dangerous section of code with try and except


• If the code in the try works - the except is skipped
• If the code in the try fails - it jumps to the except section
try / except
astr = 'Bob'

print('Hello')
astr = 'Bob'
try:
print('Hello')
Safe line istr = int(astr)
istr = int(astr)
Dangerous line!
Will not print
print('There')
Safe line if previous

except:
line fails to
execute print('There')
istr = -1
istr = -1
print('Done', istr)

print('Done', istr) Safety net


Acknowledgements / Contributions

These slides are Copyright 2010- Charles R. Severance (


www.dr-chuck.com) of the University of Michigan School of ...
Information and made available under a Creative Commons
Attribution 4.0 License. Please maintain this last slide in all
copies of the document to comply with the attribution
requirements of the license. If you make a change, feel free to
add your name and organization to the list of contributors on this
page as you republish the materials.

Initial Development: Charles Severance, University of Michigan


School of Information
FUNCTIONS

2024-IS6061
MSc BA
Lecturer: Dr Selja Seppälä
Explain what a function is and Use Python built-in functions in your Define your own functions when
describe its composition code appropriate
Functions
Explain the difference between Create functions with one or more Pass multiple arguments to
parameters and arguments and how parameters and call them using the functions by position, by name and
they are used adequate arguments with default values

Return values from functions Describe the difference between Write code with correct variable
local scope and global scope of a scope
variable

Import a library module

IS6061 | Functions | S. Seppälä


Summary

Python Functions: the Store & Reuse Pattern


• Function Definition & Utility
• Built-in Functions
• User-defined Functions
• Multiple Parameters/Arguments
• Scope of Variables
• Library Modules
• Using Functions

IS6061 | Functions | S. Seppälä


Function Definition
• In Python a function is some reusable code that takes
arguments(s) as input, does some computation, and then
returns a result or results
• We define a function using the def reserved word (i.e., “defined
function”)
• We call/invoke the function by using the function name,
parentheses, and arguments in an expression
Why Have Functions?
• Support divide-and-conquer strategy
• Abstraction of an operation
• Reuse. Once written, use again
• Sharing. If tested, others can use
• Security. Well tested, then secure for reuse
• Simplify code. More readable.

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Python Functions
• There are two kinds of functions in Python.

- Built-in functions that are provided as part of Python - print(),


input(), type(), float(), int() ...

- Functions that we define ourselves and then use

• We treat function names as “new” reserved words


(i.e., we avoid them as variable names)
Built-in Functions
• Like miniature programs
• Receive input
• Process the input
• Have output
• Table 4.1 Some Python built-in functions.

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Built-in Functions
• Output of functions is a single value
• Function is said to return its output
• Items inside parentheses called arguments
• Examples:

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Argument

big = max('Hello world')


Assignment
'w'

Result
>>> big = max('Hello world')
>>> print(big)
w
>>> tiny = min('Hello world')
>>> print(tiny)

>>>
User-defined Functions
Building our Own Functions
• We create a new function using the def keyword followed by
optional parameters in parentheses

• We indent the body of the function

• This defines the function but does not execute the body of the
function
def print_lyrics():
print("I'm a lumberjack, and I'm okay.")
print('I sleep all night and I work all day.')
Figure 5.1 Function Parts

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Definitions and Uses
• Once we have defined a function, we can call (or invoke) it
as many times as we like

• This is the store and reuse pattern


x = 5
print('Hello')

def print_lyrics():
print("I'm a lumberjack, and I'm okay.")
print('I sleep all night and I work all day.')

print('Yo')
print_lyrics()
Hello
x = x + 2
print(x) Yo
I'm a lumberjack, and I'm okay.
I sleep all night and I work all day.
7
Arguments
• An argument is a value we pass into the function as its input
when we call the function

• We use arguments so we can direct the function to do different


kinds of work when we call it at different times

• We put the arguments in parentheses after the name of the


function
big = max('Hello world')
Argument
Parameters
>>> def greet(lang):
... if lang == 'es':
A parameter is a variable which ... print('Hola')
... elif lang == 'fr':
we use in the function definition. ... print('Bonjour')
... else:
It is a “handle” that allows the ... print('Hello')
code in the function to access ...
>>> greet('en')
the arguments for a particular Hello
function invocation. >>> greet('es')
Hola
>>> greet('fr')
Bonjour
>>>
User-defined Functions
• Defined by statements of the form
def function_name(par1, par2, ...):
indented block of statements()
return expression
• par1, par2 are variables (called parameters)
• Expression evaluates to a literal of any type
• Header must end with colon
• Each statement in block indented same
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
User-defined Functions
• 3 ways to pass arguments to parameters
• Pass by position: arguments in calling statement matched to
the parameters in function header based on order
• Pass by keyword/parameter name: arguments can be passed
to functions by using names of the corresponding parameters
• Pass by default value: parameters of a function can have
default values assigned to them when no values are passed to
them

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
User-defined Functions
• Parameters and return statements are optional in function
definitions
• Function names should describe the role performed

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Return Values
Often a function will take its arguments, do some computation, and
return a value to be used as the value of the function call in the
calling expression. The return keyword is used for this.

def greet():
return "Hello" Hello Glenn
Hello Sally
print(greet(), "Glenn")
print(greet(), "Sally")
Arguments, Parameters, and
Results
>>> big = max('Hello world') Parameter
>>> print(big)
w
def max(inp):
blah
blah
'Hello world' for x in inp: 'w'
blah
blah
Argument return 'w'
Result
Passing argument to parameter
• For each argument in the function invocation, the
argument’s associated object is passed to the
corresponding parameter in the function

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Passing a Value to a Function
• Example 3: Program shows there is no change in the value of
the argument
Object pointed to by the
argument variable
The argument in a function (not the argument variable
call is a variable itself) is passed to a
parameter variable

Object is immutable, there


is no possibility that value
of the argument variable
will be changed by a
function call

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Multiple Parameters /
Arguments
• We can define more than one
parameter in the function def addtwo(a, b):
definition added = a + b
return added
• We simply add more arguments
x = addtwo(3, 5)
when we call the function print(x)
• We match the number and order 8
of arguments and parameters
Functions Having Several Parameters

• Must be the same number of arguments as parameters in


the function
• Data types of arguments’ values must be compatible with
data types expected by the parameters
• Must also be in the same order

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Functions Having Several Parameters
• FIGURE 4.3 Passing arguments to a function.

function call

function

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Passing by Parameter Name
• Arguments can be passed to functions by using names of
the corresponding parameters
• Instead of relying on position

• Given

• Could use
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
How to use default values
in your function definitions
 You can specify a default value for any parameter in a
function definition by assigning a value to the parameter.
However, the parameters with default values must be coded
last in the function definition.

 When you call a function, any arguments that have default


values are optional. But you can override the default value for
an argument by supplying that argument.
Based on: Murach's Python Programming, C4, Slide 16, © 2016, Mike Murach & Associates, Inc.
How to use named arguments
in your calling statements
 To code a named argument, code the name of the argument in the
function definition, an equals sign (=), and the value or variable for the
argument.
 If you call a function without named arguments, you must code them in
the same sequence that they’re coded in the function definition.
 If you call a function with named arguments, you don’t have to code the
arguments in the sequence that they are coded in the function definition.
 It’s a good practice to use named arguments for functions that have many
arguments. This can improve the readability of the code and reduce
errors.
Based on: Murach's Python Programming, C4, Slide 18, © 2016, Mike Murach & Associates, Inc.
Scope of Variables
Scope of a variable is the portion of the program that can
refer to it
• Local scope: A variable created inside a function belongs to
the local scope of that function and can only be used inside
that function.
• Global scope: A variable created in the main body of the
Python code is a global variable and belongs to the global
scope.

Based on: https://www.w3schools.com/python/python_scope.asp


Scope of Variables
• Variable created inside a function can only be accessed
by statements inside that function
• Ceases to exist when the function is exited
• Variable is said to be local to function or to have local
scope
• If variables created in two different functions have the
same name
• They have no relationship to each other
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
A function’s namespace
• Each function maintains a namespace for names defined
locally within the function.

• Locally means one of two things:

• a name assigned within the function

• an argument received by calling the function

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Based on: Python Namespace and Scope: https://www.programiz.com/python-programming/namespace
Named Constants
• Program sometimes employs a special constant used
several times in program
• Convention programmers use
• Create a global variable
• Name written in uppercase letters with words separated by
underscore
• In Python, programmer is responsible for not changing
value of the variable
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Library Modules
• A library module is a file with extension .py
• Contains functions and variables
• Can be used (imported) by any program
• can be created in IDLE or any text editor
• Looks like an ordinary Python program
• To gain access to the functions and variables
• place a statement of the form import moduleName
at the beginning of the program

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 4, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Library Modules
• Create a
Module:

• Use a Module:

Based on: Python Modules, W3Schools: https://www.w3schools.com/python/python_modules.asp


Docstring
• A triple quoted string just after the def is called a
docstring

• docstring is documentation of the function's purpose, to


be used by other tools to tell the user what the function is
used for
• It is shown if you do a help on a function:
help(my_function)
Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Docstring Example

Based on: Python Docstring, www.programiz.com/python-programming/docstrings


How to write a function
• Does one thing. If it does too many things, it should be
broken down into multiple functions (refactored)

• Readable. How often should we say this? If you write it,


it should be readable

• Reusable. If it does one thing well, then when a similar


situation (in another program) occurs, use it there as well.

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
More on functions
• Complete. A function should check for all the cases
where it might be invoked. Check for potential errors.

• Not too long. Kind of synonymous with do one thing. Use


it as a measure of doing too much.

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Acknowledgements / Contributions

These slides are Copyright 2010- Charles R. Severance (


www.dr-chuck.com) of the University of Michigan School of ...
Information and made available under a Creative Commons
Attribution 4.0 License. Please maintain this last slide in all
copies of the document to comply with the attribution
requirements of the license. If you make a change, feel free to
add your name and organization to the list of contributors on this
page as you republish the materials.

Initial Development: Charles Severance, University of Michigan


School of Information
LOOPS & ITERATIONS

2024-IS6061
MSc BA
Lecturer: Dr Selja Seppälä
List the different types of Choose between a while and a for
repeating statements and explain loop depending on the context
Loops and iteration
the difference

Write while and for loops using Adequately use loop control
the correct syntax (iteration) variables

Use the break and continue Describe what the code in given
statements, and explain how they while and for loops does
work

Use None constants and variables Write code with the is and is not
operators

IS6061 | Loops & Iteration |


S. Seppälä
Summary

• Repetition
• Indefinite loops: iteration with a while loop
• Breaking out of a loop with break
• Finishing an iteration with continue
• Definite loops: iteration with a for loop
• Loop patterns
• The None value
• The is and is not operators

IS6061 | Loops & Iteration |


S. Seppälä
While and For statements
• The while statement is the more general repetition construct. It
repeats a set of statements while some condition is True.

• The for statement is useful for iteration, moving through all the
elements of data structure, one at a time.

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Indefinite Loops
The while Loop
• The while loop repeatedly executes an indented block of
statements as long as a certain condition is met.
• A while loop has the form: continuation
condition
header while condition: body
indented block of statements of the
loop
• The continuation condition is a boolean expression that
evaluates to True or False
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 3, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
General approach to a while
• Outside the loop, initialize the Boolean with a loop control
(iteration) variable.
• Initialize the variable, typically outside of the loop and before
the loop begins.
• Somewhere inside the loop you perform some operation which
changes the state of the program, eventually leading to a False
Boolean and exiting the loop.
• Modify the value of the control variable during the course of
the loop
• Have to have both!
Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
n=5 Repeated Steps
Output:
No Yes Program:
n>0? 5
n = 5 4
print(n) while n > 0 : 3
print(n)
n = n – 1 2
n = n -1 print('Blastoff!') 1
print(n) Blastoff!
0
Loops (repeated steps) have iteration variables that
print('Blastoff') change each time through a loop. Often these iteration
variables go through a sequence of numbers.
Breaking Out of a Loop
• The break statement ends the current loop and jumps to the
statement immediately following the loop

• It is like a loop test that can happen anywhere in the body of the
loop
while True: > hello there
line = input('> ') hello there
if line == 'done' : > finished
break finished
print(line) > done
print('Done!') Done!
Finishing an Iteration with
continue
The continue statement ends the current iteration and jumps to the
top of the loop and starts the next iteration

while True:
> hello there
line = input('> ')
hello there
if line[0] == '#' :
continue > # don't print this
if line == 'done' : > print this!
break print this!
print(line) > done
print('Done!') Done!
Indefinite Loops

• While loops are called “indefinite loops” because they keep


going until a logical condition becomes False

• The loops we have seen so far are pretty easy to examine to


see if they will terminate or if they will be “infinite loops”

• Sometimes it is a little harder to be sure if a loop will terminate


Definite Loops
Iterating over a set of items…
Definite Loops
• Quite often we have a list of items or the lines in a file -
effectively a finite set of things

• We can write a loop to run the loop once for each of the items in
a set using the Python for construct

• These loops are called “definite loops” because they execute an


exact number of times

• We say that “definite loops iterate through the members of a set”


A Simple Definite Loop
Yes No
Done? Move i ahead 5
for i in [5, 4, 3, 2, 1] : 4
print(i) 3
print(i) print('Blastoff!')
2
1
Blastoff!

Definite loops (for loops) have explicit iteration variables


print('Blast off!') that change each time through a loop. These iteration
variables move through the sequence or set.
The for Loop
• Used to iterate through a sequence of values
• General form
of a for loop
• Sequence can be
• Arithmetic progression of numbers
• String
• List
• File object
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 3, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Looking at in...
• The iteration variable
“iterates” through the Five-element
sequence (ordered set) sequence
Iteration variable
• The block (body) of code is
executed once for each
for i in [5, 4, 3, 2, 1] :
value in the sequence print(i)
• The iteration variable moves
through all of the values in
the sequence
The syntax of a for loop with the
range() function
for int_var in range_function:
statements...
The range() function
range(stop)
range(start, stop[, step])
Examples of the range() function
range(5) # 0, 1, 2, 3, 4
range(1, 6) # 1, 2, 3, 4, 5
range(2, 10, 2) # 2, 4, 6, 8
range(5, 0, -1) # 5, 4, 3, 2, 1
Based on: Murach's Python Programming, C3, Slide 32, © 2016, Mike Murach & Associates, Inc.
Step Values for the range() Function
• If negative step value is used and initial value is greater than
terminating value,

• range function generates a decreasing sequence

• Examples:

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 3, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Making “smart” loops Example: sum
values in data

Set some variables to x = 0


initial values
• Loop pattern to solve a
problem for thing in data: running
total
• The trick is “knowing” Look for something or
something about the whole do something to each x = x + thing
loop when you are stuck entry separately,
writing code that only sees updating a variable
real total
one entry at a time
Look at the variables print(x)
The None Value
• Python has an object called None that is used to denote a lack of
value, and has no methods.
• The None keyword is used to define a null value, or no value at all.
• None is not the same as 0, False, or an empty string. None is a
data type of its own (NoneType) and only None can be None.

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 3, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved. & Python None Keyword, W3Schools (
https://www.w3schools.com/python/ref_keyword_none.asp)
The is and is not Operators
• Python has an is operator
that can be used in logical
smallest = None expressions
print('Before')
for value in [3, 41, 12, 9, 74, 15] :
if smallest is None : • Implies “is the same as”
smallest = value (same type AND value)
elif value < smallest :
smallest = value • Similar to, but stronger than
print(smallest, value)
==
print('After', smallest)
• is not also is a logical
operator
Acknowledgements / Contributions

These slides are Copyright 2010- Charles R. Severance (


www.dr-chuck.com) of the University of Michigan School of ...
Information and made available under a Creative Commons
Attribution 4.0 License. Please maintain this last slide in all
copies of the document to comply with the attribution
requirements of the license. If you make a change, feel free to
add your name and organization to the list of contributors on this
page as you republish the materials.

Initial Development: Charles Severance, University of Michigan


School of Information
STRINGS

2024-IS6061
MSc BA
Lecturer: Dr Selja Seppälä
WRITE STRINGS IN DIFFERENT SCENARIOS Read and convert strings

Strings
Use string indexing and slicing to Get the length of a string
access characters within strings

Loop through strings and count Perform string operations


characters or substrings within it

Use string libraries Describe what UTF-8 encoding is

IS6061 | Strings | S. Seppälä


Summary

• String type
• Non-printing characters
• Read & convert
• Indexing
• String length
• Looping through strings & counting
• More string operations: slicing, concatenation,
repetition, comparison
• String library: introduction & common functions
• UTF-8

IS6061 | Strings | S. Seppälä


String Data Type
>>> str1 = "Hello"
>>> str2 = 'there'
>>> bob = str1 + str2
>>> print(bob)
Hellothere
• A string is a sequence of characters >>> str3 = '123'
>>> str3 = str3 + 1
• A string literal uses quotes Traceback (most recent call
'Hello' or "Hello" last): File "<stdin>", line 1,
• For strings, + means “concatenate” in <module>
• When a string contains numbers, it is TypeError: cannot concatenate
'str' and 'int' objects
still a string
>>> x = int(str3) + 1
• We can convert numbers in a string >>> print(x)
into a number using int() 124
>>>
Strings
Can use single or double quotes:
S = "spam"
s = 'spam'
Just don't mix them
my_str = 'hi mom"  ERROR
Inserting an apostrophe:
A = "knight's" # mix up the quotes
B = 'knight\'s' # escape single quote with the \ character

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
And then there is """ """
• triple quotes preserve both the vertical and horizontal formatting
of the string
• allows you to type tables, paragraphs, whatever and preserve
the formatting

"""this is
a test
today"""

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Non-printing Characters
• If inserted directly, are preceded by a backslash (the escape
character \)
• new line '\n'
• tab '\t'

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Reading and >>> name = input('Enter:')
Enter:Chuck
Converting >>> print(name)
Chuck
>>> apple = input('Enter:')
• We prefer to read data in using Enter:100
strings and then parse and >>> x = apple – 10
convert the data as we need Traceback (most recent call
• This gives us more control over last): File "<stdin>", line
1, in <module>
error situations and/or bad user
TypeError: unsupported operand
input
type(s) for -: 'str' and 'int'
• Input numbers must be >>> x = int(apple) – 10
converted from strings >>> print(x)
90
The Index
• Because the elements of a string are a sequence, we can
associate each element with an index, a location in the
sequence:
• positive values count up from the left, beginning with index 0
• negative values count down from the right, starting with -1

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Looking Inside Strings
b a n a n a
• We can get at any single character in a
string using the index specified in 0 1 2 3 4 5
square brackets >>> fruit = 'banana'
>>> letter = fruit[1]
• The index value must be an integer
>>> print(letter)
and starts at zero a
• The index value can be an expression >>> x = 3
that is computed >>> w = fruit[x - 1]
>>> print(w)
n
Accessing an element: Summary
A particular element of the string is accessed by the index of the
element surrounded by square brackets [ ]
hello_str = 'Hello World'
print(hello_str[1]) => prints e
print(hello_str[-1]) => prints d
print(hello_str[11]) => ERROR

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Strings Have Length

b a n a n a
The built-in function len gives 0 1 2 3 4 5
us the length of a string
>>> fruit = 'banana'
>>> print(len(fruit))
6
Looping Through Strings
fruit = 'banana'
• A definite loop using a b
for letter in fruit :
for statement is much print(letter) a
more elegant n
• The iteration variable is a
index = 0 n
completely taken care of while index < len(fruit) :
by the for loop letter = fruit[index]
a
print(letter)
index = index + 1
Looping and Counting
word = 'banana'
This is a simple loop that count = 0
loops through each letter in a for letter in word :
string and counts the number if letter == 'a' :
of times the loop encounters count = count + 1
the 'a' character print(count)
Looking Deeper into in
• The iteration variable “iterates”
Iteration Six-character
through the sequence
(ordered set) variable string
• The block (body) of code is
executed once for each value for letter in 'banana' :
in the sequence
print(letter)
• The iteration variable moves
through all of the values in the
sequence
Slicing, the rules
• slicing is the ability to select a subsequence of the overall
sequence
• uses the syntax [start:finish], where:
• start is the index of where we start the subsequence
• : is the colon operator
• finish is the index of one after where we end the subsequence (“up
to but not including”)
• if either start or finish are not provided, it defaults to the
beginning of the sequence for start and the end of the sequence
for finish
Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Extended Slicing
• also takes three arguments:
• [start:finish:countBy]
• defaults are:
• start is beginning, finish is end, countBy is 1
my_str = 'hello world'
my_str[0:11:2]  'hlowrd'
• every other letter

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Basic String Operations
s = 'spam'
• length operator len()
len(s)  4
• + is concatenate
new_str = 'spam' + '-' + 'spam-'
print(new_str)  spam-spam-
• * is repeat, the number is how many times
new_str * 3 
'spam-spam-spam-spam-spam-spam-'

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Some Details
• Both + and * on strings makes a new string, does not modify the
arguments
• Order of operation is important for concatenation, irrelevant for
repetition
• The types required are specific. For concatenation you need two
strings, for repetition a string and an integer

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Using in as a Logical Operator
>>> fruit = 'banana'
• The in keyword can also be >>> 'n' in fruit
used to check to see if one True
>>> 'm' in fruit
string is “in” another string False
• The in expression is a >>> 'nan' in fruit
logical expression that True
>>> if 'a' in fruit :
returns True or False and ... print('Found it!')
can be used in an if ...
statement Found it!
>>>
String Comparison
if word == 'banana':
print('All right, bananas.')

if word < 'banana':


print('Your word,' + word + ', comes before banana.')
elif word > 'banana':
print('Your word,' + word + ', comes after banana.')
else:
print('All right, bananas.')
• Python has a number of string
String Library
functions which are in the
string library >>> greet = 'Hello Bob'
• These functions are already >>> zap = greet.lower()
built into every string - we >>> print(zap)
invoke them by appending the hello bob
function to the string variable >>> print(greet)
Hello Bob
• These functions do not modify
>>> print('Hi There'.lower())
the original string, instead they
hi there
return a new string that has
>>>
been altered
>>> stuff = 'Hello world'
>>> type(stuff)
<class 'str'>
>>> dir(stuff)
['capitalize', 'casefold', 'center', 'count', 'encode',
'endswith', 'expandtabs', 'find', 'format', 'format_map',
'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit',
'isidentifier', 'islower', 'isnumeric', 'isprintable',
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
'lstrip', 'maketrans', 'partition', 'replace', 'rfind',
'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split',
'splitlines', 'startswith', 'strip', 'swapcase', 'title',
'translate', 'upper', 'zfill']

https://docs.python.org/3/library/stdtypes.html#string-methods

See main slides for examples of string library use.


Two Kinds of Strings
Python 2.7.10 Python 3.5.1
>>> x = ' 이광춘 ' >>> x = ' 이광춘 '
>>> type(x) >>> type(x)
<type 'str'> <class 'str'>
>>> x = u' 이광춘 ' >>> x = u' 이광춘 '
>>> type(x) >>> type(x)
<type 'unicode'> <class 'str'>
>>> >>>

In Python 3, all strings are Unicode


String Representation
• Every character is "mapped" (associated) with an integer
• UTF-8, subset of Unicode, is such a mapping
• The function ord() takes a character and returns its UTF-8
integer value, chr() takes an integer and returns the UTF-8
character.

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
STRINGS

2024-IS6061
MSc BA
Lecturer: Dr Selja Seppälä
WRITE STRINGS IN DIFFERENT SCENARIOS Read and convert strings

Strings
Use string indexing and slicing to Get the length of a string
access characters within strings

Loop through strings and count Perform string operations


characters or substrings within it

Use string libraries Describe what UTF-8 encoding is

IS6061 | Strings | S. Seppälä


Summary

• String type
• Non-printing characters
• Read & convert
• Indexing
• String length
• Looping through strings & counting
• More string operations: slicing, concatenation,
repetition, comparison
• String library: introduction & common functions
• UTF-8

IS6061 | Strings | S. Seppälä


String Data Type
>>> str1 = "Hello"
>>> str2 = 'there'
>>> bob = str1 + str2
>>> print(bob)
Hellothere
• A string is a sequence of characters >>> str3 = '123'
>>> str3 = str3 + 1
• A string literal uses quotes Traceback (most recent call
'Hello' or "Hello" last): File "<stdin>", line 1,
• For strings, + means “concatenate” in <module>
• When a string contains numbers, it is TypeError: cannot concatenate
'str' and 'int' objects
still a string
>>> x = int(str3) + 1
• We can convert numbers in a string >>> print(x)
into a number using int() 124
>>>
Strings
Can use single or double quotes:
S = "spam"
s = 'spam'
Just don't mix them
my_str = 'hi mom"  ERROR
Inserting an apostrophe:
A = "knight's" # mix up the quotes
B = 'knight\'s' # escape single quote with the \ character

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
And then there is """ """
• triple quotes preserve both the vertical and horizontal formatting
of the string
• allows you to type tables, paragraphs, whatever and preserve
the formatting

"""this is
a test
today"""

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Non-printing Characters
• If inserted directly, are preceded by a backslash (the escape
character \)
• new line '\n'
• tab '\t'

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Reading and >>> name = input('Enter:')
Enter:Chuck
Converting >>> print(name)
Chuck
>>> apple = input('Enter:')
• We prefer to read data in using Enter:100
strings and then parse and >>> x = apple – 10
convert the data as we need Traceback (most recent call
• This gives us more control over last): File "<stdin>", line
1, in <module>
error situations and/or bad user
TypeError: unsupported operand
input
type(s) for -: 'str' and 'int'
• Input numbers must be >>> x = int(apple) – 10
converted from strings >>> print(x)
90
The Index
• Because the elements of a string are a sequence, we can
associate each element with an index, a location in the
sequence:
• positive values count up from the left, beginning with index 0
• negative values count down from the right, starting with -1

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Looking Inside Strings
b a n a n a
• We can get at any single character in a
string using the index specified in 0 1 2 3 4 5
square brackets >>> fruit = 'banana'
>>> letter = fruit[1]
• The index value must be an integer
>>> print(letter)
and starts at zero a
• The index value can be an expression >>> x = 3
that is computed >>> w = fruit[x - 1]
>>> print(w)
n
Accessing an element: Summary
A particular element of the string is accessed by the index of the
element surrounded by square brackets [ ]
hello_str = 'Hello World'
print(hello_str[1]) => prints e
print(hello_str[-1]) => prints d
print(hello_str[11]) => ERROR

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Strings Have Length

b a n a n a
The built-in function len gives 0 1 2 3 4 5
us the length of a string
>>> fruit = 'banana'
>>> print(len(fruit))
6
Looping Through Strings
fruit = 'banana'
• A definite loop using a b
for letter in fruit :
for statement is much print(letter) a
more elegant n
• The iteration variable is a
index = 0 n
completely taken care of while index < len(fruit) :
by the for loop letter = fruit[index]
a
print(letter)
index = index + 1
Looping and Counting
word = 'banana'
This is a simple loop that count = 0
loops through each letter in a for letter in word :
string and counts the number if letter == 'a' :
of times the loop encounters count = count + 1
the 'a' character print(count)
Looking Deeper into in
• The iteration variable “iterates”
Iteration Six-character
through the sequence
(ordered set) variable string
• The block (body) of code is
executed once for each value for letter in 'banana' :
in the sequence
print(letter)
• The iteration variable moves
through all of the values in the
sequence
Slicing, the rules
• slicing is the ability to select a subsequence of the overall
sequence
• uses the syntax [start:finish], where:
• start is the index of where we start the subsequence
• : is the colon operator
• finish is the index of one after where we end the subsequence (“up
to but not including”)
• if either start or finish are not provided, it defaults to the
beginning of the sequence for start and the end of the sequence
for finish
Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Extended Slicing
• also takes three arguments:
• [start:finish:countBy]
• defaults are:
• start is beginning, finish is end, countBy is 1
my_str = 'hello world'
my_str[0:11:2]  'hlowrd'
• every other letter

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Basic String Operations
s = 'spam'
• length operator len()
len(s)  4
• + is concatenate
new_str = 'spam' + '-' + 'spam-'
print(new_str)  spam-spam-
• * is repeat, the number is how many times
new_str * 3 
'spam-spam-spam-spam-spam-spam-'

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Some Details
• Both + and * on strings makes a new string, does not modify the
arguments
• Order of operation is important for concatenation, irrelevant for
repetition
• The types required are specific. For concatenation you need two
strings, for repetition a string and an integer

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
Using in as a Logical Operator
>>> fruit = 'banana'
• The in keyword can also be >>> 'n' in fruit
used to check to see if one True
>>> 'm' in fruit
string is “in” another string False
• The in expression is a >>> 'nan' in fruit
logical expression that True
>>> if 'a' in fruit :
returns True or False and ... print('Found it!')
can be used in an if ...
statement Found it!
>>>
String Comparison
if word == 'banana':
print('All right, bananas.')

if word < 'banana':


print('Your word,' + word + ', comes before banana.')
elif word > 'banana':
print('Your word,' + word + ', comes after banana.')
else:
print('All right, bananas.')
• Python has a number of string
String Library
functions which are in the
string library >>> greet = 'Hello Bob'
• These functions are already >>> zap = greet.lower()
built into every string - we >>> print(zap)
invoke them by appending the hello bob
function to the string variable >>> print(greet)
Hello Bob
• These functions do not modify
>>> print('Hi There'.lower())
the original string, instead they
hi there
return a new string that has
>>>
been altered
>>> stuff = 'Hello world'
>>> type(stuff)
<class 'str'>
>>> dir(stuff)
['capitalize', 'casefold', 'center', 'count', 'encode',
'endswith', 'expandtabs', 'find', 'format', 'format_map',
'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit',
'isidentifier', 'islower', 'isnumeric', 'isprintable',
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
'lstrip', 'maketrans', 'partition', 'replace', 'rfind',
'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split',
'splitlines', 'startswith', 'strip', 'swapcase', 'title',
'translate', 'upper', 'zfill']

https://docs.python.org/3/library/stdtypes.html#string-methods

See main slides for examples of string library use.


Two Kinds of Strings
Python 2.7.10 Python 3.5.1
>>> x = ' 이광춘 ' >>> x = ' 이광춘 '
>>> type(x) >>> type(x)
<type 'str'> <class 'str'>
>>> x = u' 이광춘 ' >>> x = u' 이광춘 '
>>> type(x) >>> type(x)
<type 'unicode'> <class 'str'>
>>> >>>

In Python 3, all strings are Unicode


String Representation
• Every character is "mapped" (associated) with an integer
• UTF-8, subset of Unicode, is such a mapping
• The function ord() takes a character and returns its UTF-8
integer value, chr() takes an integer and returns the UTF-8
character.

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017
Pearson Education, Ltd.
FILES

2024-IS6061
MSc BA
Lecturer: Dr Selja Seppälä
Explain how files are accessed EXPLAIN WHAT A FILE AND A FILE HANDLE ARE
(input-output stream)
Files
Use the open() & close() functions Use try/except to avoid issues
and the with statement to open & when opening files
close files

Use the different file modes for Use loops and file methods to
reading and writing read files

Search in files Write to files

IS6061 | Files | S. Seppälä


Summary

• What is a file?
• Opening a file
• Reading a file
• Searching in a file
• Writing to a file

• The PY4E file of mail interactions used in the slide


examples is available from
– www.py4e.com/code3/mbox.txt
• and a shortened version of the file is available from
– www.py4e.com/code3/mbox-short.txt
IS6061 | Files | S. Seppälä
What is a file?

• A file is a collection of data stored on secondary storage (e.g.,


hard drive, thumb drive, cloud, etc.).
• A text file consists of a sequence of lines that end with a newline
character (\n).
• Newline is one character - not two.

Based on: PY4E, Chapter 7 & "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 6,
IS6061 | Files | S. Seppälä
Copyright © 2017 Pearson Education, Ltd.
Accessing a file

• Accessing a file means establishing a connection between the file


and the program and moving data between the two (input-output
streams).
• The open() function returns a “file handle” - a variable (file
object) used to perform operations on the file.
• Reading from a disk is slow, therefore the data read from a file is
buffered in the file object for quick access.
• The file object thus contains a cache, i.e., a copy of information
from the file.

Based on: PY4E, Chapter 7 & "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 6,
IS6061 | Files | S. Seppälä
Copyright © 2017 Pearson Education, Ltd.
Making a file object

my_file = open("my_file.txt", "r")

• my_file is the file object that contains the buffer of information.


• The open function creates the connection between the disk file
and the file object.
• The first quoted string ("my_file.txt") is the file name on disk,
the second is the mode to open it (here, "r" means to read)

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 6, Copyright © 2017
IS6061 | Files | S. Seppälä
Pearson Education, Ltd.
Different modes

Table 6.1 File modes.

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 6, Copyright © 2017
IS6061 | Files | S. Seppälä
Pearson Education, Ltd.
Summary of open()

• handle = open(filename, mode)


• returns a handle used to manipulate the file
• filename is a string
• mode is optional
– 'r' to read the file
– 'w' if we are going to write to the file
• Example
fhand = open('mbox.txt', 'r')

IS6061 | Files | S. Seppälä Based on: PY4E, Chapter 7


fname = input('Enter the file name: ')
try:
prompt for file name
Missing fhand = open(fname)
except:
& print('File cannot be opened:', fname)
quit()
Bad File count = 0
terminates the program

Names for line in fhand:


if line.startswith('Subject:') :
count = count + 1
print('There were', count, 'subject lines in',
fname)
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt

Enter the file name: na na boo boo


File cannot be opened: na na boo boo
IS6061 | Files | S. Seppälä Based on: PY4E, Chapter 7
Three read methods of a file object

• read(): Places entire contents of line (including the newline


characters) into single string.
• readlines(): Read and return a list of lines from the stream.
• readline(): Read and return one line from the stream.

Based on: Murach's Python Programming, C7, Slide 10, © 2016, Mike Murach & Associates, Inc. & David I.
IS6061 | Files | S. Seppälä Schneider, An Introduction to Programming Using Python, Chapter 5, © 2016 Pearson Education, Inc.,
Hoboken, NJ. All rights reserved & https://docs.python.org/3.7/library/io.html#io.IOBase.readlines
Reading a file
• The result that is printed to the console
• How to use a loop to read each line • How to read the entire file as a
of the file listJohn Cleese
with open("members.txt") as file: Eric Idle
with open("members.txt") as file:
for line in file: member1 = file.readline()
print(line, end="") print(member1, end="")
print() member2 = file.readline()
print(member2)
• How to read the entire file as a
string • How to read each line of the file
with open("members.txt") as file: with open("members.txt") as file:
contents = file.read() members = file.readlines()
print(contents) print(members[0], end="")
print(members[1])
IS6061 | Files | S. Seppälä Based on: Murach's Python Programming, C7, Slide 11, © 2016, Mike Murach & Associates, Inc.
Opening a file in write mode and closing the file
manually

outfile = open("test.txt", "w")


outfile.write("Test")
outfile.close()
⚠️Risk of data loss if not
closed!

IS6061 | Files | S. Seppälä Based on: Murach's Python Programming, C7, Slide 6, © 2016, Mike Murach & Associates, Inc.
Opening and closing files using with statements

• The syntax of the with statement for file I/O 💡Preferred way to avoid issues

with open(file, mode) as file_object:


statements...
• Code that opens a text file in write mode and automatically closes it
with open("test.txt", "w") as outfile:
outfile.write("Test")
• Code that opens a text file in read mode and automatically closes it
with open("test.txt", "r") as infile:
print(infile.readline())

IS6061 | Files | S. Seppälä Based on: Murach's Python Programming, C7, Slide 7, © 2016, Mike Murach & Associates, Inc.
The write method
• The write() method of a file object
write(str)

• How to write one line to a text file


with open("members.txt", "w") as file:
file.write("John Cleese\n")

• How to append one line to a text file


with open("members.txt", "a") as file:
file.write("Eric Idle\n")

IS6061 | Files | S. Seppälä Based on: Murach's Python Programming, C7, Slide 8, © 2016, Mike Murach & Associates, Inc.
Searching Through a File

• if statement in for loop fhand = open('mbox-short.txt', 'r')


for line in fhand:
to only print lines that
line = line.strip()
meet some criteria. if line.startswith('From:'):
• strip() removes “white print(line)
space” from the left and if '@' in line:
right-hand sides of the print(line)
fhand.close()
string.
• Look for a string anywhere
in a line as our selection
criteria.
IS6061 | Files | S. Seppälä Based on: PY4E, Chapter 7
LISTS

2024-IS6061
MSc BA
Lecturer: Dr Selja Seppälä
Lists
DESCRIBE WHAT DATA STRUCTURES ARE & Describe the properties of lists Create and populate lists using (empty)
GIVE EXAMPLES brackets and list methods

Explain what a mutable object is & give Access and modify lists through USE THE SPLIT AND JOIN LIST METHODS TO
examples of mutable and immutable indexing, slicing and list methods CONVERT TO/FROM STRINGS
objects

Use list functions & python built-in Search in lists with "in" Construct lists with list comprehension
functions with lists

IS6061 | Lists | S. Seppälä


Summary

• Data Structures
• What is a list?
• Accessing and modifying lists
• List functions & methods
• List comprehension

IS6061 | Lists | S. Seppälä


Data Structure
• A particular way of organizing data in a computer to make some operation
easier or more efficient (e.g., an ordered sequence of characters is a way of
organizing data)
• Data structures are suited to solving certain problems, and they are often
associated with algorithms
• Implemented in a data type (e.g., a string)
• Roughly two kinds of data structures:
– Built-in data structures, data structures that are so common as to be provided by
default (e.g., lists, tuples, string, dictionaries, sets)
– User-defined data structures (classes in object-oriented programming) that are
designed for a particular task

Based on: PY4E, Chapter 8 & "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 7,
IS6061 | Lists | S. Seppälä
Copyright © 2017 Pearson Education, Ltd.
The list Object
• A list is a kind of collection, thus allows to store many values in a
single variable.
• A list is an ordered sequence of Python objects
– Objects can be of any type
– Objects do not have to all be the same type
– Constructed by writing items enclosed in square brackets and
separated by commas

friends = ['Joseph', 'Glenn', 'Sally']


carryon = ['socks', 'shirt', 'perfume']
many_types = ['red', 24, [5, 6], 98.6]
empty_list = [] OR other_empty_list = list()

Based on: PY4E, Chapter 8 & David I. Schneider, An Introduction to Programming Using Python, Chapter 2, ©
IS6061 | Lists | S. Seppälä
2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Looping through items in a list

With a for loop With a while loop


scores = [70, 80, 90, 100] scores = [70, 80, 90, 100]
total = 0
total = 0
for score in scores:
i = 0
total += score
print(total) # 340 while i < len(scores):
total += scores[i]
i += 1
print(total) # 340

IS6061 | Lists | S. Seppälä Based on: Murach's Python Programming, C6, Slide 16, © 2016, Mike Murach & Associates, Inc.
Looking inside lists: indexing

• Access any single element in a list using an index specified


in square brackets. >>> friends = [ 'Joseph', 'Glenn', 'Sally' ]
>>> print(friends[1])
Glenn
>>>

• Indexing with list of lists (or list of tuples):


my_list = ['a', [1, 2, 3], 'z’]
my_list[1][0] # apply left to right
my_list[1]  [1, 2, 3]
[1, 2, 3][0]  1

Based on: PY4E, Chapter 8 & "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 7,
IS6061 | Lists | S. Seppälä
Copyright © 2017 Pearson Education, Ltd.
Lists are Mutable
>>> fruit = 'Banana'
>>> fruit[0] = 'b'
• Strings are “immutable” - we Traceback
cannot change the contents of TypeError: 'str' object does not
support item assignment
a string - we must make a >>> x = fruit.lower()
new string to make any >>> print(x)
banana
change >>> lotto = [2, 14, 26, 41, 63]
• Lists are “mutable” - we can >>> print(lotto)
[2, 14, 26, 41, 63]
change an element of a list >>> lotto[2] = 28
using the index operator >>> print(lotto)
[2, 14, 28, 41, 63]

IS6061 | Lists | S. Seppälä Based on: PY4E, Chapter 8


How long is a list?
• The len() function
takes a list as a >>> greet = 'Hello Bob'
parameter and returns >>> print(len(greet))
the number of elements 9
in the list >>> x = [ 1, 2, 'joe', 99]
>>> print(len(x))
• Actually len() tells us 4
>>>
the number of elements
of any set or sequence
(e.g., a string)
IS6061 | Lists | S. Seppälä Based on: PY4E, Chapter 8
Using the range function
>>> print(range(4))
[0, 1, 2, 3]
• The range function >>> friends = ['Joseph', 'Glenn', 'Sally']
>>> print(len(friends))
returns a list of 3
numbers that range >>> print(range(len(friends)))
[0, 1, 2]
from zero to one less
than the parameter
friends = ['Joseph', 'Glenn', 'Sally']

• We can construct an for friend in friends :


print('Happy New Year:', friend)
index loop using for
and an integer iterator for i in range(len(friends)) :
friend = friends[i]
print('Happy New Year:', friend)
IS6061 | Lists | S. Seppälä Based on: PY4E, Chapter 8
Concatenating Lists Using +

• We can create a new >>> a = [1, 2, 3]


>>> b = [4, 5, 6]
list by adding two
>>> c = a + b
existing lists together >>> print(c)
[1, 2, 3, 4, 5, 6]
>>> print(a)
[1, 2, 3]

IS6061 | Lists | S. Seppälä Based on: PY4E, Chapter 8


Slices

• A slice of a list is a sublist specified with colon (:) notation


– Analogous to a slice of a string
– Remember: Just like in strings, the second number is “up to but not
including”

Table 2.6 Meanings of slice notations


Based on: PY4E, Chapter 8 & David I. Schneider, An Introduction to Programming Using Python, Chapter 2, ©
IS6061 | Lists | S. Seppälä
2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
List methods

• Remember, a function is a small program (such as len) that takes


some arguments, the stuff in the parenthesis, and returns some
value
• A method is a function called in a special way, the dot call. It is
called in the context of an object (or a variable associated with
an object).

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 7, Copyright © 2017
IS6061 | Lists | S. Seppälä
Pearson Education, Ltd.
List functions & methods
team = ["Seahawks", 2014, "CenturyLink Field"]
nums = [5, 10, 4, 5]
words = ["spam", "ni"]

Table 2.5 List operations

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 2, © 2016 Pearson
IS6061 | Lists | S. Seppälä
Education, Inc., Hoboken, NJ. All rights reserved.
Built-in Functions and Lists

>>> nums = [3, 41, 12, 9, 74, 15]


• There are a number of
>>> print(len(nums))
functions built into 6
Python that take lists >>> print(max(nums))
as parameters 74
>>> print(min(nums))
3
>>> print(sum(nums))
154
find average >>> print(sum(nums)/len(nums))
25.6

IS6061 | Lists | S. Seppälä Based on: PY4E, Chapter 8


List functions & methods
team = ["Seahawks", 2014, "CenturyLink Field"]
nums = [5, 10, 4, 5]
words = ["spam", "ni"]

Table 2.5 List operations

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 2, © 2016 Pearson
IS6061 | Lists | S. Seppälä
Education, Inc., Hoboken, NJ. All rights reserved.
List methods for modifying a list
stats = [48.0, 30.5, 20.2, 100.0]

inventory = ["staff", "hat", "shoes", "bread", "potion"]

stats.append(99.5) # [48.0, 30.5, 20.2, 100.0, 99.5]

inventory.insert(3, "robe") # ["staff", "hat", "shoes", "robe", "bread", "potion"]

inventory.remove("shoes") # ["staff", "hat", "robe", "bread", "potion"]

inventory = ["staff", "hat", "robe", "bread"]

item = inventory.pop() # item = "bread"

# inventory = ["staff", "hat", "robe"]

item = inventory.pop(1) # item = "hat"

IS6061 | Lists | S. Seppälä


# inventory = ["staff", "robe"]
Based on: Murach's Python Programming, C6, Slide 8, © 2016, Mike Murach & Associates, Inc.
The split and join methods

• Split method turns single string into list of substrings


• Join method turns a list of strings into a single string.
• Notice that these methods are inverses of each other.
• You can specify what delimiter character to use in the splitting.
• When you do not specify a delimiter, multiple spaces are treated
like one delimiter.

Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 2, © 2016 Pearson
IS6061 | Lists | S. Seppälä
Education, Inc., Hoboken, NJ. All rights reserved.
The split and join methods
• These statements • Program shows how join method is used
each display list to display items from list of strings.
['a', 'b', 'c'].
line = ["To", "be", "or", "not",
print("a,b,c".split(','))
"to", "be."]
print("a**b**c".split('**'))
print(" ".join(line))
print("a\nb\nc".split()) krispies = ["Snap", "Crackle", "Pop"]
print("a b c".split()) print(", ".join(krispies))
print("a b c".split()) [Run]
To be or not to be.
Snap, Crackle, Pop
Based on: David I. Schneider, An Introduction to Programming Using Python, Chapter 2, © 2016 Pearson
IS6061 | Lists | S. Seppälä
Education, Inc., Hoboken, NJ. All rights reserved.
List operators

[1, 2, 3] + [4]  [1, 2, 3, 4]


[1, 2, 3] * 2  [1, 2, 3, 1, 2, 3]
1 in [1, 2, 3]  True
[1, 2, 3] < [1, 2, 4]  True
Compares index to index, first difference determines the
result.
Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 7, Copyright © 2017
IS6061 | Lists | S. Seppälä
Pearson Education, Ltd.
Constructing lists with list comprehension

mark the comp with []

[ n for n in range(1,5)]
what we iterate
what we
through. Note that
collect we iterate over a set of
returns values and collect some
(in this case all) of them
[1,2,3,4]

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 7, Copyright © 2017
IS6061 | Lists | S. Seppälä
Pearson Education, Ltd.
DICTIONARIES

2024-IS6061
MSc BA
Lecturer: Dr Selja Seppälä
Describe the properties of Differentiate between a list and Create and populate dicts using
dictionaries and keys & values a dictionary (empty) curly braces, DICT()
Dictionaries and indexing

Access dictionary keys & values Explain what a dictionary USING DICTIONARIES TO COUNT ITEMS
using different approaches traceback is and how to avoid it

Use python built-in dictionary Search in dictionaries with "in" Create Dictionaries of complex
methods and "not in" objects

Delete items in dictionaries

IS6061 | Dictionaries |
S. Seppälä
Summary

• Types of collections
• Dictionaries
• Example: most common name
• The get() method
• Example: counting words
• Accessing keys & values
• Deleting items
• Dictionaries and lists as values of dict

IS6061 | Dictionaries |
S. Seppälä
Python dictionaries
• Dictionaries are Python’s most powerful data collection.
• Dictionaries allow us to do fast database-like operations in
Python.
• A dictionary is an unordered collection of items.
• Each item is composed of a key and value pair:
dictionary_name = {key1:value1, key2:value2, ...}
– Key must be immutable
• strings, integers, tuples are fine
• lists are NOT
– Value can be anything
IS6061 | Dictionaries | Based on: PY4E, Chapter 9 & "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 9,
S. Seppälä Copyright © 2017 Pearson Education, Ltd.
Code that creates dictionaries
# strings as keys and values
# an empty dictionary
countries = {"CA": "Canada",
book_catalog = {}
"US": "United States",
"MX": "Mexico"}
OR
# numbers as keys, strings as values
purse = dict()
numbers = {1: "One", 2: "Two", 3: "Three",
4: "Four", 5: "Five"}

# strings as keys, values of mixed types


movie = {"name": "The Holy Grail",
"year": 1975,
"price": 9.99}

IS6061 | Dictionaries |
Based on: Murach's Python Programming, C12, Slide 4, © 2016, Mike Murach & Associates, Inc.
S. Seppälä
Indexing in dictionaries
• Lists index their entries based
>>> purse = dict()
on the position in the list.
>>> purse['money'] = 12
• Dictionaries are unordered, so >>> purse['candy'] = 3
we can't use the position to >>> purse['tissues'] = 75
index the entries. >>> print(purse)
• The key acts as an index to {'money': 12, 'tissues': 75, 'candy': 3}
find the associated value, i.e., >>> print(purse['candy'])
dictionaries are indexed by 3
keys. >>> purse['candy'] = purse['candy'] + 2
• A dictionary can be searched >>> print(purse)
to locate the value associated {'money': 12, 'tissues': 75, 'candy': 5}
with a key.

IS6061 | Dictionaries | Based on: PY4E, Chapter 9 & "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 9,
S. Seppälä Copyright © 2017 Pearson Education, Ltd.
The syntax for accessing a value

dictionary_name[key]

Code that gets a value from a dictionary


country = countries["MX"] # "Mexico"
country = countries["IE"] # KeyError: Key doesn't exist
Code that sets a value if the key is in the dictionary
countries["GB"] = "United Kingdom"
Code that adds a key/value pair if the key isn’t in the dictionary
countries["FR"] = "France"
IS6061 | Dictionaries |
Based on: Murach's Python Programming, C12, Slide 7, © 2016, Mike Murach & Associates, Inc.
S. Seppälä
Dictionaries are mutable

• Like lists, dictionaries are a mutable data structure


– you can change the object via various operations, such as index
assignment

my_dict = {'bill':3, 'rich':10}


print(my_dict['bill']) # prints 3
my_dict['bill'] = 100
print(my_dict['bill']) # prints 100

IS6061 | Dictionaries | Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 9, Copyright © 2017
S. Seppälä Pearson Education, Ltd.
Comparing lists and dictionaries
Dictionaries are like lists except that they use keys instead of
numbers to look up values
>>> ddd = dict()
>>> lst = list() >>> ddd['age'] = 21
>>> lst.append(21) >>> ddd['course'] = 182
>>> lst.append(183) >>> print(ddd)
>>> print(lst) {'course': 182, 'age': 21}
[21, 183] >>> ddd['age'] = 23
>>> lst[0] = 23 >>> print(ddd)
>>> print(lst) {'course': 182, 'age': 23}
[23, 183]
IS6061 | Dictionaries |
Based on: PY4E, Chapter 9
S. Seppälä
>>> lst = list() List
>>> lst.append(21)
>>> lst.append(183) Key Value
>>> print(lst)
[21, 183] [0] 21
lst
>>> lst[0] = 23
>>> print(lst) [1] 183
[23, 183]

>>> ddd = dict()


>>> ddd['age'] = 21 Dictionary
>>> ddd['course'] = 182
Key Value
>>> print(ddd)
{'course': 182, 'age': 21}
>>> ddd['age'] = 23 ['course'] 182
ddd
>>> print(ddd)
['age'] 21
{'course': 182, 'age': 23}

IS6061 | Dictionaries |
Based on: PY4E, Chapter 9
S. Seppälä
Dictionary Tracebacks
• It is an error to reference a key which is not in the dictionary
• We can use the in operator to see if a key is in the dictionary

>>> ccc = dict()


>>> print(ccc['csev'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'csev'
>>> 'csev' in ccc
False

IS6061 | Dictionaries |
Based on: PY4E, Chapter 9
S. Seppälä
Counting with a dictionary
When we encounter a new name, we need to add a new entry in the
dictionary and if this is the second or later time we have seen the
name, we simply add one to the count in the dictionary under that
name.

counts = dict()
names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']
for name in names :
if name not in counts:
{'csev': 2,
counts[name] = 1
'zqian': 1,
else :
'cwen': 2}
counts[name] = counts[name] + 1
print(counts)
IS6061 | Dictionaries |
Based on: PY4E, Chapter 9
S. Seppälä
The get() method for dictionaries
The pattern of checking to see
if name in counts: if a key is already in a
x = counts[name] dictionary and assuming a
else :
default value if the key is not
x = 0
there is so common that there
Default value if key does not exist (and no
Traceback).
is a method called get() that
does this for us.
x = counts.get(name, 0)

{'csev': 2, 'zqian': 1, 'cwen': 2}

IS6061 | Dictionaries |
Based on: PY4E, Chapter 9
S. Seppälä
Simplified Counting with get()
We can use get() and provide a default value of zero when
the key is not yet in the dictionary - and then just add one
(either to zero or to the current value of that key).

counts = dict()
names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']
for name in names :
counts[name] = counts.get(name, 0) + 1
print(counts)

{'csev': 2, 'zqian': 1, 'cwen': 2}


Default
IS6061 | Dictionaries |
Based on: PY4E, Chapter 9
S. Seppälä
Counting pattern
• The general pattern to
counts = dict() count the words in a
print('Enter a line of text:') line of text is
line = input('')
– to split the line into
words = line.split() words,
– then loop through the
print('Words:', words)
words and
print('Counting...') – use a dictionary to track
for word in words: the count of each word
counts[word] = counts.get(word,0) + 1
independently.
print('Counts', counts)

IS6061 | Dictionaries |
Based on: PY4E, Chapter 9
S. Seppälä
Dictionary Methods
Method Description
clear() Removes all the elements from the dictionary
copy() Returns a copy of the dictionary
fromkeys() Returns a dictionary with the specified keys and value
get() Returns the value of the specified key
items() Returns a list containing a tuple for each key value pair
keys() Returns a list containing the dictionary's keys
pop() Removes the element with the specified key and returns its
associated value
popitem() Removes the last inserted key-value pair and returns it as a tuple
setdefault() Returns the value of the specified key. If the key does not exist:
insert the key, with the specified value
update() Updates the dictionary with the specified key-value pairs
values() Returns a list of all the values in the dictionary
IS6061 | Dictionaries | Based on: W3 Schools, Python Dictionary Methods, www.w3schools.com/python/python_ref_dictionary.asp &
S. Seppälä Real Python, Dictionaries in Python, https://realpython.com/python-dicts/#built-in-dictionary-methods
Retrieving Lists of Keys and Values
You can get a list
of keys, values, or >>> jjj = { 'chuck' : 1 , 'fred' : 42, 'jan': 100}
items (both) from >>> print(list(jjj))
['jan', 'chuck', 'fred']
a dictionary >>> print(list(jjj.keys()))
['jan', 'chuck', 'fred']
>>> print(list(jjj.values()))
[100, 1, 42]
>>> print(list(jjj.items()))
[('jan', 100), ('chuck', 1), ('fred', 42)]
>>>

Each element of the


IS6061 | Dictionaries |
resulting list is a “tuple”
Based on: PY4E, Chapter 9
S. Seppälä
The syntax for deleting an item
del dictionary_name[key]
Code that uses the del keyword to delete an item
del countries["MX"]
del countries["IE"] # KeyError: Key doesn't exist
Code that checks a key before deleting the item
code = "IE"
if code in countries:
country = countries[code]
del countries[code]
print(country + " was deleted.")
else:
print("There is no country for this code: " + code)

IS6061 | Dictionaries |
Based on: Murach's Python Programming, C12, Slide 10, © 2016, Mike Murach & Associates, Inc.
S. Seppälä
Two dictionary methods for deleting items

pop(key[, default_value])
clear()
Code that uses the pop() method to delete an item
country = countries.pop("US") # "United States"
country = countries.pop("IE") # KeyError
country = countries.pop("IE", "Unknown") # "Unknown"
Code that prevents a KeyError from occuring
code = "IE"
country = countries.pop(code, "Unknown country")
print(country + " was deleted.")
Code that uses the clear() method to delete all items
countries.clear()

IS6061 | Dictionaries |
Based on: Murach's Python Programming, C12, Slide 11, © 2016, Mike Murach & Associates, Inc.
S. Seppälä
A dictionary that contains other dictionaries as values (1)
contacts = {
"Joel":
{"address": "1500 Anystreet", "city": "San Francisco",
"state": "California", "postalCode": "94110",
"phone": "555-555-1111"},
"Anne"
{"address": "1000 Somestreet", "city": "Fresno",
"state": "California", "postalCode": "93704",
"phone": "125-555-2222"},
"Ben":
{"address": "1400 Another Street", "city": "Fresno",
"state": "California", "postalCode": "93704",
"phone": "125-555-4444"}
}

Code that gets values from embedded dictionaries


phone = contacts["Anne"]["phone"] # "125-555-1111"
email = contacts["Anne"]["email"] # KeyError
IS6061 | Dictionaries |
Based on: Murach's Python Programming, C12, Slide 29, © 2016, Mike Murach & Associates, Inc.
S. Seppälä
A dictionary that contains other dictionaries as values (2)

Code that checks whether a key exists within another key


key = "email"
if key in contacts["Anne"]:
email = contacts["Anne"][key]
print(email)
else:
print("Sorry, there is no email address for this contact.")

Code that uses the get() method with embedded dictionaries


phone = contacts.get("Anne").get("phone") # "125-555-2222"
phone = contacts.get("Anne").get("email") # None
phone = contacts.get("Mike").get("phone") # AttributeError
phone = contacts.get("Mike", {}).get("phone") # None

IS6061 | Dictionaries |
Based on: Murach's Python Programming, C12, Slide 30, © 2016, Mike Murach & Associates, Inc.
S. Seppälä
A dictionary that contains lists as values

students = {"Joel":[85, 95, 70],


"Anne":[95, 100, 100],
"Mike":[77, 70, 80, 85]}

Code that gets a value from an embedded list


scores = students["Joel"] # [85, 95, 70]
joel_scorstudents["Joel"][0] # 85

IS6061 | Dictionaries |
Based on: Murach's Python Programming, C12, Slide 31, © 2016, Mike Murach & Associates, Inc.
S. Seppälä
TUPLES

2024-IS60516
MSc BA
Lecturer: Dr Selja Seppälä
EXPLAIN WHAT A TUPLE IS AND DESCRIBE
ITS CHARACTERISTICS
Use the tuple() method
Tuples

Compare tuples to lists Explain the advantages of


tuples

Perform tuple assignment, Use tuple comparison &


packing, and unpacking sorting

Sort dictionaries using


tuples

IS6061 | Tuples | S. Seppälä


Summary

• What is a tuple?
• The tuple() method
• Tuples vs. lists
• Why tuples?
• Tuples & assignment
• Tuples & dictionaries
• Comparing tuples
• Sorting tuples

IS6061 | Tuples | S. Seppälä


Tuples

• Tuples are
immutable lists,
i.e., ordered
sequences of
items that cannot
be modified in
place
• They are printed
can contain
with (,) different data types

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 7, Copyright © 2017
IS6061 | Tuples | S. Seppälä Pearson Education, Ltd. & David I. Schneider, An Introduction to Programming Using Python, Chapter 2, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
The tuple() method
• Creates an empty tuple object • Converts a list into a tuple
>>> my_tpl = tuple() >>> my_tpl = tuple([5,'abc', 22.7])
>>> print(type(my_tpl)) (5, 'abc', 22.7)
<class 'tuple'>
>>> print(my_tpl)
()

• Creates a tuple
>>> my_tpl = tuple(("apple", "banana", "cherry")) # note the double
round-brackets
>>> print(my_tpl)
('apple', 'banana', 'cherry')
IS6061 | Tuples | S. Seppälä Based on: Python Tuples, W3 Schools, https://www.w3schools.com/python/python_tuples.asp
Lists and Tuples

• Everything that works with a list works with a tuple except


methods that modify the tuple ⇒ Items of a tuple cannot be
directly deleted, sorted, or altered
• Thus indexing, slicing, len, print all work as expected
• However, none of the mutable methods work: append, extend,
del

Based on: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Chapter 7, Copyright © 2017
IS6061 | Tuples | S. Seppälä Pearson Education, Ltd. & David I. Schneider, An Introduction to Programming Using Python, Chapter 2, © 2016
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Tuples are like lists
• Tuples are another kind of sequence that functions much like a list
– They have elements which are indexed starting at 0
>>> x = ('Glenn', 'Sally', 'Joseph')
>>> for iter in y:
>>> print(x[2])
... print(iter)
Joseph
...
>>> y = ( 1, 9, 2 )
1
>>> print(y)
9
(1, 9, 2)
2
>>> print(max(y))
>>>
9

– Tuples can also be sliced, concatenated, and repeated

Based on: PY4E & David I. Schneider, An Introduction to Programming Using Python, Chapter 2, © 2016
IS6061 | Tuples | S. Seppälä
Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Tuples are “immutable” (unlike lists)

>>> z = (5, 4, 3) >>> x = (3, 2, 1)


>>> z[2] = 0 >>> x.sort()
Traceback:'tuple' Traceback:
object does AttributeError: 'tuple' object has no attribute 'sort'
not support item >>> x.append(5)
Assignment Traceback:
>>> AttributeError: 'tuple' object has no attribute 'append'
>>> x.reverse()
Traceback:
AttributeError: 'tuple' object has no attribute 'reverse'
>>>

IS6061 | Tuples | S. Seppälä Based on: PY4E, Chapter 10


Why tuples?

• An immutable list gives you a data structure with some integrity,


some permanent-ness if you will
• You know you cannot accidentally change one
• Since Python does not have to build tuple structures to be
modifiable, they are simpler and more efficient in terms of
memory use and performance than lists
• So, in our program when we are making “temporary variables”
we prefer tuples over lists

IS6061 | Tuples | S. Seppälä Based on: PY4E, Chapter 10


Unpacking a tuple

• We can also put a tuple on the left-hand side of an assignment


statement ⇒ variable unpacking
>>> (x, y) = (4, 'fred')
• We can even omit the parentheses >>> print(y)
fred
>>> (a, b) = (99, 98)
>>> print(a)
99

tuple_values = (1, 2, 3)

a, b, c = tuple_values # a = 1, b = 2, c = 3

Based on: PY4E, Chapter 10 & Murach's Python Programming, C6, Slide 45, © 2016, Mike Murach & Associates,
IS6061 | Tuples | S. Seppälä
Inc.
A function that returns a tuple

def get_location():
# code that computes values for x, y, and z
return x, y, z

Code that calls the get_location() function and unpacks the


returned tuple
x, y, z = get_location()

IS6061 | Tuples | S. Seppälä Based on: Murach's Python Programming, C6, Slide 45, © 2016, Mike Murach & Associates, Inc.
Tuples and Dictionaries

The items() >>> d = dict()


>>> d['csev'] = 2
method in >>> d['cwen'] = 4
dictionaries >>> for (k,v) in d.items():
returns a list of ... print(k, v)
...
(key, value) csev 2
tuples cwen 4
>>> tups = d.items()
>>> print(tups)
dict_items([('csev', 2), ('cwen', 4)])

IS6061 | Tuples | S. Seppälä Based on: PY4E, Chapter 10


Tuples are Comparable

The comparison operators work with tuples and other sequences. If the first
item is equal, Python goes on to the next element, and so on, until it finds
elements that differ.
>>> (0, 1, 2) < (5, 1, 2)
True
>>> (0, 1, 2000000) < (0, 3, 4)
True
>>> ( 'Jones', 'Sally' ) < ('Jones', 'Sam')
True
>>> ( 'Jones', 'Sally') > ('Adams', 'Sam')
True

IS6061 | Tuples | S. Seppälä Based on: PY4E, Chapter 10


>>> d = {'a':10, 'b':1, 'c':22}
Sorting Lists of Tuples >>> t = sorted(d.items())
>>> t
[('a', 10), ('b', 1), ('c', 22)]
>>> for k, v in sorted(d.items()):
• We can take advantage of the ability ... print(k, v)
...
to sort a list of tuples to get a sorted a 10
version of a dictionary b 1
• We sort the dictionary by the key c 22
using the items() method and
sorted() function >>> c = {'a':10, 'b':1, 'c':22}
• To sort by values instead of key: >>> tmp = list()
>>> for k, v in c.items() :
– construct a list of tuples of the form ... tmp.append( (v, k) )
(value, key) ...
>>> print(tmp)
– sort the list of tuples
[(10, 'a'), (22, 'c'), (1, 'b')]
>>> tmp = sorted(tmp, reverse=True)
>>> print(tmp)
[(22, 'c'), (10, 'a'), (1, 'b')]
IS6061 | Tuples | S. Seppälä Based on: PY4E, Chapter 10
The top 10 most common words
fhand = open('romeo.txt')
counts = {}
for line in fhand:
words = line.split()
for word in words:
counts[word] = counts.get(word, 0 ) + 1

lst = []
for key, val in counts.items(): Even shorter version using list
newtup = (val, key) comprehension
lst.append(newtup)

lst = sorted(lst, reverse=True) >>> c = {'a':10, 'b':1, 'c':22}

>>> print( sorted( [ (v,k) for k,v in c.items() ] ) )


for val, key in lst[:10] :
print(key, val) [(1, 'b'), (10, 'a'), (22, 'c')]

IS6061 | Tuples | S. Seppälä Based on: PY4E, Chapter 10


NumPy
IS6061
MSc BA
Lecturer: Dr Selja Seppälä
CS 620 / DASC 600
Introduction to Data Science
& Analytics

Lecture 2- NumPy I
Credit for most of the IS6061
slide contents goes to Dr.
Sampath Jayarathna.
Dr. Sampath Jayarathna
Additions are specified on the
slides.
Old Dominion University
Credit for some of the slides in this lecture goes to Jianhua Ruan UTSA
NumPy

• Stands for Numerical Python


• It is the fundamental package
required for high
performance computing and
data analysis.

Image: Sam Taha, Python Data Science and Machine Learning Stack, 14 June 2017,
https://grandlogic.blogspot.com/2017/06/python-data-science-and-machine.html

Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data
IS6061 | NumPy | S. Seppälä
Science & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
NumPy

• NumPy is important for numerical computations in Python because it


is designed for efficiency on large arrays of data.
• It provides
– ndarray for creating multiple dimensional arrays
– Internally stores data in a contiguous block of memory, independent of other
built-in Python objects, uses much less memory than built-in Python sequences
– Standard math functions for fast operations on entire arrays of data without
having to write loops
– NumPy arrays are important because they enable you to express batch
operations on data without writing any for loops.
• We call this vectorization.

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
NumPy ndarray vs Python list

• One of the key features of NumPy is its N-dimensional array object, or


ndarray, which is a fast, flexible container for large datasets in Python.
• Whenever you see “array,” “NumPy array,” or “ndarray” in the text,
with few exceptions they all refer to the same thing: the ndarray object.
• NumPy-based algorithms are generally 10 to 100 times faster (or more)
than their pure Python counterparts and use significantly less memory.

Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data
IS6061 | NumPy | S. Seppälä
Science & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
NumPy Arrays vs. Python Lists
NumPy Arrays Python Lists
1. Homogeneous Data: NumPy arrays store elements of
1. Datatype: Lists can hold different data types,
the same data type → more compact and memory-
but this can decrease memory efficiency and
efficient than lists.
slow numerical operations.
2. Fixed Data Type: NumPy arrays have a fixed data 2. Element Overhead: Lists in Python store
type, reducing memory overhead by eliminating the additional information about each element, e.g.
need to store type information for each element. its type and reference count.
3. Contiguous Memory: NumPy arrays store elements in 3. Memory Fragmentation: Lists may not store
adjacent memory locations, reducing fragmentation elements in contiguous memory locations,
and allowing for efficient access. causing memory fragmentation and inefficiency.
4. Array Metadata: NumPy arrays have extra metadata 4. Performance: Lists are not optimized for
like shape, strides, and data type. However, this numerical computations and may have slower
overhead is usually smaller than the per-element mathematical operations due to Python’s
overhead in lists. interpretation overhead. They are generally
5. Performance: NumPy arrays are optimized for used as general-purpose data structures.
numerical computations, with efficient element-wise 5. Functionality: Lists can store any data type, but
operations and mathematical functions. These lack specialized NumPy functions for numerical
operations are implemented in C, resulting in faster operations.
performance than equivalent operations on lists.
Based on: Based on: Python Lists VS Numpy Arrays, GeeksforGeeks, 25 August 2023,
IS6061 | NumPy | S. Seppälä
https://www.geeksforgeeks.org/python-lists-vs-numpy-arrays/
NumPy Arrays vs. Python Lists

• Homogeneous Data
• Heterogeneous Data • Fixed Data Type
• Element Overhead • Contiguous Memory
• Memory Fragmentation • Array Metadata

Based on: Based on: Python Lists VS Numpy Arrays, GeeksforGeeks, 25 August 2023,
IS6061 | NumPy | S. Seppälä
https://www.geeksforgeeks.org/python-lists-vs-numpy-arrays/
NumPy Arrays vs. Python Lists

• Heterogeneous Data
• Element Overhead
• Memory Fragmentation • Homogeneous Data
• Fixed Data Type
• Contiguous Memory
• Array Metadata

Based on: Based on: Python Lists VS Numpy Arrays, GeeksforGeeks, 25 August 2023,
IS6061 | NumPy | S. Seppälä https://www.geeksforgeeks.org/python-lists-vs-numpy-arrays/
ndarray

• ndarray is used for storage of homogeneous data


– i.e., all elements the same type
• Every array must have a shape and a dtype
• Supports convenient slicing, indexing and efficient vectorized
computation
import numpy as np
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
print(arr1)
print(arr1.dtype)
print(arr1.shape)
print(arr1.ndim)

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Creating ndarrays

Using list of lists


import numpy as np

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]] #list of lists


arr2 = np.array(data2)
print(arr2.ndim) # 2
print(arr2.shape) # (2,4)

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Creating ndarrays array = np.eye(3)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
array = np.array([[0,1,2],
[2,3,4]])
array = np.arange(0, 10, 2)
[[0 1 2]
[0, 2, 4, 6, 8]
[2 3 4]]
array =
array = np.zeros((2,3))
np.random.randint(0, 10,
[[0. 0. 0.]
(3,3))
[0. 0. 0.]]
[[6 4 3]
[1 5 6]
array = np.ones((2,3))
[9 8 5]]
[[1. 1. 1.]
[1. 1. 1.]]
arange is an array-valued version of the built-in Python range function
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Arithmetic with NumPy Arrays

• Any arithmetic operations between equal-size arrays applies the


operation element-wise:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print(arr)
[[1. 2. 3.]
[4. 5. 6.]]
print(arr * arr)
[[ 1. 4. 9.]
[16. 25. 36.]]
print(arr - arr)
[[0. 0. 0.]
[0. 0. 0.]]

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Arithmetic with NumPy Arrays
• Arithmetic operations with scalars propagate the scalar argument to each element in
the array: arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print(arr)
[[1. 2. 3.]
[4. 5. 6.]]
print(arr **2)
[[ 1. 4. 9.]
[16. 25. 36.]]

• Comparisons between arrays of the same size yield boolean arrays:


arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
print(arr2)
[[ 0. 4. 1.]
[ 7. 2. 12.]]
print(arr2 > arr)
[[False True False]
[ True False True]]
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Indexing and Slicing

• One-dimensional arrays are simple; on the surface they act similarly to


Python lists:

arr = np.arange(10)
print(arr) # [0 1 2 3 4 5 6 7 8 9]
print(arr[5]) #5
print(arr[5:8]) #[5 6 7]
arr[5:8] = 12
print(arr) #[ 0 1 2 3 4 12 12 12 8 9]

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Indexing and Slicing

• As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is
propagated (or broadcasted) to the entire selection.
• An important first distinction from Python’s built-in lists is that array slices are views on
the original array.
– This means that the data is not copied, and any modifications to the view will be reflected in the
source array.
arr = np.arange(10)
print(arr) # [0 1 2 3 4 5 6 7 8 9]

arr_slice = arr[5:8]
print(arr_slice) # [5 6 7]
arr_slice[1] = 12345
print(arr) # [ 0 1 2 3 4 5 12345 7 8 9]
arr_slice[:] = 64
print(arr) # [ 0 1 2 3 4 64 64 64 8 9]
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Indexing

• In a two-dimensional array, the elements at each index are no longer


scalars but rather one-dimensional arrays:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d[2]) # [7 8 9]

• Thus, individual elements can be accessed recursively. But that is a bit


too much work, so you can pass a comma-separated list of indices to
select individual elements.
• So these are equivalent:

print(arr2d[0][2]) # 3
print(arr2d[0, 2]) # 3
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Activity 3

• Consider the two-dimensional array, arr2d.

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8,


9]])
• Write a code to slice this array to display the last column,
[[3] [6] [9]]

• Write a code to slice this array to display the last 2 elements of middle
array,
[5 6]

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
NumPy – Slicing Activity

https://forms.office.com/e/ETp
rJEZuTS

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
References

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
References

• NumPy documentation: https://numpy.org/doc/stable/index.html


• Joe James, Python: NUMPY | Numerical Python Arrays Tutorial,
https://youtu.be/8Mpc9ukltVA?si=nWIo1d-Lghi5DOxe
– See video comments for links to related videos and code.
• Python Lists VS Numpy Arrays, GeeksforGeeks, 25 August 2023,
https://www.geeksforgeeks.org/python-lists-vs-numpy-arrays/
• Alammar, J (2018). A Visual Intro to NumPy and Data Representation,
https://jalammar.github.io/visual-numpy/
– Excellent illustrations of the key concepts!

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 2- NumPy I, CS 620 / DASC 600 - Introduction to Data Scien
ce & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Pandas
IS6061
MSc BA
Lecturer: Dr Selja Seppälä
IS6061 | NumPy | S. Seppälä
CS 620 / DASC 600
Introduction to Data Science
& Analytics

Lecture 3- Pandas
Credit for most of the
IS6061 slide contents goes
to Dr. Sampath Jayarathna.
Dr. Sampath Jayarathna
Additions are specified on
the slides.
Old Dominion University
Credit for some of the slides in this lecture goes to Jianhua Ruan UTSA
Why pandas?
• One of the most popular library that data scientists use
• Labeled axes to avoid misalignment of data
– When merging two tables, some rows may be different
• Missing values or special values may need to be removed or replaced
height Weight Weight2 age Gender salary Credit score

Amy 160 125 126 32 2 Alice 50000 700

Bob 170 167 155 -1 1 Bob NA 670

Chris 168 143 150 28 1 Chris 60000 NA

David 190 182 NA 42 1 David -99999 750

Ella 175 133 138 23 2 Ella 70000 685


Frank 172 150 148 45 1 Tom 45000 660

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Overview

• Created by Wes McKinney in 2008, now maintained by many others.


– Author of one of the textbooks: Python for Data Analysis
• Powerful and productive Python data analysis and Management Library
• Panel Data System
– The name is derived from the term "panel data", an econometrics term for
data sets that include both time-series and cross-sectional data
• Its an open source product.

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Overview
• Python Library to provide data analysis features similar to: R, MATLAB, SAS
• Rich data structures and functions to make working with data structure fast,
easy and expressive.
– It contains data structures and data manipulation tools designed to make data
cleaning and analysis fast and easy in Python.
• It is built on top of NumPy
– The biggest difference is that pandas is designed for working with tabular or
heterogeneous data.
• Often used with
– numerical computing tools like NumPy and SciPy
– analytical libraries like statsmodels and scikit-learn
– data visualization libraries like matplotlib.

Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Science & Analytics, Old
IS6061 | NumPy | S. Seppälä Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/ & Wes McKinney, 5 Getting Started with
pandas, Python for Data Analytics, 3rd edition, 2023, p. 123, https://wesmckinney.com/book/
Overview
• Key components provided by Pandas:
– Series
– DataFrame

from pandas import Series,


DataFrame
import pandas as pd

import
convention

Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Science & Analytics, Old
IS6061 | NumPy | S. Seppälä
Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Series
• One dimensional array-like object
• It contains a sequence of values (array of data of any NumPy data type) and
an associated array of data labels called its index. (Indexes can be strings or
integers or other data types.)
• By default , the series will get indexing from 0 to N where N = size -1

from pandas import Series, #Output


DataFrame 0 4
import pandas as pd 1 7
2 -5
obj = Series([4, 7, -5, 3]) 3 3
print(obj) dtype: int64
print(obj.index) RangeIndex(start=0, stop=4, step=1)
print(obj.values) [ 4 7 -5 3]

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Series – referencing elements
Creating a obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', obj2['d']= 10
Series with 'c']) print(obj2[['d', 'c', 'a']])
an index print(obj2) #Output
identifying #Output d 10
d 4 c 3
each data
b 7 a -5
point with a a -5 dtype: int64
label c 3
dtype: int64 print(obj2[:2]) Using labels
print(obj2.index) #Output in the index
#Output d 10
when
Index(['d', 'b', 'a', 'c'], dtype='object') b 7
dtype: int64 selecting
print(obj2.values)
single
#Output
[ 4 7 -5 3] print(obj2.a) values or a
#Output set of
print(obj2['a']) -5 values
#Output
-5
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Series – array/dict operations obj4 = obj3[obj3>0]
print(obj4)
Can be thought of as a dict. #output
Can be constructed from a dict d 10
b 7
directly. c 3
dtype: int64
obj3 = Series({'d': 4, 'b': 7, 'a':
-5, 'c':3 })
print(obj3**2)
print(obj3)
#output
#output
d 100
d 4
b 49
b 7
a 25
a -5
c 9
c 3
dtype: int64
dtype: int64

numpy array operations can also be print(‘b’ in obj3)


#output
applied, which will preserve the index-
true
value link
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Series – from dictionary print(pd.isnull(obj4))
#output
The isnull and
notnull functions
Texas False in pandas should
Ohio False be used to
dictionary Oregon False
detect missing
Iowa True
dtype: bool data
sdata = {'Texas': 10, 'Ohio': 20, 'Oregon':
15, 'Utah': 18} print(pd.notnull(obj4))
#output
states = ['Texas', 'Ohio', 'Oregon', 'Iowa'] Texas True
obj4 = Series(sdata, index=states) Ohio True
print(obj4) Oregon True
#output Iowa False
Texas 10.0 dtype: bool
Ohio 20.0
Oregon 15.0 print(obj4[obj4.notnull()])
#output
Iowa NaN Texas 10.0
dtype: float64 missing value
Ohio 20.0
Oregon 15.0
dtype: float64

Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Science & Analytics, Old
IS6061 | NumPy | S. Seppälä
Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Series – auto alignment
print(obj4.add(obj5))
sdata = {'Texas': 10, 'Ohio': 20, 'Oregon': 15, 'Utah': 18}
#output
states = ['Texas', 'Ohio', 'Oregon', 'Iowa']
Iowa NaN
obj4 = Series(sdata, index=states)
Ohio 40.0
print(obj4)
#Output
Oregon 30.0
Texas 10.0 Texas 20.0
Ohio 20.0 Utah NaN
Oregon 15.0 dtype: float64
Iowa NaN
dtype: float64

sdata = {'Texas': 10, 'Ohio': 20, 'Oregon': 15, 'Utah': 18}


states = ['Texas', 'Ohio', 'Oregon', 'Utah'] • Series automatically
obj5 = Series(sdata, index=states) aligns by index label
print(obj5) in arithmetic
#Output operations
Texas 10
Ohio 20 • Similar to a join
Oregon 15 operation
Utah 18
dtype: int64

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Series name and index name
Both the Series object itself and its index have a name attribute, which
integrates with other key areas of pandas functionality

sdata = {'Texas': 10, 'Ohio': 20, 'Oregon': 15, 'Utah': 18}


states = ['Texas', 'Ohio', 'Oregon', 'Iowa']
obj4 = Series(sdata, index=states)
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)
#output
state
Texas 10.0
Ohio 20.0
Oregon 15.0
Iowa NaN
Name: population, dtype: float64

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Series name and index name

• Index of a series can be changed to a different index.


obj4.index = ['Florida', 'New York', 'Kentucky', 'Georgia']
Florida 10.0
New York 20.0
Kentucky 15.0
Georgia NaN
Name: population, dtype: float64

• Index object itself is immutable.


obj4.index[2]='California'
TypeError: Index does not support mutable operations

print(obj4.index)
Index(['Florida', 'New York', 'Kentucky', 'Georgia'],
dtype='object')

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Indexing, selection and filtering
• Series can be sliced/accessed with label-based indexes, or using position-
based indexes

S = Series(range(4), index=['zero', 'one', 'two',


'three'])
print(S['two'])
2
print(S[['zero', 'two']])
zero 0
two 2
dtype: int64
print(S[2])
2
print(S[[0,2]])
zero 0
two 2
dtype: int64 Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data
IS6061 | NumPy | S. Seppälä
Science & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Indexing, selection and filtering
• Series can be sliced/accessed with label-based indexes, or using position-
based indexes
S = Series(range(4), index=['zero', 'one', 'two', 'three'])
print(S[:2])
zero 0
one 1 print(S[S > 1])
dtype: int32 two 2
three 3
print(S['zero' : 'two']) dtype: int32
zero 0
one 1 print(S[-2:])
two 2 two 2
inclusive three 3
dtype: int32
dtype: int32

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame
• A DataFrame is a tabular data structure comprised of rows and columns, akin to a
spreadsheet or database table.
• It can be treated as an ordered collection of columns
– Each column can be a different data type (numeric, string, boolean, etc.)
– Have both row and column indices
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame) • There are many ways to
#output construct a DataFrame
state year pop
• One of the most common is
0 Ohio 2000 1.5
1 Ohio 2001 1.7 from a dict of equal-length
2 Ohio 2002 3.6 lists or NumPy arrays
3 Nevada 2001 2.4
4 Nevada 2002 2.9
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame – specifying columns and indices
• Order of columns/rows can be specified.
• Columns not in data will have NaN.
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['A', 'B', 'C', 'D', 'E'])

Print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN same order
C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN
initialized with NaN

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame – from nested dict of dicts
• Outer dict keys as columns and inner dict keys as row indices

pop = {'Nevada': {2001: 2.9, 2002: 2.9}, 'Ohio': {2002: 3.6, 2001: 1.7, 2000:
1.5}}
frame3 = DataFrame(pop)
print(frame3)
#output
Nevada Ohio
2000 NaN 1.5 Transpose
2001 2.9 1.7
2002 2.9 3.6 print(frame3.T)
2000 2001
2002
Nevada NaN 2.9 2.9
Union of inner keys (in sorted order) Ohio 1.5 1.7
3.6
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame – index, columns, values
frame3.index.name = 'year'
frame3.columns.name='state‘
print(frame3)
state Nevada Ohio
year
2000 NaN 1.5
2001 2.9 1.7
2002 2.9 3.6

print(frame3.index)
Int64Index([2000, 2001, 2002], dtype='int64', name='year')

print(frame3.columns)
Index(['Nevada', 'Ohio'], dtype='object', name='state')

print(frame3.values)
[[nan 1.5]
[2.9 1.7]
[2.9 3.6]]

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame – retrieving a column
A column in a DataFrame can be retrieved as a Series by dict-like
notation or as attribute
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

print(frame['state']) print(frame.state)
0 Ohio 0 Ohio
1 Ohio 1 Ohio
2 Ohio 2 Ohio
3 Nevada 3 Nevada
4 Nevada 4 Nevada
Name: state, dtype: object Name: state, dtype:
object

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame – getting rows

loc for using indexes and iloc for using positions


– loc gets rows (or columns) with particular labels from the index.
– iloc gets rows (or columns) at particular positions in the index
(so it only takes integers).

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['A', 'B', 'C', 'D', 'E’])

print(frame2) print(frame2.loc[['A', 'B']]) print(frame2.iloc[1:3])


year state pop debt year state pop debt year state pop
A 2000 Ohio 1.5 NaN A 2000 Ohio 1.5 NaN debt
B 2001 Ohio 1.7 NaN B 2001 Ohio 1.7 NaN B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN print(frame2.loc['A':'E',
E 2002 Nevada 2.9 NaN ['state','pop']]) print(frame2.iloc[:,1:3]
state pop )
print(frame2.loc['A']) A Ohio 1.5 state pop
year 2000 B Ohio 1.7 A Ohio 1.5
state Ohio C Ohio 3.6 B Ohio 1.7
pop 1.5 D Nevada 2.4 C Ohio 3.6
debt NaN E Nevada 2.9 D Nevada 2.4
Name: A, dtype: object E Nevada 2.9

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame – modifying columns
frame2['debt'] = 0 val = Series([10, 10, 10], index = ['A',
print(frame2) 'C', 'D'])
year state pop debt frame2['debt'] = val
A 2000 Ohio 1.5 0 print(frame2)
B 2001 Ohio 1.7 0 year state pop debt
C 2002 Ohio 3.6 0 A 2000 Ohio 1.5 10.0
D 2001 Nevada 2.4 0 B 2001 Ohio 1.7 NaN
E 2002 Nevada 2.9 0 C 2002 Ohio 3.6 10.0
D 2001 Nevada 2.4 10.0
frame2['debt'] = range(5) E 2002 Nevada 2.9 NaN
print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 1
C 2002 Ohio 3.6 2 Rows or individual elements can be
D 2001 Nevada 2.4 3 modified similarly. Using loc or iloc.
E 2002 Nevada 2.9 4

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
DataFrame – removing columns

del frame2['debt']
print(frame2)
year state pop
A 2000 Ohio 1.5
B 2001 Ohio 1.7
C 2002 Ohio 3.6
D 2001 Nevada 2.4
E 2002 Nevada 2.9

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
More on DataFrame indexing

import numpy as np print(frame['c1']) print(frame.loc['r1'])


data = np.arange(9).reshape(3,3) r1 0 c1 0
print(data) r2 3 c2 1
[[0 1 2] r3 6 c3 2
[3 4 5] Name: c1, dtype: int32 Name: r1, dtype: int32
[6 7 8]]

frame = DataFrame(data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3’])

print(frame) print(frame['c1']['r1']) print(frame.loc[['r1','r3']])


c1 c2 c3 0 c1 c2 c3
r1 0 1 2 r1 0 1 2
print(frame[['c1', 'c3']]) Row slices
r2 3 4 5 c1 c3 r3 6 7 8
r3 6 7 8 r1 0 2
print(frame.iloc[:2]) print(frame[:2])
r2 3 5
c1 c2 c3 c1 c2 c3
r3 6 8
r1 0 1 2 r1 0 1 2
r2 3 4 5 r2 3 4 5

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
More on DataFrame indexing - 2
print(frame.loc[['r1', 'r2'], ['c1', 'c2']]) print(v.loc['a'])
c1 c2 c1 c2 c3
r1 0 1 a 0 1 2
r2 3 4 a 3 4 5

print(frame.iloc[:2,:2]) print(frame.loc['r1':'r3',
c1 c2 'c1':'c3'])
r1 0 1 c1 c2 c3
r2 3 4 r1 0 1 2
r2 3 4 5
v = DataFrame(np.arange(9).reshape(3,3), index=['a', r3 6 7 8
'a', 'b'], columns=['c1','c2','c3'])
print(v)
c1 c2 c3
a 0 1 2
a 3 4 5 A pandas Index
b 6 7 8 can contain
duplicate labels

IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
More on DataFrame indexing - 3

print(frame) print(frame[frame['c1']>0])
c1 c2 c3 c1 c2 c3
r1 0 1 2 r2 3 4 5
r2 3 4 5 r3 6 7 8
r3 6 7 8
print(frame['c1']>0)
print(frame <3) r1 False
c1 c2 r2 True
c3 r3 True
r1 True True Name: c1, dtype: bool
True
r2 False False
False
r3 False False
frame[frame<3] = 3
False
print(frame)
c1 c2 c3
r1 3 3 3
r2 3 4 5
r3 6 7 8
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Removing rows/columns
print(frame)
c1 c2 c3
r1 0 1 2
r2 3 4 5 This returns a new object
r3 6 7 8

print(frame.drop(['r1']))
c1 c2 c3
r2 3 4 5
r3 6 7 8

print(frame.drop(['r1','r3']))
c1 c2 c3
r2 3 4 5

print(frame.drop(['c1'], axis=1))
c2 c3
r1 1 2
r2 4 5
r3 7 8
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
Reindexing
• Alter the order of rows/columns of a DataFrame or order of a series
according to new index
frame2 = frame.reindex(columns=['c2', 'c3', 'c1'])

print(frame2)
c2 c3 c1 This returns a new object
r1 1 2 0
r2 4 5 3
r3 7 8 6

frame2 = frame.reindex(['r1', 'r3', 'r2', 'r4'])


c1 c2 c3
r1 0.0 1.0 2.0
r3 6.0 7.0 8.0
r2 3.0 4.0 5.0
r4 NaN NaN NaN
IS6061 | NumPy | S. Seppälä Based on: Dr. Sampath Jayarathna, Lecture 3 - Pandas, CS 620 / DASC 600 - Introduction to Data Scienc
e & Analytics, Old Dominion University, https://www.cs.odu.edu/~sampath/courses/f19/cs620/
References

IS6061 | NumPy | S. Seppälä


References

• Pandas website: https://pandas.pydata.org/


• Wes McKinney, 5 Getting Started with pandas, Python for Data
Analytics, 3rd edition, 2023, https://wesmckinney.com/book/
• Joe James, Python: Pandas Tutorial | Intro to DataFrames,
https://youtu.be/e60ItwlZTKM?si=3qJU8Rtl2qMTEmPA
– See video comments for links to related videos and code.
• Alammar, J (2018). Visualizing Pandas' Pivoting and Reshaping
Functions,
https://jalammar.github.io/visualizing-pandas-pivoting-and-reshaping/
– Excellent illustrations of Pandas!

IS6061 | NumPy | S. Seppälä


EXPLORATORY DATA ANALYSIS (EDA)

IS6061
PROGRAMMING FOR DATA AND BUSINESS ANALYTICS
MSc BIAS
Lecturer: Dr Selja Seppälä
Introduction

IS6061 | EDA | S. Seppälä


CRISP DM
The CRoss Industry Standard Process
for Data Mining (CRISP-DM) is a process model that
serves as the base for a data science process.
It has six sequential phases:
1. Business understanding – What does the business
need?
2. Data understanding – What data do we have / need? Is
it clean?
3. Data preparation (80% of the project) – How do we
organize the data for modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business
objectives?
6. Deployment – How do stakeholders access the results?

IS6061 | EDA | S. Seppälä


What is Exploratory Data Analysis?

• A data analytics process to understand the data in depth and to learn


from the different characteristics of the data, often with visual means.
• The critical process of performing initial investigations on data to:
– Identify the important variables in the dataset
– Discover patterns
– Spot anomalies
– Test hypotheses
– Check assumptions
with the help of summary statistics and graphical representations.

IS6061 | EDA | S. Seppälä Based on: Exploratory Data Analysis in Python, Simplilearn (https://youtu.be/MoM6mighOJM)
Usefulness of EDA

• Explore the data


• Assess a situation
• Determine how to proceed
• Decide what to do

IS6061 | EDA | S. Seppälä Based on: CS 109a: Data Science, Effective Exploratory Data Analysis and Visualization, Pavlos Protopapas & Kevin Rader,
https://harvard-iacs.github.io/2018-CS109A/lectures/lecture-3/
Why Perform EDA?

• Help us identify obvious errors


• Better understand patterns withing the data itself
• Detect outliers or anomalous events
• Find interesting relations among the variables
• Transforms your data into a format which is easier to work with
• Gives you a basic overview of the data's properties
• Likely generates several questions for you to follow up in
subsequent analysis.

IS6061 | EDA | S. Seppälä Based on: Exploratory Data Analysis in Python, Simplilearn (https://youtu.be/MoM6mighOJM) & Based on: CS 109a: Data Science, Effective
Exploratory Data Analysis and Visualization, Pavlos Protopapas & Kevin Rader, https://harvard-iacs.github.io/2018-CS109A/lectures/lecture-3/
EDA Workflow

• Build a DataFrame from the data (ideally, put all data in this object)
• Clean the DataFrame. It should have the following properties
– Each row describes a single object
– Each column describes a property of that object
– Columns are numeric whenever appropriate
– Columns contain atomic properties that cannot be further decomposed
• Explore global properties. Use e.g. histograms, scatter plots, and
aggregation functions to summarize the data.
• Explore group properties. Use groupby and small multiples to compare
subsets of the data.

IS6061 | EDA | S. Seppälä Based on: CS 109a: Data Science, Effective Exploratory Data Analysis and Visualization, Pavlos Protopapas & Kevin Rader,
https://harvard-iacs.github.io/2018-CS109A/lectures/lecture-3/
EDA in Python vs. Excel

https://forms.office.com/e/
UG0kcxuaZt

IS6061 | EDA | S. Seppälä


EDA in Python vs. Excel
 Python is open source & free
 Has large developer community implementing latest advances in research
o Many tutorials
o Strong community support
 Python integrates with many other libraries, e.g., ML
 More powerful data importing and manipulation
o Structured & unstructured data
o Support for many file types (no need to convert files as in Excel)
 Easier to automate
 Can handle large amounts of data quickly (Excel gets slow)
 Easier to find errors with Python error messages
 More advanced statistics and machine learning capabilities
 Has better, more advanced and state-of-the-art graphics capabilities
 Cross-platform stability
IS6061 | EDA | S. Seppälä
EDA in Python vs. Excel

When to use Excel?


 Small datasets
 Ease of use
 Small and one-time analyses
 Creating basic visualizations quickly

IS6061 | EDA | S. Seppälä


EDA with Pandas
Credit for the slides
in this section goes
to Neha Mathur.

Exploratory Data
Analysis
by Neha Mathur
EDA using Pandas

Import data into workplace(Jupyter notebook, Google colab, Python IDE)

Descriptive statistics

Removal of nulls

Visualization
1. Packages and data import
• Step 1 : Import pandas to the workplace.
• “Import pandas”

• Step 2 : Read data/dataset into Pandas dataframe. Different input


formats include:
• Excel : read_excel
• CSV: read_csv
• JSON: read_json
• HTML and many more
• Used to make preliminary assessments about the population
distribution of the variable.
• Commonly used statistics:
1. Central tendency :

2. • Mean – The average value of all the data points. :


dataframe.mean()

Descriptive
• Median – The middle value when all the data points are
put in an ordered list: dataframe.median()
• Mode – The data point which occurs the most in the
Stats dataset : dataframe.mode()
2. Spread : It is the measure of how far the datapoints are away
(Pandas) from the mean or median
• Variance - The variance is the mean of the squares of the
individual deviations: dataframe.var()
• Standard deviation - The standard deviation is the square
root of the variance: dataframe.std()
3. Skewness: It is a measure of asymmetry: dataframe.skew()
Other methods to get a quick look on the data:
• Describe() : Summarizes the central tendency,
dispersion and shape of a dataset’s distribution,
Descriptive excluding NaN values.
• Syntax: pandas.dataframe.describe()
Stats • Info() :Prints a concise summary of the
dataframe. This method prints information
(contd.) about a dataframe including the index dtype
and columns, non-null values and memory
usage.
• Syntax: pandas.dataframe.info()
3. Null values

Detecting Handling

Detecting Null- Handling null values:


values: • Dropping the rows with
• Isnull(): It is used as an null values: dropna()
alias for dataframe.isna(). function is used to delete
This function returns the rows or columns with
dataframe with boolean null values.
values indicating missing • Replacing missing values:
values. fillna() function can fill
• Syntax : the missing values with a
dataframe.isnull() special value like mean
or median.
4. Visualization
• Univariate: Looking at one variable/column at a time
• Bar-graph
• Histograms
• Boxplot
• Multivariate : Looking at relationship between two or more variables
• Scatter plots
• (Pie plots)
• Heatmaps (seaborn)
Bar-Graph, Histogram
and Boxplot
• Bar graph: A bar plot is a plot that presents
data with rectangular bars with lengths
proportional to the values that they represent.
• Boxplot : Depicts numerical data graphically
through their quartiles. The box extends from
the Q1 to Q3 quartile values of the data, with
a line at the median (Q2).
• Histogram: A histogram is a representation of
the distribution of data.
Scatterplot, Pieplot
• Scatterplot : Shows the data as a collection of points.
• Syntax: dataframe.plot.scatter(x = 'x_column_name', y = 'y_columnn_name’)

• Pie plot : Proportional representation of the numerical data in a column.


• Syntax: dataframe.plot.pie(y=‘column_name’)
Outlier detection
• An outlier is a point or set of data points that lie away from the rest of
the data values of the dataset..
• Outliers are easily identified by visualizing the data.
• For e.g.
• In a boxplot, the data points which lie outside the upper and lower bound can
be considered as outliers
• In a scatterplot, the data points which lie outside the groups of datapoints can
be considered as outliers
Outlier removal
• Calculate the IQR as follows:
 Calculate the first and third quartile (Q1 and Q3)
 Calculate the interquartile range, IQR = Q3-Q1
 Find the lower bound which is Q1*1.5
 Find the upper bound which is Q3*1.5
 Replace the data points which lie outside this range.
 They can be replaced by mean or median.
References
• More information on EDA tools and Pandas can be found on below
links:
• https://pandas.pydata.org/docs/user_guide/index.html
• https://pandas.pydata.org/docs/user_guide/missing_data.html
• https://pandas.pydata.org/docs/user_guide/visualization.html
EDA Examples

IS6061 | EDA | S. Seppälä


EDA Examples

• Will Cukierski. (2012). Titanic - Machine Learning from Disaster. Kaggle.


https://kaggle.com/competitions/titanic
– Ashwini Swain EDA To Prediction(DieTanic),
https://www.kaggle.com/code/ash316/eda-to-prediction-dietanic/notebook
• A Beginner’s Guide to Exploratory Data Analysis with Python, Deepnote,
https://deepnote.com/@code-along-tutorials/A-Beginners-Guide-to-Exploratory-
Data-Analysis-with-Python-f536530d-7195-4f68-ab5b-5dca4a4c3579

IS6061 | EDA | S. Seppälä


References

IS6061 | EDA | S. Seppälä


References

• Myatt, Glenn J., and Wayne P. Johnson. Making Sense of Data I : A Practical Guide to Exploratory Data
Analysis and Data Mining, John Wiley & Sons, Incorporated, 2014. ProQuest Ebook Central,
https://ebookcentral-proquest-com.ucc.idm.oclc.org/lib/uccie-ebooks/detail.action?docID=1729064.
• Martinez, Wendy L., et al. Exploratory Data Analysis with MATLAB, CRC Press LLC, 2017. ProQuest
Ebook Central,
https://ebookcentral-proquest-com.ucc.idm.oclc.org/lib/uccie-ebooks/detail.action?docID=5475665.
– Section 1.1: What is Exploratory Data Analysis
• Advanced exploratory data analysis (EDA):
• https://miykael.github.io/blog/2022/advanced_eda/
• https://github.com/miykael/miykael.github.io/blob/master/assets/nb/03_advanced_eda/nb_advanced_eda.
ipynb

IS6061 | EDA | S. Seppälä


MATPLOTLIB
IS6061
MSc BA
Lecturer: Dr Selja Seppälä
Chapter 9

Matplotlib tries to make


easy things easy and
hard things possible.
John Hunter
(creator and project leader of Matplotlib)

Quote source: Sandro Tosi, John Hunter, Sandro Tosi, Merits of Matplotlib, in Matplotlib for Python
IS6061 | Matplotlib | S. Seppälä
Developers, Chapter 1, Packt, November 2009.
Introduction
• Matplotlib is a multi-platform data visualization library for Python
– Designed for creating 2D and 3D plots and figures suitable for publication
• The project was started by John Hunter in 2002 to enable a MATLAB-like plotting interface in Python
– Matplotlib was modeled on MATLAB, because graphing was something that MATLAB did very well.
• Built on NumPy arrays
• Designed to work with the broader SciPy stack (e.g., pandas has good integration with matplotlib)
• Other data visualization toolkits use matplotlib for their underlying plotting (e.g., Seaborn)
• Supports interactive plotting from the IPython shell and now, Jupyter notebook
• Supports various GUI backends on all operating systems. Plots can be rendered in UI applications.
• Can export visualizations to all the common vector and raster graphics formats (PDF, SVG, JPG,
PNG, BMP, GIF, etc.)

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä https://wesmckinney.com/book/plotting-and-visualization & Python for Scientists, Chapter 20 matplotlib,
rev 1.1, CJ Associates, 2014, p. 20-2.
Matplotlib terminology
• A Figure is one "picture".
– It has a border ("frame"), and other attributes.
– A Figure can be saved to a file.
• A Plot is one set of values graphed onto the Figure.
– A Figure can contain more than one Plot.
• Axes and Subplot are similar; the difference is how they get placed on the figure.
– Subplots allow multiple plots to be placed in a rectangular grid.
– Axes allow multiple plots to be placed at any location, including within other plots, or
overlapping.
• Matplotlib uses default objects for all of these, which are sufficient for simple plots.
– You can explicitly create any or all of these objects to fine-tune a graph.
– Most of the time, for simple plots, you can accept the defaults and get great-looking
figures.
IS6061 | Matplotlib | S. Seppälä Based on: Python for Scientists, Chapter 20 matplotlib, rev 1.1, CJ Associates, 2014, p. 20-4.
IS6061 | Matplotlib | S. Seppälä Based on: Quick start guide, Matplotlib, https://matplotlib.org/stable/users/explain/quick_start.html
Using Matplotlib

IS6061 | Matplotlib | S. Seppälä


Importing Matplotlib

import matplotlib.pyplot as plt


import numpy as np

import matplotlib as mpl

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
A simple line plot

In [14]: data = np.arange(10)

In [15]: data
Out[15]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]: plt.plot(data)

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Figures and subplots
In [17]: fig = plt.figure()

• In Jupyter nothing will be shown until we use a few more commands.


• plt.figure has a number of options
– figsize will guarantee the figure has a certain size and aspect ratio if
saved to disk
• You can’t make a plot with a blank figure.
– You have to create one or more subplots using add_subplot
(rows, columns, panel number)
This means that the figure should
In [18]: ax1 = fig.add_subplot(2, 2, 1) be 2 × 2 (so up to four plots in
total), and we’re selecting the first
of four subplots (numbered from 1).

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Figures and subplots: Example
⚠️In Jupyter notebooks, plots are reset after each cell is evaluated, so
you must put all the plotting commands in a single notebook cell.
fig = plt.figure()
ax1 = fig.add_subplot(2, 2,
1)
ax2 = fig.add_subplot(2, 2,
2)
ax3 = fig.add_subplot(2, 2,
3)

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Axis methods
• Plot axis objects have various methods
that create different types of plots
• It is preferred to use the axis methods
over the top-level plotting functions like
plt.plot.
You can find a
comprehensive
catalogue of plot
types in the
fig = plt.figure()
matplotlib document
ax1 = fig.add_subplot(2, 2, 1) ation
ax2 = fig.add_subplot(2, 2, 2) .
ax3 = fig.add_subplot(2, 2, 3)

ax1.hist(np.random.standard_normal(100), bins=20, color="black", alpha=0 .3)


ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.standard_normal
(30))
ax3.plot(np.random.standard_normal(50).cumsum(), color="black",
linestyle="dashed")
IS6061 | Matplotlib | S. Seppälä
Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
https://wesmckinney.com/book/plotting-and-visualization
plt.subplots method
• To make creating a grid of subplots more convenient, matplotlib includes a
plt.subplots method that creates a new figure and returns a NumPy array
containing the created subplot objects:

In [25]: fig, axes = plt.subplots(2, 3)

In [26]: axes
Out[26]:
array([[<Axes: >, <Axes: >, <Axes: >],
[<Axes: >, <Axes: >, <Axes: >]], dtype=object)

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Axes array
• The axes array can then be indexed like a two-dimensional array.
– E.g., axes[0, 1] refers to the subplot in the top row at the center.
• You can also indicate that subplots should have the same x- or y-
axis using sharex and sharey, respectively.
– This can be useful when you're comparing data on the same scale;
otherwise, matplotlib autoscales plot limits independently.

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Adjusting the spacing around subplots

• By default, matplotlib leaves a certain amount of padding around the outside of


the subplots and in spacing between subplots.
– This spacing is all specified relative to the height and width of the plot, so that if you
resize the plot either programmatically or manually using the GUI window, the plot
will dynamically adjust itself.
• You can change the spacing using the subplots_adjust method on Figure
objects:

subplots_adjust(left=None, bottom=None, right=None, top=None,


wspace=None, hspace=None) percent of the figure
width and figure
height, respectively,
to use as spacing
between subplots
Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Example
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
for i in range(2):
for j in range(2):
axes[i, j].hist(np.random.standard_normal(500),
bins=50,
color="black", alpha=0.5)
fig.subplots_adjust(wspace=0, hspace=0)

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Colors, Markers, and Line Styles

• Matplotlib’s line plot function accepts arrays of x and y coordinates


and optional color styling options.
– For example, to plot x versus y with green dashes, you would execute:
ax.plot(x, y, linestyle="--", color="green") A number of color names are provided for
commonly used colors, but you can use any
color on the spectrum by specifying its hex
code (e.g., "#CECECE").

fig = plt.figure()
ax = fig.add_subplot()
ax.plot(np.random.standard_normal(30).cumsum(),
color="black",
linestyle="dashed", marker="o")
See documentation for line
styles and marker options.

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Modifying the y-axis
Axis Range, Ticks and Tick Labels consists of the same
process,
substituting y for x.
Controlled by axes methods:
• Plot range: xlim
– Called with no arguments returns the current parameter value (e.g., ax.xlim()
returns the current x-axis plotting range)
– Called with parameters sets the parameter value (e.g., ax.xlim([0, 10]) sets the
x-axis range to 0 to 10)

• Tick locations: xticks


In [42]: ticks = ax.set_xticks([0, 250, 500, 750, 1000])

• Tick labels: xticklabels


In [43]: labels = ax.set_xticklabels(["one", "two", "three", "four", "five"],
....: rotation=30, fontsize=8)

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Labels, Legends and Title

Controlled by axes methods:


• Label: set_xlabel
In [44]: ax.set_xlabel("Stages")

• Title: set_title
In [45]: ax.set_title("My first matplotlib plot")
• The axes class has a set method that allows batch setting of plot
properties.
ax.set(title="My first matplotlib plot", xlabel="Stages", ylabel="Test")

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Example
fig, ax = plt.subplots()
ax.plot(np.random.standard_normal(1000).cumsum())
ticks = ax.set_xticks([0, 250, 500, 750, 1000])
labels = ax.set_xticklabels(["one", "two", "three", "four", "five"], rotation=30,
fontsize=8)
ax.set(title="My first matplotlib plot", xlabel="Stages", ylabel="Test")

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Adding legends
• There are a couple of ways to add one.
• The easiest is to pass the label argument
when adding each piece of the plot and calling
ax.legend() to automatically create a legend.
• The legend method has several other choices
for the location loc argument.

fig, ax = plt.subplots()

ax.plot(np.random.randn(1000).cumsum(), color="black", label="one")


ax.plot(np.random.randn(1000).cumsum(), color="black", linestyle="dashed",
label="two")
ax.plot(np.random.randn(1000).cumsum(), color="black", linestyle="dotted",
label="three")

ax.legend() Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Saving Plots to File

• You can save the active figure to file using the figure object’s savefig
instance method.
• The file type is inferred from the file extension.
• For example, to save an SVG version of a figure, you need only type:
fig.savefig("figpath.svg")

• To control the dots-per-inch resolution, use dpi.


fig.savefig("figpath.png", dpi=400)

• See documentation for other savefig options

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Matplotlib & a tiny bit of
Pandas

https://forms.office.com/e/3
RhG4fejm8

IS6061 | Matplotlib |
S. Seppälä
Plotting with Pandas

IS6061 | Matplotlib | S. Seppälä


Plots in Pandas

• Pandas objects come equipped with their plotting functions, which are
essentially wrappers around the matplotlib library.
– Think of matplotlib as a backend for pandas plots.
• The Pandas Plot is a set of methods that can be used with a Pandas
DataFrame, or a series, to plot various graphs from the data in that
DataFrame.
• Pandas Plot simplifies the creation of graphs and plots, so you don’t
need to know the details of working with matplotlib.

Based on: Parul Pandey, Pandas Plot: Deep Dive Into Plotting Directly With Pandas, MLOps Blog,
IS6061 | Matplotlib | S. Seppälä NeptuneAI, 23 August 2023,
https://neptune.ai/blog/pandas-plot-deep-dive-into-plotting-directly-with-pandas
plot method

• Series and DataFrame have a plot method for making some basic plot
types.
– By default, plot() makes line plots.
s = pd.Series(np.random.standard_normal(10).cumsum(), index=np.arange(0,
100, 10))

s.plot()

• See documentation for plot options:


– Series plot
– DataFrame plot

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
df.plot method

• DataFrame’s plot method plots each of its columns as a different line on


the same subplot, creating a legend automatically.
df = pd.DataFrame(np.random.standard_normal((10,
4)).cumsum(0),
columns=["A", "B", "C", "D"],
index=np.arange(0, 100, 10))

plt.style.use('grayscale')
df.plot()

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Bar plot (1)

The plot.bar() and plot.barh() make vertical and horizontal bar


plots, respectively. In this case, the Series or DataFrame index will
be used as the x (bar) or y (barh) ticks.

fig, axes = plt.subplots(2, 1)


data = pd.Series(np.random.uniform(size=16),
index=list("abcdefghijklmnop"))
data.plot.bar(ax=axes[0], color="black",
alpha=0.7)
data.plot.barh(ax=axes[1], color="black",
alpha=0.7)

Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Bar plot (2)
With a DataFrame, bar plots group the values in each row in bars, side by side,
for each value.
In [71]: df = pd.DataFrame(np.random.uniform(size=(6, 4)),
....: index=["one", "two", "three", "four", "five", "six"],
....: columns=pd.Index(["A", "B", "C", "D"], name="Genus"))

In [72]: df
Out[72]:
Genus A B C D
one 0.370670 0.602792 0.229159 0.486744
two 0.420082 0.571653 0.049024 0.880592
three 0.814568 0.277160 0.880316 0.431326
four 0.374020 0.899420 0.460304 0.100843
five 0.433270 0.125107 0.494675 0.961825
six 0.601648 0.478576 0.205690 0.560547

In [73]: df.plot.bar() Based on: Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
IS6061 | Matplotlib | S. Seppälä
https://wesmckinney.com/book/plotting-and-visualization
Histogram

The hist() function calls matplotlib.pyplot.hist(), on each series in


the DataFrame, resulting in one histogram per column.
df = pd.DataFrame({
'length': [1.5, 0.5, 1.2, 0.9, 3],
'width': [0.7, 0.2, 0.15, 0.2, 1.1]
}, index=['pig', 'rabbit', 'duck', 'chicken',
'horse'])
df.hist(bins=3)

df = pd.DataFrame({
'length': [1.5, 0.5, 1.2, 0.9, 3],
'width': [0.7, 0.2, 0.15, 0.2, 1.1]
}, index=['pig', 'rabbit', 'duck', 'chicken',
'horse'])
df['length'].hist(bins=3)
Based on: pandas.DataFrame.hist, pandas API reference,
IS6061 | Matplotlib | S. Seppälä
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html#pandas.DataFrame.hist
References

IS6061 | Matplotlib | S. Seppälä


References

• Matplotlib website: https://matplotlib.org/


• Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition,
2023, https://wesmckinney.com/book/plotting-and-visualization
– Get the related Jupyter notebooks from https://github.com/wesm/pydata-book/tree/3rd-edition
• Jake VanderPlas, 4 Visualization with Matplotlib, Python Data Science Handbook, 2016,
O'Reilly Media, https://jakevdp.github.io/PythonDataScienceHandbook/
• Python for Scientists, Chapter 20 matplotlib, rev 1.1, CJ Associates, 2014
• Derek Banas, Matplotlib Tutorial: Matplotlib Full Course, 22 August 2020,
https://youtu.be/6GUZXDef2U0?si=aXrals6V1eB4dsZo (1:33:59 min)
– See the comments of the video for a table of contents with direct access to specific topics or
graphs.

IS6061 | Matplotlib | S. Seppälä


SEABORN
IS6061
MSc BA
Lecturer: Dr Selja Seppälä
Credit to Matthias Bussonnier for the initial design and implementation of the logo. Source:Citing and
IS6061 | Seaborn | S. Seppälä
logo, seaborn, https://seaborn.pydata.org/citing.html
Introduction

• Seaborn is a Python data visualization library based on matplotlib.


• It builds on top of matplotlib and integrates closely with pandas data
structures.
• It provides a high-level interface for drawing attractive and informative
statistical graphics.
• Seaborn helps you explore and understand your data.
– It aims to make visualization a central part of exploring and understanding
complex datasets.
• Its plotting functions operate on dataframes and arrays containing whole
datasets and internally perform the necessary semantic mapping and
statistical aggregation to produce informative plots.
• Its dataset-oriented, declarative API lets you focus on what the different
elements of your plots mean, rather than on the details of how to draw them.
Based on: seaborn: statistical data visualization, https://seaborn.pydata.org/index.html & An
IS6061 | Seaborn | S. Seppälä introduction to seaborn, https://seaborn.pydata.org/tutorial/introduction.html & Matplotlib,
https://matplotlib.org
Example of Seaborn usage
# Import seaborn
import seaborn as sns

# Apply the default theme


sns.set_theme()

# Load an example dataset


tips = sns.load_dataset("tips")

# Create a visualization
sns.relplot(
data=tips,
This plot shows the relationship
x="total_bill", y="tip", col="time", between five variables in the tips
hue="smoker", style="smoker", size="size", dataset using a single call to the
seaborn function relplot().
)

Based on: An introduction to seaborn, Tutorial, seaborn,


IS6061 | Matplotlib | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
Seaborn in Jupyter notebooks

💡 If you’re working in a Jupyter notebook or an IPython terminal with


matplotlib mode enabled, you should immediately see the plot.
Otherwise, you may need to explicitly call matplotlib.pyplot.show().

import matplotlib.pyplot as plt


plt.show()

IS6061 | Seaborn | S. Seppälä


Modules of functions
• Seaborn’s code is hierarchically
structured, with modules of
functions.
• Most of the docs are structured
around these modules.
• Functions within a module share a
lot of underlying code and offer
similar features that may not be
present in other components of the
library.
• This facilitates switching between
different visual representations as
you explore a dataset, because
different representations often have
complementary strengths and
weaknesses.
Based on: Overview of seaborn plotting functions, Tutorial, seaborn,
IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
Example: the distributions module (1)

• The distributions module defines functions that specialize in


representing the distribution of datapoints.

– This includes familiar methods like the histogram:

penguins = sns.load_dataset("penguins")
sns.histplot(data=penguins,
x="flipper_length_mm",
hue="species",
multiple="stack")

Based on: Overview of seaborn plotting functions, Tutorial, seaborn,


IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
Example: the distributions module (2)
• displot() is the figure-level function for the distributions
module.
• Its default behavior is to draw a histogram, using the same
code as histplot() behind the scenes:
sns.displot(data=penguins,
x="flipper_length_mm",
hue="species",
multiple="stack")

• To draw a kernel density plot instead, using the same code as


kdeplot(), select it using the kind parameter:
sns.displot(data=penguins,
x="flipper_length_mm",
hue="species",
multiple="stack",
kind="kde")
Based on: Overview of seaborn plotting functions, Tutorial, seaborn,
IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
A high-level API for statistical graphics (1)

• Different questions are best answered by different plots. Seaborn makes


it easy to switch between different visual representations by using a
consistent dataset-oriented API.
• Statistical estimation
– Many seaborn functions will automatically perform the statistical estimation.
– When statistical values are estimated, seaborn will use bootstrapping to
compute confidence intervals and draw error bars representing the
uncertainty of the estimate.
– Statistical estimation in seaborn goes beyond descriptive statistics.
• E.g., it is possible to enhance a scatterplot by including a linear regression model
(and its uncertainty) using lmplot()

Based on: An introduction to seaborn, Tutorial, seaborn,


IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
A high-level API for statistical graphics (2)

• Distributional representations
– Statistical analyses require knowledge about the distribution of variables in
your dataset. The seaborn function displot() supports several approaches
to visualizing distributions.
• Plots for categorical data
– Several specialized plot types in seaborn are oriented towards visualizing
categorical data. They can be accessed through catplot(). These plots
offer different levels of granularity.

Based on: An introduction to seaborn, Tutorial, seaborn,


IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
Multivariate views on complex datasets (1)

• Some seaborn functions combine


multiple kinds of plots to quickly give
informative summaries of a dataset.
• One, jointplot(), focuses on a single
relationship.
– It plots the joint distribution between two
variables along with each variable’s
marginal distribution:
penguins = sns.load_dataset("penguins")
sns.jointplot(data=penguins,
x="flipper_length_mm",
y="bill_length_mm",
hue="species")
Based on: An introduction to seaborn, Tutorial, seaborn,
IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
Multivariate views on complex datasets (2)

• The other, pairplot(), takes a


broader view:.
– It shows joint and marginal
distributions for all pairwise
relationships and for each variable:

sns.pairplot(data=penguins,
hue="species")

Based on: An introduction to seaborn, Tutorial, seaborn,


IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
Multivariate views on complex datasets (3)

• Lower-level tools for building figures.


– These tools work by combining axes-
level plotting functions with objects
that manage the layout of the figure,
linking the structure of a dataset to a
grid of axes.
– Both elements are part of the public
API, and you can use them directly to
create complex figures with only a few
more lines of code.

Based on: An introduction to seaborn, Tutorial, seaborn,


IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
In Sum
• Opinionated defaults and flexible customization
– Seaborn creates complete graphics with a single function call with informative
axis labels and legends.
– Seaborn allows for several levels of customization.
• Multiple built-in themes and additional keyword arguments allowing even more control.
• Plots’ properties can be modified through both the seaborn API and by dropping down to
the matplotlib layer for fine-grained tweaking
• Relationship to matplotlib
– Seaborn’s integration with matplotlib allows you to use it across the many
environments that matplotlib supports, including exploratory analysis in
notebooks, real-time interaction in GUI applications, and archival output in a
number of raster and vector formats.
• Explore the Seaborn Example gallery:
https://seaborn.pydata.org/examples/index.html
Based on: An introduction to seaborn, Tutorial, seaborn,
IS6061 | Seaborn | S. Seppälä
https://seaborn.pydata.org/tutorial/introduction.html
Advantages of Seaborn over Matplotlib
• Functionality
– Easier/simpler syntax
– Goes beyond basic plotting of Matplotlib with a variety of visualisation
patterns
– Specialises in statistical visualisation ⇒ used for data summarisation and
showing data distribution (EDA)
– Many default themes
– Handling multiple figures is simpler
– More integrated for working with Pandas data frames
– Extends Matplotlib to create beautiful graphs with a more straightforward
set of methods

IS6061 | Seaborn | S. Seppälä


Seaborn vs. Matplotlib
• Some differences
– Seaborn is more geared towards EDA with complex datasets.
– Seaborn has pre-built visualisation patterns that generate different parts of the figure automatically
(e.g., labels, legend, etc.).
– Seaborn has a simpler syntax.
• In sum, Seaborn allows you to focus on the data instead of the code to produce the
visualisations.
• Different questions are best answered by different plots. Seaborn makes it easy to switch
between different visual representations by using a consistent dataset-oriented API.
• Further differences:
– See complete lists of differences in:
• Comparing Python Data Visualization Tools: Matplotlib vs Seaborn, Endless Origins, https
://analyticsindiamag.com/comparing-python-data-visualization-tools-matplotlib-vs-seaborn/
• Difference Between Matplotlib VS Seaborn, GeeksforGeeks, 20 August 2022,
https://www.geeksforgeeks.org/difference-between-matplotlib-vs-seaborn/
– See a comparison of code examples in: Aizhamal Zhetigenova, Matplotlib vs. Seaborn,
CodeSolid.com,
IS6061 | Seaborn | S. Seppälä 2022, https://codesolid.com/matplotlib-vs-seaborn/
Titanic Example with Seaborn

• Example showing the use of Seaborn to explore the Titanic dataset


– Eugenio Rovira, Pandas & Seaborn - A guide to handle & visualize data in
Python, Mar 16, 2017, Last Updated: January 2021,
https://tryolabs.com/blog/2017/03/16/pandas-seaborn-a-guide-to-handle-vis
ualize-data-elegantly

IS6061 | Seaborn | S. Seppälä


Matplotlib & a tiny bit of
Pandas

https://forms.office.com/e/
KQgguqrE4D

IS6061 | Seaborn | S. Seppälä


Example of how to describe a library: Plotly

• Plotly is an open source graphing library for Python. It makes interactive, publication-quality
2D and 3D graphs, such as line plots, scatter plots, area charts, bar charts, error bars, box
plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.
• Plotly is typically imported with the following command: import plotly.express as px

• Other useful information that can be added to the description:


– Libraries on which the package is based
– General structure of the library
– An example of code to generate a specific type of graph
– Etc.

• Based on: REFERENCE

IS6061 | Seaborn | S. Seppälä


References

IS6061 | Seaborn | S. Seppälä


References
• Seaborn website: https://seaborn.pydata.org/index.html
• Waskom, M. L., (2021). seaborn: statistical data visualization. Journal of Open Source Software,
6(60), 3021, https://doi.org/10.21105/joss.03021.
• Overview of seaborn plotting functions, https://seaborn.pydata.org/tutorial/function_overview.html
• Wes McKinney, 9 Plotting and Visualization, Python for Data Analytics, 3rd edition, 2023,
https://wesmckinney.com/book/
• Aizhamal Zhetigenova, Matplotlib vs. Seaborn, CodeSolid.com, 2022,
https://codesolid.com/matplotlib-vs-seaborn/
• Difference Between Matplotlib VS Seaborn, GeeksforGeeks, 20 August 2022,
https://www.geeksforgeeks.org/difference-between-matplotlib-vs-seaborn/
• Derek Banas, Seaborn Tutorial: Seaborn Full Course, 1 September 2020,
https://youtu.be/6GUZXDef2U0?si=aXrals6V1eB4dsZo (59:33 min)
– See the comments of the video for a table of contents with direct access to specific graphs.
IS6061 | Seaborn | S. Seppälä
SciPy

IS6061
PROGRAMMING FOR DATA AND BUSINESS ANALYTICS
MSc BIAS
Lecturer: Dr Selja Seppälä
IS6061 | SciPy | S. Seppälä Logo: https://github.com/scipy/scipy.org/blob/main/static/images/logo.svg
SciPy

• SciPy is a scientific computation library that uses NumPy


underneath.
• SciPy stands for Scientific Python.
• It provides more utility functions for optimization, stats and signal
processing.
• SciPy has optimized and added functions that are frequently used
in NumPy and Data Science.
(Source: https://www.w3schools.com/python/scipy/scipy_intro.php)

IS6061 | SciPy | S. Seppälä


SciPy

• SciPy is a free and open-source Python library used for scientific


computing and technical computing. (Source: Wikipedia)
• SciPy is a collection of mathematical algorithms and convenience
functions built on the NumPy extension of Python. (Source:
https://scipy.github.io/devdocs/tutorial/general.html)
• SciPy contains modules for optimization, linear algebra, integration,
interpolation, special functions, fast Fourier Transform (FFT), signal and
image processing, ODE solvers and other tasks common in science and
engineering. (Source: Wikipedia)

IS6061 | SciPy | S. Seppälä


SciPy sub-packages
• Special functions (scipy.special)
• Integration (scipy.integrate)
• Optimization (scipy.optimize)
• Interpolation (scipy.interpolate)
• Fourier Transforms (scipy.fft)
• Signal Processing (scipy.signal)
• Linear Algebra (scipy.linalg)
• Sparse Arrays (scipy.sparse)
• Sparse eigenvalue problems with ARPACK
• Compressed Sparse Graph Routines (scipy.sparse.csgraph)
• Spatial data structures and algorithms (scipy.spatial)
• Statistics (scipy.stats)
• Multidimensional image processing (scipy.ndimage)
• File IO (scipy.io)

IS6061 | SciPy | S. Seppälä Based on: SciPy User Guide, https://docs.scipy.org/doc/scipy/tutorial/index.html#user-guide


SciPy

• SciPy sub-packages need to be imported separately, for example:


>>> from scipy import linalg, optimize
• Some of the functions in these subpackages are also made available in
the scipy namespace to ease their use in interactive sessions and
programs.
• In addition, many basic array functions from numpy are also available at
the top-level of the scipy package.

(Souce: https://scipy.github.io/devdocs/tutorial/general.html)

IS6061 | SciPy | S. Seppälä


SciPy Module Examples

IS6061 | SciPy | S. Seppälä


Optimisers in SciPy

Optimizers in SciPy
• Optimizers are a set of procedures defined in SciPy that either
find the minimum value of a function, or the root of an equation.

Optimizing Functions
• Essentially, all of the algorithms in Machine Learning are nothing
more than a complex equation that needs to be minimized with
the help of given data.

(Source: https://www.w3schools.com/python/scipy/scipy_optimizers.php)
IS6061 | SciPy | S. Seppälä
Optimisers in SciPy

Minimizing a Function
• A function, in this context, represents a curve, curves have high points and low
points.
– High points are called maxima.
– Low points are called minima.
• The highest point in the whole curve is called global maxima, whereas the rest
of them are called local maxima.
• The lowest point in whole curve is called global minima, whereas the rest of
them are called local minima.
Finding Minima
• We can use scipy.optimize.minimize() function to minimize the function.
(Source: https://www.w3schools.com/python/scipy/scipy_optimizers.php)
IS6061 | SciPy | S. Seppälä
SciPy Sparse Data

What is Sparse Data


• Sparse data is data that has mostly unused elements (elements that don't carry
any information ).
– It can be an array like this one: [1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
• Sparse Data: is a data set where most of the item values are zero.
• Dense Array: is the opposite of a sparse array: most of the values are not zero.
• In scientific computing, when we are dealing with partial derivatives in linear
algebra we will come across sparse data.
• SciPy has a module, scipy.sparse that provides functions to deal with sparse
data.
(Source: https://www.w3schools.com/python/scipy/scipy_sparse_data.php )

IS6061 | SciPy | S. Seppälä


SciPy Interpolation
What is Interpolation?
• Interpolation is a method for generating points between given points.
– For example: for points 1 and 2, we may interpolate and find points 1.33 and 1.66.
• Interpolation has many usages, in Machine Learning we often deal with missing data in a
dataset, interpolation is often used to substitute those values.
– This method of filling values is called imputation.
• Interpolation is also often used where we need to smooth the discrete points in a dataset.
How to Implement it in SciPy?
• SciPy provides us with a module called scipy.interpolate which has many functions to deal
with interpolation.
• See example code and outputs in: SciPy Interpolation, SciPy Tutorial for Beginners | Overview of
SciPy library, Great Learning Team, Updated on Sep 9, 2022,
https://www.mygreatlearning.com/blog/scipy-tutorial/
IS6061 | SciPy | S. Seppälä Based on: SciPy Interpolation, W3 Schools, https://www.w3schools.com/python/scipy/scipy_interpolation.php
SciPy Statistical Functions

• Statistical functions available in scipy.stats.


• This module contains
– a large number of probability distributions,
– summary and frequency statistics,
– correlation functions and statistical tests,
– masked statistics,
– kernel density estimation,
– quasi-Monte Carlo functionality,
– and more.
(Source: https://docs.scipy.org/doc/scipy/reference/stats.html#module-scipy.stats)

IS6061 | SciPy | S. Seppälä


SciPy Statistical Significance Tests

What is Statistical Significance Test?


• In statistics, statistical significance means that the result that was
produced has a reason behind it, it was not produced randomly, or by
chance.
• SciPy provides us with a module called scipy.stats, which has
functions for performing statistical significance tests.
(Source: https://www.w3schools.com/python/scipy/scipy_statistical_significance_tests.php)

IS6061 | SciPy | S. Seppälä


INTRODUCTION TO DATA SCIENCE ALGORITHMS

IS6061
PROGRAMMING FOR DATA AND BUSINESS ANALYTICS
MSc BIAS
Lecturer: Dr Selja Seppälä
Data Science Process: CRISP DM
• The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a
process model that serves as the base for a data science process. It has
six sequential phases:
1. Business understanding – What does the business need?
2. Data understanding – What data do we have / need? Is it clean?
3. Data preparation – How do we organize the data for modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business objectives?
6. Deployment – How do stakeholders access the results?
• Published in 1999 to standardize data mining processes across
industries, it has since become the most common methodology for data
mining, analytics, and data science projects.
(Source: https://www.datascience-pm.com/crisp-dm-2/)
IS6061 | Introduction to Algorithms |
S. Seppälä
CRISP DM

IS6061 | Introduction to Algorithms |


S. Seppälä (Source: https://www.sv-europe.com/crisp-dm-methodology/)
IS6061 | Introduction to Algorithms | Source: Zipporah Luna, CRISP-DM Phase 4: Modeling Phase, Published in Analytics Vidhya, Aug 16, 2021,
S. Seppälä https://medium.com/analytics-vidhya/crisp-dm-phase-4-modeling-phase-b81f2580ff3
What is an algorithm?

• “algorithms are a set of instructions that are executed to get the


solution to a given problem. Since algorithms are not language-
specific, they can be implemented in several programming
languages. No standard rules guide the writing of algorithms.
They are resource- and problem-dependent but share some
common code constructs, such as flow-control (if-else) and loops
(do, while, for).”
(Source: https://www.upgrad.com/blog/data-structures-algorithm-in-python/)

IS6061 | Introduction to Algorithms |


S. Seppälä
Example of algorithm

Find the age of the youngest person in the DB


• Input: records including age of person
• Output: age of the youngest person
• Algorithm:
– Initialise variable to track smallest_so_far=None
– Loop through records in DB and assign age to variable age
– Test if smallest_so_far==None
• If True, assign current value of age to smallest_so_far
– Otherwise, test if current value of age is smaller than smallest_so_far
• If True, assign current value of age to smallest_so_far
– Return age of the youngest person
IS6061 | Introduction to Algorithms |
S. Seppälä
Choice of algorithm
Choice of algorithm depends on
• Data
• Type of data: categorical, continuous
• Data structures, i.e., the organisations of data that the algorithms use.
(Source: "The Practice of Computing Using Python, 3rd/ E, GE", Punch & Enbody, Copyright © 2017 Pearson Education, Ltd.)

• E.g., int, float, str, bool, list, dictionary, tuple


• E.g., user-defined (that is, users define them): Stack, Queue, Linked List, Tree, Graph and HashMap. (Source:

https://www.upgrad.com/blog/data-structures-algorithm-in-python/)

• Type of thing you want to achieve (task)


• Classification
• Clustering
• Prediction
• While data structures help in organizing information, algorithms provide the guidelines to
solve the problem of data analysis. (Source: https://www.upgrad.com/blog/data-structures-algorithm-in-python/ )
IS6061 | Introduction to Algorithms |
S. Seppälä
Some types of algorithms

• Tree Traversal Algorithms (a tree can be traversed in 3 different


ways)
• Sorting Algorithms (to arrange data in a particular format)
• Searching Algorithms (checking and retrieving an element from
different data structures)
• Graph Algorithms (methods of traversing graphs using their
edges)
(Source: https://www.upgrad.com/blog/data-structures-algorithm-in-python/)

IS6061 | Introduction to Algorithms |


S. Seppälä
IS6061 | Introduction to Algorithms |
S. Seppälä (Source: www.knime.com/sites/default/files/2021-08/l4-ml-slides.pdf)
Types of machine learning (ML) techniques

IS6061 | Introduction to Algorithms |


S. Seppälä (Source: https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d2929)
Types of machine learning (ML) techniques
• Supervised learning: algorithm trained on labelled data
• Unsupervised learning: algorithm trained on unlabelled data
• Semi-supervised learning: only part of the data in training set is
labelled; the model labels the data in the training set using a modified
unsupervised learning procedure (Source:
www.knime.com/sites/default/files/2021-08/l4-ml-slides.pdf)
• Reinforcement learning: an agent learns in an interactive environment
by trial and error using feedback from its own actions and experiences.
(Based on: https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292)
• Active learning: a special case of machine learning in which a learning
algorithm can interactively query a user (or some other information
source) to label new data points with the desired outputs. (Source:
https://en.wikipedia.org/wiki/Active_learning_(machine_learning))
IS6061 | Introduction to Algorithms |
S. Seppälä
IS6061 | Introduction to Algorithms |
S. Seppälä (Source: https://tutorialforbeginner.com/machine-learning-algorithms)
IS6061 | Introduction to Algorithms |
S. Seppälä (Source: https://towardsdatascience.com/machine-learning-algorithms-in-laymans-terms-part-1-d0368d769a7b )
ML algorithm vs. ML model

• A “model” in machine learning is the output of a machine learning algorithm


run on data.
• A model represents what was learned by a machine learning algorithm.
• The model is the “thing” that is saved after running a machine learning
algorithm on training data and represents the rules, numbers, and any other
algorithm-specific data structures required to make predictions.
• The best analogy is to think of the machine learning model as a “program.”
• The machine learning model “program” is comprised of both data and a
procedure for using the data to make a prediction.
(Source: https://machinelearningmastery.com/difference-between-algorithm-and-model-in-machine-learning/)
IS6061 | Introduction to Algorithms |
S. Seppälä
ML algorithm vs. ML model

• Specifically, an algorithm is run on data to create a model.


– Machine Learning => Machine Learning Model
• We also understand that a model is comprised of both data and a
procedure for how to use the data to make a prediction on new
data. You can think of the procedure as a prediction algorithm if
you like.
– Machine Learning Model == Model Data + Prediction Algorithm
(Source: https://machinelearningmastery.com/difference-between-algorithm-and-model-in-machine-learning/)

IS6061 | Introduction to Algorithms |


S. Seppälä
Steps in choosing a model
1. Determine size of training data — if you have a small dataset, fewer number of observations, high
number of features, you can choose high bias/low variance algorithms (Linear Regression, Naïve
Bayes, Linear SVM). If your dataset is large and has a high number of observations compared to
number of features, you can choose a low bias/high variance algorithms (KNN, Decision trees).
2. Accuracy and/or interpretability of the output — if your goal is inference, choose restrictive models as
it is more interpretable (Linear Regression, Least Squares). If your goal is higher accuracy, then
choose flexible models (Bagging, Boosting, SVM).
3. Speed or training time — always remember that higher accuracy as well as large datasets means
higher training time. Examples of easy to run and to implement algorithms are: Naïve Bayes, Linear
and Logistic Regression. Some examples of algorithms that need more time to train are: SVM, Neural
Networks, and Random Forests.
4. Linearity — try checking first the linearity of your data by fitting a linear line or by trying to run a
logistic regression, you can also check their residual errors. Higher errors mean that the data is not
linear and needs complex algorithms to fit. If data is Linear, you can choose: Linear Regression,
Logistic Regression, Support Vector Machines. If Non-linear: Kernel SVM, Random Forest, Neural Nets.
IS6061 | Introduction to Algorithms | Source: Zipporah Luna, CRISP-DM Phase 4: Modeling Phase, Published in Analytics Vidhya, Aug 16, 2021,
S. Seppälä https://medium.com/analytics-vidhya/crisp-dm-phase-4-modeling-phase-b81f2580ff3
Some useful terminology

• Independent variable(s): input feature(s)


• Dependent variable: output class/label/target
• Model fitting: the measure of how well a machine learning model generalizes data
similar to that with which it was trained. A good model fit refers to a model that
accurately approximates the output when it is provided with unseen inputs.
(Source: https://www.educative.io/answers/definition-model-fitting)
We would like our algorithm to generalise to data it hasn’t seen before.
Issues:
– Overfitting: see next slides
– Underfitting: see next slides
IS6061 | Introduction to Algorithms |
S. Seppälä
IS6061 | Introduction to Algorithms |
S. Seppälä (Source: www.knime.com/sites/default/files/2021-08/l4-ml-slides.pdf)
IS6061 | Introduction to Algorithms |
S. Seppälä (Source: www.knime.com/sites/default/files/2021-08/l4-ml-slides.pdf)
Additional References

• Top 10 Data Science Algorithms You Must Know About, by TechVidvan:


https://techvidvan.com/tutorials/data-science-algorithms/
• Difference Between Algorithm and Model in Machine Learning, by Jason Brownlee:
https://machinelearningmastery.com/difference-between-algorithm-and-model-in-machi
ne-learning/

IS6061 | Introduction to Algorithms |


S. Seppälä
Questions?

You might also like