0% found this document useful (0 votes)
25 views

SIC - C - P - Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization - v1.0

Uploaded by

Raj Jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

SIC - C - P - Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization - v1.0

Uploaded by

Raj Jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 479

Samsung

Innovation
Campus
Coding and Programming
Chapter 7.

Data Processing, Descriptive


Statistics, and Data Visualization
Coding and Programming

Samsung Innovation Campus 2


Chapter Description

Chapter objectives

 Learners will be able to collect various types of large amounts of data and organize them in a
form that can be analyzed.
 Learners will be able to generate various descriptive statistics for organized data using Pandas.
 Learners can visualize data using the Python Visualization Library.

Chapter contents

 Unit 34. Using Python Modules


 Unit 35. Pandas Series for Data Processing
 Unit 36. Pandas DataFrame for Data Processing
 Unit 37. Data Tidying
 Unit 38. Time Series Data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 3
Unit 34.

Using Python Modules

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 4
Unit learning objective (1/3) UNIT
34

Learning objectives

 Learners will be able to explain why modules are grouped into classes, functions, variables,
execution codes, etc., and why they are needed.

 Learners will be able to look at documents on how to use the module and decrypt how to
separate and use the functions, classes, and parameters, etc. that the module contains.

 Learners can select and use the import, from, and as statements in modules appropriately
according to need.

 Learners will be able to generate as many integers and real data as they want using the random
function when they encounter situations where unspecified test data is needed.

 Learners will be able to convert and generate values for a specific date and time using a data
module.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 5
Unit learning objective (2/3) UNIT
34

Learning overview

 Be able to use Python’s standard module and external module.


 Be able to use the most commonly used standard library among many standard modules.
 Be able to create the necessary modules yourself and learn how to use them in other codes.
 Be able to bring up the entire module, classes, for functions in the module, and learn how to use
it in your code.
 Although we haven’t covered the concept of time series yet, be able to generate date and time
data using datetime.

Concepts you will need to know from previous units

 Know how to use Python Functions.

 Know how to use Python class.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 6
Unit learning objective (3/3) UNIT
34

Keywords

Standard
import External Module
Module

random pip install datetime

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 7
Unit 34.Using Python Modules

Mission

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
34

1. Make a Dice Game!


What do we need to know about making a dice game?
‣ Today, we are going to make a dice game in which the computer and the user take turns to roll a fixed
number of dice and see who gets the higher total. Each of them will get one chance to roll again and
can decide which of the dices will be held or rerolled.
‣ The dice game problem involves a concept of randomness and an element of chance. Instead of
deciding on an item, we will let the computer to pick something at random.
‣ Through this mission, we will learn how to make our own Python functions, which will allow us to name
a block of code and then reuse it later by calling the name.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 9
Mission UNIT
34

2. Your Dice Game will look like this!


※ To view the video clip, put the mouse on the box above, and the play button appears. Click
it to watch.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 10
Unit 34.Using Python Modules

Key concept

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
34

1. Modules
1.1. What is a Module?
A module enables the Python code to be logically grouped, managed, and used. Normally, one
Python .py file becomes one module. Functions, classes, or variables may be defined in the module, and
may include an execution code.
Simply put, it is a code file.
It is divided into a standard module and an external module. The standard module refers to what is
basically
‣ Standard built into Python. In addition, modules created by other 3rd parties are called external modules.
Module: Built-in modules within Python
‣ External Module: Other modules made by a 3rd party.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 12
Key concept UNIT
34

1. Modules
1.1. What is a Module?
To use these modules, the module may be imported and used, and the import statement may invoke
one or more modules within the code as shown below.

Module Name

Line 1
• In case only one module is brought in for use.

Module Name1, [Module Name2, Module Name3, …]

Line 1
• For cases of multiple modules.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 13
Key concept UNIT
34

1. Modules
1.1. What is a Module?
After importing a module using ‘import’, use the following method when using a specific function or
variable of the module.

‣ ModuleName.Variable
‣ ModuleName.FunctionName()
‣ ModuleName.Class()

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 14
Key concept UNIT
34

1. Modules
1.2. Standard Module, Standard Library
Standard modules are installed when installing Python. There is no need to memorize what is in the
standard module, as it can be searched through a search engine or through Python’s standard library at
https://docs.python.org/3/library/.

Typical Functions of Standard Libraries


‣ Date and Time Modules
‣ Numbers and Math Modules
‣ File System Modules
‣ Operating System Modules
‣ Reading and Writing Data Format Modules such as HTML, XML, and JSON
‣ Internet Protocol Modules such as HTTP, SMTP, and FTP
‣ Multimedia Data Modules such as sound and video
‣ Localized Information Modules such as calls and dates

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 15
Key concept UNIT
34

1. Module s
1.2. Standard Module, Standard Library
Since the standard library is built-in, it does not require a separate installation process and can be used
immediately by simply importing.

Line 1
• A math module, one of the standard libraries with math-related functions

# It is used in the form of ModuleName.FunctionName()

Line
1~2
• 1: It is used in the form of ModuleName.FunctionName()
• 2: An example of using a sin(x) function that obtains the sine value among several functions of the math
module.
To find out what other functions the Math module has, such as math.sin(x), visit
https://docs.python.org/3/library/math.html. In this way, you can learn its basic use and use what you
need through searching.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 16
Key concept UNIT
34

1. Modules
1.3. External Module, External Library
Since external modules were created by other people (such as open-source libraries), the process of
installing modules is required in order to use them in code.
The most recommended and safe way to install an external library is to use pip. After Python 3.4, it is
basically included in the Python binary installation program and can be easily used.
Pip is a utility that allows for access of a widely used Python package index called PyPI(Python Package
Index).
Ex If you want to install a module that has a special function, you can use the pip to install the
available candidate library after searching in PyPI.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 17
Key concept UNIT
34

1. Modules
1.3. External Module, External Library
Visit https://pypi.org/ in order to search the necessary functions.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 18
Key concept UNIT
34

1. Modules
1.3. External Module, External Library
Examine the detail page in the library and copy the pip install under the module name if it is the desired
function.

pip install
chart

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 19
Key concept UNIT
34

1. Modules
1.3. External Module, External Library
Run Anaconda prompt and install the library after moving to the virtual environment you are currently
using. The library installation instruction used as an example is a pip install chart.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 20
Key concept UNIT
34

1. Modules
1.3. External Module, External Library

Line 1,
3• 1: Import the external library you installed through the import command.
• 3: Use histogram(x), which outputs a histogram chart (one of the chart library functions).

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 21
Key concept UNIT
34

1. Modules
1.3. External Module, External Library
If you run the code and get an error message like the one below, read the message carefully. It is a
message showing that the module cannot be found, and in this case, it is an error caused by the library
not being installed. This is a mistake that is made more often at the beginner level than expected, and
there are cases where a library is not installed in the virtual environment.

Let’s practice searching, installing, and using a library that outputs emoticons.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 22
Key concept UNIT
34

1. Modules
1.4. Make Your Own Module and Bring It Up For Use
In addition to being able to import and use the Python module .py file, the entire script in the module file
can be executed immediately. Let’s practice bringing up modules through the practice of calculating the
hospital funds.
We sell event tickets to raise funds for hospitals.
Each individual participating in the event has to pay 5t +3.
T is the number of tickets purchased by one person.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 23
Key concept UNIT
34

1. Modules
1.4. Make Your Own Module and Bring It Up For Use

# Save the name of this file as fund_cal.py.

Line 1
• Save the name of this file as fund_cal.py.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 24
Key concept UNIT
34

1. Modules
1.4. Make Your Own Module and Bring It Up For Use

("Enter the total number of people who participated in the hospital donation event"))

("Total Donation Amount :", total)

Line 1
• The plan is to bring and use the donate.py file, a module created and stored above. Be careful not
to write the extension for the filename.py.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 25
Key concept UNIT
34

1. Modules
1.4. Make Your Own Module and Bring It Up For Use
Ex An example of when an external file (*.py) is retrieved and the module selects and calls only
specific functions.
# Store this file under the name cal_n.py

Line 1, 3, 4, 6, 12, 13
• 1: Store this file under the name cal_n.py.
• 3: An algorithm that calculates the value obtained by adding all the numbers from 3:1 to the
input n.
• 4: Two slashes are divided into two
• 6: An algorithm that adds all the consecutive numbers from 6:1 input n

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 26
Key concept UNIT
34

1. Modules
1.4. Make Your Own Module and Bring It Up For Use
Ex An example of when an external file (*.py) is retrieved and the module selects and calls only
specific functions.
# Store this file under the name cal_n.py

Line 1, 3, 4, 6, 12, 13
• 12: In order to check whether the algorithm above works well, verification is performed using
print(sum_n(10)) function. However, this example code was annotated as an unnecessary code
area because it was for the practice of bringing up the module.
• 13: print(sum_n2(100))

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 27
Key concept UNIT
34

1. Modules
1.4. Make Your Own Module and Bring It Up For Use

("Enter a desired number for calculation : "))

("The sum of adding from 1 until", n, "is: ", sum_v)

("The sum of adding consecutively from 1 until", n, "is: ", sum_vv)

Line 1
• Note that you do not use the .py extension when importing a file into a module.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 28
Key concept UNIT
34

1. Modules
1.5. Python Syntax: From
The module contains many variables and functions, all of which are extremely rare to find and use
100%. When you want to use only a specific function in the module, you can use the ‘From’ syntax in the
following format.
• from Module import FunctionName
• from Module import ClassName
• from Module import VariableName

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 29
Key concept UNIT
34

1. Modules
1.5. Python Syntax: From
If we apply it to the math.sin(1) we practiced earlier, it looks like the following.

Line 2
• We can use it by only using the function name, as opposed to adding the module name ‘math’ in
the front.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 30
Key concept UNIT
34

1. Modules
1.5. Python Syntax: From
Several variables or functions that you want to use in the module can also be used in the following
format.

Line 1
• Several functions names can be called at once using ‘ , ’.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 31
Key concept UNIT
34

1. Modules
1.5. Python Syntax: From
However, if it is inconvenient to use the module name in front of it, then you can code using only the
function name. The entire function of the module can be brought using the form ‘from ModuleName
import*'
from math import *

Line 1
• Even though the function sin is not written after import, it can be used only with the function
name.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 32
Key concept UNIT
34

1. Modules
1.5. Python Syntax: From

Line 1
• You can round down without writing the function ‘floor’ after import.

Line 1
• You can round up without writing the function ‘ceil’ after import.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 33
Key concept UNIT
34

1. Modules
1.6. Python Syntax: As
The name of the module is long, so it is sometimes cumbersome to write the code. And sometimes the
names overlap when installing and using an external library. In this case, change the name of the library
using the as syntax. It can be used as a short word.

import ModuleName as DesiredModuleName(Identifier)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 34
Key concept UNIT
34

1. Modules
1.6. Python Syntax: As
When describing ‘from’, it was explained that various variables or function names can be retrieved at
once using ‘ , ’ when importing,
from Module import Variable as Name1, Function as Name2, Class as Name3

Line 1
• It can be called one after another using ’as’ even while changing it using abbreviations.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 35
Key concept UNIT
34

2. Using Typical Standard Libraries


2.1. Using the Random Module to Create Random Numbers
This module is used to make random numbers. And it can be used to sample and extract some parts
from the list.
Random means that the results appear unpredictably every time.
Ex When you roll a dice, one number is selected randomly from 1 to 6.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 36
Key concept UNIT
34

2. Using Typical Standard Libraries


2.1. Using the Random Module to Create Random Numbers
1) Example of Generating Random Numbers
‣ random() is a function belonging to the random module which randomly generates floats between 0
and 1.

Line 4,
5• 4: random() randomly generates floats among numbers greater than or equal to 0 and less than 1.
• 5: Each time it is executed, a number between 0 and 1 is randomly returned. Another number returns
randomly when 10 loops are executed.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 37
Key concept UNIT
34

2. Using Typical Standard Libraries


2.1. Using the Random Module to Create Random Numbers
2) An example of sampling a part of a list or a tuple and extracting randomly
‣ sample (list name, number of samples) is a function that randomly extracts as many samples as the
number of samples from the list. It is used for random sampling without duplication.
‣ Instead of the list, a tuple and set may also be used as a collection of data extraction targets.

Line 4
• We randomly extract the six samples from a collection called ‘data’ and store them in the variable
a.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 38
Key concept UNIT
34

2. Using Typical Standard Libraries


2.1. Using the Random Module to Create Random Numbers
2) An example of sampling a part of a list or a tuple and extracting randomly
‣ The choice() function randomly extracts any element from the collection without specifying the number
of samples.

Line 3,
4• 3: Tuple collection made of dog names
• 4: Random elements are extracted and stored in the variable my_lovely_dog.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 39
Key concept UNIT
34

2. Using Typical Standard Libraries


2.1. Using the Random Module to Create Random Numbers
random.random()

Return the next arbitrary floating-point number in the section [0.0, 1.0]

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 40
Key concept UNIT
34

2. Using Typical Standard Libraries


2.2. Functions of the Random Module that Generates an Integer Dis-
tribution
random.uniform(a,b)

Returns an arbitrary float N that satisfies the condition if a <= b, then a <= N <= b, and if b < a, then b
<= N <= a.
The termination value b may or may not be included in the range according to the float position of a +
(b-a) * random().
Simply put, you can set it to return any float in the range where a is the minimum value and b is the
maximum value.

("If a <=b, then a <= N <=b,", x)


("If b < a, then b <=N <=a", y)
("Set to return any float in the range where a is the minimum value and b is the maximum value.")

Set to return any float in the range where a is the minimum value and b is the maximum value.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 41
Key concept UNIT
34

2. Using Typical Standard Libraries


2.3. Functions of the Random Module that Generates an Integer Dis-
tribution
random.randint(a, b)

Returns any integer (a < N < b) between a and b as the minimum and maximum value, respectively.

Line 3
• Returns an integer between 1 and 100 and stores it in variable x.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 42
Key concept UNIT
34

2. Using Typical Standard Libraries


2.3. Functions of the Random Module that Generates an Integer Dis-
tribution
random.randrange(start, stop, step )

Returns an arbitrary integer as a step from the start value to the stop value.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 43
Key concept UNIT
34

2. Using Typical Standard Libraries


2.3. Functions of the Random Module that Generates an Integer Dis-
tribution
random.randint(a, b)

Returns any integer N that satisfies a <= N <= b.

TIP

 A module that generates random numbers should not be used for security purposes. For security or
encryption, it is recommended to use the secrets module.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 44
Key concept UNIT
34

2. Using Typical Standard Libraries


2.4. Time Module
time.sleep(secs)

It is the most commonly used function among time module functions.


It functions to pause the execution of the thread called for a given second.

Line 5
• All executions are paused for 2 seconds at this part while the loop is in execution.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 45
Key concept UNIT
34

2. Using Typical Standard Libraries


2.4. Time Module
time.sleep(secs)

Line 3,
4• 3: It is a function of finding the current time. As of 0h 0m 0s on January 1st, 1970, it informs you of
the past time in seconds. However, the return value is returned to a real value that is difficult to
read.
• 4: It is a function for returning the form of time that we can understand.
Unix timestamp
‣ It is also called Epoch time. The elapsed time from 00:00:00(UTC) on January 1st, 1970 is converted into
seconds and expressed as an integer.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 46
Key concept UNIT
34

2. Using Typical Standard Libraries


2.4. Time Module
time.sleep(secs)

Let’s make a simple electronic watch.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 47
Key concept UNIT
34

2. Using Typical Standard Libraries


2.4. Time Module
time.strftime(‘Format’, Time object)

When the %y format code is provided, a two-digit year can be parsed. Values 69 to 99 are mapped from
1969 to 1999, and values 0 to 68 are mapped from 2000 to 2068.

%a : Name of the Week


%b : Name of the Month
%d : Day of the Month in Decimal
%Y : Year in Decimal

‣ https://docs.python.org/3/library/time.html?highlight=time#time.strftime

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 48
Key concept UNIT
34

2. Using Typical Standard Libraries


2.4. Time Module
time.strftime(‘Format’, Time object)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 49
Key concept UNIT
34

2. Using Typical Standard Libraries


2.5. Using the Basics of the Datetime Moduletime Module
As a module related to date and time, it is often used to create date formats. Below is a summary of
various cases that print out basic dates.
datetime.date.today()
‣ Returns the current date.

Line 3, 6
• 3: Returns the current date to the variable today and stores it
• 6: Separates the information about year

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 50
Key concept UNIT
34

2. Using Typical Standard Libraries


2.5. Using the Basics of the Datetime Moduletime Module
As a module related to date and time, it is often used to create date formats. Below is a summary of
various cases that print out basic dates.
datetime.date.today()
‣ Returns the current date.

Line 7, 8
• 7: Separates the information about month
• 8: Separates the information about date

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 51
Key concept UNIT
34

2. Using Typical Standard Libraries


2.5. Using the Basics of the Datetime Moduletime Module
datetime.datetime.now()
‣ Returns the current date up to hours, minutes, and seconds.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 52
Key concept UNIT
34

2. Using Typical Standard Libraries


2.5. Using the Basics of the Datetime Moduletime Module

Line 3, 6, 7
• 3: Year, month, day, hour, minute, second
• 6: Separates the information about year
• 7: Separates the information about month

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 53
Key concept UNIT
34

2. Using Typical Standard Libraries


2.5. Using the Basics of the Datetime Moduletime Module

Line 8, 9, 10, 11
• 8: Separates the information about day
• 9: Separates the information about hour
• 10: Separates the information about minute
• 11: Separates the information about second

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 54
Key concept UNIT
34

2. Using Typical Standard Libraries


2.6. Using datetime.timedelta to Indicate the Difference Between
Two Dates and Time
datetime.timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0,
hours=0, weeks=0)
All factors are optional, not essentials, and the default value is zero. You can enter it as an integer or
float. You can use both positive and negative numbers.

‣ Milliseconds are converted into 1000 microseconds.


‣ Minutes are converted into 60 seconds.
‣ Hours are converted into 3600 seconds.
‣ Weeks are converted into 7 days.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 55
Key concept UNIT
34

2. Using Typical Standard Libraries


2.6. Using datetime.timedelta to Indicate the Difference Between
Two Dates and Time
datetime.timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0,
hours=0, weeks=0)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 56
Key concept UNIT
34

2. Using Typical Standard Libraries


2.6. Using datetime.timedelta to Indicate the Difference Between
Two Dates and Time
datetime.timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0,
hours=0, weeks=0)
Let’s find the date that is 30 days from May 3rd, 2000.

Line 3
• Objects can be created by adding years, months, days, hours, minutes, seconds, and
microseconds to datetime.datetime.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 57
Key concept UNIT
34

2. Using Typical Standard Libraries


2.7. Using Replace to Replace Elements of a Specific Time
datetime.replace(year=, month= , day= , hour=, minute=, second=, microsecond= )

‣ You can change the elements of a specific time.

Line 7
• Change the month to December and the day to the 30th.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 58
Key concept UNIT
34

2. Using Typical Standard Libraries


2.8. OS Modules with Functions Related to the Operating System
It is a module that has functions related to the operating system. You can create a new folder on our
computer or look at the list of files inside the folder using the OS module.

• os.mkdir(”Folder Name”): Creates a folder.


• os.rmdir(”Folder Name"): Deletes the folder.
• os.getcwd(): Returns the current path.
• os.listdir(): Inquires the list of files and directories

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 59
Key concept UNIT
34

2. Using Typical Standard Libraries


2.8. OS Modules with Functions Related to the Operating System

'UNIT34_Using Python Modules.ipynb']

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 60
Unit Using Python Modules
34.

Paper coding
Try to fully understand the basic concept before moving on to the next step.
Lack of understanding basic concepts will increase your burden in learning this
course, which may make you fail the course.
It may be difficult now, but for successful completion of this course we suggest
you to fully understand the concept and move on to the next step.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Paper coding UNIT
34

Q1.
Randomly select three multiples of 5 from the range 0 to 100 using the random module and
print them in the form of a list.

Write the entire code and the expected output results in the
note.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 62
Paper coding UNIT
34

Q1.
Q2
Use timedelta to create a program that prints a 100-day anniversary from a special day of
yours. It doesn’t have to be a 100-day anniversary, so feel free to make your own special
anniversary calculator.
.

Write the entire code and the expected output results in the
note.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 63
Unit Using Python Modules
34.

Let’s code

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
34

Steps for Writing Python Code


STEP 1
Ask the user to enter the number of dice. Then, convert the data type of the number of dice from string
to integer.

STEP 2
Ask the user to press any key to start the dice game.

STEP 3
We need to generate random numbers for rolling a dice. We’re going to import and use the random
module from the Python library. Write the line on the very top of the code.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 65
Let’s code UNIT
34

Steps for Writing Python Code


STEP 4
Define a function named “roll_dice(n).” Make sure to create the function right after the import
statement. It will use a parameter “n,” which is the number of dice. We are going to call this function in
the main program for rolling a dice. Within the roll_dice(n) function, create an empty list that will store
the dice numbers rolled.

STEP 5
Add random numbers that are between 1 and 6 to the list of dice. Create the numbers as much as the
number of dice indicated as a parameter. When you’re done with appending all the random numbers
required to the list, return the list of dice.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 66
Let’s code UNIT
34

Steps for Writing Python Code


STEP 6
In the main programming area, call the roll_dice(n) function with number_dice as a parameter. Name the
list of dice returned by the function as “user_rolls.” Display the user’s first roll on the screen.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 67
Let’s code UNIT
34

Steps for Writing Python Code


STEP 7
Get the user’s choices for holding on or rerolling each dice. Check if the user enters the right number of
choices that matches the number of dice. If the length of the user’s input and the number of dice do not
match, ask the user to re-enter the choices.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 68
Let’s code UNIT
34

Steps for Writing Python Code


STEP 8
We are going to use a sleep( ) function to pause the program for seconds. Import the time module from
the Python library. Write the code below the statement for importing the random module.

STEP 9
Define a function named “roll_again(choices, dice_list),” two lines after the function definition of
roll_dice(n). It will use parameters of either the user or computer’s choices and of the list of dices.
Display a message, and pause the program for 3 seconds to wait for the computer to roll a dice. Refer to
the side note on the next slide for using the sleep( ) function of the time module.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 69
Let’s code UNIT
34

Steps for Writing Python Code


STEP 10
Step through the choices, and if a character within the string is “r” (roll again), replace the dice at the
index of the dice list to a new random number between 1 and 6. When you’re done with the for loop,
pause the program again for 3 seconds. Since the function is a void function, it does not return anything
at the end.

Important

time.sleep(secs)

‣ pauses, stops, waits, or sleeps your Python program for secs. Here, secs is the number of seconds
that the Python code should pause execution, and the argument should be either an integer or a
float.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 70
Let’s code UNIT
34

Steps for Writing Python Code


STEP 11
In the main programming area, call the function roll_again(choices, dice_list) with user_choices and
user_rolls as parameters. The list of dice will be updated through the execution of the function, based on
the user’s choices. Display the user’s new roll on the screen.

STEP 12
Now, it’s the computer’s turn to roll a dice. Call the function roll_dice(n) with number_dice as a
parameter. Name the list of dice returned by the function as “computer_rolls.” Display the computer’s
first roll on the screen.
# step 5 - computer's turn to roll

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 71
Let’s code UNIT
34

Steps for Writing Python Code


STEP 13
Define a function named “computer_strategy(n),” which will decide on the computer’s choices for
whether to hold on or to re-roll each dice. It will use a parameter “n,” which is the number of dice.
Display a message that the computer is thinking, and pause the program for 3 seconds. Also, create an
empty list that will store the computer’s choices on each dice.

TIP

 You can create an empty list in three ways:


① Name of an empty list = [ ]
② Name of an empty list = list[ ]
③ Name of an empty list = ‘ ’

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 72
Let’s code UNIT
34

Steps for Writing Python Code


STEP 14
Next, step through each element in the list of dice for the computer. If the dice in the computer’s dice
list is less than 5, append “r” (roll again) to the computer’s choices list. If it is 5 or 6, append “-” (hold)
to the computer’s choices list. Lastly, return the list of choices that the computer has made.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 73
Let’s code UNIT
34

Steps for Writing Python Code


STEP 15
In the main programming area, call the function computer_strategy(n) with number_dice as a
parameter. The list of dice will be updated through the execution of the function, based on the user’s
choices. Display the user’s new roll on the screen.

STEP 16
Now, it’s the computer’s turn to roll a dice. Call the function roll_dice(n) with number_dice as a
parameter. Name the list of dice returned by the function as “computer_rolls.” Display the computer’s
first roll on the screen.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 74
Let’s code UNIT
34

Steps for Writing Python Code


STEP 17
Define a function named “find_winner(cdice_list, udice_list),” which will determine the winner for the
dice game. It will use the lists of dices for each computer and user as parameters. With the Python’s
sum( ) function, calculate the totals of the dice numbers from each computer and user’s list of dices.
Then, display the totals on the screen.

Important

sum(iterable, start)

‣ returns a number, the sum of all items in an iterable. Here, iterable is a required parameter, which is
the sequence or a list of numbers to sum. On the other hand, start is an optional parameter, which is
a value that is added to the return value.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 75
Let’s code UNIT
34

Steps for Writing Python Code


STEP 18
If the user has a higher total, display that the user is a winner. If the computer has a higher total, display
that the computer is a winner. Otherwise, the user and the computer have the same total, so display
that it is a tie. Since the function is a void function, it does not return anything.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 76
Let’s code UNIT
34

Steps for Writing Python Code


STEP 19
In the main programming area, call the find_winner(cdice_list, udice_list) with computer_rolls and
user_rolls as parameters. The parameters, which are the lists of dices, have been updated through the
previous functions. The program will end by determining the winner of the dice game in the
find_winner(cdice_list, udice_list) function.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 77
Let’s code UNIT
34

Finalized Code for a Dice Game

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 78
Let’s code UNIT
34

Finalized Code for a Dice Game

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 79
Let’s code UNIT
34

Finalized Code for a Dice Game

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 80
Let’s code UNIT
34

Finalized Code for a Dice Game

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 81
Let’s code UNIT
34

Finalized Code for a Dice Game

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 82
Unit Using Python Modules
34.

Pair programming

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
34

Pair Programming Practice


Guideline, mechanisms & contingency plan
Preparing pair programming involves establishing guidelines and mechanisms to help students pair
properly and to keep them paired. For example, students should take turns “driving the mouse.” Ef-
fective preparation requires contingency plans in case one partner is absent or decides not to partic-
ipate for one reason or another. In these cases, it is important to make it clear that the active student
will not be punished because the pairing did not work well.

Pairing similar, not necessarily equal, abilities as partners


Pair programming can be effective when students of similar, though not necessarily equal, abilities
are paired as partners. Pairing mismatched students often can lead to unbalanced participation.
Teachers must emphasize that pair programming is not a “divide-and-conquer” strategy, but rather a
true collaborative effort in every endeavor for the entire project. Teachers should avoid pairing very
weak students with very strong students.
Motivate students by offering extra incentives
Offering extra incentives can help motivate students to pair, especially with advanced students.
Some teachers have found it helpful to require students to pair for only one or two assignments.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 84
Pair programming UNIT
34

Pair Programming Practice


Prevent collaboration cheating
The challenge for the teacher is to find ways to assess individual outcomes, while leveraging the
benefits of collaboration. How do you know whether a student learned or cheated? Experts recom-
mend revisiting course design and assessment, as well as explicitly and concretely discussing with
the students on behaviors that will be interpreted as cheating. Experts encourage teachers to make
assignments meaningful to students and to explain the value of what students will learn by complet-
ing them.
Collaborative learning environment
A collaborative learning environment occurs anytime an instructor requires students to work together
on learning activities. Collaborative learning environments can involve both formal and informal ac-
tivities and may or may not include direct assessment. For example, pairs of students work on pro-
gramming assignments; small groups of students discuss possible answers to a professor’s question
during lecture; and students work together outside of class to learn new concepts. Collaborative
learning is distinct from projects where students “divide and conquer.” When students divide the
work, each is responsible for only part of the problem solving and there are very limited opportunities
for working through problems with others. In collaborative environments, students are engaged in in-
tellectual talk with each other.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 85
Pair programming UNIT
34

Q1.Complete the creative artwork by looking at the document with your colleagues.
The sample code below executes an artwork using cool turtle graphics. Turtle is also one of Python’s
standard libraries and is very easy to use.
Descriptions for the Python Turtle Graphics Module is in the link below
https://docs.python.org/3/library/turtle.html?highlight=turtle#module-turtle.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 86
Pair programming UNIT
34

Q1.Complete the creative artwork by looking at the document with your colleagues.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 87
Pair programming UNIT
34

Q1.Complete the creative artwork by looking at the document with your colleagues.
Line 1, 6, 7, 10, 12, 15, 18, 21, 29
• 1: make a geometric rainbow pattern
• 6: turn background black
• 7: make 36 hexagons, each 10 degrees
apart
• 10: make hexagon by repeating 6 times
• 12: pick color at position i
• 15: add a turn before the next hexagon
• 18: get ready to draw 36 circles
• 21: repeat 36 times to match the 36
hexagons
• 29: hide turtle to finish the drawing

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 88
Pair programming UNIT
34

Q1.Complete the creative artwork by looking at the document with your colleagues.
When you run the help  turtle menu in
the menu selection in Python Idle, the
demo window, as shown on the left, is
executed. You can take a look at this one
by one with your pair programming
colleague and choose a demo you want to
use to change the artwork.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 89
Unit 35.

Pandas Series
for Data Processing

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 90
Unit learning objective (1/3) UNIT
35

Learning objectives

 Be able to install the latest version of pandas library to the current virtual environment

 Be able to distinguish between series and dataframe when pandas-based data set is given

 Be able to process list and dictionary data sets into series

 Be able to select a specific data element from the series data set and perform arithmetic
operation

 Be able to differentiate data feature and draw a suitable graph

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 91
Unit learning objective (2/3) UNIT
35

Learning overview

 Learn how to install the pandas library and other basic dependent libraries to the virtual environment
 Learn about two different types of pandas data structures
 Learn how to created series by using the list, dictionary, numpy, and scalar values
 Learn how to find a specific element in the series
 Learn how to perform arithmetic operation within the series structure
 Learn to visualize the series data set by using the matplotlib

Concepts you will need to know from previous units

 How to install and import python modules


 How to perform arithmetic operation
 Python data structures including list, dictionary, etc.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 92
Unit learning objective (3/3) UNIT
35

Keywords

Pandas Series Data데이터


element

index Line graph bar graph

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 93
Unit Pandas Series for Data
35. Processing

Mission

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
35

What is population density?


Population density is the ratio of population in the unit area of a certain region. It is normally expressed as
number of population in 1 ㎢ .
Population density can be used to explain the location, growth, and movement of many organisms.
In the case of human, population density is often used for urbanization, immigration, and population
statistics.
Statistics related to global population density is traced by the UN Bureau of Statistics.
Population density
‣ Population is easy
density = to calculate.
Number of population / Area ( ㎢ )

1. Convert the attached dictionary data into series objects.


2. Calculate the population fluctuation of each city in each year and express it as a bar graph.
3. Calculate the population density of each capital and find the cities with the maximum and
minimum population density.
4. Visualize the population density calculation results.

Use two attached dictionary data for solving the mission.


Data source is https://unstats.un.org/ which organized 15 cities with largest populations.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 95
Mission UNIT
35

What is population density?


# population_2020 is dictionary data object that stores total population
in 2020.

Line 1
• population_2020 is dictionary data object that stores total population of each city in 2020.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 96
Mission UNIT
35

What is population density?


# population_2021 is dictionary data object that stores total population
in 2021.

Line 1
• population_2021 is dictionary data object that stores total population of each city in 2021.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 97
Mission UNIT
35

What is population density?


# area is the area data of each city, and the unit is .

Line 1
• area is the area data of each city, and the unit is km2.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 98
Unit Pandas Series for Data
35. Processing

Key concept

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
35

1. Introduction of Pandas
Developed in 2008 by Wes McKenny, pandas has been an open source since 2009 as a library for data
processing.
Pandas was originally designed for time series data operation and analysis in finance, especially stock
price. To perform such operations, analytic tools including searching, indexing, refining, arranging,
reshaping, and slicing were required, and pandas was developed as a solution.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 100
Key concept UNIT
35

1. Introduction of Pandas
Ever since switched to an open source, pandas supports effective data-related functions as follows due
to the efforts made by many people. The list of functions provided below was complete by referring to
https://Pandas.pydata.org/.
‣ Quick and effective series and dataframe objects for operating data with integrated indexing
‣ Intelligent data arrangement by using index and label
‣ Finding omitted data and supporting integrated processing
‣ Converting unorganized data into organized data
‣ Built-in tool to read and write data in the in-memory file, database, and web service
‣ Be able to process data stored in csv, excel, and json formats
‣ Flexible pivoting and remodeling of data set
‣ Slicing with index values, indexing, and subsetting of data set
‣ Addition and deletion of seta set columns
‣ Split-apply-combine functions to combine or change data as a powerful data grouping tool
‣ Integration with high performance data set merge
‣ Performance optimization through the critical path written with Cpython or C

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 101
Key concept UNIT
35

2. How to Use Pandas


It is most appropriate to use pandas for data operation. However, not all issues related to
comprehensive data can be solved only with pandas.
Instead of being a comprehensive data analytic tool, pandas is a data operation tool that has some level
of analytic function.
For more in-depth analysis, data operation, and data visualization in different fields, use libraries
including Scipy, Numpy, scikit-learn, matplotlib, seaborn, ggbis along with pandas. Likewise, a great
advantage of pandas is that there are many python-based libraries to connect with.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 102
Key concept UNIT
35

2. How to Use Pandas


Advantages of learning pandas
‣ It is necessary to learn data processing for machine learning or AI deep learning, and even if not
working in the AI field, data processing and analysis that go further beyond excel in regular working
environment will be practically great advantage.

Cleaning up
Python support Visualize
Data

Great handling Optimized per-


Grouping
of data formance

Merging and
Alignment and
joining of Unique data
indexing
datasets
Python Pandas Multiple file
Input and output
Features formats sup-
tools
Mask data
ported

Handling miss- Lot of time Se-


ing data ries

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 103
Key concept UNIT
35

3. Pandas Has Two Different Data Structures


Series and DataFrame are two different data structure types of Pandas. It is extremely important to fully
understand those two data structures.
The difference between the two data structures is that while Series has one-dimensional array data,
DataFrame has two-dimensional array data.

Series DataFrame

Sequentially arranged one-dimensional array Two-dimensional array consists of index and column
Index (yellow) and data (white) are 1:1. (blue)
Adding multiple series can make one dataframe.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 104
Key concept UNIT
35

4. Install Pandas Library


1) The following is the official python version for pandas installation.
Officially Python 3.7.1 and above, 3.8,
and 3.9.
‣ https://Pandas.pydata.org/docs/getting_started/install.html#python-version-support for reference

2) There are many different ways to install pandas, but this unit will only introduce pandas downloading
from PyPl just like for external library installation introduced previously.

‣ Just like any other external libraries, pandas can


be installed from PyPl by using pip.
‣ https://pypi.org/project/Pandas/

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 105
Key concept UNIT
35

4. Install Pandas Library


3) Make sure to run pip install Pandas after executing anaconda prompt and moving to the virtual
environment. Installation must be done after moving to the currently used virtual environment. This is
the most fundamental thing to remember for all of the other external library installation, but many
beginners make mistakes.

4) After installation, the Pandas.version command checks the installed pandas library version to
determine if the installation is done appropriately.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 106
Key concept UNIT
35

4. Install Pandas Library


5) Possible error after module installation (Solution for import ERROR).
‣ In general, ImportError means that it is impossible to find pandas from the python library list. Python
has built-in directory list to find a package.
‣ Most of the time, error occurs because different versions of python are installed in the computer, and if
the module is not installed to the currently used version of python installation environment but in
python of a different version.
‣ Use the built-in python module sys to locate.

‣ Linux/Mac can run a specific python with terminal and show which python installation is used. It is not
recommended to use “/user/bin/python” because it uses the basic python of the system.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 107
Key concept UNIT
35

4. Install Pandas Library


6) The most precise and reliable installation method is to use conda.
‣ Conda is an open source package and environment management system that is executive in Windows,
macOS, and Linux. Coda can easily and quickly install, execute, and update the package and relevant
dependency. It is originally developed for python programs, but software packaging and distribution are
possible for any other languages (Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN).
‣ Similar to installing other modules from pvpl, conda can be installed through the pip install conda, the it
is highly likely that it’s not in the latest version.
‣ Because conda is included in all versions of Anaconda®, Miniconda and Anaconda Repository, it is
recommended to install Anaconda or Miniconda first.

conda install
Pandas(PKGNAME)==3.1.4(version)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 108
Key concept UNIT
35

5. Optional Dependencies
When using pandas, there are many cases where you’d need dependencies packages to use specific
methods. If you do not have them, it would result in difficulties. When you experience import error even
if you used methods as learned from a book or lecture, you need to check dependencies packages.
It is recommended to first install dependencies packages before using pandas.
Ex While the read_hdf() requires pytables package, DataFrame.to_markdown() requires tabulate pack-
age.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 109
Key concept UNIT
35

5. Optional Dependencies
The following table shows some of the optional dependencies required for this unit. It is mandatory to
install them before continuing this chapter.
- Scipy : Miscellaneous statistical functions
For computation
- numba : Alternative execution engine for rolling operations

- xlrd : Reading Excel


- xlwt : Writing Excel
Excel operation - xlsxwriter : Writing Excel
- openpyxl : Reading / writing for xlsx files
- pyxlsb : Reading for xlsb files

- BeautifulSoup4 : HTML parser for read_html


HTML operation - html5lib : HTML parser for read_html
- lxml : HTML parser for read_html
XML operation - lxml : XML parser for read_xml and tree builder for to_xml
- matplotlib : Plotting library
Data
visualization - seaborn : Python data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics.
- SQLAlchemy : SQL support for databases other than sqlite
SQL database - psycopg2 : PostgreSQL engine for sqlalchemy
- pymysql : MySQL engine for sqlalchemy

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 110
Key concept UNIT
35

6. Series
Series is the basic data structure of pandas.
It is similar to Numpy array, but it is different that series has indices. As shown in the figure below, the
index value and data have 1:1 ratio. Thus, the structure itself is similar to dictionary that has {key :
value} structure.
Each index value functions as the address of each data.
index data

0 1

1 2

2 3

3 4

4 5

From the sample code provided above, separate designation of an index is not required to create series.
Pandas automatically generates integer type index from 0 that increases by 1 for each data item.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 111
Key concept UNIT
35

6. Series
TIP

 The series indices can be flexibly designated. It doesn’t necessarily start from 0. Indices other than
integer type can be created as well.
 When creating series, use index parameter to designate desirable index values.

Visit the link https://Pandas.pydata.org/docs/reference/Series.html# to check available parameters for series.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 112
Key concept UNIT
35

7. Creating Series
There are many different ways to create series, but only four of them are provided this chapter.
‣ Creating series by using python list
‣ Creating series by using python dictionary
‣ Creating series by using Numpy
‣ Creating series by using scalar value (a value that has designated range similar to integers)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 113
Key concept UNIT
35

7. Creating Series
7.1. Creating series from python list or
dictionary
Pandas.Series (list or dictionary)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 114
Key concept UNIT
35

7. Creating Series
7.1. Creating series from python list or
dictionary
Pandas.Series (list or dictionary)

‣ The dictionary key pairs with the index of series, while the dictionary value becomes each element
(data value) of series.

Series → index data

Iron Man 2010


Captain Amer-
2011
ica
Thor 2013
Winter Soldier 2014
Ultron 2015
dictionary → key value

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 115
Key concept UNIT
35

7. Creating Series
7.1. Creating series from python list or
dictionary
Pandas.Series (list or dictionary)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 116
Key concept UNIT
35

7. Creating Series
7.2. Creating series by using array data made with
Numpy
‣ Series objects can be initialized by using different kinds of Numpy functions. Use array-creating
functions and methods to create series.

Line 4
• anrange() creates an array of integers from 10 to 15

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 117
Key concept UNIT
35

7. Creating Series
7.2. Creating series by using array data made with
Numpy
‣ Series objects can be initialized by using different kinds of Numpy functions. Use array-creating
functions and methods to create series.

Line 6
• The array created with numpy is converted into series object

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 118
Key concept UNIT
35

7. Creating Series
7.2. Creating series by using array data made with
Numpy

Line 6
• Creates 6 normally distributed random numbers and stores them to variable r.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 119
Key concept UNIT
35

7. Creating Series
7.3. Creating series by using scalar value
‣ Scalar value refers to a value that has designated range similar to integers.
‣ First, create a simple series object that has single data.

Line 6
• Create series object consists of one index that has data of 4.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 120
Key concept UNIT
35

7. Creating Series
7.3. Creating series by using scalar value

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 121
Key concept UNIT
35

7. Creating Series
7.3. Creating series by using scalar value

Line 1,
3• 1: Creates an array of integers from 0 to 5 in order

• 3: Uses the array data to create series object

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 122
Key concept UNIT
35

7. Creating Series
7.3. Creating series by using scalar value

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 123
Key concept UNIT
35

TIP

 In general, series is used to express time series data with indices of data and time.
Use the Pandas.date_range() function to create range of data for an index value.

Line 5, 7
• 5: Converts a list into series object to prepare data for practice
• 7: When printed, the indices will consist of integers starting from 0.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 124
Key concept UNIT
35

TIP

 In general, series is used to express time series data with indices of data and time.
Use the Pandas.date_range() function to create range of data for an index value.

Line 1, 3
• 1: Creates a range of date in the form of Pandas.date_range(‘starting date', ‘end data')
• 3: When printed, the special index data DatetimeIndex will be created.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 125
Key concept UNIT
35

TIP

 In general, series is used to express time series data with indices of data and time.
Use the Pandas.date_range() function to create range of data for an index value.

Line 1~3
• 1, 2: Replaces the index of marvel, which is series object, to the data range that was previously
created.
Of course, the data amount of newly created index and object numbers must meet to prevent
an error.
• 3: Checks the series object with changed index.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 126
Key concept UNIT
35

8. Selecting a Specific Element (Data) from Series


In series object, each index has data that makes a 1:1 pair. Consider the index as intrinsic address of
each data and use it to find, arrange, and select each data.
As explained earlier, there are two types of indices. There are integer type index that is automatically
created form 0 when nothing is designated, and index label that you can specifically designate index
names.

Series consists of integer type index starting from 0

#creates series object consists of general integer type indices

Series consists of index label other than integers

# creates series object consists of index la-


bels with designated names

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 127
Key concept UNIT
35

8. Selecting a Specific Element (Data) from Series


• Series name.index: Inquires entire index values of the series data
• Series name.values: Inquires entire element values of the series data

Line 4
• Converts the list into series object to prepare data for practice.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 128
Key concept UNIT
35

8. Selecting a Specific Element (Data) from Series


• Series name.index: Inquires entire index values of the series data
• Series name.values: Inquires entire element values of the series data

Line 1
• Use index property to inquire the index value of the series

Line 1
• Use values property to inquire all of the data elements of the series

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 129
Key concept UNIT
35

8. Selecting a Specific Element (Data) from Series


8.1. Selecting series element
‣ A method to select an element is different for integer type index or index label.
‣ How to select an element from integer type index

Line 6,
7• 6: For selecting data with a single index address

• 7: For selecting data in index number range

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 130
Key concept UNIT
35

8. Selecting a Specific Element (Data) from Series


8.1. Selecting series element
‣ How to select an element from index label

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 131
Key concept UNIT
35

8. Selecting a Specific Element (Data) from Series


8.1. Selecting series element
‣ How to select an element from index label

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 132
Key concept UNIT
35

9. Basic Arithmetic Operation for Series Object (+, -, /, *, ...)


Pandas perform three built-in stages of processes for arithmetic operation between data objects. Data
are arranged in each row and column, and the elements on the same index (location) are paired up. If
so, elements that are not paired up are classified as Nan.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 133
Key concept UNIT
35

9. Basic Arithmetic Operation for Series Object (+, -, /, *, ...)


9.1. Arithmetic operation for single series
object

Line 7
• All of the elements in the sr Series object are multiplied by 2

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 134
Key concept UNIT
35

9. Basic Arithmetic Operation for Series Object (+, -, /, *, ...)


9.2. Operation with index for 2 series ob-
jects
If the elements of each index are not matched, the operation result returns NaN.

Line 8
• Check how each element value changes from the addition of s1 and s2.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 135
Key concept UNIT
35

10. Pandas Provides Many Accessible Descriptive Statistics


The descriptive statistics methods provided below are both used for Series objects and DataFrame
objects.
count() Number of non-null observations
sum() Sum of values
mean() Mean of Values
median() Median of Values
mode() Mode of values
std() Standard Deviation of the Values
min() Minimum Value
Max() Maximum Value
Abs() Absolute Value
prod() Product of Values
cumsum() Cumulative Sum
cumprod() Cumulative Product

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 136
Key concept UNIT
35

10. Pandas Provides Many Accessible Descriptive Statistics

Line 10
• When printing, the elements of indices 3 and 4 are NAN data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 137
Key concept UNIT
35

10. Pandas Provides Many Accessible Descriptive Statistics

Line 12
• Returns the number of elements except for missing data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 138
Key concept UNIT
35

10. Pandas Provides Many Accessible Descriptive Statistics

Line 13
• Returns sum of the elements except for missing data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 139
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot
Assigning the plt.plot (x-axis and y-axis data) passes series index to x-axis and data element of each
index to y-axis.
Or, drawing a graph is also possible with the plt.plot (Series object or DataFrame object).
Index  X-axis
plt.plot (x-axis, y-axis)

Data  Y-
axis

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 140
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot

Line 2
• If not assigning separate x values, this value is assigned to the sequence of y-axis.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 141
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot

From the chart above, x-axis is 0-3 and y-axis is 1-4. This is because when providing a single list or array
for floating, matplotlib assumes that it is the sequence of y value and automatically generates x value.
Python range starts from 0, so the basic x vector has the same length with y, but it starts from 0. Thus,
the x data become [0, 1, 2, 3].

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 142
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot

Line 2
• Assigns both x-axis and y-axis

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 143
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 144
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot

Line 1, 4, 6, 8
• 1: matplotlib is an external module that needs to be installed.
• 4: uses list comprehension.
• 6: linewidth is a property that adjusts the thickness of a line graph, while color designates spe-
cific colors.
• 8: adds the title of the chart.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 145
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot

Line 9, 10, 13
• 9: adds the name of x-axis of the chart.
• 10: adds the name of y-axis of the chart.
• 13: command function to print a graph.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 146
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.1. Drawing a line plot

‣ It is result of drawing y = x **2 curve where x is between [0,99].

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 147
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.2. Drawing a bar plot
A bar plot represents the height proportional to data value size as a rectangular bar.
It is possible to compare the data size through the relative difference between bar height, and there are
two different bar types including vertical and horizontal bar.

Vertical bar plt.bar() Horizontal bar plt.barh()


- Suitable to show differences between data
values of two points that have time
- Suitable so show differences between variable
difference.
value sizes.
- In other words, it is suitable for time series
data expression.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 148
Key concept UNIT
35

11. Basic of Data Visualization Tool Matploylib


11.2. Drawing a bar plot
If the elements of each index are not matched, the operation result returns NaN.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 149
Unit Pandas Series for Data
35. Processing

Let’s code

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
35

Step 1
1) Consider which libraries would be required to solve the mission. Find necessary libraries and install
ones that you don’t have. Import the module to the code.
‣ Numpy for making an array while data processing
‣ Pandas for creating series objects
‣ Matplotlib for data visualization

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 151
Let’s code UNIT
35

Step 1
2) Convert each of three dictionaries into a series object.
‣ Make sure to check the index label of each series object for operation because if index label or index
number (number of data) is different, it may return NaN.

Line
1~3
• 1: population_2020 is dictionary data object that stores total population of each city in
2020.
• 2: population_2021 is dictionary data object that stores total population of each city in
2021.
• 3: city_area is dictionary data object that stores area of each city.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 152
Let’s code UNIT
35

Step 2
Use two series arithmetic operations to calculate population fluctuation of each year.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 153
Let’s code UNIT
35

Step 2
Use two series arithmetic operations to calculate population fluctuation of each year.

Line
1~2
• 1: Creates a new series object by calculating population fluctuation of each city (index
label) through two series arithmetic operations (-).
• 2: Checks data of new series object that stores arithmetic operation results.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 154
Let’s code UNIT
35

Step 3
As previously explained, when visualizing data through a bar graph, vertical graph is suitable for time
series data that have consecutive values, but horizontal bar graph is suitable here because it would
show data difference between each variable (city name).
The x-axis of the graph is index label of time series object, and the y-axis consists of data elements.
To draw a chart of time series object of ‘growth,’ save index information to x-axis and data values to y-
axis.

Line
1~2
• 1: Extracts index value only.
• 2: Extracts data elements only.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 155
Let’s code UNIT
35

Step 3
When creating a bar graph, overlapping sometime occurs if there are many label names on x-axis. To
solve this problem, adjust the graph size to clearly see label names on x-axis.
Use the plt.figure() function, figsize =(horizontal, vertical size), and parameters in inch to adjust the
chart size. If not designating the size, the default chart size becomes 6.4 and 4.8 inches.
Use the plt.xticks( ) function to adjust the angle of label name printing direction.

• rotation = enters an angle that rotates the number counterclockwise.


• size = font size

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 156
Let’s code UNIT
35

Step 3

Line 1, 3, 4,
5
• 1: Figure dimension (width, height) in inches.
• 3: It is possible to enter a number that signifies an angle instead of vertical. If
rotation=90, it means that it rotates 90 degrees counterclockwise.
• 4: size=10 refers to font size.
• 5: Use plt.barh() to plot a horizontal bar plot.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 157
Let’s code UNIT
35

Step 3

‣ When calculating data with cities with the largest population, it shows that population growth rate is
decreased in Tokyo.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 158
Let’s code UNIT
35

Step 4
Let’s calculate population density.
The series objects for calculation include s_2020 that has population data in 2020 and s_area that has
area data of each city.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 159
Let’s code UNIT
35

Step 4

Line 1~2
• 1: Extracts index values only.
• 2: Extracts data elements only.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 160
Let’s code UNIT
35

Step 4

Line 1, 5, 7, 8, 9
• 1: Figure dimension (width, height) in inches.
• 5: Use color parameter to designate color of the graph.
• 7: Adds the title of the chart.
• 8: Adds the name of x-axis of the chart.
• 9: Adds the name of y-axis of the chart.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 161
Let’s code UNIT
35

Step 4

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 162
Let’s code UNIT
35

Step 5
Calculate the cities with the highest and lowest population density by using the descriptive statistics
functions of the pandas series object.

count() Number of non-null observations


sum() Sum of values
mean() Mean of Values
median() Median of Values
mode() Mode of values
std() Standard Deviation of the Values
min() Minimum Value
Max() Maximum Value
Abs() Absolute Value
prod() Product of Values
cumsum() Cumulative Sum
cumprod() Cumulative Product

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 163
Let’s code UNIT
35

Step 5
Calculate the cities with the highest and lowest population density by using the descriptive statistics
functions of the pandas series object.

Line 1
• Finds the minimum value among the entire data element values of the series object

Line 1
• Finds the maximum value among the entire data element values of the series object

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 164
Unit Pandas Series for Data
35. Processing

Pair programming

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
35

Pair Programming Practice


Guideline, mechanisms & contingency plan
Preparing pair programming involves establishing guidelines and mechanisms to help students pair
properly and to keep them paired. For example, students should take turns “driving the mouse.” Ef-
fective preparation requires contingency plans in case one partner is absent or decides not to partic-
ipate for one reason or another. In these cases, it is important to make it clear that the active student
will not be punished because the pairing did not work well.

Pairing similar, not necessarily equal, abilities as partners


Pair programming can be effective when students of similar, though not necessarily equal, abilities
are paired as partners. Pairing mismatched students often can lead to unbalanced participation.
Teachers must emphasize that pair programming is not a “divide-and-conquer” strategy, but rather a
true collaborative effort in every endeavor for the entire project. Teachers should avoid pairing very
weak students with very strong students.
Motivate students by offering extra incentives
Offering extra incentives can help motivate students to pair, especially with advanced students.
Some teachers have found it helpful to require students to pair for only one or two assignments.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 166
Pair programming UNIT
35

Pair Programming Practice


Prevent collaboration cheating
The challenge for the teacher is to find ways to assess individual outcomes, while leveraging the
benefits of collaboration. How do you know whether a student learned or cheated? Experts recom-
mend revisiting course design and assessment, as well as explicitly and concretely discussing with
the students on behaviors that will be interpreted as cheating. Experts encourage teachers to make
assignments meaningful to students and to explain the value of what students will learn by complet-
ing them.
Collaborative learning environment
A collaborative learning environment occurs anytime an instructor requires students to work together
on learning activities. Collaborative learning environments can involve both formal and informal ac-
tivities and may or may not include direct assessment. For example, pairs of students work on pro-
gramming assignments; small groups of students discuss possible answers to a professor’s question
during lecture; and students work together outside of class to learn new concepts. Collaborative
learning is distinct from projects where students “divide and conquer.” When students divide the
work, each is responsible for only part of the problem solving and there are very limited opportunities
for working through problems with others. In collaborative environments, students are engaged in in-
tellectual talk with each other.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 167
Pair programming UNIT
35

The Spotify dataset has various column values other than the artist name, ranking, and

Q1. popularity that we used in the mission. Discuss with your classmate to create a special playlist
that is made with other column values.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 168
Unit 36.

Pandas DataFrame
for Data Processing

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 169
Unit learning objective (1/3) UNIT
36

Learning objectives

 Be able to explain the structure in terms of a DataFrame when a two-dimensional data structure
is given.
 Be able to change to DataFrames stably when a dictionary object is given.

 Be able to rename rows, columns, and indexes in a DataFrame.

 Be able to create a DataFrame by importing csv, excel, json in the form of an external file.

 Be able to create DataFrames by importing existing datasets through the API.

 Be able to edit row and column information for the created DataFrame.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 170
Unit learning objective (2/3) UNIT
36

Learning overview

 Understand the basic structure of a DataFrame.


 Learn how to edit the names of rows, columns, and indexes in a DataFrame.
 Learn how to converts various data objects into DataFrames.
 Learn how to obtain data from remote data service using DataReader API.
 Learn how to deletes, adds, and merges rows, columns, and data elements of a DataFrame.

Concepts you will need to know from previous units

 Understanding Pandas Series Objects

 Understanding of two types of data structures in Pandas

 How to use Matplotlib library

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 171
Unit learning objective (3/3) UNIT
36

Keywords

DataFrame Pandas I/O Rows in DataFrame

Columns in Data Elements in


DataReader API
DataFrame DataFrame

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 172
Unit Pandas DataFrame for Data Processing
36.

Mission

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
36

“Import external data to find and sort data elements you


want.”
Search and download the 2019 Most Streamed Tracks data set from the global streaming service Spotify
from Kaggle. Find out which artist has the most popular songs based on this data set!
Also, check whether the tempo of music is correlated with popularity through scatter plot data
visualization!
Finally, create a tracklist of your favorite artist and share it with your learning colleagues by creating a
playlist in Excel!

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 174
Unit Pandas DataFrame for Data Processing
36.

Key concept

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
36

1. DataFrame
DataFrame is a two-dimensional label data structure made up of rows and columns. (Series is different
from DataFrame in that it is a one-dimensional array.)
Simply put, tabular data and Excel spreadsheets that are commonly encountered in data analysis are in
the DataFrame format.
Unlike a Series that can have only one value per index, a DataFrame can have multiple values per index
label. At this time, each Series becomes a column of the DataFrame.

https://www.geeksforgeeks.org/creating-a-pan-
das-dataframe/

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 176
Key concept UNIT
36

1. DataFrame
Terms commonly used in data statistics and data science fields are summarized below.
The most basic spreadsheet-like data structure in statistics and machine learning
DataFrame
models.
Typically, each column in the table is a feature. Similar terms include attributes and
Feature
predictors.
The goal of most data science projects is to predict some outcome. The feature is
Outcome used for prediction. Similar terms include dependent variable, response, goal, and
output.
Typically, each row in a table represents one record. Similar terms include recorded
Record
value, case, case, example, observation, pattern, sample, etc.
In other words, to define a DataFrame using the above terms, it can be basically said that it is a two-
dimensional matrix consisting of a row representing each record (case) and a column representing a
feature (variable).
In order to effectively process the DataFrame object, each column is designated as an index. In the case
of pandas, multi/hierarchical indexes can be set, so more complex data processing can be done
effectively.
Please visit https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe to read
more about DataFrame definitions and uses from a Python and pandas' perspective.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 177
Key concept UNIT
36

2. Creating a DataFrame
Multiple one-dimensional arrays of the same length are combined to form a DataFrame. The same
length means that the number of data elements is the same. In other words, multiple Series can be
combined to create one DataFrame.
Also, since an object that combines several Series is a dictionary, it can also be understood as "convert
a dictionary into a DataFrame".

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 178
Key concept UNIT
36

2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]

Key  column

List data  ele-


ment

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 179
Key concept UNIT
36

2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]

Line 3
• Convert a dictionary called data to a DataFrame

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 180
Key concept UNIT
36

2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]

Line 2, 3
• 2: Store in series object s
• 3: Store in another series object s2

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 181
Key concept UNIT
36

2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]

Line 5, 7, 8
• 5: Import two Series and assign key names one, two
• 7: When the DataFrame is output, the key name of the called dictionary becomes the column
name.
• 8: NaN occurs because the number of elements in the data does not match when two Series
Samsung are imported.
Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 182
Key concept UNIT
36

2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
You can set the row and column names separately when creating a DataFrame or after creating it.
First, understand the structure and naming rules of DataFrames through the images below.
As described in the image, each column of a DataFrame is a series object, and these series objects have
a matrix structure where they are combined based on the index of the same row.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 183
Key concept UNIT
36

2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Change all row indexes: DataFrame object name.index = Array of row indexes to change
• Change all column names (columns name): DataFrame object name.columns = Array of new
column names

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 184
Key concept UNIT
36

Line 2, 3, 4, 6, 7, 9,
11
• 2: It is a dictionary data type.
• 3: Create a DataFrame object with the name df by assigning a dictionary.
• 4: Check the DataFrame created before the name change.
• 6: Change to new index name
• 7: Check the result of index change
• 9: Change to new column name
• 11: Check the result of column name change

As you can see from the previous code execution result, when you change the index or column name,
you can see that the DataFrame before the change is not maintained, but the data frame is changed
with the changed index and column name.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 185
Key concept UNIT
36

2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Select part of a row and rename it: DataFrame object name.rename(index={existing index:
index to be replaced with new one, ...})
• Select part of a column and rename it: DataFrame object name.rename(columns={Existing
column name: new column name, ...})

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 186
Key concept UNIT
36

2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame

Line 4, 6
• 4: Check the DataFrame created before the name change.
• 6: Change index 0 to a new name, new index0.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 187
Key concept UNIT
36

2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Select part of a row and rename it: DataFrame object name.rename(index={existing index:
index to be replaced with new one, ...})
• Select part of a column and rename it: DataFrame object name.rename(columns={Existing
column name: new column name, ...})

Line 1
• Change the column name col2 to the new name new col2.

If you see the result of executing the code above, it is different from when the entire index and column
name were changed. You can see that it is returning a new DataFrame object rather than changing the
original itself. You must use inplace = True to change the original object.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 188
Key concept UNIT
36

2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Select part of a row and rename it: DataFrame object name.rename(index={existing index:
index to be replaced with new one, ...})
• Select part of a column and rename it: DataFrame object name.rename(columns={Existing
column name: new column name, ...})

Line 4, 5
• 4: Make changes to the original DataFrame object itself through the inplace = True option.
• 5: Although only the column name has been changed, you can see that the original object itself has
been changed through the code above.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 189
Key concept UNIT
36

One More Step


‣ Another alternative to converting a dictionary to a DataFrame is to use .from_dic().
• A dictionary of array-like objects
• dictionary

pd.DataFrame.from_dict(dictionary object to convert, orient = 'columns')

‣ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html#pandas-dataframe-
from-dict

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 190
Key concept UNIT
36

One More Step

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 191
Key concept UNIT
36

One More Step

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 192
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.1. The pandas I/O API
It is sometimes expressed that the conversion of other data types such as a dictionary into a DataFrame
is input to the DataFrame. Data is input and converted using various IO tools (text, CSV, HDF5, …) of
pandas.
Ex Use the top-level "read" functions that convert a file to a DataFrame object, such as
pandas.read_csv().

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 193
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.1. The pandas I/O API
The list of IO tools organized in the panda's official documentation is as follows.
(https://pandas.pydata.org/docs/user_guide/io.html)

Format Data Format


Reader Writer Data Description Reader Writer
Type Description Type
read_feathe
text CSV read_csv to_csv binary Feather Format to_feather
r
Fixed-Width read_parque
text read_fwf binary Parquet Format to_parquet
Text File t
text JSON read_json to_json binary ORC Format read_orc
text HTML read_html to_html binary Stata read_stata to_stata
Styler.to_late
text LaTeX binary SAS read_sas
x
text XML read_xml to_xml binary SPSS read_spss
read_clipboar Python Pickle
text Local Clipboard to_clipboard binary read_pickle to_pickle
d Format
binary MS Excel read_excel to_excel SQL SQL read_sql to_sql
OpenDocumen
binary read_excel SQL Google BigQuery read_gbq to_gbq
t
binary HDF5 Format read_hdf to-hdf

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 194
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.2. About file paths
One of the many mistakes beginners make while learning Python for data processing in the first place is
entering the wrong path to the files to load data.
You can leave it as the basics of computing, but it's helpful to make sure you clean up the file paths
once. Paths of files in the local computer can be expressed in two ways.

‣ Absolute path: This method uses all paths from the first starting point (start of OS) to the file.

Ex The OS uses the Windows system. If you find "sample.txt" on the desktop, it is C:\Users\UserID\
Desktop\sample.txt. No matter which operating system (OS) is used as well as Windows, it is to
find the file with an absolute path that contains all the paths passed through from the top-level
root.
‣ Relative path: It is the "location of the file you want to find" based on "the current location".

/ : Move to the top-level directory (root)


./ : Current same directory, can be deleted
../ : Parent directory from my current location (file being
created)
../../ : Directory two levels higher

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 195
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.2. About file paths
Path Description
<img src=“picture.jpg”> The “picture.jpg” file is located in the same folder as the current page
<img
The “picture.jpg” file is located in the images folder in the current page
src=“images/picture.jpg”>
<img The “picture.jpg” file is located in the images folder at the root of the
src=“/images/picture.jpg”> current web
The “picture.jpg” file is located in the folder one level up from the current
<img src=“../picture.jpg”>
folder
Why do we need relative paths?
‣ n absolute path is a static string that tells you exactly where a file is located on a specific computer.
However, when dealing with paths, this static feature may come as a disadvantage.

Ex What if the path of test.txt is always changed frequently, or if the root directory covers different
OSs? Because of the static characteristics, the former needs to rewrite all documents written
with absolute paths, and the latter has to create and manage absolute paths for each OS.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 196
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.2. About file paths

boot bin home etc lib usr

abhishek prakash Absolute


You are here path

scripts your_script.sh

Relative
path You want to access this
my_script.sh file

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 197
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.3. dataframe.head() for data review
It returns the row (rows) from the first to n items of the DataFrame. The main reason for using this is to
check the result of the DataFrame currently being processed while data is being processed.
We will deal with it now because it will be most used when handling the DataFrame that we will learn
soon.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 198
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.3. dataframe.head() for data review
DataFrame.head(n)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 199
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.3. dataframe.head() for data review
DataFrame.head(n)

Line 1
• Return data from top to index 5 in a DataFrame named df. If no number is entered, only 5 are
output.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 200
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.3. dataframe.head() for data review
DataFrame.head(n)

Line 1
• Return data from the top as many rows as the number is entered in head().

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 201
Key concept UNIT
36

3. Converting various data objects to DataFrame


3.3. dataframe.head() for data review
DataFrame.head(n)

Line 1
• If "-" is input, data is returned after excluding 3 rows from the bottom.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 202
Key concept UNIT
36

TIP

 A DataFrame is a data object consisting of a two-dimensional array. If you use df.shape, you can
check how many dimensions the data frame consists of.
 It returns the dimensionality of the data frame in the form of a tuple.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 203
Key concept UNIT
36

TIP

 Other functions for reviewing data

Line 1
• Use the len() function to know the length of a row.

Line 1
• Print the columns.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 204
Key concept UNIT
36

TIP

 Other functions for reviewing data

Line 1
• Only the values of the data frame are output. Actually, it isn't used much.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 205
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
In csv, data is separated by a comma (,) so the name is csv (comma -separated values).
‣ Separate columns with commas
‣ Separate rows with newlines

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 206
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

1) Default
‣ Download the Titanic Survivor data file from https://www.kaggle.com/c/titanic/data

Line 2
• Enter the path of the downloaded file as a relative path.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 207
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

1) Default

Line 1
• Output only the data in the 5th row from the top and review.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 208
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

1) Default

TIP

 Dataframe can search not only data but also metadata information such as column type, number of
null data, and data distribution. You can use info() and describe().

Line 1
• This is a method that can check the total number of DataFrames and data types of the
imported DataFrame. When preprocessing data, it is recommended to check through this
method first as a habit.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 209
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

1) Default

TIP

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 210
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

1) Default

TIP

Rangeindex: You can see the total number of rows


(417) and number of columns (11) in the range of
the DataFrame Index.

Data type of each column


(object can be thought of as a string)

A summary of all column information

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 211
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(Path")

1) Default

TIP

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 212
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

1) Default

TIP

Line 1
• You can roughly check the data distribution of the imported DataFrame.
Ex When using this data in machine learning, it is very important to start by knowing the
distribution of the data to improve performance. It is recommended that you use this method
to get a rough distribution diagram as a habit.
mean Average value of all data
std Standard Deviation
min Minimum value
max Maximum value
25% 25 percentible value
50% 50 percentible value
75% 75 percentible value

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 213
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

1) Default

Notes when loading a csv file into a DataFrame


‣ Be sure to open the file and check the structure before loading the file into the code: Check how the
data set is structured and set the appropriate parameters.
‣ Most csv files use , as a delimiter to separate data. However, sometimes there are files that are
separated by tabs.
 In this case, use sep = '\t' through the sep parameter. If the file delimiter is '|', add sep='|'.
Enter the number of rows to read in the nrows parameter.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 214
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

2) skiprow
‣ When importing a file, it specifies whether to skip the first few lines and import. It can also be set as a
list containing the number of lines to skip.
Ex [1, 4, 7]

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 215
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object

Line 1
• You can specify a row to be the column name.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 216
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

3) header
‣ Check the csv file before loading it in the code, and then set the row you want to specify as the column
name. By default, the file is loaded as it is, and the row at index 0 of the csv becomes the column
header.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 217
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object

Line 1
• You can specify a row to be the column name.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 218
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

3) header

titanic = pd.read_csv("./data//titanic/test.csv“, header=0)


header =
0 (default)

header =
2

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 219
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

3) header

header =
None

headheader= None ,
names=
['a','b','c','d','e','f','g','h','i','j’, 'k’]

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 220
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

3) header
‣ When should "header = None" be used? After checking the csv, it is used when starting from data
rather than a separate name from the first row. And in this case, you can specify the header with the
name parameter.
• names = [List to use as column names]

Line 1
• You can specify a row to be the column name.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 221
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

3) header

Line 1,
2• 1: You can specify a row that becomes a column name.

• 2: Return NaN when naming more columns than the data object has.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 222
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

4) encoding
‣ It can be used both to read or write to the file.
‣ It is easy to understand that encoding is a process of transforming languages other than the Ascii-series
string (which can be expressed as 0 to 127), such as Hangul by adding bytes to the computer for use.
However, there are several ways for this encoding.
‣ It functions to specify the encoding type of text when loading a csv file.
Ex encoding = 'utf-8'

‣ There may be cases where the imported text looks broken even with this option. In this case, the
easiest solution is to specify the data format as utf-8 in Excel and save it.
• See https://docs.python.org/3/library/codecs.html#standard-encodings for Python standard encoding
information.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 223
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

5) index_col
‣ Specifies the column to be used as the row.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 224
Key concept UNIT
36

4. Converting a csv file to a DataFrame


4.1. Reading a file and creating a DataFrame object
pandas.read_csv(”Path")

5) index_col
‣ Specifies the column to be used as the row.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 225
Key concept UNIT
36

5. Converting Excel file to DataFrame


5.1. Reading a file and creating a DataFrame object
Rows and columns in Excel have a one-to-one correspondence between rows and columns in DataFrame.

pandas.read_excel(Path, sheet_name = 0 , header= 1)

‣ Download data on NBA players in the american professional basketball from


https://www.kaggle.com/justinas/nba-players-data
‣ Due to the characteristics of the Excel file, you can select and load a specific sheet of the file to be
imported.
‣ The sheet_name parameter is used, and a string consisting of the name of the actual sheet or a list of
integers starting from 0 can be used.
Ex If sheet_name = [0,1,2,"names"] is specified, the first, second, third, and sheet names of
"names" are all loaded when Excel is imported.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 226
Key concept UNIT
36

5. Converting Excel file to DataFrame


5.1. Reading a file and creating a DataFrame object
pandas.read_excel(Path, sheet_name = 0 , header= 1)

Line 2, 4
• 2: header specifies the order of the columns to be specified as columns.
• 4: Output only the data of 5 rows from the top and review.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 227
Key concept UNIT
36

5. Converting Excel file to DataFrame


5.1. Reading a file and creating a DataFrame object
pandas.read_excel(Path, sheet_name = 0 , header= 1)

Warning

 To use the read_excel function, the optional dependency xlrd package must be installed in advance.

cond install xlrd

You can check various parameters that can be used when loading a file like csv from
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html?highlight=read_excel

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 228
Key concept UNIT
36

6. Converting JSON File to DataFrame


6.1. JSON
JSON (JavaScript Object Notation) is a format created for sharing data. It consists of attribute-value pairs
or key-value pairs. attribute–value pairs and array data types (or any other serializable value))
How to use : ”Dataname": value
It contains strings, numbers, arrays, booleans, and other objects, which are the basic data types of
JavaScript. In this way, the data layer can be configured as shown in the example picture below.
‣ If created as a file, the file extension is ".json"
‣ Content-Type of HTTP request is "application/json"

https://www.json.org/json-
en.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 229
Key concept UNIT
36

6. Converting JSON File to DataFrame


6.2. JSON object
JSON data consists of name-value pairs. JSON data is listed with commas (,).

Line 1
• JSON object

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 230
Key concept UNIT
36

6. Converting JSON File to DataFrame


6.3. JSON array
It includes multiple json data using commas.
An object is expressed by enclosing it in curly braces ({}). An array is expressed by enclosing it in
square brackets ([]).

Line 1
• JSON array

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 231
Key concept UNIT
36

6. Converting JSON File to DataFrame


6.3. JSON array
object

{ string : value }

array
[ value ]

value
string

number

object

array

true

false

null
https://www.json.org/json-
en.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 232
Key concept UNIT
36

6. Converting JSON File to DataFrame


6.4. Reading a file and creating a DataFrame object
pandas.read_json(Path, orient=None)

Line 4
• A url can also be used in the path.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 233
Key concept UNIT
36

6. Converting JSON File to DataFrame


6.4. Reading a file and creating a DataFrame object
Orient parameter
‣ The characteristic of JSON object is JSON object or its complex form, JSON array form. Also, it can be
composed of various hierarchical structures.
‣ Just like we checked the structure by opening csv or excel file before loading the file, we also open JSON
to check the structure of the string first. And then, select the option that suits the structure and use
them. Rather than memorizing all the conditions and using them, it is a good way to experiment while
substituting them one by one. Default value is None.
'columns': {columns: {index: value, ...}, ...}
'split': {"index" : [index, ...], "columns" : [column, ...], "data":
[value, ...]}
'records': [{column:value}, .. ,{column:value}]
'index': {index " {column: value, ...}, ...}
'values': [values, ...] just the values array
Warning
 You should remember that what we are learning here is loading the json object into a DataFrame
using pandas. It is different from encoding to load json into Python and convert it to a Python
string. Hope you don't get confused.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 234
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


To use a module, install it first.
‣ pip install pandas-DataReader
The DataReader package supports access to a variety of useful data sources. The most useful thing is
that it can be processed directly into a DataFrame.
Ex With just two lines of code, you can get data on 5-years of 10-year constant maturity yields on U.S.
government bonds.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 235
Key concept UNIT
36

7. Reading data from remote data service using DataReader API

Line 2, 4
• 2: Import the package.
• 4: Save the data of sysmbol called GS10 in Federal Reserve Economic Data (FRED) data as a
DataFrame.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 236
Key concept UNIT
36

7. Reading data from remote data service using DataReader API

Line 6
• Visualize the DataFrame as a line graph.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 237
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Same data set as https://fred.stlouisfed.org/series/GS10. Since the period is 5 years of yield, if you want
to compare only data from 2017, you can specify the graph search period from January 1, 2017.
Compare the graph we processed with the official graph of FRED. Is it the same?

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 238
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Remote data access is possible as follows. (The following data feeds are available:) However, depending
on the type of data, you need to register an API Key, so be sure to check the usage for each
specification.
Ex Tiingo is a tracking platform that provides a data API by the closing prices for stocks, mutual funds,
and ETFs. A free registration is required to obtain an API key. Free accounts have limited fees and
access to a limited number of symbols (500 at the time of writing).
For technical specifications of each data, visit
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#remote-data-access to check
detailed information.
• Tiingo • OECD
• IEX • Eurostat
• Alpha Vantage • Thrift Savings Plan
• Econdb • Nasdaq Trader symbol definitions
• Enigma • Stooq
• Quandl • MOEX
• St.Louis FED (FRED) • Naver Finance
• Kenneth French’s data library • Yahoo Finance
• World Bank

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 239
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.
World Bank supports thousands of datasets through DataReader. You can see the World Bank's data
catalog at https://datacatalog.worldbank.org/
The World Bank dataset is identified through an indicator. Currently, there are about 1,897 identifiers.
You can check it with the wb.get_indicators() function.

TIP

 It provides guidance on how to use DataReader based on World Bank data, but you can use it
sufficiently for other data by finding the link below in the official document. We will deal with
indicators as a sample, but I want you to learn while thinking, "I can find and use it this way" while
comparing it with the official documentation.
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#indicators

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 240
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.

Line 2, 4
• 2: Import the package.
• 4: The wb.get_indicators() function can get the full list of indicators.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 241
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.

Line 6
• We will study iloc in the DataFrame row and column selection section. This way you can only get
5 indicators from the beginning of the whole list.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 242
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.
If you want to know what data the indicator means, you can search the World Bank data catalog as shown
in the image below.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 243
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.
Conversely, you can also search for the necessary indicators to get the data you want.
You can search for an indicator with the wb.search( ) function. Let's practice searching for the indicator
and downloading the corresponding data.

Line 3
• Use wb.search() function to search indicator to get data on life expectancy.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 244
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.
Conversely, you can also search for the necessary indicators to get the data you want.
You can search for an indicator with the wb.search( ) function. Let's practice searching for the indicator
and downloading the corresponding data.

Line 5
• If you print all the results, you get a lot of search results. For now, let's print only 5 results. Each
indicator is country-specific.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 245
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.

Line 1, 3
• 1: Use the .wb.get_countries() function to get data for all countries.
• 2: Among all data, only the data in the 'name', 'capitalCity', and 'iso2c' columns are extracted
from the above.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 246
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 247
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.

Line 1, 3
• 1: You can download by putting the indicator and period you want to search in wb.download().
• 3: If you see the output results, you can find that only the data of Canada, the United States,
and Mexico are output, not all countries. To download the entire country, you need to use the
parameter country='all'.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 248
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 249
Key concept UNIT
36

7. Reading data from remote data service using DataReader API


Ex How to use DataReader: Process life expectancy data by country around the world.

Line 3
• You can get data for all countries by using the parameter country='all'.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 250
Key concept UNIT
36

One More Step


Let's find out which country recorded the lowest average life expectancy by year among the standard
life expectancy data by country.
‣ First, through pivoting, the data is restructured with year index and country column.

Line 3, 5, 7
• 3: The indicator to be used for the search uses the average lifespan data used in the previous
indicator search practice.
• 5: Create a new DataFrame by pivoting.
• 7: If you check the pivoted result, the data is restructured into a country index and a column for
each year as planned.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 251
Key concept UNIT
36

One More Step


Let's find out which country recorded the lowest average life expectancy by year among the standard
life expectancy data by country.
‣ First, through pivoting, the data is restructured with year index and country column.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 252
Key concept UNIT
36

One More Step


DataFrame.idxmin(axis=0, skipna=True)
‣ Return index of first occurrence of minimum over requested axis.
‣ https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.idxmin.html#pandas-data
frame-idxmin

Line 3
• You can check the results of the countries with the lowest life expectancy by year.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 253
Key concept UNIT
36

One More Step


DataFrame.idxmin(axis=0, skipna=True)

What do you think of the results? If you search the history of each country to find out what happened in
that year, you can understand what seriously affects the average life expectancy in each country.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 254
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


It is necessary to understand the structure of the DataFrame by adding the concept of the axis of the
array.
Axis =1
‣ axis = 0 (index)
Axis =0
• It works in the row direction. The
result of the work appears as a row. It
is easy to understand if you imagine
that the books are stacked on top of
each other.
‣ axis = 1 (columns)
Axis =0
• It works in the column direction. The
result of the work appears as a
Axis =1 column. It's like putting a book aside.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 255
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


Let's prepare the data for practice first. In this lesson, we will use what we used in the previous practice
of converting JSON to a DataFrame.

Line 3
• Using the 2021 Worldwide University Links data downloaded from Kaggle.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 256
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


Let's prepare the data for practice first. In this lesson, we will use what we used in the previous practice
of converting JSON to a DataFrame.

Line 5
• Check the contents of the data frame. There are 1526 indexes from 0 to 1525, and it consists of
7 columns.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 257
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


Let's prepare the data for practice first. In this lesson, we will use what we used in the previous practice
of converting JSON to a DataFrame.

Line 1
• Check the data structure by printing only 5 indexes.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 258
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.1. drop() to delete row, column
• Delete row: DataFrame object name.drop(row index or array (list-like), axis=0,
inplace=False)
• Delete column: DataFrame object name.drop(column name or array(list-like), axis=0,
inplace=False)
As described in the rename() method, use the inplace parameter to determine whether to change the
original data object when a row or column is deleted. If this parameter is not specified, the default value
is false (the original data is not changed).

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 259
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.1. drop() to delete row, column

Line 1
• Delete indexes 0 through 15.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 260
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.1. drop() to delete row, column

Line 1
• Delete two columns.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 261
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.1. drop() to delete row, column
rank.drop([0,1,2,3,4,5,6,7,8,9,10], axis=0, inplace=False)
Axis Axis
=0 =1

rank.drop(["students staff ratio", "gender ratio"], axis=1, inplace=False)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 262
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.2. Select row
• Select by index name (index label): loc
• Select by integer position index (integer position): iloc

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 263
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.2. Select row

Line 5
• Select numbers 1 to 9 by the integer index position and store them in a new DataFrame. (except
10)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 264
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.2. Select row

rank.iloc[1:10]

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 265
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.2. Select row

Line 1, 3
• 1: Use rename() to change the name. When saving the original DataFrame with the inplace=True
option, the DataFrame sometimes gives an error message. This is because of memory
management. To prevent this, pandas recommends copying the original to a new DataFrame
with the copy() method and then working on it.
• 3: For the loc practice, change the integer indexes to string index labels.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 266
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.2. Select row

Line 5, 7
• 5: You can see that the integer index has been changed to an index label made of a string such
as a,b,c..
• 7: Select the rows with index labels a and b and store them in a new DataFrame.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 267
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.2. Select row
df.loc [["a","b"]]

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 268
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.3. Select column
• When selecting a column,
- DataFrame object name["column name"]
- DataFrame object name.column name: Only possible when the column name must be a string.
• When selecting multiple (n) columns,
- DataFrame object name["column name", "column name", "column name" ..]

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 269
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.3. Select column

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 270
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.3. Select column

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 271
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.3. Select column

Line 7, 9
• 7: Select only one column named title and save it as a new DataFrame
• 9: Select only one column named title and save it as a new DataFrame

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 272
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.3. Select column

Line 11, 13
• 11: A format that selects one column but compare the output results of both formats. The
column name of the selected column is not printed.
• 13: This is a form of selecting one column, and the selected column name is printed.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 273
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.3. Select column
index_position[["title", "location", "students staff ratio" ]]

index_position[[“gender ratio]]

index_position.gender ratio

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 274
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.4. Select data element
Select an element within the DataFrame by specifying the position of the element as if inputting [row.
column] coordinates.
• DataFrame object name.loc[row name, column name]
• DataFrame object name.iloc[row number, column number]

Line 5
• Select the data element in row number 7, column number 1.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 275
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.4. Select data element
0 1 2 3 4 5 6

df[7,1]

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 276
Key concept UNIT
36

8. Manipulating Rows, Columns, and Elements in a DataFrame


8.4. Select data element

Line 1, 3
• 1: Select 2 or more elements
• 3: You can also specify the range of column selection.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 277
Unit Pandas DataFrame for Data Processing
36.

Let’s code

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
36

Step 1
Think about what libraries are necessary to solve the mission. Find the necessary libraries, install those
that are not, and import the module through import in the code.
‣ Pandas for creating DataFrame objects
‣ Matplotlib for visualizing data
‣ Seaborn, one of the data visualization tools

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 279
Let’s code UNIT
36

Step 1
After searching the target data and downloading, create a DataFrame.
‣ Download Spotify's 2019 Top Songs list data file from
https://www.kaggle.com/prasertk/spotify-global-2019-moststreamed-tracks

song = pd.read_csv("./data/spotify/
spotify_global_2019_most_streamed_tracks_audio_features.csv")

Line 1
• Load the downloaded csv file into the song DataFrame.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 280
Let’s code UNIT
36

Step 1
song.head(10)

Line 1
• Print only 10 DataFrames on the screen.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 281
Let’s code UNIT
36

Step 1

song.info()

Line 1
• This is a method that can check the total number of DataFrames and data types of the imported
DataFrame. It is a DataFrame with a total of 1717 rows and 24 columns.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 282
Let’s code UNIT
36

Step 1

Line 1
• A method to check data distribution

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 283
Let’s code UNIT
36

Step 2
Selects only the necessary columns and recreates the DataFrame.

Line 1
• Select several columns you want and load them into a new DataFrame called df.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 284
Let’s code UNIT
36

Step 3
Only data with Artist_popularity of 95 or higher is filtered.
In this case, we can use the data filtering method to filter rows by using logical operators on column
values. A single logical operator or multiple logical operators can be used at the same time. Here, we
use one logical operator to filter only the data with Artist_popularity of 95 or higher.
The comparison value of 95 can be changed as you want in the practice. However, the first thing to do
when importing an external file was to open the data directly and check the data structure. Since this
data itself is only from Spotify's popular tracks in 2019, most of the artists are over 90.
It is recommended to set it to 90 or higher for discrimination.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 285
Let’s code UNIT
36

Step 3

Line 1
• Create a new DataFrame with a value of 90 or higher in the Artist_popularity column and the
index pop_song by using logical operators.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 286
Let’s code UNIT
36

Step 3
Who is the artist with the most songs on the list?

Line 1
• Counts the total number of non-duplicate unique data elements (values) and returns them as a
series object.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 287
Let’s code UNIT
36

Step 3

Line 1
• An artist named Post Malone has the most songs with 45 songs.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 288
Let’s code UNIT
36

Step 3

Line 1
• If you check the returned data type, it is Series.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 289
Let’s code UNIT
36

Step 3
TIP

 If pop_song["Artist"].unique() is used, all non-overlapping unique values are found in the


corresponding DataFrame column.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 290
Let’s code UNIT
36

Step 4
Is BTS Really Popular?
If you are not a fan of BTS, you can filter data based on the English name of your favorite artist.
First, search whether the name of the artist you want to find exists in the current dataset.

Line 1
• If the return result is true, it means that the data exists. Let's not forget! rank is a series object
made up of non-overlapping artist names and total numbers.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 291
Let’s code UNIT
36

Step 4
We saw that the songs of the corresponding artist exist. Let's filter the data to see how many songs and
which songs are included.

Line 1, 3
• 1: pop_song is the DataFrame created in step3. Create a new DataFrame by collecting only the
rows with BTS in the Artists column
• 3: You can see that there are 11 songs in total.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 292
Let’s code UNIT
36

Step 4

Line 1
• Song list

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 293
Let’s code UNIT
36

Step 5
If you see the result of step3, an American rapper named Post Malone has 45 songs in the dataset. He is
the artist with the most hit songs. Let's make a playlist by making a separate list only for Post Malone.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 294
Let’s code UNIT
36

Step 5

Line 3
• Initialize the index with reset_index.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 295
Let’s code UNIT
36

Step 6
Expressing the tempo by rank as a scatter plot graph
‣ A scatterplot is a visual representation of the relationship between two variables.
‣ A scatterplot is used to understand the relationship between two variables x and y. It is a graph in the
form of a linear function in which the point (x, y) with x and y as an ordered pair is shown on the
coordinate plane.

https://docs.tibco.com/pub/spotfire_server/7.10.0/doc/html/en-US/TIB_sfire-
bauthor-consumer_usersguide/GUID-A8DC822E-35B3-4289-94CB-
642DFFE5E88F.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 296
Let’s code UNIT
36

Step 6
In a graph, you can see the relationship between two variables if one of them tends to increase (or
decrease) as the other increases.
The relationship is evaluated as large or small depending on the degree to which the distributed form of
each point converges to the center based on the linear function.
What is the positive correlation in the figure below?
Ex If a company produces more of a product, the price of the product falls, then production and price
have a negative correlation. If sales increase as marketing investment increases, these two
relationships are positively correlated.
Don't get confused! A positive or negative correlation does not evaluate the degree of correlation
between two variables.
Positive Correlation Negative Correlation

𝑦 𝑦 𝑦 𝑦

strong
𝑥 weak
𝑥 strong
𝑥 wea
𝑥
k
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 297
Let’s code UNIT
36

Step 6
You can easily plot correlation using Matplotlib's scatter.
‣ https://matplotlib.org/stable/gallery/shapes_and_collections/scatter.html#scatter-plot

‣ When visually checking the graph, there is no clear correlation between the two variables.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 298
Let’s code UNIT
36

Step 6
pandas.DataFrame.corr(method='pearson')

‣ Pandas supports a function that makes it easy to calculate the correlation coefficient.
‣ This is a function that calculates the pairwise correlation of columns excluding NA/Null values. The
following three methods can be used for this function. Among the data of most pearson-used columns,
numerical data is found and compared, and the string-type data is naturally subtracted.
‣ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
‣ The result is returned as a DataFrame, and the interpretation of the values is as follows.
• Closer to 1, both increase equally
• Closer to -1, increase by one / decrease by one
• Closer to 0, there is no relationship between the two

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 299
Let’s code UNIT
36

Step 6

Line 2
• If you print the correlation coefficient result, you can't see a variable that is very close to 1.
However, if the closest value is found among them, tempo appears to have the least effect.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 300
Let’s code UNIT
36

Step 6

Line 2
• It is displayed in the form of a heatmap in a matrix.
• Display the color bar.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 301
Let’s code UNIT
36

Step 6

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 302
Unit Pandas DataFrame for Data Processing
36.

Pair programming

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
36

Pair Programming Practice


Guideline, mechanisms & contingency plan
Preparing pair programming involves establishing guidelines and mechanisms to help students pair
properly and to keep them paired. For example, students should take turns “driving the mouse.” Ef-
fective preparation requires contingency plans in case one partner is absent or decides not to partic-
ipate for one reason or another. In these cases, it is important to make it clear that the active student
will not be punished because the pairing did not work well.

Pairing similar, not necessarily equal, abilities as partners


Pair programming can be effective when students of similar, though not necessarily equal, abilities
are paired as partners. Pairing mismatched students often can lead to unbalanced participation.
Teachers must emphasize that pair programming is not a “divide-and-conquer” strategy, but rather a
true collaborative effort in every endeavor for the entire project. Teachers should avoid pairing very
weak students with very strong students.
Motivate students by offering extra incentives
Offering extra incentives can help motivate students to pair, especially with advanced students.
Some teachers have found it helpful to require students to pair for only one or two assignments.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 304
Pair programming UNIT
36

Pair Programming Practice


Prevent collaboration cheating
The challenge for the teacher is to find ways to assess individual outcomes, while leveraging the
benefits of collaboration. How do you know whether a student learned or cheated? Experts recom-
mend revisiting course design and assessment, as well as explicitly and concretely discussing with
the students on behaviors that will be interpreted as cheating. Experts encourage teachers to make
assignments meaningful to students and to explain the value of what students will learn by complet-
ing them.
Collaborative learning environment
A collaborative learning environment occurs anytime an instructor requires students to work together
on learning activities. Collaborative learning environments can involve both formal and informal ac-
tivities and may or may not include direct assessment. For example, pairs of students work on pro-
gramming assignments; small groups of students discuss possible answers to a professor’s question
during lecture; and students work together outside of class to learn new concepts. Collaborative
learning is distinct from projects where students “divide and conquer.” When students divide the
work, each is responsible for only part of the problem solving and there are very limited opportunities
for working through problems with others. In collaborative environments, students are engaged in in-
tellectual talk with each other.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 305
Pair programming UNIT
36

In the Spotify dataset, there are various column values in addition to the artist's name, rank,
Q1. and popularity we used in our mission. Discuss with your learning colleagues to create special
playlists using different column values.

Ex A playlist of only upbeat music: use the "tempo" column to select an appropriate tempo.

TIP

 If you have made a playlist, save it as an Excel file and share it with your learning colleagues.
 Saving a DataFrame as an Excel file is very simple.

pandas.DataFrame.to_excel()

‣ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 306
Unit 37.

Data Tidying

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 307
Unit learning objective (1/3) UNIT
37

Learning objectives

 Learners will be able to check for missing data within a given data frame.

 Learners will be able to delete or replace missing data.

 Learners will be able to perform descriptive statistics such as mean, median, mode, variance,
and standard deviation for data frames.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 308
Unit learning objective (2/3) UNIT
37

Learning overview

 Differentiate situations that require data tidying


 Search and check missing data
 Delete missing data or replace it with another value
 Learn the basics of descriptive statistics such as mean, median, mode, variance, standard devia-
tion, and correlation coefficient
 Data visualization: learning box about box plots, histograms, and scatter plots

Concepts you will need to know from previous units

 Know how to select and slice data elements in a data frame

 Know the basics of Matplotlib visualization

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 309
Unit learning objective (3/3) UNIT
37

Keywords

NaN Tidy data seaborn

Descriptive
Boxplot
Statistics

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 310
UNIT Data Tidying
37.

Mission

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
37

Descriptive Statistics and Visualization of Student Grades


The University of California, Irvine provides a sample dataset for data learning.
Among these datasets, download and unzip student.zip from
https://archive.ics.uci.edu/ml/datasets/student%2Bperformance

‣ The student-mat.csv file is used as the target data for data analysis.
‣ It is converted into a data frame using Pandas.
‣ Identify the characteristics of the data.
‣ If checking for missing data, run data pre-processing.
‣ Calculate the mean, median, and mode.
‣ Visualize in a box plot graph.
‣ Visualize all the variables in a histogram and scatterplot.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 312
Mission UNIT
37

https://archive.ics.uci.edu/ml/
datasets.php

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 313
UNIT Data Tidying
37.

Key concept

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
37

1. Why is Data Tidying Necessary?


So far, we have learned to create data frames and series, or to bring in other files and turn them into
data frames. In fact, the most frequently encountered situation when doing such similar tasks in the
field is when the data form of the analysis target is abnormal.
Abnormal data refers to cases where data is missing, units do not match, or are overlapped without
rules. The process of organizing the abnormal data is called data tidying.
This term was coined by Hadley Wickham(http://hadley.nz/) in his paper ”Tidy Data”(Paper Link:
https://vita.had.co.nz/papers/tidy-data.html). The paper is available for download, so please check it out.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 315
Key concept UNIT
37

1. Why is Data Tidying Necessary?


There is something that data scientists in the field always say.
”80% of data analysis is spent on tidying and preparing data".
In addition, the reason why this work is so tedious and repetitive is because data preparation and
tidying do not end in a single process, but rather, requires repeatedly tidying new problems that are
constantly found in the process.
Therefore, Tidy Data provides a standard method of configuring data values within a standardized data
set. The standard makes it easy to clean up initial data because you don’t have to start from scratch and
perform new tasks every time

https://cfss.uchicago.edu/notes/
tidy-data/

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 316
Key concept UNIT
37

2. Definition of Tidy Data


“Tidy data is a standard way of mapping the meaning of a dataset to its structure.”
- Hadley Wickham

The 3 rules of Tidy Data


‣ Each variable must have its own column.
‣ Each observation must have its own row.
‣ Each value must have its own cell.

https://cfss.uchicago.edu/notes/
tidy-data/

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 317
Key concept UNIT
37

3. Messy Data
The following is a summary of realistic situations in which data tidying is necessary.

Messy Data
‣ If the column is not the name of the variable, but the value itself
‣ If there is not just one variable in the column, but multiple variables
‣ If the variables are stored in both columns and rows (should be stored in columns)
‣ If missing data exists
‣ If it’s not a sample for the desired period of time
‣ If quantitative data is needed but the variable is qualitative
‣ If the data types of the values are wrong
‣ If there is duplication of data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 318
Key concept UNIT
37

4. Setting Up Data for Practice


seaborn.load_dataset

You can call up and use the needed amount of dataset at seaborn’s online repository (
https://github.com/mwaskom/seaborn-data). By accessing this link, you can also see what data sets
there are. The small drawback is that Internet connection is required in order to access the repository.
Since the result value is returned in the Pandas data frame, it is useful for learning purposes

Line 5
• Brings in the data set of titanic survivors. At this time, check the local cache first and set it to
cache=True in order to set it up for use.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 319
Key concept UNIT
37

4. Setting Up Data for Practice


seaborn.load_dataset

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 320
Key concept UNIT
37

5. Checking for Missing Data


In many cases, the value of the element data is omitted in the data frame. In Pandas, which we are
learning now, missing data is displayed as NaN(Not a number).
The following are the reasons why missing data occurs. There are many other reasons, but only the
common cases are summarized here.
‣ If there are no values corresponding to each other in the two data sets joining E
‣ If the data from an external source is incomplete
‣ If the data is missing at the time of collection because it will be filled up later
‣ If the event continues to occur and accumulate despite an error in the value
‣ If the shape of the data is changed due to adding a new column or row (that was not checked during
data reshaping)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 321
Key concept UNIT
37

5. Checking for Missing Data

Line 1
• You can see that the number of valid values among the data held for each column

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 322
Key concept UNIT
37

5. Checking for Missing Data


The Rangeindex shows that each column has 891 data elements. However, it can be seen that the
number of non-null data is less than the total amount of data in columns such as age and deck. Simply
calculating, it can be confirmed that there are NaN data elements of 891-714=177 for the case of age.

891-714 =
177(NaN)

891-203 =
688(NaN)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 323
Key concept UNIT
37

5. Checking for Missing Data


If the dropna=False parameter is used in the df.value_counts() method, the number of missing data can
be returned in the form of a Series.
‣ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

TIP

 If the parameter is set from df.vale_counts() to dropna=True, of if the dropna parameter is not used
at all, the number of data excluding missing data is calculated.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 324
Key concept UNIT
37

5. Checking for Missing Data


TIP

 If the parameter is set from df.vale_counts() to dropna=True, of if the dropna parameter is not used
at all, the number of data excluding missing data is calculated.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 325
Key concept UNIT
37

6. Finding Missing Data


• isnull(): Returns True for Missing Data
• notnull(): Returns False for Missing Data

df.isnull()

NaN  True

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 326
Key concept UNIT
37

6. Finding Missing Data


• isnull(): Returns True for Missing Data
• notnull(): Returns False for Missing Data

df.notnull(
)
NaN 
False

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 327
Key concept UNIT
37

6. Finding Missing Data


• isnull(): Returns True for Missing Data
• notnull(): Returns False for Missing Data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 328
Key concept UNIT
37

6. Finding Missing Data


• isnull(): Returns True for Missing Data
• notnull(): Returns False for Missing Data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 329
Key concept UNIT
37

6. Finding Missing Data


The .sum() method treats True and False as 1 and 0. If applied, the total NaN can be obtained

‣ If you check the results through this code, there are 177 in the page column.
‣ There are two missing data in the embark_town column and two in the deck column.

Line 1
• Check how much missing data are for each column in the entire data with numbers.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 330
Key concept UNIT
37

6. Finding Missing Data

Line 1
• In the previous result series, if .sum() is applied once more, we can know the total number of
NaN values in the data frame.

Line 1
• df.count() returns the number of values other than missing data for each column. The number of
missing data can be obtained by subtracting this from the total amount of data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 331
Key concept UNIT
37

7. Deleting Missing Data


DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Delete the column.


‣ The total number of passengers is 891, and 688 do not have their deck information.
‣ Since the proportion of missing data is very high, it can be said that the column is meaningless from
the standpoint of processing and analyzing data. In this case, the most common way to deal with
missing data is to delete columns with missing data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 332
Key concept UNIT
37

7. Deleting Missing Data


DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Delete the column.

Line 1,
3• 1: thresh = 500 is a command to delete all columns with more than 500 NaN values.

• 3: Since the deck column shows 688 NaNs and there are more than 500 NaNs, we can confirm
that all the results are deleted.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 333
Key concept UNIT
37

7. Deleting Missing Data


DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Delete the row.


‣ 177 out of 891 have no data on age. If the passenger’s age is considered an important variable in
data analysis, it is recommended to delete the passenger’s data(row) without age data.
‣ If subset=‘age,’ delete all rows with NaN values (axis = 0) from the rows in the age column.
‣ how = 'all" is deleted only if all data is NaN.

Line 1,
2• 1: There were 177 rows without age data above, We deleted this,

• 2: Therefore, it is only normal that the number of the result data is 714.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 334
Key concept UNIT
37

7. Deleting Missing Data


DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Delete the column.

Line 1
• If there is any NaN in the column, delete it.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 335
Key concept UNIT
37

8. Replacing Missing Data with Other


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
Just because there is missing data in the preprocessing stage of the data set, deleting all the columns or
rows is unfavorable because it reduces the number of data sets to be analyzed.
However, replacing NaN’s data with values such as 0 or 1 will affect data analysis.
Therefore, it is usually a value that replaces missing data, and the mean value and mode value, which
represent the distribution and characteristics of the data set, are obtained and filled.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 336
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
1) Replacing with the Mean
‣ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html?highlight=
mean
#pandas.DataFrame.mean

Line 4,
6• 4: Bring up the Titanic Dataset.

• 6: Check the data in the age column, the data is found as


NaN.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 337
Key concept UNIT
37

8. Replacing Missing Data with Other Data

Line 4,
6• 4: Bring up the Titanic Dataset.

• 6: Check the data in the age column, the data is found as


NaN.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 338
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
1) Replacing with the Mean

Line 1
• The mean of the data age is stored in avg_age. It is the mean of the data in the age column.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 339
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
1) Replacing with the Mean

Line 1,
2• 1: The median, using the median() method, can also be used as replacement instead of the mean.

• 2:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html?high
light=median#pandas.DataFrame.median

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 340
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
1) Replacing with the Mean

Line 1
• NaN data elements are substituted with the mean using fillna(). Let’s replace it using the median
value median_age, calculated earlier.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 341
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
1) Replacing with the Mean

Line 1
• We can see that the missing data is replaced with the mean.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 342
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
1) Replacing with the Mean

Line 1
• It appears that there is no NaN replaced in the age column.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 343
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
2) Replacing with the Maximum Value.
‣ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html?hig
hlight=value_counts
‣ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html?highlight
=idxmax#pandas.DataFrame.idxmax
‣ Let’s search for the missing data of embark_town by searching for the name of the city with the most
passengers and replace it with the data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 344
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
2) Replacing with the Maximum Value

Line 4
• Bring up the Titanic data set.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 345
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
2) Replacing with the Maximum Value

Line 1
• Return the time series including unique rows in the corresponding column. Exclude missing data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 346
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
2) Replacing with the Maximum Value

Line 1
• df.idxmas() Return the index in which the maximum value first occurs on the requested axis.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 347
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
2) Replacing with the Maximum Value

Line 1,
3• 1: Using the fillna() method, NaN data elements are substituted with the names of the most
embarking towns stored in the variable most.
• 3: Missing data in the embark_town column has been replaced by Southampton, resulting in no
missing data in the column.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 348
Key concept UNIT
37

8. Replacing Missing Data with Other Data


DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None)
3) Replacing with Nearest Neighbor Values
‣ This is not true for all data, but usually due to its nature, neighboring data often have similarities. After
checking the characteristics of the dataset, the missing data is replaced with previous data or the one
• immediately
Changing the after.
Entire Row Index : dataframeobjectname.index = array of row indices that are
to be changed
• Changing the Entire Column Name: datagrameobject.columns – array of column names that
are to be changed

‣ Print df_01 and df_02 and compare the two data frames to see how the place where the missing data
was became filled!

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 349
UNIT Data Tidying
37.

Let’s code

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
37

Step 1
Preparing Data and Creating Data Frame

‣ Although the data input has been confirmed, the data is difficult to handle in this state. It can be
confirmed that the downloaded data is divided into ‘ ; ‘ .
‣ Usually, csv files are divided by ‘ , ‘ , but this file is divided into semicolons, make it difficult to check
visually.
‣ In order to change the character symbol that separates the data, the parameter sep= ‘separating
character symbol’ is used to designate it

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 351
Let’s code UNIT
37

Step 1
Preparing Data and Creating Data Frame

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 352
Let’s code UNIT
37

Step 1
Checking Data Characteristics

Line 1
• non-null means that there is no null data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 353
Let’s code UNIT
37

Step 1
Searching for any NaN in the Data

Line 1
• Looking at the results, it can be confirmed that the number of NaN data for each column is 0.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 354
Let’s code UNIT
37

Step 1
Searching for any NaN in the Data

Line 1
• Obtain the number of columns with #NaN data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 355
Let’s code UNIT
37

Step 1
Understanding the Meaning of the Columns for Data Comprehension
‣ For data analysis, it is fundamental to understand the meaning of each column name of the data frame.
‣ In the case of this data, it is helpful because it is written in detail in the study.txt file included in the
compressed file. However, on a daily basis, there are often no documents explaining these column
names. In this case, it is necessary to contact the person who received the data to check the
information on the column of the data set.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 356
Let’s code UNIT
37

Step 1
Understanding the Meaning of the Columns for Data Comprehension

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 357
Let’s code UNIT
37

Step 2
Let’s visualize the number of absent days of students in a histogram. The number of days of absence is
the column 30 absences.
‣ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib-pyplot-hist

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 358
Let’s code UNIT
37

Step 2

Line 1, 6
• 1: Specify the variable to be graphed on the his-
togram
• 6: Add a grid to the graph
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 359
Let’s code UNIT
37

Step 2
Let’s make a histogram graph of other variables that have numeric data perform an exploratory data
analysis.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 360
Let’s code UNIT
37

Step 3
Find the basic descriptive statistics such as mean, median, mode, variance, and standard deviation etc.
‣ First of all, we learned in the previous lessons that we can check the results of various and basic
descriptive statistics of the corresponding data frame using Pandas’ describe() method.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 361
Let’s code UNIT
37

Step 3
Median
‣ The median refers to the middle value of a data that is rearranged in the order of size.
‣ Assuming that the median value of studytime is obtained,

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 362
Let’s code UNIT
37

Step 3
Mode
‣ The mode is the most frequent value in the data.
‣ Assuming that we obtain the mode of studytime

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 363
Let’s code UNIT
37

Step 3
Variance
‣ It is possible to check whether the data is scattered or concentrated around the mean by calculating
the variance. After designating the reference variable, the var() method is used.
‣ Square the observed value minus the average, add it all, and divide it by the total number. That is, it’s
the sum of all squared differences. If you add all the deviations minus the mean from the observed
values, you get zero, so you add them in squares.

Line 1
• The smaller the result value, the smaller the degree of scattering the data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 364
Let’s code UNIT
37

Step 3
Standard Deviation
‣ Where there is a lot of data, the average is often used as the value that represents the data. The
standard deviation, one of the scatter plots, is a representative figure indicating how spread out the
data is around the average. The unit of the standard deviation is identical with the unit of data, unlike
the variance which uses the square root. If the standard deviation is close to the center, it means that
the data values are concentrated near the average. The larger the standard deviation, the more
widespread the data values are.

Line 1
• The smaller the result value, the smaller the degree of scatter in the data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 365
Let’s code UNIT
37

Step 3
Visualization through Box Plot
‣ Necessary Concepts for Understanding Box Plots

Percentile Data divided into hundred equal parts

Quartile Data divided into four parts

The value at the middle of the data (Half of the observations are greater than
Median Value(Q2)
or equal, and the other half are smaller or equal)

The 3rd (Upper) The median of the top 50% based on the median value. The value
Quartile (Q3) corresponding to the top 25% of the entire data.

The 1st (Lower) The median of the bottom 50% based on the median value. The value
Quartile (Q1) corresponding to the bottom 25%, that is 75% of the total data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 366
Let’s code UNIT
37

Step 3
Visualization through Box Plot

Lower Quartile Median Upper Quartile


Q1 Q2 Q3
Min Max
25% 25% 25% 25%
Whisker Whisker

BOX

Inter Quartile Range (IQR)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 367
Let’s code UNIT
37

Step 3
Visualization through Box Plot

Line 1
• 1st Semester Grades

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 368
Let’s code UNIT
37

Step 3
Visualization through Box Plot

Line 1
• Number of absent days

‣ Looking at the graph, it can be seen that there are many abnormal data in the case of the box plot of
the number of absences.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 369
Let’s code UNIT
37

Step 3
Visualization through Box Plot

‣ If you look at the box plot of the first semester grades, the second semester grades, the third semester
grades, the weekend night outs statistics, and the number of days of absence, you can get insights on
student performance.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 370
Let’s code UNIT
37

Step 4
Coefficient of Variation (CV)
‣ Data on variables with different units of measurement cannot be compared simply. This is because if
the size of the data is different, the deviation tends to increase when the measurement unit is large.
Ex The standard deviation between stock prices and gas prices cannot be compared simply.

‣ What we can use in these situations is the coefficient of the variation.


‣ It is the value of the standard deviation divided by the mean. This can be used to compare data of
different specifications regardless of size.

Coefficient of Variation (CV) =


Standard Deviation/Mean

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 371
Let’s code UNIT
37

Step 4
Coefficient of Variation (CV)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 372
Let’s code UNIT
37

Step 4
Coefficient of Variation (CV)
‣ The describe() method does not show the result of the coefficient of variation for the whole, It can be
applied as follows.
‣ In order to find the coefficient of variation for the whole, it can be applied as follows.
‣ However, it should be noted that if the mean to be compared is 0 or close to 0, the coefficient of
variation may be infinitely large.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 373
Let’s code UNIT
37

Step 4
Coefficient of Variation (CV)

Line 1
• Return the CV for the entire column as Series

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 374
Let’s code UNIT
37

Step 4
Coefficient of Variation (CV)

Line 2
• However, since the entire data may not be numeric, it is recommended to designate a specific
column, as specified in the content of the error message

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 375
Let’s code UNIT
37

Step 4
Covariance
‣ The covariance represents the relationship between the two variables.
• If the values of the covariance is positive, the two variables are positive.
• If the value of the covariance is negative, the two variables are negative.
‣ Multiply the deviation between the two variables and calculate it by averaging it. It is used to calculate
the variance of two or more variables.
‣ It can be calculated using the cov( ) method.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 376
Let’s code UNIT
37

Step 4
Covariance

Line 1
• Returns the covariance value of each column in the data frame.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 377
Let’s code UNIT
37

Step 4
Covariance

Line 1
• The covariance of NumPy can also be calculated through the cov() method. Two series columns,
that is, the result of covariance for two series data.

‣ Analysis of the matrix above is as follows.


• The covariance values of G1 and G3: matrix elements (1,2) and (2,1) 12.18768232
• Variance of G1: matrix element (1,1) 11.01705327
• Variance of G3: matrix element (2,2) 20.9896164

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 378
Let’s code UNIT
37

Step 4
Covariance

Line 1,
2• 1: Verification through var() to obtain the variance is the same as the result of the matrix element
above
• 2: Verification through var() to obtain the variance is the same as the result of the matrix element
above

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 379
Let’s code UNIT
37

Step 4
Correlation Coefficient
‣ The covariance equation itself depends on the scale and unit of each variable. The correlation
coefficient eliminates the dependence of each variable on the scale and finds out the relationship
between data.
‣ The covariance can tell you what the relationship between the two variables is, but there are cases
where a correlation coefficient is needed because the size of the relationship cannot be explained.
‣ Simply put, the correlation coefficient measures the degree to which two variables move(?) together.
Complete positive correlation. This means that when one variable moves to a specific
1.0
size, the other moves in the same direction at the same rate.

0.0 It means that the two variables have no relationship.

Complete negative correlation or inverse correlation. This means that when one
-1.0
variable moves to a specific size, the other moves in the opposite direction.

‣ Can be calculated using the corr( ) method.


ttps://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 380
Let’s code UNIT
37

Step 4
Correlation Coefficient

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 381
Let’s code UNIT
37

Step 4
Scatter Plot

Line 1
• The graph displays the comparison results of the first and final tests. It shows that people with
good grades from the beginning often perform well until the end.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 382
Let’s code UNIT
37

Step 5
Draw a histogram and scatter plot of the variables you want to compare.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 383
UNIT Data Tidying
37.

Pair programming

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
37

Pair Programming Practice


Guideline, mechanisms & contingency plan
Preparing pair programming involves establishing guidelines and mechanisms to help students pair
properly and to keep them paired. For example, students should take turns “driving the mouse.” Ef-
fective preparation requires contingency plans in case one partner is absent or decides not to partic-
ipate for one reason or another. In these cases, it is important to make it clear that the active student
will not be punished because the pairing did not work well.

Pairing similar, not necessarily equal, abilities as partners


Pair programming can be effective when students of similar, though not necessarily equal, abilities
are paired as partners. Pairing mismatched students often can lead to unbalanced participation.
Teachers must emphasize that pair programming is not a “divide-and-conquer” strategy, but rather a
true collaborative effort in every endeavor for the entire project. Teachers should avoid pairing very
weak students with very strong students.
Motivate students by offering extra incentives
Offering extra incentives can help motivate students to pair, especially with advanced students.
Some teachers have found it helpful to require students to pair for only one or two assignments.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 385
Pair programming UNIT
37

Pair Programming Practice


Prevent collaboration cheating
The challenge for the teacher is to find ways to assess individual outcomes, while leveraging the
benefits of collaboration. How do you know whether a student learned or cheated? Experts recom-
mend revisiting course design and assessment, as well as explicitly and concretely discussing with
the students on behaviors that will be interpreted as cheating. Experts encourage teachers to make
assignments meaningful to students and to explain the value of what students will learn by complet-
ing them.
Collaborative learning environment
A collaborative learning environment occurs anytime an instructor requires students to work together
on learning activities. Collaborative learning environments can involve both formal and informal ac-
tivities and may or may not include direct assessment. For example, pairs of students work on pro-
gramming assignments; small groups of students discuss possible answers to a professor’s question
during lecture; and students work together outside of class to learn new concepts. Collaborative
learning is distinct from projects where students “divide and conquer.” When students divide the
work, each is responsible for only part of the problem solving and there are very limited opportunities
for working through problems with others. In collaborative environments, students are engaged in in-
tellectual talk with each other.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 386
Pair programming UNIT
37

Q1.
After converting the universities_ranking.csv file in the practice folder ‘data folder’ into a data
frame, check if there is missing data, and decide how to replace it or delete it altogether with
your colleague

The results of the discussion should be organized through actual codes.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 387
Unit 38.

Time Series Data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 388
Unit learning objective (1/3) UNIT
38

Learning objectives
 Be able to determine whether it is an analysis target using time series data when a data frame is
presented.

 Be able to explain the two data types of time series data.

 Be able to create date, time, interval, and array data as a time series data type.

 Be able to index and slice a data frame using DatetimeIndex.

 Be able to obtain a moving average, a representative descriptive statistic of time series data.

 Be able to visualize data as an area graph for comparison of two or more series data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 389
Unit learning objective (2/3) UNIT
38

Learning overview

 Learn real-world applications of time series data analysis


 Learn types and elements of time series data types
 Learn how to generate time series data (date, time, interval, array of time series data)
 Learn how to index and slice time series data
 Learn basic descriptive statistics using time series data functions

Concepts you will need to know from previous units

 How to index and slice of data frames and series

 How to create an external file as a dataframe

 How to draw graphs using Matplotlib

 How to view the descriptive statistics summary of the datafram using Dataframe.info() function

 How to retrieve missing data and replacing data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 390
Unit learning objective (3/3) UNIT
38

Keywords

Time Series Data Timestamp Period

Moving Average Area Graph Categorical Data

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 391
UNIT Time Series Data
38.

Mission

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
38

In the current global Seafood Production business, are most of the fish caught and processed directly by
fishermen, or are they processed through aquaculture?
This may be different depending on the geographical location of each country or the difference in food
culture. An organization called Our World in Data provides related data. Based on this, let's compare what
type of raw material supply occupies the most in the current global Seafood Production business.
In addition, since this data provides data for each country, let's search for the country you want and
create related data statistics separately.

https://ourworldinda-
ta.org/

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 393
Mission UNIT
38

This mission must comply with the tasks below.

‣ Data resource location: https://ourworldindata.org/fish-and-overfishing


‣ Data as of November 6, 2021 are downloaded and provided for practice.
df = pd.read_csv("./data//fish/capture-fisheries-vs-aquaculture.csv")
‣ Convert the csv file to a data frame.
‣ Delete unnecessary columns in the data frame.
‣ Convert the year data to a time series data type and convert it to an index.
‣ Transform country data into categorical data.
‣ Check if NaN data exists and replace it.
‣ Use visualization tools to see trends in aquaculture and direct catch data around the world (All
countries) from the 1960s to the present.
‣ Search for 3 or more individual countries and compare them through visualization.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 394
Mission UNIT
38

The goal of this mission is to produce the same type of results as the graph below.
The graph below was created by data analysis experts.

Source: Food and Agriculture Organization of the United Na-


tions (via World Bank)

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 395
UNIT Time Series Data
38.

Key concept

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
38

1. What is time series data?


Time series data is a set of time-organized values collected over a certain time or period.
It can be explained a little more technically:

“A time series is data about one or more variables measured over a period of time at
specific intervals.”
Machine learning technology is widely used for business predictive analysis. It analyzes business
problems such as stock price, budget, sales and asset flow, forecast maintenance and sales forecasting,
and predicts future indicators by composing related data such as trend, periodic theory, and seasonality
together.
Pandas was created to handle financial data, and time series data analysis can be the core of data
analysis using Pandas.
In addition to the financial field, time series data analysis is used in various fields.
‣ Medicine
‣ Meteorology
‣ Astronomy
‣ Economics
‣ IOT

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 397
Key concept UNIT
38

2. Time Series Data and Analysis used in the Real World


The image below is the Manufacturing Purchasing Management Index (PMI) of the United States. Values
above 50 indicate economic expansion, and values below 50 indicate contraction. It is a meaningful
graph that can be a leading indicator of overall economic performance. This is an example where time
series data is used in the real world. Even the predicted and actual figures are compared, and this can
also be generated by analyzing time series data accumulated in the past.
https://www.markiteconomics.com/Public/Release/PressReleases

http://www.Invest-
ing.com

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 398
Key concept UNIT
38

2. Time Series Data and Analysis used in the Real World


Time series data are sorted by time order, so there are continuous observation values. There is a high
probability that each data value has a correlation with each other.
To understand time series data, you should first understand how pandas represents dates, times, and
intervals. Pandas has many functions that can be used when transforming data at different frequencies
or using a calendar that reflects business days and holidays for financial calculations.
Pandas was created to handle financial data. When analyzing financial data, changing time series data
into an index within a data frame and using it has many advantages.
As it will be explained in the example code, when most of the external data is called and the columns of
date are checked, there are many cases of string or object type. It starts with converting to Pandas time
series data type for effective analysis.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 399
Key concept UNIT
38

3. Two Time Series Data Types in Pandas


• timestamp: one point in time
• period: two time points. i.e., a constant period between two timestamps

‣ Multiple timestamps are gathered to create an array and become a Datetimeindex.

Period 1 Period 2 Period 3

2021-01-01 2021-04-02 2021-07-02 2021-10-01

Timestamp 1 Timestamp 2 Timestamp 3 Timestamp 4

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 400
Key concept UNIT
38

4. Elements of Time Series Data


A fluctuating pattern in data that appear in the long term
Trend (The values of the data are increased or decreased in a reasonably predictable
pattern)
A fluctuating pattern in data that appears in a period of unit time such as a week,
Seasonal month, quarter, or half year, etc.
(The patterns of the data are repeated over a specific period.)
A long-term fluctuation rather than fixed periods that appear in a period of at least
two years
Cycle
(The values of the data exhibit rises and falls that are not of a fixed frequency often
due to economic conditions.)

Random A completely irregular pattern that does not belong to the above three categories

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 401
Key concept UNIT
38

4. Elements of Time Series Data


You need to decompose the components of each of the four representative time series data to discover
insights.

This analysis of time series data is called a time series additive model (Time series additive model).
Trend factor + Cycle factor + Seasonal factor + Irregular/Random factor

𝑌 𝑡 =𝑇 𝑡 +𝐶 𝑡 +𝑆 𝑡 +𝐼 𝑡

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 402
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.1. Create date and time with python datetime class
The datetime object is part of Python's datetime library. This class is used to express general and
various patterns, such as a specific point in time using both date and time, only the date minus the
time, or only the time.
‣ https://docs.python.org/ko/3/library/datetime.html#module-datetime

Most of the functions supported by the date and time class of the datetime library are supported.
The disadvantage is that it reduces the precision needed for massive computations on time series data.
The datetime class receives year, month, day, hour, minute, second, microsecond, and time zone as
arguments. The time argument is not a required valu. If empty, 0 is returned as a default value.
‣ You can use Pandas by converting datetime to timestamp object.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 403
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.1. Create date and time with python datetime class

Line 7
• We need at least 3 parameters corresponding to year, month and day. Hour and minute are re-
turned as 0 by default.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 404
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.1. Create date and time with python datetime class

Line 1
• In case of setting the parameters of 15:50 in the hour and minute

Line 4
• If you use the combine() method of the datetime class, you can create a datetime object using an
existing date or time object.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 405
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.1. Create date and time with python datetime class

Line 1
• Current date and time, local time

Line 1
• If a time zone is used as a parameter in the now method, a datetime object applied with the time
zone is created.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 406
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.1. Create date and time with python datetime class

Line 1
• Get the current time and date and returns only the date.

Line 1
• Get the current time and date and returns only the time.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 407
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.2. Create a point in time with Pandas timestamp
pandas.Timestamp()

The date and time created with datetime can also be created with pandas.Timestamp(). The difference
is that the datatype is datetime64, which has higher precision than Python datetime.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 408
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.2. Create a point in time with Pandas timestamp
pandas.Timestamp()

Line 1
• Both date and time can be set when creating.

Line 1
• Created by specifying the time.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 409
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.2. Create a point in time with Pandas timestamp
pandas.Timestamp()

Line 1
• If only time is specified, today's date is applied as default.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 410
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.3. Create time intervals with Pandas Timedelta
We have learned to create a moment of date and time so far. Now, we will learn how to express time
intervals. timedelta is a subclass of Python's datetime.timedelta and is also supported by Pandas.
You can use Pandas' Timedelta class for time intervals. The time interval is important for time series
data analysis because it is necessary when determining the number of days or analyzing by a specific
time interval.
pandas.Timedelta(value, unit=None, **kwargs)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 411
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.3. Create time intervals with Pandas Timedelta
pandas.Timedelta(value, unit=None, **kwargs)

Line 5, 6, 7, 9
• 5: Create a specific date with datetime.
• 6: Create today's date.
• 7: Calculate the day plus one day using timedelta.
• 9: How long is tomorrow from my birthday?

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 412
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.3. Create time intervals with Pandas Timedelta
pandas.Timedelta(value, unit=None, **kwargs)

Line 1
• When calculating the number of days between two dates, data is returned in timedelta format.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 413
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.3. Create time intervals with Pandas Timedelta
pandas.Timedelta(value, unit=None, **kwargs)

Line 2,
5• 2: The difference in time can also be calculated arithmetically.

• 5: You can see that 10 hours are added to today.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 414
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.4. Represent a period with Period
pandas.Period()
https://pandas.pydata.org/docs/reference/api/pandas.Period.html?highlight=pd%20period
Most of time series data analysis is event analysis for a specific time interval. An example is an analysis
of a company's sales over a specific period of time. However, when analyzing events by grouping
multiple periods, it is difficult to use only timestamps. Pandas provides a standardized time interval
through a class called Period to facilitate this kind of data organization and calculation.
Period creates a period based on a specified frequency such as daily, weekly, monthly, yearly, quarterly,
etc. and provides Timestamp that indicates the start time and end time.
A period can be created using a timestamp corresponding to a reference point and a frequency that
indicates the period.
If you create a period corresponding to a month based on July 1973, you can do it as follows.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 415
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.4. Represent a period with Period
pandas.Period()

pd.Period has attributes showing start time and end time.


https://pandas.pydata.org/docs/reference/api/pandas.Period.start_time.html#pandas.Period.start_time

Line 1
• Return the start time of that point in time.

Line 1
• Return the end time of that point in time.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 416
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.4. Represent a period with Period
pandas.Period()

Period can be shifted through simple arithmetic operation. You can create a new period object by shifting
the frequency. The example below is an example of +2 (shifting two months) because the frequency of
special_day is one month.

Line 1
• You cannot understand that adding 2 makes two months shifted just because 2 means two
months. You should understand the way that the period is shifted by the unit that created the
period.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 417
Key concept UNIT
38

5. Creating an Interval of Date, Time, Frequency and Time


5.4. Represent a period with Period
pandas.Period()

Line 1
• See the results, you can see that Pandas is properly judging the full date of September 1973
(there are 30 days).

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 418
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex
You can convert data type to datetime64 type with timestamp with pandas.to_datetime().
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None,
format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix',
cache=True)
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.t
o
_datetime
The key to working with time series data in Pandas is indexing using DatetimeIndex objects.
The indexing is very useful, it automatically sorts data based on date and time.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 419
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None,
format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix',
cache=True)

Line 2
• As example data, we use the international crude oil price data provided by the FED.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 420
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None,
format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix',
cache=True)
You can see that the data type of the date column is object.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 421
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None,
format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix',
cache=True)

Line 2
• The data type is changed to datetime64
type.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 422
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex

Line 1
A column named N_Date is created.

TIP
 Sometimes, in the process of converting to timestamp, an error occurs if the data cannot be
converted. In this case, there is a way to force the conversion.
 If you use errors = 'coerce' parameter, NaT is forcibly assigned to data that cannot be converted.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 423
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None,
format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix',
cache=True)

Line 1
• Delete the DATE column that was the existing object data type.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 424
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None,
format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix',
cache=True)

Line 1
• Designate the newly created datetime64 column as the index of the data frame.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 425
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.1. Convert other data types to time series data Timestamp and index-
ing with DatetimeIndex
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None,
format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix',
cache=True)
Now, the dataframe is very easy to index or slice in chronological order because it supports the time
series index class.

Line 1
• This completes the data frame indexed by DatetimeIndex.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 426
Key concept UNIT
38

6. Time Series Data Indexing and Basic Use


6.2. Slicing using DatetimeIndex
In the previous lesson, we learned how to slice specific rows and columns of a dataframe. We check how
easy and convenient it is to slice from time series data.

Line 1
• Select a specific date, and you can slice only the data you want very conveniently. Isn't it very
intuitive?

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 427
Key concept UNIT
38

7. Creating Time Series Data


7.1. Create time series of specific frequency from Timestamp array
Time series data can be created not only in units of one unit but also in a specific time interval, that is,
in a form with a specific frequency.
You can use the freq parameter for pd.date_range() to create a time series with a frequency you want.
The default is one unit.
The principle is similar to creating an array of numbers with Python range() .

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 428
Key concept UNIT
38

7. Creating Time Series Data


7.1. Create time series of specific frequency from Timestamp array
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None,
normalize=False, name=None, closed=None, **kwargs)

start: the start of date range, end: the end of date range, periods: the number of
timestamps to be created

For more details on freq, refer to


https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

If we describe only representative examples of freq,


‣ D: calendar day frequency
‣ B: business day (Business day frequency)
‣ W: weekly frequency
‣ M: month and frequency
‣ Ms: month start frequency
‣ Q: quater and frequency

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 429
Key concept UNIT
38

7. Creating Time Series Data


7.1. Create time series of specific frequency from Timestamp array
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None,
normalize=False, name=None, closed=None, **kwargs)

• Since there are 5 periods, it means to create 5 Timestamps.

Line 2
• In case of a native timezone state where the timezone is not set, count the number of data
created.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 430
Key concept UNIT
38

7. Creating Time Series Data


7.1. Create time series of specific frequency from Timestamp array
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None,
normalize=False, name=None, closed=None, **kwargs)

Line 2
• In case the time zone is set to Seoul

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 431
Key concept UNIT
38

7. Creating Time Series Data


7.1. Create time series of specific frequency from Timestamp array
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None,
normalize=False, name=None, closed=None, **kwargs)

Line 2
• In case of using the parameter as the frequency based on the end of the month

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 432
Key concept UNIT
38

7. Creating Time Series Data


7.1. Create time series of specific frequency from Timestamp array
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None,
normalize=False, name=None, closed=None, **kwargs)

Line 2
• In case of using the frequency as a parameter based on the month start date

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 433
Key concept UNIT
38

7. Creating Time Series Data


7.2. Create time series of specific frequency from Period array
You can create time series data containing multiple periods and frequencies with period_range() . It
returns PeriodIndex as a result.
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)
https://pandas.pydata.org/docs/reference/api/pandas.period_range.html

Line 3, 4
• 3: Start of date range
• 4: End of date range

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 434
Key concept UNIT
38

7. Creating Time Series Data


7.2. Create time series of specific frequency from Period array
You can create time series data containing multiple periods and frequencies with period_range() . It
returns PeriodIndex as a result.
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)
https://pandas.pydata.org/docs/reference/api/pandas.period_range.html

Line 5, 6
• 5: Number of frequencies to generate
• 6: Length of period

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 435
Key concept UNIT
38

7. Creating Time Series Data


7.2. Create time series of specific frequency from Period array
You can create time series data containing multiple periods and frequencies with period_range() . It
returns PeriodIndex as a result.
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)
https://pandas.pydata.org/docs/reference/api/pandas.period_range.html

Line 8
• The label of the index is period

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 436
Key concept UNIT
38

7. Creating Time Series Data


7.2. Create time series of specific frequency from Period array
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)

Line 2
• See the data carefully. It automatically creates when each month actually starts and ends.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 437
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Pandas provides a function that can easily calculate the moving statistics (Rolling window) for
DataFrame and Series.
The expression “window” refers to a period or section in which specific data is expressed.
It is a format in which the window is automatically moved according to the specified interval, the
statistics are calculated accordingly, and the entire time series data is applied.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 438
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


To explain by referring to the image below, the moving average is the most used method for time series
financial data analysis in order to smooth the short-term volatility of the target data and analyze long-
term trends.

https://docs.wavefront.com/
query_language_windows_trends.ht
ml

TIP
 Smoothing means smooth processing by removing small fluctuations or discontinuities that are not
good in the data due to noise when sampling from large sample data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 439
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


DataFrame rolling function : pandas.DataFrame.rolling( )
‣ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
Series rolling function : pandas.Series.rolling( )
‣ https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html?highlight=series%20rolling
#pandas.Series.rolling

Representative method
.rolling().mean() Average in window
Standard deviation in
.rolling().std()
window
.rolling().var() Dispersion in window
.rolling().sum() Total in window
.rolling().min() Minimum value in window
.rolling().max() Maximum value in window

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 440
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's practice the stock price data using the moving average concept as an example.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 441
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's practice the stock price data using the moving average concept as an example.

Line 6, 7, 10
• 6: It is a Korean stock data set provided by Yahoo Finance in datareader.
• 7: If you want to practice with a wide window, it is better to specify a longer period.
• 10: Ticker code. Samsung

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 442
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's practice the stock price data using the moving average concept as an example.

Line 11, 12, 13


• 11: Data specified from Yahoo Finance.
• 12: Start date of search, use timestamp stored in start variable
• 13: Search end date, using the timestamp stored in the end variable

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 443
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's practice the stock price data using the moving average concept as an example.

Line 1
• You can see that this data set is already in DatetimeIndex.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 444
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's practice the stock price data using the moving average concept as an example.

Line 1
• It is easy to calculate the moving average data over 5 days.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 445
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's visualize each moving average data.

Line 1, 3
• 1: To compare with the moving average in the graph, only the closing price is stored separately.
• 3: Average over Windows 5 days

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 446
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's visualize each moving average data.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 447
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


Let's visualize each moving average data.

Line 1, 3, 4, 5, 6, 8, 10, 11, 12, 13, 14


• 1: To compare with the moving average in the graph, only the closing price is stored separately.
• 3: Average over Windows 5 days
• 4: Average over Windows 10 days
• 5: Average over Windows 60 days
• 6: Average over Windows 120 days
• 8: Set the size of the chart
• 10: Express the closing price data as a line graph and name the label close. Since it is the
reference data, it is expressed a little thicker.
• 11: Express the daily moving average data as a line graph and name the label 5 window.
• 12: Express the 10-day moving average data as a line graph and name the label 10 window.
• 13: Express the 60-day moving average data as a line graph and name the label 60 window.
• 14: Express the 120-day moving average data as a line graph and name the label 120 window.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 448
Key concept UNIT
38

8. Moving Statistics Function (Rolling Window Calculations)


The figure below is the result graph.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 449
UNIT Time Series Data
38.

Let’s code

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
38

Step 1
Data acquisition and data frame transformation

Line 8
• https://ourworldindata.org/fish-and-overfishing

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 451
Let’s code UNIT
38

Step 1
Data acquisition and data frame transformation

Line 1
• Let's check the data type of each column of the data frame.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 452
Let’s code UNIT
38

Step 2
Preprocessing including data cleaning

Line 1
• Delete unnecessary columns. Let's delete the code column in the statistics we want to do because
we don't need it.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 453
Let’s code UNIT
38

Step 2
Preprocessing including data cleaning
# Search NaN data. You can check the number of missing data for each col-
umn.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 454
Let’s code UNIT
38

Step 2
Preprocessing including data cleaning

Line
1~3
• 1: Create a variable with the data you want to replace
• 2: Replace NaN with 0 in order not to affect the sum statistic.
• 3: When the substitution result was confirmed, all NaN data were substituted with 0, and there is
currently no number of NaNs.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 455
Let’s code UNIT
38

Step 2
Change to time series data type and replace index

Line 1, 2
• 1: Year is an int64 type. An error occurs if you change it to datetime. It is necessary to change the
data type.
• 2: Specify the new_Year column changed in the time series format as an index.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 456
Let’s code UNIT
38

Step 2
Change to time series data type and replace index

Line 3, 5
• 3: Delete the existing Year column because it is no longer needed.
• 5: Confirm the above processing result.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 457
Let’s code UNIT
38

Step 2
Change to time series data type and replace index

Line 1
• See the result, and you can confirm that the datetimeindex has been changed.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 458
Let’s code UNIT
38

Step 2
Change to time series data type and replace index

Line 1
• Sorting dataframes based on set index

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 459
Let’s code UNIT
38

Step 2
Change to time series data type and replace index

Line 2
• You can see that the data type for the country name has been changed from object to categorical.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 460
Let’s code UNIT
38

Step 3

Line 1
• Let's check how many unique data there are in the column with country name. There are data for
a total of 264 countries.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 461
Let’s code UNIT
38

Step 3
By adding up each country's catch and aquaculture production by year, we can visualize this to see
trends in global catch and aquaculture.
If you see the data, there are separate data for each country and year. This can be solved by calculating
the sum of the world (The sum of data from each country.) for each year through the group operation of
pandas based on the year and visualizing this.

Line 1
• Groups are grouped based on the new_Year column and stored them in a new data frame called g.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 462
Let’s code UNIT
38

Step 3

Line 1
• If you check the saved result, you can see that the data frame was created based on the year.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 463
Let’s code UNIT
38

Step 3

Line 1
• Print the contents of the g object using a loop.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 464
Let’s code UNIT
38

Step 4

Line 1
• Through group operations for each created group, the sum of each year is obtained and a new
data frame is created.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 465
Let’s code UNIT
38

Step 4
Visualize global data

‣ Compare the graph we created with the graph we created from Our World data. You can make a result
to the level that experts process and visualize.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 466
Let’s code UNIT
38

Step 4
Visualize global data

Line 1, 2, 3, 4, 5, 7, 8
• 1: Specify the style of the graph
• 2: Draw an area graph in which the rest of the line graph is colored.
• 3: Adjust the opacity of the color to increase the visibility of overlapping graphs.
• 4: Select the option with False
• 5: Specify the size of the graph
• 7: Show legend
• 8: If you check the results, you can see that the world has been increasing the amount of artificial
aquaculture rather than the amount caught little by little starting in 2010.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 467
Let’s code UNIT
38

Step 5
Let's search for a specific country name and visualize it. You can do this easily if you remember the
practice of searching for artist names in the DataFrame lecture.

Line
1~3
• 1: Create only non-duplicate names as series data in the country column to search for a country.
• 2: Print only non-duplicate country names.
• 3: If you check the processed data type, you can see that it has been successfully converted into a
series.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 468
Let’s code UNIT
38

Step 5

Line 1
• If it is True when the country name you want is searched, it means that there is data for the
country.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 469
Let’s code UNIT
38

Step 5
Search for 3 or more countries with different cultural and geographic requirements, visualize and get
data insights.

Line
1~3
• 1: Create a separate data frame after searching for the country name you want in the series data
created earlier.
• 2: Create a separate data frame after searching for the country name you want in the series data
created earlier.
• 3: Create a separate data frame after searching for the country name you want in the series data
created earlier.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 470
Let’s code UNIT
38

Step 5

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 471
Let’s code UNIT
38

Step 5

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 472
UNIT Time Series Data
38.

Pair programming

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
38

Pair Programming Practice


Guideline, mechanisms & contingency plan
Preparing pair programming involves establishing guidelines and mechanisms to help students pair
properly and to keep them paired. For example, students should take turns “driving the mouse.” Ef-
fective preparation requires contingency plans in case one partner is absent or decides not to partic-
ipate for one reason or another. In these cases, it is important to make it clear that the active student
will not be punished because the pairing did not work well.

Pairing similar, not necessarily equal, abilities as partners


Pair programming can be effective when students of similar, though not necessarily equal, abilities
are paired as partners. Pairing mismatched students often can lead to unbalanced participation.
Teachers must emphasize that pair programming is not a “divide-and-conquer” strategy, but rather a
true collaborative effort in every endeavor for the entire project. Teachers should avoid pairing very
weak students with very strong students.
Motivate students by offering extra incentives
Offering extra incentives can help motivate students to pair, especially with advanced students.
Some teachers have found it helpful to require students to pair for only one or two assignments.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 474
Pair programming UNIT
38

Pair Programming Practice


Prevent collaboration cheating
The challenge for the teacher is to find ways to assess individual outcomes, while leveraging the
benefits of collaboration. How do you know whether a student learned or cheated? Experts recom-
mend revisiting course design and assessment, as well as explicitly and concretely discussing with
the students on behaviors that will be interpreted as cheating. Experts encourage teachers to make
assignments meaningful to students and to explain the value of what students will learn by complet-
ing them.
Collaborative learning environment
A collaborative learning environment occurs anytime an instructor requires students to work together
on learning activities. Collaborative learning environments can involve both formal and informal ac-
tivities and may or may not include direct assessment. For example, pairs of students work on pro-
gramming assignments; small groups of students discuss possible answers to a professor’s question
during lecture; and students work together outside of class to learn new concepts. Collaborative
learning is distinct from projects where students “divide and conquer.” When students divide the
work, each is responsible for only part of the problem solving and there are very limited opportunities
for working through problems with others. In collaborative environments, students are engaged in in-
tellectual talk with each other.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 475
Pair programming UNIT
38

Q1. Discuss and practice the data you practiced in the key concept section of this lecture with your
learning colleagues as shown below.

Change it to another company's data.


Try slicing based on the learning date.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 476
Pair programming UNIT
38

Access the University of California, Irvine, one of the open datasets used in the previous lecture,
Q2
Q1. Data Organization Learning, and explore together with your learning colleagues which data is
good data to utilize the advantages of time series data analysis.

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 477
End of
Document

Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 478
ⓒ2022 SAMSUNG. All rights reserved.
Samsung Electronics Corporate Citizenship Office holds the copyright of book.
This book is a literary property protected by copyright law so reprint and reproduction without permission are prohibited.
To use this book other than the curriculum of Samsung Innovation Campus or to use the entire or part of this book, you must receive written
consent from copyright holder.

You might also like