SIC - C - P - Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization - v1.0
SIC - C - P - Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization - v1.0
Innovation
Campus
Coding and Programming
Chapter 7.
Chapter objectives
Learners will be able to collect various types of large amounts of data and organize them in a
form that can be analyzed.
Learners will be able to generate various descriptive statistics for organized data using Pandas.
Learners can visualize data using the Python Visualization Library.
Chapter contents
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 3
Unit 34.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 4
Unit learning objective (1/3) UNIT
34
Learning objectives
Learners will be able to explain why modules are grouped into classes, functions, variables,
execution codes, etc., and why they are needed.
Learners will be able to look at documents on how to use the module and decrypt how to
separate and use the functions, classes, and parameters, etc. that the module contains.
Learners can select and use the import, from, and as statements in modules appropriately
according to need.
Learners will be able to generate as many integers and real data as they want using the random
function when they encounter situations where unspecified test data is needed.
Learners will be able to convert and generate values for a specific date and time using a data
module.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 5
Unit learning objective (2/3) UNIT
34
Learning overview
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 6
Unit learning objective (3/3) UNIT
34
Keywords
Standard
import External Module
Module
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 7
Unit 34.Using Python Modules
Mission
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 9
Mission UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 10
Unit 34.Using Python Modules
Key concept
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
34
1. Modules
1.1. What is a Module?
A module enables the Python code to be logically grouped, managed, and used. Normally, one
Python .py file becomes one module. Functions, classes, or variables may be defined in the module, and
may include an execution code.
Simply put, it is a code file.
It is divided into a standard module and an external module. The standard module refers to what is
basically
‣ Standard built into Python. In addition, modules created by other 3rd parties are called external modules.
Module: Built-in modules within Python
‣ External Module: Other modules made by a 3rd party.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 12
Key concept UNIT
34
1. Modules
1.1. What is a Module?
To use these modules, the module may be imported and used, and the import statement may invoke
one or more modules within the code as shown below.
Module Name
Line 1
• In case only one module is brought in for use.
Line 1
• For cases of multiple modules.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 13
Key concept UNIT
34
1. Modules
1.1. What is a Module?
After importing a module using ‘import’, use the following method when using a specific function or
variable of the module.
‣ ModuleName.Variable
‣ ModuleName.FunctionName()
‣ ModuleName.Class()
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 14
Key concept UNIT
34
1. Modules
1.2. Standard Module, Standard Library
Standard modules are installed when installing Python. There is no need to memorize what is in the
standard module, as it can be searched through a search engine or through Python’s standard library at
https://docs.python.org/3/library/.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 15
Key concept UNIT
34
1. Module s
1.2. Standard Module, Standard Library
Since the standard library is built-in, it does not require a separate installation process and can be used
immediately by simply importing.
Line 1
• A math module, one of the standard libraries with math-related functions
Line
1~2
• 1: It is used in the form of ModuleName.FunctionName()
• 2: An example of using a sin(x) function that obtains the sine value among several functions of the math
module.
To find out what other functions the Math module has, such as math.sin(x), visit
https://docs.python.org/3/library/math.html. In this way, you can learn its basic use and use what you
need through searching.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 16
Key concept UNIT
34
1. Modules
1.3. External Module, External Library
Since external modules were created by other people (such as open-source libraries), the process of
installing modules is required in order to use them in code.
The most recommended and safe way to install an external library is to use pip. After Python 3.4, it is
basically included in the Python binary installation program and can be easily used.
Pip is a utility that allows for access of a widely used Python package index called PyPI(Python Package
Index).
Ex If you want to install a module that has a special function, you can use the pip to install the
available candidate library after searching in PyPI.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 17
Key concept UNIT
34
1. Modules
1.3. External Module, External Library
Visit https://pypi.org/ in order to search the necessary functions.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 18
Key concept UNIT
34
1. Modules
1.3. External Module, External Library
Examine the detail page in the library and copy the pip install under the module name if it is the desired
function.
pip install
chart
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 19
Key concept UNIT
34
1. Modules
1.3. External Module, External Library
Run Anaconda prompt and install the library after moving to the virtual environment you are currently
using. The library installation instruction used as an example is a pip install chart.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 20
Key concept UNIT
34
1. Modules
1.3. External Module, External Library
Line 1,
3• 1: Import the external library you installed through the import command.
• 3: Use histogram(x), which outputs a histogram chart (one of the chart library functions).
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 21
Key concept UNIT
34
1. Modules
1.3. External Module, External Library
If you run the code and get an error message like the one below, read the message carefully. It is a
message showing that the module cannot be found, and in this case, it is an error caused by the library
not being installed. This is a mistake that is made more often at the beginner level than expected, and
there are cases where a library is not installed in the virtual environment.
Let’s practice searching, installing, and using a library that outputs emoticons.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 22
Key concept UNIT
34
1. Modules
1.4. Make Your Own Module and Bring It Up For Use
In addition to being able to import and use the Python module .py file, the entire script in the module file
can be executed immediately. Let’s practice bringing up modules through the practice of calculating the
hospital funds.
We sell event tickets to raise funds for hospitals.
Each individual participating in the event has to pay 5t +3.
T is the number of tickets purchased by one person.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 23
Key concept UNIT
34
1. Modules
1.4. Make Your Own Module and Bring It Up For Use
Line 1
• Save the name of this file as fund_cal.py.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 24
Key concept UNIT
34
1. Modules
1.4. Make Your Own Module and Bring It Up For Use
("Enter the total number of people who participated in the hospital donation event"))
Line 1
• The plan is to bring and use the donate.py file, a module created and stored above. Be careful not
to write the extension for the filename.py.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 25
Key concept UNIT
34
1. Modules
1.4. Make Your Own Module and Bring It Up For Use
Ex An example of when an external file (*.py) is retrieved and the module selects and calls only
specific functions.
# Store this file under the name cal_n.py
Line 1, 3, 4, 6, 12, 13
• 1: Store this file under the name cal_n.py.
• 3: An algorithm that calculates the value obtained by adding all the numbers from 3:1 to the
input n.
• 4: Two slashes are divided into two
• 6: An algorithm that adds all the consecutive numbers from 6:1 input n
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 26
Key concept UNIT
34
1. Modules
1.4. Make Your Own Module and Bring It Up For Use
Ex An example of when an external file (*.py) is retrieved and the module selects and calls only
specific functions.
# Store this file under the name cal_n.py
Line 1, 3, 4, 6, 12, 13
• 12: In order to check whether the algorithm above works well, verification is performed using
print(sum_n(10)) function. However, this example code was annotated as an unnecessary code
area because it was for the practice of bringing up the module.
• 13: print(sum_n2(100))
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 27
Key concept UNIT
34
1. Modules
1.4. Make Your Own Module and Bring It Up For Use
Line 1
• Note that you do not use the .py extension when importing a file into a module.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 28
Key concept UNIT
34
1. Modules
1.5. Python Syntax: From
The module contains many variables and functions, all of which are extremely rare to find and use
100%. When you want to use only a specific function in the module, you can use the ‘From’ syntax in the
following format.
• from Module import FunctionName
• from Module import ClassName
• from Module import VariableName
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 29
Key concept UNIT
34
1. Modules
1.5. Python Syntax: From
If we apply it to the math.sin(1) we practiced earlier, it looks like the following.
Line 2
• We can use it by only using the function name, as opposed to adding the module name ‘math’ in
the front.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 30
Key concept UNIT
34
1. Modules
1.5. Python Syntax: From
Several variables or functions that you want to use in the module can also be used in the following
format.
Line 1
• Several functions names can be called at once using ‘ , ’.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 31
Key concept UNIT
34
1. Modules
1.5. Python Syntax: From
However, if it is inconvenient to use the module name in front of it, then you can code using only the
function name. The entire function of the module can be brought using the form ‘from ModuleName
import*'
from math import *
Line 1
• Even though the function sin is not written after import, it can be used only with the function
name.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 32
Key concept UNIT
34
1. Modules
1.5. Python Syntax: From
Line 1
• You can round down without writing the function ‘floor’ after import.
Line 1
• You can round up without writing the function ‘ceil’ after import.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 33
Key concept UNIT
34
1. Modules
1.6. Python Syntax: As
The name of the module is long, so it is sometimes cumbersome to write the code. And sometimes the
names overlap when installing and using an external library. In this case, change the name of the library
using the as syntax. It can be used as a short word.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 34
Key concept UNIT
34
1. Modules
1.6. Python Syntax: As
When describing ‘from’, it was explained that various variables or function names can be retrieved at
once using ‘ , ’ when importing,
from Module import Variable as Name1, Function as Name2, Class as Name3
Line 1
• It can be called one after another using ’as’ even while changing it using abbreviations.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 35
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 36
Key concept UNIT
34
Line 4,
5• 4: random() randomly generates floats among numbers greater than or equal to 0 and less than 1.
• 5: Each time it is executed, a number between 0 and 1 is randomly returned. Another number returns
randomly when 10 loops are executed.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 37
Key concept UNIT
34
Line 4
• We randomly extract the six samples from a collection called ‘data’ and store them in the variable
a.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 38
Key concept UNIT
34
Line 3,
4• 3: Tuple collection made of dog names
• 4: Random elements are extracted and stored in the variable my_lovely_dog.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 39
Key concept UNIT
34
Return the next arbitrary floating-point number in the section [0.0, 1.0]
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 40
Key concept UNIT
34
Returns an arbitrary float N that satisfies the condition if a <= b, then a <= N <= b, and if b < a, then b
<= N <= a.
The termination value b may or may not be included in the range according to the float position of a +
(b-a) * random().
Simply put, you can set it to return any float in the range where a is the minimum value and b is the
maximum value.
Set to return any float in the range where a is the minimum value and b is the maximum value.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 41
Key concept UNIT
34
Returns any integer (a < N < b) between a and b as the minimum and maximum value, respectively.
Line 3
• Returns an integer between 1 and 100 and stores it in variable x.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 42
Key concept UNIT
34
Returns an arbitrary integer as a step from the start value to the stop value.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 43
Key concept UNIT
34
TIP
A module that generates random numbers should not be used for security purposes. For security or
encryption, it is recommended to use the secrets module.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 44
Key concept UNIT
34
Line 5
• All executions are paused for 2 seconds at this part while the loop is in execution.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 45
Key concept UNIT
34
Line 3,
4• 3: It is a function of finding the current time. As of 0h 0m 0s on January 1st, 1970, it informs you of
the past time in seconds. However, the return value is returned to a real value that is difficult to
read.
• 4: It is a function for returning the form of time that we can understand.
Unix timestamp
‣ It is also called Epoch time. The elapsed time from 00:00:00(UTC) on January 1st, 1970 is converted into
seconds and expressed as an integer.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 46
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 47
Key concept UNIT
34
When the %y format code is provided, a two-digit year can be parsed. Values 69 to 99 are mapped from
1969 to 1999, and values 0 to 68 are mapped from 2000 to 2068.
‣ https://docs.python.org/3/library/time.html?highlight=time#time.strftime
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 48
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 49
Key concept UNIT
34
Line 3, 6
• 3: Returns the current date to the variable today and stores it
• 6: Separates the information about year
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 50
Key concept UNIT
34
Line 7, 8
• 7: Separates the information about month
• 8: Separates the information about date
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 51
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 52
Key concept UNIT
34
Line 3, 6, 7
• 3: Year, month, day, hour, minute, second
• 6: Separates the information about year
• 7: Separates the information about month
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 53
Key concept UNIT
34
Line 8, 9, 10, 11
• 8: Separates the information about day
• 9: Separates the information about hour
• 10: Separates the information about minute
• 11: Separates the information about second
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 54
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 55
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 56
Key concept UNIT
34
Line 3
• Objects can be created by adding years, months, days, hours, minutes, seconds, and
microseconds to datetime.datetime.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 57
Key concept UNIT
34
Line 7
• Change the month to December and the day to the 30th.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 58
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 59
Key concept UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 60
Unit Using Python Modules
34.
Paper coding
Try to fully understand the basic concept before moving on to the next step.
Lack of understanding basic concepts will increase your burden in learning this
course, which may make you fail the course.
It may be difficult now, but for successful completion of this course we suggest
you to fully understand the concept and move on to the next step.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Paper coding UNIT
34
Q1.
Randomly select three multiples of 5 from the range 0 to 100 using the random module and
print them in the form of a list.
Write the entire code and the expected output results in the
note.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 62
Paper coding UNIT
34
Q1.
Q2
Use timedelta to create a program that prints a 100-day anniversary from a special day of
yours. It doesn’t have to be a 100-day anniversary, so feel free to make your own special
anniversary calculator.
.
Write the entire code and the expected output results in the
note.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 63
Unit Using Python Modules
34.
Let’s code
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
34
STEP 2
Ask the user to press any key to start the dice game.
STEP 3
We need to generate random numbers for rolling a dice. We’re going to import and use the random
module from the Python library. Write the line on the very top of the code.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 65
Let’s code UNIT
34
STEP 5
Add random numbers that are between 1 and 6 to the list of dice. Create the numbers as much as the
number of dice indicated as a parameter. When you’re done with appending all the random numbers
required to the list, return the list of dice.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 66
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 67
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 68
Let’s code UNIT
34
STEP 9
Define a function named “roll_again(choices, dice_list),” two lines after the function definition of
roll_dice(n). It will use parameters of either the user or computer’s choices and of the list of dices.
Display a message, and pause the program for 3 seconds to wait for the computer to roll a dice. Refer to
the side note on the next slide for using the sleep( ) function of the time module.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 69
Let’s code UNIT
34
Important
time.sleep(secs)
‣ pauses, stops, waits, or sleeps your Python program for secs. Here, secs is the number of seconds
that the Python code should pause execution, and the argument should be either an integer or a
float.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 70
Let’s code UNIT
34
STEP 12
Now, it’s the computer’s turn to roll a dice. Call the function roll_dice(n) with number_dice as a
parameter. Name the list of dice returned by the function as “computer_rolls.” Display the computer’s
first roll on the screen.
# step 5 - computer's turn to roll
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 71
Let’s code UNIT
34
TIP
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 72
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 73
Let’s code UNIT
34
STEP 16
Now, it’s the computer’s turn to roll a dice. Call the function roll_dice(n) with number_dice as a
parameter. Name the list of dice returned by the function as “computer_rolls.” Display the computer’s
first roll on the screen.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 74
Let’s code UNIT
34
Important
sum(iterable, start)
‣ returns a number, the sum of all items in an iterable. Here, iterable is a required parameter, which is
the sequence or a list of numbers to sum. On the other hand, start is an optional parameter, which is
a value that is added to the return value.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 75
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 76
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 77
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 78
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 79
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 80
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 81
Let’s code UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 82
Unit Using Python Modules
34.
Pair programming
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 84
Pair programming UNIT
34
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 85
Pair programming UNIT
34
Q1.Complete the creative artwork by looking at the document with your colleagues.
The sample code below executes an artwork using cool turtle graphics. Turtle is also one of Python’s
standard libraries and is very easy to use.
Descriptions for the Python Turtle Graphics Module is in the link below
https://docs.python.org/3/library/turtle.html?highlight=turtle#module-turtle.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 86
Pair programming UNIT
34
Q1.Complete the creative artwork by looking at the document with your colleagues.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 87
Pair programming UNIT
34
Q1.Complete the creative artwork by looking at the document with your colleagues.
Line 1, 6, 7, 10, 12, 15, 18, 21, 29
• 1: make a geometric rainbow pattern
• 6: turn background black
• 7: make 36 hexagons, each 10 degrees
apart
• 10: make hexagon by repeating 6 times
• 12: pick color at position i
• 15: add a turn before the next hexagon
• 18: get ready to draw 36 circles
• 21: repeat 36 times to match the 36
hexagons
• 29: hide turtle to finish the drawing
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 88
Pair programming UNIT
34
Q1.Complete the creative artwork by looking at the document with your colleagues.
When you run the help turtle menu in
the menu selection in Python Idle, the
demo window, as shown on the left, is
executed. You can take a look at this one
by one with your pair programming
colleague and choose a demo you want to
use to change the artwork.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 89
Unit 35.
Pandas Series
for Data Processing
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 90
Unit learning objective (1/3) UNIT
35
Learning objectives
Be able to install the latest version of pandas library to the current virtual environment
Be able to distinguish between series and dataframe when pandas-based data set is given
Be able to select a specific data element from the series data set and perform arithmetic
operation
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 91
Unit learning objective (2/3) UNIT
35
Learning overview
Learn how to install the pandas library and other basic dependent libraries to the virtual environment
Learn about two different types of pandas data structures
Learn how to created series by using the list, dictionary, numpy, and scalar values
Learn how to find a specific element in the series
Learn how to perform arithmetic operation within the series structure
Learn to visualize the series data set by using the matplotlib
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 92
Unit learning objective (3/3) UNIT
35
Keywords
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 93
Unit Pandas Series for Data
35. Processing
Mission
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 95
Mission UNIT
35
Line 1
• population_2020 is dictionary data object that stores total population of each city in 2020.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 96
Mission UNIT
35
Line 1
• population_2021 is dictionary data object that stores total population of each city in 2021.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 97
Mission UNIT
35
Line 1
• area is the area data of each city, and the unit is km2.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 98
Unit Pandas Series for Data
35. Processing
Key concept
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
35
1. Introduction of Pandas
Developed in 2008 by Wes McKenny, pandas has been an open source since 2009 as a library for data
processing.
Pandas was originally designed for time series data operation and analysis in finance, especially stock
price. To perform such operations, analytic tools including searching, indexing, refining, arranging,
reshaping, and slicing were required, and pandas was developed as a solution.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 100
Key concept UNIT
35
1. Introduction of Pandas
Ever since switched to an open source, pandas supports effective data-related functions as follows due
to the efforts made by many people. The list of functions provided below was complete by referring to
https://Pandas.pydata.org/.
‣ Quick and effective series and dataframe objects for operating data with integrated indexing
‣ Intelligent data arrangement by using index and label
‣ Finding omitted data and supporting integrated processing
‣ Converting unorganized data into organized data
‣ Built-in tool to read and write data in the in-memory file, database, and web service
‣ Be able to process data stored in csv, excel, and json formats
‣ Flexible pivoting and remodeling of data set
‣ Slicing with index values, indexing, and subsetting of data set
‣ Addition and deletion of seta set columns
‣ Split-apply-combine functions to combine or change data as a powerful data grouping tool
‣ Integration with high performance data set merge
‣ Performance optimization through the critical path written with Cpython or C
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 101
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 102
Key concept UNIT
35
Cleaning up
Python support Visualize
Data
Merging and
Alignment and
joining of Unique data
indexing
datasets
Python Pandas Multiple file
Input and output
Features formats sup-
tools
Mask data
ported
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 103
Key concept UNIT
35
Series DataFrame
Sequentially arranged one-dimensional array Two-dimensional array consists of index and column
Index (yellow) and data (white) are 1:1. (blue)
Adding multiple series can make one dataframe.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 104
Key concept UNIT
35
2) There are many different ways to install pandas, but this unit will only introduce pandas downloading
from PyPl just like for external library installation introduced previously.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 105
Key concept UNIT
35
4) After installation, the Pandas.version command checks the installed pandas library version to
determine if the installation is done appropriately.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 106
Key concept UNIT
35
‣ Linux/Mac can run a specific python with terminal and show which python installation is used. It is not
recommended to use “/user/bin/python” because it uses the basic python of the system.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 107
Key concept UNIT
35
conda install
Pandas(PKGNAME)==3.1.4(version)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 108
Key concept UNIT
35
5. Optional Dependencies
When using pandas, there are many cases where you’d need dependencies packages to use specific
methods. If you do not have them, it would result in difficulties. When you experience import error even
if you used methods as learned from a book or lecture, you need to check dependencies packages.
It is recommended to first install dependencies packages before using pandas.
Ex While the read_hdf() requires pytables package, DataFrame.to_markdown() requires tabulate pack-
age.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 109
Key concept UNIT
35
5. Optional Dependencies
The following table shows some of the optional dependencies required for this unit. It is mandatory to
install them before continuing this chapter.
- Scipy : Miscellaneous statistical functions
For computation
- numba : Alternative execution engine for rolling operations
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 110
Key concept UNIT
35
6. Series
Series is the basic data structure of pandas.
It is similar to Numpy array, but it is different that series has indices. As shown in the figure below, the
index value and data have 1:1 ratio. Thus, the structure itself is similar to dictionary that has {key :
value} structure.
Each index value functions as the address of each data.
index data
0 1
1 2
2 3
3 4
4 5
From the sample code provided above, separate designation of an index is not required to create series.
Pandas automatically generates integer type index from 0 that increases by 1 for each data item.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 111
Key concept UNIT
35
6. Series
TIP
The series indices can be flexibly designated. It doesn’t necessarily start from 0. Indices other than
integer type can be created as well.
When creating series, use index parameter to designate desirable index values.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 112
Key concept UNIT
35
7. Creating Series
There are many different ways to create series, but only four of them are provided this chapter.
‣ Creating series by using python list
‣ Creating series by using python dictionary
‣ Creating series by using Numpy
‣ Creating series by using scalar value (a value that has designated range similar to integers)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 113
Key concept UNIT
35
7. Creating Series
7.1. Creating series from python list or
dictionary
Pandas.Series (list or dictionary)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 114
Key concept UNIT
35
7. Creating Series
7.1. Creating series from python list or
dictionary
Pandas.Series (list or dictionary)
‣ The dictionary key pairs with the index of series, while the dictionary value becomes each element
(data value) of series.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 115
Key concept UNIT
35
7. Creating Series
7.1. Creating series from python list or
dictionary
Pandas.Series (list or dictionary)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 116
Key concept UNIT
35
7. Creating Series
7.2. Creating series by using array data made with
Numpy
‣ Series objects can be initialized by using different kinds of Numpy functions. Use array-creating
functions and methods to create series.
Line 4
• anrange() creates an array of integers from 10 to 15
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 117
Key concept UNIT
35
7. Creating Series
7.2. Creating series by using array data made with
Numpy
‣ Series objects can be initialized by using different kinds of Numpy functions. Use array-creating
functions and methods to create series.
Line 6
• The array created with numpy is converted into series object
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 118
Key concept UNIT
35
7. Creating Series
7.2. Creating series by using array data made with
Numpy
Line 6
• Creates 6 normally distributed random numbers and stores them to variable r.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 119
Key concept UNIT
35
7. Creating Series
7.3. Creating series by using scalar value
‣ Scalar value refers to a value that has designated range similar to integers.
‣ First, create a simple series object that has single data.
Line 6
• Create series object consists of one index that has data of 4.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 120
Key concept UNIT
35
7. Creating Series
7.3. Creating series by using scalar value
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 121
Key concept UNIT
35
7. Creating Series
7.3. Creating series by using scalar value
Line 1,
3• 1: Creates an array of integers from 0 to 5 in order
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 122
Key concept UNIT
35
7. Creating Series
7.3. Creating series by using scalar value
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 123
Key concept UNIT
35
TIP
In general, series is used to express time series data with indices of data and time.
Use the Pandas.date_range() function to create range of data for an index value.
Line 5, 7
• 5: Converts a list into series object to prepare data for practice
• 7: When printed, the indices will consist of integers starting from 0.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 124
Key concept UNIT
35
TIP
In general, series is used to express time series data with indices of data and time.
Use the Pandas.date_range() function to create range of data for an index value.
Line 1, 3
• 1: Creates a range of date in the form of Pandas.date_range(‘starting date', ‘end data')
• 3: When printed, the special index data DatetimeIndex will be created.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 125
Key concept UNIT
35
TIP
In general, series is used to express time series data with indices of data and time.
Use the Pandas.date_range() function to create range of data for an index value.
Line 1~3
• 1, 2: Replaces the index of marvel, which is series object, to the data range that was previously
created.
Of course, the data amount of newly created index and object numbers must meet to prevent
an error.
• 3: Checks the series object with changed index.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 126
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 127
Key concept UNIT
35
Line 4
• Converts the list into series object to prepare data for practice.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 128
Key concept UNIT
35
Line 1
• Use index property to inquire the index value of the series
Line 1
• Use values property to inquire all of the data elements of the series
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 129
Key concept UNIT
35
Line 6,
7• 6: For selecting data with a single index address
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 130
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 131
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 132
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 133
Key concept UNIT
35
Line 7
• All of the elements in the sr Series object are multiplied by 2
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 134
Key concept UNIT
35
Line 8
• Check how each element value changes from the addition of s1 and s2.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 135
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 136
Key concept UNIT
35
Line 10
• When printing, the elements of indices 3 and 4 are NAN data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 137
Key concept UNIT
35
Line 12
• Returns the number of elements except for missing data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 138
Key concept UNIT
35
Line 13
• Returns sum of the elements except for missing data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 139
Key concept UNIT
35
Data Y-
axis
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 140
Key concept UNIT
35
Line 2
• If not assigning separate x values, this value is assigned to the sequence of y-axis.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 141
Key concept UNIT
35
From the chart above, x-axis is 0-3 and y-axis is 1-4. This is because when providing a single list or array
for floating, matplotlib assumes that it is the sequence of y value and automatically generates x value.
Python range starts from 0, so the basic x vector has the same length with y, but it starts from 0. Thus,
the x data become [0, 1, 2, 3].
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 142
Key concept UNIT
35
Line 2
• Assigns both x-axis and y-axis
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 143
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 144
Key concept UNIT
35
Line 1, 4, 6, 8
• 1: matplotlib is an external module that needs to be installed.
• 4: uses list comprehension.
• 6: linewidth is a property that adjusts the thickness of a line graph, while color designates spe-
cific colors.
• 8: adds the title of the chart.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 145
Key concept UNIT
35
Line 9, 10, 13
• 9: adds the name of x-axis of the chart.
• 10: adds the name of y-axis of the chart.
• 13: command function to print a graph.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 146
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 147
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 148
Key concept UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 149
Unit Pandas Series for Data
35. Processing
Let’s code
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
35
Step 1
1) Consider which libraries would be required to solve the mission. Find necessary libraries and install
ones that you don’t have. Import the module to the code.
‣ Numpy for making an array while data processing
‣ Pandas for creating series objects
‣ Matplotlib for data visualization
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 151
Let’s code UNIT
35
Step 1
2) Convert each of three dictionaries into a series object.
‣ Make sure to check the index label of each series object for operation because if index label or index
number (number of data) is different, it may return NaN.
Line
1~3
• 1: population_2020 is dictionary data object that stores total population of each city in
2020.
• 2: population_2021 is dictionary data object that stores total population of each city in
2021.
• 3: city_area is dictionary data object that stores area of each city.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 152
Let’s code UNIT
35
Step 2
Use two series arithmetic operations to calculate population fluctuation of each year.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 153
Let’s code UNIT
35
Step 2
Use two series arithmetic operations to calculate population fluctuation of each year.
Line
1~2
• 1: Creates a new series object by calculating population fluctuation of each city (index
label) through two series arithmetic operations (-).
• 2: Checks data of new series object that stores arithmetic operation results.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 154
Let’s code UNIT
35
Step 3
As previously explained, when visualizing data through a bar graph, vertical graph is suitable for time
series data that have consecutive values, but horizontal bar graph is suitable here because it would
show data difference between each variable (city name).
The x-axis of the graph is index label of time series object, and the y-axis consists of data elements.
To draw a chart of time series object of ‘growth,’ save index information to x-axis and data values to y-
axis.
Line
1~2
• 1: Extracts index value only.
• 2: Extracts data elements only.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 155
Let’s code UNIT
35
Step 3
When creating a bar graph, overlapping sometime occurs if there are many label names on x-axis. To
solve this problem, adjust the graph size to clearly see label names on x-axis.
Use the plt.figure() function, figsize =(horizontal, vertical size), and parameters in inch to adjust the
chart size. If not designating the size, the default chart size becomes 6.4 and 4.8 inches.
Use the plt.xticks( ) function to adjust the angle of label name printing direction.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 156
Let’s code UNIT
35
Step 3
Line 1, 3, 4,
5
• 1: Figure dimension (width, height) in inches.
• 3: It is possible to enter a number that signifies an angle instead of vertical. If
rotation=90, it means that it rotates 90 degrees counterclockwise.
• 4: size=10 refers to font size.
• 5: Use plt.barh() to plot a horizontal bar plot.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 157
Let’s code UNIT
35
Step 3
‣ When calculating data with cities with the largest population, it shows that population growth rate is
decreased in Tokyo.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 158
Let’s code UNIT
35
Step 4
Let’s calculate population density.
The series objects for calculation include s_2020 that has population data in 2020 and s_area that has
area data of each city.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 159
Let’s code UNIT
35
Step 4
Line 1~2
• 1: Extracts index values only.
• 2: Extracts data elements only.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 160
Let’s code UNIT
35
Step 4
Line 1, 5, 7, 8, 9
• 1: Figure dimension (width, height) in inches.
• 5: Use color parameter to designate color of the graph.
• 7: Adds the title of the chart.
• 8: Adds the name of x-axis of the chart.
• 9: Adds the name of y-axis of the chart.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 161
Let’s code UNIT
35
Step 4
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 162
Let’s code UNIT
35
Step 5
Calculate the cities with the highest and lowest population density by using the descriptive statistics
functions of the pandas series object.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 163
Let’s code UNIT
35
Step 5
Calculate the cities with the highest and lowest population density by using the descriptive statistics
functions of the pandas series object.
Line 1
• Finds the minimum value among the entire data element values of the series object
Line 1
• Finds the maximum value among the entire data element values of the series object
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 164
Unit Pandas Series for Data
35. Processing
Pair programming
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 166
Pair programming UNIT
35
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 167
Pair programming UNIT
35
The Spotify dataset has various column values other than the artist name, ranking, and
Q1. popularity that we used in the mission. Discuss with your classmate to create a special playlist
that is made with other column values.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 168
Unit 36.
Pandas DataFrame
for Data Processing
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 169
Unit learning objective (1/3) UNIT
36
Learning objectives
Be able to explain the structure in terms of a DataFrame when a two-dimensional data structure
is given.
Be able to change to DataFrames stably when a dictionary object is given.
Be able to create a DataFrame by importing csv, excel, json in the form of an external file.
Be able to edit row and column information for the created DataFrame.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 170
Unit learning objective (2/3) UNIT
36
Learning overview
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 171
Unit learning objective (3/3) UNIT
36
Keywords
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 172
Unit Pandas DataFrame for Data Processing
36.
Mission
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 174
Unit Pandas DataFrame for Data Processing
36.
Key concept
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
36
1. DataFrame
DataFrame is a two-dimensional label data structure made up of rows and columns. (Series is different
from DataFrame in that it is a one-dimensional array.)
Simply put, tabular data and Excel spreadsheets that are commonly encountered in data analysis are in
the DataFrame format.
Unlike a Series that can have only one value per index, a DataFrame can have multiple values per index
label. At this time, each Series becomes a column of the DataFrame.
https://www.geeksforgeeks.org/creating-a-pan-
das-dataframe/
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 176
Key concept UNIT
36
1. DataFrame
Terms commonly used in data statistics and data science fields are summarized below.
The most basic spreadsheet-like data structure in statistics and machine learning
DataFrame
models.
Typically, each column in the table is a feature. Similar terms include attributes and
Feature
predictors.
The goal of most data science projects is to predict some outcome. The feature is
Outcome used for prediction. Similar terms include dependent variable, response, goal, and
output.
Typically, each row in a table represents one record. Similar terms include recorded
Record
value, case, case, example, observation, pattern, sample, etc.
In other words, to define a DataFrame using the above terms, it can be basically said that it is a two-
dimensional matrix consisting of a row representing each record (case) and a column representing a
feature (variable).
In order to effectively process the DataFrame object, each column is designated as an index. In the case
of pandas, multi/hierarchical indexes can be set, so more complex data processing can be done
effectively.
Please visit https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe to read
more about DataFrame definitions and uses from a Python and pandas' perspective.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 177
Key concept UNIT
36
2. Creating a DataFrame
Multiple one-dimensional arrays of the same length are combined to form a DataFrame. The same
length means that the number of data elements is the same. In other words, multiple Series can be
combined to create one DataFrame.
Also, since an object that combines several Series is a dictionary, it can also be understood as "convert
a dictionary into a DataFrame".
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 178
Key concept UNIT
36
2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]
Key column
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 179
Key concept UNIT
36
2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]
Line 3
• Convert a dictionary called data to a DataFrame
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 180
Key concept UNIT
36
2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]
Line 2, 3
• 2: Store in series object s
• 3: Store in another series object s2
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 181
Key concept UNIT
36
2. Creating a DataFrame
2.1. DataFrame creation basics
pandas.Dataframe[The dictionary object you want to convert]
Line 5, 7, 8
• 5: Import two Series and assign key names one, two
• 7: When the DataFrame is output, the key name of the called dictionary becomes the column
name.
• 8: NaN occurs because the number of elements in the data does not match when two Series
Samsung are imported.
Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 182
Key concept UNIT
36
2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
You can set the row and column names separately when creating a DataFrame or after creating it.
First, understand the structure and naming rules of DataFrames through the images below.
As described in the image, each column of a DataFrame is a series object, and these series objects have
a matrix structure where they are combined based on the index of the same row.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 183
Key concept UNIT
36
2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Change all row indexes: DataFrame object name.index = Array of row indexes to change
• Change all column names (columns name): DataFrame object name.columns = Array of new
column names
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 184
Key concept UNIT
36
Line 2, 3, 4, 6, 7, 9,
11
• 2: It is a dictionary data type.
• 3: Create a DataFrame object with the name df by assigning a dictionary.
• 4: Check the DataFrame created before the name change.
• 6: Change to new index name
• 7: Check the result of index change
• 9: Change to new column name
• 11: Check the result of column name change
As you can see from the previous code execution result, when you change the index or column name,
you can see that the DataFrame before the change is not maintained, but the data frame is changed
with the changed index and column name.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 185
Key concept UNIT
36
2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Select part of a row and rename it: DataFrame object name.rename(index={existing index:
index to be replaced with new one, ...})
• Select part of a column and rename it: DataFrame object name.rename(columns={Existing
column name: new column name, ...})
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 186
Key concept UNIT
36
2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
Line 4, 6
• 4: Check the DataFrame created before the name change.
• 6: Change index 0 to a new name, new index0.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 187
Key concept UNIT
36
2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Select part of a row and rename it: DataFrame object name.rename(index={existing index:
index to be replaced with new one, ...})
• Select part of a column and rename it: DataFrame object name.rename(columns={Existing
column name: new column name, ...})
Line 1
• Change the column name col2 to the new name new col2.
If you see the result of executing the code above, it is different from when the entire index and column
name were changed. You can see that it is returning a new DataFrame object rather than changing the
original itself. You must use inplace = True to change the original object.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 188
Key concept UNIT
36
2. Creating a DataFrame
2.2. Set the name of the row (index) / column (columns) of the
DataFrame
• Select part of a row and rename it: DataFrame object name.rename(index={existing index:
index to be replaced with new one, ...})
• Select part of a column and rename it: DataFrame object name.rename(columns={Existing
column name: new column name, ...})
Line 4, 5
• 4: Make changes to the original DataFrame object itself through the inplace = True option.
• 5: Although only the column name has been changed, you can see that the original object itself has
been changed through the code above.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 189
Key concept UNIT
36
‣ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html#pandas-dataframe-
from-dict
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 190
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 191
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 192
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 193
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 194
Key concept UNIT
36
‣ Absolute path: This method uses all paths from the first starting point (start of OS) to the file.
Ex The OS uses the Windows system. If you find "sample.txt" on the desktop, it is C:\Users\UserID\
Desktop\sample.txt. No matter which operating system (OS) is used as well as Windows, it is to
find the file with an absolute path that contains all the paths passed through from the top-level
root.
‣ Relative path: It is the "location of the file you want to find" based on "the current location".
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 195
Key concept UNIT
36
Ex What if the path of test.txt is always changed frequently, or if the root directory covers different
OSs? Because of the static characteristics, the former needs to rewrite all documents written
with absolute paths, and the latter has to create and manage absolute paths for each OS.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 196
Key concept UNIT
36
scripts your_script.sh
Relative
path You want to access this
my_script.sh file
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 197
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 198
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 199
Key concept UNIT
36
Line 1
• Return data from top to index 5 in a DataFrame named df. If no number is entered, only 5 are
output.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 200
Key concept UNIT
36
Line 1
• Return data from the top as many rows as the number is entered in head().
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 201
Key concept UNIT
36
Line 1
• If "-" is input, data is returned after excluding 3 rows from the bottom.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 202
Key concept UNIT
36
TIP
A DataFrame is a data object consisting of a two-dimensional array. If you use df.shape, you can
check how many dimensions the data frame consists of.
It returns the dimensionality of the data frame in the form of a tuple.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 203
Key concept UNIT
36
TIP
Line 1
• Use the len() function to know the length of a row.
Line 1
• Print the columns.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 204
Key concept UNIT
36
TIP
Line 1
• Only the values of the data frame are output. Actually, it isn't used much.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 205
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 206
Key concept UNIT
36
1) Default
‣ Download the Titanic Survivor data file from https://www.kaggle.com/c/titanic/data
Line 2
• Enter the path of the downloaded file as a relative path.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 207
Key concept UNIT
36
1) Default
Line 1
• Output only the data in the 5th row from the top and review.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 208
Key concept UNIT
36
1) Default
TIP
Dataframe can search not only data but also metadata information such as column type, number of
null data, and data distribution. You can use info() and describe().
Line 1
• This is a method that can check the total number of DataFrames and data types of the
imported DataFrame. When preprocessing data, it is recommended to check through this
method first as a habit.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 209
Key concept UNIT
36
1) Default
TIP
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 210
Key concept UNIT
36
1) Default
TIP
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 211
Key concept UNIT
36
1) Default
TIP
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 212
Key concept UNIT
36
1) Default
TIP
Line 1
• You can roughly check the data distribution of the imported DataFrame.
Ex When using this data in machine learning, it is very important to start by knowing the
distribution of the data to improve performance. It is recommended that you use this method
to get a rough distribution diagram as a habit.
mean Average value of all data
std Standard Deviation
min Minimum value
max Maximum value
25% 25 percentible value
50% 50 percentible value
75% 75 percentible value
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 213
Key concept UNIT
36
1) Default
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 214
Key concept UNIT
36
2) skiprow
‣ When importing a file, it specifies whether to skip the first few lines and import. It can also be set as a
list containing the number of lines to skip.
Ex [1, 4, 7]
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 215
Key concept UNIT
36
Line 1
• You can specify a row to be the column name.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 216
Key concept UNIT
36
3) header
‣ Check the csv file before loading it in the code, and then set the row you want to specify as the column
name. By default, the file is loaded as it is, and the row at index 0 of the csv becomes the column
header.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 217
Key concept UNIT
36
Line 1
• You can specify a row to be the column name.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 218
Key concept UNIT
36
3) header
header =
2
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 219
Key concept UNIT
36
3) header
header =
None
headheader= None ,
names=
['a','b','c','d','e','f','g','h','i','j’, 'k’]
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 220
Key concept UNIT
36
3) header
‣ When should "header = None" be used? After checking the csv, it is used when starting from data
rather than a separate name from the first row. And in this case, you can specify the header with the
name parameter.
• names = [List to use as column names]
Line 1
• You can specify a row to be the column name.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 221
Key concept UNIT
36
3) header
Line 1,
2• 1: You can specify a row that becomes a column name.
• 2: Return NaN when naming more columns than the data object has.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 222
Key concept UNIT
36
4) encoding
‣ It can be used both to read or write to the file.
‣ It is easy to understand that encoding is a process of transforming languages other than the Ascii-series
string (which can be expressed as 0 to 127), such as Hangul by adding bytes to the computer for use.
However, there are several ways for this encoding.
‣ It functions to specify the encoding type of text when loading a csv file.
Ex encoding = 'utf-8'
‣ There may be cases where the imported text looks broken even with this option. In this case, the
easiest solution is to specify the data format as utf-8 in Excel and save it.
• See https://docs.python.org/3/library/codecs.html#standard-encodings for Python standard encoding
information.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 223
Key concept UNIT
36
5) index_col
‣ Specifies the column to be used as the row.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 224
Key concept UNIT
36
5) index_col
‣ Specifies the column to be used as the row.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 225
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 226
Key concept UNIT
36
Line 2, 4
• 2: header specifies the order of the columns to be specified as columns.
• 4: Output only the data of 5 rows from the top and review.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 227
Key concept UNIT
36
Warning
To use the read_excel function, the optional dependency xlrd package must be installed in advance.
You can check various parameters that can be used when loading a file like csv from
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html?highlight=read_excel
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 228
Key concept UNIT
36
https://www.json.org/json-
en.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 229
Key concept UNIT
36
Line 1
• JSON object
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 230
Key concept UNIT
36
Line 1
• JSON array
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 231
Key concept UNIT
36
{ string : value }
array
[ value ]
value
string
number
object
array
true
false
null
https://www.json.org/json-
en.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 232
Key concept UNIT
36
Line 4
• A url can also be used in the path.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 233
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 234
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 235
Key concept UNIT
36
Line 2, 4
• 2: Import the package.
• 4: Save the data of sysmbol called GS10 in Federal Reserve Economic Data (FRED) data as a
DataFrame.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 236
Key concept UNIT
36
Line 6
• Visualize the DataFrame as a line graph.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 237
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 238
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 239
Key concept UNIT
36
TIP
It provides guidance on how to use DataReader based on World Bank data, but you can use it
sufficiently for other data by finding the link below in the official document. We will deal with
indicators as a sample, but I want you to learn while thinking, "I can find and use it this way" while
comparing it with the official documentation.
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#indicators
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 240
Key concept UNIT
36
Line 2, 4
• 2: Import the package.
• 4: The wb.get_indicators() function can get the full list of indicators.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 241
Key concept UNIT
36
Line 6
• We will study iloc in the DataFrame row and column selection section. This way you can only get
5 indicators from the beginning of the whole list.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 242
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 243
Key concept UNIT
36
Line 3
• Use wb.search() function to search indicator to get data on life expectancy.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 244
Key concept UNIT
36
Line 5
• If you print all the results, you get a lot of search results. For now, let's print only 5 results. Each
indicator is country-specific.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 245
Key concept UNIT
36
Line 1, 3
• 1: Use the .wb.get_countries() function to get data for all countries.
• 2: Among all data, only the data in the 'name', 'capitalCity', and 'iso2c' columns are extracted
from the above.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 246
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 247
Key concept UNIT
36
Line 1, 3
• 1: You can download by putting the indicator and period you want to search in wb.download().
• 3: If you see the output results, you can find that only the data of Canada, the United States,
and Mexico are output, not all countries. To download the entire country, you need to use the
parameter country='all'.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 248
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 249
Key concept UNIT
36
Line 3
• You can get data for all countries by using the parameter country='all'.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 250
Key concept UNIT
36
Line 3, 5, 7
• 3: The indicator to be used for the search uses the average lifespan data used in the previous
indicator search practice.
• 5: Create a new DataFrame by pivoting.
• 7: If you check the pivoted result, the data is restructured into a country index and a column for
each year as planned.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 251
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 252
Key concept UNIT
36
Line 3
• You can check the results of the countries with the lowest life expectancy by year.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 253
Key concept UNIT
36
What do you think of the results? If you search the history of each country to find out what happened in
that year, you can understand what seriously affects the average life expectancy in each country.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 254
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 255
Key concept UNIT
36
Line 3
• Using the 2021 Worldwide University Links data downloaded from Kaggle.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 256
Key concept UNIT
36
Line 5
• Check the contents of the data frame. There are 1526 indexes from 0 to 1525, and it consists of
7 columns.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 257
Key concept UNIT
36
Line 1
• Check the data structure by printing only 5 indexes.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 258
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 259
Key concept UNIT
36
Line 1
• Delete indexes 0 through 15.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 260
Key concept UNIT
36
Line 1
• Delete two columns.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 261
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 262
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 263
Key concept UNIT
36
Line 5
• Select numbers 1 to 9 by the integer index position and store them in a new DataFrame. (except
10)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 264
Key concept UNIT
36
rank.iloc[1:10]
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 265
Key concept UNIT
36
Line 1, 3
• 1: Use rename() to change the name. When saving the original DataFrame with the inplace=True
option, the DataFrame sometimes gives an error message. This is because of memory
management. To prevent this, pandas recommends copying the original to a new DataFrame
with the copy() method and then working on it.
• 3: For the loc practice, change the integer indexes to string index labels.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 266
Key concept UNIT
36
Line 5, 7
• 5: You can see that the integer index has been changed to an index label made of a string such
as a,b,c..
• 7: Select the rows with index labels a and b and store them in a new DataFrame.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 267
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 268
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 269
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 270
Key concept UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 271
Key concept UNIT
36
Line 7, 9
• 7: Select only one column named title and save it as a new DataFrame
• 9: Select only one column named title and save it as a new DataFrame
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 272
Key concept UNIT
36
Line 11, 13
• 11: A format that selects one column but compare the output results of both formats. The
column name of the selected column is not printed.
• 13: This is a form of selecting one column, and the selected column name is printed.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 273
Key concept UNIT
36
index_position[[“gender ratio]]
index_position.gender ratio
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 274
Key concept UNIT
36
Line 5
• Select the data element in row number 7, column number 1.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 275
Key concept UNIT
36
df[7,1]
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 276
Key concept UNIT
36
Line 1, 3
• 1: Select 2 or more elements
• 3: You can also specify the range of column selection.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 277
Unit Pandas DataFrame for Data Processing
36.
Let’s code
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
36
Step 1
Think about what libraries are necessary to solve the mission. Find the necessary libraries, install those
that are not, and import the module through import in the code.
‣ Pandas for creating DataFrame objects
‣ Matplotlib for visualizing data
‣ Seaborn, one of the data visualization tools
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 279
Let’s code UNIT
36
Step 1
After searching the target data and downloading, create a DataFrame.
‣ Download Spotify's 2019 Top Songs list data file from
https://www.kaggle.com/prasertk/spotify-global-2019-moststreamed-tracks
song = pd.read_csv("./data/spotify/
spotify_global_2019_most_streamed_tracks_audio_features.csv")
Line 1
• Load the downloaded csv file into the song DataFrame.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 280
Let’s code UNIT
36
Step 1
song.head(10)
Line 1
• Print only 10 DataFrames on the screen.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 281
Let’s code UNIT
36
Step 1
song.info()
Line 1
• This is a method that can check the total number of DataFrames and data types of the imported
DataFrame. It is a DataFrame with a total of 1717 rows and 24 columns.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 282
Let’s code UNIT
36
Step 1
Line 1
• A method to check data distribution
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 283
Let’s code UNIT
36
Step 2
Selects only the necessary columns and recreates the DataFrame.
Line 1
• Select several columns you want and load them into a new DataFrame called df.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 284
Let’s code UNIT
36
Step 3
Only data with Artist_popularity of 95 or higher is filtered.
In this case, we can use the data filtering method to filter rows by using logical operators on column
values. A single logical operator or multiple logical operators can be used at the same time. Here, we
use one logical operator to filter only the data with Artist_popularity of 95 or higher.
The comparison value of 95 can be changed as you want in the practice. However, the first thing to do
when importing an external file was to open the data directly and check the data structure. Since this
data itself is only from Spotify's popular tracks in 2019, most of the artists are over 90.
It is recommended to set it to 90 or higher for discrimination.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 285
Let’s code UNIT
36
Step 3
Line 1
• Create a new DataFrame with a value of 90 or higher in the Artist_popularity column and the
index pop_song by using logical operators.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 286
Let’s code UNIT
36
Step 3
Who is the artist with the most songs on the list?
Line 1
• Counts the total number of non-duplicate unique data elements (values) and returns them as a
series object.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 287
Let’s code UNIT
36
Step 3
Line 1
• An artist named Post Malone has the most songs with 45 songs.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 288
Let’s code UNIT
36
Step 3
Line 1
• If you check the returned data type, it is Series.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 289
Let’s code UNIT
36
Step 3
TIP
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 290
Let’s code UNIT
36
Step 4
Is BTS Really Popular?
If you are not a fan of BTS, you can filter data based on the English name of your favorite artist.
First, search whether the name of the artist you want to find exists in the current dataset.
Line 1
• If the return result is true, it means that the data exists. Let's not forget! rank is a series object
made up of non-overlapping artist names and total numbers.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 291
Let’s code UNIT
36
Step 4
We saw that the songs of the corresponding artist exist. Let's filter the data to see how many songs and
which songs are included.
Line 1, 3
• 1: pop_song is the DataFrame created in step3. Create a new DataFrame by collecting only the
rows with BTS in the Artists column
• 3: You can see that there are 11 songs in total.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 292
Let’s code UNIT
36
Step 4
Line 1
• Song list
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 293
Let’s code UNIT
36
Step 5
If you see the result of step3, an American rapper named Post Malone has 45 songs in the dataset. He is
the artist with the most hit songs. Let's make a playlist by making a separate list only for Post Malone.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 294
Let’s code UNIT
36
Step 5
Line 3
• Initialize the index with reset_index.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 295
Let’s code UNIT
36
Step 6
Expressing the tempo by rank as a scatter plot graph
‣ A scatterplot is a visual representation of the relationship between two variables.
‣ A scatterplot is used to understand the relationship between two variables x and y. It is a graph in the
form of a linear function in which the point (x, y) with x and y as an ordered pair is shown on the
coordinate plane.
https://docs.tibco.com/pub/spotfire_server/7.10.0/doc/html/en-US/TIB_sfire-
bauthor-consumer_usersguide/GUID-A8DC822E-35B3-4289-94CB-
642DFFE5E88F.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 296
Let’s code UNIT
36
Step 6
In a graph, you can see the relationship between two variables if one of them tends to increase (or
decrease) as the other increases.
The relationship is evaluated as large or small depending on the degree to which the distributed form of
each point converges to the center based on the linear function.
What is the positive correlation in the figure below?
Ex If a company produces more of a product, the price of the product falls, then production and price
have a negative correlation. If sales increase as marketing investment increases, these two
relationships are positively correlated.
Don't get confused! A positive or negative correlation does not evaluate the degree of correlation
between two variables.
Positive Correlation Negative Correlation
𝑦 𝑦 𝑦 𝑦
strong
𝑥 weak
𝑥 strong
𝑥 wea
𝑥
k
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 297
Let’s code UNIT
36
Step 6
You can easily plot correlation using Matplotlib's scatter.
‣ https://matplotlib.org/stable/gallery/shapes_and_collections/scatter.html#scatter-plot
‣ When visually checking the graph, there is no clear correlation between the two variables.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 298
Let’s code UNIT
36
Step 6
pandas.DataFrame.corr(method='pearson')
‣ Pandas supports a function that makes it easy to calculate the correlation coefficient.
‣ This is a function that calculates the pairwise correlation of columns excluding NA/Null values. The
following three methods can be used for this function. Among the data of most pearson-used columns,
numerical data is found and compared, and the string-type data is naturally subtracted.
‣ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
‣ The result is returned as a DataFrame, and the interpretation of the values is as follows.
• Closer to 1, both increase equally
• Closer to -1, increase by one / decrease by one
• Closer to 0, there is no relationship between the two
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 299
Let’s code UNIT
36
Step 6
Line 2
• If you print the correlation coefficient result, you can't see a variable that is very close to 1.
However, if the closest value is found among them, tempo appears to have the least effect.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 300
Let’s code UNIT
36
Step 6
Line 2
• It is displayed in the form of a heatmap in a matrix.
• Display the color bar.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 301
Let’s code UNIT
36
Step 6
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 302
Unit Pandas DataFrame for Data Processing
36.
Pair programming
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 304
Pair programming UNIT
36
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 305
Pair programming UNIT
36
In the Spotify dataset, there are various column values in addition to the artist's name, rank,
Q1. and popularity we used in our mission. Discuss with your learning colleagues to create special
playlists using different column values.
Ex A playlist of only upbeat music: use the "tempo" column to select an appropriate tempo.
TIP
If you have made a playlist, save it as an Excel file and share it with your learning colleagues.
Saving a DataFrame as an Excel file is very simple.
pandas.DataFrame.to_excel()
‣ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 306
Unit 37.
Data Tidying
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 307
Unit learning objective (1/3) UNIT
37
Learning objectives
Learners will be able to check for missing data within a given data frame.
Learners will be able to perform descriptive statistics such as mean, median, mode, variance,
and standard deviation for data frames.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 308
Unit learning objective (2/3) UNIT
37
Learning overview
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 309
Unit learning objective (3/3) UNIT
37
Keywords
Descriptive
Boxplot
Statistics
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 310
UNIT Data Tidying
37.
Mission
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
37
‣ The student-mat.csv file is used as the target data for data analysis.
‣ It is converted into a data frame using Pandas.
‣ Identify the characteristics of the data.
‣ If checking for missing data, run data pre-processing.
‣ Calculate the mean, median, and mode.
‣ Visualize in a box plot graph.
‣ Visualize all the variables in a histogram and scatterplot.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 312
Mission UNIT
37
https://archive.ics.uci.edu/ml/
datasets.php
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 313
UNIT Data Tidying
37.
Key concept
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 315
Key concept UNIT
37
https://cfss.uchicago.edu/notes/
tidy-data/
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 316
Key concept UNIT
37
https://cfss.uchicago.edu/notes/
tidy-data/
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 317
Key concept UNIT
37
3. Messy Data
The following is a summary of realistic situations in which data tidying is necessary.
Messy Data
‣ If the column is not the name of the variable, but the value itself
‣ If there is not just one variable in the column, but multiple variables
‣ If the variables are stored in both columns and rows (should be stored in columns)
‣ If missing data exists
‣ If it’s not a sample for the desired period of time
‣ If quantitative data is needed but the variable is qualitative
‣ If the data types of the values are wrong
‣ If there is duplication of data
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 318
Key concept UNIT
37
You can call up and use the needed amount of dataset at seaborn’s online repository (
https://github.com/mwaskom/seaborn-data). By accessing this link, you can also see what data sets
there are. The small drawback is that Internet connection is required in order to access the repository.
Since the result value is returned in the Pandas data frame, it is useful for learning purposes
Line 5
• Brings in the data set of titanic survivors. At this time, check the local cache first and set it to
cache=True in order to set it up for use.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 319
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 320
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 321
Key concept UNIT
37
Line 1
• You can see that the number of valid values among the data held for each column
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 322
Key concept UNIT
37
891-714 =
177(NaN)
891-203 =
688(NaN)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 323
Key concept UNIT
37
TIP
If the parameter is set from df.vale_counts() to dropna=True, of if the dropna parameter is not used
at all, the number of data excluding missing data is calculated.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 324
Key concept UNIT
37
If the parameter is set from df.vale_counts() to dropna=True, of if the dropna parameter is not used
at all, the number of data excluding missing data is calculated.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 325
Key concept UNIT
37
df.isnull()
NaN True
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 326
Key concept UNIT
37
df.notnull(
)
NaN
False
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 327
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 328
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 329
Key concept UNIT
37
‣ If you check the results through this code, there are 177 in the page column.
‣ There are two missing data in the embark_town column and two in the deck column.
Line 1
• Check how much missing data are for each column in the entire data with numbers.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 330
Key concept UNIT
37
Line 1
• In the previous result series, if .sum() is applied once more, we can know the total number of
NaN values in the data frame.
Line 1
• df.count() returns the number of values other than missing data for each column. The number of
missing data can be obtained by subtracting this from the total amount of data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 331
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 332
Key concept UNIT
37
Line 1,
3• 1: thresh = 500 is a command to delete all columns with more than 500 NaN values.
• 3: Since the deck column shows 688 NaNs and there are more than 500 NaNs, we can confirm
that all the results are deleted.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 333
Key concept UNIT
37
Line 1,
2• 1: There were 177 rows without age data above, We deleted this,
• 2: Therefore, it is only normal that the number of the result data is 714.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 334
Key concept UNIT
37
Line 1
• If there is any NaN in the column, delete it.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 335
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 336
Key concept UNIT
37
Line 4,
6• 4: Bring up the Titanic Dataset.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 337
Key concept UNIT
37
Line 4,
6• 4: Bring up the Titanic Dataset.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 338
Key concept UNIT
37
Line 1
• The mean of the data age is stored in avg_age. It is the mean of the data in the age column.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 339
Key concept UNIT
37
Line 1,
2• 1: The median, using the median() method, can also be used as replacement instead of the mean.
• 2:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html?high
light=median#pandas.DataFrame.median
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 340
Key concept UNIT
37
Line 1
• NaN data elements are substituted with the mean using fillna(). Let’s replace it using the median
value median_age, calculated earlier.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 341
Key concept UNIT
37
Line 1
• We can see that the missing data is replaced with the mean.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 342
Key concept UNIT
37
Line 1
• It appears that there is no NaN replaced in the age column.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 343
Key concept UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 344
Key concept UNIT
37
Line 4
• Bring up the Titanic data set.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 345
Key concept UNIT
37
Line 1
• Return the time series including unique rows in the corresponding column. Exclude missing data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 346
Key concept UNIT
37
Line 1
• df.idxmas() Return the index in which the maximum value first occurs on the requested axis.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 347
Key concept UNIT
37
Line 1,
3• 1: Using the fillna() method, NaN data elements are substituted with the names of the most
embarking towns stored in the variable most.
• 3: Missing data in the embark_town column has been replaced by Southampton, resulting in no
missing data in the column.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 348
Key concept UNIT
37
‣ Print df_01 and df_02 and compare the two data frames to see how the place where the missing data
was became filled!
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 349
UNIT Data Tidying
37.
Let’s code
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
37
Step 1
Preparing Data and Creating Data Frame
‣ Although the data input has been confirmed, the data is difficult to handle in this state. It can be
confirmed that the downloaded data is divided into ‘ ; ‘ .
‣ Usually, csv files are divided by ‘ , ‘ , but this file is divided into semicolons, make it difficult to check
visually.
‣ In order to change the character symbol that separates the data, the parameter sep= ‘separating
character symbol’ is used to designate it
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 351
Let’s code UNIT
37
Step 1
Preparing Data and Creating Data Frame
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 352
Let’s code UNIT
37
Step 1
Checking Data Characteristics
Line 1
• non-null means that there is no null data
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 353
Let’s code UNIT
37
Step 1
Searching for any NaN in the Data
Line 1
• Looking at the results, it can be confirmed that the number of NaN data for each column is 0.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 354
Let’s code UNIT
37
Step 1
Searching for any NaN in the Data
Line 1
• Obtain the number of columns with #NaN data
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 355
Let’s code UNIT
37
Step 1
Understanding the Meaning of the Columns for Data Comprehension
‣ For data analysis, it is fundamental to understand the meaning of each column name of the data frame.
‣ In the case of this data, it is helpful because it is written in detail in the study.txt file included in the
compressed file. However, on a daily basis, there are often no documents explaining these column
names. In this case, it is necessary to contact the person who received the data to check the
information on the column of the data set.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 356
Let’s code UNIT
37
Step 1
Understanding the Meaning of the Columns for Data Comprehension
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 357
Let’s code UNIT
37
Step 2
Let’s visualize the number of absent days of students in a histogram. The number of days of absence is
the column 30 absences.
‣ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib-pyplot-hist
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 358
Let’s code UNIT
37
Step 2
Line 1, 6
• 1: Specify the variable to be graphed on the his-
togram
• 6: Add a grid to the graph
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 359
Let’s code UNIT
37
Step 2
Let’s make a histogram graph of other variables that have numeric data perform an exploratory data
analysis.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 360
Let’s code UNIT
37
Step 3
Find the basic descriptive statistics such as mean, median, mode, variance, and standard deviation etc.
‣ First of all, we learned in the previous lessons that we can check the results of various and basic
descriptive statistics of the corresponding data frame using Pandas’ describe() method.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 361
Let’s code UNIT
37
Step 3
Median
‣ The median refers to the middle value of a data that is rearranged in the order of size.
‣ Assuming that the median value of studytime is obtained,
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 362
Let’s code UNIT
37
Step 3
Mode
‣ The mode is the most frequent value in the data.
‣ Assuming that we obtain the mode of studytime
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 363
Let’s code UNIT
37
Step 3
Variance
‣ It is possible to check whether the data is scattered or concentrated around the mean by calculating
the variance. After designating the reference variable, the var() method is used.
‣ Square the observed value minus the average, add it all, and divide it by the total number. That is, it’s
the sum of all squared differences. If you add all the deviations minus the mean from the observed
values, you get zero, so you add them in squares.
Line 1
• The smaller the result value, the smaller the degree of scattering the data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 364
Let’s code UNIT
37
Step 3
Standard Deviation
‣ Where there is a lot of data, the average is often used as the value that represents the data. The
standard deviation, one of the scatter plots, is a representative figure indicating how spread out the
data is around the average. The unit of the standard deviation is identical with the unit of data, unlike
the variance which uses the square root. If the standard deviation is close to the center, it means that
the data values are concentrated near the average. The larger the standard deviation, the more
widespread the data values are.
Line 1
• The smaller the result value, the smaller the degree of scatter in the data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 365
Let’s code UNIT
37
Step 3
Visualization through Box Plot
‣ Necessary Concepts for Understanding Box Plots
The value at the middle of the data (Half of the observations are greater than
Median Value(Q2)
or equal, and the other half are smaller or equal)
The 3rd (Upper) The median of the top 50% based on the median value. The value
Quartile (Q3) corresponding to the top 25% of the entire data.
The 1st (Lower) The median of the bottom 50% based on the median value. The value
Quartile (Q1) corresponding to the bottom 25%, that is 75% of the total data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 366
Let’s code UNIT
37
Step 3
Visualization through Box Plot
BOX
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 367
Let’s code UNIT
37
Step 3
Visualization through Box Plot
Line 1
• 1st Semester Grades
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 368
Let’s code UNIT
37
Step 3
Visualization through Box Plot
Line 1
• Number of absent days
‣ Looking at the graph, it can be seen that there are many abnormal data in the case of the box plot of
the number of absences.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 369
Let’s code UNIT
37
Step 3
Visualization through Box Plot
‣ If you look at the box plot of the first semester grades, the second semester grades, the third semester
grades, the weekend night outs statistics, and the number of days of absence, you can get insights on
student performance.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 370
Let’s code UNIT
37
Step 4
Coefficient of Variation (CV)
‣ Data on variables with different units of measurement cannot be compared simply. This is because if
the size of the data is different, the deviation tends to increase when the measurement unit is large.
Ex The standard deviation between stock prices and gas prices cannot be compared simply.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 371
Let’s code UNIT
37
Step 4
Coefficient of Variation (CV)
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 372
Let’s code UNIT
37
Step 4
Coefficient of Variation (CV)
‣ The describe() method does not show the result of the coefficient of variation for the whole, It can be
applied as follows.
‣ In order to find the coefficient of variation for the whole, it can be applied as follows.
‣ However, it should be noted that if the mean to be compared is 0 or close to 0, the coefficient of
variation may be infinitely large.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 373
Let’s code UNIT
37
Step 4
Coefficient of Variation (CV)
Line 1
• Return the CV for the entire column as Series
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 374
Let’s code UNIT
37
Step 4
Coefficient of Variation (CV)
Line 2
• However, since the entire data may not be numeric, it is recommended to designate a specific
column, as specified in the content of the error message
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 375
Let’s code UNIT
37
Step 4
Covariance
‣ The covariance represents the relationship between the two variables.
• If the values of the covariance is positive, the two variables are positive.
• If the value of the covariance is negative, the two variables are negative.
‣ Multiply the deviation between the two variables and calculate it by averaging it. It is used to calculate
the variance of two or more variables.
‣ It can be calculated using the cov( ) method.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 376
Let’s code UNIT
37
Step 4
Covariance
Line 1
• Returns the covariance value of each column in the data frame.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 377
Let’s code UNIT
37
Step 4
Covariance
Line 1
• The covariance of NumPy can also be calculated through the cov() method. Two series columns,
that is, the result of covariance for two series data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 378
Let’s code UNIT
37
Step 4
Covariance
Line 1,
2• 1: Verification through var() to obtain the variance is the same as the result of the matrix element
above
• 2: Verification through var() to obtain the variance is the same as the result of the matrix element
above
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 379
Let’s code UNIT
37
Step 4
Correlation Coefficient
‣ The covariance equation itself depends on the scale and unit of each variable. The correlation
coefficient eliminates the dependence of each variable on the scale and finds out the relationship
between data.
‣ The covariance can tell you what the relationship between the two variables is, but there are cases
where a correlation coefficient is needed because the size of the relationship cannot be explained.
‣ Simply put, the correlation coefficient measures the degree to which two variables move(?) together.
Complete positive correlation. This means that when one variable moves to a specific
1.0
size, the other moves in the same direction at the same rate.
Complete negative correlation or inverse correlation. This means that when one
-1.0
variable moves to a specific size, the other moves in the opposite direction.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 380
Let’s code UNIT
37
Step 4
Correlation Coefficient
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 381
Let’s code UNIT
37
Step 4
Scatter Plot
Line 1
• The graph displays the comparison results of the first and final tests. It shows that people with
good grades from the beginning often perform well until the end.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 382
Let’s code UNIT
37
Step 5
Draw a histogram and scatter plot of the variables you want to compare.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 383
UNIT Data Tidying
37.
Pair programming
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 385
Pair programming UNIT
37
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 386
Pair programming UNIT
37
Q1.
After converting the universities_ranking.csv file in the practice folder ‘data folder’ into a data
frame, check if there is missing data, and decide how to replace it or delete it altogether with
your colleague
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 387
Unit 38.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 388
Unit learning objective (1/3) UNIT
38
Learning objectives
Be able to determine whether it is an analysis target using time series data when a data frame is
presented.
Be able to create date, time, interval, and array data as a time series data type.
Be able to obtain a moving average, a representative descriptive statistic of time series data.
Be able to visualize data as an area graph for comparison of two or more series data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 389
Unit learning objective (2/3) UNIT
38
Learning overview
How to view the descriptive statistics summary of the datafram using Dataframe.info() function
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 390
Unit learning objective (3/3) UNIT
38
Keywords
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 391
UNIT Time Series Data
38.
Mission
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Mission UNIT
38
In the current global Seafood Production business, are most of the fish caught and processed directly by
fishermen, or are they processed through aquaculture?
This may be different depending on the geographical location of each country or the difference in food
culture. An organization called Our World in Data provides related data. Based on this, let's compare what
type of raw material supply occupies the most in the current global Seafood Production business.
In addition, since this data provides data for each country, let's search for the country you want and
create related data statistics separately.
https://ourworldinda-
ta.org/
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 393
Mission UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 394
Mission UNIT
38
The goal of this mission is to produce the same type of results as the graph below.
The graph below was created by data analysis experts.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 395
UNIT Time Series Data
38.
Key concept
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Key concept UNIT
38
“A time series is data about one or more variables measured over a period of time at
specific intervals.”
Machine learning technology is widely used for business predictive analysis. It analyzes business
problems such as stock price, budget, sales and asset flow, forecast maintenance and sales forecasting,
and predicts future indicators by composing related data such as trend, periodic theory, and seasonality
together.
Pandas was created to handle financial data, and time series data analysis can be the core of data
analysis using Pandas.
In addition to the financial field, time series data analysis is used in various fields.
‣ Medicine
‣ Meteorology
‣ Astronomy
‣ Economics
‣ IOT
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 397
Key concept UNIT
38
http://www.Invest-
ing.com
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 398
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 399
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 400
Key concept UNIT
38
Random A completely irregular pattern that does not belong to the above three categories
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 401
Key concept UNIT
38
This analysis of time series data is called a time series additive model (Time series additive model).
Trend factor + Cycle factor + Seasonal factor + Irregular/Random factor
𝑌 𝑡 =𝑇 𝑡 +𝐶 𝑡 +𝑆 𝑡 +𝐼 𝑡
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 402
Key concept UNIT
38
Most of the functions supported by the date and time class of the datetime library are supported.
The disadvantage is that it reduces the precision needed for massive computations on time series data.
The datetime class receives year, month, day, hour, minute, second, microsecond, and time zone as
arguments. The time argument is not a required valu. If empty, 0 is returned as a default value.
‣ You can use Pandas by converting datetime to timestamp object.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 403
Key concept UNIT
38
Line 7
• We need at least 3 parameters corresponding to year, month and day. Hour and minute are re-
turned as 0 by default.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 404
Key concept UNIT
38
Line 1
• In case of setting the parameters of 15:50 in the hour and minute
Line 4
• If you use the combine() method of the datetime class, you can create a datetime object using an
existing date or time object.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 405
Key concept UNIT
38
Line 1
• Current date and time, local time
Line 1
• If a time zone is used as a parameter in the now method, a datetime object applied with the time
zone is created.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 406
Key concept UNIT
38
Line 1
• Get the current time and date and returns only the date.
Line 1
• Get the current time and date and returns only the time.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 407
Key concept UNIT
38
The date and time created with datetime can also be created with pandas.Timestamp(). The difference
is that the datatype is datetime64, which has higher precision than Python datetime.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 408
Key concept UNIT
38
Line 1
• Both date and time can be set when creating.
Line 1
• Created by specifying the time.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 409
Key concept UNIT
38
Line 1
• If only time is specified, today's date is applied as default.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 410
Key concept UNIT
38
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 411
Key concept UNIT
38
Line 5, 6, 7, 9
• 5: Create a specific date with datetime.
• 6: Create today's date.
• 7: Calculate the day plus one day using timedelta.
• 9: How long is tomorrow from my birthday?
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 412
Key concept UNIT
38
Line 1
• When calculating the number of days between two dates, data is returned in timedelta format.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 413
Key concept UNIT
38
Line 2,
5• 2: The difference in time can also be calculated arithmetically.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 414
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 415
Key concept UNIT
38
Line 1
• Return the start time of that point in time.
Line 1
• Return the end time of that point in time.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 416
Key concept UNIT
38
Period can be shifted through simple arithmetic operation. You can create a new period object by shifting
the frequency. The example below is an example of +2 (shifting two months) because the frequency of
special_day is one month.
Line 1
• You cannot understand that adding 2 makes two months shifted just because 2 means two
months. You should understand the way that the period is shifted by the unit that created the
period.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 417
Key concept UNIT
38
Line 1
• See the results, you can see that Pandas is properly judging the full date of September 1973
(there are 30 days).
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 418
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 419
Key concept UNIT
38
Line 2
• As example data, we use the international crude oil price data provided by the FED.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 420
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 421
Key concept UNIT
38
Line 2
• The data type is changed to datetime64
type.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 422
Key concept UNIT
38
Line 1
A column named N_Date is created.
TIP
Sometimes, in the process of converting to timestamp, an error occurs if the data cannot be
converted. In this case, there is a way to force the conversion.
If you use errors = 'coerce' parameter, NaT is forcibly assigned to data that cannot be converted.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 423
Key concept UNIT
38
Line 1
• Delete the DATE column that was the existing object data type.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 424
Key concept UNIT
38
Line 1
• Designate the newly created datetime64 column as the index of the data frame.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 425
Key concept UNIT
38
Line 1
• This completes the data frame indexed by DatetimeIndex.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 426
Key concept UNIT
38
Line 1
• Select a specific date, and you can slice only the data you want very conveniently. Isn't it very
intuitive?
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 427
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 428
Key concept UNIT
38
start: the start of date range, end: the end of date range, periods: the number of
timestamps to be created
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 429
Key concept UNIT
38
Line 2
• In case of a native timezone state where the timezone is not set, count the number of data
created.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 430
Key concept UNIT
38
Line 2
• In case the time zone is set to Seoul
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 431
Key concept UNIT
38
Line 2
• In case of using the parameter as the frequency based on the end of the month
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 432
Key concept UNIT
38
Line 2
• In case of using the frequency as a parameter based on the month start date
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 433
Key concept UNIT
38
Line 3, 4
• 3: Start of date range
• 4: End of date range
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 434
Key concept UNIT
38
Line 5, 6
• 5: Number of frequencies to generate
• 6: Length of period
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 435
Key concept UNIT
38
Line 8
• The label of the index is period
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 436
Key concept UNIT
38
Line 2
• See the data carefully. It automatically creates when each month actually starts and ends.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 437
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 438
Key concept UNIT
38
https://docs.wavefront.com/
query_language_windows_trends.ht
ml
TIP
Smoothing means smooth processing by removing small fluctuations or discontinuities that are not
good in the data due to noise when sampling from large sample data.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 439
Key concept UNIT
38
Representative method
.rolling().mean() Average in window
Standard deviation in
.rolling().std()
window
.rolling().var() Dispersion in window
.rolling().sum() Total in window
.rolling().min() Minimum value in window
.rolling().max() Maximum value in window
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 440
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 441
Key concept UNIT
38
Line 6, 7, 10
• 6: It is a Korean stock data set provided by Yahoo Finance in datareader.
• 7: If you want to practice with a wide window, it is better to specify a longer period.
• 10: Ticker code. Samsung
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 442
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 443
Key concept UNIT
38
Line 1
• You can see that this data set is already in DatetimeIndex.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 444
Key concept UNIT
38
Line 1
• It is easy to calculate the moving average data over 5 days.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 445
Key concept UNIT
38
Line 1, 3
• 1: To compare with the moving average in the graph, only the closing price is stored separately.
• 3: Average over Windows 5 days
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 446
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 447
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 448
Key concept UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 449
UNIT Time Series Data
38.
Let’s code
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Let’s code UNIT
38
Step 1
Data acquisition and data frame transformation
Line 8
• https://ourworldindata.org/fish-and-overfishing
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 451
Let’s code UNIT
38
Step 1
Data acquisition and data frame transformation
Line 1
• Let's check the data type of each column of the data frame.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 452
Let’s code UNIT
38
Step 2
Preprocessing including data cleaning
Line 1
• Delete unnecessary columns. Let's delete the code column in the statistics we want to do because
we don't need it.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 453
Let’s code UNIT
38
Step 2
Preprocessing including data cleaning
# Search NaN data. You can check the number of missing data for each col-
umn.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 454
Let’s code UNIT
38
Step 2
Preprocessing including data cleaning
Line
1~3
• 1: Create a variable with the data you want to replace
• 2: Replace NaN with 0 in order not to affect the sum statistic.
• 3: When the substitution result was confirmed, all NaN data were substituted with 0, and there is
currently no number of NaNs.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 455
Let’s code UNIT
38
Step 2
Change to time series data type and replace index
Line 1, 2
• 1: Year is an int64 type. An error occurs if you change it to datetime. It is necessary to change the
data type.
• 2: Specify the new_Year column changed in the time series format as an index.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 456
Let’s code UNIT
38
Step 2
Change to time series data type and replace index
Line 3, 5
• 3: Delete the existing Year column because it is no longer needed.
• 5: Confirm the above processing result.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 457
Let’s code UNIT
38
Step 2
Change to time series data type and replace index
Line 1
• See the result, and you can confirm that the datetimeindex has been changed.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 458
Let’s code UNIT
38
Step 2
Change to time series data type and replace index
Line 1
• Sorting dataframes based on set index
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 459
Let’s code UNIT
38
Step 2
Change to time series data type and replace index
Line 2
• You can see that the data type for the country name has been changed from object to categorical.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 460
Let’s code UNIT
38
Step 3
Line 1
• Let's check how many unique data there are in the column with country name. There are data for
a total of 264 countries.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 461
Let’s code UNIT
38
Step 3
By adding up each country's catch and aquaculture production by year, we can visualize this to see
trends in global catch and aquaculture.
If you see the data, there are separate data for each country and year. This can be solved by calculating
the sum of the world (The sum of data from each country.) for each year through the group operation of
pandas based on the year and visualizing this.
Line 1
• Groups are grouped based on the new_Year column and stored them in a new data frame called g.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 462
Let’s code UNIT
38
Step 3
Line 1
• If you check the saved result, you can see that the data frame was created based on the year.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 463
Let’s code UNIT
38
Step 3
Line 1
• Print the contents of the g object using a loop.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 464
Let’s code UNIT
38
Step 4
Line 1
• Through group operations for each created group, the sum of each year is obtained and a new
data frame is created.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 465
Let’s code UNIT
38
Step 4
Visualize global data
‣ Compare the graph we created with the graph we created from Our World data. You can make a result
to the level that experts process and visualize.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 466
Let’s code UNIT
38
Step 4
Visualize global data
Line 1, 2, 3, 4, 5, 7, 8
• 1: Specify the style of the graph
• 2: Draw an area graph in which the rest of the line graph is colored.
• 3: Adjust the opacity of the color to increase the visibility of overlapping graphs.
• 4: Select the option with False
• 5: Specify the size of the graph
• 7: Show legend
• 8: If you check the results, you can see that the world has been increasing the amount of artificial
aquaculture rather than the amount caught little by little starting in 2010.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 467
Let’s code UNIT
38
Step 5
Let's search for a specific country name and visualize it. You can do this easily if you remember the
practice of searching for artist names in the DataFrame lecture.
Line
1~3
• 1: Create only non-duplicate names as series data in the country column to search for a country.
• 2: Print only non-duplicate country names.
• 3: If you check the processed data type, you can see that it has been successfully converted into a
series.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 468
Let’s code UNIT
38
Step 5
Line 1
• If it is True when the country name you want is searched, it means that there is data for the
country.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 469
Let’s code UNIT
38
Step 5
Search for 3 or more countries with different cultural and geographic requirements, visualize and get
data insights.
Line
1~3
• 1: Create a separate data frame after searching for the country name you want in the series data
created earlier.
• 2: Create a separate data frame after searching for the country name you want in the series data
created earlier.
• 3: Create a separate data frame after searching for the country name you want in the series data
created earlier.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 470
Let’s code UNIT
38
Step 5
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 471
Let’s code UNIT
38
Step 5
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 472
UNIT Time Series Data
38.
Pair programming
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization
Pair programming UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 474
Pair programming UNIT
38
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 475
Pair programming UNIT
38
Q1. Discuss and practice the data you practiced in the key concept section of this lecture with your
learning colleagues as shown below.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 476
Pair programming UNIT
38
Access the University of California, Irvine, one of the open datasets used in the previous lecture,
Q2
Q1. Data Organization Learning, and explore together with your learning colleagues which data is
good data to utilize the advantages of time series data analysis.
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 477
End of
Document
Samsung Innovation Campus Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization 478
ⓒ2022 SAMSUNG. All rights reserved.
Samsung Electronics Corporate Citizenship Office holds the copyright of book.
This book is a literary property protected by copyright law so reprint and reproduction without permission are prohibited.
To use this book other than the curriculum of Samsung Innovation Campus or to use the entire or part of this book, you must receive written
consent from copyright holder.