0% found this document useful (0 votes)
32 views

SIC_AI_Chapter 3. Exploratory Data Analysis_v2.1

Uploaded by

bhartivandana198
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

SIC_AI_Chapter 3. Exploratory Data Analysis_v2.1

Uploaded by

bhartivandana198
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 520

Samsung

Innovation
Campus
Artificial Intelligence Course
Chapter 3.

NumPy Arrays: Optimized Numerical


Computation & Pandas: Exploratory
Data Analysis
Artificial Intelligence
Course

Samsung Innovation Campus 2


Chapter Description

Chapter objectives

 Understand the precise use of NumPy and be able to process data efficiently.
 Learn the basics of NumPy arrays, indexing, and slicing and the various ways of their application.
 Learn to create and handle series and data frame objects.
 Learn the appropriate methods of optimal model execution for data preprocessing using the
Pandas library to explore and convert data.
 Be able to find the appropriate analysis method by implementing a data visualization suitable for
the data scale.

Chapter contents

 Unit 1. NumPy Array Data Structure for Optimal Computational Performance


 Unit 2. Optimal Data Exploration Through Pandas
 Unit 3. Pandas Data Preprocessing for Optimal Model Execution
 Unit 4. Data Visualization For Various Data Scales

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 3


Unit 1.

NumPy Array Data Structure for Opti-


mal Computational Performance
1.1. NumPy Arrays 1.4. NumPy Indexing and Slicing
1.2. NumPy Array Basics 1.5. Array Transposition and Axis
Swap
1.3. NumPy Array Operations

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 4


1.1. NumPy Arrays UNIT
01

Data Structure
The most important concepts in programming are data types, data structures, and algorithms.
Knowing the clear differences between these three concepts is essential for easily dealing with
various programming languages and solving many errors.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 5


1.1. NumPy Arrays UNIT
01

Data Type
‣ First, a data type, in computer science and programming languages, is a classification that identifies
a type of data such as floats, integers, Booleans, characters, and strings while also determining the
size of the data. It should be used by applicable data types. The data types differ by programming
language, but most of the data type concepts from the C language are inherited and used and have
affected other languages.
‣ Memory is expensive, but when C was introduced, it was expensive and difficult to store a lot of
data. As a result, it was designed to be optimized for storing as little space as necessary, which
eventually resulted in a data size problem.
‣ For example, in C, integer data types are divided into char(compatible with integers), short, int, and
long. Each byte size differs from the other data types.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 6


1.1. NumPy Arrays UNIT
01

Data Type
‣ Second, data structure in computer science refers to the organization, management, and storage of
data that enables efficient access and modification. It relies heavily on the data structure to find
data in the fastest way. In other words, finding data is a matter of algorithms (which will be
explained later), so a good data structure can be said to be a necessary and sufficient condition for
a good algorithm.
‣ To be more specific, the data structure refers to a group of data values, a relationship between data,
and a function or command applicable to the data. Carefully selected data structures make it
possible to use more efficient algorithms.
‣ An effectively designed data structure allows operations to be performed with minimal resources,
such as execution time or memory capacity.
‣ There are several types of data structures, each of which is tailored for each operation and purpose.
‣ When designing various programs, it should be the priority to consider and select the most
appropriate data structure. This is because when manufacturing a large-scale system, the
implementation difficulty and the final product’s performance depend heavily on the data structure.
‣ Once the data structure is selected, it becomes relatively clear which algorithm needs to be applied.
There are times when this order is reversed, and they are when the target operation necessarily
requires a particular algorithm, and the given algorithm produces the best performance with the
particular data structure. In any case, it is essential to select an appropriate data structure.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 7


1.1. NumPy Arrays UNIT
01

Algorithm
‣ Third, an algorithm is a formulation of a set of procedures or methods to solve any solvable problem
and refers to a step-by-step procedure for executing a calculation.
‣ Algorithms are the most important in the fields of machine learning and deep learning, areas we will
cover in the future because it requires work with a lot of data.
‣ In the end, as the performance of the algorithm is directly related to the performance of the data
structure, it can be said that it is most important to know precisely where to use a certain data
structure.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 8


1.1. NumPy Arrays UNIT
01

Data Structures in Python


First, the following are the types of data structures used in computer science:

Integer
Float
Primitive
Char
String
Sequential
Singly Liked List
List Doubly Linked
Linear Linked List
ListLinked
Circular
Stack
Data List
Queue
Structure
Deque
General Tree
Non- Tree Binary Tree
Linear Graph General Tree

Sequential Binary Tree


File
File Indexed File
Direct File

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 9


1.1. NumPy Arrays UNIT
01

Types of Data Structures (1/2)


‣ However, we will only concern ourselves with the data structure of the Python language here. In
general, it can be largely divided into a primitive data structure, a structure for basic types, and a
non-primitive data structure, a structure for effectively storing multiple data with basic data types.
‣ The most commonly and conveniently used non-primitive data structure in Python is the list. This is
because the size of the data is not fixed while storing several different data types (integer,
character, etc.)
‣ However, the advantage has become inappropriate in the fields of machine learning and deep
learning, where more data must be quickly and exclusively processed in numbers.
‣ This is because computers must take decimal numbers understood by humans and convert them
into binary forms to perform fast operations and use the same numerical data to directly access and
calculate each element without repetition.
‣ In the end, computers can only perform operations when letters/characters are converted into
numbers.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 10


1.1. NumPy Arrays UNIT
01

Types of Data Structures (2/2)


‣ Compensating for such shortcomings is the array data structure. Python, however, does not directly
support the array; rather, the array data structure can be used through the NumPy library.

Data
Structure

Primitiv Non-
e Primitive

Intege Dictionar
Float String Boolean y
List Array Tuple Set File
r

Non-
Linear
linear

Graph
Stacks Queues Trees
s

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 11


1.1. NumPy Arrays UNIT
01

Method of Memory Storage Per Data Structure


‣ To understand why array data structures enable such rapid operations, we first need to know how
they are stored in memory for each data structure.
‣ Normally, for most languages, an array is referred to as a group of data created by listing data of
the same data type and storing them contiguously in memory.
‣ Each value in it is called an element of an array. These elements use the number called an index,
which always starts at zero, to simply distinguish an array’s elements. Most data types can be
configured in an arrangement and consist of a one-dimensional arrangement, a two-dimensional
arrangement, and a three-dimensional arrangement, depending on the configuration type.
‣ Previously, it was explained that the data structure was a concept that focused primarily on
effectively storing data. Finding out how the data are stored in the array and the list is the precise
way to understand the array’s pros and cons and know the reason for its use.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 12


1.1. NumPy Arrays UNIT
01

Storage and Access of Data in an Array (1/2)


‣ The following is the process of accessing existing data when storing it in an array and adding and
deleting new data. You should compare it with the list that follows.
‣ Suppose three pieces of data with strings representing the following colors are stored in an array.

Blue Yellow Red

‣ Each element may be accessed through each index. An index is a number representing an order.
Data is sequentially stored in a contiguous location of memory, as shown in the figure below.

Memor
y

Blue

Yellow

Red

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 13


1.1. NumPy Arrays UNIT
01

Storage and Access of Data in an Array (2/2)


‣ Since the data is stored in a contiguous location, the address of the memory can be accessed with
an index, and the data can be randomly selected to access the desired location of the data. Here is
a picture where the data approaches the red in the third room (we will express the concept of a
variable that mainly stores one value as a room) and approaches the blue room. Random access is
possible through the index.
a[ a[ a[
0] 1] 2]
Blue Yellow Red

Random
Access
a[ a[ a[
0] 1] 2]
Blue Yellow Red

a[ a[ a[
0] 1] 2]
Blue Yellow Red

a[ a[ a[
0] 1] 2]
Blue Yellow Red

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 14


1.1. NumPy Arrays UNIT
01

Adding and Deleting Data From the Array (1/4)


‣ Another feature of the array is that adding or deleting data to a specific location requires more
computation and space in comparison to the list. The calculation here is that it takes more time for
the CPU to calculate, and more space in memory is also needed.
‣ Consider adding the value “Green” to the second position.
a[ a[ a[
0] 1] 2]
Blue Yellow Red

Green

‣ First, we need to secure additional space at the end of the arrangement.

a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red

Green

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 15


1.1. NumPy Arrays UNIT
01

Adding and Deleting Data From the Array (2/4)


‣ To add data to the second space, the data behind the second space must move to the right one by
one.
a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red

a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red

Green

‣ The “Green” data is then added to the empty space.


a[ a[ a[ a[
0] 1] 2] 3]
Blue Green Yellow Red

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 16


1.1. NumPy Arrays UNIT
01

Adding and Deleting Data From the Array (3/4)


‣ Conversely, when deleting the second element–the “Green” value, remove the element first, then
move the value to the left one by one so there is no empty space.

a[ a[ a[ a[
0] 1] 2] 3]
Blue Green Yellow Red

a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red

Green

a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red

a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 17


1.1. NumPy Arrays UNIT
01

Adding and Deleting Data From the Array (4/4)


‣ It is completed by deleting the last remaining space.

a[ a[ a[
0] 1] 2]
Blue Yellow Red

Green

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 18


1.1. NumPy Arrays UNIT
01

Saving and Accessing Data From the List (1/4)


‣ The following is a method of storing data in the list data structure.

Blue Yellow Red

‣ To understand lists, you must first understand the concept of the pointer. In simple terms, the
pointer is an address value that points to a certain value.
‣ As shown in the figure above, the blue room (the concept of a variable representing one value will
be expressed as a room) points to the yellow room. Let’s say the blue room has the address of the
yellow room’s memory location. Then each room can point to the next room.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 19


1.1. NumPy Arrays UNIT
01

Saving and Accessing Data From the List (2/4)


‣ Lists are not stored sequentially but in separate locations.

Memory

Yellow
Pointer

Blue
Pointer

Red
Pointer

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 20


1.1. NumPy Arrays UNIT
01

Saving and Accessing Data From the List (3/4)


‣ Since the pointer of the “Blue” value refers to the address of the “Yellow” value, and the pointer of
the yellow value refers to the “Red” value, the order may be maintained.

Memory

Yellow
Pointer

Blue
Pointer

Red
Pointer

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 21


1.1. NumPy Arrays UNIT
01

Saving and Accessing Data From the List (4/4)


‣ Since the data is stored in non-contiguous addresses, sequential approaches are accessed through
the pointer that precedes them.

Sequential Access

Blue Yellow Red

Blue Yellow Red

Blue Yellow Red

Blue Yellow Red

Blue Yellow Red

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 22


1.1. NumPy Arrays UNIT
01

Adding and Deleting Data From the List


‣ Now, let’s look at the case of adding the “Green” data.

Blue Yellow Red

Green

Blue Yellow Red Blue Yellow Red

Green Green

‣ As shown in the picture above, using each pointer, the blue points to the green, and the green
points to the yellow to add the green.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 23


1.1. NumPy Arrays UNIT
01

What is NumPy?
NumPy stands for Numerical Python, which is a basic package for Data Science.

‣ It is a Python Library that provides multidimensional array objects, various derived objects
(matrices, etc.), and an assortment of routines for fast operation on arrays.
‣ It supports discrete Fourier transforms, basic linear algebra, basic statistical operations, random
simulations, etc.
‣ The ndarray object is the core of the NumPy package. It processes n-dimensional arrays of
homogenous data types.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 24


1.1. NumPy Arrays UNIT
01

Distinguish the differences between NumPy arrays and standard Python sequences
(Lists, Tuples, Dictionaries, Sets).
‣ NumPy arrays have a fixed size when generated as opposed to Python Lists (which can grow
dynamically).
‣ A change in the size of the ndarray creates a new array and deletes the original.
‣ The elements of the NumPy array are all homogenous data types; thus, they maintain the same size
in memory.
‣ Lists, however, allow for arrays of various sizes.
‣ NumPy arrays easily calculate advanced mathematical and other types of operations on large
numbers of data.
‣ Normally, such operations are performed more efficiently and with less code than when performed
with Python’s built-in sequences.
‣ More and more scientific and mathematical Python-based packages are using NumPy arrays. These
typically support Python sequence input, but they convert such input to NumPy arrays before
processing and often output NumPy arrays.
‣ In other words, it is insufficient to rely only on Python’s built-in sequences to efficiently use today’s
scientific/mathematical Python-based software. One also needs to have a knowledge of NumPy
array to increase efficiency.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 25


1.1. NumPy Arrays UNIT
01

Why Use NumPy Arrays? (1/3)


‣ Previously, we learned that Python does not directly support arrays but allows for data structure
arrangement through the NumPy library.
‣ The arrangement concepts learned in standard programming languages are not too different in
Python.
‣ In programming, decreasing loops is the method to increase performance.
‣ Loops in large-scale computations require a computation per repetition, causing poor performance.
‣ NumPy arrays allow for a wide variety of data processing operations through concise array
operations as opposed to loops.
‣ The use of array computing to explicitly remove loops is called Vectorization. Mathematical
operations for vectorized arrays are typically two to three, if not ten or even a hundred times faster
than pure Python operations. Broadcasting, another method that we will learn later, is a very
powerful vector operation.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 26


1.1. NumPy Arrays UNIT
01

Why Use NumPy Arrays? (2/3)


‣ Since the NumPy array operation uses an internal iteration implemented in C, it is faster than the
Python iteration and performs a linear algebraic operation using a vectorized operation.
‣ A vectorization operation is one of the linear transformations that transform a matrix into a vertical
vector, as shown below.
‣ Since it works throughout the vector, it can be used instead of for and while loops.

[]
𝑎
2x2 matrix𝐴= 𝑐 [𝑎 𝑏
𝑑 ] 𝑣𝑒𝑐 ( 𝐴 )= 𝑐
through vectorization become 𝑏
𝑑

‣ In scikit-learn, the NumPy arrangement is the basic data structure. In other words, the NumPy
arrangement should be used as the standard input/output for machine learning.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 27


1.1. NumPy Arrays UNIT
01

Why Use NumPy Arrays? (3/3)


‣ In scikit-learn, the NumPy arrangement is the basic data structure. In other words, the NumPy
arrangement should be used as the standard input/output for machine learning.
‣ Scikit-Learn is a Python machine learning library.
‣ Since scikit-learn receives data in the form of a NumPy array as input, all data to be used in the
future must be converted into a NumPy array.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 28


Unit 1.

NumPy Array Data Structure for Opti-


mal Computational Performance
1.1. NumPy Arrays 1.4. NumPy Indexing and Slicing
1.2. NumPy Array Basics 1.5. Array Transposition and Axis
Swap
1.3. NumPy Array Operations

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 29


1.2. NumPy Array Basics UNIT
01

How to import NumPy


NumPy Library Import
‣ For better readability of the code, abbreviate NumPy to np. This is a rule adopted by all those
working with codes so that it can be easily understood.
‣ In Python, objects generally have properties and methods. The property is another Python object
stored inside the object, and the method refers to a function that allows access to the internal data
of the object. It can be approached in the form of np.attrribute_name.

Line 1
• As stands for alias, meaning it is abbreviated.
Line 2
• To view the version of NumPy, use the built-in property ____version__.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 30


1.2. NumPy Array Basics UNIT
01

NumPy Array Basics


NumPy ndarray
‣ An n-dimensional array object is called a ndarry. It can be used to process and store large datasets.
‣ Fast and flexible arrangements use a similar grammar used for operations between scalar elements.
They use mathematical operations for the entire data block.

First, an arrangement may be made into a sequence (list, tuple, array, set).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 31


1.2. NumPy Array Basics UNIT
01

Here is one more arrangement.

Check the address of each array using a function representing the address id.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 32


1.2. NumPy Array Basics UNIT
01

The actual value inside is the same, but the address is confirmed to be different because it is its
own object.

1 3 5 7 9

2246716570240 arr1

2247660678400 arr2 1 3 5 7 9

‣ In Python, the = symbol is an assignment operator that assigns address values.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 33


1.2. NumPy Array Basics UNIT
01

Let’s assign the address arr1 to the variable arr3.

Line 9
• After assigning the address values, we can see that the address values are the same.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 34


1.2. NumPy Array Basics UNIT
01

We can see that the address values are the same.

1 3 5 7 9

22467165702 arr1
40

2247660678400 arr 1 3 5 7 9
2

22467165702 arr3
arr
40 3

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 35


1.2. NumPy Array Basics UNIT
01

If you want to copy only the value, you can use np.copy(). See help(np.copy) for more detail.

Then let’s check the address value.

Line 12
• Since only the value was copied, you can see that the address value is different.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 36


1.2. NumPy Array Basics UNIT
01

Creating Arrays with Tuples

Creating Arrays with Dictionaries

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 37


1.2. NumPy Array Basics UNIT
01

Creating Arrays with Sets

‣ Normally, lists are commonly used to create arrays.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 38


1.2. NumPy Array Basics UNIT
01

Creating Arrays with arange


‣ This is the same concept as the range in standard Python.

‣ For NumPy arrays, use np.arrange.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 39


1.2. NumPy Array Basics UNIT
01

Use the len() method and the size property for the size of the array.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 40


1.2. NumPy Array Basics UNIT
01

The NumPy array deals with each element by converting it to the same data type. (1/3)
‣ The important thing is that the NumPy arrays process the same type quickly and effectively, as one
of its characteristics.
‣ However, although arrays can be made with different types of arrays, each type is converted into
the same type.

Line 28
• Different types of arrays can be created, such as integers, floats, and Booleans.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 41


1.2. NumPy Array Basics UNIT
01

The NumPy array deals with each element by converting it to the same data type. (2/3)
‣ Due to the . in the integer type, the array is converted into a float when integer, float, and Boolean
types are together.

Line 30
• The data type of NumPy is returned in NumPy.datatype format.
Line 31
• Different types of arrays are created, such as integers, floats, and strings.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 42


1.2. NumPy Array Basics UNIT
01

The NumPy array deals with each element by converting it to the same data type. (3/3)

Line 33
• Due to the '111’ in the integer type, the array is converted into a string when integer, float, and
string types are together.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 43


1.2. NumPy Array Basics UNIT
01

The Data Types of NumPy Arrays are as follows:

Data Type Explanation


int8, int16, int32, int64, int_
Integer
uint8, uint16, uint32, uint64

float16, float32, float64, float128, float_ Floating point

bool_ Boolean

string_, unicode_ String

‣ The 8 in int8 is 8 bitss, which is the range of values that this value can represent.
https://docs.scipy.org/doc/NumPy-1.17.0/reference/
arrays.dtypes.html

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 44


1.2. NumPy Array Basics UNIT
01

Explanation of 8 bits

‣ Since the bit represents two values, 8 bits can become 2 to the power of 8, representing up to 256
integers. In integers, this value ranges from -128 to 127. The range is up to 127, and not 128,
because zero is excluded.
‣ If only positive values are used, the space in the range of negative numbers needs to be crossed
over to positive numbers. At this time, u, the first letter of the unsigned (meaning positive), is used.
‣ Therefore, uint8 ranges from 0 to 255.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 45


1.2. NumPy Array Basics UNIT
01

Data Types of NumPy Arrays

Line 35
• The U10, here, represents Unicode, and the < symbol represents 10byte or less.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 46


1.2. NumPy Array Basics UNIT
01

Creating Arrays With the linspace Function (1/2) – See help(np.linspace) for more detail.
‣ Linear means that it can be expressed in the form of linear bonds with respect to the elements of set
A.
‣ That is, where the elements of set A are multiplied and added by constants forming , which belongs
to this set A. This type of expression is called a linear combination.
‣ In this way, representing the heat of the basic element in the form of a linear bond is called linear.
Linear combination is the corresponding coefficients and variables, and the solution can be obtained
through this equation.
‣ The figure below shows the concept of linearity in mathematics and daily life.
Linear relationship between
time and the height of the
water bottle

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 47


1.2. NumPy Array Basics UNIT
01

Creating Arrays With the linspace Function (2/2)

Line 38
• Five numbers in the linear space from 1 to 10.
Line 39
• 20 numbers in the linear space from 10 to 10.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 48


1.2. NumPy Array Basics UNIT
01

Creating Arrays With np.zeros() and np.ones()

Line 40
• Be careful not to skip the “s.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 49


1.2. NumPy Array Basics UNIT
01

Multidimensional Arrays
A multidimensional array refers to an array of two or more dimensions. It can be made in the
form of a list in the list.

Line 43
• The first element in this list is also a list, so the first element in that list is 1.
Line 44
• Two-dimensional Array

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 50


1.2. NumPy Array Basics UNIT
01

Check the dimensions of the array with ndim.

Then, let’s check the one-dimensional array that we learned before.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 51


1.2. NumPy Array Basics UNIT
01

We will also make a three-dimensional array.

5 6

1 7
2 8

3 4

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 52


1.2. NumPy Array Basics UNIT
01

Three-Dimensional Array

Line 50
• A surface is two-dimensional and can be stacked. It’s easy to understand if you think of the
surface as a piece of paper.
Line 52
• When extracting a three-dimensional value, it’s easier to think of it as a method of approaching
step by step.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 53


1.2. NumPy Array Basics UNIT
01

Creating two-dimensional shapes using np.zeros()

Line 53
• Designate it as two rows and three columns in the form of a tuple.

‣ When created, the element within it is normally a float.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 54


1.2. NumPy Array Basics UNIT
01

If it is created by a different type, designate the parameter name using dtype.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 55


1.2. NumPy Array Basics UNIT
01

Use the astype() method if you want to change the type to a float.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 56


1.2. NumPy Array Basics UNIT
01

Properties of the NumPy Array


Explanation of Each Property
‣ Note that since it is not a method, () should not be attached.

Line 63
• Size
Line 64
• Converting the row and columns into a tuple

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 57


1.2. NumPy Array Basics UNIT
01

It should be noted that for one-dimensional arrays, the tuples return the number of elements.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 58


1.2. NumPy Array Basics UNIT
01

Reshape
How to Reshape

Line 70
• Be sure to input the shape as a tuple.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 59


1.2. NumPy Array Basics UNIT
01

Checking the shape of the object

Line 71
• When checked, the shape of the object remains the same. Thus, it should be assigned again to
create a new object.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 60


1.2. NumPy Array Basics UNIT
01

The shape can be changed directly.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 61


1.2. NumPy Array Basics UNIT
01

Random Numbers
Data following the standard normal distribution (average 0, standard deviation 1) is generated
as random numbers. See help(np.random.randn) for more detail.

Line 78
• The generated data follows the standard normal distribution with shapes in two rows and three
columns.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 62


1.2. NumPy Array Basics UNIT
01

Creating data2 with 100 rows and 100 columns

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 63


1.2. NumPy Array Basics UNIT
01

Go ahead and create data3 as well.

‣ The average is close to zero but not completely zero. If more data is generated, then the results get
closer to zero.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 64


1.2. NumPy Array Basics UNIT
01

Comparing Python Lists vs. NumPy Arrays: Processing Speed (1/4)


‣ Let’s compare a Python list and a NumPy array that stores a million integers.
‣ If you run the example below, you can see that the code using NumPy is much faster than the code
written with pure Python.
‣ By approaching np objects, denoted by abbreviating NumPy, you can create 1 million arrays starting
from 0 using the arange method.

‣ Here is a data structure of the list for comparison's sake.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 65


1.2. NumPy Array Basics UNIT
01

Comparing Python Lists vs. NumPy Arrays: Processing Speed (2/4)

Line 88
• It returns the result in the form of NumPy.ndarray. This means that there is an array with n-
dimensions.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 66


1.2. NumPy Array Basics UNIT
01

Comparing Python Lists vs. NumPy Arrays: Processing Speed (3/4)


‣ The following shows the speed of the Python array and the operation of multiplying each value of 1
million lists by 2, 1000 times.

‣ %time is an IPython magic command that returns a single execution time. This magic command is a
special command designed to easily control general tasks and other operations in the IPython
system.
‣ Magic commands are labeled with a % sign.

‣ It took 46.5 seconds. There is a difference of more than 20 times.


‣ Once again, it is important to understand the pros and cons of the arrays and lists we learned earlier
to perform large-scale Big Data AI operations in the future. This would be a clear reason why we
must learn and use NumPy when operating with such operations.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 67


1.2. NumPy Array Basics UNIT
01

Comparing Python Lists vs. NumPy Arrays: Processing Speed (4/4)


‣ The results may differ in each user’s system, as we just saw on the screen.
‣ Wall time, also called Wall-clock time, is the sum of the entire time it takes for the program to run
and end, including CPU, I/O, Sub Program, etc.
‣ To summarize vectorization once more, vectorization is one of the biggest reasons for using NumPy.
‣ Vectorization is the operation of placing data without using loops or indexing.
‣ NumPy’s vectorization optimizes the placement operation with compiled C code by performing it out
of sight. For example, the loop statement code for calculating the weight of machine learning uses
NumPy vectorization.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 68


1.2. NumPy Array Basics UNIT
01

Adding Elements to NumPy Arrays


A value is added to the one-dimensional data. The rank of the first-dimension is 1. See help.
(np.append) for more detail.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 69


1.2. NumPy Array Basics UNIT
01

NumPy.append() method (1/3)


‣ You can extend a NumPy array by calling the NumPy.append() method. Note that, unlike the lists,
the plus ‘+’ operator cannot be used for this purpose.
‣ When an array has orientations, you can specify the directions. For a 2D array, the axis argument
set to 0 means vertical extension, while 1 means horizontal extension.
‣ NumPy.append() provides just a view. For the changes to remain, the result should be assigned to a
variable.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 70


1.2. NumPy Array Basics UNIT
01

NumPy.append() method (2/3)

Line 96
• axis=0 is a row.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 71


1.2. NumPy Array Basics UNIT
01

NumPy.append() method (3/3)

Line 98
• axis=1 is a column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 72


1.2. NumPy Array Basics UNIT
01

Deleting Elements from NumPy Arrays


Method of Deleting Elements from Arrays (1/2)

Line 102
• Delete the corresponding index element.
Line 103
• Although the element was deleted, the array remains the same because no assignment was
made.
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 73
1.2. NumPy Array Basics UNIT
01

Method of Deleting Elements from Arrays (2/2)

Line 106
• Delete the first row.
Line 108
• Delete the second column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 74


Unit 1.

NumPy Array Data Structure for Opti-


mal Computational Performance
1.1. NumPy Arrays 1.4. NumPy Indexing and Slicing
1.2. NumPy Array Basics 1.5. Array Transposition and Axis
Swap
1.3. NumPy Array Operations

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 75


1.3. NumPy Array Operations UNIT
01

Basic Operations
We learned earlier that the unique feature of the NumPy array operation is that data can be
processed collectively without using a for loop statement, which is called vectorization. In this
case, arithmetic operations between arrays of the same size are applied in each element unit of
the

array.
For this reason, popular machine learning and deep learning libraries, such as scikit-learn,
TensorFlow, PyTorch, etc., were all made based on NumPy.
‣ This is a summary of the characteristics of NumPy that we’ve learned so far.
• A low-level, high-performance library implemented in C
• Supports fast, memory-effective multidimensional array ndarray operations
• Supports various computational functions such as linear algebra, random number generator,
Fourier transform, etc.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 76


1.3. NumPy Array Operations UNIT
01

Comparing Python Lists and NumPy Arrays: + Operator

Line 4
• In a list, the + operator means connecting the given elements.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 77


1.3. NumPy Array Operations UNIT
01

Comparing Python Lists and NumPy Arrays: + Operator

Line 6
• In a NumPy array, each element is calculated.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 78


1.3. NumPy Array Operations UNIT
01

If the shape of the array is the same, you can perform addition, subtraction, multiplication, and
division operations.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 79


1.3. NumPy Array Operations UNIT
01

Comparing Python Lists and NumPy Arrays: Multiplication

Line 10
• In a list, the * operator means repetition.
Line 11
• Multiplication in a NumPy array returns the multiplication result of each element.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 80


1.3. NumPy Array Operations UNIT
01

repeat & tile

Line 14
• In NumPy, the repeat method is used to repeat each element.

Line 15
• In NumPy, the tile method is used for repeating the array.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 81


1.3. NumPy Array Operations UNIT
01

When using the tile method, the array can be repeated into a two-dimensional form.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 82


1.3. NumPy Array Operations UNIT
01

Arrays can also perform exponential operations.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 83


1.3. NumPy Array Operations UNIT
01

The array can also perform comparison operations. After checking whether each element
matches the conditions, it returns True if there is a match and False if otherwise.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 84


1.3. NumPy Array Operations UNIT
01

Universal Functions (1/3)


‣ Universal functions operate element by element on whole arrays. See help(np.ufunc) for more
detail.

1.3. NumPy Array Operations

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 85


1.3. NumPy Array Operations UNIT
01

Universal Functions (2/3)


‣ NumPy functions:

Function Explanation Universal function?


sin, cos, tan Trigonometric functions Yes
arcsin, arccos, arctan Inverse trigonometric functions Yes
round Round to a given number of decimals Yes
floor Returns the nearest smaller integer Yes
ceil Returns the nearest greater integer Yes
fix Returns the nearest integer closer to 0 Yes
prod Returns the product of the array elements No
cumsum Returns the cumulative sums of an array No
sum, mean, var, std, median Statistical functions No
exp, log Exponential and logarithmic functions Yes
unique Returns the unique values of an array No
Minimum, maximum, and the corresponding in-
min, max, argmax, argmin No
dices
https://docs.scipy.org/doc/NumPy-1.17.0/
reference/

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 86


1.3. NumPy Array Operations UNIT
01

Universal Functions (3/3)


‣ Statistical methods of NumPy arrays:

Method Explanation
mean Average
var Variance
std Standard deviation
sum Total sum
cumsum The cumulative sums of an array
max, min The maximum and the minimum of an array
argmax, argmin Indices of the maximum and the minimum

https://docs.scipy.org/doc/NumPy-1.17.0/
reference/

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 87


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:

Line 33
• Sum

Line 34
• Average

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 88


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:

Line 35
• Average 5.5

Line 36
• Deviation (difference from average)

Line 37
• Square of Deviation

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 89


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:

Line 39
• Sum of Squared Deviation

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 90


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:

Line 40
• The sum of the squared deviations divided by the number of arrays is called a variance.

‣ This result can be calculated directly through the process above, or the var() can be used.

Line 42
• The square root of the variance is the standard deviation.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 91


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:

Line 40
• The sum of the squared deviations divided by the number of arrays is called a variance.

‣ This result can be calculated directly through the process above, or the std() method can be used.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 92


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:

Line 44
• The cumsum() method is used to find the cumulative sum of x.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 93


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:

Line 47
• In this way, you can designate it as two rows and three columns. However, if you designate only
a row, use -1 to change the rest on its own.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 94


1.3. NumPy Array Operations UNIT
01

Statistical methods of NumPy arrays:


‣ Statistical functions can also be used for each row and column.

Line 52
• The average of 1 and 4 is 2.5. When the axis = 0, the operation is carried out along the columns.

Line 53
• The average of 1, 2, and 3 is 2. When the axis = 1, the operation is carried out along the rows.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 95


1.3. NumPy Array Operations UNIT
01

About Random Numbers in NumPy

Line 54
• Initializing the seed value using np.ramdom will generate a fixed random number when
generating random number values.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 96


1.3. NumPy Array Operations UNIT
01

About Random Numbers in NumPy


‣ Return random integers from ‘low’ (inclusive) to ‘high’ (exclusive). Refer to help(np.random.randint).

Line 55
• A random number is returned from integers 0 to 9.

Line 56
• This is how you would return a random number from integers 1 to 10.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 97


1.3. NumPy Array Operations UNIT
01

About Random Numbers in NumPy


‣ Write the code below to check if the code will return 10 every time.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 98


1.3. NumPy Array Operations UNIT
01

About Random Numbers in NumPy

Line 58
• In this case, 11 is not returned, so it repeats indefinitely. Click (interrupt kernel) to stop the
repetition.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 99


1.3. NumPy Array Operations UNIT
01

Final Summary
‣ Universal functions:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 100


1.3. NumPy Array Operations UNIT
01

Final Summary
‣ NumPy functions:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 101


1.3. NumPy Array Operations UNIT
01

Final Summary
‣ NumPy functions:

Line 65
• Returns only unique values that are not repeated.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 102


1.3. NumPy Array Operations UNIT
01

Final Summary
‣ NumPy functions:

Line 68
• Rounded up to the second decimal place.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 103


1.3. NumPy Array Operations UNIT
01

Final Summary
‣ NumPy functions:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 104


1.3. NumPy Array Operations UNIT
01

Final Summary
‣ NumPy functions:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 105


1.3. NumPy Array Operations UNIT
01

Final Summary
‣ Universal functions:

Line 77
• Returns the index location with the highest value into an integer.
Line 78
• Returns the index location with the lowest value into an integer.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 106


1.3. NumPy Array Operations UNIT
01

Linear Algebra of the NumPy Array


‣ NumPy supports not only simple operations of arrays but also matrix (two-dimensional array)
operations for linear algebra.
‣ Among the various methods, let’s first look at how to obtain matrix multiplication, transpose matrix,
inverse matrix, and determinant.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 107


1.3. NumPy Array Operations UNIT
01

Matrix Operations
‣ The following is the method for obtaining matrix product, transpose matrix, inverse matrix, and
determinant for matrices A and B.
• matrix product
• transpose matrix
• inverse matrix
• determinant

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 108


1.3. NumPy Array Operations UNIT
01

Matrix Operations
‣ For matrix operations, make 2x2 matrices A and B as follows:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 109


1.3. NumPy Array Operations UNIT
01

Matrix Operations
‣ Example of the product of matrices A and B (A*B). Both methods can be used.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 110


1.3. NumPy Array Operations UNIT
01

Matrix Operations
‣ The following is an example of finding the transpose matrix of matrix A. Both methods can be used.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 111


1.3. NumPy Array Operations UNIT
01

Matrix Operations
‣ How to find the inverse matrix of matrix A

‣ Example of obtaining a determinant of matrix A

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 112


1.3. NumPy Array Operations UNIT
01

Coding Exercise #0101

Follow practice steps on ‘ex_0101.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 113


1.3. NumPy Array Operations UNIT
01

Coding Exercise #0102

Follow practice steps on ‘ex_0102.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 114


Unit 1.

NumPy Array Data Structure for Opti-


mal Computational Performance
1.1. NumPy Arrays 1.4. NumPy Indexing and Slicing
1.2. NumPy Array Basics 1.5. Array Transposition and Axis
Swap
1.3. NumPy Array Operations

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 115


1.4. NumPy Indexing and Slicing UNIT
01

Indexing and Slicing


Selecting elements in an array by specifying the location or condition of an array is called
indexing. And selecting elements in an array by specifying a range is called slicing.
‣ To select an element at a specific location in a one-dimensional array, the element’s position must
be specified as shown below, and the element in the array starts at 0.
‣ Array Name[Location]

Indexing
The first
-5
-4
0
-3
1
-2
2
-1
3
4
The last

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 116


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ Indexing a one-dimensional array is like Python list indexing.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 117


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ Multi-dimensional arrays can also be approached like list indexing.

Line 5
• Can access one specific multi-dimensional value.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 118


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ Unlike list indexing, dimensions can be divided into commas (,) and approached in a
multidimensional array. The dimension divided by commas is called an axis. Since b is a two-
dimensional array, we can treat it as a matrix. Vales in columns 1 and 2 can be accessed as on the
previous page.
‣ It can also be approached as shown below.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 119


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ Not only can the array’s elements be imported, but the value of the corresponding index can be
changed as follows.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 120


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ To select multiple elements from a one-dimensional array, store them as follows.
‣ The outer brackets are brackets for indexing, and the inner brackets are brackets for the list.
‣ Array name [[]]
‣ Array name [[position 1, position 2, ..., position n]]

Line 10
• Elements 10, 30, and 40 located at positions 1, 3, and 4 were taken from the one-dimensional
array a1.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 121


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ To select an element at a specific location in a two-dimensional array, specify the positions of rows
and columns as follows.
‣ Array name [row_position, column_position]
‣ If only the array name [row_position] is entered without a “column_position,” the entire designated
row is selected.
‣ It is a method of selecting and importing a specific element by indexing in a two-dimensional array.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 122


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ In the two-dimensional array a2, an element having a row_position of 0 and a column_position of 2
is selected and imported as follows.

‣ As shown below, you can also change the value after selecting an element by specifying the
positions of rows and columns in a two-dimensional array.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 123


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ A method of obtaining the entire row by specifying the “row_position” in a two-dimensional array is
shown below.

‣ The entire row can also be changed by specifying a specific row in a two-dimensional array, as
shown below.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 124


1.4. NumPy Indexing and Slicing UNIT
01

Indexing
‣ To select several elements in a two-dimensional array, do as follows.
‣ Array name [[row_position 1, row_position 2, …, row_position n],[column_position n],
column_position 2, …, column_position n]]
‣ The following is an example of selecting multiple elements by specifying the positions of rows and
columns in a two-dimensional array.

Line 17
• If you select the row first from [10, 20, 30] and the column from [45, 55, 65], then 10 and 55 are
returned.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 125


1.4. NumPy Indexing and Slicing UNIT
01

Selecting the Array by Specifying the Conditions


‣ Array Name[Condition]

Line 19
• Only elements that meet the conditions of “a>3” are returned.
Line 20
• Only even numbers are returned.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 126


1.4. NumPy Indexing and Slicing UNIT
01

Array Slicing
Instead of selecting one element through indexing, slicing selects a portion of the array by
specifying a range.
‣ For a one-dimensional array, slicing specifies the positions of the beginning and the end, as shown
below.
‣ Array[start_position]:end_position]
‣ If the start position is not specified, the start position becomes 0, and the range becomes “0 to end
position -1.” If the end position is not specified, the “end position” becomes the array’s length, and
the range becomes “start_position to end of the array.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 127


1.4. NumPy Indexing and Slicing UNIT
01

Slicing
‣ In a one-dimensional array, slicing is performed without specifying a ”start_position” and an
“end_position,” as shown below.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 128


1.4. NumPy Indexing and Slicing UNIT
01

Slicing
‣ Now, let’s consider the case for a two-dimensional array.

‣ Try slicing the array above.

Line 27
• Choose from the beginning to the second line of “arr2d.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 129


1.4. NumPy Indexing and Slicing UNIT
01

Slicing
‣ It is also possible to slice multi-dimensional arrays by crossing several indices.

‣ As shown above, slicing always gives a view of the array in the same dimension. An integer index
and a slice can be used together to obtain a lower-dimension slice.
‣ For example, if you want to select only the first two columns in the second row, do as follows.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 130


1.4. NumPy Indexing and Slicing UNIT
01

Slicing
‣ If you select only the third column in the first two rows, do as follows.

‣ If you just use a colon, you choose the entire axis, so the slice of the original dimension is returned.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 131


1.4. NumPy Indexing and Slicing UNIT
01

Slicing

Code Form of Slice

arr[:2, 1:] (2,2)

arr[2] (3,)
arr[2, :] (3,)
arr[2:, :] (1,3)

arr[:, :2] (3,2)

arr[1, :2] (2,)


arr[1:2, :2] (1,2)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 132


1.4. NumPy Indexing and Slicing UNIT
01

Selecting in Boolean Values

‣ This Boolean array determines whether the name “Bob” is in the names array.

‣ By indexing under this condition, only those whose name is “Bob” can be returned.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 133


1.4. NumPy Indexing and Slicing UNIT
01

Fancy Indexing
Return a new array of given shape and type without initializing entries. See help(np.empty) for
more detail.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 134


1.4. NumPy Indexing and Slicing UNIT
01

Fancy Indexing
‣ Selecting rows in a specific order can skip the ndarrray or the list containing the desired order.

‣ If you use negative numbers as an index, select a row from the end.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 135


1.4. NumPy Indexing and Slicing UNIT
01

Fancy Indexing
‣ To index values that are only multiples of 5, the Boolean array for the condition can be put into the
variable and obtained.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 136


1.4. NumPy Indexing and Slicing UNIT
01

Fancy Indexing
‣ It can also be obtained by combining two conditions.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 137


Unit 1.

NumPy Array Data Structure for Opti-


mal Computational Performance
1.1. NumPy Arrays 1.4. NumPy Indexing and Slicing
1.2. NumPy Array Basics 1.5. Array Transposition and Axis
Swap
1.3. NumPy Array Operations

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 138


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition
Array Transposition is a special function that returns a view in which the data shape has changed
without copying the data. ndarray has a transponse method and a special property named T.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 139


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition
‣ Transpose can change the dimensional order of the ndarray. T can be used to reverse the order of all
dimensions with transpose. Since this is a frequently used function, it was made into a shortcut
function.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 140


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition
‣ Linear algebra, such as matrix multiplication, division, determinant, and square matrix, is an
important part of the library dealing with arrays.
‣ Multiplying two two-dimensional arrays by * operators yields the product of each corresponding
element, not the multiplication of the matrix.
‣ Matrix multiplication is calculated using a dot function in the NumPy namespace and an array
method.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 141


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition
‣ It is often used in matrix calculation, and np.dot is also used to find the inner part of the matrix.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 142


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition
‣ First, let’s reshape 3 rows and 4 columns in 2D into 4 rows and 3 columns.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 143


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition
‣ This time, let’s try with a three-dimensional array.
‣ Even in a three-dimensional array, the transpose method receives the tuple axis number and
replaces it.

Line 16
• Second page, second row, fourth row.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 144


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition

Line 18
• The order of the first and second axes was reversed, and the last axis remained the same.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 145


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition
‣ There is a method called swapaxes in the ndarray, which receives two axis numbers and reverses
the array.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 146


1.5. Array Transposition and Axis Swap UNIT
01

Array Transposition

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 147


1.5. Array Transposition and Axis Swap UNIT
01

Coding Exercise #0103

Follow practice steps on ‘ex_0103.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 148


Unit 2.

Optimal Data Exploration Through


Pandas
2.1. Pipelines: Data Structures Ac- 2.4. DataFrame Sorting and Multi-
cording to Data Types Index
2.2. Pandas Series and 2.5. Examining the Characteristics
DataFrames of Data Through Descriptive
Statistics and Data Samples
2.3. Merging and Binding
DataFrames
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 149
2.1. Pipelines: Data Structures According to Data Types UNIT
02

Pipelines
Structural Perspective of Data Types
‣ Collected data, from a structural perspective (schema structure or computability), can be divided
into three categories: structured, unstructured, and semi-structured data.
• Structured data refers to data that has a structure-based form of the structured schema (form)
and is stored in fixed fields such as RDB and spreadsheet and is consistent in value and format.
• Unstructured data refers to data that does not have a schema structure and is not stored in fixed
fields such as social media, web bulletin boards, and NoSQL.
• Semi-Structured data refers to data that has a schema(formal) structure and contains metadata
such as XML, HTML, weblogs, system logs, and alarms and is inconsistent in value and format.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 150


2.1. Pipelines: Data Structures According to Data Types UNIT
02

Python Data Structure Pipeline


‣ Data collected from various sources have various forms and properties, as shown in the figure
below. Thus, for analysis, it is necessary to integrate various data types in the same format so that
computers can understand them.

Source Format Form Processing


(Unstruc-
tured)

Internet plain text List, Tuple, Set

CSV Array, Matrix


File
HTML/XML
Frame, Series

JSON
Database Dictionary
Table form

Structured

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 151


Unit 2.

Optimal Data Exploration Through


Pandas
2.1. Pipelines: Data Structures 2.4. DataFrame Sorting and Multi-
According to Data Types Index
2.2. Pandas Series and 2.5. Examining the Characteristics
DataFrames of Data Through Descriptive
Statistics and Data Samples
2.3. Merging and Binding
DataFrames
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 152
2.2. Pandas Series and DataFrames UNIT
02

Pandas Outline
A table is the most optimal form of data that a person can understand. Therefore, the ability to
handle tabular data well is the basis of analysis. In Python, the table form is called a DataFrame
and is implemented as a pandas library.
‣ The pandas library is built based on NumPy but specializes in more complex data analysis.
‣ While NumPy processes only the same array of data types, Pandas can process different data types.
‣ The pandas library is an optimal tool for collecting and organizing data.
‣ Pandas is an important tool that can handle most of the work in data science and is essential for
data scientists.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 153


2.2. Pandas Series and DataFrames UNIT
02

Pandas for Data Analysis


Essential Library for Data Analysis
‣ Statistics-based effective representation and data collection summary
‣ Data alignment and manipulation to merge and weave various types of data
‣ Processing data such as collection, transformation, and function application can be applied to the
entire data bundle.
‣ Since NumPy is a general arithmetic data processing-based library, pandas must be used to process
tabular data for statistics or analysis. Pandas also provides time series processing capabilities that
NumPy lacks.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 154


2.2. Pandas Series and DataFrames UNIT
02

About pandas Data Structure:


‣ To learn about pandas, it is important to familiarize oneself with Series and DataFrame data
structures.
Series Series Series Series
Sex Bloodtype Height Weight
Male 3 23 68.2
Female 2 22 53
Male 4 24 80.1
Male 3 23 85.7
Female 1 20 49.5
Female 2 21 52
Female 1 22 45.3

DataFrame

‣ The basic data structures of pandas include Series (1D) and DataFrame (2D). DataFrame is a
container for Series, and Series is a container for scalar (0D). They can add or delete data in a
dictionary manner.
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 155
2.2. Pandas Series and DataFrames UNIT
02

Series
A series is a form of sequentially listed one-dimensional arrays.
‣ Since it is an array, all elements in the series must belong to one data type.
‣ Simple series can be created from sequences such as lists, tuples, and arrays.
‣ The word ‘series’ is singular AND plural. Its Latin origin, ’serere,’ means to join or connect.
‣ The default value of the series is an integer index. The label of the first item is 0, the label of the
second item is 1, and it increases in this manner.
‣ The attribute value of the series is the list of all values in the series.
‣ The attribute value of the index means an index of a series.
‣ The attribute value of the index.values is an array of all index values.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 156


2.2. Pandas Series and DataFrames UNIT
02

How to Create a Series

‣ Creating from a List

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 157


2.2. Pandas Series and DataFrames UNIT
02

How to Create a Series

‣ Creating from NumPy Arrays. See help(pd.Series) for more detail.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 158


2.2. Pandas Series and DataFrames UNIT
02

How to Create a Series


‣ Series can also be created with array-like, Iterable, dict, or scalar value.

Line 06
• We can know that the data type is a series.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 159


2.2. Pandas Series and DataFrames UNIT
02

How to Create a Series


‣ To deal with DataFrames in the future, you must first understand the series data and think of the
column part as a series in the figure below.

DataFrame

row

column

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 160


2.2. Pandas Series and DataFrames UNIT
02

Creating a Series with Dictionary

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 161


2.2. Pandas Series and DataFrames UNIT
02

Creating a Series with Dictionary


‣ Index arrays can be selected separately using index attributes of the series class.

‣ Selecting the data value array separately is also possible. At that time, the values attribute of the
series class is used.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 162


2.2. Pandas Series and DataFrames UNIT
02

Creating a Series with List

Line 16
• The index is represented by an integer-like RangeIndex object in the range of 0 and 4. At this
time, the last value is not included.

‣ The data value array maintains the order of the list element array of list_data, which is the original
data.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 163


2.2. Pandas Series and DataFrames UNIT
02

DataFrame and Series


‣ Let’s create a simple DataFrame and learn how to manipulate data in a series. This data stores
passenger data of the Titanic. It is assumed that the passenger’s name (character), age (integer),
and gender (male/female) data are known.

Name Age Sex


Braund, Mr. Owen Harris 22 male
Allen, Mr. William Henry 35 male
Bonnell, Miss. Elizabeth 58 female

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 164


2.2. Pandas Series and DataFrames UNIT
02

DataFrame and Series


‣ A data frame must be used to view the proceeding table and store data in the form of a table. The
column name, also called a column header, is mainly set as a key value of a dictionary.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 165


2.2. Pandas Series and DataFrames UNIT
02

DataFrame and Series

‣ A DataFrame is a two-dimensional data structure that stores various data types (letters, integers,
floating point values, categorical data, etc.) in a column.
‣ Each table has three columns with a column label. The column labels are “Name,” “Age,” and “Sex.”

‣ Column “Name” consists of text data in which each value is a string, column “Age” is a silver
number, and column “Sex” is text data.
‣ It is similar to data table representation in a spreadsheet.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 166


2.2. Pandas Series and DataFrames UNIT
02

The column of a DataFrame is a series.

Series

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 167


2.2. Pandas Series and DataFrames UNIT
02

The column of a DataFrame is a series.


‣ If you are only interested in the “Age” column and want to pull out the column, use the dictionary
key value.

Line 19
• If you are familiar with dictionary, selecting a single column is very similar to selecting a
dictionary value based on a key.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 168


2.2. Pandas Series and DataFrames UNIT
02

The column of a DataFrame is a series.


‣ If you are only interested in the “Age” column and want to pull out the column, use the dictionary
key value.

Line 21
• It is possible to index a series.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 169


2.2. Pandas Series and DataFrames UNIT
02

The column of a DataFrame is a series.


‣ Let’s make a DataFrame into a list of list types and organize it in a two-dimensional form.

Line 23
• Since the column name is not specified, it appears as a RangeIndex object.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 170


2.2. Pandas Series and DataFrames UNIT
02

Creating a series by designating index

Line 29
• Note that string data types in the series are recognized in the form of objects.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 171


2.2. Pandas Series and DataFrames UNIT
02

Creating a series by designating index

‣ The attributes of the series are as follows.

Line 32
• Designate the index name with the name attribute.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 172


2.2. Pandas Series and DataFrames UNIT
02

Creating a series by designating index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 173


2.2. Pandas Series and DataFrames UNIT
02

Creating a series by designating index


‣ The methods of the series are as follows.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 174


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.

Line 39
• Only non-duplicate values are returned.
Line 40
• This is the number of values that are not duplicates.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 175


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Now, let’s try and duplicate the values.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 176


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Now, let’s try and duplicate the values.

Line 44
• Only non-duplicate values are returned.
Line 46
• Using the value_counts() method, the same value is added to the number. It’s a method that is
used quite often, so be sure to remember it.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 177


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Series can be indexed and sliced like a NumPy array.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 178


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Operations in the series are also possible.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 179


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Operations in the series are also possible.

ser1 ser2 ser1+ser2


0 4 4

1 3 4

2 2 4

3 1 4

4 0 4

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 180


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Operations in the series are also possible.

ser1 ser2 ser1*ser2


0 4 0

1 3 3

2 2 4

3 1 3

4 0 0

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 181


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Operations in the series are also possible.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 182


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Methods Related to Statistics in the Series

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 183


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ Methods Related to Statistics in the Series

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 184


2.2. Pandas Series and DataFrames UNIT
02

The methods of the series are as follows.


‣ The lambda function can be applied to the series using the apply function.

Line 65
• We will replace the lambda function with what we learned in Python Basics.
Line 67
• Add 10 to each element and return it to the series.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 185


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series and DataFrame Practice


Two-Dimensional Array
‣ It was mentioned that DataFrame is a two-dimensional array consisting of rows and columns. The
two-dimensional array is the most commonly used form of data for computers.
‣ In the field of image processing and data analysis, most of the data are searched and processed in
the form of a two-dimensional array. The DataFrame data structure of pandas originated from the
DataFrame of R, a popular statistical package.
‣ It is similar to the two-dimensional NumPy array. The data type of the column may be different. It is
ideal for storing data from CSV files, Excel spreadsheets, SQL tables, etc. There are properties such
as columns, indices, etc.
‣ DataFrame can be thought of as a collection of Series objects. Pandas DataFrames allows for data
representation similar to Excel spreadsheets.
‣ DataFrame consists of rows and columns. Rows are observations or instances. Columns are
variables.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 186


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ pandas DataFrame is very convenient when reading an external data set, especially when data is
stored in a comma-separated (CSV) format.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 187


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 188


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ DataFrame objects can be created from dictionary objects.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 189


2.2. Pandas Series and DataFrames UNIT
02

Things to Check After Bringing in the DataFrame


1)df.info()
2)df.head()
‣ This method prints information about a DataFrame, including the index dtype and columns, non-null
values, and memory usage. See help(df.info) for more detail.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 190


2.2. Pandas Series and DataFrames UNIT
02

Things to Check After Bringing in the DataFrame

Line 9
• It only shows the first five data. It is used to quickly view the state or value of the data.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 191


2.2. Pandas Series and DataFrames UNIT
02

Things to Check After Bringing in the DataFrame

Line 10
• It only shows the last five data.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 192


2.2. Pandas Series and DataFrames UNIT
02

Things to Check After Bringing in the DataFrame

Line 11
• Used to check the column name.
• iris_df.columns() #caution: Since it is not a method, () is not used.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 193


2.2. Pandas Series and DataFrames UNIT
02

Things to Check After Bringing in the DataFrame

Line 12
• Used the check index.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 194


2.2. Pandas Series and DataFrames UNIT
02

Things to Check After Bringing in the DataFrame


‣ You can also change the column name, as shown below.

Line 14
• You can see that it has changed from . to _.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 195


2.2. Pandas Series and DataFrames UNIT
02

Things to Check After Bringing in the DataFrame


‣ Indexing and slicing are also possible in DataFrames.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 196


2.2. Pandas Series and DataFrames UNIT
02

Several columns are brought into the two-dimensional form, as shown below.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 197


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ A DataFrame can also be created into a two-dimensional array. Let’s make the table below into a
two-dimensional array of DataFrames.

Name Age Sex School

Tom 15 Male middle

Alice 10 Female elementary

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 198


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice

‣ As shown above, the column name is not recognized when the data is entered. Thus, you have to
set the columns yourself and create a DataFrame.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 199


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 200


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ The row index and column name object of the df of the DataFrame, represented as df.index and
df.columns, can be changed by assigning a new list to the attributes of the row index and column.
• #Row Index Change: DataFrame Object.index = new row index list
• #Column Name Change: DataFrame Object.columns = new column name list
‣ There is no index at this time.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 201


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 202


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ Columns can also be changed.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 203


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ It is also possible to select and replace a row index or a part of the column name using the rename
method in the DataFrame. It should be noted, however, that the original object is not directly
modified, but a new DataFrame object is returned. Use the inplace= True option to change the
original object.
• # Row Index Change: DataFrame Object.rename(index={Existing index: New index,…})
• # Column Name Change: DataFrame Object.renames=(columns={Existing Name: New Name,…})

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 204


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice

Line 29
• Looking at it again, notice that it didn’t change. Let’s try the inplace=True option.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 205


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 206


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ If you don’t want to use the inplace=True option, you can reassign new objects with the same name,
but it’s a bit cumbersome.
‣ If you want to change the index 'stu2’ to 'student2’, do it as follows.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 207


2.2. Pandas Series and DataFrames UNIT
02

Pandas Series & DataFrame Practice


‣ Column names can be created in the same way.
‣ When changing 'student_name‘ to 'stu_name,’ the following two methods can be used.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 208


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ When deleting a row or column of a DataFrame, use the drop() method. When deleting a row, enter
Axis=0 as the axis option, or do not enter anything at all. On the other hand, if Axis=1 is entered as
an axis option, the column is deleted. If you want to delete multiple rows or columns at the same
time, enter them in the form of a list.
• Delete Row: DataFrame Object.drop(row index or list, axis=0)
• Delete Column: DataFrame Object.drop(column name or list, axis=1)

Line 36
• Converting a DataFrame with a DataFrame() function. Save to variable df.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 209


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ Let’s create a DataFrame with an index as follows.

‣ Always be in the habit of copying the original before deleting the row. This is because if the data is
incorrectly deleted when checking with the data prior to deleting, it needs to be brought back from
the original. This is the only way to save time if the amount of data is large.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 210


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 211


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns

Line 43
• The original remains.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 212


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 213


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ The same goes for deleting columns.

Line 49
• Replicate the DataFrame df and store it in the variable df4. Delete one column of df4.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 214


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ The same goes for deleting columns.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 215


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ The same goes for deleting columns.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 216


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ The same goes for deleting columns.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 217


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ Use the row index to select one row.

Line 57-1
• Use the loc indexer.
Line 57-2
• Use the iloc indexer.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 218


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ Use the row index to select one row.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 219


2.2. Pandas Series and DataFrames UNIT
02

Deleting Rows and Columns


‣ Use the row index to select one row.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 220


2.2. Pandas Series and DataFrames UNIT
02

Selecting Columns

Line 63
• Select only the “math” score data. Save it to variable math1.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 221


2.2. Pandas Series and DataFrames UNIT
02

Selecting Columns

Line 65
• Select only the “english” score data. Save it to variable english.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 222


2.2. Pandas Series and DataFrames UNIT
02

Selecting Columns
‣ Select “music” and “science” score data. Save it to variable music_sci.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 223


2.2. Pandas Series and DataFrames UNIT
02

Selecting Columns
‣ If you want to store it in a two-dimensional form, and there is only one column, use the list form[].

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 224


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 225


2.2. Pandas Series and DataFrames UNIT
02

Designating Index
‣ There is no index, but an index may be selected among the columns.
‣ Designate the column “Name” as a new index and reflect the changes to the df object.

Line 74
• If the name part moves down, it is set as an index.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 226


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Line 77
• Select one specific element of the DataFrame df (“music” score of “honggildong”).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 227


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Line 79
• Select two or more elements of the DataFrame df (“music” and “phys_tra” scores of
“honggildong”).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 228


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Line 81
• This can also be done with the integer index iloc.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 229


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Line 83
• It is also possible through slicing.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 230


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 231


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Line 87
• Select an element from two or more rows and columns of df (”music” and “phys_tra” scores of
“honggildong” and “hongeedong”).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 232


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 233


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 234


2.2. Pandas Series and DataFrames UNIT
02

Designating Index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 235


2.2. Pandas Series and DataFrames UNIT
02

Add a Column

Line 94
• Add a ”kor” score column to the DataFrame df. The value of the data is 80.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 236


2.2. Pandas Series and DataFrames UNIT
02

Add a Row

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 237


2.2. Pandas Series and DataFrames UNIT
02

Add a Row

Line 97
• Add a new row – enter the same element value.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 238


2.2. Pandas Series and DataFrames UNIT
02

Add a Row

Line 99
• Add a new row – enter an array of element values.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 239


2.2. Pandas Series and DataFrames UNIT
02

Add a Row

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 240


2.2. Pandas Series and DataFrames UNIT
02

Add a Row
‣ help(df.reset_index)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 241


2.2. Pandas Series and DataFrames UNIT
02

Add a Row

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 242


2.2. Pandas Series and DataFrames UNIT
02

Add a Row

Line 109
• When the index is set this way, pay close attention when adding rows.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 243


2.2. Pandas Series and DataFrames UNIT
02

Change the DataFame Element

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 244


2.2. Pandas Series and DataFrames UNIT
02

Change the DataFame Element

Line 113
• Designate the column “name” as a new index and reflect the changes to the df object.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 245


2.2. Pandas Series and DataFrames UNIT
02

Change the DataFame Element

Line 115
• A method for changing a specific element for the DataFrame df: There are various methods for
changing a “phy” score of “stu1.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 246


2.2. Pandas Series and DataFrames UNIT
02

Change the DataFame Element

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 247


2.2. Pandas Series and DataFrames UNIT
02

Change the DataFame Element

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 248


2.2. Pandas Series and DataFrames UNIT
02

Change the DataFame Element

Line 121
• A method for changing several elements of stu1 df: “mus” and “phy” scores of “stu1.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 249


2.2. Pandas Series and DataFrames UNIT
02

Change the DataFame Element

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 250


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 251


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 127
• Transposing the DataFrame df (using the method).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 252


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 129
• Transposing the DataFrame df again (using class attributes).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 253


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame


‣ Index Setting

Line 132
• A specific column is set as a row index of a DataFrame.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 254


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 255


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 136
• Multi-index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 256


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame


‣ Index

Line 138-1
• Definition of the dictionary

Line 138-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 257


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 139
• Redesignate the index to [r0, r1, r2, r3, r4].

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 258


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 140
• Fill in the NaN value generated by reindexing with the number 0.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 259


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 142-1
• Definition of the dictionary

Line 142-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 260


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 143
• Reset the row index to integer.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 261


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 145-1
• Definition of the dictionary

Line 145-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 262


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 146
• Sort row indices in descending order.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 263


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 147-1
• Definition of the dictionary
Line 147-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 264


2.2. Pandas Series and DataFrames UNIT
02

Transpose the DataFrame

Line 148
• Sort in descending order based on column c1.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 265


2.2. Pandas Series and DataFrames UNIT
02

Series Operations
Series vs. Numbers
‣ Adding a number to a series object adds a number to each of the series’ individual elements and
converts the calculated result into a series object.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 266


2.2. Pandas Series and DataFrames UNIT
02

Series vs. Numbers


‣ Adding a number to a series object adds a number to each of the series’ individual elements and
converts the calculated result into a series object.

Line 3
• Create a pandas series with data from a dictionary.
Line 4
• Divide the student’s scores by 200 per subject.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 267


2.2. Pandas Series and DataFrames UNIT
02

Series vs. Series


‣ For all indices in the series, elements with the same index are calculated.

mat
kor
mat eng
kor h
eng
h

Line 5
• Create a pandas series with data from a dictionary.
Line 5-4 ~ 5-7
• Perform the four fundamental arithmetic calculations based on the scores of each student per
subject.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 268


2.2. Pandas Series and DataFrames UNIT
02

Series vs. Series

[‘addition’, ‘subtraction’, ‘multiplication’,


‘division’]

mat
kor eng
h
addition
subtracti
on
Multiplicati
on
division

Line 6
• Combine the results of the arithmetic operation into a DataFrame (series -> DataFrame).

‣ In the example above, the order of subject names given by index is different. However, pandas finds
and sorts the same subject name (index), and it adds scores of the same subject name (index).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 269


2.2. Pandas Series and DataFrames UNIT
02

Series vs. Series


‣ If the number of elements in the two series is different or the size of the series is the same, it is
treated as a not-a-number (NaN), meaning that there is no valid value.

mat
kor
mat eng
kor h
h

Line 7
• Create a pandas series with data from a dictionary.

Line 7-4 ~ 5-7


• Perform the four fundamental arithmetic calculations based on the scores of each student per
subject.
(Series vs. Series)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 270


2.2. Pandas Series and DataFrames UNIT
02

Series vs. Series

[‘addition’, ‘subtraction’, ‘multiplication’,


‘division’])

mat
kor eng
h
addition
subtracti
on
Multiplicati
on
division

Line 8
• Combine the results of the arithmetic operation into a DataFrame (series -> DataFrame).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 271


2.2. Pandas Series and DataFrames UNIT
02

Operation Method
‣ As you learned in previous slides, when there is no common index or NAN in values, it returns NaN.
To not occur this result, set the fill value for the data set.

kor eng math


math kor

Line 10
• Create a pandas series with data from a dictionary.
Line 10-4 ~ 5-7
• Perform the four fundamental arithmetic calculations based on the scores of each student per
subject.
(Use the Operation Method)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 272


2.2. Pandas Series and DataFrames UNIT
02

Operation Method

[‘addition’, ‘subtraction’, ‘multiplication’,


‘division’])

kor math eng

addition
subtractio
n
Multiplicatio
n
division

Line 11
• Combine the results of the arithmetic operation into a DataFrame (series -> DataFrame).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 273


2.2. Pandas Series and DataFrames UNIT
02

DataFrame Operations
DataFrames can be understood as concepts that expand series operations. First, it is sorted
based on the row/column index and calculated between corresponding elements one by one.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 274


2.2. Pandas Series and DataFrames UNIT
02

DataFrame vs. Numbers


‣ The seaborn library will be covered later in visualization, but here, it was used to import data.

Line 12
• Create a DataFrame by selecting two columns, age and fare, from the titanic dataset.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 275


2.2. Pandas Series and DataFrames UNIT
02

DataFrame vs. Numbers

Line 13
• Create a DataFrame by selecting two columns, age and fare, from the titanic dataset.
Line 13-3
• Shows only the first five lines.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 276


2.2. Pandas Series and DataFrames UNIT
02

DataFrame vs. Numbers

Line 14-1
• Add 10 to the DataFrame.
Line 14-2
• Shows only the first five lines.

‣ While maintaining the form of the existing DataFrame, only the element value is replaced with a
new calculated value and returned as a new DataFrame object.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 277


2.2. Pandas Series and DataFrames UNIT
02

DataFrame vs. DataFrame


‣ Elements in the same row and column positions of each DataFrame are calculated. If the element is
not present on either side or NaN, the calculation result is treated as NaN.

Line 15
• Create a DataFrame by selecting two columns, age and fare, from the titanic dataset.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 278


2.2. Pandas Series and DataFrames UNIT
02

DataFrame vs. Numbers

Line 16
• Add 10 to the DataFrame.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 279


2.2. Pandas Series and DataFrames UNIT
02

DataFrame vs. Numbers

Line 17-1
• Calculate between DataFrames (additon - df).
Line 17-2
• Shows only the last five lines.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 280


2.2. Pandas Series and DataFrames UNIT
02

Coding Exercise #0104

Follow practice steps on 'ex_0104.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 281


2.2. Pandas Series and DataFrames UNIT
02

Coding Exercise #0105_Edited

Follow practice steps on 'ex_0105_Edited.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 282


Unit 2.

Optimal Data Exploration Through


Pandas
2.1. Pipelines: Data Structures 2.4. DataFrame Sorting and Multi-
According to Data Types Index
2.2. Pandas Series and 2.5. Examining the Characteristics
DataFrames of Data Through Descriptive
Statistics and Data Samples
2.3. Merging and Binding
DataFrames
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 283
2.3. Merging and Binding DataFrames UNIT
02

DataFrame Manipulation
Merging DataFrames:

A B

Name Gender Age Name Position Wage


Harry Potter Male 23 John Smith Intern 25000
David Baker Male 31 Alex Du Bois Team Lead 75000
John Smith Male 22 Joanne Rowling Manager 90000
Juan Martinez Male 36 Jane Connor Manager 70000
Jane Connor Female 30

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 284


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: A.Name and B.Name used as key

A B

Name Gender Age Name Position Wage


Harry Potter Male 23 John Smith Intern 25000
David Baker Male 31 Alex Du Bois Team Lead 75000
John Smith Male 22 Joanne Rowling Manager 90000
Juan Martinez Male 36 Jane Connor Manager 70000
Jane Connor Female 30

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 285


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Inner join

A B

Name Gender Age Name Position Wage


Harry Potter Male 23 John Smith Intern 25000
David Baker Male 31 Alex Du Bois Team Lead 75000
John Smith Male 22 Joanne Rowling Manager 90000
Juan Martinez Male 36 Jane Connor Manager 70000
Jane Connor Female 30

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 286


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Inner join

A.Name Gender Age B.Name Position Wage


John Smith Male 22 John Smith Intern 25000
Jane Connor Female 30 Jane Connor Manager 70000

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 287


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Left join

A B

Name Gender Age Name Position Wage


Harry Potter Male 23 John Smith Intern 25000
David Baker Male 31 Alex Du Bois Team Lead 75000
John Smith Male 22 Joanne Rowling Manager 90000
Juan Martinez Male 36 Jane Connor Manager 70000
Jane Connor Female 30

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 288


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Left join

A.Name Gender Age B.Name Position Wage


Harry Potter Male 23 NA NA NA
David Baker Male 31 NA NA NA
John Smith Male 22 John Smith Intern 25000
Juan Martinez Male 36 NA NA NA
Jane Connor Female 30 Jane Connor Manager 70000

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 289


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Right join

A B

Name Gender Age Name Position Wage


Harry Potter Male 23 John Smith Intern 25000
David Baker Male 31 Alex Du Bois Team Lead 75000
John Smith Male 22 Joanne Rowling Manager 90000
Juan Martinez Male 36 Jane Connor Manager 70000
Jane Connor Female 30

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 290


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Right join

A.Name Gender Age B.Name Position Wage


John Smith Male 22 John Smith Intern 25000
NA NA NA Alex Du Bois Team Lead 75000
NA NA NA Joanne Rowling Manager 90000
Jane Connor Female 30 Jane Connor Manager 70000

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 291


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Full outer join

A B

Name Gender Age Name Position Wage


Harry Potter Male 23 John Smith Intern 25000
David Baker Male 31 Alex Du Bois Team Lead 75000
John Smith Male 22 Joanne Rowling Manager 90000
Juan Martinez Male 36 Jane Connor Manager 70000
Jane Connor Female 30

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 292


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames: Full outer join

A.Name Gender Age B.Name Position Wage


Harry Potter Male 23 NA NA NA
David Baker Male 31 NA NA NA
John Smith Male 22 John Smith Intern 25000
Juan Martinez Male 36 NA NA NA
Jane Connor Female 30 Jane Connor Manager 70000
NA NA NA Alex Du Bois Team Lead 75000
NA NA NA Joanne Rowling Manager 90000

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 293


2.3. Merging and Binding DataFrames UNIT
02

Merging and Binding DataFrames:

Line 1, 2
• Inner join

Line 3
• Left join

Line 4
• Right join
Line 5
• Full outer join

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 294


2.3. Merging and Binding DataFrames UNIT
02

Merging and Binding DataFrames:

Line 1
• Bind vertically by matching the column names.

Line 2
• Bind horizontally by matching the indices.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 295


2.3. Merging and Binding DataFrames UNIT
02

Binding DataFrames
‣ When data is divided into several places, it may be necessary to combine them into one or bind the
data. In pandas, functions used to combine or bind DataFrames include concat(), merge(), and join.
‣ pandas.concat(list of DataFrames)
‣ If the axial direction is not specified, the default option axis=0 is applied and connected in the
up/down row direction.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 296


2.3. Merging and Binding DataFrames UNIT
02

Binding DataFrames

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 297


2.3. Merging and Binding DataFrames UNIT
02

Binding DataFrames

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 298


2.3. Merging and Binding DataFrames UNIT
02

Binding DataFrames
‣ Rows 0, 1, 2, and 3 derived from df1 are entered as NaN because there is no column ”d.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 299


2.3. Merging and Binding DataFrames UNIT
02

Binding DataFrames
‣ help(pd.concat). ignore_index

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 300


2.3. Merging and Binding DataFrames UNIT
02

Binding DataFrames
‣ Axis=1 option binds the DataFrame in the left/right column directions.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 301


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ The concat() function is a concept that merges the two DataFrames by a certain criterion in a
manner similar to SQL’s join command. In this case, the column or index that is the reference is
referred to as a key. The key must exist in both DataFrames.

# Creating a DataFrame with Stock Market Data

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 302


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames

Hanmi
Pharmaceutical

NS Shopping

E-mart
Green Cross Medical
Science Corporation

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 303


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames

Harim Co.
Meritz
Financial
Group
E-mart

Samyang

‣ The key here must be an id with a non-overlapping value.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 304


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ See help(pd.merge) for more detail. Let’s just put two DataFrames as parameters and merge them.

E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 305


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames

E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit

‣ If there are no common column names, an error occurs.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 306


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ Here is how to solve the error.
‣ Bring the file again.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 307


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ Here is how to solve the error.

E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 308


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ The on=None and how='inner’ options are applied as default values. The on=None option means
merging all columns that belong in common to the two DataFrames into a reference (key).
‣ The how='inner’ option means that data in the reference column is extracted only when the data is
an intersection common to both DataFrames.
‣ We merged and returned five commonly existing stocks based on the column “id.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 309


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ Let’s deliberately change and see what happens if there are columns “id_1” and “name_1” on the
left side of the DataFrame and if there are columns “id_1” and “name_1” on its right side.

E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit

‣ When the two columns are the same, the results are merged based on the values in the common
column name.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 310


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ If you use how='left,’ all the companies in the left DataFrame will be returned, and those not on the
right will be treated as NaN.

Hanmi
Pharmaceutical

NS Shopping
E-mart E-mart
Green Cross
Medical Science
Corporation
Samyang Samyang
Chong Kun Dang Chong Kun
Group Dang Group
Cuckoo Electronics

ToolGen Inc.
ModeTour
ModeTour Reit
Reit

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 311


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ If you how='right,’ all the companies in the right DataFrame will be returned, and those not on the
left will be treated as NaN.

Harim Co.
Meritz
Financial
Group
E-mart E-mart

Samyang Samyang
Hankook
Tire
NHN
Entertainment
Chong Kun Chong Kun
Dang Group Dang Group
ModeTour ModeTour
Reit Reit
Samsung
Biologics

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 312


2.3. Merging and Binding DataFrames UNIT
02

Merging DataFrames
‣ If you use how=“outer,” all the data on the left and right is returned.

Hanmi
Pharmaceutical

NS Shopping

E-mart E-mart
Green Cross
Medical Science
Corporation
Samyang Samyan
g
Chong Kun Dang Chong Kun
Group Dang Group
Cuckoo Electronics

ToolGen Inc.
ModeTour
ModeTour Reit Reit
Harim
Co.
Meritz
Financial
Group
Hankook
Tire
NHN
Entertainmen
t
Samsung
Biologics

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 313


2.3. Merging and Binding DataFrames UNIT
02

Coding Exercise #0106

Follow practice steps on 'ex_0106.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 314


2.3. Merging and Binding DataFrames UNIT
02

Coding Exercise #0107

Follow practice steps on 'ex_0107.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 315


Unit 2.

Optimal Data Exploration Through


Pandas
2.1. Pipelines: Data Structures 2.4. DataFrame Sorting and Multi-
According to Data Types Index
2.2. Pandas Series and 2.5. Examining the Characteristics
DataFrames of Data Through Descriptive
Statistics and Data Samples
2.3. Merging and Binding
DataFrames
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 316
2.4. DataFrame Sorting and Multi-Index UNIT
02

DataFrame Manipulation
Sorting:
‣ It is possible to sort the rows of a DataFrame using one or more columns.

Line 1
• Sort in ascending order.

Line 2
• Sort in descending order.
Line 3
• Sort using two columns.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 317


2.4. DataFrame Sorting and Multi-Index UNIT
02

Hierarchical indexing with MultiIndex:

Line 2-1
• Column names
Line 2-2
• Labels for the outer layer
Line 2-3
• Labels for the inner layer

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 318


2.4. DataFrame Sorting and Multi-Index UNIT
02

Hierarchical indexing with MultiIndex:

Line 2-4
• Create a list of tuples with the labels.
Line 2-5
• Create the MultiIndex.
Line 2-6
• Apply the MultiIndex.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 319


2.4. DataFrame Sorting and Multi-Index UNIT
02

Hierarchical indexing with MultiIndex:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 320


Unit 2.

Optimal Data Exploration Through


Pandas
2.1. Pipelines: Data Structures 2.4. DataFrame Sorting and Multi-
According to Data Types Index
2.2. Pandas Series and 2.5. Examining the Characteristics
DataFrames of Data Through Descriptive Sta-
tistics and Data Samples
2.3. Merging and Binding
DataFrames
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 321
2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

DataFrame Summarization
Grouping and Summarizing:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 322


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:
‣ Manipulate the indices and the columns and then summarize.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 323


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:
‣ Index by 'Size' and 'Type.' Columns by 'Location.' Values provided by the 'B' column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 324


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 325


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:

‣ The same as the graph on the right, but fill the missing values with 0.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 326


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:

‣ The same as the graph on the right with the aggregation


function specified.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 327


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:
‣ Index by ''Location.' Columns by 'Size' and 'Type.' Values provided by the 'B' column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 328


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:

Line 18
• Now, MultiIndex object for the columns.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 329


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:
‣ The aggregation function is NumPy.median().

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 330


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:
‣ Group averages of the columns 'A' and 'B'

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 331


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:
‣ Now, with groupby() method. The result is the same.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 332


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Pivoting:
‣ Aggregate the columns 'A' and 'B' differently.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 333


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Statistics:

Line 1
• Column sums
Line 2
• Row sums
Line 3
• Column averages without skipping the missing values
Line 4
• Descriptive statistics of the columns (variables)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 334


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Statistics:

Line 5
• Non-missing values along the columns
Line 6
• Correlation between the column 'A' and the column 'B'
Line 7
• Correlation matrix taking the numeric variables pair-wise
Line 8
• Correlations between 'A' and the other numeric variables

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 335


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Missing value detection and processing

Line 1
• A DataFrame with True where missing values are found.
Line 2
• Count the missing values for each column.
Line 3
• Proportions of the missing values for each column.
Line 4
• Drop rows where one or more missing values are found.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 336


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Missing value detection and processing.

Line 5
• Drop columns where one or more missing values are found.
Line 6
• Drop the rows with less than 3 normal values.
Line 7
• Fill the missing values with 0.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 337


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Coding Exercise #0108

Follow practice steps on 'ex_0108.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 338


2.5 Examining the Characteristics of Data Through Descriptive Statistics and UNIT
Data Samples 02

Coding Exercise #0109

Follow practice steps on 'ex_0109.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 339


Unit 3.

Pandas Data Preprocessing for Opti-


mal Model Execution
3.1. Data Preprocessing 3.4. Checking and Processing
Duplicate Data
3.2. Identifying Data Properties
3.5. Data Feature Engineering
3.3. Checking for Missing Data

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 340


3.1. Data Preprocessing UNIT
03

Data Preprocessing
Data scientist survey:
‣ What do data scientists spend the most time doing?

Collecting
data sets
19%
Mining data for patterns
9%
Cleaning and
Refining algorithms 4% organizing
data
Building training sets 3% 60%

Other 5%

https://www.forbes.com/sites/gilpress/2016/03/23/data-
preparation-most-time-consuming-least-enjoyable-data-
science-task-survey-says/#790d18c36f63

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 341


3.1. Data Preprocessing UNIT
03

Data scientist survey:


‣ What do data scientists least enjoy doing?

Collecting
data sets
21%
Mining data for patterns
3%
Cleaning and
Refining algorithms 4% organizing
data
Building training sets 10% 57%

Other 5%

https://www.forbes.com/sites/gilpress/2016/03/23/data-
preparation-most-time-consuming-least-enjoyable-data-
science-task-survey-says/#790d18c36f63

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 342


3.1. Data Preprocessing UNIT
03

Operations:

Cleanin
g

Transformation/
Reshaping

Integration/
Join

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 343


3.1. Data Preprocessing UNIT
03

Operations:

Scaling/Normalization

Imputation of missing
values

Outlier treatment

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 344


3.1. Data Preprocessing UNIT
03

Algorithm considerations:
‣ Often, machine learning algorithms can be categorized into “Tree-like” and “non-Tree-like.”
• a) Tree-like algorithms: Tree, Random Forest, AdaBoost, XGMBoost, etc.
• b) Non-Tree-like algorithms: Linear Regression, Logistic Regression, SVM, Neural Network, etc.
‣ Scaling/Normalization and Outlier treatment can be needed for non-Tree-like algorithms, but not for
Tree-like algorithms.
‣ This is because Tree-like algorithms partition the configuration space into “patches,” mostly
unaffected by scales or outliers.
‣ Other preprocessing operations are equally applicable for both Tree-like and non-Tree-like.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 345


Unit 3.

Pandas Data Preprocessing for Opti-


mal Model Execution
3.1. Data Preprocessing 3.4. Checking and Processing
Duplicate Data
3.2. Identifying Data Properties
3.5. Data Feature Engineering
3.3. Checking for Missing Data

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 346


3.2. Identifying Data Properties UNIT
03

Scale
Data Classification
‣ Continuous Scale data and Categorical data are the most basic types of structured data.
‣ Continuous data include continuity data, such as wind speed and duration, and discrete data, such
as the frequency of occurrence of events.
‣ Categorical data refers to the city names of each country (Washington, New York, Los Angeles) or
the type of car (bus, taxi, truck).
‣ Binary data is a special case of having either value, such as 0 and 1, yes/no, or true/false, among
categorical types.
‣ Among the categorical types, the ratings (1,2,3,4,5) in which the values within the category are
ranked are called Ordinal data.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 347


3.2. Identifying Data Properties UNIT
03

Why Classify Data?


‣ Data science software such as Python utilizes this data type information for computational
performance.
‣ The type of data determines how to perform the calculation related to the variable.
‣ For example, in R or Python, ordinal data is classified into ordered.factor and is used to maintain the
order desired by users in charts, tables, and statistical models.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 348


Unit 3.

Pandas Data Preprocessing for Opti-


mal Model Execution
3.1. Data Preprocessing 3.4. Checking and Processing
Duplicate Data
3.2. Identifying Data Properties
3.5. Data Feature Engineering
3.3. Checking for Missing Data

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 349


3.3. Checking for Missing Data UNIT
03

Missing Value
Missing Value
‣ Missing value means refers to missing data and empty data.
‣ It is impossible to distort the analysis results or apply functions.
‣ In some cases, the missing values generated on the variable are not related to other variables, so
the missing values must be deleted and replaced according to the situation.
‣ Assuming that each variable follows a specific probability distribution, the distribution parameters
are estimated and replaced.
‣ There are mean replacement, median replacement, and mode replacement.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 350


3.3. Checking for Missing Data UNIT
03

Checking for Missing Data

Line 1-1
• Import Library.

Line 1-2
• Load the Titanic Dataset.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 351


3.3. Checking for Missing Data UNIT
03

Checking for Missing Data

Line 1-3
• Calculate the number of NaNs in the Deck column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 352


3.3. Checking for Missing Data UNIT
03

Checking for Missing Data

Line 2
• Find the missing data using the isnull() method.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 353


3.3. Checking for Missing Data UNIT
03

Checking for Missing Data

Line 3
• Find the missing data using the notnull() method.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 354


3.3. Checking for Missing Data UNIT
03

Checking for Missing Data

Line 4
• Calculate the number of missing data using the isnull() method.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 355


3.3. Checking for Missing Data UNIT
03

Applying the threshold=500 option to the dropna() method deletes all columns with 500 or more
NaN values.

Line 5
• Delete all columns with more than 500 NaN values. Deck column (688 NaN values out of 891).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 356


3.3. Checking for Missing Data UNIT
03

Limit the subset to the column ‘age.’

Line 6
• Delete all rows without age data in the age column. Age column (177 NaN values out of 891).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 357


3.3. Checking for Missing Data UNIT
03

Replacing Missing Values


Replacing the Missing Data with the Mean

Line 7-1
• Delete all rows without age data in the age column. Age Column (177 NaN values out of 891).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 358


3.3. Checking for Missing Data UNIT
03

Replacing the Missing Data with the Mean

Line 7-2
• Calculate the mean of the age column. (Excluding NaN values)
Line 7-3
• Print the first 10 data in the age column. (In row 5, NaN values are replaced by the mean.)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 359


3.3. Checking for Missing Data UNIT
03

Practice with Missing Data


Checking for Missing Data

Line 8-1
• Load the titanic dataset.
Line 8-2
• Print the NaN data of column embark_town and row 829.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 360


3.3. Checking for Missing Data UNIT
03

Replacing the Missing Data with the Mode

Line 9
• The NaN value of the embark_town column is replaced with the value that appears the most
among the boarding cities.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 361


3.3. Checking for Missing Data UNIT
03

Replacing the Missing Data with the Mode

Line 10
• Print the NaN data of row 829 and column embark_town. (NaN value is replaced by value
most_freq.)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 362


3.3. Checking for Missing Data UNIT
03

Replacing the Missing Data with the Mode

Line 11
• Change the NaN value of embark_town to the immediately preceding value of row 828.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 363


Unit 3.

Pandas Data Preprocessing for Opti-


mal Model Execution
3.1. Data Preprocessing 3.4. Checking and Processing
Duplicate Data
3.2. Identifying Data Properties
3.5. Data Feature Engineering
3.3. Checking for Missing Data

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 364


3.4. Checking and Processing Duplicate Data UNIT
03

Processing Duplicate Data


Checking for Duplicate Data

Line 12
• Create a DataFrame with duplicate data.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 365


3.4. Checking and Processing Duplicate Data UNIT
03

Checking for Duplicate Data

Line 13
• Find the duplicate values among the entire row data of the DataFrame.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 366


3.4. Checking and Processing Duplicate Data UNIT
03

The data in line 1 becomes True because it overlaps with the previous row 0.

Line 14
• Find the duplicate value in the specific column data of the DataFrame.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 367


3.4. Checking and Processing Duplicate Data UNIT
03

drop_duplicates() Method

Line 15
• Find the duplicate value in the specific column data of the DataFrame.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 368


3.4. Checking and Processing Duplicate Data UNIT
03

Column Reference Corresponding to the Subset Option

Line 16
• Remove duplicate rows based on columns c2 and c3.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 369


Unit 3.

Pandas Data Preprocessing for Opti-


mal Model Execution
3.1. Data Preprocessing 3.4. Checking and Processing
Duplicate Data
3.2. Identifying Data Properties
3.5. Data Feature Engineering
3.3. Checking for Missing Data

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 370


3.5. Data Feature Engineering UNIT
03

Feature Engineering
Derived variables:
‣ Create one or more variables based on other variable(s).
Ex From the “Date” variable, derive the “Year,” “Month,” “Day,” “Weekday,” etc.

Date Year Month Day Weekday


2009-8-21 2009 8 21 Friday
2013-8-27 2013 8 27 Tuesday
2019-2-11 2019 2 11 Monday

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 371


3.5. Data Feature Engineering UNIT
03

Derived variables:
‣ Create one or more variables based on other variable(s).

Ex From the “X_coordinate” and “Y_coordinate,” derive the “Distance” variable:

√X _ coordinate
Distance =
2
+Y _ coordinate2

X_coordinate Y_coordinate Distance


3 4 5
5 12 13
9 12 15

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 372


3.5. Data Feature Engineering UNIT
03

Derived variables:
‣ Create one or more variables based on other variable(s).

Ex From the “Price” and “Area(𝑚2),” derive the “PricePerArea” variable:


𝑃𝑟𝑖𝑐𝑒𝑃𝑒𝑟𝐴𝑟𝑒𝑎=𝑃𝑟𝑖𝑐𝑒/𝐴𝑟𝑒𝑎

Price Area PricePerArea


300,000 100 3,000
525,000 150 3,500
160,000 80 2,000

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 373


3.5. Data Feature Engineering UNIT
03

Derived variables:
‣ Principal components with or without dimensional reduction can be regarded as derived variables.
‣ In general, rotated coordinates can be regarded as derived variables.
‣ We can apply mathematical functions such as 𝑙𝑜𝑔⁡() to a variable to create its derived variable.

With , we can ameliorate a positively skewed distribution making it more “normal.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 374


3.5. Data Feature Engineering UNIT
03

Dealing with categorical variables: Dummy variables


‣ A dummy variable is a derived variable that can have only 0 or 1 as a value.
‣ A categorical variable generates 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐𝑜𝑢𝑛𝑡 – 1 dummy variable.

Ex “Gender” variable with two category values: “male” and “female.”


→ Only one dummy variable is required: “Gender_male.”

Gender Gender_male
female 0
male 1

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 375


3.5. Data Feature Engineering UNIT
03

Dealing with categorical variables: Dummy variables


‣ A dummy variable is a derived variable that can have only 0 or 1 as a value.
‣ A categorical variable generates 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐𝑜𝑢𝑛𝑡 – 1 dummy variable.

Ex “Species” variable with three category values: “setosa,” “versicolor,” and “virginica.”
→ Only two dummy variables are required: “Species_versicolor” and “Species_virginica.”

Species Species_versicolor Species_virginica


setosa 0 0
versicolor 1 0
virginica 0 1

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 376


3.5. Data Feature Engineering UNIT
03

Dealing with categorical variables: Dummy variables


‣ Similar to dummy variables.
‣ In this case, there are as many dummy variables as the count of unique category values.

Ex One hot encoding for a variable that can have integer values 0~9.
→ Integer variables often represent category or class rather than the numeric value.

X0 X1 X2 X3 X4 X6 X7 X8 X9

0 0 0 0 0 1 0 0 0 0

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 377


3.5. Data Feature Engineering UNIT
03

Dealing with categorical variables: Label encoding


‣ Encodes a categorical variable by assigning integers to the category values.
‣ Creates a new numeric variable (derived variable).
‣ OK with “Tree-like” algorithms, but unsafe with “non-Tree-like” algorithms.

Ex “Capital” with five category values [“London”, “Moscow”, “Paris”, “Seoul”, “Washington”]
can be encoded as an integer variable “Capital_int” where 0 ↔ “London,” 1 ↔ “Moscow,”
2 ↔ “Paris,” 3 ↔ “Seoul,” 4 ↔ “Washington.”

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 378


3.5. Data Feature Engineering UNIT
03

Turning numerical variables into categorical:


‣ From a numerical variable, we can derive a categorical one that contains intervals as values.
‣ These intervals can be quantiles or custom-made.
‣ From the categorical variable, the corresponding dummy variables can be generated as usual.

Ex A numerical “Age” variable can be converted into “Age_10,” “Age_20,” etc.

Age Age_10 Age_20 Age_30 Age_40 Age_50 Age_60 Age_70 Age_80


8 0 0 0 0 0 0 0 0
17 1 0 0 0 0 0 0 0
26 0 1 0 0 0 0 0 0
63 0 0 0 0 0 1 0 0

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 379


3.5. Data Feature Engineering UNIT
03

Matching the Same Measurement Unit Equally


Convert miles per gallon into kilometers per liter (km/L).

Line 2-1
• Generate df with the read_csv() function.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 380


3.5. Data Feature Engineering UNIT
03

Convert miles per gallon into kilometers per liter (km/L).

Line 2-3
• Designate the column name.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 381


3.5. Data Feature Engineering UNIT
03

The round (2) command rounds up the second digit below the decimal point.

Line 3-1
• Convert mile per gallon to kilometer per liter (mpg_to_kpl = 0.425).

Line 3-3
• Add the result of multiplying the mpg column by 0.425 to the new column (kpl).

Line 3-7
• Round up the kpl column to the second place below the decimal point.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 382


3.5. Data Feature Engineering UNIT
03

The round (2) command rounds up the second digit below the decimal point.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 383


3.5. Data Feature Engineering UNIT
03

Data Type Conversion


Check the original data type.

Line 4-1
• Check the data type of each column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 384


3.5. Data Feature Engineering UNIT
03

Check the original data type.

Line 4-4
• Check the original value of the horsepower column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 385


3.5. Data Feature Engineering UNIT
03

Check the original data type.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 386


3.5. Data Feature Engineering UNIT
03

Convert the integer data type into a character data type using dictionary.

Line 5
• Check the original value of the origin column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 387


3.5. Data Feature Engineering UNIT
03

Convert the integer data type into a character data type using dictionary.

Line 6-1
• Convert the integer data type into a character data type.
Line 6-3 ~ 6-4
• Check the original value and the data type of the origin column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 388


3.5. Data Feature Engineering UNIT
03

Convert the string into a categorical type.

Line 7-1
• Convert the string data type of the origin column into a categorical type.

Line 7-4
• Convert the categorical type back to the string type.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 389


3.5. Data Feature Engineering UNIT
03

Convert the string into a categorical type.

Line 8
• Convert the integer type of the model year column into a categorical type.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 390


3.5. Data Feature Engineering UNIT
03

Categorical Data Conversion (Division of Sections)


Convert continuous variables into categorical discrete variables.

Line 9-1
• Generate df with the read_csv() function.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 391


3.5. Data Feature Engineering UNIT
03

Convert continuous variables into categorical discrete variables.

Line 9-3 ~ 9-4


• Designate the column name.

Line 9-6 ~ 9-8


• Find the missing data (“?”) of the horsepower column and convert it into a float.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 392


3.5. Data Feature Engineering UNIT
03

Convert continuous variables into categorical discrete variables.

Line 9-6
• Change “?” to np.nan.

Line 9-7
• Delete the missing data row.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 393


3.5. Data Feature Engineering UNIT
03

Convert continuous variables into categorical discrete variables.

Line 9-8
• Convert the string to a float.

Line 10
• Find a list of boundary values divided by three bin by the np.histogram function.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 394


3.5. Data Feature Engineering UNIT
03

Use the include_lowest=True option to include the low boundary values between countries.

[‘Low output’, ‘Normal output’, ‘High


output’]

Line 11-1
• Designate the names for the 3 bins.

Line 11-3 ~ 11-6


• Each data corresponds to three bins as a pd.cut function.

Line 11-3
• Data array

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 395


3.5. Data Feature Engineering UNIT
03

Use the include_lowest=True option to include the low boundary values between countries.

[‘Low output’, ‘Normal output’, ‘High


output’]

Line 11-4
• Boundary value list

Line 11-5
• Bin names
`
Line 11-6
• Include the first boundary value.
Line 11-8
• Print the first 15 rows of the horse bower column, hp_bin column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 396


3.5. Data Feature Engineering UNIT
03

Categorical Data Conversion (Divisions of Sections)


Use the include_lowest=True option to include the low boundary values between countries.

Normal output
Normal output
Normal output
Normal output
Normal output
High output
High output
High output
High output
High output
High output
Normal output
Normal output
High output
Low output

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 397


3.5. Data Feature Engineering UNIT
03

Dummy Variable
To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.

‘Low output’, ‘Normal output’, ‘High


output’]

Line 12-1
• Generate df with the read_csv() function.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 398


3.5. Data Feature Engineering UNIT
03

To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.

‘Low output’, ‘Normal output’, ‘High


output’]

Line 12-3 ~ 12-4


• Designate the column name.
Line 12-6 ~ 12-8
• Delete the missing data “?” in the horsepower column and convert it into a float.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 399


3.5. Data Feature Engineering UNIT
03

To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.

‘Low output’, ‘Normal output’, ‘High


output’]

Line 12-6
• Change ”?” to np.nan.
Line 12-7
• Delete the missing data row.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 400


3.5. Data Feature Engineering UNIT
03

To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.

‘Low output’, ‘Normal output’, ‘High


output’]

Line 12-8
• Convert the string into a float.
Line 10
• Find a list of boundary values divided by three bins using np.histogram.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 401


3.5. Data Feature Engineering UNIT
03

To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.

‘Low output’, ‘Normal output’, ‘High output’]

Line 12-12
• Designate the name for three bins.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 402


3.5. Data Feature Engineering UNIT
03

One-Hot Vector
If the get_dummies() method is used, all original values of the categorical variables are each
converted into new dummy variables.

Line 13-1 ~ 13-4


• Assign each data to three bins using pd.cut.
Line 13-1
• Data array
Line 13-2
• Boundary value list

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 403


3.5. Data Feature Engineering UNIT
03

If the get_dummies() method is used, all original values of the categorical variables are each
converted into new dummy variables.

Line 13-3
• Bin names

Line 13-4
• Include the first boundary value.

Line 13-6
• Convert the categorical data in the hp_bin column into dummy variables.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 404


3.5. Data Feature Engineering UNIT
03

If the get_dummies() method is used, all original values of the categorical variables are each
converted into new dummy variables.

Low Normal High


output output output

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 405


3.5. Data Feature Engineering UNIT
03

Normalization
The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.

Line 14-1
• Check the maximum value with statistical summary information of the horsepower column.

Line 14-4
• All data is divided and stored as the absolute value of the maximum value of the horsepower
column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 406


3.5. Data Feature Engineering UNIT
03

The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 407


3.5. Data Feature Engineering UNIT
03

The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.

Line 15-1
• The maximum value (max) and the minimum value (min) are checked by the statistical summary
information of the horsepower column.
Line 15-4 ~ 15-6
• All data is divided and stored as the absolute value of the maximum value of the horsepower
column.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 408


3.5. Data Feature Engineering UNIT
03

The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 409


3.5. Data Feature Engineering UNIT
03

Coding Exercise #0110

Follow practice steps on 'ex_0110.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 410


Unit 4.

Data Visualization For Various Data


Scales
4.1. Intro to Data Visualization 4.4. Visualization for Matplotlib
4.2. Graphs for Continuous Data & pandas
Summary
4.3. Graphs for Categorical Data Sum- 4.5. Advanced Graphing with
mary Seaborn

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 411


4.1. Intro to Data Visualization UNIT
04

Exploring Univariate Data and Multivariate Data


Refers to data with one variable and is divided into quantitative and qualitative data.

Variable Variable Type Growth Rate

 Histogram
 Box Plot
Continuous Data
Univariate  Violin Plot
(One Vari-  Kernel Density Curve
able)
Categorical Data  Bar Chart
(nominal, ordinal)  Pie Chart

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 412


4.1. Intro to Data Visualization UNIT
04

Refers to data with one variable and is divided into quantitative and qualitative data.

Variable Variable Type Growth Rate

 Scatter Plot
Univariate Continuous Data  Line Plot
(Two or  Time Series Plot
More Vari-
ables) Categorical Data
 Mosaic Chart
(nominal, ordinal)

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 413


Unit 4.

Data Visualization For Various Data


Scales
4.1. Intro to Data Visualization 4.4. Visualization for Matplotlib
4.2. Graphs for Continuous Data & pandas
Summary
4.3. Graphs for Categorical Data 4.5. Advanced Graphing with
Summary Seaborn

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 414


4.2. Graphs for Continuous Data Summary UNIT
04

Basic Visualization Types


Univariate visualization:
‣ One continuous numeric variable: Boxplot

30

20

10

‣ A boxplot is composed of a box, whiskers, outliers, etc.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 415


4.2. Graphs for Continuous Data Summary UNIT
04

Univariate visualization:
‣ One continuous numeric variable: Histogram

100

80
Frequency

60

40

20

5 10 15
x
‣ Shows the absolute or relative frequencies of each interval.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 416


4.2. Graphs for Continuous Data Summary UNIT
04

Univariate visualization:
‣ One continuous numeric variable: Histogram

30

20
Frequency

10

5 10 15
x
‣ The interval width can be adjusted.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 417


4.2. Graphs for Continuous Data Summary UNIT
04

Multivariate visualization:
‣ Two continuous numeric variables: Scatter plot

x
‣ Identify whether a linear relation exists between the two variables.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 418


4.2. Graphs for Continuous Data Summary UNIT
04

Multivariate visualization:
‣ Two continuous numeric variables: Scatter plot

x
‣ Different categories can be denoted by different colors or markers (symbols).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 419


4.2. Graphs for Continuous Data Summary UNIT
04

Multivariate visualization:
‣ Two continuous numeric variables and one categorical variable: Multiple Scatter plots

y y

x x
‣ Different categories can be plotted separately.
‣ You should make sure that the axis ranges match for proper comparison.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 420


Unit 4.

Data Visualization For Various Data


Scales
4.1. Intro to Data Visualization 4.4. Visualization for Matplotlib
4.2. Graphs for Continuous Data & pandas
Summary
4.3. Graphs for Categorical Data 4.5. Advanced Graphing with
Summary Seaborn

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 421


4.3. Graphs for Categorical Data Summary UNIT
04

Basic Visualization Types


Univariate visualization:
‣ One categorical variable: Bar plot

Frequency

A B C D

‣ Shows the absolute or relative frequencies of each category (type).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 422


4.3. Graphs for Categorical Data Summary UNIT
04

Multivariate visualization:
‣ One categorical variable: Pie chart

A
B

‣ Shows the proportions of each category (type).

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 423


4.3. Graphs for Categorical Data Summary UNIT
04

Bivariate visualization:
‣ One continuous numeric variable & one categorical variable: Multiple Boxplots

30

20

10

A B
‣ The number of categories (types) = the number of boxplots

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 424


4.3. Graphs for Categorical Data Summary UNIT
04

Bivariate visualization:
‣ One continuous numeric variable & one categorical variable: Multiple Histograms

A B
Frequency

Frequency
40 15
20
5
0 0

0 5 10 15 20 0 5 10 15 20

‣ The number of categories (types) = the number of histograms


‣ You should make sure that the axis ranges match for proper comparison.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 425


4.3. Graphs for Categorical Data Summary UNIT
04

Bivariate visualization:
‣ Two categorical variables: Bar plot

m
f

A B C D

‣ Use color to distinguish the categories of the secondary variable.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 426


4.3. Graphs for Categorical Data Summary UNIT
04

Bivariate visualization:
‣ Two categorical variables: Bar plot

m
f

A B C D

‣ Use color and dodged bars to distinguish the categories of the secondary variable.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 427


4.3. Graphs for Categorical Data Summary UNIT
04

Recommendations
In the following bar plot, can you see big difference among the categories?
‣ Apparently, yes?

71.0

70.5

70.0
A B C D

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 428


4.3. Graphs for Categorical Data Summary UNIT
04

In the following bar plot, can you see big difference between the categories?
‣ In this case where the vertical zero is shown, you see little difference.

70
60
50
40
30
20
10
0
A B C D

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 429


4.3. Graphs for Categorical Data Summary UNIT
04

Sometimes, 3D effects should be avoided.


‣ In a 3D pie chart, it is hard to distinguish the relative proportions due to the perspective.

4.7%
8.9% 5.4%

43.1%

38.0%

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 430


Unit 4.

Data Visualization For Various Data


Scales
4.1. Intro to Data Visualization 4.4. Visualization for Matplotlib
4.2. Graphs for Continuous Data & pandas
Summary
4.3. Graphs for Categorical Data 4.5. Advanced Graphing with
Summary Seaborn

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 431


4.4. Visualization for Matplotlib & pandas UNIT
04

Basic Matplotlib Visualization


Bar plot

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 432


4.4. Visualization for Matplotlib & pandas UNIT
04

Histogram

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 433


4.4. Visualization for Matplotlib & pandas UNIT
04

Histogram

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 434


4.4. Visualization for Matplotlib & pandas UNIT
04

Multiple boxplots

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 435


4.4. Visualization for Matplotlib & pandas UNIT
04

Multiple boxplots

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 436


4.4. Visualization for Matplotlib & pandas UNIT
04

Line plot

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 437


4.4. Visualization for Matplotlib & pandas UNIT
04

Line plot

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 438


4.4. Visualization for Matplotlib & pandas UNIT
04

Line plot

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 439


4.4. Visualization for Matplotlib & pandas UNIT
04

Line plot

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 440


4.4. Visualization for Matplotlib & pandas UNIT
04

Line plot

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 441


4.4. Visualization for Matplotlib & pandas UNIT
04

Scatter plots with plot() function and linestyle ='none':

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 442


4.4. Visualization for Matplotlib & pandas UNIT
04

Scatter plots with plot() function and linestyle ='none':

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 443


4.4. Visualization for Matplotlib & pandas UNIT
04

Scatter plots with plot() function and linestyle ='none':

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 444


4.4. Visualization for Matplotlib & pandas UNIT
04

Scatter plots with scatter() function:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 445


4.4. Visualization for Matplotlib & pandas UNIT
04

Scatter plots with scatter() function:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 446


4.4. Visualization for Matplotlib & pandas UNIT
04

Scatter plots with scatter() function:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 447


4.4. Visualization for Matplotlib & pandas UNIT
04

Scatter plots with scatter() function:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 448


4.4. Visualization for Matplotlib & pandas UNIT
04

Arguments of the plot() function:

Argument Explanation
color Color
alpha Transparency
linewidth Line width
linestyle Line style
marker Marker type
markersize Marker size
markerfacecolor Marker color inside
markeredgecolor Color of the marker edge
markeredgewidth Width of the marker edge

https://matplotlib.org/3.1.1/api/_as_gen/
matplotlib.pyplot.plot.html

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 449


4.4. Visualization for Matplotlib & pandas UNIT
04

Values of the linestyle argument:

linestyle Explanation
'none' No line
':' Dotted line
'--' Dashed line
'-.' Dash dot
'-' Continuous line
'steps' In steps
https://matplotlib.org/3.1.1/api/_as_gen/
matplotlib.pyplot.plot.html

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 450


4.4. Visualization for Matplotlib & pandas UNIT
04

Values of the marker argument:

marker Explanation
'.' Point
',' Pixel
'o' Circle
'^' Triangle up
'v' Triangle down
's' Square
'*' Star
'+' Plus sign
'x' X character
'D' Diamond
'p' Pentagon
https://matplotlib.org/3.1.1/api/_as_gen/
matplotlib.pyplot.plot.html

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 451


4.4. Visualization for Matplotlib & pandas UNIT
04

Matplotlib Visualization with Objects


Import required modules.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 452


4.4. Visualization for Matplotlib & pandas UNIT
04

Visualization with a figure object:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 453


4.4. Visualization for Matplotlib & pandas UNIT
04

Visualization with a figure object:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 454


4.4. Visualization for Matplotlib & pandas UNIT
04

Visualization with a figure object:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 455


4.4. Visualization for Matplotlib & pandas UNIT
04

Multiple plots within the same axes:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 456


4.4. Visualization for Matplotlib & pandas UNIT
04

Multiple plots in separate axes:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 457


4.4. Visualization for Matplotlib & pandas UNIT
04

Multiple plots in separate axes:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 458


4.4. Visualization for Matplotlib & pandas UNIT
04

Multiple plots in an array of axes:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 459


4.4. Visualization for Matplotlib & pandas UNIT
04

Multiple plots in an array of axes:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 460


4.4. Visualization for Matplotlib & pandas UNIT
04

pandas Visualization
Import modules and data.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 461


4.4. Visualization for Matplotlib & pandas UNIT
04

Histogram:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 462


4.4. Visualization for Matplotlib & pandas UNIT
04

Histogram:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 463


4.4. Visualization for Matplotlib & pandas UNIT
04

Histogram:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 464


4.4. Visualization for Matplotlib & pandas UNIT
04

Bar plot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 465


4.4. Visualization for Matplotlib & pandas UNIT
04

Bar plot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 466


4.4. Visualization for Matplotlib & pandas UNIT
04

Single scatter plot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 467


4.4. Visualization for Matplotlib & pandas UNIT
04

An array of scatter plots:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 468


4.4. Visualization for Matplotlib & pandas UNIT
04

Coding Exercise #0111

Follow practice steps on 'ex_0111.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 469


4.4. Visualization for Matplotlib & pandas UNIT
04

Coding Exercise #0112

Follow practice steps on 'ex_0112.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 470


Unit 4.

Data Visualization For Various Data


Scales
4.1. Intro to Data Visualization 4.4. Visualization for Matplotlib
4.2. Graphs for Continuous Data & pandas
Summary
4.3. Graphs for Categorical Data 4.5. Advanced Graphing with
Summary Seaborn

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 471


4.5. Advanced Graphing with Seaborn UNIT
04

Seaborn Visualization Library


What the Seaborn library provides:
‣ Internal dataset: load_dataset()
‣ Basic graphic types: distplot(), jointplot(), kdeplot(), rugplot(), barplot(), countplot(), etc.
‣ Arrays: pairplot(), PairGrid(), FacetGrid(), etc.
‣ With regression (trend) line: lmplot(), jointplot(), etc.
‣ Special graphic types: heatmap(), clustermap(), etc.
‣ Applied graph types: violinplot(), swarmplot(), stripplot(), etc.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 472


4.5. Advanced Graphing with Seaborn UNIT
04

Install the Seaborn library.


‣ Remember to execute “!pip install seaborn” command first for Seaborn library installation.
‣ Import modules.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 473


4.5. Advanced Graphing with Seaborn UNIT
04

Import data set.


‣ Read in the data set ‘mpg’ from the library.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 474


4.5. Advanced Graphing with Seaborn UNIT
04

Histogram:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 475


4.5. Advanced Graphing with Seaborn UNIT
04

Histogram + KDE:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 476


4.5. Advanced Graphing with Seaborn UNIT
04

Histogram + Rug:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 477


4.5. Advanced Graphing with Seaborn UNIT
04

KDE (Kernel Density Estimation):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 478


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 479


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot + regression line:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 480


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot + regression line:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 481


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot + regression line (using ‘hue’ argument):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 482


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot + regression line (multiple of plots):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 483


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot + regression line (multiple of plots):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 484


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot + regression line (multiple of plots):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 485


4.5. Advanced Graphing with Seaborn UNIT
04

Hex:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 486


4.5. Advanced Graphing with Seaborn UNIT
04

KDE (Kernel Density Estimation):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 487


4.5. Advanced Graphing with Seaborn UNIT
04

One categorical variable + one numeric variable. Aggregated by the mean.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 488


4.5. Advanced Graphing with Seaborn UNIT
04

One categorical variable + one numeric variable. Aggregated by the median.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 489


4.5. Advanced Graphing with Seaborn UNIT
04

One categorical variable only. Frequency table is implicitly calculated.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 490


4.5. Advanced Graphing with Seaborn UNIT
04

Two categorical variables. Use of 'hue' argument.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 491


4.5. Advanced Graphing with Seaborn UNIT
04

Horizontal boxplot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 492


4.5. Advanced Graphing with Seaborn UNIT
04

Vertical boxplot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 493


4.5. Advanced Graphing with Seaborn UNIT
04

Multiple of boxplots:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 494


4.5. Advanced Graphing with Seaborn UNIT
04

Multiple of boxplots:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 495


4.5. Advanced Graphing with Seaborn UNIT
04

Multiple of boxplots. Two categorical variables + one numeric variable. Use of 'hue' argument.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 496


4.5. Advanced Graphing with Seaborn UNIT
04

Multiple of violin plots:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 497


4.5. Advanced Graphing with Seaborn UNIT
04

Multiple of violin plots:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 498


4.5. Advanced Graphing with Seaborn UNIT
04

Strip plot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 499


4.5. Advanced Graphing with Seaborn UNIT
04

Swarm plot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 500


4.5. Advanced Graphing with Seaborn UNIT
04

Violin plot + Swarm plot:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 501


4.5. Advanced Graphing with Seaborn UNIT
04

Color palette (Frequency table shown as bar plot):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 502


4.5. Advanced Graphing with Seaborn UNIT
04

Seaborn Visualization Library (with more examples)


Import modules.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 503


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot array:

Line 2
• Read in the 'iris' data from the library.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 504


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot array:

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 505


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot array (Color the markers with 'species.'):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 506


4.5. Advanced Graphing with Seaborn UNIT
04

Scatter plot array (Color the markers with 'species.' Apply palette.):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 507


4.5. Advanced Graphing with Seaborn UNIT
04

PairGrid (An array of scatter plots):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 508


4.5. Advanced Graphing with Seaborn UNIT
04

PairGrid (An array of different visualization types):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 509


4.5. Advanced Graphing with Seaborn UNIT
04

FacetGrid (Multiple histograms):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 510


4.5. Advanced Graphing with Seaborn UNIT
04

FacetGrid (Multiple histograms):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 511


4.5. Advanced Graphing with Seaborn UNIT
04

FacetGrid (Multiple scatter plots):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 512


4.5. Advanced Graphing with Seaborn UNIT
04

Heat map:

Line 11
• Read in 'mpg' data from the library.

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 513


4.5. Advanced Graphing with Seaborn UNIT
04

Heat map (Correlation matrix):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 514


4.5. Advanced Graphing with Seaborn UNIT
04

Heat map (Default heatmap):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 515


4.5. Advanced Graphing with Seaborn UNIT
04

Heat map (Use 'cmap' instead of 'palette‘.):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 516


4.5. Advanced Graphing with Seaborn UNIT
04

Heat map (Adjust the center.):

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 517


4.5. Advanced Graphing with Seaborn UNIT
04

Coding Exercise #0113

Follow practice steps on 'ex_0113.ipynb' file

Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 518


End of
Document

Samsung Innovation Campus


ⓒ2022 SAMSUNG. All rights reserved.
Samsung Electronics Corporate Citizenship Office holds the copyright of book.
This book is a literary property protected by copyright law so reprint and reproduction without permission are prohibited.
To use this book other than the curriculum of Samsung Innovation Campus or to use the entire or part of this book, you must receive written
consent from copyright holder.

You might also like