SIC_AI_Chapter 3. Exploratory Data Analysis_v2.1
SIC_AI_Chapter 3. Exploratory Data Analysis_v2.1
Innovation
Campus
Artificial Intelligence Course
Chapter 3.
Chapter objectives
Understand the precise use of NumPy and be able to process data efficiently.
Learn the basics of NumPy arrays, indexing, and slicing and the various ways of their application.
Learn to create and handle series and data frame objects.
Learn the appropriate methods of optimal model execution for data preprocessing using the
Pandas library to explore and convert data.
Be able to find the appropriate analysis method by implementing a data visualization suitable for
the data scale.
Chapter contents
Data Structure
The most important concepts in programming are data types, data structures, and algorithms.
Knowing the clear differences between these three concepts is essential for easily dealing with
various programming languages and solving many errors.
Data Type
‣ First, a data type, in computer science and programming languages, is a classification that identifies
a type of data such as floats, integers, Booleans, characters, and strings while also determining the
size of the data. It should be used by applicable data types. The data types differ by programming
language, but most of the data type concepts from the C language are inherited and used and have
affected other languages.
‣ Memory is expensive, but when C was introduced, it was expensive and difficult to store a lot of
data. As a result, it was designed to be optimized for storing as little space as necessary, which
eventually resulted in a data size problem.
‣ For example, in C, integer data types are divided into char(compatible with integers), short, int, and
long. Each byte size differs from the other data types.
Data Type
‣ Second, data structure in computer science refers to the organization, management, and storage of
data that enables efficient access and modification. It relies heavily on the data structure to find
data in the fastest way. In other words, finding data is a matter of algorithms (which will be
explained later), so a good data structure can be said to be a necessary and sufficient condition for
a good algorithm.
‣ To be more specific, the data structure refers to a group of data values, a relationship between data,
and a function or command applicable to the data. Carefully selected data structures make it
possible to use more efficient algorithms.
‣ An effectively designed data structure allows operations to be performed with minimal resources,
such as execution time or memory capacity.
‣ There are several types of data structures, each of which is tailored for each operation and purpose.
‣ When designing various programs, it should be the priority to consider and select the most
appropriate data structure. This is because when manufacturing a large-scale system, the
implementation difficulty and the final product’s performance depend heavily on the data structure.
‣ Once the data structure is selected, it becomes relatively clear which algorithm needs to be applied.
There are times when this order is reversed, and they are when the target operation necessarily
requires a particular algorithm, and the given algorithm produces the best performance with the
particular data structure. In any case, it is essential to select an appropriate data structure.
Algorithm
‣ Third, an algorithm is a formulation of a set of procedures or methods to solve any solvable problem
and refers to a step-by-step procedure for executing a calculation.
‣ Algorithms are the most important in the fields of machine learning and deep learning, areas we will
cover in the future because it requires work with a lot of data.
‣ In the end, as the performance of the algorithm is directly related to the performance of the data
structure, it can be said that it is most important to know precisely where to use a certain data
structure.
Integer
Float
Primitive
Char
String
Sequential
Singly Liked List
List Doubly Linked
Linear Linked List
ListLinked
Circular
Stack
Data List
Queue
Structure
Deque
General Tree
Non- Tree Binary Tree
Linear Graph General Tree
Data
Structure
Primitiv Non-
e Primitive
Intege Dictionar
Float String Boolean y
List Array Tuple Set File
r
Non-
Linear
linear
Graph
Stacks Queues Trees
s
‣ Each element may be accessed through each index. An index is a number representing an order.
Data is sequentially stored in a contiguous location of memory, as shown in the figure below.
Memor
y
Blue
Yellow
Red
Random
Access
a[ a[ a[
0] 1] 2]
Blue Yellow Red
a[ a[ a[
0] 1] 2]
Blue Yellow Red
a[ a[ a[
0] 1] 2]
Blue Yellow Red
Green
a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red
Green
a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red
Green
a[ a[ a[ a[
0] 1] 2] 3]
Blue Green Yellow Red
a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red
Green
a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red
a[ a[ a[ a[
0] 1] 2] 3]
Blue Yellow Red
a[ a[ a[
0] 1] 2]
Blue Yellow Red
Green
‣ To understand lists, you must first understand the concept of the pointer. In simple terms, the
pointer is an address value that points to a certain value.
‣ As shown in the figure above, the blue room (the concept of a variable representing one value will
be expressed as a room) points to the yellow room. Let’s say the blue room has the address of the
yellow room’s memory location. Then each room can point to the next room.
Memory
Yellow
Pointer
Blue
Pointer
Red
Pointer
Memory
Yellow
Pointer
Blue
Pointer
Red
Pointer
Sequential Access
Green
Green Green
‣ As shown in the picture above, using each pointer, the blue points to the green, and the green
points to the yellow to add the green.
What is NumPy?
NumPy stands for Numerical Python, which is a basic package for Data Science.
‣ It is a Python Library that provides multidimensional array objects, various derived objects
(matrices, etc.), and an assortment of routines for fast operation on arrays.
‣ It supports discrete Fourier transforms, basic linear algebra, basic statistical operations, random
simulations, etc.
‣ The ndarray object is the core of the NumPy package. It processes n-dimensional arrays of
homogenous data types.
Distinguish the differences between NumPy arrays and standard Python sequences
(Lists, Tuples, Dictionaries, Sets).
‣ NumPy arrays have a fixed size when generated as opposed to Python Lists (which can grow
dynamically).
‣ A change in the size of the ndarray creates a new array and deletes the original.
‣ The elements of the NumPy array are all homogenous data types; thus, they maintain the same size
in memory.
‣ Lists, however, allow for arrays of various sizes.
‣ NumPy arrays easily calculate advanced mathematical and other types of operations on large
numbers of data.
‣ Normally, such operations are performed more efficiently and with less code than when performed
with Python’s built-in sequences.
‣ More and more scientific and mathematical Python-based packages are using NumPy arrays. These
typically support Python sequence input, but they convert such input to NumPy arrays before
processing and often output NumPy arrays.
‣ In other words, it is insufficient to rely only on Python’s built-in sequences to efficiently use today’s
scientific/mathematical Python-based software. One also needs to have a knowledge of NumPy
array to increase efficiency.
[]
𝑎
2x2 matrix𝐴= 𝑐 [𝑎 𝑏
𝑑 ] 𝑣𝑒𝑐 ( 𝐴 )= 𝑐
through vectorization become 𝑏
𝑑
‣ In scikit-learn, the NumPy arrangement is the basic data structure. In other words, the NumPy
arrangement should be used as the standard input/output for machine learning.
Line 1
• As stands for alias, meaning it is abbreviated.
Line 2
• To view the version of NumPy, use the built-in property ____version__.
First, an arrangement may be made into a sequence (list, tuple, array, set).
Check the address of each array using a function representing the address id.
The actual value inside is the same, but the address is confirmed to be different because it is its
own object.
1 3 5 7 9
2246716570240 arr1
2247660678400 arr2 1 3 5 7 9
Line 9
• After assigning the address values, we can see that the address values are the same.
1 3 5 7 9
22467165702 arr1
40
2247660678400 arr 1 3 5 7 9
2
22467165702 arr3
arr
40 3
If you want to copy only the value, you can use np.copy(). See help(np.copy) for more detail.
Line 12
• Since only the value was copied, you can see that the address value is different.
Use the len() method and the size property for the size of the array.
The NumPy array deals with each element by converting it to the same data type. (1/3)
‣ The important thing is that the NumPy arrays process the same type quickly and effectively, as one
of its characteristics.
‣ However, although arrays can be made with different types of arrays, each type is converted into
the same type.
Line 28
• Different types of arrays can be created, such as integers, floats, and Booleans.
The NumPy array deals with each element by converting it to the same data type. (2/3)
‣ Due to the . in the integer type, the array is converted into a float when integer, float, and Boolean
types are together.
Line 30
• The data type of NumPy is returned in NumPy.datatype format.
Line 31
• Different types of arrays are created, such as integers, floats, and strings.
The NumPy array deals with each element by converting it to the same data type. (3/3)
Line 33
• Due to the '111’ in the integer type, the array is converted into a string when integer, float, and
string types are together.
bool_ Boolean
‣ The 8 in int8 is 8 bitss, which is the range of values that this value can represent.
https://docs.scipy.org/doc/NumPy-1.17.0/reference/
arrays.dtypes.html
Explanation of 8 bits
‣ Since the bit represents two values, 8 bits can become 2 to the power of 8, representing up to 256
integers. In integers, this value ranges from -128 to 127. The range is up to 127, and not 128,
because zero is excluded.
‣ If only positive values are used, the space in the range of negative numbers needs to be crossed
over to positive numbers. At this time, u, the first letter of the unsigned (meaning positive), is used.
‣ Therefore, uint8 ranges from 0 to 255.
Line 35
• The U10, here, represents Unicode, and the < symbol represents 10byte or less.
Creating Arrays With the linspace Function (1/2) – See help(np.linspace) for more detail.
‣ Linear means that it can be expressed in the form of linear bonds with respect to the elements of set
A.
‣ That is, where the elements of set A are multiplied and added by constants forming , which belongs
to this set A. This type of expression is called a linear combination.
‣ In this way, representing the heat of the basic element in the form of a linear bond is called linear.
Linear combination is the corresponding coefficients and variables, and the solution can be obtained
through this equation.
‣ The figure below shows the concept of linearity in mathematics and daily life.
Linear relationship between
time and the height of the
water bottle
Line 38
• Five numbers in the linear space from 1 to 10.
Line 39
• 20 numbers in the linear space from 10 to 10.
Line 40
• Be careful not to skip the “s.”
Multidimensional Arrays
A multidimensional array refers to an array of two or more dimensions. It can be made in the
form of a list in the list.
Line 43
• The first element in this list is also a list, so the first element in that list is 1.
Line 44
• Two-dimensional Array
5 6
1 7
2 8
3 4
Three-Dimensional Array
Line 50
• A surface is two-dimensional and can be stacked. It’s easy to understand if you think of the
surface as a piece of paper.
Line 52
• When extracting a three-dimensional value, it’s easier to think of it as a method of approaching
step by step.
Line 53
• Designate it as two rows and three columns in the form of a tuple.
Use the astype() method if you want to change the type to a float.
Line 63
• Size
Line 64
• Converting the row and columns into a tuple
It should be noted that for one-dimensional arrays, the tuples return the number of elements.
Reshape
How to Reshape
Line 70
• Be sure to input the shape as a tuple.
Line 71
• When checked, the shape of the object remains the same. Thus, it should be assigned again to
create a new object.
Random Numbers
Data following the standard normal distribution (average 0, standard deviation 1) is generated
as random numbers. See help(np.random.randn) for more detail.
Line 78
• The generated data follows the standard normal distribution with shapes in two rows and three
columns.
‣ The average is close to zero but not completely zero. If more data is generated, then the results get
closer to zero.
Line 88
• It returns the result in the form of NumPy.ndarray. This means that there is an array with n-
dimensions.
‣ %time is an IPython magic command that returns a single execution time. This magic command is a
special command designed to easily control general tasks and other operations in the IPython
system.
‣ Magic commands are labeled with a % sign.
Line 96
• axis=0 is a row.
Line 98
• axis=1 is a column.
Line 102
• Delete the corresponding index element.
Line 103
• Although the element was deleted, the array remains the same because no assignment was
made.
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 73
1.2. NumPy Array Basics UNIT
01
Line 106
• Delete the first row.
Line 108
• Delete the second column.
Basic Operations
We learned earlier that the unique feature of the NumPy array operation is that data can be
processed collectively without using a for loop statement, which is called vectorization. In this
case, arithmetic operations between arrays of the same size are applied in each element unit of
the
‣
array.
For this reason, popular machine learning and deep learning libraries, such as scikit-learn,
TensorFlow, PyTorch, etc., were all made based on NumPy.
‣ This is a summary of the characteristics of NumPy that we’ve learned so far.
• A low-level, high-performance library implemented in C
• Supports fast, memory-effective multidimensional array ndarray operations
• Supports various computational functions such as linear algebra, random number generator,
Fourier transform, etc.
Line 4
• In a list, the + operator means connecting the given elements.
Line 6
• In a NumPy array, each element is calculated.
If the shape of the array is the same, you can perform addition, subtraction, multiplication, and
division operations.
Line 10
• In a list, the * operator means repetition.
Line 11
• Multiplication in a NumPy array returns the multiplication result of each element.
Line 14
• In NumPy, the repeat method is used to repeat each element.
Line 15
• In NumPy, the tile method is used for repeating the array.
When using the tile method, the array can be repeated into a two-dimensional form.
The array can also perform comparison operations. After checking whether each element
matches the conditions, it returns True if there is a match and False if otherwise.
Method Explanation
mean Average
var Variance
std Standard deviation
sum Total sum
cumsum The cumulative sums of an array
max, min The maximum and the minimum of an array
argmax, argmin Indices of the maximum and the minimum
https://docs.scipy.org/doc/NumPy-1.17.0/
reference/
Line 33
• Sum
Line 34
• Average
Line 35
• Average 5.5
Line 36
• Deviation (difference from average)
Line 37
• Square of Deviation
Line 39
• Sum of Squared Deviation
Line 40
• The sum of the squared deviations divided by the number of arrays is called a variance.
‣ This result can be calculated directly through the process above, or the var() can be used.
Line 42
• The square root of the variance is the standard deviation.
Line 40
• The sum of the squared deviations divided by the number of arrays is called a variance.
‣ This result can be calculated directly through the process above, or the std() method can be used.
Line 44
• The cumsum() method is used to find the cumulative sum of x.
Line 47
• In this way, you can designate it as two rows and three columns. However, if you designate only
a row, use -1 to change the rest on its own.
Line 52
• The average of 1 and 4 is 2.5. When the axis = 0, the operation is carried out along the columns.
Line 53
• The average of 1, 2, and 3 is 2. When the axis = 1, the operation is carried out along the rows.
Line 54
• Initializing the seed value using np.ramdom will generate a fixed random number when
generating random number values.
Line 55
• A random number is returned from integers 0 to 9.
Line 56
• This is how you would return a random number from integers 1 to 10.
Line 58
• In this case, 11 is not returned, so it repeats indefinitely. Click (interrupt kernel) to stop the
repetition.
Final Summary
‣ Universal functions:
Final Summary
‣ NumPy functions:
Final Summary
‣ NumPy functions:
Line 65
• Returns only unique values that are not repeated.
Final Summary
‣ NumPy functions:
Line 68
• Rounded up to the second decimal place.
Final Summary
‣ NumPy functions:
Final Summary
‣ NumPy functions:
Final Summary
‣ Universal functions:
Line 77
• Returns the index location with the highest value into an integer.
Line 78
• Returns the index location with the lowest value into an integer.
Matrix Operations
‣ The following is the method for obtaining matrix product, transpose matrix, inverse matrix, and
determinant for matrices A and B.
• matrix product
• transpose matrix
• inverse matrix
• determinant
Matrix Operations
‣ For matrix operations, make 2x2 matrices A and B as follows:
Matrix Operations
‣ Example of the product of matrices A and B (A*B). Both methods can be used.
Matrix Operations
‣ The following is an example of finding the transpose matrix of matrix A. Both methods can be used.
Matrix Operations
‣ How to find the inverse matrix of matrix A
Indexing
The first
-5
-4
0
-3
1
-2
2
-1
3
4
The last
Indexing
‣ Indexing a one-dimensional array is like Python list indexing.
Indexing
‣ Multi-dimensional arrays can also be approached like list indexing.
Line 5
• Can access one specific multi-dimensional value.
Indexing
‣ Unlike list indexing, dimensions can be divided into commas (,) and approached in a
multidimensional array. The dimension divided by commas is called an axis. Since b is a two-
dimensional array, we can treat it as a matrix. Vales in columns 1 and 2 can be accessed as on the
previous page.
‣ It can also be approached as shown below.
Indexing
‣ Not only can the array’s elements be imported, but the value of the corresponding index can be
changed as follows.
Indexing
‣ To select multiple elements from a one-dimensional array, store them as follows.
‣ The outer brackets are brackets for indexing, and the inner brackets are brackets for the list.
‣ Array name [[]]
‣ Array name [[position 1, position 2, ..., position n]]
Line 10
• Elements 10, 30, and 40 located at positions 1, 3, and 4 were taken from the one-dimensional
array a1.
Indexing
‣ To select an element at a specific location in a two-dimensional array, specify the positions of rows
and columns as follows.
‣ Array name [row_position, column_position]
‣ If only the array name [row_position] is entered without a “column_position,” the entire designated
row is selected.
‣ It is a method of selecting and importing a specific element by indexing in a two-dimensional array.
Indexing
‣ In the two-dimensional array a2, an element having a row_position of 0 and a column_position of 2
is selected and imported as follows.
‣ As shown below, you can also change the value after selecting an element by specifying the
positions of rows and columns in a two-dimensional array.
Indexing
‣ A method of obtaining the entire row by specifying the “row_position” in a two-dimensional array is
shown below.
‣ The entire row can also be changed by specifying a specific row in a two-dimensional array, as
shown below.
Indexing
‣ To select several elements in a two-dimensional array, do as follows.
‣ Array name [[row_position 1, row_position 2, …, row_position n],[column_position n],
column_position 2, …, column_position n]]
‣ The following is an example of selecting multiple elements by specifying the positions of rows and
columns in a two-dimensional array.
Line 17
• If you select the row first from [10, 20, 30] and the column from [45, 55, 65], then 10 and 55 are
returned.
Line 19
• Only elements that meet the conditions of “a>3” are returned.
Line 20
• Only even numbers are returned.
Array Slicing
Instead of selecting one element through indexing, slicing selects a portion of the array by
specifying a range.
‣ For a one-dimensional array, slicing specifies the positions of the beginning and the end, as shown
below.
‣ Array[start_position]:end_position]
‣ If the start position is not specified, the start position becomes 0, and the range becomes “0 to end
position -1.” If the end position is not specified, the “end position” becomes the array’s length, and
the range becomes “start_position to end of the array.”
Slicing
‣ In a one-dimensional array, slicing is performed without specifying a ”start_position” and an
“end_position,” as shown below.
Slicing
‣ Now, let’s consider the case for a two-dimensional array.
Line 27
• Choose from the beginning to the second line of “arr2d.”
Slicing
‣ It is also possible to slice multi-dimensional arrays by crossing several indices.
‣ As shown above, slicing always gives a view of the array in the same dimension. An integer index
and a slice can be used together to obtain a lower-dimension slice.
‣ For example, if you want to select only the first two columns in the second row, do as follows.
Slicing
‣ If you select only the third column in the first two rows, do as follows.
‣ If you just use a colon, you choose the entire axis, so the slice of the original dimension is returned.
Slicing
arr[2] (3,)
arr[2, :] (3,)
arr[2:, :] (1,3)
‣ This Boolean array determines whether the name “Bob” is in the names array.
‣ By indexing under this condition, only those whose name is “Bob” can be returned.
Fancy Indexing
Return a new array of given shape and type without initializing entries. See help(np.empty) for
more detail.
Fancy Indexing
‣ Selecting rows in a specific order can skip the ndarrray or the list containing the desired order.
‣ If you use negative numbers as an index, select a row from the end.
Fancy Indexing
‣ To index values that are only multiples of 5, the Boolean array for the condition can be put into the
variable and obtained.
Fancy Indexing
‣ It can also be obtained by combining two conditions.
Array Transposition
Array Transposition is a special function that returns a view in which the data shape has changed
without copying the data. ndarray has a transponse method and a special property named T.
Array Transposition
‣ Transpose can change the dimensional order of the ndarray. T can be used to reverse the order of all
dimensions with transpose. Since this is a frequently used function, it was made into a shortcut
function.
Array Transposition
‣ Linear algebra, such as matrix multiplication, division, determinant, and square matrix, is an
important part of the library dealing with arrays.
‣ Multiplying two two-dimensional arrays by * operators yields the product of each corresponding
element, not the multiplication of the matrix.
‣ Matrix multiplication is calculated using a dot function in the NumPy namespace and an array
method.
Array Transposition
‣ It is often used in matrix calculation, and np.dot is also used to find the inner part of the matrix.
Array Transposition
‣ First, let’s reshape 3 rows and 4 columns in 2D into 4 rows and 3 columns.
Array Transposition
‣ This time, let’s try with a three-dimensional array.
‣ Even in a three-dimensional array, the transpose method receives the tuple axis number and
replaces it.
Line 16
• Second page, second row, fourth row.
Array Transposition
Line 18
• The order of the first and second axes was reversed, and the last axis remained the same.
Array Transposition
‣ There is a method called swapaxes in the ndarray, which receives two axis numbers and reverses
the array.
Array Transposition
Pipelines
Structural Perspective of Data Types
‣ Collected data, from a structural perspective (schema structure or computability), can be divided
into three categories: structured, unstructured, and semi-structured data.
• Structured data refers to data that has a structure-based form of the structured schema (form)
and is stored in fixed fields such as RDB and spreadsheet and is consistent in value and format.
• Unstructured data refers to data that does not have a schema structure and is not stored in fixed
fields such as social media, web bulletin boards, and NoSQL.
• Semi-Structured data refers to data that has a schema(formal) structure and contains metadata
such as XML, HTML, weblogs, system logs, and alarms and is inconsistent in value and format.
JSON
Database Dictionary
Table form
Structured
Pandas Outline
A table is the most optimal form of data that a person can understand. Therefore, the ability to
handle tabular data well is the basis of analysis. In Python, the table form is called a DataFrame
and is implemented as a pandas library.
‣ The pandas library is built based on NumPy but specializes in more complex data analysis.
‣ While NumPy processes only the same array of data types, Pandas can process different data types.
‣ The pandas library is an optimal tool for collecting and organizing data.
‣ Pandas is an important tool that can handle most of the work in data science and is essential for
data scientists.
DataFrame
‣ The basic data structures of pandas include Series (1D) and DataFrame (2D). DataFrame is a
container for Series, and Series is a container for scalar (0D). They can add or delete data in a
dictionary manner.
Samsung Innovation Campus Chapter 3. Exploratory Data Analysis 155
2.2. Pandas Series and DataFrames UNIT
02
Series
A series is a form of sequentially listed one-dimensional arrays.
‣ Since it is an array, all elements in the series must belong to one data type.
‣ Simple series can be created from sequences such as lists, tuples, and arrays.
‣ The word ‘series’ is singular AND plural. Its Latin origin, ’serere,’ means to join or connect.
‣ The default value of the series is an integer index. The label of the first item is 0, the label of the
second item is 1, and it increases in this manner.
‣ The attribute value of the series is the list of all values in the series.
‣ The attribute value of the index means an index of a series.
‣ The attribute value of the index.values is an array of all index values.
Line 06
• We can know that the data type is a series.
DataFrame
row
column
‣ Selecting the data value array separately is also possible. At that time, the values attribute of the
series class is used.
Line 16
• The index is represented by an integer-like RangeIndex object in the range of 0 and 4. At this
time, the last value is not included.
‣ The data value array maintains the order of the list element array of list_data, which is the original
data.
‣ A DataFrame is a two-dimensional data structure that stores various data types (letters, integers,
floating point values, categorical data, etc.) in a column.
‣ Each table has three columns with a column label. The column labels are “Name,” “Age,” and “Sex.”
‣ Column “Name” consists of text data in which each value is a string, column “Age” is a silver
number, and column “Sex” is text data.
‣ It is similar to data table representation in a spreadsheet.
Series
Line 19
• If you are familiar with dictionary, selecting a single column is very similar to selecting a
dictionary value based on a key.
Line 21
• It is possible to index a series.
Line 23
• Since the column name is not specified, it appears as a RangeIndex object.
Line 29
• Note that string data types in the series are recognized in the form of objects.
Line 32
• Designate the index name with the name attribute.
Line 39
• Only non-duplicate values are returned.
Line 40
• This is the number of values that are not duplicates.
Line 44
• Only non-duplicate values are returned.
Line 46
• Using the value_counts() method, the same value is added to the number. It’s a method that is
used quite often, so be sure to remember it.
1 3 4
2 2 4
3 1 4
4 0 4
1 3 3
2 2 4
3 1 3
4 0 0
Line 65
• We will replace the lambda function with what we learned in Python Basics.
Line 67
• Add 10 to each element and return it to the series.
Line 9
• It only shows the first five data. It is used to quickly view the state or value of the data.
Line 10
• It only shows the last five data.
Line 11
• Used to check the column name.
• iris_df.columns() #caution: Since it is not a method, () is not used.
Line 12
• Used the check index.
Line 14
• You can see that it has changed from . to _.
Several columns are brought into the two-dimensional form, as shown below.
‣ As shown above, the column name is not recognized when the data is entered. Thus, you have to
set the columns yourself and create a DataFrame.
Line 29
• Looking at it again, notice that it didn’t change. Let’s try the inplace=True option.
Line 36
• Converting a DataFrame with a DataFrame() function. Save to variable df.
‣ Always be in the habit of copying the original before deleting the row. This is because if the data is
incorrectly deleted when checking with the data prior to deleting, it needs to be brought back from
the original. This is the only way to save time if the amount of data is large.
Line 43
• The original remains.
Line 49
• Replicate the DataFrame df and store it in the variable df4. Delete one column of df4.
Line 57-1
• Use the loc indexer.
Line 57-2
• Use the iloc indexer.
Selecting Columns
Line 63
• Select only the “math” score data. Save it to variable math1.
Selecting Columns
Line 65
• Select only the “english” score data. Save it to variable english.
Selecting Columns
‣ Select “music” and “science” score data. Save it to variable music_sci.
Selecting Columns
‣ If you want to store it in a two-dimensional form, and there is only one column, use the list form[].
Designating Index
Designating Index
‣ There is no index, but an index may be selected among the columns.
‣ Designate the column “Name” as a new index and reflect the changes to the df object.
Line 74
• If the name part moves down, it is set as an index.
Designating Index
Line 77
• Select one specific element of the DataFrame df (“music” score of “honggildong”).
Designating Index
Line 79
• Select two or more elements of the DataFrame df (“music” and “phys_tra” scores of
“honggildong”).
Designating Index
Line 81
• This can also be done with the integer index iloc.
Designating Index
Line 83
• It is also possible through slicing.
Designating Index
Designating Index
Line 87
• Select an element from two or more rows and columns of df (”music” and “phys_tra” scores of
“honggildong” and “hongeedong”).
Designating Index
Designating Index
Designating Index
Add a Column
Line 94
• Add a ”kor” score column to the DataFrame df. The value of the data is 80.
Add a Row
Add a Row
Line 97
• Add a new row – enter the same element value.
Add a Row
Line 99
• Add a new row – enter an array of element values.
Add a Row
Add a Row
‣ help(df.reset_index)
Add a Row
Add a Row
Line 109
• When the index is set this way, pay close attention when adding rows.
Line 113
• Designate the column “name” as a new index and reflect the changes to the df object.
Line 115
• A method for changing a specific element for the DataFrame df: There are various methods for
changing a “phy” score of “stu1.”
Line 121
• A method for changing several elements of stu1 df: “mus” and “phy” scores of “stu1.”
Line 127
• Transposing the DataFrame df (using the method).
Line 129
• Transposing the DataFrame df again (using class attributes).
Line 132
• A specific column is set as a row index of a DataFrame.
Line 136
• Multi-index
Line 138-1
• Definition of the dictionary
Line 138-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].
Line 139
• Redesignate the index to [r0, r1, r2, r3, r4].
Line 140
• Fill in the NaN value generated by reindexing with the number 0.
Line 142-1
• Definition of the dictionary
Line 142-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].
Line 143
• Reset the row index to integer.
Line 145-1
• Definition of the dictionary
Line 145-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].
Line 146
• Sort row indices in descending order.
Line 147-1
• Definition of the dictionary
Line 147-2
• Converting the dictionary into a DataFrame. Specify the index as [r0, r1, r2].
Line 148
• Sort in descending order based on column c1.
Series Operations
Series vs. Numbers
‣ Adding a number to a series object adds a number to each of the series’ individual elements and
converts the calculated result into a series object.
Line 3
• Create a pandas series with data from a dictionary.
Line 4
• Divide the student’s scores by 200 per subject.
mat
kor
mat eng
kor h
eng
h
Line 5
• Create a pandas series with data from a dictionary.
Line 5-4 ~ 5-7
• Perform the four fundamental arithmetic calculations based on the scores of each student per
subject.
mat
kor eng
h
addition
subtracti
on
Multiplicati
on
division
Line 6
• Combine the results of the arithmetic operation into a DataFrame (series -> DataFrame).
‣ In the example above, the order of subject names given by index is different. However, pandas finds
and sorts the same subject name (index), and it adds scores of the same subject name (index).
mat
kor
mat eng
kor h
h
Line 7
• Create a pandas series with data from a dictionary.
mat
kor eng
h
addition
subtracti
on
Multiplicati
on
division
Line 8
• Combine the results of the arithmetic operation into a DataFrame (series -> DataFrame).
Operation Method
‣ As you learned in previous slides, when there is no common index or NAN in values, it returns NaN.
To not occur this result, set the fill value for the data set.
Line 10
• Create a pandas series with data from a dictionary.
Line 10-4 ~ 5-7
• Perform the four fundamental arithmetic calculations based on the scores of each student per
subject.
(Use the Operation Method)
Operation Method
addition
subtractio
n
Multiplicatio
n
division
Line 11
• Combine the results of the arithmetic operation into a DataFrame (series -> DataFrame).
DataFrame Operations
DataFrames can be understood as concepts that expand series operations. First, it is sorted
based on the row/column index and calculated between corresponding elements one by one.
Line 12
• Create a DataFrame by selecting two columns, age and fare, from the titanic dataset.
Line 13
• Create a DataFrame by selecting two columns, age and fare, from the titanic dataset.
Line 13-3
• Shows only the first five lines.
Line 14-1
• Add 10 to the DataFrame.
Line 14-2
• Shows only the first five lines.
‣ While maintaining the form of the existing DataFrame, only the element value is replaced with a
new calculated value and returned as a new DataFrame object.
Line 15
• Create a DataFrame by selecting two columns, age and fare, from the titanic dataset.
Line 16
• Add 10 to the DataFrame.
Line 17-1
• Calculate between DataFrames (additon - df).
Line 17-2
• Shows only the last five lines.
DataFrame Manipulation
Merging DataFrames:
A B
A B
A B
A B
A B
A B
Line 1, 2
• Inner join
Line 3
• Left join
Line 4
• Right join
Line 5
• Full outer join
Line 1
• Bind vertically by matching the column names.
Line 2
• Bind horizontally by matching the indices.
Binding DataFrames
‣ When data is divided into several places, it may be necessary to combine them into one or bind the
data. In pandas, functions used to combine or bind DataFrames include concat(), merge(), and join.
‣ pandas.concat(list of DataFrames)
‣ If the axial direction is not specified, the default option axis=0 is applied and connected in the
up/down row direction.
Binding DataFrames
Binding DataFrames
Binding DataFrames
‣ Rows 0, 1, 2, and 3 derived from df1 are entered as NaN because there is no column ”d.”
Binding DataFrames
‣ help(pd.concat). ignore_index
Binding DataFrames
‣ Axis=1 option binds the DataFrame in the left/right column directions.
Merging DataFrames
‣ The concat() function is a concept that merges the two DataFrames by a certain criterion in a
manner similar to SQL’s join command. In this case, the column or index that is the reference is
referred to as a key. The key must exist in both DataFrames.
Merging DataFrames
Hanmi
Pharmaceutical
NS Shopping
E-mart
Green Cross Medical
Science Corporation
Merging DataFrames
Harim Co.
Meritz
Financial
Group
E-mart
Samyang
Merging DataFrames
‣ See help(pd.merge) for more detail. Let’s just put two DataFrames as parameters and merge them.
E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit
Merging DataFrames
E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit
Merging DataFrames
‣ Here is how to solve the error.
‣ Bring the file again.
Merging DataFrames
‣ Here is how to solve the error.
E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit
Merging DataFrames
‣ The on=None and how='inner’ options are applied as default values. The on=None option means
merging all columns that belong in common to the two DataFrames into a reference (key).
‣ The how='inner’ option means that data in the reference column is extracted only when the data is
an intersection common to both DataFrames.
‣ We merged and returned five commonly existing stocks based on the column “id.”
Merging DataFrames
‣ Let’s deliberately change and see what happens if there are columns “id_1” and “name_1” on the
left side of the DataFrame and if there are columns “id_1” and “name_1” on its right side.
E-mart E-mart
Samyang Samyang
Chong Kun Chong Kun
Dang Dang
Group Group
ModeTour ModeTour
Reit Reit
‣ When the two columns are the same, the results are merged based on the values in the common
column name.
Merging DataFrames
‣ If you use how='left,’ all the companies in the left DataFrame will be returned, and those not on the
right will be treated as NaN.
Hanmi
Pharmaceutical
NS Shopping
E-mart E-mart
Green Cross
Medical Science
Corporation
Samyang Samyang
Chong Kun Dang Chong Kun
Group Dang Group
Cuckoo Electronics
ToolGen Inc.
ModeTour
ModeTour Reit
Reit
Merging DataFrames
‣ If you how='right,’ all the companies in the right DataFrame will be returned, and those not on the
left will be treated as NaN.
Harim Co.
Meritz
Financial
Group
E-mart E-mart
Samyang Samyang
Hankook
Tire
NHN
Entertainment
Chong Kun Chong Kun
Dang Group Dang Group
ModeTour ModeTour
Reit Reit
Samsung
Biologics
Merging DataFrames
‣ If you use how=“outer,” all the data on the left and right is returned.
Hanmi
Pharmaceutical
NS Shopping
E-mart E-mart
Green Cross
Medical Science
Corporation
Samyang Samyan
g
Chong Kun Dang Chong Kun
Group Dang Group
Cuckoo Electronics
ToolGen Inc.
ModeTour
ModeTour Reit Reit
Harim
Co.
Meritz
Financial
Group
Hankook
Tire
NHN
Entertainmen
t
Samsung
Biologics
DataFrame Manipulation
Sorting:
‣ It is possible to sort the rows of a DataFrame using one or more columns.
Line 1
• Sort in ascending order.
Line 2
• Sort in descending order.
Line 3
• Sort using two columns.
Line 2-1
• Column names
Line 2-2
• Labels for the outer layer
Line 2-3
• Labels for the inner layer
Line 2-4
• Create a list of tuples with the labels.
Line 2-5
• Create the MultiIndex.
Line 2-6
• Apply the MultiIndex.
DataFrame Summarization
Grouping and Summarizing:
Pivoting:
‣ Manipulate the indices and the columns and then summarize.
Pivoting:
‣ Index by 'Size' and 'Type.' Columns by 'Location.' Values provided by the 'B' column.
Pivoting:
Pivoting:
‣ The same as the graph on the right, but fill the missing values with 0.
Pivoting:
Pivoting:
‣ Index by ''Location.' Columns by 'Size' and 'Type.' Values provided by the 'B' column.
Pivoting:
Line 18
• Now, MultiIndex object for the columns.
Pivoting:
‣ The aggregation function is NumPy.median().
Pivoting:
‣ Group averages of the columns 'A' and 'B'
Pivoting:
‣ Now, with groupby() method. The result is the same.
Pivoting:
‣ Aggregate the columns 'A' and 'B' differently.
Statistics:
Line 1
• Column sums
Line 2
• Row sums
Line 3
• Column averages without skipping the missing values
Line 4
• Descriptive statistics of the columns (variables)
Statistics:
Line 5
• Non-missing values along the columns
Line 6
• Correlation between the column 'A' and the column 'B'
Line 7
• Correlation matrix taking the numeric variables pair-wise
Line 8
• Correlations between 'A' and the other numeric variables
Line 1
• A DataFrame with True where missing values are found.
Line 2
• Count the missing values for each column.
Line 3
• Proportions of the missing values for each column.
Line 4
• Drop rows where one or more missing values are found.
Line 5
• Drop columns where one or more missing values are found.
Line 6
• Drop the rows with less than 3 normal values.
Line 7
• Fill the missing values with 0.
Data Preprocessing
Data scientist survey:
‣ What do data scientists spend the most time doing?
Collecting
data sets
19%
Mining data for patterns
9%
Cleaning and
Refining algorithms 4% organizing
data
Building training sets 3% 60%
Other 5%
https://www.forbes.com/sites/gilpress/2016/03/23/data-
preparation-most-time-consuming-least-enjoyable-data-
science-task-survey-says/#790d18c36f63
Collecting
data sets
21%
Mining data for patterns
3%
Cleaning and
Refining algorithms 4% organizing
data
Building training sets 10% 57%
Other 5%
https://www.forbes.com/sites/gilpress/2016/03/23/data-
preparation-most-time-consuming-least-enjoyable-data-
science-task-survey-says/#790d18c36f63
Operations:
Cleanin
g
Transformation/
Reshaping
Integration/
Join
Operations:
Scaling/Normalization
Imputation of missing
values
Outlier treatment
Algorithm considerations:
‣ Often, machine learning algorithms can be categorized into “Tree-like” and “non-Tree-like.”
• a) Tree-like algorithms: Tree, Random Forest, AdaBoost, XGMBoost, etc.
• b) Non-Tree-like algorithms: Linear Regression, Logistic Regression, SVM, Neural Network, etc.
‣ Scaling/Normalization and Outlier treatment can be needed for non-Tree-like algorithms, but not for
Tree-like algorithms.
‣ This is because Tree-like algorithms partition the configuration space into “patches,” mostly
unaffected by scales or outliers.
‣ Other preprocessing operations are equally applicable for both Tree-like and non-Tree-like.
Scale
Data Classification
‣ Continuous Scale data and Categorical data are the most basic types of structured data.
‣ Continuous data include continuity data, such as wind speed and duration, and discrete data, such
as the frequency of occurrence of events.
‣ Categorical data refers to the city names of each country (Washington, New York, Los Angeles) or
the type of car (bus, taxi, truck).
‣ Binary data is a special case of having either value, such as 0 and 1, yes/no, or true/false, among
categorical types.
‣ Among the categorical types, the ratings (1,2,3,4,5) in which the values within the category are
ranked are called Ordinal data.
Missing Value
Missing Value
‣ Missing value means refers to missing data and empty data.
‣ It is impossible to distort the analysis results or apply functions.
‣ In some cases, the missing values generated on the variable are not related to other variables, so
the missing values must be deleted and replaced according to the situation.
‣ Assuming that each variable follows a specific probability distribution, the distribution parameters
are estimated and replaced.
‣ There are mean replacement, median replacement, and mode replacement.
Line 1-1
• Import Library.
Line 1-2
• Load the Titanic Dataset.
Line 1-3
• Calculate the number of NaNs in the Deck column.
Line 2
• Find the missing data using the isnull() method.
Line 3
• Find the missing data using the notnull() method.
Line 4
• Calculate the number of missing data using the isnull() method.
Applying the threshold=500 option to the dropna() method deletes all columns with 500 or more
NaN values.
Line 5
• Delete all columns with more than 500 NaN values. Deck column (688 NaN values out of 891).
Line 6
• Delete all rows without age data in the age column. Age column (177 NaN values out of 891).
Line 7-1
• Delete all rows without age data in the age column. Age Column (177 NaN values out of 891).
Line 7-2
• Calculate the mean of the age column. (Excluding NaN values)
Line 7-3
• Print the first 10 data in the age column. (In row 5, NaN values are replaced by the mean.)
Line 8-1
• Load the titanic dataset.
Line 8-2
• Print the NaN data of column embark_town and row 829.
Line 9
• The NaN value of the embark_town column is replaced with the value that appears the most
among the boarding cities.
Line 10
• Print the NaN data of row 829 and column embark_town. (NaN value is replaced by value
most_freq.)
Line 11
• Change the NaN value of embark_town to the immediately preceding value of row 828.
Line 12
• Create a DataFrame with duplicate data.
Line 13
• Find the duplicate values among the entire row data of the DataFrame.
The data in line 1 becomes True because it overlaps with the previous row 0.
Line 14
• Find the duplicate value in the specific column data of the DataFrame.
drop_duplicates() Method
Line 15
• Find the duplicate value in the specific column data of the DataFrame.
Line 16
• Remove duplicate rows based on columns c2 and c3.
Feature Engineering
Derived variables:
‣ Create one or more variables based on other variable(s).
Ex From the “Date” variable, derive the “Year,” “Month,” “Day,” “Weekday,” etc.
Derived variables:
‣ Create one or more variables based on other variable(s).
√X _ coordinate
Distance =
2
+Y _ coordinate2
Derived variables:
‣ Create one or more variables based on other variable(s).
Derived variables:
‣ Principal components with or without dimensional reduction can be regarded as derived variables.
‣ In general, rotated coordinates can be regarded as derived variables.
‣ We can apply mathematical functions such as 𝑙𝑜𝑔() to a variable to create its derived variable.
Gender Gender_male
female 0
male 1
Ex “Species” variable with three category values: “setosa,” “versicolor,” and “virginica.”
→ Only two dummy variables are required: “Species_versicolor” and “Species_virginica.”
Ex One hot encoding for a variable that can have integer values 0~9.
→ Integer variables often represent category or class rather than the numeric value.
X0 X1 X2 X3 X4 X6 X7 X8 X9
0 0 0 0 0 1 0 0 0 0
Ex “Capital” with five category values [“London”, “Moscow”, “Paris”, “Seoul”, “Washington”]
can be encoded as an integer variable “Capital_int” where 0 ↔ “London,” 1 ↔ “Moscow,”
2 ↔ “Paris,” 3 ↔ “Seoul,” 4 ↔ “Washington.”
Line 2-1
• Generate df with the read_csv() function.
Line 2-3
• Designate the column name.
The round (2) command rounds up the second digit below the decimal point.
Line 3-1
• Convert mile per gallon to kilometer per liter (mpg_to_kpl = 0.425).
Line 3-3
• Add the result of multiplying the mpg column by 0.425 to the new column (kpl).
Line 3-7
• Round up the kpl column to the second place below the decimal point.
The round (2) command rounds up the second digit below the decimal point.
Line 4-1
• Check the data type of each column.
Line 4-4
• Check the original value of the horsepower column.
Convert the integer data type into a character data type using dictionary.
Line 5
• Check the original value of the origin column.
Convert the integer data type into a character data type using dictionary.
Line 6-1
• Convert the integer data type into a character data type.
Line 6-3 ~ 6-4
• Check the original value and the data type of the origin column.
Line 7-1
• Convert the string data type of the origin column into a categorical type.
Line 7-4
• Convert the categorical type back to the string type.
Line 8
• Convert the integer type of the model year column into a categorical type.
Line 9-1
• Generate df with the read_csv() function.
Line 9-6
• Change “?” to np.nan.
Line 9-7
• Delete the missing data row.
Line 9-8
• Convert the string to a float.
Line 10
• Find a list of boundary values divided by three bin by the np.histogram function.
Use the include_lowest=True option to include the low boundary values between countries.
Line 11-1
• Designate the names for the 3 bins.
Line 11-3
• Data array
Use the include_lowest=True option to include the low boundary values between countries.
Line 11-4
• Boundary value list
Line 11-5
• Bin names
`
Line 11-6
• Include the first boundary value.
Line 11-8
• Print the first 15 rows of the horse bower column, hp_bin column.
Normal output
Normal output
Normal output
Normal output
Normal output
High output
High output
High output
High output
High output
High output
Normal output
Normal output
High output
Low output
Dummy Variable
To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.
Line 12-1
• Generate df with the read_csv() function.
To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.
To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.
Line 12-6
• Change ”?” to np.nan.
Line 12-7
• Delete the missing data row.
To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.
Line 12-8
• Convert the string into a float.
Line 10
• Find a list of boundary values divided by three bins using np.histogram.
To use categorical data representing categories in machine learning algorithms, convert them
into dummy variables represented by the number 0 or 1.
Line 12-12
• Designate the name for three bins.
One-Hot Vector
If the get_dummies() method is used, all original values of the categorical variables are each
converted into new dummy variables.
If the get_dummies() method is used, all original values of the categorical variables are each
converted into new dummy variables.
Line 13-3
• Bin names
Line 13-4
• Include the first boundary value.
Line 13-6
• Convert the categorical data in the hp_bin column into dummy variables.
If the get_dummies() method is used, all original values of the categorical variables are each
converted into new dummy variables.
Normalization
The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.
Line 14-1
• Check the maximum value with statistical summary information of the horsepower column.
Line 14-4
• All data is divided and stored as the absolute value of the maximum value of the horsepower
column.
The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.
The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.
Line 15-1
• The maximum value (max) and the minimum value (min) are checked by the statistical summary
information of the horsepower column.
Line 15-4 ~ 15-6
• All data is divided and stored as the absolute value of the maximum value of the horsepower
column.
The DataFrames are normalized so that decreased performance, due to the difference in the
relative size of numerical data in the column (each variable), does not occur.
Histogram
Box Plot
Continuous Data
Univariate Violin Plot
(One Vari- Kernel Density Curve
able)
Categorical Data Bar Chart
(nominal, ordinal) Pie Chart
Refers to data with one variable and is divided into quantitative and qualitative data.
Scatter Plot
Univariate Continuous Data Line Plot
(Two or Time Series Plot
More Vari-
ables) Categorical Data
Mosaic Chart
(nominal, ordinal)
30
20
10
Univariate visualization:
‣ One continuous numeric variable: Histogram
100
80
Frequency
60
40
20
5 10 15
x
‣ Shows the absolute or relative frequencies of each interval.
Univariate visualization:
‣ One continuous numeric variable: Histogram
30
20
Frequency
10
5 10 15
x
‣ The interval width can be adjusted.
Multivariate visualization:
‣ Two continuous numeric variables: Scatter plot
x
‣ Identify whether a linear relation exists between the two variables.
Multivariate visualization:
‣ Two continuous numeric variables: Scatter plot
x
‣ Different categories can be denoted by different colors or markers (symbols).
Multivariate visualization:
‣ Two continuous numeric variables and one categorical variable: Multiple Scatter plots
y y
x x
‣ Different categories can be plotted separately.
‣ You should make sure that the axis ranges match for proper comparison.
Frequency
A B C D
Multivariate visualization:
‣ One categorical variable: Pie chart
A
B
Bivariate visualization:
‣ One continuous numeric variable & one categorical variable: Multiple Boxplots
30
20
10
A B
‣ The number of categories (types) = the number of boxplots
Bivariate visualization:
‣ One continuous numeric variable & one categorical variable: Multiple Histograms
A B
Frequency
Frequency
40 15
20
5
0 0
0 5 10 15 20 0 5 10 15 20
Bivariate visualization:
‣ Two categorical variables: Bar plot
m
f
A B C D
Bivariate visualization:
‣ Two categorical variables: Bar plot
m
f
A B C D
‣ Use color and dodged bars to distinguish the categories of the secondary variable.
Recommendations
In the following bar plot, can you see big difference among the categories?
‣ Apparently, yes?
71.0
70.5
70.0
A B C D
In the following bar plot, can you see big difference between the categories?
‣ In this case where the vertical zero is shown, you see little difference.
70
60
50
40
30
20
10
0
A B C D
4.7%
8.9% 5.4%
43.1%
38.0%
Histogram
Histogram
Multiple boxplots
Multiple boxplots
Line plot
Line plot
Line plot
Line plot
Line plot
Argument Explanation
color Color
alpha Transparency
linewidth Line width
linestyle Line style
marker Marker type
markersize Marker size
markerfacecolor Marker color inside
markeredgecolor Color of the marker edge
markeredgewidth Width of the marker edge
https://matplotlib.org/3.1.1/api/_as_gen/
matplotlib.pyplot.plot.html
linestyle Explanation
'none' No line
':' Dotted line
'--' Dashed line
'-.' Dash dot
'-' Continuous line
'steps' In steps
https://matplotlib.org/3.1.1/api/_as_gen/
matplotlib.pyplot.plot.html
marker Explanation
'.' Point
',' Pixel
'o' Circle
'^' Triangle up
'v' Triangle down
's' Square
'*' Star
'+' Plus sign
'x' X character
'D' Diamond
'p' Pentagon
https://matplotlib.org/3.1.1/api/_as_gen/
matplotlib.pyplot.plot.html
pandas Visualization
Import modules and data.
Histogram:
Histogram:
Histogram:
Bar plot:
Bar plot:
Histogram:
Histogram + KDE:
Histogram + Rug:
Scatter plot:
Hex:
Horizontal boxplot:
Vertical boxplot:
Multiple of boxplots:
Multiple of boxplots:
Multiple of boxplots. Two categorical variables + one numeric variable. Use of 'hue' argument.
Strip plot:
Swarm plot:
Line 2
• Read in the 'iris' data from the library.
Scatter plot array (Color the markers with 'species.' Apply palette.):
Heat map:
Line 11
• Read in 'mpg' data from the library.