0% found this document useful (0 votes)
13 views

A Machine Learning Model For Training Your AI

Artificial Intelligence is playing an increasing role in solving some of the world’s biggest problems. Machine Learning Models, within the context of reinforcement learning, define and structure a problem in a format that can be used to learn about an environment in order to find an optimal solution. This includes the states, actions, rewards, and other elements in a learning environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

A Machine Learning Model For Training Your AI

Artificial Intelligence is playing an increasing role in solving some of the world’s biggest problems. Machine Learning Models, within the context of reinforcement learning, define and structure a problem in a format that can be used to learn about an environment in order to find an optimal solution. This includes the states, actions, rewards, and other elements in a learning environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

A Machine Learning Model for Training Your AI


Akaninyene Udoeyop

Abstract:- Artificial Intelligence is playing an increasing A. Reinforcement Learning


role in solving some of the world’s biggest problems. Reinforcement learning is a machine learning method
Machine Learning Models, within the context of that involves rewarding desired behavior and punishing
reinforcement learning, define and structure a problem in undesired behavior. This involves learning agents interacting
a format that can be used to learn about an environment with an environment and observing the results or rewards
in order to find an optimal solution. This includes the from their actions. The objective in reinforcement learning is
states, actions, rewards, and other elements in a learning to learn the optimal behavior based on experience.
environment. This also includes the logic and policies that
guide learning agents to an optimal or nearly optimal This section outlines reinforcement learning policies
solution to the problem. This paper outlines a process for and the Q-learning learning algorithm that will be utilized in
developing machine learning models. The process is this paper.
extensible and can be applied to solve various problems.
This includes a process for implementing data models  Policies
using multi-dimensional arrays for efficient data A policy is a function that maps states to actions. It is
processing. We include an evaluation of learning policies, essentially a strategy that guides the learning agent’s actions
assessing their performance relative to manual and while it is interacting with the environment. Policies can be
automated approaches. either deterministic or stochastic:

I. INTRODUCTION  Deterministic - Maps states to actions with an expected


reward value.
Artificial Intelligence (AI) is a field of study that utilizes  Stochastic - Maps states to a probabilistic distribution
computer systems and data to solve problems. The term over actions.
“Artificial Intelligence” is often used interchangeably with
the term “Machine Learning”. However, it is important to The difference between a deterministic and stochastic
distinguish between the two concepts. Machine learning is a policy is in the way that they choose actions. A deterministic
type of AI that involves training computer systems to solve policy will choose the optimal action that maximizes the
problems through learning and adapting from experience. reward value. This is considered a “greedy” policy. A
stochastic policy will choose a random action some
There are several approaches to machine learning, percentage of the time, otherwise will follow a greedy policy.
including Supervised and Unsupervised Learning. This paper This is known as the Epsilon-greedy policy.
will focus on reinforcement learning. Reinforcement learning
is a machine learning technique that involves utilizing a  Q-Learning Algorithm
reward-based model to train a learning agent, then enabling Q-Learning is a machine learning algorithm that assigns
the agent to make decisions based on the data accumulated reward values to state-action combinations. The algorithm
during training. estimates a function that is closely related to the policy. This
function is called the value function. With Q-Learning, the
This paper will outline a process to develop and train reward values are numerical and are produced by the value
data models using reinforcement learning. function and indicate how “good” or “bad” an outcome of an
action is. With Q-Learning, the Q-Value is the metric used to
measure an action at a particular state.

 Machine Learning Models learning. Developing a machine learning model involves


A machine learning model is a framework that defining the states, actions, rewards, and other elements in the
represents the environmental components and policies used environment. Then, selecting learning policies that guide
to train learning agents to detect patterns and solve problems. learning agents to an optimal or nearly optimal solution to a
They are trained using techniques such as reinforcement problem.

IJISRT24JUL769 www.ijisrt.com 3463


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

II. RELATED WORK designing, training, and testing the learning model. James
Wexler et al. in their research, developed a tool called What-
Approaches to developing machine learning models are If to evaluate the performance of several machine learning
evolving. There are various theories, algorithms, and policies models[3]. In their study, they show how selecting the
that are available for solving problems using machine appropriate model, algorithm, and criteria can significantly
learning. Data scientists, for example, can build tens to impact the efficiency and the accuracy of results produced by
hundreds of models before arriving at one that meets some the data model.
acceptance criteria (e.g. AUC cutoff, accuracy threshold)[1].
The process for building a machine learning model is not an The advantages of modeling have been known for many
exact science. Emily Sullivan, in her research, outlines years. In 2013, Christopher M. Bishop touched on some of
considerations and rationale behind selecting a learning these in his research on model-based approaches to machine
model and mentions some of these challenges using Neural learning[4]. These advantages included the opportunity to
Networks[2]. MIT researchers Manasi Vartak et al. describe create highly tailored models for specific scenarios, as well
the challenges in building models that adequately correlate to as rapid prototyping and comparison of a range of alternative
the problem being solved[1]. models.

There are some efforts to organize and standardize III. MACHINE LEARNING MODEL FOR
machine learning models. In their work, Manasi Vartak et al. REINFORCEMENT LEARNING
are developing a system for the management of machine
learning models called ModelDB. While tools like ModelDB A machine learning model is a framework that defines
can be useful for managing models, this paper focuses on the the states, actions, rewards, logic, data models, and learning
process of taking the problem, deconstructing the problem policies required to identify patterns or make predictions
into components that are used to build a learning model. Then using machine learning.

Fig 1: Machine Learning Model

The following outlines a process for deriving machine direction. The objective is for the platform to be balanced and
learning models. The process can be used along with learning not moving, or as close to being balanced as possible.
policies to solve problems. This includes an approach for
implementing the data model used for reinforcement B. Define Problems and Goals
learning. The objective in reinforcement learning is to solve a
problem through learning from experience. In order to solve
A. The Learning Egg Problem the problem, it should be broken down into sub-problems and
To assist in explaining the process for developing a sub-goals that can be mapped into a machine learning model.
machine learning model, we will use the example of The Sub-problems and sub-goals should include elements that are
Learning Egg problem. The problem consists of an egg sitting quantifiable and limited to factors that affect the overall
on a platform that rotates left and right, where the egg can objective.
lean to the left and the right to shift the platform in either

IJISRT24JUL769 www.ijisrt.com 3464


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

environmental components that should be derived, followed


by their application in The Learning Egg problem. These
elements will provide the foundation for the machine
Learning Model.

 States
A state consists of a combination of quantifiable factors
in the environment that affect the problem and the goal. A
state factor is defined as a measurable and quantifiable value,
also called a state value, that reflects an environmental
condition at a point in time. Each state value should have an
objective that is aligned with the overall goal. A distinct state
Fig 2: Sub-Problem to Sub-Goal Mapping is essentially a snapshot of the set of state values at a point in
time.
For the Learning Egg, the problems and goals are
defined in the following sections.

 Problems

 The platform is unbalanced when it is not positioned at 0°.

 The platform rotates to the left and to the right.

Note that each component of the problem affects the


objective of the platform being balanced. The platform’s
rotation position and the egg’s position can be quantified. The
objective of the problem is for the platform to also not be
moving. This can be quantified by measuring the platform Fig 3: State to Objective Mapping
position over time. The egg's movement left and right can also
be quantified. For The Learning Egg problem, the goal is defined “For
the platform to be balanced and not moving, or as close to
 Goals being balanced as possible”. This goal can be quantified by
the platform position and speed. The state for this problem
 For the platform to be balanced, or as close to being can be defined by the following state values:
balanced as possible.
 For the platform to not be moving  Platform Position - The number of degrees that the
platform has rotated away from a balanced position of 0°.
Note that the state of the platform being balanced or The objective of this state value is for the platform
unbalanced is based on the platform position and speed, position to be 0°, or as close to 0° as possible.
which are quantifiable. Being balanced is not based on the  Platform Speed - The change in the platform position
position of the egg, although the egg’s position influences the over time. As the platform rotates left and right, it
position of the platform. increases and decreases in speed. The objective of this
state value is to be at a speed of 0, which indicates that the
C. Define Environmental Components platform is not moving.
After defining the problem and the goal, the
environment should be defined. This section outlines

𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚 𝑠𝑝𝑒𝑒𝑑 = (𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 − 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛) / 𝑡𝑖𝑚𝑒

The objective for this state value is to be zero, which Actions should be defined along with a reward metric to
indicates that the platform is not moving. gauge how “good” or “bad” the result of an action is. A
reward metric is a numerical value that represents how
Note that the factors associated with the egg are not positive or negative the result of an action is. An optimal
included in the state as the goal only references the platform action is the action that will result in the highest reward or the
and not the egg. “best” outcome from the action.

 Actions For The Learning Egg problem, the only action that can
An action is an operation that the learning agent, which be taken is for the egg to lean to the left or right, or to stand
is the egg in this example, can take to change the state of the up-right. This can be quantified by the egg’s position.
environment. Actions are directly enacted by the learning
agent and result in a change in one or more state values.

IJISRT24JUL769 www.ijisrt.com 3465


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

 Egg Position - A number representing the egg’s position.

The resulting platform position and speed after the egg


moves can be measured as the reward for the action. Thus,
the egg position represents the action and the combination of
the platform position and speed represents the reward metric.
The closer the platform is to being balanced without moving,
the better the reward.

 Rewards
Rewards are quantifiable observations of the Fig 4: State to Objective Mapping
environment that indicate how positive or negative that the
result of an action is. They need to be quantifiable because D. Define Environment Rules and Constraints
they need to indicate a state of positivity or negativity to Once the environment’s components have been defined,
integrate into a learning model. Rewards are observed after rules and constraints should be defined for them. Rules
an action is taken and gives the machine learning model should describe how the environment behaves as it relates to
feedback that is used to decide on future actions. Each reward state values and actions. Constraints define what the
should map to a state objective and should be able to measure environment’s components can and cannot do. Constraints
the result of an action within the context of the objective. should be defined as limitations for the state values and
actions. This can be implemented by defining a numerical
For The Learning Egg problem, the sum of the platform range, scale, and limits for the environment’s components.
position and the platform speed were used as the reward Environment rules and constraints for The Learning Egg
metric. The objective of the platform being balanced can be problem are defined in the following sections.
measured by the platform angle position. The objective that
the platform not be moving while balanced can be measured  Platform Position
by the platform speed. The goal for the reward is for both the The platform can rotate from -50° to 50°. Thus, the
platform position and speed to be zero, which indicates that platform position value will range from -50 to 50, with the
the platform is balanced and not moving. value 0 meaning that the platform is balanced. Negative
values indicate that the platform has rotated left. Positive
values mean that the platform has rotated right.

Fig 5: Platform Position

 Egg Position
The egg can move into the following 5 positions:

Fig 6: Egg Position

The egg position will be represented by the numbers -2,-1,0,1, and 2 with each number denoting the respective position above.

IJISRT24JUL769 www.ijisrt.com 3466


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

 Platform Speed
The platform speed is defined as the change in the platform position over time:

𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚 𝑠𝑝𝑒𝑒𝑑 = (𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 − 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛) / 𝑡𝑖𝑚𝑒

Since the platform position value is a representation of E. Define Data Model


degrees, the platform speed will be denoted in A Data Model is a data structure and format for
degrees/second. The platform speed can range from -2 to 2, representing data. In machine learning, data models are used
with negative values indicating the speed going in the left to store the information learned from interacting with the
direction, and positive values indicating the speed going the environment. There are several types of data models. In Q-
right direction. A platform speed of 0 means that the platform Learning, A Q-table is the data model used to store the reward
is not moving. values, or Q-values, for actions taken at a specific state. Q-
tables format the states as the rows in the table and the actions
The following table outlines how the egg position as the columns in the table.
affects the platform speed:
Table 2: Q-table Example
Table 1: Egg Position Effect on Platform Speed Action 1 Action 2
Egg Position Platform Speed Change State 1 QScore(State1, QScore(State1,
Hard Left Shifts platform speed left at -2 Action1) Action2)
Lean degrees/second State 2 QScore(State2, QScore(State2,
Soft left lean Shifts platform speed left at -1 Action1) Action2)
degree/second State 3 QScore(State3, QScore(State3,
Stand Up- Platform speed does not change Action1) Action2)
Right
Soft Right Shifts platform speed right at 1 The Q-table serves as an effective lookup table for
Lean degree/second scenarios where a single action is taken in a single state. Many
Hard Right Shifts platform speed right at 2 problems however, have multiple state factors and multiple
Lean degrees/second actions that affect the environment at a single point in time.
The Q-table operates in a 2-dimensional space, where the
This table essentially shows that, if the egg leans left, state is the y-axis, and the action is the x-axis.
the platform speed will shift to the left. Also, if the egg leans
right, then the platform speed will shift to the right. The Because of this limitation, we derived a data model
harder the lean of the egg, the greater the change in speed for design to support multiple state factors and multiple actions.
the platform. Note that the platform speed will not change Instead of the 2-dimensional space that the Q-table operates
when the egg is standing up-right. This simply means that if in, this data model design takes a multi-dimensional
the platform is in motion, it will remain at the same speed. approach. Like the Q-table, this data model design, as a
The egg standing up-right does not mean that the platform is representation of the environment, should include the states,
not moving. actions, and rewards defined for the environment. These are
mapped into the columns of the table. The table should be
structured with the first columns representing state values,
followed by columns that represent the actions, then followed
by a reward column. This approach is extensible and can be
applied to a diverse set of problems. This approach is flexible
in that the number of states and actions can vary.

State val 1 State val 2 State val 3 ... Action 1 Action 2 ... Reward
Fig 7: State Action Reward Table Column Structure

 Based on this Format, the Table for The Learning Egg  Platform Speed - Degrees per second that the platform is
Example will be Structured as Follows: moving before egg movement
 Action - The position that the egg is moving to
 Platform Position - Number of degrees that the platform  Reward - Sum of the platform position and speed.
has leaned away from a balanced position or 0° before egg
movement

Platform Position Platform Speed Egg Position Platform Position + Platform Speed
Fig 8: Platform/Egg State Action Reward Table Column Structure

IJISRT24JUL769 www.ijisrt.com 3467


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

As the data model is populated with data, the learning its actions. The following is a sample of what the data model
agent will gain an understanding of potential outcomes from can look like as it is being populated:

Table 3: Data Model Trained for Platform Position and Speed.


Platform Platform Egg Position Reward Resulting Platform Resulting Platform
Position Speed Position Speed
25° 1 Hard Left Lean 23 24° -1
25° 1 Soft Left Lean 25 25° 0
25° 1 Stand Up-Right 27 26° 1
25° 1 Soft Right Lean 28 27° 2
25° 1 Hard Right Lean 28 27° 2

The data model would be populated when the platform and speed of 0, the reward value closest to 0 is considered
has experienced the position of 25° while moving at a speed optimal. Thus, the reward value of 23 being the closest value
of 1 degree/second multiple times. When the egg makes a to 0 for this platform position and speed would be considered
movement from that platform position and speed, it observes optimal. This would lead the learning agent to take a Hard
the resulting platform and speed, then calculates their sum as Left Lean if it were pursuing the optimal action.
the reward. Based on the information in the data table above,
we can infer that at a platform position of 25° ,and moving at Note that the data table represents a small subset of a
a speed of 1 degree/second, the optimal action is a Hard Left larger dataset. The data model can store data for every
Lean because that results in the reward of 23. Since the goal combination of platform position, platform speed, egg action,
of The Learning Egg problem is to get to a platform position and reward.

Fig 9: Data Table Subset of Larger Dataset

IJISRT24JUL769 www.ijisrt.com 3468


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

F. Select Data Structure for Data Model


In Computer Science, data structures are software  Multi-Dimensional Array
components that are formatted for organizing, processing, The multi-dimensional array represents the data from
retrieving and storing data. These can be used for a variety of the data model. Thus, each column of the data table, should
purposes, including representing a data model in be mapped to indexes of the multi-dimensional array. The
reinforcement learning. The data structure utilized for this rewards should be stored in the array with the indexes used to
approach is a multi-dimensional array. In Computer Science, lookup and update their respective rewards.
an array is essentially a list of values with indexes that point
to specific locations on the list. A multi-dimensional array is From the data model format described in Section 2.5, we
essentially a list of lists. The choice to use nested arrays was can derive the multi-dimensional array below by mapping
made due to arrays' fast lookup complexity of O(1). This each of the columns of the table to an index of the array. This
would allow the learning agent to efficiently lookup and approach allows the lookup and updating of reward values to
update rewards as it learns. Multi-dimensional arrays also occur at an efficiency of O(1).
support the Q-table inspired data model described in Section
2.5.

State val 1 State val 2 State val 3 ... Action 1 Action 2 ... Reward
array[state_val_1][state_val_2][state_val_3][action_2][action_2] = reward
Fig 10: Multi-Dimensional Array State Action Reward Data Model

This approach used for The Learning Egg problem produces the following multi-dimensional array:

Platform Position Platform Speed Egg Position Reward


array[platform_position][platform_speed][egg_position] = reward
Fig 11: Multi-Dimensional Array Platform/Egg State Action Reward Data Model

With this array, reward values can be looked up and


updated by using the platform position, platform speed, and  Hard Left Lean
egg position to index. This allows managing the data via  Soft Left Lean
indexing, not searching.  Stand Up-Right
 Soft Right Lean
G. Environmental Logic and Policy  Hard Right Lean
A machine learning model involves deriving policies
that can be used by a learning agent to guide its future actions.  The egg position will be represented by the numbers -2,-
In order to implement a policy, each rule and constraint 1,0,1, and 2 with each number denoting the respective
should be implemented within the logic of the environment position above.
in software. For The Learning Egg problem, the following  The platform speed can range from -2 to 2, with negative
rules and constraints were integrated into the environmental values indicating the speed going in the left direction, and
logic of the program: positive values indicating the speed going the right
direction.
 The platform can rotate from -50° to 50°. Thus, the  The following table outlines how the egg position affects
platform position can range from -50 to 50. the platform speed:
 The following are the possibilities for egg position:

Table 4: Egg Position Effect on Platform Speed


Egg Position Platform Speed Change
Hard Left Lean Decreases platform speed at -2 degrees/second
Soft left lean Decreases platform speed at -1 degree/second
Stand Up-Right Platform speed does not change
Soft Right Lean Increases platform speed at 1 degree/second
Hard Right Lean Increases platform speed at 2 degrees/second

IV. TRAINING PHASE populate the data model, thus learning about the environment.
There are various approaches and factors that go into training
Reinforcement learning involves an iterative learning a learning agent. We will evaluate deterministic, stochastic,
process that involves training the learning agent by running manual, automated, and other approaches. For the approaches
simulations of the problem. During these simulations, the that involve reinforcement learning, the data model is
learning agent will interact with the environment and populated with reward data.

IJISRT24JUL769 www.ijisrt.com 3469


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

 Learning Policy and Approach  For The Learning Egg Problem, the Following Outlines
Now that we have developed a data model, we are faced the Behavior for the Stochastic Approach used in this
with the challenge of “how” to train our data model. Are some Research:
approaches more efficient than others? Do some approaches
adapt better to changing environments? Why choose one  For a percentage of the time that matches the exploration
approach over another? How do these approaches compare to rate, the egg should make a random move from the set of
manual or automated solutions? In this section, we will cover all possible moves.
the rationale behind why policies and approaches to training  The rest of the time, the egg should make the move that
were chosen. has yielded the highest reward based on experience.

 Manual Approach  Exploratory Policy


A manual approach involves a human being controlling The learning approaches referenced earlier have their
the actions of the agent in the environment. This approach pros and cons, which we will discuss later in the document.
does not use automation or machine learning. For The For this research, we are implementing and testing an
Learning egg problem, a person would control the egg approach called an Exploratory Policy. The policy prioritizes
movement with a computer keyboard. The person would exploring new states and actions when little is known about
move the egg left and right with the goal of balancing the the environment. Once enough information has been learned
platform. about the environment, the policy then guides the learning
agent using a deterministic policy.
 Automated Approach
An automated approach involves the agent following a  For The Learning Egg Problem, the Following Outlines
pre-programed policy where actions and conditions are hard- Behavior Using an Exploratory Policy:
coded to achieve a goal. No machine learning is used for this
approach as the behavior is predetermined. For The Learning  When the platform is at an angle that has not been
Egg problem, the following outlines the behavior of the egg experienced yet, the egg should make a random move
for an automated approach used for this research: from the set of all possible moves.
 When the platform is at a previously experienced angle,
 If the platform angle is greater than 0°, move the egg one and the egg has made a moves and learned from actions
position to the left until it is in a Hard Lean Left position. from that angle:
 If the platform angle is less than 0°, move the egg one
position to the right until it is in a Hard Lean Right  If the egg has learned from one or some actions, but has
position. not learned from all of the actions to take from that
 Otherwise, move the egg to a Stand Up-Right position. platform angle, the egg makes a random move from the
set of actions that have not been taken from that state.
 Deterministic Policy  If the egg has learned from every move from that angle,
A deterministic policy selects actions that are projected make the move that has yielded the highest reward based
to yield the highest reward. This type of policy is considered on experience.
“greedy”. For The Learning Egg problem, the following
outlines the behavior for the deterministic approach used in This approach prioritizes exploring early in the learning
this research: process, then prioritizes maximizing rewards once
information has been learned.
 When the platform is at an angle that has not been
experienced yet, the egg should make a random move  Enumeration Policy
from the set of all possible moves. There are other approaches to training in reinforcement
 When the platform is at a previously experienced angle, learning. One of these approaches will be referred to as an
and has made moves from that angle, and learned from the enumeration policy. This type of policy explicitly iterates
rewards, make the move that has yielded the highest through every combination of the states and actions in the
reward based on experience. environment until the data model is filled with information.

 Stochastic Policy  For The Learning Egg Problem, the Following Outlines
In reinforcement learning, stochastic policies will Behavior Using an Enumeration Policy:
choose a random action from a state some percentage of the
time, otherwise will follow a greedy policy. The notion of  The platform is forced into every position iteratively one-
choosing a random action is referred to as “exploring”. The by-one.
exploration rate is the percentage of actions that should be  At each position, the egg makes every possible move and
taken by the learning agent in an exploring fashion vs a learns from the rewards from its actions.
greedy fashion.  This is continued until every action has been taken from
every state.

IJISRT24JUL769 www.ijisrt.com 3470


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

Note that there are both software and physical rewards. The Exploratory Policy allows for a number of
limitations for this type of approach. In a physical setting, samples in order for a state-action combination to be
physics, cost, and safety can render this policy not feasible. considered “learned”. This will allow the learning agent to
This policy is also inefficient from a software complexity explore unknown states until all of the actions from a state
standpoint, and does not scale well, which limits its use for have been sampled an adequate number of times. After all of
software based machine learning. the actions from a state are learned, a learning agent following
the Exploratory Policy will then take the greedy action.
 Sampling
Sampling is when a learning agent takes an action from  Learning Egg Simulation
a state and observes the result. In some learning The Learning Egg is simulated via a web application
environments, sampling produces relatively consistent results that utilizes JavaScript to implement the logic, physics,
or rewards. This is common in many software-based learning model, and policies for solving the problem. The egg
simulations. However, some learning environments have can be moved manually, via automation, and by several
factors that can lead to varied results when the same action is reinforcement learning techniques. Moving the egg will
taken from the same state. This can occur in many physical affect the position and speed of the platform based on the
environments where factors such as weather, turbulence, and rules and constraints defined in Section 3.4. The egg can be
air resistance can affect the consistency of the rewards from manually moved left and right by a user pressing left and right
sampling. When this is the case, multiple samples may need keys from the computer keyboard.
to be taken in order to derive an average or a distribution of

Fig 12: The Learning Egg Web Application

 The Balancing Game for 5 seconds. Some approaches did not result in a balanced
Performance for each learning approach is measured by egg, however a rolling 5 second average of the platform’s
simulating The Balancing Game. The goal of the game is to angle is tracked. The best rolling 5 second average would
balance the platform for 5 seconds. The platform is measure how close the platform came to being balanced
considered balanced when its position is between -1° and 1° during the simulation.

IJISRT24JUL769 www.ijisrt.com 3471


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

 Training Machine Learning Model  Best Angle From Balanced - The best rolling 5 second
Approaches that utilize reinforcement learning will average of the platform’s angle. This would measure how
require training. This occurs by the egg being guided by a close the platform came to being balanced during the
learning policy to make movements while populating the data simulation.
model with learned data. This can result in the egg balancing  Number of Moves Until Balanced - The number of
dring training, however some approaches did not result in the movements that the egg has to make in order to balance.
egg balancing. This would measure the efficiency of the egg movements,
where balancing using less moves would be considered
After training the data model, we tested the quality of more efficient.
the learned data by introducing “turbulence”, or instability,  Time Until Balanced - The amount of time that the egg
into the environment and testing how quickly the egg would took to balance. This would measure efficiency in terms
rebalance the platform by following a greedy policy. This was of time, with a faster time considered more efficient.
executed by shifting the egg and the platform all the way to  Time Until Rebalanced - The amount of time that the
the right, then continuously pressing the right arrow key on egg took to balance after instability is introduced into the
the keyboard to constantly manually move the egg to the environment and a greedy policy is followed. This would
right. This is all done while the egg tries to rebalance by measure how quickly the learning policies react to
following a greedy policy. This introduction of instability into unforeseen changes to the environment, with a faster time
the environment will test how the learning policies react to considered more efficient.
unforeseen changes to the environment.
V. RESULTS
 Performance and Metrics
The following metrics will be used to measure the The Learning Egg problem was simulated using the
performance of the manual, automated, and machine learning manual, automated, and machine learning techniques
approaches: discussed in this document. The table below shows the
average result of running 10 simulations for each approach:

Table 5: Results from Simulations


Learning Policy Best Angle From Number of Moves Time Until Time to Rebalance
Balanced Until Balanced Balanced
Exploratory 0° 11,992.3 00:18:56.8 00:00:55.3
Greedy 0° 6268.7 00:15:22.5 00:01:31.9
Epsilon-Greedy (10%) 0° 13,802.1 00:18:46.9 00:01:20.7
Enumeration 0° 25,025.9 01:25:46.4 00:00:07.1
Automated 3.08° Does Not Balance Does Not Balance N/A
Manual 5.77° Does Not Balance Does Not Balance N/A
Epsilon-Greedy (50%) 33.10° Does Not Balance Does Not Balance Does Not Rebalance
Epsilon-Greedy (90%) 37.64° Does Not Balance Does Not Balance Does Not Rebalance

The Epsilon-Greedy policies are denoted with the Greedy (10%) policies required an average of 11,992.3 and
exploration rate percentage. Note that non-machine-learning 13,802.1 moves respectively, with the former slightly
approaches did not utilize a machine learning data model, outperforming the latter. The Enumeration policy performed
thus could not be used to rebalance the platform. The “Time the worst in that it required an average of 25,025.9 moves.
to Rebalance '' column for manual and automated approaches This is due to the policy iterating through every combination
use “N/A” to indicate this. of state and action.

A. Balancing Performance  Note: The following times are formatted as follows:


From these results, we see that only the Exploratory, HH:MM:SS
Greedy, Epsilon-Greedy (10%), and Enumeration policies  HH - 2 digits representing the number of hours
resulted in the platform being balanced. The Automated  MM - 2 digits representing the number of minutes
approach outperformed the Manual approach with best 5  SS - 2 digits representing the number of seconds
second rolling averages of 3.08° and 5.77° respectively.
Epsilon-Greedy policies with moderate to high exploration The Greedy policy also held the best average
rates performed the worst. Policies with exploration rates of performance of 00:03:12.5, with respect to the time that it
50% and 90% yielded best 5 second rolling averages of took to balance the platform during training. The Exploratory
33.10° and 37.64° respectively. and Epsilon-Greedy (10%) policies required an average time
of 00:16:46.8 and 00:18:46.9 respectively for the platform to
For the machine learning approaches that resulted in a balance, with the former slightly outperforming the latter.
balanced platform during training, the Greedy policy The Enumeration policy performed the worst with an average
achieved this in the least number of moves from the egg on time of 01:25:46.4.
average, which was 581.7. The Exploratory and Epsilon-

IJISRT24JUL769 www.ijisrt.com 3472


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

B. Rebalancing Test exploring states and actions that have not been visited before.
After training, we tested the quality of the learned data This adds some range to the set of states that get explored,
by introducing turbulence and observing how quickly a while minimizing re-visiting states. The Exploratory policy’s
greedy policy can use the learned data to rebalance. This trait of switching to a greedy policy after enough information
tested how well the learning policies react to unforeseen is known about a state, allowed the learning agent to move
changes to the environment. Interestingly, The Greedy policy closer to the goal faster than Epsilon-Greedy policies. The
performed poorly with an average rebalancing time of 59.9 Enumeration policy yielded a comprehensive dataset of every
seconds. This was due to the learning agent not exploring and state and action. This produced a data model that was robust
learning enough of the environment to know what to do in the enough to handle turbulence well.
states introduced by instability. The Epsilon-Greedy (10%)
performed even worse at an average time of 1 minute and 0.7 C. Knowledge Map Visualization
seconds. Although the learning agent is able to balance during The Learning Egg web application includes a
training, the learning agent would require significantly more knowledge map, which is a 2-dimensional grid that shows a
training to reduce the rebalancing time. The trade-off there visualization of the coverage in knowledge. The map shows
would be that more training would increase the amount of which platform position/speed states that the egg has been
time required for adequate training. The Exploratory and trained for. The x-axis represents the platform position, and
Enumeration policies performed the best averaging 18.3 and the y-axis represents the platform speed.
7.1 seconds respectively. The exploratory policy prioritizes

Fig 13: Knowledge Map Visualization

Each grid cell represents the number of actions, at a specific platform position/speed state, that the learning agent has been
trained for.

Fig 14: Knowledge Map Grid Cell Color Mappings

D. Extensibility policies for other problems. One example of how this


The process for developing machine learning models approach can be applied would be a drone.
that is outlined in this document was designed to be
extensible. This means that it can be applied to other The Learning Egg problem used the following data
problems. The same principles used for The Learning Egg model to represent its environment.
problem can be used to define the environment and learning

Platform Position Platform Speed Egg Position Reward


Fig 15: Learning Egg Platform/Egg State Action Reward Data Model

IJISRT24JUL769 www.ijisrt.com 3473


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL769

Instead of balancing a platform, the goal for a drone The drone could use a similar data model to The Learning
would be to balance the drone based on position and speed. Egg.
Instead of an egg moving, the drone’s propellers are moving.

Drone Position Drone Speed Propeller Position Reward


Fig 16: Drone/Propeller State Action Reward Data Model

 The following Outlines a Process for Developing a REFERENCES


Machine Learning Model to Solve Problems Using
Reinforcement Learning: [1]. Manasi Vartak , Harihar Subramanyam , Wei-En Lee
, Srinidhi Viswanathan , Saadiyah Husnoo , Samuel
 Define the problem and the goal Madden , Matei Zaharia, 2016. ModelDB: A System
 Define environment components (states, actions, rewards) for Machine Learning Model Management.
 Define environment rules and constraints [2]. Emily Sullivan, 2022. Understanding from Machine
 Define a data model Learning Models: The British Journal for the
 Develop a data structure for the data model Philosophy of Science, Volume 73, Number 1.
 Implement the environmental logic and policies in [3]. James Wexler, Mahima Pushkarna, Tolga Bolukbasi,
software Martin Wattenberg, Fernanda Viegas, and Jimbo
Wilson, 2020. The What-If Tool: Interactive Probing
 Train the learning agent to learn an optimal solution
of Machine Learning Models: IEEE Transactions On
Visualization And Computer Graphics, Vol. 26, No. 1.
VI. CONCLUSION
[4]. Christopher M. Bishop, 2013, Model-Based Machine
Machine learning is a useful tool for solving problems Learning: Phil Trans R, Soc A 371: 20120222
that manual or purely automated solutions cannot. In order to
solve these problems, they will need to be mapped to a
machine learning model. We have outlined an approach to
building machine learning models using reinforcement
learning that is extensible. Through simulations, we have
observed the performance of various learning policies and
approaches. We observed that the Exploratory policy
performed well across the board. Due to its directive to
explore the unknown, and maximize the reward once enough
information is known, the policy allowed for adequate
exploration, minimizing revisiting states. The policy also
took a greedy approach once enough information was known
about a state, which moved the egg and platform closer to the
goal of being balanced, faster. It is important to select a data
structure that optimizes performance and facilitates learning.
This is why multi-dimensional arrays proved useful due to
their fast lookup complexity of O(1). The Learning Egg
problem provides an example of how using a learning policy
that explores states and actions that have not been explored
before, then switching to a deterministic policy can be
effective. This paper gives insight into how to build a
machine learning model, not only for The Learning Egg
problem, but to solve a variety of problems.

IJISRT24JUL769 www.ijisrt.com 3474

You might also like