A Machine Learning Model For Training Your AI
A Machine Learning Model For Training Your AI
II. RELATED WORK designing, training, and testing the learning model. James
Wexler et al. in their research, developed a tool called What-
Approaches to developing machine learning models are If to evaluate the performance of several machine learning
evolving. There are various theories, algorithms, and policies models[3]. In their study, they show how selecting the
that are available for solving problems using machine appropriate model, algorithm, and criteria can significantly
learning. Data scientists, for example, can build tens to impact the efficiency and the accuracy of results produced by
hundreds of models before arriving at one that meets some the data model.
acceptance criteria (e.g. AUC cutoff, accuracy threshold)[1].
The process for building a machine learning model is not an The advantages of modeling have been known for many
exact science. Emily Sullivan, in her research, outlines years. In 2013, Christopher M. Bishop touched on some of
considerations and rationale behind selecting a learning these in his research on model-based approaches to machine
model and mentions some of these challenges using Neural learning[4]. These advantages included the opportunity to
Networks[2]. MIT researchers Manasi Vartak et al. describe create highly tailored models for specific scenarios, as well
the challenges in building models that adequately correlate to as rapid prototyping and comparison of a range of alternative
the problem being solved[1]. models.
There are some efforts to organize and standardize III. MACHINE LEARNING MODEL FOR
machine learning models. In their work, Manasi Vartak et al. REINFORCEMENT LEARNING
are developing a system for the management of machine
learning models called ModelDB. While tools like ModelDB A machine learning model is a framework that defines
can be useful for managing models, this paper focuses on the the states, actions, rewards, logic, data models, and learning
process of taking the problem, deconstructing the problem policies required to identify patterns or make predictions
into components that are used to build a learning model. Then using machine learning.
The following outlines a process for deriving machine direction. The objective is for the platform to be balanced and
learning models. The process can be used along with learning not moving, or as close to being balanced as possible.
policies to solve problems. This includes an approach for
implementing the data model used for reinforcement B. Define Problems and Goals
learning. The objective in reinforcement learning is to solve a
problem through learning from experience. In order to solve
A. The Learning Egg Problem the problem, it should be broken down into sub-problems and
To assist in explaining the process for developing a sub-goals that can be mapped into a machine learning model.
machine learning model, we will use the example of The Sub-problems and sub-goals should include elements that are
Learning Egg problem. The problem consists of an egg sitting quantifiable and limited to factors that affect the overall
on a platform that rotates left and right, where the egg can objective.
lean to the left and the right to shift the platform in either
States
A state consists of a combination of quantifiable factors
in the environment that affect the problem and the goal. A
state factor is defined as a measurable and quantifiable value,
also called a state value, that reflects an environmental
condition at a point in time. Each state value should have an
objective that is aligned with the overall goal. A distinct state
Fig 2: Sub-Problem to Sub-Goal Mapping is essentially a snapshot of the set of state values at a point in
time.
For the Learning Egg, the problems and goals are
defined in the following sections.
Problems
The objective for this state value is to be zero, which Actions should be defined along with a reward metric to
indicates that the platform is not moving. gauge how “good” or “bad” the result of an action is. A
reward metric is a numerical value that represents how
Note that the factors associated with the egg are not positive or negative the result of an action is. An optimal
included in the state as the goal only references the platform action is the action that will result in the highest reward or the
and not the egg. “best” outcome from the action.
Actions For The Learning Egg problem, the only action that can
An action is an operation that the learning agent, which be taken is for the egg to lean to the left or right, or to stand
is the egg in this example, can take to change the state of the up-right. This can be quantified by the egg’s position.
environment. Actions are directly enacted by the learning
agent and result in a change in one or more state values.
Rewards
Rewards are quantifiable observations of the Fig 4: State to Objective Mapping
environment that indicate how positive or negative that the
result of an action is. They need to be quantifiable because D. Define Environment Rules and Constraints
they need to indicate a state of positivity or negativity to Once the environment’s components have been defined,
integrate into a learning model. Rewards are observed after rules and constraints should be defined for them. Rules
an action is taken and gives the machine learning model should describe how the environment behaves as it relates to
feedback that is used to decide on future actions. Each reward state values and actions. Constraints define what the
should map to a state objective and should be able to measure environment’s components can and cannot do. Constraints
the result of an action within the context of the objective. should be defined as limitations for the state values and
actions. This can be implemented by defining a numerical
For The Learning Egg problem, the sum of the platform range, scale, and limits for the environment’s components.
position and the platform speed were used as the reward Environment rules and constraints for The Learning Egg
metric. The objective of the platform being balanced can be problem are defined in the following sections.
measured by the platform angle position. The objective that
the platform not be moving while balanced can be measured Platform Position
by the platform speed. The goal for the reward is for both the The platform can rotate from -50° to 50°. Thus, the
platform position and speed to be zero, which indicates that platform position value will range from -50 to 50, with the
the platform is balanced and not moving. value 0 meaning that the platform is balanced. Negative
values indicate that the platform has rotated left. Positive
values mean that the platform has rotated right.
Egg Position
The egg can move into the following 5 positions:
The egg position will be represented by the numbers -2,-1,0,1, and 2 with each number denoting the respective position above.
Platform Speed
The platform speed is defined as the change in the platform position over time:
State val 1 State val 2 State val 3 ... Action 1 Action 2 ... Reward
Fig 7: State Action Reward Table Column Structure
Based on this Format, the Table for The Learning Egg Platform Speed - Degrees per second that the platform is
Example will be Structured as Follows: moving before egg movement
Action - The position that the egg is moving to
Platform Position - Number of degrees that the platform Reward - Sum of the platform position and speed.
has leaned away from a balanced position or 0° before egg
movement
Platform Position Platform Speed Egg Position Platform Position + Platform Speed
Fig 8: Platform/Egg State Action Reward Table Column Structure
As the data model is populated with data, the learning its actions. The following is a sample of what the data model
agent will gain an understanding of potential outcomes from can look like as it is being populated:
The data model would be populated when the platform and speed of 0, the reward value closest to 0 is considered
has experienced the position of 25° while moving at a speed optimal. Thus, the reward value of 23 being the closest value
of 1 degree/second multiple times. When the egg makes a to 0 for this platform position and speed would be considered
movement from that platform position and speed, it observes optimal. This would lead the learning agent to take a Hard
the resulting platform and speed, then calculates their sum as Left Lean if it were pursuing the optimal action.
the reward. Based on the information in the data table above,
we can infer that at a platform position of 25° ,and moving at Note that the data table represents a small subset of a
a speed of 1 degree/second, the optimal action is a Hard Left larger dataset. The data model can store data for every
Lean because that results in the reward of 23. Since the goal combination of platform position, platform speed, egg action,
of The Learning Egg problem is to get to a platform position and reward.
State val 1 State val 2 State val 3 ... Action 1 Action 2 ... Reward
array[state_val_1][state_val_2][state_val_3][action_2][action_2] = reward
Fig 10: Multi-Dimensional Array State Action Reward Data Model
This approach used for The Learning Egg problem produces the following multi-dimensional array:
IV. TRAINING PHASE populate the data model, thus learning about the environment.
There are various approaches and factors that go into training
Reinforcement learning involves an iterative learning a learning agent. We will evaluate deterministic, stochastic,
process that involves training the learning agent by running manual, automated, and other approaches. For the approaches
simulations of the problem. During these simulations, the that involve reinforcement learning, the data model is
learning agent will interact with the environment and populated with reward data.
Learning Policy and Approach For The Learning Egg Problem, the Following Outlines
Now that we have developed a data model, we are faced the Behavior for the Stochastic Approach used in this
with the challenge of “how” to train our data model. Are some Research:
approaches more efficient than others? Do some approaches
adapt better to changing environments? Why choose one For a percentage of the time that matches the exploration
approach over another? How do these approaches compare to rate, the egg should make a random move from the set of
manual or automated solutions? In this section, we will cover all possible moves.
the rationale behind why policies and approaches to training The rest of the time, the egg should make the move that
were chosen. has yielded the highest reward based on experience.
Stochastic Policy For The Learning Egg Problem, the Following Outlines
In reinforcement learning, stochastic policies will Behavior Using an Enumeration Policy:
choose a random action from a state some percentage of the
time, otherwise will follow a greedy policy. The notion of The platform is forced into every position iteratively one-
choosing a random action is referred to as “exploring”. The by-one.
exploration rate is the percentage of actions that should be At each position, the egg makes every possible move and
taken by the learning agent in an exploring fashion vs a learns from the rewards from its actions.
greedy fashion. This is continued until every action has been taken from
every state.
Note that there are both software and physical rewards. The Exploratory Policy allows for a number of
limitations for this type of approach. In a physical setting, samples in order for a state-action combination to be
physics, cost, and safety can render this policy not feasible. considered “learned”. This will allow the learning agent to
This policy is also inefficient from a software complexity explore unknown states until all of the actions from a state
standpoint, and does not scale well, which limits its use for have been sampled an adequate number of times. After all of
software based machine learning. the actions from a state are learned, a learning agent following
the Exploratory Policy will then take the greedy action.
Sampling
Sampling is when a learning agent takes an action from Learning Egg Simulation
a state and observes the result. In some learning The Learning Egg is simulated via a web application
environments, sampling produces relatively consistent results that utilizes JavaScript to implement the logic, physics,
or rewards. This is common in many software-based learning model, and policies for solving the problem. The egg
simulations. However, some learning environments have can be moved manually, via automation, and by several
factors that can lead to varied results when the same action is reinforcement learning techniques. Moving the egg will
taken from the same state. This can occur in many physical affect the position and speed of the platform based on the
environments where factors such as weather, turbulence, and rules and constraints defined in Section 3.4. The egg can be
air resistance can affect the consistency of the rewards from manually moved left and right by a user pressing left and right
sampling. When this is the case, multiple samples may need keys from the computer keyboard.
to be taken in order to derive an average or a distribution of
The Balancing Game for 5 seconds. Some approaches did not result in a balanced
Performance for each learning approach is measured by egg, however a rolling 5 second average of the platform’s
simulating The Balancing Game. The goal of the game is to angle is tracked. The best rolling 5 second average would
balance the platform for 5 seconds. The platform is measure how close the platform came to being balanced
considered balanced when its position is between -1° and 1° during the simulation.
Training Machine Learning Model Best Angle From Balanced - The best rolling 5 second
Approaches that utilize reinforcement learning will average of the platform’s angle. This would measure how
require training. This occurs by the egg being guided by a close the platform came to being balanced during the
learning policy to make movements while populating the data simulation.
model with learned data. This can result in the egg balancing Number of Moves Until Balanced - The number of
dring training, however some approaches did not result in the movements that the egg has to make in order to balance.
egg balancing. This would measure the efficiency of the egg movements,
where balancing using less moves would be considered
After training the data model, we tested the quality of more efficient.
the learned data by introducing “turbulence”, or instability, Time Until Balanced - The amount of time that the egg
into the environment and testing how quickly the egg would took to balance. This would measure efficiency in terms
rebalance the platform by following a greedy policy. This was of time, with a faster time considered more efficient.
executed by shifting the egg and the platform all the way to Time Until Rebalanced - The amount of time that the
the right, then continuously pressing the right arrow key on egg took to balance after instability is introduced into the
the keyboard to constantly manually move the egg to the environment and a greedy policy is followed. This would
right. This is all done while the egg tries to rebalance by measure how quickly the learning policies react to
following a greedy policy. This introduction of instability into unforeseen changes to the environment, with a faster time
the environment will test how the learning policies react to considered more efficient.
unforeseen changes to the environment.
V. RESULTS
Performance and Metrics
The following metrics will be used to measure the The Learning Egg problem was simulated using the
performance of the manual, automated, and machine learning manual, automated, and machine learning techniques
approaches: discussed in this document. The table below shows the
average result of running 10 simulations for each approach:
The Epsilon-Greedy policies are denoted with the Greedy (10%) policies required an average of 11,992.3 and
exploration rate percentage. Note that non-machine-learning 13,802.1 moves respectively, with the former slightly
approaches did not utilize a machine learning data model, outperforming the latter. The Enumeration policy performed
thus could not be used to rebalance the platform. The “Time the worst in that it required an average of 25,025.9 moves.
to Rebalance '' column for manual and automated approaches This is due to the policy iterating through every combination
use “N/A” to indicate this. of state and action.
B. Rebalancing Test exploring states and actions that have not been visited before.
After training, we tested the quality of the learned data This adds some range to the set of states that get explored,
by introducing turbulence and observing how quickly a while minimizing re-visiting states. The Exploratory policy’s
greedy policy can use the learned data to rebalance. This trait of switching to a greedy policy after enough information
tested how well the learning policies react to unforeseen is known about a state, allowed the learning agent to move
changes to the environment. Interestingly, The Greedy policy closer to the goal faster than Epsilon-Greedy policies. The
performed poorly with an average rebalancing time of 59.9 Enumeration policy yielded a comprehensive dataset of every
seconds. This was due to the learning agent not exploring and state and action. This produced a data model that was robust
learning enough of the environment to know what to do in the enough to handle turbulence well.
states introduced by instability. The Epsilon-Greedy (10%)
performed even worse at an average time of 1 minute and 0.7 C. Knowledge Map Visualization
seconds. Although the learning agent is able to balance during The Learning Egg web application includes a
training, the learning agent would require significantly more knowledge map, which is a 2-dimensional grid that shows a
training to reduce the rebalancing time. The trade-off there visualization of the coverage in knowledge. The map shows
would be that more training would increase the amount of which platform position/speed states that the egg has been
time required for adequate training. The Exploratory and trained for. The x-axis represents the platform position, and
Enumeration policies performed the best averaging 18.3 and the y-axis represents the platform speed.
7.1 seconds respectively. The exploratory policy prioritizes
Each grid cell represents the number of actions, at a specific platform position/speed state, that the learning agent has been
trained for.
Instead of balancing a platform, the goal for a drone The drone could use a similar data model to The Learning
would be to balance the drone based on position and speed. Egg.
Instead of an egg moving, the drone’s propellers are moving.