UKF ppt
UKF ppt
• Here,
f – process model z – observation vector
h – observation model w – process/control noise
u – control input v – measurement/observation noise
x – state vector
• Famous among the estimations of these kind of systems is the Kalman Filter, and is optimal for linear
systems.
• However, all real-world applications are non-linear in nature.
UNSCENTED KALMAN FILTER
• Instead of solving non-linearity using Jacobian, like
what was done in EKF, UKF takes the current state as
probability distribution with mean and covariance.
• A set of “sigma points” are drawn from the
probability distribution.
• UKF is also not completely accurate. It can still
diverge to an incorrect state when trying to combine
data from multiple sensors
UKF FLOW
The formalisation of UKF of the discrete system mentioned above is as follows:-
• Defining an augmented state vector xa , with length M that concatenates the
process/control noise and measurement noise terms with the state variables as:
• The augmented state vector and associated augmented state covariance, P a, are initialised
with:
•Where is the expected value of the initial (regular) state and P k is the (regular) state
covariance
•The current augmented state and covariance are used to generate the set of sigma points,
X, using:
•Where (m) and (c) denote whether the weight is used for a mean
calculation or a
covariance calculation, and beta is used to incorporate prior
knowledge of the distribution around the mean.
THE PREDICT STEP
• The predict step begins with the sigma points being propagated through the system
model:
• The parameters passed inside the observation function is highly application specific. Which
means, the values keep changing based on the application.
THE UPDATE STEP
• For the update step, the sigma points that were updated in the predict step are
propagated through the observation model:
• The mean and covariance of the observation-transformed sigma points are calculated:
UPDATE CONTD.
• Followed by the cross-covariance:
• Where z dash is the current set of observations and the current covariance is updated with:
HW/SW CODESIGN
• The predict and update states are highly application specific. For example, consider
a 3D rigid-body dynamics model of a multi-rotor micro-UAV which accounts
for gravity and air resistance. Changing to a different system would require the
predict model be updated to reflect the new system.
• The same applies to the update step, and even that changes from application to
application. For example, if the sensors used keep changing, we need to change
the parameters also. This also makes it highly application specific.
• For greater efficiency, all the non-specific application part is implemented on
hardware. For specific application, we use software implementation since it is easy
to develop, and is portable.
• There are three ways of implementing the design on hardware:
1. Serial design
2. Parallel design
3. Pipeline design
SERIAL DESIGN
• The Serial design strategy is to minimise the area and power consumption
as much as possible with the intent to include the design into a greater
SoC.
• As such, the design forgoes one of the main benefits of hardware
implementations: wide parallelism.
• The UKF algorithm can be logically divided into two parts: the predict step
(for the Serial design, we consider sigma points generation as part of the
predict step) and the update step.
• The algorithm must first be initialised with an augmented state estimate
and covariance before any calculations can begin.
STATE MACHINE OF SERIAL DESIGN
• The Serial design IP core has five top-level states: an idle state, two initialisation states and
one state each for the two parts of the UKF.
PREDICT STEP
• The predict step for the Serial design generates the new set of sigma points and calculates
the priori state estimate.
• A block diagram of the predict step architecture is shown below:
• The predict step begins by using the current augmented state vector and covariance to
calculate new sigma points.
• To calculate the new set of sigma points, first the matrix 'square-root' of the current
augmented covariance must be calculated.
TRIANGULAR EQUATION SOLVER
• In addition to the matrix 'square-root', the Choleksy Decomposition is also used in the
Kalman gain calculation which involves a matrix inversion.
MATRIX MULTIPLY-ADD
• The matrix multiply-add data path is a standard element-wise multiplication and
accumulation.
• The element-wise calculation is given by:
• The elements of the matrix to be added, C, can simply be injected into the accumulation
directly, instead of performing an additional matrix addition after a matrix multiplication.
• The data path of the same is shown below:
CALCULATED MEAN AND COVARIANCE
• Calculating the mean and covariance of the transformed sigma points are both very similar,
meaning both can be calculated by the same datapath.
• Calculation of mean from the predict step is given as:
• To avoid the subtraction operation, we can assume Xi= so the covariance reduces to:
UPDATE STEP
• The update step corrects the a priori state estimate with a set of observations to generate the
new state estimate.
• Many of the calculations in the update step are very similar to the predict step. The block
diagram is shown below:
CONTD.
• The update step starts with the copying of the current sensor
observations.
• Similar to the predict step, the observation mean is used to
calculate the update ‘sigma point residuals' (subtract) before
the covariances are calculated.
• The observation covariance is calculated with the update
'sigma point residuals' and the cross covariance is calculated
with both the predict and update 'sigma points residuals‘.
• After the current state estimate and covariance are
calculated, they are, like the predict step, written back to the
processor memory buffer as well as, respectively, the
augmented state and covariance local memory blocks.
PARALLEL DESIGN
• The Parallel design reintroduces the main benefit of hardware implementations: wide
parallelism.
• This design strategy will use much more resources than the Serial design, but also increases
performance.
• The design does so by encapsulating certain parts of the major datapaths into a sub-module
called a processing element (PE), then uses multiple instances of these PEs in parallel, allowing
multiple elements of an algorithm to be calculated at once.
• A memory 'prefetch', which fetches data from a serial memory block and places it into parallel
memory blocks.
• A memory ‘serialiser’ which collects calculations from a parallel scheme and outputs in serial
fashion.
TOP LEVEL DESIGN
• Instead of having digital control lines, the control register has been incorporated into the
memory map of the memory buffer.
• Instead of a simple FIFO, the memory buffer for the Parallel design has a proper internal
memory map to ensure the control information and data is coherent between the processor
and the IP core.
STATE MACHINE OF PARALLEL DESIGN
• The IP core is controlled by a state machine which has 5 states: idle, init, sig_gen, predict and
update.
• During the init state, the processor initialises the internal memory of the IP core with initial
values for the augmented state and covariance.
CONTD.
• The sig_gen state handles the calculation of the latest set of
sigma points.
• After the new sigma points have been propagated through
the predict model, the predict state uses the transformed
sigma points to calculate the a priori state and covariance.
• Similarly, the update state uses the update transformed sigma
points to calculate the current state and covariance.
• The predict and update steps may be performed together if
valid observations are available, or independently as required.
SIGMA POINTS GENERATION
• The functionality of the sig_gen is taking the matrix 'square-root' of the augmented
covariance, then multiplying the result by a weighting matrix, before adding the augmented
state column-wise.
• The main difference from the Serial design is the need to introduce a memory prefetch as
well as a memory serialiser module.
TRISOLVE
• For the Parallel design, the fused multiply-add module and feedback FIFO has been
encapsulated to form a processing element which can be instantiated multiple times in
parallel.
MATRIX MULTIPLY-ADD
• The entire datapath from the Serial design has been enclosed as one processing element and
additional PEs are added to handle calculations in parallel.
PREDICT STEP
• The architecture for the predict step is shown below:
• The processor may initiate a predict step once it has placed valid transformed sigma points into
the memory buffer.
• The sigma point residuals are once again calculated first before the covariance calculation.
• Memory serialisers are necessary after the mean and covariance calculation as the memory
buffer is a serial memory.
UPDATE STEP
• The update step for the Parallel design is very similar to the update step in the Serial design.
• First, the prefetch module converts the transformed sigma points into a parallel memory
structure.
• The mean and 'sigma point residuals' are calculated, then used to calculate the observation
covariance.
• The update 'sigma point residuals' are also combined with the predict 'sigma point residuals',
which were calculated during the predict step, to calculate the cross covariance between the two
system models.
PIPELINE DESIGN
• The Pipeline design reinforces the main benefit of hardware implementations wide
parallelism with a 'high-level' pipeline to increase performance even further.
• This design strategy uses the most resources but also has the highest performance in terms
of algorithm throughput.
• It has three major steps: sig_gen, predict, update.
• The top-level block diagram of the Pipeline design is shown below:
STAGES OF UKF PIPELINE
• sig_gen step contains two large matrix operations trisolve and the matrix multiply-add.
• sig_gen step is broken up into two stages the first two stages for the pipeline.
• The third stage is the software 'stage' where the processor propagates the sigma points
through the system models.
• The final two stages are simply for the predict and update steps.
SIGMA POINTS GENERATION
• To start the sig_gen module, the processor must first place the current augmented state and
covariance estimate into the memory buffer.
• The first stage (sig_gen (a))contains the matrix 'square-root' and a prefetch module to hold the
augmented state vector.
• The second stage (sig_gen (b)) contains just the matrix multiply-add.
• The sig_gen module is able to accept new data (i.e. the augmented state and covariance of another
UKF instance) once the first stage is completed.
PREDICT STEP
• The functionality of the predict step in the Pipeline design is also very similar to the Parallel design
except that the a priori state estimate and covariance are output to a FIFO.
• This is because these values are necessary during calculations in the update step and there are no
longer any local memory blocks for the augmented state and covariance; once the values are output
into the FIFO, the predict module can continue with the next UKF instance.
• The processor may initiate a predict step once it has placed a valid set of transformed sigma points into
the memory buffer.
• The processor must propagate the sigma points, generated from the sig_gen module, through both the
predict model as well as the update model as both the predict and update steps are calculated
together in succession.
UPDATE STEP
• The update module is functionally similar to the Parallel but has some key practical differences since
none of the hardware is reused.
• The UAV linear and angular rates are controlled via inputs:
CONTD.
• Where uxyz is the desired linear motion for the x, y and z axes respectively, is the desired
angular motion about the roll, pitch, yaw rotational axes respectively and and are the
zero-mean Gaussian control noise terms.
• Landmarks in the environment are represented using the inverse depth parameterisation. For i-th
landmark Li:
• Where xi, yi, zi are the co-ordinates of the UAV when the landmark was first seen in the world
frame. Alpha and Beta are azimuth and elevation to the landmark respectively when it was first
seen in the world frame.
• is the inverse depth (i.e. =1/d where d is the distance to the landmark) of the landmark.
• The inverse depth parameterisation provides low linearisation errors at low parallax and has the
ability to represent any distance from the system immediately.
• Features effectively at infinity from the system are unstable attracting additional processing to
treat or discard those sensor readings, but, in this parameterisation, is simply treated as zero.
SENSOR MODEL
• The sensor used in this is only for the pinhole camera. Fixed to the front of the UAV and its
aperture aligned perpendicular to the UAV x-/roll axis.
• The sensor readings are simply the camera/image frame co-ordinates to the landmark.
• Camera model for the co-ordinates of some point P, which exists in the environment, in the
camera frame is given by:
• Where fu and fv are the distances from the centre of the aperture of the camera to the centre
of the image plane. xP ; yP ; zP are the Cartesian co-ordinates of the point P in the UAV frame
and I is a zero-mean Gaussian noise term.
PREDICT MODEL
• The predict model uses a dead reckoning model and the control inputs to predict the motion of the
UAV.
• The positions of known landmarks are also tracked so let the state vector be:
• Where p = [px; py; pz] is the Cartesian position of the UAV in the world frame.
• The predict model, f is then:
UPDATE MODEL
• The update model uses new measurements of one of the landmarks to update the state of both
the UAV and that landmark.
• The observation model ‘h’ is given by:
• where vk = is the observation noise and xL; yL; zL are the co-ordinates of landmark in the world
frame calculated via:
• where Li,xyz is the Cartesian position of the UAV in the world frame when the i-th landmark was first
seen and pk-1 is the a priori estimate of the position of the UAV in the world frame.
CONTD.
• As the UAV keeps moving around, the landmarks are not static. If a new landmark is detected, the
state vector must be expanded and initialised with new information.
• Adding a new landmark to the tracking is done by passing the current state and observation to an
inverse sensor model, h-1, given by:
• where lx, ly, lz are the co-ordinates of the newly detected landmark in the world frame.
• New landmarks are detected when observations cannot be associated with existing known
landmarks
SIMULATION MODEL
• The initial augmented state vector is:
• The length of the state vector is 7 + 6n where n is the number of known features, the number
of observation variables is 2 and the augmented state vector has 15 + 6n variables.
• The maximum number of landmarks considered in this simulation is 3 (i.e. n = 3) so the
maximum state and augmented state vector has 25 and 33 variables respectively.
• The control noise terms are modelled with covariances = [0.002812, 0.004349,
0.002248] m.s and
-1
= [0.01993, 0.03476, 0.03223] rad.s-1 while the observation
•The latency of the algorithm depends on the number of landmarks that are visible at any
given time step.
•For the SLAM solution, however, multiple update steps needs to be performed depending on
how many observations were made in a single time step; in some cases, no observations of
landmarks were made and so no update step was performed.
•In addition to this, because the update step updates the augmented state and covariance,
subsequent update steps require the sigma points to be re-sampled.
•In this example, the predict state and augmented state vector would only need to have 7
and 13 variables respectively (only including the UAV pose and control noise but not the
measurement noise)
•While the update state and augmented state vector would have 13 and 15 variables
respectively (including the UAV pose, landmark position and measurement noise).
•Segmenting the UKF in this way benefits the hardware/software approach more so than
microprocessor-based approaches since with appropriate choices of processing elements, the
time complexity of the each UKF instance could be reduced even further.
IMPLEMENTATION OF HW/SW CODESIGN
• The HW/SW codesign, all three variants, was implemented for
a wide range of parameters in order to demonstrate the
flexibility and effectiveness of the design.
• Implementations for three example applications are
presented: an expanded implementation of the nanosatellite
application, a theoretical implementation which features a
large number of observation variables, and an
implementation where alternative parameterisation schemes
for the number of PEs were explored.
ANALYSIS OVERVIEW
• For all implementations described in this chapter, synthesis and implementation runs were
targeted at the Zynq-7000 XC7Z045 at a target frequency of 100 Mhz.
• Resources utilisation of the device by the IP core is reported by Vivado post-implementation.
• The power analysis is done via the Xilinx Power Estimator (XPE) post-implementation; all
power estimates exclude the device static power dissipation and the processing system
power draw.
• The execution time (latency) for any hardware part is measured via behavioural simulation in
Vivado Simulator, assuming a clock frequency of 100 MHz; this assumption was validated
post-implementation for all designs.
• The entire IP core utilises synchronous logic and is on a single clock domain which makes
confirming the proper distribution of the assumed clock signals, in this case 100 MHz,
relatively straightforward.
• The execution time (latency) of any software part is measured via the ARMv7 Performance
Monitor Unit (PMU) which counts processor clock cycles between two epochs; because the
number of processor clock cycles to perform a given task can vary, each measurement was
conducted at least 10 times and the average latency measured is reported here.
EXAMPLE APPLICATION: NANOSATELLITES
• The initial numbers of PEs were chosen to be multiples of the number of augmented state
variables so that the major datapaths remained data efficient.
• If the number of PEs is not a multiple of the size of the matrix, then the last iteration of the
calculations will not have enough data to fill all the PEs making the datapath slightly
inefficient.
CONTD.
• Synthesis results for the Pipeline design can be seen below.
• The Pipeline design uses a huge amount of resources compared to the Serial and Parallel
designs.
• The Pipeline 2 PE implementation uses nearly the same amount of resources as the Parallel
10 PE implementation.
• This most likely makes the Pipeline infeasible on low-end devices, although in mid-range
devices the design could still potentially be part of a SoC for low numbers of processing
elements.
POWER CONSUMPTION
• A power consumption breakdown for the hardware IP core of the Serial and Parallel designs
is shown below :
• The power consumption of the Serial design is reasonably low, due to the area efficiency
design goals and the heavy utilisation of the FPGA clock enable resources to disable modules
that are not currently in use.
CONTD.
• A power consumption breakdown of the IP core for the Pipeline design is shown below:
• The power consumption of the Pipeline design is much larger than the Serial and Parallel
designs.
• However, the performance gains of the Pipeline design may outweigh the downsides in
power consumption, especially for a constellation.
TIMING ANALYSIS
• A breakdown of the execution time (latency) of different modules for the Serial and Parallel
designs are shown below:
• The design spends a large amount of the time propagating the sigma points through the two
system models.
• For the hardware part, the majority of time is spent in the sig_gen step. The two modules in
the sig_gen step, the triangular linear equations solver and the matrix multiply-add, are both
large matrix operations which scale with the number of augmented state variables.
CONTD.
• A breakdown of the time spent in different modules for the Pipeline design is shown below:
CONTD.
• A timing diagram for the whole pipeline is shown below:
• there exists further additional overhead when writing the augmented state/covariance into
the memory buffer at the start of each UKF instance, and when reading the current state
estimate from the memory buffer at the end of each UKF instance so the overall latency ends
up being roughly 10 times the longest stage.
EXAMPLE APPLICATION: LARGE NUMBER OF
OBSERVATION VATIABLES
• The application is presented to explore what happens
when there are a greater number of observation
variables than state variables, i.e for Mobs > Mstate.
• Since the update step is generally the most complex
sub-module in any variant of the HW/SW codesign,
increasing the number of observation variables may
have a disproportionate impact on the
implementation of the IP core.
SYNTHESIS RESULT
• Synthesis results for the Serial design and a range of processing elements for the Parallel design is shown
below:
• Resource usage seems to be dominated by the number of processing elements rather than changes in
the number of state or observation variables
• The additional power usage, in this example, appears to be entirely from the BRAMs.
• The update step does use more memory than either the sig_gen or the predict steps which
means that increasing the number of observation variables leads to these memories being
larger and could be why this implementation has a slightly higher power consumption.
CONTD.
• The power estimate for the Pipeline design is shown below:
• Unlike the Parallel design, the Pipeline design only shows increases in power consumption for
the 5+ processing element cases; however, the majority of increases are in the signals, logic
and DSPs rather than the BRAMs.
• The Pipeline design does not use as much memory as the Serial/Parallel designs because
many of the intermediate products need not be stored and so the increase in power
consumption may simply be from the increase in activity in the update step.
TIMING ANALYSIS
• The latency across each step for the Serial and Parallel designs is shown below:
• The IP core now spends roughly the same amount of time in the sig_gen and update steps,
likely because of the trisolve module.
• Overall, the IP core for this implementation is slightly slower than the IP core in the
nanosatellite implementation.
CONTD.
• The latency across each step for the Pipeline design is shown below:
• Where, as with the Serial and Parallel designs, the increase in observation variables causes
the update step to outweigh the reduction in augmented state variables.
LATENCY: UKF STEPS
• A closer look at the latency of each of the sig_gen,
predict and update steps is presented in this section.
• It can be seen in previous implementations that the
IP core as a whole suffers from diminishing returns as
the number of processing elements increases.
• For example, in the nanosatellite application going
from the Serial design to the 2 PE Parallel design
reduces the execution time by 90 micro-s but adding
another 8 PEs to implement the 10 PE case only
reduces the execution time by 60 micro- s.
SIGMA POINTS GENERATION
• The figure below shows a graph of the latency of the sig_gen step versus the number of
processing elements.
•The first thing to note is that the latency of the trisolve module
barely changes with increasing numbers of processing elements.
•The Cholesky Decomposition cannot be effectively parallelised.
Instantiating additional processing elements for this module
appears to be a waste of resources in the sig_gen step.
•Conversely, the matrix multiply-add, greatly benefits from the
additional processing elements.
•Therefore the trisolve module will remain the main hindrance in
the sig_gen datapath regardless of instantiated processing
elements while the matrix multiply-add greatly benefits from the
same.
PREDICT STEP
• The figure shows a graph of the latency of the predict step versus the number of processing
elements.
•It can be seen that none of the modules in the predict datapath
disproportionately cause any congestion; furthermore, all modules
appear to benefit from additional processing elements.
•For inefficient processing element numbers, i.e. a non-multiple of
the state variables, additional processing elements actually slightly
increase the latency.
•However, the total latency of the predict step is much lower than
the other two steps.
•Even though the predict step benefits from additional processing
elements, it may not be necessary to use them since the other steps
take much longer in terms of overall latency anyway.
UPDATE STEP
• Figure shows a graph of the latency of the update step versus the number of processing
elements.
•As with the predict step, additional processing elements reduce the
latency of every module in the update step.
•Unlike the sig_gen step, the trisolve module here actually does
decrease in latency when additional processing elements are used.
•This is most likely due to the fact that the trisolve module here is
used for the matrix right 'divide'; i.e. the Cholesky Decomposition
followed by forward elimination then back substitution.
•Although the Cholesky Decomposition cannot be effectively
parallelised, the forward elimination and back substitution can be,
meaning those operations benefit from additional processing
elements.
LATENCY: AUGMENTED STATE VARIABLES
• The increase in augmented state variables, state variables and observation variables have
different impacts on each of the steps for the IP core.
• In this section, an implementation exploring the effect these variables have on the latency of
the design is examined.
• Consider an application with an even split between the number of state and observation
variables and perfect system models (i.e. Mstate = M=2, Mobs = M=2).
• A graph of the latency versus the number of augmented state variables for the Serial design is
shown below:
CONTD.
• The Cholesky Decomposition in both steps, as well as the large matrix multiplication for sigma
points generation, dominate the execution time, especially as the state vector gets larger.
• A graph of the latency of each step versus the number of augmented state variables for the 5
PE case is shown below:
CONTD.
• The number of augmented state variables was capped at a much lower level compared to the previous
image in order to show some of the small effects of the processing elements more clearly.
• In both cases, and as seen in the previous implementations, the sig_gen step takes the longest out of the
three steps.
• The increase in the sig_gen step's latency also rises faster than the other two steps.
• A graph of the latency of each step versus the number of augmented state variables for 10 PE is shown
below:
•Small dips in the overall latency can be seen in both cases.
•This is because the parallelisation scheme of many of the modules discussed previously is
most efficient when the number of processing elements is some multiple of the size of the
matrix being calculated.
•For example, consider the matrix multiply-add, if the row size of the matrix to be multiplied
is 10, and 10 processing elements are used, then the calculation only required one iteration
as each processing element calculates one row.
•If the row size of the matrix to be multiplied is 11-20, then the number of iterations
necessary is 2.
•Thus, for matrices of size 11-19, the module is now somewhat inefficient, since not all
processing elements are used every iteration.
•In the ‘Latency vs. augmented state variables for the Parallel design (5 PE)’ figure, small
dips can be seen at every multiple of 5 for the total latency and the sig_gen curves. This is
likely because of the large matrix multiply-add during the sig_gen step.
•In the ‘Latency vs. augmented state variables for the Parallel design (10 PE)’ figure,
although there is a very obvious dip at M = 20, after that the curves are more or less smooth.
•As the augmented state vector grows much larger than the number of processing elements,
the impact of the parallelisation becomes smaller.
• The Figure below shows the 10 processing element case for much larger augmented state
vectors where the complexity at O(M2.5) is not quite as poor as the Serial design.
•The Figure below shows the latency for the 20 processing element case with two power series
fits for augmented state variables lower and higher than the number of processing elements.
•There is an increase in complexity as the augmented state vector passes the 20 mark
and at these low numbers of augmented state variables (compared to the number of
processing elements), the complexity itself is even less than quadratic.