This summer we got a task of programming a mars lander simulation and autopilot. We recieved
a C++ program that would call our functions and render the mars lander and visualize
it's speed and altitude.
Source: Andrew Gee and Gabor Csanyi
I decided to extend the exercise by training an autopilot using Neuro-evolution of augmenting
topologies (NEAT). I used a python library neat-python to handle the training and only had
to provide a fitness fuction. That was done by running the python from C++ (embedded), and
splitting it into two threads. One thread handles running neural networks and evolving them
while the other one runs the mars lander simulation whenever fitness needs to be calculated.
What is so NEAT about it?
Traditionaly neural networks are trained using gradient descent but that requires
a differentiable loss function. That is rather difficult in this scenario since we can only
tell how well it performed and not what change would make it better. A solution for that would
be q-learning but that is often rather fiddly and hard to get working.
An alternative is neuro-evolution where a population of neural networks is tested, awarded
a fitness and using evolution the population for the next generation is created. NEAT makes
a modification by also allowing the topology of the neural network to change. We need to select
inputs and outputs of the network and a fitness function.
Inputs/Outputs
We need to provide the network with all the information it might need and get target
orientation, thrust and a signal to deploy parachute. It is quite important
to think about what forms we want to be giving them. Using carthesian coordinates is a rather
bad idea. It is better to use polar coordinates (v1 - direction of the position vector,
v2 - direction of the velocity perpendicular to position vector, v3 - the vector perpendicular
to both position and velocity).
As inputs we give it position and velocity but can skip the ones that are always zero. So we
pass position-v1, velocity-v1 and velocity-v2. Since the weights in the neural
network are of a limited range and accuracy (specificaly for neuro-evolution because it is hard
to differentiate which weights should be small and which should be large), we need to pass each
of the inputs multiple times in different scales (approximately to have one in (0.05-20) range).
We do this by providing x, x/sqrt(max_x), and x/max_x for each input x.
For outputs we just need to realign the ranges and for target_orientation do a vector
normalization.
Fitness function
The initital idea for the fitness function was to take negative remaining energy of the lander.
That was for simplicity changed to negative difference of mechanical energy of the lander
compared to a landed lander if it crashed or timed-out in orbit. If it landed it would be
rated by how much fuel remains. This way the autopilot is motivated to deorbit (since
the atmosphere will consume a lot of the lander's energy) and use up as much fuel as it wants
to get closer to landing. Successful autopilots are motivated to conserve as much fuel as
possible.
Problem with taking negative energy difference was that energy scales too fast and when
fitnesses got normalized onto a 0-1 scale, most genomes were rated 0.9 (as nearly landed).
To provide more differentiation, the function was changed to negative logarithm of remaining
energy (that was safe since the minimal energy (energy that can be absorbed at landing) would
be more than 1). That spaced them out reasonably. Since that gave an average rating of -5,
positive fitnesses were set to be of similar magnitude.
Then if we choose to run multiple subscenarios (i.e. incorporate in the fitness ability to deal
with different scenarios), we need to make sure that positive ratings don't overpower some
negative ratings (we want an autopilot that handles any scenario rather than one that handles
some perfectly).
Training strategy
The first idea for how to train the networks was to start with an easy scenario (a descent
from 1km) and whenever one network in the population solves it, go to the next scenario.
The altitudes would be increasing and after it reached orbital altitudes, the horizontal
velocity would gradually be increased until it would reach orbit.
This worked reasonably until it reached orbit since then it had problems learning to deorbit
(as that was a completely new skill rather than improving an old one). A way to overcome that
turned out to be to train a population where a majority could land from a deorbitted scenario
and then repeativly training that for a few generations to see if some random change leads
to a solution, if not trying again.
Next problem turned out to be that networks trained to be good at landing from orbit didn't
work so well for landing from a static start in high altitude. A way to solve that is
to introduce multiple subscenarios and give a fitness for the overall result on them.
One more thing to watch out for is that neuro-evolution purges genome species for stagnation
when their fitness doesn't improve for a set number of generations. Since we have scenarios
of increasing difficulty this would lead to the species that did best in previous scenario often
getting purged. A way to accommodate for that is to offset fitness by enough to make that
impossible each time scenario is changed.
Conclusion and ideas for improvement
It took 2 months (~1500 hours) to train it to the point where a single genome would be able
land in diverse situations though still wasn't as good as a programmatic autopilot I made.
My guess is that it would need one more month to surpass that. If I had not been lazy and
made a pooled evaluator (parallelizing the fitness evaluation) it would be trainable in 4 days
on a 16 CPU machine, which is quite fine.
Trained network, contains many useless nodes.
During the training I tried various strategies. One was rotating scenarios but that had very
negative results. It probably was too unstable when I used a single scenario. A way to solve
that would be to use ~3 subsenarios of similar combinatios (a vertical descent, a deorbiting,
a random scenario). Another way to solve it might be to add momentum for fitness. So the fitness
given to a genome would be 1/3 from current scenario, 2/9 from the previous one, 4/27 from
the one before etc. If a new genome gets created it would inherit the fitness from it's parents
possibly with a slight mutation. I didn't manage to test this idea but it would be interesting
to see if it could be used to speed up training.
Since the hardest part for the autopilot was to learn how to de-orbit (since it was rather
a step change), I introduced periareoin altitude (divided by atmosphere height) as one of the
inputs. This didn't seem to help since the networks instead found a different way of de-orbiting
that didn't use the periareoin input.
To reach very good results, something that might help could be starting with say 8 small batches
- e.g. populations of 100. When they all reach a specific point, join the populations pairwise
and continue till a next specific point and so on. I expect this would lead to greater diversity
than if we just went through the whole process with one population of 800. That would be because
when we're increasing scenario difficulty, it might be difficult for some genomes to progress
if another has solved the last scenario and the difficulty step is large now.