NEAT Mars Lander

24 August 2019
This summer we got a task of programming a mars lander simulation and autopilot. We recieved a C++ program that would call our functions and render the mars lander and visualize it's speed and altitude.
Source: Andrew Gee and Gabor Csanyi
I decided to extend the exercise by training an autopilot using Neuro-evolution of augmenting topologies (NEAT). I used a python library neat-python to handle the training and only had to provide a fitness fuction. That was done by running the python from C++ (embedded), and splitting it into two threads. One thread handles running neural networks and evolving them while the other one runs the mars lander simulation whenever fitness needs to be calculated.

What is so NEAT about it?

Traditionaly neural networks are trained using gradient descent but that requires a differentiable loss function. That is rather difficult in this scenario since we can only tell how well it performed and not what change would make it better. A solution for that would be q-learning but that is often rather fiddly and hard to get working.

An alternative is neuro-evolution where a population of neural networks is tested, awarded a fitness and using evolution the population for the next generation is created. NEAT makes a modification by also allowing the topology of the neural network to change. We need to select inputs and outputs of the network and a fitness function.

Inputs/Outputs

We need to provide the network with all the information it might need and get target orientation, thrust and a signal to deploy parachute. It is quite important to think about what forms we want to be giving them. Using carthesian coordinates is a rather bad idea. It is better to use polar coordinates (v1 - direction of the position vector, v2 - direction of the velocity perpendicular to position vector, v3 - the vector perpendicular to both position and velocity).

As inputs we give it position and velocity but can skip the ones that are always zero. So we pass position-v1, velocity-v1 and velocity-v2. Since the weights in the neural network are of a limited range and accuracy (specificaly for neuro-evolution because it is hard to differentiate which weights should be small and which should be large), we need to pass each of the inputs multiple times in different scales (approximately to have one in (0.05-20) range). We do this by providing x, x/sqrt(max_x), and x/max_x for each input x.

For outputs we just need to realign the ranges and for target_orientation do a vector normalization.

Fitness function

The initital idea for the fitness function was to take negative remaining energy of the lander. That was for simplicity changed to negative difference of mechanical energy of the lander compared to a landed lander if it crashed or timed-out in orbit. If it landed it would be rated by how much fuel remains. This way the autopilot is motivated to deorbit (since the atmosphere will consume a lot of the lander's energy) and use up as much fuel as it wants to get closer to landing. Successful autopilots are motivated to conserve as much fuel as possible.

Problem with taking negative energy difference was that energy scales too fast and when fitnesses got normalized onto a 0-1 scale, most genomes were rated 0.9 (as nearly landed). To provide more differentiation, the function was changed to negative logarithm of remaining energy (that was safe since the minimal energy (energy that can be absorbed at landing) would be more than 1). That spaced them out reasonably. Since that gave an average rating of -5, positive fitnesses were set to be of similar magnitude.

Then if we choose to run multiple subscenarios (i.e. incorporate in the fitness ability to deal with different scenarios), we need to make sure that positive ratings don't overpower some negative ratings (we want an autopilot that handles any scenario rather than one that handles some perfectly).

Training strategy

The first idea for how to train the networks was to start with an easy scenario (a descent from 1km) and whenever one network in the population solves it, go to the next scenario. The altitudes would be increasing and after it reached orbital altitudes, the horizontal velocity would gradually be increased until it would reach orbit.

This worked reasonably until it reached orbit since then it had problems learning to deorbit (as that was a completely new skill rather than improving an old one). A way to overcome that turned out to be to train a population where a majority could land from a deorbitted scenario and then repeativly training that for a few generations to see if some random change leads to a solution, if not trying again.

Next problem turned out to be that networks trained to be good at landing from orbit didn't work so well for landing from a static start in high altitude. A way to solve that is to introduce multiple subscenarios and give a fitness for the overall result on them.

One more thing to watch out for is that neuro-evolution purges genome species for stagnation when their fitness doesn't improve for a set number of generations. Since we have scenarios of increasing difficulty this would lead to the species that did best in previous scenario often getting purged. A way to accommodate for that is to offset fitness by enough to make that impossible each time scenario is changed.

Conclusion and ideas for improvement

It took 2 months (~1500 hours) to train it to the point where a single genome would be able land in diverse situations though still wasn't as good as a programmatic autopilot I made. My guess is that it would need one more month to surpass that. If I had not been lazy and made a pooled evaluator (parallelizing the fitness evaluation) it would be trainable in 4 days on a 16 CPU machine, which is quite fine.
Trained network, contains many useless nodes.
During the training I tried various strategies. One was rotating scenarios but that had very negative results. It probably was too unstable when I used a single scenario. A way to solve that would be to use ~3 subsenarios of similar combinatios (a vertical descent, a deorbiting, a random scenario). Another way to solve it might be to add momentum for fitness. So the fitness given to a genome would be 1/3 from current scenario, 2/9 from the previous one, 4/27 from the one before etc. If a new genome gets created it would inherit the fitness from it's parents possibly with a slight mutation. I didn't manage to test this idea but it would be interesting to see if it could be used to speed up training.

Since the hardest part for the autopilot was to learn how to de-orbit (since it was rather a step change), I introduced periareoin altitude (divided by atmosphere height) as one of the inputs. This didn't seem to help since the networks instead found a different way of de-orbiting that didn't use the periareoin input.

To reach very good results, something that might help could be starting with say 8 small batches - e.g. populations of 100. When they all reach a specific point, join the populations pairwise and continue till a next specific point and so on. I expect this would lead to greater diversity than if we just went through the whole process with one population of 800. That would be because when we're increasing scenario difficulty, it might be difficult for some genomes to progress if another has solved the last scenario and the difficulty step is large now.