I am currently wrapping up my implementations of all six OpenAI Spinning Up algorithms. Here are some of the challenges I have faced and how I have approached those challenges.

Challenge 1: extensive literature

Learning reinforcement learning theory can be overwhelming. Not only are there a large number of key papers, but the ones listed in this link are only a representative and somewhat outdated sample.

One striking aspect of Spinning Up is that while it presents six key algorithms, these algorithms represent only a small subset of the reinforcement learning landscape. The focus is on policy gradient actor critic algorithms: Vanilla Policy Gradients (VPG), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradients (DDPG), Twin-Delayed Deep Deterministic Policy Gradients (TD3), and Soft Actor Critic (SAC).

A Taxonomy of RL Algorithms from Spinning Up

Spinning Up only touches on algorithms under the Policy Optimization/Q-Learning branch of this tree. But this tree is itself far from complete. A brief glance at the lecture topics in Berkeley’s Deep Reinforcement Learning Course will quickly illustrate that there are many more topics not included in Spinning Up, for example pure value function methods, planning and model-based learning, offline reinforcement learning, inverse reinforcement learning, and meta learning / transfer learning. While I have watched these very good lectures, they can only act as an introduction, with each lecture referencing a handful of research papers for further information. I have resolved to learn as much as possible of this material, but with the recognition that it will take time to learn and commit to long term memory.

Given that I have only scratched the surface, my solution to the sheer size of the literature is one that I have used in other areas: a learning approach similar to iterative deepening. For instance I won’t try to become an expert in all of these topics in a fully DFS manner. Rather, I’ll traverse the main topics like planning and model-based learning, offline RL, IRL, meta-learning, etc, re-watch the lectures, and pick one paper that I am most interested in. While I have already done a fair amount of reading about generative modeling, I plan to continue with the same approach in that area, and also with the readings from the AI Alignment Curriculum. There is so much to read that I feel I must truncate the search depth for each topic, under the assumption that I will specialize in a small number of niches in the future. One other aspect that I like about this approach is that it encourages formation of a primitive mental map of the concept space before zooming in too much on one area.

Challenge 2: translating papers to code

While Spinning Up has for the most part done a great job of clarifying algorithmic details that some research papers gloss over, there were times when not even the Spinning Up psuedocode served as an unambiguous description of the algorithms. Fortunately, most of the time authors also publish reference implementations of their algorithms so it is possible to suss out all of the details missing from the paper.

There was a fair amount of detail left out of the PPO implementation. To illustrate this point, see this great blog post: The 37 Implementation Details of Proximal Policy Optimization. I won’t elaborate on all of the missing details because they are already so well described in that post. When my implementation of PPO was not behaving as expected (especially on Atari), I systematically went through the blog post and code in OpenAI Baselines to understand the discrepancy.

As another example, the descriptions of off-policy algorithms left out a crucial detail that took considerable effort to discover. The description of DDPG (see the equations for details) uses the variable $d$ to indicate whether or not an episode is terminated. This description is correct on the whole, but somewhat less than fully qualified. In practice, a distinction must be drawn between episode completion due to termination, where done should equal True (the agent died and the environment went to an absorbing state), and completion because the episode was truncated due to exceeding a user-defined maximum number of steps, where done should equal False. Originally, I was treating the episode truncation case with done=True, which was severely degrading DDPG performance. My intuition for why this caused so much trouble is because truncated transitions would be treated as having very low future value in the Bellman backup target even when the agent was predicting high future rewards because 1-d == 0 in (1-d)*value == 0 . This led to the agent learning a poor value function, and consequently very poor policies since policy gradients are based on the value function in actor-critic architectures. After fixing this issue, my DDPG implementation worked like a charm, very closely matching the Spinning Up benchmarks.

My main takeways from these experiences are somewhat divergent. On one hand, it is definitely valuable to be able to write down code that corresponds to a somewhat ambiguous mathematical description. I got better at this as I progressed through these algorithms. On the other hand, sometimes very important details are left out, so instead of getting stuck for long periods of time, I think it would actually be better to have a lower threshold for consulting reference implementations than I held myself to throughout this project.

Challenge 3: debugging code after implementation

Debugging reinforcement learning algorithms is definitely a challenge. The code is above average complexity, but I have actually worked with significantly more complex production code in the domain of systems and distributed systems. Part of the difference is that Spinning Up is a deliberatly simplified learning environment. Production code must handle every possible use-case and failure mode.

However, one distinguishing feature of implementing ML / RL code is that it is highly numerical. So while production systems might have more cyclomatic complexity, numerical programs are complex in a different way. Numerical computing libraries like PyTorch, Tensorflow, Numpy, etc. do a great job of abstracting over the most complex and fraught parts of numerical implementations so users are spared from having to implement things like matrix multiplication or autodifferentiation, but as a user of these libraries it is still important to have a good understanding of the underlying complexity in order to properly use the provided abstractions. For example it is important to be sure that input and output tensors have the expected shape, especially when broadcasting can make mismatched shapes compatible. It is important to understand exactly how autodifferentiation behaves so that gradients aren’t accidentally propagated and applied in optimizers when they shouldn’t be. As great as these libraries are, they don’t try to protect the user from subtle numerical issues like underflow and overflow. So even in the cases where cyclomatic complexity is lower, I have found implementing and debugging RL algorithms to be considerably harder than non-numerical algorithms. These factors in combination with stochasiticity make debugging RL algorithms a real challenge. I often found myself looking for needles in large tensor haystacks for anamalously low or high values to gain some understanding of what was going wrong in my code. For example, see this commit removing debugging statements.

I haven’t made an effort to implement parallelism in my algorithms, so that is one reason for the lower level of complexity. I have implemented highly parallel and concurrent programs in my day jobs, and so I decided not to focus on this aspect for my project. In some ways this was a good choice, because by keeping my programs simple I avoided dealing with a whole class of bugs that would have made debugging even more difficult. On the other hand, training is a lot slower. This ends up being painful at times while I wrap up and run benchmarks. For example running SAC with three random seeds for 3 million timesteps each on 6 mujoco tasks takes about 7 days. In some ways this doesn’t really matter: I can start a training run and focus on other things while it runs. But when I realized that I had chosen an incorrect parameter after about 5 days in, I started to regret not implementing parallelism. In the end I still find it best to avoid the extra time spent on parallelism because I am confident that I can handle that part in the future for production implementations, and waiting an extra week while I do other things doesn’t actually matter because I have no deadlines.

When developing on a single machine before large-scale deployment, debugging RL algorithms often requires a different approach than debugging in non-numerical code where often print statements are effective. Some like to use interactive debuggers, but I generally prefer good old fashioned print statements in the majority of cases. Print statements alone are often not sufficient for debugging RL algorithms. Often the most effective way to debug RL algorithms is to inspect graphs of key metrics. For instance, is the average episode return per epoch as expected? Comparing returns to the Spinning Up benchmarks or benchmarks published in the original papers was a good way to discern whether my implementation was in the right ballpark. Many other metrics ended up being useful. While each algorithm usually has its own key metrics that are specific to the implementation, other metrics tend to be useful across all algorithms, for example actor and critic loss and average Q or V value, amongst others. After some practice a brief glance at these metrics can give a good idea whether things are functioning as expected or whether something is wrong. That being said, for some classes of bugs more algorithm-specific metrics might be needed to root-cause the issue.

This metric-driven approach to debugging is not unique to RL. Debugging large-scale distributed systems also requires a similar approach, but can often be considerably harder, because of the sheer number of kinds of faults, and because broad knowledge is required to even understand what a metric coming from a particular subsystem means. Since the metrics generally come from a number of heterogeneous systems, often owned by different teams, external vendors, or the open-source community, they frequently have different levels of observability built in. Often the metric that is needed happens to be disabled because tracking it is too data intensive, and so turning it on would blow up the observability infrastructure. Other times the metric is just not built-in to code, and so adding the observability would require a pull-request, CI build, integration with the metrics collection system, and redeployment (and that’s if you have enough control to do all of those steps in a timely manner, which is often not the case with vendor or open-source code). These barriers are often high enough that it is better to look for another proxy for the same metric, like counting log lines that are already available, even if imperfect. I have to say: I really like debugging, and I really like debugging when it gets to levels of complexity that can only be effectively measured quantitatively. I think that is the commonality between metric-driven debugging for RL and for distributed-systems. In RL, there are many events at a micro level (timesteps, episodes, epochs, etc.), where in distributed systems there are many events at a macro level (machines, processes, requests, faults, etc.)

To quote Spinning Up: “Measure everything. Do a lot of instrumenting to see what’s going on under-the-hood. The more stats about the learning process you read out at each iteration, the easier it is to debug—after all, you can’t tell it’s broken if you can’t see that it’s breaking.” It is interesting to think about how this compares and contrasts with observability for large-scale systems, where there is an active tradeoff between observability and cost/performance. In this setting, fine-grained metrics or metrics that scale with the number of machines or processes in the system are likely to cause severe problems in monitoring infrastructure because of the vast amount of data produced. Handling larger volumes of data requires scaling the infrastructure in order to process it. This involves larger machines with sufficient memory, and a scaled write path to perform actions like downsampling, indexing, and compression. The data itself also becomes unweildy, leading to higher storage and archival costs, along with bloated indexes which can hurt query performance. In some cases the metric data is so large that it effectively can’t be queried unless the query is known at collection time and pre-computed. On top of this, a data retention policy will be necessary, which has the unfortunate side-effect of erasing valuable history. All of this works against the “measure everything” mantra. Of course, I did find it useful to measure everything at small scales during implementation, but I imagine that large-scale RL training deployments would run into similar problems. For this reason, it is likely important to have different metric “verbosities” that can be configured for local development or production.

Challenge 4: tracking experiments

One difficulty I encountered was effectively tracking my experiments within the provided framework. The framework does solve the persistence problem for tracking experiments, and does have basic visualization features (graph metric X over all epochs), but I found that there were times when I wanted more. For example I often wanted to keep track of the best set of hyperparameters for given a combination of algorithm and experiment. This became harder over time when my memory would fade with respect to what had worked well in the past. The best solution I found for this was to literally plot all relevant experiments, and to look for the best curve. This was challenging because the color scheme for each time series became a bit overloaded, and the graphs became so crowded that they were hard to read. I found myself wishing I had access to a declarative query language to search the logged experiment results to ask “What were the top k experiments for algorithm x on environment y with respect to average reward over all timesteps?”

Another example of discomfort tracking experiments is what I had to do to conveniently keep track of my most recent benchmark for each (algorithm, experiment) pair. In order to replicate the Spinning Up Benchmark Graphs. I first needed to choose a standardized benchmark experiment name, which I did using the following scheme:

python -m spinup.run sac_mytorch \
    ... \
    --exp_name "sac-benchmark-mytorch-$(echo $experiment_name | tr '[:upper:]' '[:lower:]')-$(date +%s)"

Then I would have to do bash gymnastics in order to plot only the most recent run for each algorithm for a particular environment:

experiments=$( ls -t data | grep benchmark | sort -r -u -t '-' -k 1,5 | grep walker | xargs -I {} echo -n "$(pwd)/data/{} " );  python -m spinup.run plot -s 11 $(echo $experiments) --legend $(echo $experiments | tr -s ' ' "\n" | cut -d'/' -f 7 | cut -d'-' -f 1) --value AverageEpRet

I applaud Spinning Up for embracing the unix philosophy and providing a command-line interface so that I could achieve this goal. The experiment tracking in this framework works very well for what it was meant to be: a tightly-scoped pedagogical device. For a production project, experiment tracking frameworks like Weights & Biases or similar would be preferred.