BenchMARL - Benchmarking Multi-Agent Reinforcement Learning

BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL) using TorchRL. BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.


  • We introduce BenchMARL, a training library for benchmarking MARL algorithms, tasks, and models backed by TorchRL.
  • BenchMARL already contains a variety of SOTA algorithms and tasks.
  • BenchMARL is grounded by its core tenets: standardization and reproducibility

What is BenchMARL 🧐?

BenchMARL is a Multi-Agent Reinforcement Learning (MARL) training library created to enable reproducibility and benchmarking across different MARL algorithms and environments. Its mission is to present a standardized interface that allows easy integration of new algorithms and environments to provide a fair comparison with existing solutions. BenchMARL uses TorchRL as its backend, which grants it high performance and state-of-the-art implementations. It also uses hydra for flexible and modular configuration, and its data reporting is compatible with marl-eval for standardised and statistically strong evaluations.

BenchMARL core design tenets are:

  • Reproducibility through systematical grounding and standardization of configuration
  • Standardised and statistically-strong plotting and reporting
  • Experiments that are independent of the algorithm, environment, and model choices
  • Breadth over the MARL ecosystem
  • Easy implementation of new algorithms, environments, and models
  • Leveraging the know-how and infrastructure of TorchRL, without reinventing the wheel

Why would I BenchMARL 🤔?

Why would you BenchMARL, I see you ask. Well, you can BenchMARL to compare different algorithms, environments, models, to check how your new research compares to existing ones, or if you just want to approach the domain and want to easily take a picture of the landscape.

Why does it exist?

We created it because, compared to other ML domains, RL has always been more fragmented in terms of shared community standards, tools, and interfaces. In MARL, this problem is even more evident, with new libraries being frequently introduced that focus on specific algorithms, environments, or models. Furthermore, these libraries often implement components from scratch, without leveraging the know-how of the single-agent RL community. In fact, the great majority of components used in MARL is shared with single-agent RL (e.g., losses like MAPPO, models, probability distributions, replay buffers, and much more).

This fragmentation of the domain has led to a reproducibility crisis, recently highlighted in a NeurIPS paper 1. While authors in 1 propose a set of tools for statistically-strong results’ reporting, there is still the need for a standardized library to run such benchmarks. This is where BenchMARL comes in. Its mission is to provide a benchmarking tool for MARL, leveraging the components of TorchRL for a solid RL backend.

How do I use it?

Command line

Simple, to run an experiment from the command line do:

python benchmarl/ algorithm=mappo task=vmas/balance

to run multiple experiments, a benchmark you can do:

python benchmarl/ -m algorithm=mappo,qmix,masac task=vmas/balance,vmas/sampling seed=0,1

Multirun has many launchers supported in the backend. The default implementation for hydra multirun is sequential, but parallel and slurm launchers are also available.


Run an experiment:

experiment = Experiment(

Run a benchmark:

benchmark = Benchmark(
    seeds={0, 1},


The goal of BenchMARL is to bring different MARL environments and algorithms under the same interfaces to enable fair and reproducible comparison and benchmarking. BenchMARL is a full-pipline unified training library with the goal of enabling users to run any comparison they want across our algorithms and tasks in just one line of code. To achieve this, BenchMARL interconnects components from TorchRL, which provides an efficient and reliable backend.

The library has a default configuration for each of its components. While parts of this configuration are supposed to be changed (for example experiment configurations), other parts (such as tasks) should not be changed to allow for reproducibility. To aid in this, each version of BenchMARL is paired to a default configuration.

Let’s now introduce each component in the library.

Experiment. An experiment is a training run in which an algorithm, a task, and a model are fixed. Experiments are configured by passing these values alongside a seed and the experiment hyperparameters. The experiment hyperparameters cover both on-policy and off-policy algorithms, discrete and continuous actions, and probabilistic and deterministic policies (as they are agnostic of the algorithm or task used). An experiment can be launched from the command line or from a script.

Benchmark. In the library we call benchmark a collection of experiments that can vary in tasks, algorithm, or model. A benchmark shares the same experiment configuration across all of its experiments. Benchmarks allow to compare different MARL components in a standardized way. A benchmark can be launched from the command line or from a script.

Algorithms. Algorithms are an ensemble of components (e.g., losss, replay buffer) which determine the training strategy. Here is a table with the currently implemented algorithms in BenchMARL.

NameOn/Off policyActor-criticFull-observability in criticAction compatibilityProbabilistic actor
MAPPOOnYesYesContinuous + DiscreteYes
IPPOOnYesNoContinuous + DiscreteYes
MASACOffYesYesContinuous + DiscreteYes
ISACOffYesNoContinuous + DiscreteYes

Tasks. Tasks are scenarios from a specific environment which constitute the MARL challenge to solve. They differ based on many aspects, here is a table with the current environments in BenchMARL

EnvironmentTasksCooperationGlobal stateReward functionAction spaceVectorized
VMAS5Cooperative + CompetitiveNoShared + Independent + GlobalContinuous + DiscreteYes
MPE8Cooperative + CompetitiveYesShared + IndependentContinuous + DiscreteNo

BenchMARL uses the TorchRL MARL API for grouping agents. In competitive environments like MPE, for example, teams will be in different groups. Each group has its own loss, models, buffers, and so on. Parameter sharing options refer to sharing within the group. See the example on creating a custom algorithm for more info.

Models. Models are neural networks used to process data. They can be used as actors (policies) or, when requested, as critics. We provide a set of base models (layers) and a SequenceModel to concatenate different layers. All the models can be used with or without parameter sharing within an agent group. Here is a table of the models implemented in BenchMARL

NameDecentralizedCentralized with local inputsCentralized with global input

And the ones that are work in progress

NameDecentralizedCentralized with local inputsCentralized with global input


BenchMARL has many features. In this section we will dive deep in the features that correspond to our core design tenets, but there are many more cool nuggets here and there, such as:

  • A test CI with integration and training test routines that are run for all simulators and algorithms
  • Integration in the official TorchRL ecosystem for dedicated support
  • Experiment checkpointing and restoring using torch
  • Experiment logging compatible with many loggers (wandb, csv, mflow, tensorboard). The wandb logger is fully compatible with experiment restoring and will automatically resume the run of the loaded experiment.

In the following we illustrate the features which are core to our tenets.

Fine-tuned public benchmarks

In the fine_tuned folder we are collecting some tested hyperparameters for specific environments to enable users to bootstrap their benchmarking. You can just run the scripts in this folder to automatically use the proposed hyperparameters.

We will tune benchmarks for you and publish the config and benchmarking plots on Wandb publicly

Currently available ones are:


In the following, we report a table of the results:


Sample efficiency curves (all tasks)

Performance profile

Aggregate scores


Reporting and plotting

Reporting and plotting is compatible with marl-eval. If experiment.create_json=True (this is the default in the experiment config) a file named {experiment_name}.json will be created in the experiment output folder with the format of marl-eval. You can load and merge these files using the utils in eval_results to create beautiful plots of your benchmarks. No more struggling with matplotlib and latex!

aggregate_scores sample_efficiancy performace_profile


The project can be configured either the script itself or via hydra. Each component in the project has a corresponding yaml configuration in the BenchMARL conf tree. Components’ configurations are loaded from these files into python dataclasses that act as schemas for validation of parameter names and types. That way we keep the best of both words: separation of all configuration from code and strong typing for validation! You can also directly load and validate configuration yaml files without using hydra from a script by calling ComponentConfig.get_from_yaml().

Here are some examples on how you can override configurations:


python benchmarl/ task=vmas/balance algorithm=mappo experiment.evaluation=true experiment.train_device="cpu"


python benchmarl/ task=vmas/balance algorithm=masac algorithm.num_qvalue_nets=3 algorithm.target_entropy=auto algorithm.share_param_critic=true


Be careful, for benchmarking stability this is not suggested.

python benchmarl/ task=vmas/balance algorithm=mappo task.n_agents=4


python benchmarl/ task=vmas/balance algorithm=mappo model=sequence "model.intermediate_sizes=[256]" "model/layers@model.layers.l1=mlp" "model/layers@model.layers.l2=mlp" "+model/layers@model.layers.l3=mlp" "model.layers.l3.num_cells=[3]"

Check out the section on how to configure BenchMARL and our examples.


One of the core tenets of BenchMARL is allowing users to leverage the existing algorithm and tasks implementations to benchmark their newly proposed solution.

For this reason we expose standard interfaces with simple abstract methods for algorithms, tasks and models. To introduce your solution in the library, you just need to implement the abstract methods exposed by these base classes which use objects from the TorchRL library.

Here is an example on how you can create a custom algorithm.

Here is an example on how you can create a custom task.

Here is an example on how you can create a custom model.

Next steps

BenchMARL is just born and is constantly looking for collaborators to extend and improve its capabilities. If you are interested in joining the project, please reach out!

The next steps will include extending the library as well as fine-tuning sets of benchmark hyperparameters to make them available to the community.

  1. Gorsane, Rihab, et al. “Towards a standardised performance evaluation protocol for cooperative marl.” Advances in Neural Information Processing Systems 35 (2022): 5510-5521. ↩︎ ↩︎

Matteo Bettini
Matteo Bettini
PhD Candidate

Matteo’s research is focused on studying heterogeneity and resilience in multi-agent and multi-robot systems.