Reinforcement learning (RL) has become integral to modern large language model development. Beyond the traditional post-training use of Reinforcement Learning from Human Feedback (RLHF) to align models with human preferences, RL on verifiable rewards has emerged as a powerful technique for extending LLM capabilities. This approach has gained particular importance as high-quality pre-training data becomes increasingly scarce.
Recent breakthroughs demonstrate the effectiveness of this paradigm, including OpenAI's reasoning models (o1 and o3) and the openly available DeepSeek R1 models. The latest frontier extends LLM reinforcement learning to multi-turn settings, where models function as agents interacting with environments to solve complex tasks. This represents a significant step toward creating LLMs that can serve as effective agents across diverse domains.
Given the potential of reinforcement learning, the open source ecosystem has experienced rapid growth, with numerous libraries emerging that embody different design principles and optimization choices. This analysis examines the leading RL frameworks from a technical perspective, evaluating their respective advantages and limitations. We conducted this research while developing a reinforcement learning library, systematically analyzing existing solutions to understand the design decisions and architectural trade-offs inherent in each approach. By sharing these insights, we aim to help researchers and practitioners navigate the current landscape more effectively and make informed choices about which tools best suit their specific requirements and use cases.
We acknowledge that some of the comparison points can be subjective – we made our best effort to compare the libraries on objective criteria as much as possible, and provide links to the relevant code in the libraries so you can get your own impressions. If you feel something is misrepresented please let us know. Also keep in mind that these libraries are evolving at a very fast pace, so always consult the documentation and code of the different libraries for the latest updates.
We chose these specific libraries based on their recent activity, relevance to diverse use cases (from RLHF to agentic RL), and their representation of different architectural philosophies in the open-source landscape.
TRL: A popular library from Hugging Face, tightly integrated with the Hugging Face ecosystem, focusing on the training aspect of RL.
Verl: A popular high-performance, feature-rich RL stack from ByteDance optimized for scalability and support for advanced training techniques.
OpenRLHF: One of the earliest popular open-source RLHF libraries, it is both easy to use and high-performance. Similar to Verl, paving the way for many subsequent frameworks
RAGEN: A notable extension of Verl that extends its capabilities with a focus on multi-turn conversations and diverse RL environments.
Nemo-RL: NVIDIA's comprehensive post-training framework, designed with clean interfaces and a focus on structured data flow, but also with scalability and high performance in mind.
ROLL: New library from Alibaba for RLHF, reasoning, and multi-turn agentic training – designed to be flexible for researchers and scalable for large-scale production use.
AReaL: RL library from Ant Research with a focus on asynchronous training to improve training throughput and scalability.
Verifiers: Built on TRL, this framework simplifies the implementation of multi-turn RL with verifiable rewards, prioritizing ease of use.
SkyRL: New RL library from Berkeley with a focus on multi-turn agentic training, using a simple and flexible design that allows both high-performance execution and adaptation to a wide variety of scenarios.
Reinforcement learning libraries aim to simplify the process of training RL policies that can solve complex problems. Users define their specific problems along with reward functions that measure solution quality, while the library handles the underlying training mechanics to develop effective policies. Common problems to train a policy on include:
Coding, where the reward is whether the code is correct (e.g., measured by whether it passes a suite of unit tests),
Computer use (i.e., solving tasks with a computer), where the reward is whether the task is solved successfully,
Formulating mathematical proofs, where the reward is +1 if the proof is a valid proof for a given proposition and 0 otherwise, and
Playing games, where the reward is the final score achieved in the game, or whether the policy could beat the game.
As mentioned in the introduction, there are a number of different use cases for RL libraries for LLMs:
Reinforcement Learning from Human Feedback (RLHF): This was the original RL post-training use case and aligns the LLM with human preferences. As part of that, a preference dataset is collected from humans, from which a reward model is learned, which is then used to run RL to adapt the LLM.
Reasoning models: This use case uses tasks like solving mathematical or scientific problems or coding to improve the reasoning capabilities of LLMs. Instead of outputting the answer immediately, the LLM learns to output “thinking” tokens first, which can boost the performance of the models. The thinking process often learned by running RL on the problem with verifiable rewards. Since the LLM gives a single answer at the end of the reasoning process, this is often referred to as “single-step” RL.
Agentic and multi-step RL: In this setting, the LLM operates as an autonomous agent within an environment, executing multiple sequential actions to accomplish a task. This represents the most sophisticated configuration and places the highest demands on the supporting library. The library must coordinate both the environment and the LLM, while handling problems that may require very different numbers of steps across various instances.
The core of an RL library can be broken down into two fundamental components:
The generator that sets the LLM up to interact with an environment to solve problems and compute the rewards.
The trainer that updates the model based on reward feedback given during the generation phase.
The way a framework is designed and integrates these components reveals its core philosophy and use cases. Some libraries are more focused on RLHF, some are optimized for training reasoning models, and others have more powerful abstractions for generation and environments, which makes them better suited for multi-turn RL and training agents.
We will first dive a little deeper into the design tradeoffs that can be made in these components.
The generation phase is often the most expensive part of the computation. It involves running inference on the LLM as well as executing the actions in the environment and computing the rewards. Often, there is a dataset of problems, which might contain initial prompts with the problem statement and/or parameters of the environment that the agent should operate in. In the case of verifiable rewards, there is also a notion of the correct answer to the problem. For each problem, the current policy generates a trajectory, which contains the sequence of tokens produced by the LLM that contain the actions taken in the environment. Different libraries represent these trajectories in different ways, but the OpenAI format is a common choice.
The generation can be arbitrarily complex and use techniques such as branching when exploring a larger solution space in a tree-like manner. Libraries that make it easy to define a custom generate function for generating the rollouts enable the maximum amount of flexibility for custom sampling, branching, and searching.
The environment is an important part of the Generator. In earlier RL libraries used for RLHF or for training reasoning models, there was often no explicit environment. Instead, each output of the LLM was scored either with a reward model or an explicit reward in the form of a custom reward function. Later, more explicit APIs were added, often similar to OpenAI’s gym interface with init, step, and reset functions. This allows tool use during the trajectory generation, as well as multi-step interactions between the LLM and the environment. In many cases, the actual execution of the environment happens in a separate container/process or on a remote server, so the environment can be isolated.
The trainer is where the core optimization loop resides, which takes the trajectory data collected in the generation phase and produces the new policy. Most libraries have standardized around supporting both the PPO and GRPO training algorithms. In terms of the sharding backend for splitting up the model among different GPUs, there are several different choices, with Hugging Face trainer, FSDP, DeepSpeed, and Megatron being the most popular. FSDP is among the simpler choices with tight PyTorch integration, DeepSpeed has more aggressive parameter and optimizer offloading, and Megatron is the most performant at a larger scale. Several libraries derive their training code from Hugging Face’s TRL library, and many others use veRL’s training code under the hood, which itself supports FSDP and Megatron.
We want to keep the comparisons as objective as possible, so we focus on comparison points that can be objectively deduced from the libraries’ code and documentation. We link to the relevant places to enable users to form their own view and make their own choices.
Adoption: While any metrics to describe the adoption of a library are imperfect, we include the first release, number of stars, number of total issues, and contributors as a proxy for the adoption of each library. Given a library that satisfies your use case (e.g., RLHF, training reasoning models, training agents) and has the features you need, more widely adopted libraries are often preferable since they are tested in more scenarios and therefore generally more robust, and it will be easier to find help.
System Properties: Besides listing the use case the library is targeted at, we also list the properties that each library can have:
Efficiency: supports collocation of inference/training or async training
Scalability: supports inference and training with many GPUs on a cluster
Modularity: has clear interfaces for various components of the library that can be implemented to change the behavior of the library
Flexibility: supports a wide range of features or can be modified easily to support different settings
Components: We link the trainer, generator, and environment components described in the section “Components of an RL library” for each of the libraries. The components might not always exactly follow the division into trainer, generator, and environment – for example, some libraries unify the generator and environment into a combined multi-turn environment. We also include the following information about the training backend, inference engine, and environment layer:
Training backend: We list which training backends the library supports (Hugging Face trainer, FSDP, DeepSpeed, or Megatron) and which algorithms the different frameworks support out of the box.
Inference engine: The most popular inference engines are vLLM, SGLang, Hugging Face transformers, or external inference engines via the OpenAI API. Possible deployment modes include collocating the inference engine and training engine or deploying them separately. Some RL frameworks support external inference engines not managed by the framework. Collocating training and inference can help reduce GPU usage, but it is less flexible. Separating training and inference is often preferable in scenarios involving long-horizon tasks with stragglers, asynchronous inference workloads, when different GPU types should be used for training and inference, or when training and inference need to scale independently. Another design decision is whether multiple inference engines are supported and whether load balancing between multiple inference engines is done client or server side.
Async support: Whether the library can overlap training and generation by running these two activities asynchronously. Doing so can improve resource utilization, especially for very long rollouts with stragglers, but complicates and may destabilize the learning dynamics. There are different ways to synchronize the weights updated by the training engine to the inference engines. Some methods require forming a torch distributed process group among one or more ranks of the training engine and all inference engines, then executing weight synchronization via NCCL, gloo, or CUDA IPC. This syncs weights quickly, but is inflexible. Some methods enable syncing checkpointing model weights to storage and loading them into the generators, which is more flexible but slower.
Environment layer: Different libraries have varying support for environments. Some libraries only support single-turn interactions without an explicit environment abstraction. Some libraries have an environment abstraction, but in order to add a custom environment, the library itself needs to be changed or forked. Some libraries support implementing custom environments outside of the library. In general, the less coupled the environment is to the generation process and the easier it is to integrate existing environments (e.g., in the form of external remote services), the better. In order to make the comparison as objective as possible and acknowledge the fact that different users have different requirements and preferences, we link example environments or documentation on how to integrate new environments for each library so the reader can make up their own mind.
Orchestrator: RL involves different components (environments, inference engine, training engine) that interact with each other and may need to be orchestrated in complex ways, e.g., if generation and training are collocated on the same resources or done asynchronously. In addition, each component consists of many different processes or containers that have to be run at a large scale. It is therefore beneficial to have an orchestrator that takes care of scheduling, communication, fault tolerance, and autoscaling. Due to its flexibility, scalability, and support for heterogeneous computing, many of the RL libraries have standardized on Ray for orchestration.
This table gives a general overview of the libraries and links to the different components so you can find out how the interfaces are written and how flexible they are.
Framework | First release ⭐Stars 📝Total issues 🧑💻Contributors | Target use case/system properties | Components (Data structures) | Training backend / Algorithms | Inference engine | Async support | Environment layer | Orchestrator |
(Hugging Face) | Jan 2023 ⭐ 14.4k 📝 1.9k 🧑💻 365 | RLHF, reasoning / flexibility, scalability | Hugging Face Trainer / SFT, DPO, GRPO, PPO, others | Hugging Face, vLLM | ❌ | ❌ | — | |
(ByteDance) | Nov 2024 ⭐ 10.2k 📝 1.0k 🧑💻 253 | RLHF, reasoning, agents / flexibility, efficiency, scalability | FSDP, DeepSpeed, Megatron / SFT, DPO, GRPO, PPO | vLLM, SGLang | 🚧 RFC BufferDataset | 🚧 | Ray | |
Jul 2023 ⭐ 7.2k 📝 0.7k 🧑💻 73 | RLHF / flexibility, efficiency, scalability | DeepSpeed / SFT, DPO, GRPO, PPO, others | Hugging Face, vLLM | ✅ (--async_train) | 🚧 via python function | Ray | ||
Jan 2025 ⭐2.1k 📝 0.1k 🧑💻 14 | agents / modularity, flexibility, scalability, efficiency | Verl backend / GRPO, PPO | Hugging Face, vLLM, SGLang | ❌ | ✅ | Ray | ||
(Ant Group) | Feb 2025 ⭐1.9k 📝 0.1k 🧑💻 17 | RLHF, reasoning, agents / efficiency, scalability | DeepSpeed, Megatron / GRPO, PPO | vLLM, SGLang | ✅ | ✅ | Ray (optional) | |
Feb 2025 ⭐1.4k 📝 0.1k 🧑💻 6 | reasoning, agents / flexibility, modularity | Hugging Face Trainer / GRPO | vLLM, OpenAI | ✅ | ✅ | — | ||
(Alibaba) | May 2025 ⭐1.3k 📝0.0k 🧑💻23 | RLHF, reasoning, agents / modularity, flexibility, scalability efficiency | DeepSpeed, Megatron / GRPO, PPO, others | vLLM, SGLang | ❌ (planned) | ✅ | Ray | |
(Nvidia) | Mar 2025 ⭐0.5k 📝 0.2k 🧑💻 29 | RLHF, reasoning, agents / modularity, flexibility, scalability, efficiency | FSDP, Megatron / SFT, DPO, GRPO | vLLM | ✅ | ✅ | Ray | |
(UC Berkeley) | Jun 2025 ⭐0.5k 📝 0.0k 🧑💻 9 | agents / modularity, flexibility, scalability, efficiency | optional: Environment | FSDP, DeepSpeed / GRPO, PPO | vLLM, SGLang, OpenAI | ✅ | ✅ | Ray |
In this section we discuss some of the above libraries in more detail and try to give you an idea about the context they were built in and what they will be most useful for. Some of these remarks are necessarily a bit more subjective than the table above and also many of the libraries are evolving quickly, so take everything with a grain of salt.
TRL: Optimized for simplicity and integration into the hugging face ecosystem, e.g. datasets, transformers, accelerate and PEFT. It was built for text-based post-training LLMs (e.g. for reasoning, RLHF) and supports supervised finetuning as well as DPO in addition to GRPO and PPO. It is a good choice if you are more interested in the text-based RL setting without environment interactions. However, TRL doesn’t natively support multi-turn RL with arbitrary environments and more flexible generation schemes, which prompted the development of other libraries like verifiers with similar design decisions as TRL but also support of multi-turn RL and environments. Overall TRL defines fewer internal interfaces than some of the other libraries, which makes it simpler but also less adaptable.
Verl: Verl was built for performance and scalability, it supports all the mature training frameworks like FSDP, DeepSpeed and Megatron. Similar to TRL it originated as a library for training reasoning models and doing single-turn interactions without an environment, but is being extended with tool-calling and multi-turn RL, and more flexible environment / agent support is emerging. Generation and training are pretty tightly coupled for performance reasons. Verl has a big community and is very actively developed and among the most mature open-source RL libraries for LLMs. There are already several projects that are building on top of verl and extending it in various ways [1] [2] [3] (with a full list here).
OpenRLHF: OpenRLHF was built for RLHF and has very good support for reward models and different optimization algorithms. It also supports asynchronous training as well as collocation of generation and training. There is some support for environments / agentic RL but no dedicated interfaces. OpenRLHF is mature and has a sizable community, and there are libraries built on top of it that extend it more towards reasoning as well as multi-agent training domains.
RAGEN: RAGEN is a library built on top of verl which integrates a more explicit environment interface, and extends verl with better agent and multi-turn RL support. It also makes it easier to define custom environments.
AReal: This library is especially focusing on performance and optimizing for asynchronous execution of generation and training. To achieve maximal performance, it implements interruptible rollouts and algorithmic modifications to PPO to combat data staleness during asynchronous execution.
Verifiers: Addresses some of the shortcomings of TRL and adds support for environments and multi-turn tool use. Similar to TRL, verifiers is also based on Hugging Face transformers Trainer. Simple framework and popular among researchers.
ROLL: ROLL is designed to appeal to a diverse range of users (large AI labs, practitioners and researchers) by providing a large set of interfaces and making the library configurable.
NeMo-RL: NeMo-RL is one of the more recent libraries, it has clearly defined interfaces for the various components and clear data representations for trajectories and other relevant data. It is built with multi-turn RL and environment support in mind. Similar to verl, it is also built for performance and scalability, but it is not as mature as Verl yet and has fewer examples and less of an ecosystem.
SkyRL: SkyRL incorporates learnings from many of the above libraries and designs simple but flexible interfaces that make it easy to support a wide range of settings (e.g. sync or async one-off pipelining, colocate or disaggregate generation or training, external or integrated inference engine, different methods for weight synchronization). Not as mature as some of the other libraries yet and has less of an ecosystem.
One question you might naturally ask yourself is: “Which library should I use?”. The answer will depend on the setting you are operating in and your use cases, and you should make up your own mind (hopefully the comparison points in this blog post will help you). We can give some general guidelines that might help guide your decision:
If you are optimizing for training large models or have strict performance requirements, looking at verl is arguably a good choice. It is among the most mature libraries, but its focus on performance and scalability also comes at the cost of some flexibility. If you are looking for a more flexible environment and agent support, it will be worth looking into RAGEN too (if you like verl) or some of the other frameworks like NeMo-RL, SkyRL, ROLL or AReal.
If you are a researcher and value simplicity and flexibility the most (e.g. you want to try out new algorithms or approaches that are not commonly used and spend a lot of time changing the source code), it is worth checking out verifiers.
Finally, if you are working on your own library or are interested in this space, consider submitting a talk to Ray Summit 2025 where we will have a dedicated track for post training and reinforcement learning.