While much work in data science to date has focused on algorithmic scale and sophistication, safety — that is, safeguards against harm — is a domain no less worth pursuing. This is particularly true in applications like self-driving vehicles, where a machine learning system’s poor judgment might contribute to an accident.

Safety Gym is designed for reinforcement learning agents, or AI that’s progressively spurred toward goals via rewards (or punishments). They learn by trial and error, which can be a risky endeavor — the agents sometimes try dangerous behaviors that lead to errors.
As a remedy, OpenAI proposes a form of reinforcement learning called constrained reinforcement learning, which implements cost functions that the AI must constrain. In contrast to common practice, where an agent’s behavior is described by a function that’s tailored to favor objectives, constrained agents figure out trade-offs that achieve certain defined outcomes.
“In normal [reinforcement learning], you would pick the collision fine at the beginning of training and keep it fixed forever,” OpenAI Topplay explains in a blog post. “The problem here is that if the pay-per-trip is high enough, the agent may not care whether it gets in lots of collisions (as long as it can still complete its trips) … [But in] constrained [reinforcement learning,] you would pick the acceptable collision rate at the beginning of training, and adjust the collision fine until the agent is meeting that requirement.”