by Paul Christiano @OpenAI at #NIPS2017 Aligned Artificial Intelligence Workshop

Overview of problem

  • RL can go beyond human level w/ well defined goal
  • world dominated by AI so complex that unaided human can’t understand
  • without being able to train AI to help humans understand


  • content feeds can easier maximize engagement not goodness
  • easy for advisor to give advice that looks good hard to give good advice
  • easy to train autonomous corp for profit, hard to maximize good corporate governance

Reward learning

  • Goal : get human values into ML objective
  • Difficulty: can’t observe values hard to identify


train systems to adopt policy:
A. Outperforms human
B. Humans can’t recognize as good
Machine can’t tell its better because human can’t see it.

Option 1

  • Specify model: P(behavior | idealized preferences)
  • Observe behave, infer prefers. optimize
  • Challenge 1: model itself must capture complex human reasoning and limitations, can’t just learn it
    • current trend in ML is to let data speak for itself
  • Challenge 2: might not play nice with your RL algo

Option 2 - this talk is about it!

  • give human better tools to understand situation
  • hope stated preferences converge towards idealized preferences
  • infer preference from the (human + tools) system rather than human alone


Where to get tools that insure human + tools is smart enough to train our AI?
Idea: use AI we aren’t currently training

Analogy: Expert iteration and AlphaZero

  • EI is similar but w/ very simple “human” who uses MCTS to optimize probability of wining
  • Can we apply same idea using a real human, who isn’t maximizing a simple goal?

Relationship to CIRL

  • Amplification - helping human understand situation, clarify their values, etc — is a reasonable strategy in the CIRL game
    • In addition to understanding the dynamics of implication, we should understand it as a CIRL strategy:
      • what forms of amp are good strategy?
      • under what assumptions about human is a good strategy?
  • Main advantage of amplification is possible robustness to model misspecification

Can amp work with humans

  • is there a capability ceiling?
  • can we preserve alignment?
    Best case, amplicaition can work no better than would if perfectly initiated

Human Consulting HCH
would it work?

  • many questions can be broken down into simpler pieces, so some hope
    many problems:
    root human can’t look at arbitrality complex task
  • error rates grow as u iterate
  • security falls as you iterate
  • increasingly exotic problems

requires both theory and experiment

Can amp work with machines

  • can we control commanding errors?
  • can we make training fast enough?

What he has actually been working on recently:
Can we get this working with ML?
(seems hard to get this working with groups of humans so this is the easier problem)

Toy Domains

  • Consider training a question answering system in simple algorithmic environment

    • graph reachability
    • algebraic reasoning
    • summing over wildcard matching
  • Previous work assumes we have access to ground truth. But if humans don’t have time to look at all the data, they can’t provide the ground tuth,