Hi everyone, I am Baihan Lin, from Columbia
University.
I am presenting our work: "A Story of Two
Streams: Reinforcement Learning Models from
Human Behavior and Neuropsychiatry".
This is a work done at IBM Research with Guillermo
Cecchi, Djallel Bouneffouf, Jenna Reinen and
Irina Rish.
Human decision making is different from traditional
reinforcement learning.
At each state, the available actions all have
their corresponding reward and cost.
An RL agent aims to find the optimal policy
to maximize the marginal rewards in all future
scenarios.
While the human minds undergo much more complicated
processes.
The cost and rewards can come in different
aspects and thus, yielding different weights
to find the optimal policy.
For instance, one might not worry too much
of certain costs.
Or one might ignore the rewards of certain
types.
The timescale of human decision making process
is also different.
Some might focus on the present day pleasure,
without caring about the long term benefits.
These are all reward bias hidden in each human's
characters and idiocrasies.
In another word, it is not that simple.
Insights from neuroscience and psychiatry
concur with this intuition.
The reward system of the brain often involves
multiple brain regions that represents prediction
error and values in various circuitries and
mechanisms.
These interacting systems drive downstream
behaviors like motivation, approach behavior
and action selection.
In clinical literature, many psychiatric disorders
are known to be involved with different types
of reward processing.
Presented here are a subset of these neuropsychiatric
conditions, including Parkinson's, addiction,
Alzheimer, chronic pain, dementia, and ADHD.
From the evolutionary psychiatry point of
view, we can also view mental disorders as
extreme points in a continuous spectrum of
behaviors and traits developed for various
purposes during evolution.
Somewhat less extreme versions of those traits
can be actually beneficial in specific environments.
For instance, ADHD-like fast-switching attention
can be life-saving in certain environments
Therefore, modeling these disorder-related
reward bias can help us build better AI.
Let's get back to the traditional reinforcement
learning problem, where we have an agent in
a certain state.
It makes an action given its current policy
and the environment reveals the next state
and a reward.
We in this paper, are more interested in the
reward processing mechanism of the agents,
which drives its learning and action selection.
In the reinforcement learning setting, we
are given the reward parameters and the environment,
and we wish to learn an optimal policy.
In the Inverse reinforcement learning setting,
we are given the reward parameters and the
policy, or the behavioral trajectories, and
we aim to discover the reward functions of
the environment.
In the behavioral modeling setting on the
other hand, we are usually given a behavioral
trajectories and the environment payoffs,
and we wish to infer the reward parameters
that drives the agent to its current state.
To accomplish all three problem settings,
we proposed Split Q Learning.
Split Q Learning extended upon the Q Learning
by introducing a two-stream reward processing
mechanism.
It assumes that the reward comes in two streams,
and each stream of reward went through the
learning processes independently of the other
stream.
We introduced four tunable parameters for
the two streams.
First, we have two parameters for the weights
on the reward histories.
A smaller parameter would mean we discount
the past more.
And we have two parameters to account for
the reward perception.
A bigger parameter would mean that the reward
of this stream is amplified in certain degree.
Given this setup, we can simulate different
neuropsychiatric conditions with our model
based on the clinical literature.
For instance, the chronic pain has a larger
weight on the entire negative stream than
the positive stream.
The addiction forgets about the negative history
faster than other agents.
The Parkinson's' patients' behaviors are more
associated with a magnified perception of
the negative reward or pain.
Other than these mental variants.
We also have a moderate version that we called
M, a standard version, and two single-streamed
variants called positive Split QL and negative
Split QL.
First, we tested our approach on a simulated
Markov Decision Process with not Gaussian
reward distributions.
For instance, we have five states.
From the initial state A we can either choose
to go left or right, and from there, the next
action can sample the reward from certain
distributions.
Presented here is an example scenario, where
the reward distributions are two bimodal distributions.
As shown, the orange arm, or the right arm
have a better expected reward.
Over 500 iterations, we observe that, our
Split Q Learning outperforms QL and Double
Q Learning in the speed to converge to the
right arm, as well as its accumulated final
reward over 500 iterations.
The second row gave us a more inside look
of what's going on in the Q tables of both
streams, where we can see that the Split mechanism
offered a robust regularization to the Q value
estimation.
We also tested the Iowa Gambling Task, a very
popular psychology experiments to model human
decision making.
The agent gets to pick from these four decks
of cards, trying to maximize its final reward.
Computationally, it is a multi-armed bandit
task with four arms, with two arms as better
arms for giving a better expected payoff.
As in the reward distribution plot shown here,
we see that the reward distribution is highly
non-Gaussian, where the Q Learning would usually
underperform.
In this paper, we compared the two most common
payoff schemes for the task, called IGT scheme
1 and 2.
We soon see that, the Split QL backbone does
a great job distinguishing the mental agents
from one another.
The short-term dynamics here, shown as the
percentage of actions choosing the better
decks, demonstrated interesting trajectories.
For instance, we see that the addiction picked
up the better decks most rapidly, but as time
goes on, it was caught up by others.
This is due to the fact that it doesn't carry
much memory for the negative rewards and thus
tending to stick to the short-term reward
payoff.
The t-SNE plots of the behavioral trajectories
also demonstrated that these agents form little
clusters labeled by their reward bias.
We also tested our approach on the PacMan
game in the nonstationary setting.
For the positive and negative streams, we
installed three independent non-stationarities.
In reward muting, we randomly mute one, both
or neither streams.
For instance, if the positive stream is muted,
eating PacDot receives a reward of zero.
In reward scaling, we randomly amplify one,
both or neither streams of rewards.
For instance, if the positive stream is scaled,
each reward is 100 times larger than before.
In reward flipping, we randomly flip one,
both or neither streams of rewards.
For instance, if the positive stream is flipped,
each reward is now the negative amount of
that reward.
To simulate a lifelong learning setting, we
have this stochastic change happen every n
rounds.
Here we see that the orange curve is our Split
QL agent.
The averaged score over 1000 iterations demonstrated
that Split QL is very competitive in the stationary
setting, and beats baselines in almost all
nonstationary setting.
If we look closer in the reward flipping scenario,
we also validated our evolutionary psychiatry
understanding, that several mental variants
of Split QL performs even better than Split
QL and all the baselines.
For instance, the ADHD agent, due to its fast
switching bias, adapts quite well in these
conditions.
Here are some mental agents in action after
1000 iterations of training.
We can see more interesting behaviors going
on.
For instance, the chronic pain agent doesn't
seem to care much about the reward, so it
only moves whenever an imminent threat is
present, to avoid being eaten by the ghost.
The addiction agent on the other hand, only
cares of eating as much dots as possible,
even if that means going behind a ghost in
a dangerously close distance.
There are many interesting routes from here.
We aim to further investigate the optimal
reward bias in different criteria, such as
surviving the longest or having the highest
score.
We are also interested to see multiple reward-biased
agent interacting in the same arena.
We are also learning the reward bias parameters
from real patient data, in order to better
tune our models for clinical applications.
Last but not least, we are going deep, to
test the two-stream mechanism in Deep Q networks.
Thank you for your interest.
If you have any question, feel free to reach
out to us.
The full paper can be accessed on arXiv and
the codes are available at our GitHub repository.
