Acknowledgments What I write here is far from comprehensive, but I present these words as a sampl... more Acknowledgments What I write here is far from comprehensive, but I present these words as a sample of the diverse relationships and experiences that cumulatively provoke immense gratitude and a nostalgia for the present.
As computational agents are increasingly used beyond research labs, their success will depend on ... more As computational agents are increasingly used beyond research labs, their success will depend on their ability to learn new skills and adapt to their dynamic, complex environments. If human users-without programming skillscan transfer their task knowledge to agents, learning can accelerate dramatically, reducing costly trials. The tamer framework guides the design of agents whose behavior can be shaped through signals of approval and disapproval, a natural form of human feedback. More recently, tamer+rl was introduced to enable human feedback to augment a traditional reinforcement learning (RL) agent that learns from a Markov decision process's (MDP) reward signal. We address limitations of prior work on tamer and tamer+rl, contributing in two critical directions. First, the four successful techniques for combining human reward with RL from prior tamer+rl work are tested on a second task, and these techniques' sensitivities to parameter changes are analyzed. Together, these examinations yield more general and prescriptive conclusions to guide others who wish to incorporate human knowledge into an RL algorithm. Second, tamer+rl has thus far been limited to a sequential setting, in which training occurs before learning from MDP reward. In this paper, we introduce a novel algorithm that shares the same spirit as tamer+rl but learns simultaneously from both reward sources, enabling the human feedback to come at any time during the reinforcement learning process. We call this algorithm simultaneous tamer+rl. To enable simultaneous learning, we introduce a new technique that appropriately determines the magnitude of the human model's influence on the RL algorithm throughout time and state-action space.
We present a case study of applying a framework for learning from numeric human feedback-TAMER-to... more We present a case study of applying a framework for learning from numeric human feedback-TAMER-to a physically embodied robot. In doing so, we also provide the first demonstration of the ability to train multiple behaviors by such feedback without algorithmic modifications and of a robot learning from free-form human-generated feedback without any further guidance or evaluative feedback. We describe transparency challenges specific to a physically embodied robot learning from human feedback and adjustments that address these challenges.
As computational learning agents move into domains that incur real costs (e.g., autonomous drivin... more As computational learning agents move into domains that incur real costs (e.g., autonomous driving or financial investment), it will be necessary to learn good policies without numerous high-cost learning trials. One promising approach to reducing sample complexity of learning a task is knowledge transfer from humans to agents. Ideally, methods of transfer should be accessible to anyone with task knowledge, regardless of that person's expertise in programming and AI. This paper focuses on allowing a human trainer to interactively shape an agent's policy via reinforcement signals. Specifically, the paper introduces "Training an Agent Manually via Evaluative Reinforcement," or tamer, a framework that enables such shaping. Differing from previous approaches to interactive shaping, a tamer agent models the human's reinforcement and exploits its model by choosing actions expected to be most highly reinforced. Results from two domains demonstrate that lay users can train tamer agents without defining an environmental reward function (as in an MDP) and indicate that human training within the tamer framework can reduce sample complexity over autonomous learning algorithms.
As learning agents move from research labs to the real world, it is increasingly important that h... more As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able to teach agents desired behaviors. Recently, the tamer framework was introduced for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals. Past work on tamer showed that shaping can greatly reduce the sample complexity required to learn a good policy, can enable lay users to teach agents the behaviors they desire, and can allow agents to learn within a Markov Decision Process (MDP) in the absence of a coded reward function. However, tamer does not allow this human training to be combined with autonomous learning based on such a coded reward function. This paper leverages the fast learning exhibited within the tamer framework to hasten a reinforcement learning (RL) algorithm's climb up the learning curve, effectively demonstrating that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent. We tested eight plausible tamer+rl methods for combining a previously learned human reinforcement function,Ĥ, with MDP reward in a reinforcement learning algorithm. This paper identifies which of these methods are most effective and analyzes their strengths and weaknesses. Results from these tamer+rl algorithms indicate better final performance and better cumulative performance than either a tamer agent or an RL agent alone.
Recent research has demonstrated that human-generated reward signals can be effectively used to t... more Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic-i.e., conducted in unconnected episodes of activity that often end in either goal or failure statesor continuing-i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward-a myopic agent-or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation . This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.
Several studies have demonstrated that teaching agents by human-generated reward can be a powerfu... more Several studies have demonstrated that teaching agents by human-generated reward can be a powerful technique. However, the algorithmic space for learning from human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward in goal-based, episodic tasks, we investigate how anticipated future rewards should be discounted to create behavior that performs well on the task that the human trainer intends to teach. We identify a "positive circuits" problem with low discounting (i.e., high discount factors) that arises from an observed bias among humans towards giving positive reward. Empirical analyses indicate that high discounting (i.e., low discount factors) of human reward is necessary in goal-based, episodic tasks and lend credence to the existence of the positive circuits problem.
Acknowledgments What I write here is far from comprehensive, but I present these words as a sampl... more Acknowledgments What I write here is far from comprehensive, but I present these words as a sample of the diverse relationships and experiences that cumulatively provoke immense gratitude and a nostalgia for the present.
As computational agents are increasingly used beyond research labs, their success will depend on ... more As computational agents are increasingly used beyond research labs, their success will depend on their ability to learn new skills and adapt to their dynamic, complex environments. If human users-without programming skillscan transfer their task knowledge to agents, learning can accelerate dramatically, reducing costly trials. The tamer framework guides the design of agents whose behavior can be shaped through signals of approval and disapproval, a natural form of human feedback. More recently, tamer+rl was introduced to enable human feedback to augment a traditional reinforcement learning (RL) agent that learns from a Markov decision process's (MDP) reward signal. We address limitations of prior work on tamer and tamer+rl, contributing in two critical directions. First, the four successful techniques for combining human reward with RL from prior tamer+rl work are tested on a second task, and these techniques' sensitivities to parameter changes are analyzed. Together, these examinations yield more general and prescriptive conclusions to guide others who wish to incorporate human knowledge into an RL algorithm. Second, tamer+rl has thus far been limited to a sequential setting, in which training occurs before learning from MDP reward. In this paper, we introduce a novel algorithm that shares the same spirit as tamer+rl but learns simultaneously from both reward sources, enabling the human feedback to come at any time during the reinforcement learning process. We call this algorithm simultaneous tamer+rl. To enable simultaneous learning, we introduce a new technique that appropriately determines the magnitude of the human model's influence on the RL algorithm throughout time and state-action space.
We present a case study of applying a framework for learning from numeric human feedback-TAMER-to... more We present a case study of applying a framework for learning from numeric human feedback-TAMER-to a physically embodied robot. In doing so, we also provide the first demonstration of the ability to train multiple behaviors by such feedback without algorithmic modifications and of a robot learning from free-form human-generated feedback without any further guidance or evaluative feedback. We describe transparency challenges specific to a physically embodied robot learning from human feedback and adjustments that address these challenges.
As computational learning agents move into domains that incur real costs (e.g., autonomous drivin... more As computational learning agents move into domains that incur real costs (e.g., autonomous driving or financial investment), it will be necessary to learn good policies without numerous high-cost learning trials. One promising approach to reducing sample complexity of learning a task is knowledge transfer from humans to agents. Ideally, methods of transfer should be accessible to anyone with task knowledge, regardless of that person's expertise in programming and AI. This paper focuses on allowing a human trainer to interactively shape an agent's policy via reinforcement signals. Specifically, the paper introduces "Training an Agent Manually via Evaluative Reinforcement," or tamer, a framework that enables such shaping. Differing from previous approaches to interactive shaping, a tamer agent models the human's reinforcement and exploits its model by choosing actions expected to be most highly reinforced. Results from two domains demonstrate that lay users can train tamer agents without defining an environmental reward function (as in an MDP) and indicate that human training within the tamer framework can reduce sample complexity over autonomous learning algorithms.
As learning agents move from research labs to the real world, it is increasingly important that h... more As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able to teach agents desired behaviors. Recently, the tamer framework was introduced for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals. Past work on tamer showed that shaping can greatly reduce the sample complexity required to learn a good policy, can enable lay users to teach agents the behaviors they desire, and can allow agents to learn within a Markov Decision Process (MDP) in the absence of a coded reward function. However, tamer does not allow this human training to be combined with autonomous learning based on such a coded reward function. This paper leverages the fast learning exhibited within the tamer framework to hasten a reinforcement learning (RL) algorithm's climb up the learning curve, effectively demonstrating that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent. We tested eight plausible tamer+rl methods for combining a previously learned human reinforcement function,Ĥ, with MDP reward in a reinforcement learning algorithm. This paper identifies which of these methods are most effective and analyzes their strengths and weaknesses. Results from these tamer+rl algorithms indicate better final performance and better cumulative performance than either a tamer agent or an RL agent alone.
Recent research has demonstrated that human-generated reward signals can be effectively used to t... more Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic-i.e., conducted in unconnected episodes of activity that often end in either goal or failure statesor continuing-i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward-a myopic agent-or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation . This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.
Several studies have demonstrated that teaching agents by human-generated reward can be a powerfu... more Several studies have demonstrated that teaching agents by human-generated reward can be a powerful technique. However, the algorithmic space for learning from human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward in goal-based, episodic tasks, we investigate how anticipated future rewards should be discounted to create behavior that performs well on the task that the human trainer intends to teach. We identify a "positive circuits" problem with low discounting (i.e., high discount factors) that arises from an observed bias among humans towards giving positive reward. Empirical analyses indicate that high discounting (i.e., low discount factors) of human reward is necessary in goal-based, episodic tasks and lend credence to the existence of the positive circuits problem.
Uploads
Papers by Brad Knox