Academia.eduAcademia.edu

Training a Robot via Human Feedback: A Case Study

We present a case study of applying a framework for learning from numeric human feedback-TAMER-to a physically embodied robot. In doing so, we also provide the first demonstration of the ability to train multiple behaviors by such feedback without algorithmic modifications and of a robot learning from free-form human-generated feedback without any further guidance or evaluative feedback. We describe transparency challenges specific to a physically embodied robot learning from human feedback and adjustments that address these challenges.

Training a Robot via Human Feedback: A Case Study W. Bradley Knox1 , Peter Stone2 , and Cynthia Breazeal1 1 Massachusetts Institute of Technology Media Lab, 20 Ames Street, Cambridge, MA USA [email protected], [email protected], 2 University of Texas at Austin Dept. of Computer Science, Austin, TX USA [email protected] Abstract. We present a case study of applying a framework for learning from numeric human feedback—TAMER—to a physically embodied robot. In doing so, we also provide the first demonstration of the ability to train multiple behaviors by such feedback without algorithmic modifications and of a robot learning from free-form human-generated feedback without any further guidance or evaluative feedback. We describe transparency challenges specific to a physically embodied robot learning from human feedback and adjustments that address these challenges. 1 Introduction As robots increasingly collaborate with people and otherwise operate in their vicinity, it will be crucial to develop methods that allow technically unskilled users to teach and customize behavior to their liking. In this paper we focus on teaching a robot by feedback signals of approval and disapproval generated by live human trainers, the technique of interactive shaping. These signals map to numeric values, which we call “human reward”. In comparison to learning from demonstration [1], teaching by such feedback has several potential advantages. An agent can display its learned behavior while being taught by feedback, increasing responsiveness to teaching and ensuring that teaching is focused on the task states experienced when the agent behaves according to its learned policy. A feedback interface can be independent of the task domain. And we speculate that feedback requires less exper- Fig. 1. A training session with the MDS robot tise than control and places less cognitive Nexi. The artifact used for trainer interaction load on the trainer. Further, a reward signal can be seen on the floor immediately behind is relatively simple in comparison a con- Nexi, and the trainer holds a presentation retrol signal; given this simplicity, teaching mote by which reward is delivered. by human-generated reward is a promising technique for improving the effectiveness of low-bandwidth myolectric and EEG-based interfaces, which are being developed to enable handicapped users to control various robotic devices. In this paper, we reductively study the problem of learning from live human feedback in isolation, without the added advantages of learning from demonstration or similar methods, to increase our ability to draw insight about this specific style of teaching and learning. This paper demonstrates for the first time that TAMER [6]—a framework for learning from human reward—can be successfully applied on a physically embodied robot. We detail our application of TAMER to enable the training of interactive navigation behaviors on the Mobile-Dexterous-Social (MDS) robot “Nexi”. Figure 1 shows a snapshot of a training session. In this domain, Nexi senses the relative location of an artifact that the trainer can move, and the robot chooses at intervals to turn left or right, to move forward, or to stay still. Specifically, the choice is dependent on the artifact’s relative location and is made according to what the robot has learned from the feedback signals provided by the human trainer. The artifact can be moved by the human, permitting the task domain itself—not just training—to be interactive. The evaluation in this paper is limited with respect to who trains the agent (the first author). However, multiple target behaviors are trained, giving the evaluation a different dimension of breadth than previous TAMER experiments, which focused on the speed or effectiveness of training to maximize a predetermined performance metric [9,11,8]. Though a few past projects have considered this problem of learning from human reward [4,21,20,16,18,13,9], only two of these implemented their solution for a robotic agent. In one such project [13], the agent learned partially in simulation and from hardcoded reward, demonstrations, and human reward. In another [18], the human trainer, an author of that study, followed a predetermined algorithm of giving positive reward for desired actions and negative reward otherwise. This paper describes the first successful teaching of a robot purely by free-form human reward. One contribution of this paper is the description of how a system for learning from human reward—TAMER—was applied to a physically embodied robot. A second contribution is explicitly demonstrating that different behaviors can be trained by changing only the reward provided to the agent (and trainer interaction with its environment). Isbell et al. [4] showed the potential for such personalization by human reward in a virtual online environment, but it has not previously been demonstrated for robots or for TAMER. 2 Background on TAMER (for Training an Agent Manually via Evaluative Reinforcement) is a solution to the problem of how an agent can learn to perform a sequential task given only real-valued feedback on its behavior from a human trainer. This problem is defined formally by Knox [6]. The human feedback—“human reward”—is delivered through push buttons, spoken word, or any other easy-to-learn interface. The human’s feedback is the only source of feedback or evaluation that the agent receives. However, TAMER and other methods for learning from human reward can be useful even when other evaluative information is available, as has been shown previously [21,5,17,11]. The TAMER algorithm described below has additionally been extended to learn in continuous action spaces through an actor-critic algorithm [22] and to provide additional information to the trainer—either action confidence or summaries of past performance—creating changes in the quantity of reward instances given and in learned performance [14] TAMER Motivation and philosophy of TAMER The TAMER framework is designed around two insights. First, when a human trainer evaluates some behavior, she considers the long-term impact of that behavior, so her feedback signal contains her full judgment of the desirability of the targeted behavior. Second, a human trainer’s feedback is only delayed by how long it takes to make and then communicate an evaluation. TAMER assumes that trainers’ feedback is focused on recent behavior; as a consequence, human reward is considered a trivially delayed, full judgment on the desirability of behavior. Following the insights above and TAMER’s assumption of behavior-focused feedback, TAMER avoids the credit assignment problem inherent in reinforcement learning. It instead treats human reward as fully informative about the quality of recent actions from their corresponding states. Mechanics of TAMER The TAMER framework consists of three modules, as illustrated in Figure 2: credit assignment to create labels from delayed reward signals for training samples; supervised learning from those samples to model human reward; and action selection using the human reward model. The three modules are described below. TAMER models a hypothetical human reward function, RH : S × A → R, that predicts numeric reward based on the current state and action values (and thus is Markovian). This modeling, corresponding to the “supervised learner” box in Figure 2, uses a regression algorithm chosen by the agent designer; we call the model R̂H . Learning samples for modeling are constructed from experienced state-action pairs and the real-valued human reward credited to each pair as outlined below. The TAMER algorithm used in this Sensory paper (the “full” algorithm with “delaydisplay Human Environment weighted aggregate reward” described RH : S × A ! R in detail by Knox [6]), addresses the State small delay in providing feedback by Delayed reward s Action h a spreading each human reward signal among multiple recent state-action pairs, Credit Action contributing to the label of each pair’s assigner Action a selector resultant sample for learning R̂H . These TAMER samples, each with a state-action pair agent Reward (s, a, ĥ) as input and a post-assignment reward model samples Supervised R̂H : S × A ! R learner label as the output, are shown as the product of the “credit assigner” box in Figure 2. Each sample’s share of a re- Fig. 2. An information-flow diagram illustrating the ward signal is calculated from an esti- TAMER framework. mated probability density function for the delay in reward delivery, fdelay . To choose actions at some state s (the “action selector” box of Figure 2), a TAMER agent directly exploits the learned model R̂H and its predictions of expected reward. When acting greedily, a TAMER agent chooses the action a = argmaxa [R̂H (s, a)]. This is equivalent to performing reinforcement with a discount factor of 0, where reward acquired from future actions is not considered in action selection (i.e., action selection is myopic). In practice, almost all TAMER agents thus far have been greedy, since the trainer can punish the agent to make it try something different, reducing the need for other forms of exploration. Putting TAMER in context Although reinforcement learning was inspired by models of animal learning [19], it has seldom been applied to reward created from non-expert humans. We and others concerned with the problem of learning from human reward (sometimes called interactive shaping) seek to understand how reinforcement learning can be adapted to learn from reward generated by a live human trainer, a goal that may be critical to the usability of reinforcement learning by non-experts. TAMER, along with other work on interactive shaping, makes progress towards a second major form of teaching, one that will complement but not supplant learning from demonstration (LfD). In contrast to LfD, interactive shaping is a young approach. A recent survey of LfD for robots cites more than 100 papers [1]; this paper describes the second project to involve training of robots exclusively from human reward (and the first from purely free-form reward). In comparison to past methods for learning from human reward, TAMER differs in three important ways: (1) TAMER addresses delays in human evaluation through credit assignment, (2) TAMER learns a model of human reward (R̂H ), and (3) at each time step, TAMER myopically chooses the action that is predicted to directly elicit the maximum reward (argmaxa R̂H (s, a)), eschewing consideration of the action’s effect on future state. Accordingly, other algorithms for learning from human reward [4,21,20,16,18,13] do not directly account for delay, do not model human reward explicitly, and are not fully myopic (i.e., they employ discount factors greater than 0). However, nearly all previous approaches for learning from human-generated reward are relatively myopic, with abnormally high rates of discounting. Myopia creates certain limitations, including the need for the trainer to communicate what behavior is correct in any context (e.g., going left at a certain corner); a non-myopic algorithm instead would permit communication of correct outcomes (e.g., reaching a goal or failure state), lessening the communication load on trainers (while ideally still allowing behavior-based feedback, which people seem inclined to give). However, the myopic trend in past work was only recently identified and justified by Knox and Stone [10], who built upon this understanding to create the first successful algorithm to learn non-myopically from human reward [12]. Along with their success in a 30-state grid world, they also showed that their non-myopic approach needs further human-motivated improvements to scale to more complex tasks. Complementing this continuing research into non-myopic approaches, this paper focuses on applying an established and widely successful myopic approach to a robotic task, showing that TAMER can be used flexibly to teach a range of behaviors and drawing lessons from its application. TAMER has been implemented successfully in a number of simulation domains commonly used in reinforcement learning research: mountain car [9], balancing cart pole [11], Tetris [9], 3 vs. 2 keep-away soccer [17], and a grid-world task [10]. In comparison, other interactive-shaping approaches have been applied in at most two domains. 3 The MDS Robot Nexi A main contribution of this paper is the application of TAMER to a physical robot, shown in Figure 3a. The Mobile-Dexterous-Social robot platform is designed for research at the intersection of mobility, manipulation, and human-robot interaction [2]. The mobile base of the MDS platform has 2 degrees of freedom, with two powered wheels and one unpowered, stability-adding wheel. The robot estimates its environmental state through a Vicon Motion Capture system that determines the 3-dimensional locations and orientations of the robot and the training artifact; in estimating its own position and orientation, the robot employs both the Vicon data and information from its wheel encoders. In addition to the Vicon system, the robot has a number of other sensing capabilities that are not employed in this work. 4 TAMER Algorithm for Interactive Robot Navigation We implemented the full TAMER algorithm 4 ac%ons: as described generally in Section 2 and in detail by Knox [6], using the delay-weighted right le. aggregate reward credit assignment system described therein. forward From the robot’s estimation of the position and orientation of itself and the training and stay artifact, two features are extracted and used 2 state features: as input to R̂H along with the action. The first feature is the distance in meters from the robot to the training artifact, and the secθ ond is the angle in radians from the robot’s position and orientation to the artifact. Figure 3b shows these state features and the four (a) training ar(fact possible actions: turn left, turn right, move forward, or stay still. (b) In implementing a version of TAMER that learns interactive navigational behav- Fig. 3. (a) The MDS robot Nexi. (b) Nexi’s aciors, we specified the following components. tion and state spaces, as presented to TAMER. R̂H is modeled by the k-nearest neighbors algorithm. More detail is given later in this section. The training interface is a presentation remote that can be held in the trainer’s hand. Two buttons map to positive and negative reward, giving values of +1 and −1 respectively. Also, an additional button on the remote toggles the training mode on and off. When toggled on, TAMER chooses actions and learns from feedback on those actions; when off, TAMER does not learn further but does demonstrate learned behavior (see Knox [6] for details about toggling training). Another button both turns training off and forces the robot to stay still. This safety function is intended to avoid collisions with objects in the environment. The probability density function fdelay , which is used by TAMER ’s credit assignment module and describes the probability of a certain delay in feedback from the state-action pair it targets, is a Uniform(-0.8 seconds, -0.2 seconds) distribution, as has been employed in past work [6].3 The duration of time steps varies by the action chosen (for reasons discussed in Section 5). Moving forward and staying each last 1.5 seconds; turns occur for 2.5 3 Two minor credit-assignment parameters are not explained here but are nonetheless part of the full TAMER algorithm. For this instantiation, these are ǫp = 1 and cmin = 0.5. seconds. When moving forward, Nexi attempts to move at 0.075 meters per second, and Nexi seeks to turn at 0.15 radians per second. Since changes in intended velocity— translational or rotational—require a period of acceleration, the degree of movement during a time step was affected by whether the same action had occurred in the previous time step. R̂H is modeled using k-nearest neighbors with a separate sub-model per action (i.e., there is no generalization between actions), as shown in Algorithm 1. The number of neighbors k is dynamically set to the floor of the square root of the number of samples gathered for the corresponding action, growing k with the sample size to counteract the lessening generalization caused by an increase in samples and to reduce the impact of any one experienced state-action pair, reducing potential erratic behavior caused by mistaken feedback. The distance metric is the Euclidean distance given the 2-dimensional feature vectors of the queried state and the neighboring sample’s state. In calculating the distance, each vector element v is normalized within [0, 1] by (v − vmin )/(vmax − vmin ), where vmax and vmin are respectively the maximum and minimum values observed across training samples in the dimension of v. To help prevent one or a few highly Algorithm 1 Inference by k-Nearest Neighbors Given: Euclidean distance function d over state features negative rewards during early learning Input: Query q with state q.s and action q.a, and a set of samples from making Nexi avoid the targeted ac- Ma for each action a. Each sample has state features, an action, ĥ. tion completely, we bias R̂H toward val- and a reward labelp 1: k ← f loor( |Mq.a |) ues of zero. This biasing is achieved by 2: if k = 0 then reducing the value of each neighbor by a 3: R̂H (q.s, q.a) ← 0 else factor determined by its distance d from 4: 5: knn = ∅ the queried state, with larger distances 6: preds sum ← 0 7: for i = 1 to k do resulting in larger reductions. The bias 8: nn ← argminm∈Mq.a \knn d(m, q) factor is calculated as the maximum of 9: knn ← knn ∪ {nn} dist ← d(nn, q) linear and hyperbolic decay functions 10: 11: predictioni ← nn.ĥ × as shown in line 11 of Algorithm 1. max(1 − (dist/2), 1/(1 + (5 × dist))) Lastly, when multiple actions have 12: preds sum ← preds sum + predictioni end for the same predicted reward from the cur- 13: 14: R̂H (q.s, q.a) ← preds sum/k rent state, such ties are broken by repeat- 15: end if ing the previous action. This approach 16: return R̂H (q.s, q.a) lessens the number of action changes, which is intended to reduce early feedback error caused by ambiguously timed changes in actions (as discussed in Section 5). Accordingly, at the first time step, during which all actions are tied with a value of 0, a random action is chosen and repeated until non-zero feedback is given. 5 Results and Discussion We now describe the results of training the robot and discuss challenges and lessons provided by implementing TAMER in this domain. Behaviors taught Five different behaviors were independently taught by the first author, each of which is illustrated in Figure 4a: – Go to – The robot turns to face the artifact, then moves forward, and stops before the artifact with little space between the two. Go#to# Stay training'ar(fact' Move forward Turn left Turn right 2 ≥3 1 Go to 0 0 -1 Keep#conversa.onal#distance# ≤ -3 2 Keep conv. distance Look#away# ≥5 1 0 0 -1 ≤ -5 2 Look away Toy#tantrum# ≥5 1 0 0 -1 ≤ -5 2 Toy tantrum Magne.c#control# ≥5 1 0 0 -1 ≤ -5 2 Magnetic control ≥5 1 0 0 -1 ≤ -5 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 Space demarcated in meters (a) (b) Fig. 4. (a) Iconic illustrations of the five interactive navigational behaviors that were taught to the MDS robot Nexi (described in Section 5). Each gray square represents a category of state space. The arrow indicates the desired action in such state; lack of an arrow corresponds to the stay action. (b) Heat maps showing the reward model that was learned at the end of each successful training session. Nexi is shown by a transparent birds-eye rendering of the robot, with Nexi facing the top of the page. The map colors communicate the value of the reward prediction for taking that action when the artifact is in the corresponding location relative to Nexi. A legend indicating the mapping between colors and prediction values for each behavior is given on the right. The small triangle, if visible, represents the location of the artifact at the end of training and subsequent testing of the behavior. (Note that in all cases the triangle is in a location that should make the robot stay still, a point of equilibrium.) – Keep conversational distance – The robot goes to the artifact and stops at an approximate distance from the training artifact that two people would typically keep between each other during conversation (about 2 feet). – Look away – The robot should turn away from the artifact, stopping when facing the opposite direction. The robot never moves forward. – Toy tantrum – When the artifact is near the front of the robot, it does not move (as if the artifact is a toy that is in the robot’s possession, satisfying the robot). Otherwise, the robot turns from side to side (as if in a tantrum to get the toy back). The robot never moves forward. – Magnetic control – When the artifact is behind the robot, it acts as if the artifact repels it. The repulsion is akin to one end of a magnet repelling another magnet that faces it with the same pole. Specifically, when the artifact is near the center of the robot’s back, the robot moves forward. If the artifact is behind its left shoulder, it turns right, moving that shoulder forward (and vice versa for the right shoulder). If the artifact is not near the robot’s back, the robot does not move. Videos of the successful training sessions—as well as some earlier, unsuccessful sessions— can be seen at http://bradknox.net/nexi. Figure 4b contains heat maps of the learned reward model for each behavior at the end of successful training. Adjustments for physical embodiment All of the videos were recorded during a oneday period of training and refinement of our implementation of the TAMER algorithm, during which we specifically adjusted action durations, the effects of chosen actions, and the communication of the robot’s perceptions to the trainer. Nearly all sessions that ended unsuccessfully failed because of issues of transparency, which we addressed before or during this period. These transparency issues were mismatches between the state-action pair currently occurring and what the trainer believes to be occurring. The two main points of confusion and their solutions are described below. The start and end of actions As mentioned previously in Section 4, there can be a delay between the robot taking an action (e.g., turn right at 0.15 rad/s) and the robot visibly performing that action. This delay occurs specifically after any change in action. This offset between the robot’s and the trainer’s understandings of an action’s duration (i.e., of a time step) can cause reward to be misattributed to the wrong action. The durations of each action—2.5 seconds for turns and 1.5 seconds otherwise—were chosen to ensure that the robot will carry out an action long enough that its visible duration can be targeted by the trainer. The state of the training artifact The position of the artifact, unlike that of the robot, was estimated from only Vicon data. When the artifact moved beyond the range of the infrared Vicon cameras, its position was no longer updated. The most common source of failed training sessions was a lack of awareness by the trainer of this loss of sensing. In response to this issue, an audible alarm was added that fired whenever the artifact could not be located, alerting the trainer that the robot’s belief about the artifact is no longer changing. The transparency issues above are ilTable 1. Training times for each behavior lustrative of the types of challenges that Active Total time are likely to occur with any physically Target behavior training time (in min.) embodied agent trained by human re(in min.) ward. Such issues are generally absent Go to 27.3 38.5 in simulation. In general, the designer Keep conv. dist. 9.5 11.4 of the learning environment and algoLook away 5.9 7.9 rithm should seek to minimize cases in Toy tantrum 4.7 6.9 which the trainer gives feedback for a Magnetic control 7.3 16.4 state-action pair that was perceived by the trainer but did not occur from the learning The middle column shows the cumulative duraalgorithm’s perspective, causing misat- tion of active training time, and the right column tributed feedback. Likewise, mismatches shows the time of the entire session, including in perceived timing of state-action pairs time when agent learning is disabled. could be problematic in any feedbackbased learning system. Such challenges are related to but different from the corre- spondence problem in learning from demonstration [15,1], which is the problem of how to map from a demonstrator’s state and action space to that of the emulating agent. Especially relevant is work by Crick et al. [3], which compares learning from human controllers who see a video feed of the robot’s environment to learning from humans whose perceptions are matched to those of the robot, yielding a more limited sensory display. Their sensing-matched demonstrators performed worse at the task yet created learning samples that led to better performance. Training observations The go to behavior was taught successfully early on, after which the aforementioned state transparency issues temporarily blocked further success. After the out-of-range alarm was added, the remaining four behaviors were taught successfully in consecutive training sessions. Table 1 shows the times required for training each behavior. Note that the latter four behaviors—which differ from go to training in that they were taught using an anecdotally superior strategy (for space considerations, described at http://bradknox.net/nexi), had the alarm, and benefitted from an additional half day of trainer experience—were taught in considerably less time. 6 Conclusion In this paper, we described an application of TAMER to teach a physically embodied robot five different interactive navigational tasks. The feasibility of training these five behaviors constitutes the first focused demonstration of the possibility of using human reward to flexibly teach multiple robotic behaviors, and of TAMER to do so in any task domain. Further work with numerous trainers and other task domains will be critical to establishing the generality of our findings. Additionally, in preliminary work, we have adapted TAMER to permit feedback on intended actions [7], for which we plan to use Nexi’s emotive capabilities to signal intention. One expected advantage of such an approach is that unwanted, even harmful actions can be given negative reward before they occur, allowing the agent to learn what actions to avoid without ever taking them. We are also developing methods for non-myopic learning from human reward, which will permit reward that describes higher-level features of the task (e.g., goals) rather than only correct or incorrect behavior, reducing the training burden on humans, permitting more complex behavior to be taught in shorter training sessions. ACKNOWLEDGMENTS This work has taken place in the Personal Robots Group (PRG) at MIT and the Learning Agents Research Group (LARG) at UT Austin. PRG research is supported in part by NSF (award 1138986, Collaborative Research: Socially Assistive Robots). LARG research is supported in part by NSF (IIS-0917122), ONR (N00014-09-1-0658), and the FHWA (DTFH61-07-H-00030). We thank Siggi Örn, Nick DePalma, and Adam Setapen for their generous support in operating the MDS robot. References 1. B. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009. 2. C. Breazeal, M. Siegel, M. Berlin, J. Gray, R. Grupen, P. Deegan, J. Weber, K. Narendran, and J. McBean. Mobile, dexterous, social robots for mobile manipulation and human-robot interaction. SIGGRAPH’08: ACM SIGGRAPH 2008 new tech demos, 2008. 3. C. Crick, S. Osentoski, G. Jay, and O. C. Jenkins. Human and robot perception in largescale learning from demonstration. In Proceedings of the 6th international conference on Human-robot interaction, pages 339–346. ACM, 2011. 4. C. Isbell, M. Kearns, S. Singh, C. Shelton, P. Stone, and D. Kormann. Cobot in LambdaMOO: An Adaptive Social Statistics Agent. Proceedings of The 5th Annual International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2006. 5. W. B. Knox and P. Stone. Combining manual feedback with subsequent MDP reward signals for reinforcement learning. Proceedings of The 9th Annual International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2010. 6. W. B. Knox. Learning from Human-Generated Reward. PhD thesis, Department of Computer Science, The University of Texas at Austin, August 2012. 7. W. B. Knox, C. Breazeal, and P. Stone. Learning from feedback on actions past and intended. In In Proceedings of 7th ACM/IEEE International Conference on Human-Robot Interaction, Late-Breaking Reports Session (HRI 2012), March 2012. 8. W. B. Knox, B. D. Glass, B. C. Love, W. T. Maddox, and P. Stone. How humans teach agents: A new experimental perspective. International Journal of Social Robotics, Special Issue on Robot Learning from Demonstration, 4(4):409–421, 2012. 9. W. B. Knox and P. Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In The 5th International Conference on Knowledge Capture, September 2009. 10. W. B. Knox and P. Stone. Reinforcement learning from human reward: Discounting in episodic tasks. In 21st IEEE International Symposium on Robot and Human Interactive Communication (Ro-Man), September 2012. 11. W. B. Knox and P. Stone. Reinforcement learning with human and MDP reward. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), June 2012. 12. W. B. Knox and P. Stone. Learning non-myopically from human-generated reward. In International Conference on Intelligent User Interfaces (IUI), March 2013. 13. A. León, E. Morales, L. Altamirano, and J. Ruiz. Teaching a robot to perform task through imitation and on-line feedback. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 549–556, 2011. 14. G. Li, H. Hung, S. Whiteson, and W. B. Knox. Using informative behavior to increase engagement in the TAMER framework. May 2013. 15. C. L. Nehaniv and K. Dautenhahn. 2 the correspondence problem. Imitation in animals and artifacts, page 41, 2002. 16. P. Pilarski, M. Dawson, T. Degris, F. Fahimi, J. Carey, and R. Sutton. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In IEEE International Conference on Rehabilitation Robotics (ICORR), pages 1–7. IEEE, 2011. 17. M. Sridharan. Augmented reinforcement learning for interaction with non-expert humans in agent domains. In Proceedings of IEEE International Conference on Machine Learning Applications, 2011. 18. H. Suay and S. Chernova. Effect of human guidance and state space size on interactive reinforcement learning. In 20th IEEE International Symposium on Robot and Human Interactive Communication (Ro-Man), pages 1–6, 2011. 19. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. 20. A. Tenorio-Gonzalez, E. Morales, and L. Villaseñor-Pineda. Dynamic reward shaping: training a robot by voice. Advances in Artificial Intelligence–IBERAMIA, pages 483–492, 2010. 21. A. Thomaz and C. Breazeal. Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence, 172(6-7):716–737, 2008. 22. N. A. Vien and W. Ertel. Reinforcement learning combined with human feedback in continuous state and action spaces. In Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on, pages 1–6. IEEE, 2012.