2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021
We present iGibson 1.0, a novel simulation environment to develop robotic solutions for interacti... more We present iGibson 1.0, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains 15 fully interactive home-sized scenes with 108 rooms populated with rigid and articulated objects. The scenes are replicas of realworld homes, with distribution and the layout of objects aligned to those of the real world. iGibson 1.0 integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality virtual sensor signals (RGB, depth, segmentation, LiDAR, flow and so on), ii) domain randomization to change the materials of the objects (both visual and physical) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of human demonstrated (mobile) manipulation behaviors. iGibson 1.0 is open-source, equipped with comprehensive examples and documentation. For more information, visit our project website:
2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021
Humans can robustly follow a visual trajectory defined by a sequence of images (i.e. a video) reg... more Humans can robustly follow a visual trajectory defined by a sequence of images (i.e. a video) regardless of substantial changes in the environment or the presence of obstacles. We aim at endowing similar visual navigation capabilities to mobile robots solely equipped with a RGB fisheye camera. We propose a novel probabilistic visual navigation system that learns to follow a sequence of images with bidirectional visual predictions conditioned on possible navigation velocities. By predicting bidirectionally (from start towards goal and vice versa) our method extends its predictive horizon enabling the robot to go around unseen large obstacles that are not visible in the video trajectory. Learning how to react to obstacles and potential risks in the visual field is achieved by imitating human teleoperators. Since the human teleoperation commands are diverse, we propose a probabilistic representation of trajectories that we can sample to find the safest path. We evaluate our navigation system quantitatively and qualitatively in multiple simulated and real environments and compare to state-of-theart baselines. Our approach outperforms the most recent visual navigation methods with a large margin with regard to goal arrival rate, subgoal coverage rate, and success weighted by path length (SPL). Our method also generalizes to new robot embodiments never used during training.
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
Leveraging multimodal information with recursive Bayesian filters improves performance and robust... more Leveraging multimodal information with recursive Bayesian filters improves performance and robustness of state estimation, as recursive filters can combine different modalities according to their uncertainties. Prior work has studied how to optimally fuse different sensor modalities with analytical state estimation algorithms. However, deriving the dynamics and measurement models along with their noise profile can be difficult or lead to intractable models. Differentiable filters provide a way to learn these models end-to-end while retaining the algorithmic structure of recursive filters. This can be especially helpful when working with sensor modalities that are high dimensional and have very different characteristics. In contact-rich manipulation, we want to combine visual sensing (which gives us global information) with tactile sensing (which gives us local information). In this paper, we study new differentiable filtering architectures to fuse heterogeneous sensor information. As case studies, we evaluate three tasks: two in planar pushing (simulated and real) and one in manipulating a kinematically constrained door (simulated). In extensive evaluations, we find that differentiable filters that leverage crossmodal sensor information reach comparable accuracies to unstructured LSTM models, while presenting interpretability benefits that may be important for safety-critical systems. We also release an open-source library for creating and training differentiable Bayesian filters in PyTorch, which can be found on our project website: https://sites.google.com/view/ multimodalfilter.
The exploration mechanism used by a Deep Reinforcement Learning (RL) agent plays a key role in de... more The exploration mechanism used by a Deep Reinforcement Learning (RL) agent plays a key role in determining its sample efficiency. Thus, improving over random exploration is crucial to solve long-horizon tasks with sparse rewards. We propose to leverage an ensemble of partial solutions as teachers that guide the agent's exploration with action suggestions throughout training. While the setup of learning with teachers has been previously studied, our proposed approach - Actor-Critic with Teacher Ensembles (AC-Teach) - is the first to work with an ensemble of suboptimal teachers that may solve only part of the problem or contradict other each other, forming a unified algorithmic solution that is compatible with a broad range of teacher ensembles. AC-Teach leverages a probabilistic representation of the expected outcome of the teachers' and student's actions to direct exploration, reduce dithering, and adapt to the dynamically changing quality of the learner. We evaluate a v...
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Planning in realistic environments requires searching in large planning spaces. Affordances are a... more Planning in realistic environments requires searching in large planning spaces. Affordances are a powerful concept to simplify this search, because they model what actions can be successful in a given situation. However, the classical notion of affordance is not suitable for long horizon planning because it only informs the robot about the immediate outcome of actions instead of what actions are best for achieving a longterm goal. In this paper, we introduce a new affordance representation that enables the robot to reason about the long-term effects of actions through modeling what actions are afforded in the future, thereby informing the robot the best actions to take next to achieve a task goal. Based on the new representation, we develop a learning-to-plan method, Deep Affordance Foresight (DAF), that learns partial environment models of affordances of parameterized motor skills through trial-and-error. We evaluate DAF on two challenging manipulation domains and show that it can effectively learn to carry out multi-step tasks, share learned affordance representations among different tasks, and learn to plan with high-dimensional image inputs. Additional material at https://sites.google.com/stanford.edu/daf
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
A robotic agent navigating autonomously needs to perceive and track the motion of objects and oth... more A robotic agent navigating autonomously needs to perceive and track the motion of objects and other agents in its surroundings to be able to plan and execute robust and safe trajectories. To be useful for planning and execution, the motion should be perceived in 3D Cartesian space. However, most recent multi-object tracking (MOT) research have focused on tracking people and moving objects in 2D RGB video sequences. This is because searching in 3D is costly, there is a lack of annotated datasets with 3D labels of moving agents and a relative scarcity of data with 3D sensor modalities. In this work we present JRMOT, a novel 3D MOT system that integrates information from RGB images and 3D point clouds to achieve real-time, state-of-the-art tracking performance. Our system incorporates recent advancements in neural network based re-identification as well as 2D and 3D detectors and descriptors. We integrate them into a joint probabilistic dataassociation framework within a multi-modal recursive Kalman architecture to achieve online, real-time 3D MOT. As part of our work, we release the JRDB dataset, a novel large scale 2D+3D dataset and benchmark annotated with over 2 million boxes and 3500 time consistent 2D+3D trajectories across 54 indoor and outdoor scenes. This dataset, that we use to train and test our model, contains over 60 minutes of data including 360 • cylindrical RGB video and 3D pointclouds, in social settings. The presented 3D MOT system demonstrates state-ofthe-art performance against competing methods on the popular 2D tracking KITTI benchmark and serves as a competitive 3D tracking baseline for our dataset and benchmark. Moreover, our tests on our social robot JackRabbot indicate that the system is capable of tracking multiple pedestrians fast and reliably. The ROS code of our tracker is publicly available at https://sites.google.com/view/jrmot
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in hu... more Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in human environments, such as office buildings and homes. While related literature has addressed the co-navigation problem focused on the scalability with the number of pedestrians in open spaces, typical indoor environments present the additional challenge of constrained spaces such as corridors, doorways and crosswalks that limit maneuverability and influence patterns of pedestrian interaction. We present an approach based on reinforcement learning to learn policies capable of dynamic adaptation to the presence of moving pedestrians while navigating between desired locations in constrained environments. The policy network receives guidance from a motion planner that provides waypoints to follow a globally planned trajectory, whereas the reinforcement component handles the local interactions. We explore a compositional principle for multi-layout training and find that policies trained in a small set of geometrically simple layouts successfully generalize to unseen and more complex layouts that exhibit composition of the simple structural elements available during training. Going beyond wall-world like domains, we show transfer of the learned policy to unseen 3D reconstructions of two real environments (market, home). These results support the applicability of the compositional principle to real-world environments and indicate promising usage of agent simulation within reconstructed environments for tasks that involve interaction. https://ai.stanford.edu/∼cdarpino/socialnavconstrained/
2019 International Conference on Robotics and Automation (ICRA), 2019
When operating in unstructured environments such as warehouses, homes, and retail centers, robots... more When operating in unstructured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables. Mechanical Search describes the class of tasks where the goal is to locate and extract a known target object. In this paper, we formalize Mechanical Search and study a version where distractor objects are heaped over the target object in a bin. The robot uses an RGBD perception system and control policies to iteratively select, parameterize, and perform one of 3 actions-push, suction, grasp-until the target object is extracted, or either a time limit is exceeded, or no high confidence push or grasp is available. We present a study of 5 algorithmic policies for mechanical search, with 15,000 simulated trials and 300 physical trials for heaps ranging from 10 to 20 objects. Results suggest that success can be achieved in this long-horizon task with algorithmic policies in over 95% of instances and that the number of actions required scales approximately linearly with the size of the heap. Code and supplementary material can be found at http://ai.stanford.edu/mech-search.
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
A video prediction model that generalizes to diverse scenes would enable intelligent agents such ... more A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns highfidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to stateof-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules. Visualization and more details are at https: // sites. google. com/ view/ ghvae .
2019 International Conference on Robotics and Automation (ICRA), 2019
We present a navigation system that combines ideas from hierarchical planning and machine learnin... more We present a navigation system that combines ideas from hierarchical planning and machine learning. The system uses a traditional global planner to compute optimal paths towards a goal, and a deep local trajectory planner and velocity controller to compute motion commands. The latter components of the system adjust the behavior of the robot through attention mechanisms such that it moves towards the goal, avoids obstacles, and respects the space of nearby pedestrians. Both the structure of the proposed deep models and the use of attention mechanisms make the system's execution interpretable. Our simulation experiments suggest that the proposed architecture outperforms baselines that try to map global plan information and sensor data directly to velocity commands. In comparison to a hand-designed traditional navigation system, the proposed approach showed more consistent performance.
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, tor... more Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen-a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art Reinforcement Learning and Hierarchical Reinforcement Learning baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots.
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by a... more Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by allowing them to learn from human demonstrations collected via teleoperation, but has mostly been limited to single-arm manipulation. However, many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk. Unfortunately, applying IL to multi-arm manipulation tasks has been challenging-asking a human to control more than one robotic arm can impose significant cognitive burden and is often only possible for a maximum of two robot arms. To address these challenges, we present MULTI-ARM ROBOTURK (MART), a multiuser data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms and collect demonstrations for multi-arm tasks. Using MART, we collected demonstrations for five novel two and three-arm tasks from several geographically separated users. From our data we arrived at a critical insight: most multi-arm tasks do not require global coordination throughout its full duration, but only during specific moments. We show that learning from such data consequently presents challenges for centralized agents that directly attempt to model all robot actions simultaneously, and perform a comprehensive study of different policy architectures with varying levels of centralization on our tasks. Finally, we propose and evaluate a base-residual policy framework that allows trained policies to better adapt to the mixed coordination setting common in multi-arm manipulation, and show that a centralized policy augmented with a decentralized residual model outperforms all other models on our set of benchmark tasks. Additional results and videos at https://roboturk. stanford.edu/multiarm
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Searching for objects in indoor organized environments such as homes or offices is part of our ev... more Searching for objects in indoor organized environments such as homes or offices is part of our everyday activities. When looking for a desired object, we reason about the rooms and containers the object is likely to be in; the same type of container will have a different probability of containing the target depending on which room it is in. We also combine geometric and semantic information to infer what container is best to search, or what other objects are best to move, if the target object is hidden from view. We use a 3D scene graph representation to capture the hierarchical, semantic, and geometric aspects of this problem. To exploit this representation in a search process, we introduce Hierarchical Mechanical Search (HMS), a method that guides an agent's actions towards finding a target object specified with a natural language description. HMS is based on a novel neural network architecture that uses neural message passing of vectors with visual, geometric, and linguistic information to allow HMS to process data across layers of the graph while combining semantic and geometric cues. HMS is trained on 1000 3D scene graphs and evaluated on a novel dataset of 500 3D scene graphs with dense placements of semantically related objects in storage locations, and is shown to be significantly better than several baselines at finding objects. It is also close to the oracle policy in terms of the median number of actions required. Additional qualitative results can be found at https://ai.stanford.edu/mech-search/hms
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
The process of learning a manipulation task depends strongly on the action space used for explora... more The process of learning a manipulation task depends strongly on the action space used for exploration: posed in the incorrect action space, solving a task with reinforcement learning can be drastically inefficient. Additionally, similar tasks or instances of the same task family impose latent manifold constraints on the most effective action space: the task family can be best solved with actions in a manifold of the entire action space of the robot. Combining these insights we present LASER, a method to learn latent action spaces for efficient reinforcement learning. LASER factorizes the learning problem into two sub-problems, namely action space learning and policy learning in the new action space. It leverages data from similar manipulation task instances, either from an offline expert or online during policy learning, and learns from these trajectories a mapping from the original to a latent action space. LASER is trained as a variational encoder-decoder model to map raw actions into a disentangled latent action space while maintaining action reconstruction and latent space dynamic consistency. We evaluate LASER on two contact-rich robotic tasks in simulation, and analyze the benefit of policy learning in the generated latent action space. We show improved sample efficiency compared to the original action space from better alignment of the action space to the task space, as we observe with visualizations of the learned action space manifold.
2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019
Reinforcement Learning (RL) of contact-rich manipulation tasks has yielded impressive results in ... more Reinforcement Learning (RL) of contact-rich manipulation tasks has yielded impressive results in recent years. While many studies in RL focus on varying the observation space or reward model, few efforts focused on the choice of action space (e.g. joint or end-effector space, position, velocity, etc.). However, studies in robot motion control indicate that choosing an action space that conforms to the characteristics of the task can simplify exploration and improve robustness to disturbances. This paper studies the effect of different action spaces in deep RL and advocates for variable impedance control in end-effector space (VICES) as an advantageous action space for constrained and contact-rich tasks. We evaluate multiple action spaces on three prototypical manipulation tasks: Path Following (task with no contact), Door Opening (task with kinematic constraints), and Surface Wiping (task with continuous contact). We show that VICES improves sample efficiency, maintains low energy consumption, and ensures safety across all three experimental setups. Further, RL policies learned with VICES can transfer across different robot models in simulation, and from simulation to real for the same robot. Further information is available at https://stanfordvl.github.io/vices.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully le... more A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabb... more We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabbot. The dataset includes 64 minutes of annotated multimodal sensor data including stereo cylindrical 360 • RGB video at 15 fps, 3D point clouds from two 16 planar rays Velodyne LiDARs, line 3D point clouds from two Sick Lidars, audio signal, RGB-D video at 30 fps, 360 • spherical image from a fisheye camera and encoder values from the robot's wheels. Our dataset incorporates data from traditionally underrepresented scenes such as indoor environments and pedestrian areas, all from the ego-perspective of the robot, both stationary and navigating. The dataset has been annotated with over 2.4 million bounding boxes spread over 5 individual cameras and 1.8 million associated 3D cuboids around all people in the scenes totaling over 3500 time consistent trajectories. Together with our dataset and the annotations, we launch a benchmark and metrics for 2D and 3D person detection and tracking. With this dataset, which we plan on extending with further types of annotation in the future, we hope to provide a new source of data and a test-bench for research in the areas of egocentric robot vision, autonomous navigation, and all perceptual tasks around social robotics in human environments.
We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evalu... more We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evaluating Interactive Navigation solutions. Interactive Navigation tasks are robot navigation problems where physical interaction with objects (e.g. pushing) is allowed and even encouraged to reach the goal. Our benchmark comprises two novel elements: 1) a new experimental simulated environment, the Interactive Gibson Environment (iGibson 0.5 †), that generates photo-realistic images of indoor scenes and simulates realistic physical interactions of robots and common objects found in these scenes; 2) the Interactive Navigation Score, a novel metric to study the interplay between navigation and physical interaction of Interactive Navigation solutions. We present and evaluate multiple learning-based baselines in Interactive Gibson Benchmark, and provide insights into regimes of navigation with different trade-offs between navigation, path efficiency and disturbance of surrounding objects. We make our benchmark publicly available 3 and encourage researchers from related robotics disciplines (e.g. planning, learning, control) to propose, evaluate, and compare their Interactive Navigation solutions in Interactive Gibson Benchmark.
Humans can routinely follow a trajectory defined by a list of images/landmarks. However, traditio... more Humans can routinely follow a trajectory defined by a list of images/landmarks. However, traditional robot navigation methods require accurate mapping of the environment, localization, and planning. Moreover, these methods are sensitive to subtle changes in the environment. In this paper, we propose a Deep Visual MPC-policy learning method that can perform visual navigation while avoiding collisions with unseen objects on the navigation path. Our model PoliNet takes in as input a visual trajectory and the image of the robot's current view and outputs velocity commands for a planning horizon of N steps that optimally balance between trajectory following and obstacle avoidance. PoliNet is trained using a strong image predictive model and traversability estimation model in a MPC setup, with minimal human supervision. Different from prior work, PoliNet can be applied to new scenes without retraining. We show experimentally that the robot can follow a visual trajectory when varying start position and in the presence of previously unseen obstacles. We validated our algorithm with tests both in a realistic simulation environment and in the real world. We also show that we can generate visual trajectories in simulation and execute the corresponding path in the real environment. Our approach outperforms classical approaches as well as previous learning-based baselines in success rate of goal reaching, sub-goal coverage rate, and computational load.
2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021
We present iGibson 1.0, a novel simulation environment to develop robotic solutions for interacti... more We present iGibson 1.0, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains 15 fully interactive home-sized scenes with 108 rooms populated with rigid and articulated objects. The scenes are replicas of realworld homes, with distribution and the layout of objects aligned to those of the real world. iGibson 1.0 integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality virtual sensor signals (RGB, depth, segmentation, LiDAR, flow and so on), ii) domain randomization to change the materials of the objects (both visual and physical) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of human demonstrated (mobile) manipulation behaviors. iGibson 1.0 is open-source, equipped with comprehensive examples and documentation. For more information, visit our project website:
2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021
Humans can robustly follow a visual trajectory defined by a sequence of images (i.e. a video) reg... more Humans can robustly follow a visual trajectory defined by a sequence of images (i.e. a video) regardless of substantial changes in the environment or the presence of obstacles. We aim at endowing similar visual navigation capabilities to mobile robots solely equipped with a RGB fisheye camera. We propose a novel probabilistic visual navigation system that learns to follow a sequence of images with bidirectional visual predictions conditioned on possible navigation velocities. By predicting bidirectionally (from start towards goal and vice versa) our method extends its predictive horizon enabling the robot to go around unseen large obstacles that are not visible in the video trajectory. Learning how to react to obstacles and potential risks in the visual field is achieved by imitating human teleoperators. Since the human teleoperation commands are diverse, we propose a probabilistic representation of trajectories that we can sample to find the safest path. We evaluate our navigation system quantitatively and qualitatively in multiple simulated and real environments and compare to state-of-theart baselines. Our approach outperforms the most recent visual navigation methods with a large margin with regard to goal arrival rate, subgoal coverage rate, and success weighted by path length (SPL). Our method also generalizes to new robot embodiments never used during training.
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
Leveraging multimodal information with recursive Bayesian filters improves performance and robust... more Leveraging multimodal information with recursive Bayesian filters improves performance and robustness of state estimation, as recursive filters can combine different modalities according to their uncertainties. Prior work has studied how to optimally fuse different sensor modalities with analytical state estimation algorithms. However, deriving the dynamics and measurement models along with their noise profile can be difficult or lead to intractable models. Differentiable filters provide a way to learn these models end-to-end while retaining the algorithmic structure of recursive filters. This can be especially helpful when working with sensor modalities that are high dimensional and have very different characteristics. In contact-rich manipulation, we want to combine visual sensing (which gives us global information) with tactile sensing (which gives us local information). In this paper, we study new differentiable filtering architectures to fuse heterogeneous sensor information. As case studies, we evaluate three tasks: two in planar pushing (simulated and real) and one in manipulating a kinematically constrained door (simulated). In extensive evaluations, we find that differentiable filters that leverage crossmodal sensor information reach comparable accuracies to unstructured LSTM models, while presenting interpretability benefits that may be important for safety-critical systems. We also release an open-source library for creating and training differentiable Bayesian filters in PyTorch, which can be found on our project website: https://sites.google.com/view/ multimodalfilter.
The exploration mechanism used by a Deep Reinforcement Learning (RL) agent plays a key role in de... more The exploration mechanism used by a Deep Reinforcement Learning (RL) agent plays a key role in determining its sample efficiency. Thus, improving over random exploration is crucial to solve long-horizon tasks with sparse rewards. We propose to leverage an ensemble of partial solutions as teachers that guide the agent's exploration with action suggestions throughout training. While the setup of learning with teachers has been previously studied, our proposed approach - Actor-Critic with Teacher Ensembles (AC-Teach) - is the first to work with an ensemble of suboptimal teachers that may solve only part of the problem or contradict other each other, forming a unified algorithmic solution that is compatible with a broad range of teacher ensembles. AC-Teach leverages a probabilistic representation of the expected outcome of the teachers' and student's actions to direct exploration, reduce dithering, and adapt to the dynamically changing quality of the learner. We evaluate a v...
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Planning in realistic environments requires searching in large planning spaces. Affordances are a... more Planning in realistic environments requires searching in large planning spaces. Affordances are a powerful concept to simplify this search, because they model what actions can be successful in a given situation. However, the classical notion of affordance is not suitable for long horizon planning because it only informs the robot about the immediate outcome of actions instead of what actions are best for achieving a longterm goal. In this paper, we introduce a new affordance representation that enables the robot to reason about the long-term effects of actions through modeling what actions are afforded in the future, thereby informing the robot the best actions to take next to achieve a task goal. Based on the new representation, we develop a learning-to-plan method, Deep Affordance Foresight (DAF), that learns partial environment models of affordances of parameterized motor skills through trial-and-error. We evaluate DAF on two challenging manipulation domains and show that it can effectively learn to carry out multi-step tasks, share learned affordance representations among different tasks, and learn to plan with high-dimensional image inputs. Additional material at https://sites.google.com/stanford.edu/daf
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
A robotic agent navigating autonomously needs to perceive and track the motion of objects and oth... more A robotic agent navigating autonomously needs to perceive and track the motion of objects and other agents in its surroundings to be able to plan and execute robust and safe trajectories. To be useful for planning and execution, the motion should be perceived in 3D Cartesian space. However, most recent multi-object tracking (MOT) research have focused on tracking people and moving objects in 2D RGB video sequences. This is because searching in 3D is costly, there is a lack of annotated datasets with 3D labels of moving agents and a relative scarcity of data with 3D sensor modalities. In this work we present JRMOT, a novel 3D MOT system that integrates information from RGB images and 3D point clouds to achieve real-time, state-of-the-art tracking performance. Our system incorporates recent advancements in neural network based re-identification as well as 2D and 3D detectors and descriptors. We integrate them into a joint probabilistic dataassociation framework within a multi-modal recursive Kalman architecture to achieve online, real-time 3D MOT. As part of our work, we release the JRDB dataset, a novel large scale 2D+3D dataset and benchmark annotated with over 2 million boxes and 3500 time consistent 2D+3D trajectories across 54 indoor and outdoor scenes. This dataset, that we use to train and test our model, contains over 60 minutes of data including 360 • cylindrical RGB video and 3D pointclouds, in social settings. The presented 3D MOT system demonstrates state-ofthe-art performance against competing methods on the popular 2D tracking KITTI benchmark and serves as a competitive 3D tracking baseline for our dataset and benchmark. Moreover, our tests on our social robot JackRabbot indicate that the system is capable of tracking multiple pedestrians fast and reliably. The ROS code of our tracker is publicly available at https://sites.google.com/view/jrmot
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in hu... more Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in human environments, such as office buildings and homes. While related literature has addressed the co-navigation problem focused on the scalability with the number of pedestrians in open spaces, typical indoor environments present the additional challenge of constrained spaces such as corridors, doorways and crosswalks that limit maneuverability and influence patterns of pedestrian interaction. We present an approach based on reinforcement learning to learn policies capable of dynamic adaptation to the presence of moving pedestrians while navigating between desired locations in constrained environments. The policy network receives guidance from a motion planner that provides waypoints to follow a globally planned trajectory, whereas the reinforcement component handles the local interactions. We explore a compositional principle for multi-layout training and find that policies trained in a small set of geometrically simple layouts successfully generalize to unseen and more complex layouts that exhibit composition of the simple structural elements available during training. Going beyond wall-world like domains, we show transfer of the learned policy to unseen 3D reconstructions of two real environments (market, home). These results support the applicability of the compositional principle to real-world environments and indicate promising usage of agent simulation within reconstructed environments for tasks that involve interaction. https://ai.stanford.edu/∼cdarpino/socialnavconstrained/
2019 International Conference on Robotics and Automation (ICRA), 2019
When operating in unstructured environments such as warehouses, homes, and retail centers, robots... more When operating in unstructured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables. Mechanical Search describes the class of tasks where the goal is to locate and extract a known target object. In this paper, we formalize Mechanical Search and study a version where distractor objects are heaped over the target object in a bin. The robot uses an RGBD perception system and control policies to iteratively select, parameterize, and perform one of 3 actions-push, suction, grasp-until the target object is extracted, or either a time limit is exceeded, or no high confidence push or grasp is available. We present a study of 5 algorithmic policies for mechanical search, with 15,000 simulated trials and 300 physical trials for heaps ranging from 10 to 20 objects. Results suggest that success can be achieved in this long-horizon task with algorithmic policies in over 95% of instances and that the number of actions required scales approximately linearly with the size of the heap. Code and supplementary material can be found at http://ai.stanford.edu/mech-search.
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
A video prediction model that generalizes to diverse scenes would enable intelligent agents such ... more A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns highfidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to stateof-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules. Visualization and more details are at https: // sites. google. com/ view/ ghvae .
2019 International Conference on Robotics and Automation (ICRA), 2019
We present a navigation system that combines ideas from hierarchical planning and machine learnin... more We present a navigation system that combines ideas from hierarchical planning and machine learning. The system uses a traditional global planner to compute optimal paths towards a goal, and a deep local trajectory planner and velocity controller to compute motion commands. The latter components of the system adjust the behavior of the robot through attention mechanisms such that it moves towards the goal, avoids obstacles, and respects the space of nearby pedestrians. Both the structure of the proposed deep models and the use of attention mechanisms make the system's execution interpretable. Our simulation experiments suggest that the proposed architecture outperforms baselines that try to map global plan information and sensor data directly to velocity commands. In comparison to a hand-designed traditional navigation system, the proposed approach showed more consistent performance.
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, tor... more Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen-a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art Reinforcement Learning and Hierarchical Reinforcement Learning baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots.
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by a... more Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by allowing them to learn from human demonstrations collected via teleoperation, but has mostly been limited to single-arm manipulation. However, many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk. Unfortunately, applying IL to multi-arm manipulation tasks has been challenging-asking a human to control more than one robotic arm can impose significant cognitive burden and is often only possible for a maximum of two robot arms. To address these challenges, we present MULTI-ARM ROBOTURK (MART), a multiuser data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms and collect demonstrations for multi-arm tasks. Using MART, we collected demonstrations for five novel two and three-arm tasks from several geographically separated users. From our data we arrived at a critical insight: most multi-arm tasks do not require global coordination throughout its full duration, but only during specific moments. We show that learning from such data consequently presents challenges for centralized agents that directly attempt to model all robot actions simultaneously, and perform a comprehensive study of different policy architectures with varying levels of centralization on our tasks. Finally, we propose and evaluate a base-residual policy framework that allows trained policies to better adapt to the mixed coordination setting common in multi-arm manipulation, and show that a centralized policy augmented with a decentralized residual model outperforms all other models on our set of benchmark tasks. Additional results and videos at https://roboturk. stanford.edu/multiarm
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
Searching for objects in indoor organized environments such as homes or offices is part of our ev... more Searching for objects in indoor organized environments such as homes or offices is part of our everyday activities. When looking for a desired object, we reason about the rooms and containers the object is likely to be in; the same type of container will have a different probability of containing the target depending on which room it is in. We also combine geometric and semantic information to infer what container is best to search, or what other objects are best to move, if the target object is hidden from view. We use a 3D scene graph representation to capture the hierarchical, semantic, and geometric aspects of this problem. To exploit this representation in a search process, we introduce Hierarchical Mechanical Search (HMS), a method that guides an agent's actions towards finding a target object specified with a natural language description. HMS is based on a novel neural network architecture that uses neural message passing of vectors with visual, geometric, and linguistic information to allow HMS to process data across layers of the graph while combining semantic and geometric cues. HMS is trained on 1000 3D scene graphs and evaluated on a novel dataset of 500 3D scene graphs with dense placements of semantically related objects in storage locations, and is shown to be significantly better than several baselines at finding objects. It is also close to the oracle policy in terms of the median number of actions required. Additional qualitative results can be found at https://ai.stanford.edu/mech-search/hms
2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
The process of learning a manipulation task depends strongly on the action space used for explora... more The process of learning a manipulation task depends strongly on the action space used for exploration: posed in the incorrect action space, solving a task with reinforcement learning can be drastically inefficient. Additionally, similar tasks or instances of the same task family impose latent manifold constraints on the most effective action space: the task family can be best solved with actions in a manifold of the entire action space of the robot. Combining these insights we present LASER, a method to learn latent action spaces for efficient reinforcement learning. LASER factorizes the learning problem into two sub-problems, namely action space learning and policy learning in the new action space. It leverages data from similar manipulation task instances, either from an offline expert or online during policy learning, and learns from these trajectories a mapping from the original to a latent action space. LASER is trained as a variational encoder-decoder model to map raw actions into a disentangled latent action space while maintaining action reconstruction and latent space dynamic consistency. We evaluate LASER on two contact-rich robotic tasks in simulation, and analyze the benefit of policy learning in the generated latent action space. We show improved sample efficiency compared to the original action space from better alignment of the action space to the task space, as we observe with visualizations of the learned action space manifold.
2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019
Reinforcement Learning (RL) of contact-rich manipulation tasks has yielded impressive results in ... more Reinforcement Learning (RL) of contact-rich manipulation tasks has yielded impressive results in recent years. While many studies in RL focus on varying the observation space or reward model, few efforts focused on the choice of action space (e.g. joint or end-effector space, position, velocity, etc.). However, studies in robot motion control indicate that choosing an action space that conforms to the characteristics of the task can simplify exploration and improve robustness to disturbances. This paper studies the effect of different action spaces in deep RL and advocates for variable impedance control in end-effector space (VICES) as an advantageous action space for constrained and contact-rich tasks. We evaluate multiple action spaces on three prototypical manipulation tasks: Path Following (task with no contact), Door Opening (task with kinematic constraints), and Surface Wiping (task with continuous contact). We show that VICES improves sample efficiency, maintains low energy consumption, and ensures safety across all three experimental setups. Further, RL policies learned with VICES can transfer across different robot models in simulation, and from simulation to real for the same robot. Further information is available at https://stanfordvl.github.io/vices.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully le... more A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabb... more We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabbot. The dataset includes 64 minutes of annotated multimodal sensor data including stereo cylindrical 360 • RGB video at 15 fps, 3D point clouds from two 16 planar rays Velodyne LiDARs, line 3D point clouds from two Sick Lidars, audio signal, RGB-D video at 30 fps, 360 • spherical image from a fisheye camera and encoder values from the robot's wheels. Our dataset incorporates data from traditionally underrepresented scenes such as indoor environments and pedestrian areas, all from the ego-perspective of the robot, both stationary and navigating. The dataset has been annotated with over 2.4 million bounding boxes spread over 5 individual cameras and 1.8 million associated 3D cuboids around all people in the scenes totaling over 3500 time consistent trajectories. Together with our dataset and the annotations, we launch a benchmark and metrics for 2D and 3D person detection and tracking. With this dataset, which we plan on extending with further types of annotation in the future, we hope to provide a new source of data and a test-bench for research in the areas of egocentric robot vision, autonomous navigation, and all perceptual tasks around social robotics in human environments.
We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evalu... more We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evaluating Interactive Navigation solutions. Interactive Navigation tasks are robot navigation problems where physical interaction with objects (e.g. pushing) is allowed and even encouraged to reach the goal. Our benchmark comprises two novel elements: 1) a new experimental simulated environment, the Interactive Gibson Environment (iGibson 0.5 †), that generates photo-realistic images of indoor scenes and simulates realistic physical interactions of robots and common objects found in these scenes; 2) the Interactive Navigation Score, a novel metric to study the interplay between navigation and physical interaction of Interactive Navigation solutions. We present and evaluate multiple learning-based baselines in Interactive Gibson Benchmark, and provide insights into regimes of navigation with different trade-offs between navigation, path efficiency and disturbance of surrounding objects. We make our benchmark publicly available 3 and encourage researchers from related robotics disciplines (e.g. planning, learning, control) to propose, evaluate, and compare their Interactive Navigation solutions in Interactive Gibson Benchmark.
Humans can routinely follow a trajectory defined by a list of images/landmarks. However, traditio... more Humans can routinely follow a trajectory defined by a list of images/landmarks. However, traditional robot navigation methods require accurate mapping of the environment, localization, and planning. Moreover, these methods are sensitive to subtle changes in the environment. In this paper, we propose a Deep Visual MPC-policy learning method that can perform visual navigation while avoiding collisions with unseen objects on the navigation path. Our model PoliNet takes in as input a visual trajectory and the image of the robot's current view and outputs velocity commands for a planning horizon of N steps that optimally balance between trajectory following and obstacle avoidance. PoliNet is trained using a strong image predictive model and traversability estimation model in a MPC setup, with minimal human supervision. Different from prior work, PoliNet can be applied to new scenes without retraining. We show experimentally that the robot can follow a visual trajectory when varying start position and in the presence of previously unseen obstacles. We validated our algorithm with tests both in a realistic simulation environment and in the real world. We also show that we can generate visual trajectories in simulation and execute the corresponding path in the real environment. Our approach outperforms classical approaches as well as previous learning-based baselines in success rate of goal reaching, sub-goal coverage rate, and computational load.
Uploads
Papers by Roberto Martin