In intelligent environments, computer systems not solely serve as passive input devices waiting f... more In intelligent environments, computer systems not solely serve as passive input devices waiting for user interaction but actively analyze their environment and adapt their behavior according to changes in environmental parameters. One essential ability to achieve this goal is to analyze the mood, emotions and dispositions a user experiences while interacting with such intelligent systems. Features allowing to infer such parameters can be extracted from auditive, as well as visual sensory input streams. For the visual feature domain, in particular facial expressions are known to contain rich information about a user's emotional state and can be detected by using either static and/or dynamic image features. During interaction facial expressions are rarely performed in isolation, but most of the time co-occur with movements of the head. Thus, optical flow based facial features are often compromised by additional motions. Parts of the optical flow may be caused by rigid head motions, while other parts reflect deformations resulting from facial expressivity (non-rigid motions). In this work, we propose the first steps towards an optical flow based separation of rigid head motions from nonrigid motions caused by facial expressions. We suggest that after their separation, both, head movements and facial expressions can be used as a basis for the recognition of a user's emotions and dispositions and thus allow a technical system to effectively adapt to the user's state.
Communications in computer and information science, 2016
Biologically inspired computational models of visual processing often utilize conventional frame-... more Biologically inspired computational models of visual processing often utilize conventional frame-based cameras for data acquisition. Instead, the Dynamic Vision Sensor (DVS) emulates the main processing sequence of the mammalian retina and generates spike-trains to encode temporal changes in the luminance distribution of a visual scene. Based on such sparse input representation we propose neural mechanisms for initial motion estimation and integration functionally related to the dorsal stream in the visual cortical hierarchy. We adapt the spatio-temporal filtering scheme as originally suggested by Adelson and Bergen to make it consistent with the input representation generated by the DVS. In order to regulate the overall activation of single neurons against a pool of neighboring cells, we incorporate a competitive stage that operates upon the spatial as well as the feature domain. The impact of such normalization stage is evaluated using information theoretic measures. Results of optical flow estimation were analyzed using synthetic ground truth data.
Fine-grained action segmentation in long untrimmed videos is an important task for many applicati... more Fine-grained action segmentation in long untrimmed videos is an important task for many applications such as surveillance, robotics, and human-computer interaction. To understand subtle and precise actions within a long time period, second-order information (e.g. feature covariance) or higher is reported to be effective in the literature. However, extracting such high-order information is considerably non-trivial. In particular, the dimensionality increases exponentially with the information order, and hence gaining more representation power also increases the computational cost and the risk of overfitting. In this paper, we propose an approach to representing high-order information for temporal action segmentation via a simple yet effective bilinear form. Specifically, our contributions are: (1) From the multilinear perspective, we derive a bilinear form of low complexity, assuming that the three-way tensor has low-rank frontal slices. (2) Rather than learning the tensor entries from data, we sample the entries from different underlying distributions, and prove that the underlying distribution influences the information order. (3) We employed our bilinear form as an intermediate layer in state-of-the-art deep neural networks, enabling to represent high-order information in complex deep models effectively and efficiently. Our experimental results demonstrate that the proposed bilinear form outperforms the previous state-of-the-art methods on the challenging temporal action segmentation task. One can see our project page for data, model and code: https://vlg.inf.ethz.ch/projects/BilinearTCN/.
2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), 2021
Digitization is advancing rapidly in many prevalently analogue domains such as healthcare. For th... more Digitization is advancing rapidly in many prevalently analogue domains such as healthcare. For the latter domain, the synergies with modern information technologies (IT) have become an integral part regarding communication and collaboration. For this reason, a comprehensible language is of importance in order to allow a frictionless exchange of information between domain experts. The Business Process Model and Notation (BPMN) 2.0 represents a promising notation that may be applied as lingua franca. Although the BPMN 2.0 is widespread applied by experts in business and industry, little experience exists how BPMN 2.0 is adopted in healthcare. In order to assess how BPMN 2.0 is deployed in healthcare, we conducted a preliminary eye tracking study, in which n = 16 professionals from healthcare comprehended a particular BPMN 2.0 process model. The results indicate that BPMN 2.0 might be a candidate for a lingua franca to foster the comprehensible exchange of information as well as collaboration between healthcare and IT.
The purpose of the presented model architecture is to capture dynamics and adaptation of the syst... more The purpose of the presented model architecture is to capture dynamics and adaptation of the system states to yield insights on neural circuitry and functionality. However, to present the model in a physiologically more realistic setting and to provide testable neurophysiological parameter values, we modified Eq. (1) and Eq. (6) by incorporating membrane capacitance C and leak reversal potential E l. The core equations read
Convolutional neural networks gain more and more popularity in image classification tasks since t... more Convolutional neural networks gain more and more popularity in image classification tasks since they are often even able to outperform human classifiers. While much research has been targeted towards network architecture optimization, the optimization of the labeled training data has not been explicitly targeted yet. Since labeling of training data is time-consuming, it is often performed by less experienced domain experts or even outsourced to online services. Unfortunately, this results in labeling errors, which directly impact the classification performance of the trained network. To overcome this problem, we propose an interactive visual analysis system that helps to spot and correct errors in the training dataset. For this purpose, we have identified instance interpretation errors, class interpretation errors and similarity errors as frequently occurring errors, which shall be resolved to improve classification performance. After we detect these errors, users are guided towards...
Standard separation index. Standard separation index for experiment 3, adapter tone experiment (g... more Standard separation index. Standard separation index for experiment 3, adapter tone experiment (green lines) and 4, timing experiment (blue lines). The peak of the green curve is shifted by a maximum of 6dB for increased adapter tone intensity and gains an sensitivity increment of 52%, compared to 6dB shift and sensitivity increment of 53% for no noise inputs (compare Fig. 5). The peak of the blue curve is shifted by a maximum of 32dB, compared to 38dB between stimuli of different ITDs. The increment of sensitivity for this shift is 101%, compared to 106%.
In our modern industrial society the group of the older (generation 65+) is constantly growing. M... more In our modern industrial society the group of the older (generation 65+) is constantly growing. Many subjects of this group are severely affected by their health and are suffering from disability and pain. The problem with chronic illness and pain is that it lowers the patient's quality of life, and therefore accurate pain assessment is needed to facilitate effective pain management and treatment. In the future, automatic pain monitoring may enable health care professionals to assess and manage pain in a more and more objective way. To this end, the goal of our SenseEmotion project is to develop automatic painand emotion-recognition systems for successful assessment and effective personalized management of pain, particularly for the generation 65+. In this paper the recently created SenseEmotion Database for pain-vs. emotion-recognition is presented. Data of 45 healthy subjects is collected to this database. For each subject approximately 30 min of multimodal sensory data has been recorded. For a comprehensive understanding of pain and affect three rather different modalities of data are included in this study: biopotentials, camera images of the facial region, and, for the first time, audio signals. Heat stimulation is applied to elicit pain, and affective image stimuli accompanied by sound stimuli are used for the elicitation of emotional states.
Facial point detection gains an increasing importance in computer vision as it plays a vital role... more Facial point detection gains an increasing importance in computer vision as it plays a vital role in several applications such as facial expression recognition and human behavior analysis. In this work, we propose an approach to locate 49 facial points via neural networks in a cascade regression fashion. The localization process starts by detecting the face, followed by a face cropping refinement task and lastly arriving at the facial point location through five cascades of regressors. In particular, we perform a guided initialization using holistic features extracted from the entire face patch. Then, the points location is refined in the next four cascades using local features extracted from patches enclosing the prior estimates of the points. The generalization capability was improved by performing feature selection at each cascade. By evaluating our approach on samples gathered from four challenging databases, we achieved a location average error for each point ranging between 0.72 % and 1.57 % of the face width. The proposed approach was further evaluated according to the 300-w challenge, where we achieved competitive results to those obtained by state-of-the-art approaches and commercial software packages. Moreover, our approach showed better generalization capability. Finally, we validated the proposed enhancements by studying the impact of several factors on the point localization accuracy.
Adaptation to statistics of sensory inputs is an essential ability of neural systems and extends ... more Adaptation to statistics of sensory inputs is an essential ability of neural systems and extends their effective operational range. Having a broad operational range facilitates to react to sensory inputs of different granularities, thus is a crucial factor for survival. The computation of auditory cues for spatial localization of sound sources, particularly the interaural level difference (ILD), has long been considered as a static process. Novel findings suggest that this process of ipsi-and contra-lateral signal integration is highly adaptive and depends strongly on recent stimulus statistics. Here, adaptation aids the encoding of auditory perceptual space of various granularities. To investigate the mechanism of auditory adaptation in binaural signal integration in detail, we developed a neural model architecture for simulating functions of lateral superior olive (LSO) and medial nucleus of the trapezoid body (MNTB) composed of single compartment conductance-based neurons. Neurons in the MNTB serve as an intermediate relay population. Their signal is integrated by the LSO population on a circuit level to represent excitatory and inhibitory interactions of input signals. The circuit incorporates an adaptation mechanism operating at the synaptic level based on local inhibitory feedback signals. The model's predictive power is demonstrated in various simulations replicating physiological data. Incorporating the innovative adaptation mechanism facilitates a shift in neural responses towards the most effective stimulus range based on recent stimulus history. The model demonstrates that a single LSO neuron quickly adapts to these stimulus statistics and, thus, can encode an extended range of ILDs in the ipsilateral hemisphere. Most significantly, we provide a unique measurement of the adaptation efficacy of LSO neurons. Prerequisite of normal function is an accurate interaction of inhibitory and excitatory signals, a precise encoding of time and a well-tuned local feedback circuit. We suggest that the mechanisms of temporal competitive-cooperative interaction and the local feedback mechanism jointly sensitize the circuit to enable a response shift towards contra-lateral and ipsi-lateral stimuli, respectively.
Fine-grained temporal action parsing is important in many applications, such as daily activity un... more Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations over a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to previous work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lowerdimension representations of our bilinear forms, so that the dimensionality is reduced without suffering from information loss nor requiring extra computation. We perform extensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art pooling work on various datasets.
Figure 1: 3D human bodies with various shapes and poses are automatically generated to interact w... more Figure 1: 3D human bodies with various shapes and poses are automatically generated to interact with the scene. Appropriate human-scene contact is encouraged, and human-scene surface interpenetration is discouraged.
2022 International Conference on Robotics and Automation (ICRA), May 23, 2022
We propose a framework for robust and efficient training of Dense Object Nets (DON) [1] with a fo... more We propose a framework for robust and efficient training of Dense Object Nets (DON) [1] with a focus on industrial multi-object robot manipulation scenarios. DON is a popular approach to obtain dense, view-invariant object descriptors, which can be used for a multitude of downstream tasks in robot manipulation, such as, pose estimation, state representation for control, etc. However, the original work [1] focused training on singulated objects, with limited results on instance-specific, multi-object applications. Additionally, a complex data collection pipeline, including 3D reconstruction and mask annotation of each object, is required for training. In this paper, we further improve the efficacy of DON with a simplified data collection and training regime, that consistently yields higher precision and enables robust tracking of keypoints with less data requirements. In particular, we focus on training with multi-object data instead of singulated objects, combined with a well-chosen augmentation scheme. We additionally propose an alternative loss formulation to the original pixelwise formulation that offers better results and is less sensitive to hyperparameters. Finally, we demonstrate the robustness and accuracy of our proposed framework on a real-world robotic grasping task.
Proceedings of the International Symposium on Auditory and Audiological Research, 2019
The development of spatially registered auditory maps in the external nucleus of the inferior col... more The development of spatially registered auditory maps in the external nucleus of the inferior colliculus in young owls and their maintenance in adult animals is visually guided and evolves dynamically. To investigate the underlying neural mechanisms of this process, we developed a model of stabilized neoHebbian correlative learning which is augmented by an eligibility signal and a temporal trace of activations. This 3-component learning algorithm facilitates stable, yet flexible, formation of spatially registered auditory space maps composed of conductance-based topographically organized neural units. Spatially aligned maps are learned for visual and auditory input stimuli that arrive in temporal and spatial registration. The reliability of visual sensory inputs can be used to regulate the learning rate in the form of an eligibility trace. We show that by shifting visual sensory inputs at the onset of learning the topography of auditory space maps is shifted accordingly. Simulation results explain why a shift of auditory maps in mature animals is possible only if corrections are induced in small steps. We conclude that learning spatially aligned auditory maps is flexibly controlled by reliable visual sensory neurons and can be formalized by a biological plausible unsupervised learning mechanism.
A patient (HJA) with bilateral occipital lobe damage to ventral cortical areas V2, V3 and V4 was ... more A patient (HJA) with bilateral occipital lobe damage to ventral cortical areas V2, V3 and V4 was tested on a texture segmentation task involving texture bar detection in an array of oriented lines. Performance detecting a target shape was assessed as the orientations of the background lines had increasing orientation noise. Control participants found the task easier when the background lines had the same orientation or only slightly shifted in orientation. HJA was poor with all backgrounds but particularly so when the background lines had the same or almost the same orientations. The results suggest that V1 alone is not sufficient to perform easy texture segmentation, even when the background of the display is a homogeneous texture. Ventral extra-striate cortical areas are needed in order to detect texture boundaries. We suggest that extra-striate visual areas enhance the borders between the target and background, while also playing a role in reducing the signal from homogeneous texture backgrounds.
Automated analysis of facial expressions is a well-investigated research area in the field of com... more Automated analysis of facial expressions is a well-investigated research area in the field of computer vision, with impending applications such as human-computer interaction (HCI). The conducted work proposes new methods for the automated evaluation of facial expression in image sequences of color and depth data. In particular, we present the main components of our system, i.e. accurate estimation of the observed person’s head pose, followed by facial feature extraction and, third, by classification. Through the application of dimensional affect models, we overcome the use of strict categories, i.e. basic emotions, which are focused on by most state-of-the-art facial expression recognition techniques. This is of importance as in most HCI applications classical basic emotions are only occurring sparsely, and hence are often inadequate to guide the dialog with the user. To resolve this issue we suggest the mapping to the so-called “Circumplex model of affect”, which enables us to dete...
In intelligent environments, computer systems not solely serve as passive input devices waiting f... more In intelligent environments, computer systems not solely serve as passive input devices waiting for user interaction but actively analyze their environment and adapt their behavior according to changes in environmental parameters. One essential ability to achieve this goal is to analyze the mood, emotions and dispositions a user experiences while interacting with such intelligent systems. Features allowing to infer such parameters can be extracted from auditive, as well as visual sensory input streams. For the visual feature domain, in particular facial expressions are known to contain rich information about a user's emotional state and can be detected by using either static and/or dynamic image features. During interaction facial expressions are rarely performed in isolation, but most of the time co-occur with movements of the head. Thus, optical flow based facial features are often compromised by additional motions. Parts of the optical flow may be caused by rigid head motions, while other parts reflect deformations resulting from facial expressivity (non-rigid motions). In this work, we propose the first steps towards an optical flow based separation of rigid head motions from nonrigid motions caused by facial expressions. We suggest that after their separation, both, head movements and facial expressions can be used as a basis for the recognition of a user's emotions and dispositions and thus allow a technical system to effectively adapt to the user's state.
Communications in computer and information science, 2016
Biologically inspired computational models of visual processing often utilize conventional frame-... more Biologically inspired computational models of visual processing often utilize conventional frame-based cameras for data acquisition. Instead, the Dynamic Vision Sensor (DVS) emulates the main processing sequence of the mammalian retina and generates spike-trains to encode temporal changes in the luminance distribution of a visual scene. Based on such sparse input representation we propose neural mechanisms for initial motion estimation and integration functionally related to the dorsal stream in the visual cortical hierarchy. We adapt the spatio-temporal filtering scheme as originally suggested by Adelson and Bergen to make it consistent with the input representation generated by the DVS. In order to regulate the overall activation of single neurons against a pool of neighboring cells, we incorporate a competitive stage that operates upon the spatial as well as the feature domain. The impact of such normalization stage is evaluated using information theoretic measures. Results of optical flow estimation were analyzed using synthetic ground truth data.
Fine-grained action segmentation in long untrimmed videos is an important task for many applicati... more Fine-grained action segmentation in long untrimmed videos is an important task for many applications such as surveillance, robotics, and human-computer interaction. To understand subtle and precise actions within a long time period, second-order information (e.g. feature covariance) or higher is reported to be effective in the literature. However, extracting such high-order information is considerably non-trivial. In particular, the dimensionality increases exponentially with the information order, and hence gaining more representation power also increases the computational cost and the risk of overfitting. In this paper, we propose an approach to representing high-order information for temporal action segmentation via a simple yet effective bilinear form. Specifically, our contributions are: (1) From the multilinear perspective, we derive a bilinear form of low complexity, assuming that the three-way tensor has low-rank frontal slices. (2) Rather than learning the tensor entries from data, we sample the entries from different underlying distributions, and prove that the underlying distribution influences the information order. (3) We employed our bilinear form as an intermediate layer in state-of-the-art deep neural networks, enabling to represent high-order information in complex deep models effectively and efficiently. Our experimental results demonstrate that the proposed bilinear form outperforms the previous state-of-the-art methods on the challenging temporal action segmentation task. One can see our project page for data, model and code: https://vlg.inf.ethz.ch/projects/BilinearTCN/.
2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), 2021
Digitization is advancing rapidly in many prevalently analogue domains such as healthcare. For th... more Digitization is advancing rapidly in many prevalently analogue domains such as healthcare. For the latter domain, the synergies with modern information technologies (IT) have become an integral part regarding communication and collaboration. For this reason, a comprehensible language is of importance in order to allow a frictionless exchange of information between domain experts. The Business Process Model and Notation (BPMN) 2.0 represents a promising notation that may be applied as lingua franca. Although the BPMN 2.0 is widespread applied by experts in business and industry, little experience exists how BPMN 2.0 is adopted in healthcare. In order to assess how BPMN 2.0 is deployed in healthcare, we conducted a preliminary eye tracking study, in which n = 16 professionals from healthcare comprehended a particular BPMN 2.0 process model. The results indicate that BPMN 2.0 might be a candidate for a lingua franca to foster the comprehensible exchange of information as well as collaboration between healthcare and IT.
The purpose of the presented model architecture is to capture dynamics and adaptation of the syst... more The purpose of the presented model architecture is to capture dynamics and adaptation of the system states to yield insights on neural circuitry and functionality. However, to present the model in a physiologically more realistic setting and to provide testable neurophysiological parameter values, we modified Eq. (1) and Eq. (6) by incorporating membrane capacitance C and leak reversal potential E l. The core equations read
Convolutional neural networks gain more and more popularity in image classification tasks since t... more Convolutional neural networks gain more and more popularity in image classification tasks since they are often even able to outperform human classifiers. While much research has been targeted towards network architecture optimization, the optimization of the labeled training data has not been explicitly targeted yet. Since labeling of training data is time-consuming, it is often performed by less experienced domain experts or even outsourced to online services. Unfortunately, this results in labeling errors, which directly impact the classification performance of the trained network. To overcome this problem, we propose an interactive visual analysis system that helps to spot and correct errors in the training dataset. For this purpose, we have identified instance interpretation errors, class interpretation errors and similarity errors as frequently occurring errors, which shall be resolved to improve classification performance. After we detect these errors, users are guided towards...
Standard separation index. Standard separation index for experiment 3, adapter tone experiment (g... more Standard separation index. Standard separation index for experiment 3, adapter tone experiment (green lines) and 4, timing experiment (blue lines). The peak of the green curve is shifted by a maximum of 6dB for increased adapter tone intensity and gains an sensitivity increment of 52%, compared to 6dB shift and sensitivity increment of 53% for no noise inputs (compare Fig. 5). The peak of the blue curve is shifted by a maximum of 32dB, compared to 38dB between stimuli of different ITDs. The increment of sensitivity for this shift is 101%, compared to 106%.
In our modern industrial society the group of the older (generation 65+) is constantly growing. M... more In our modern industrial society the group of the older (generation 65+) is constantly growing. Many subjects of this group are severely affected by their health and are suffering from disability and pain. The problem with chronic illness and pain is that it lowers the patient's quality of life, and therefore accurate pain assessment is needed to facilitate effective pain management and treatment. In the future, automatic pain monitoring may enable health care professionals to assess and manage pain in a more and more objective way. To this end, the goal of our SenseEmotion project is to develop automatic painand emotion-recognition systems for successful assessment and effective personalized management of pain, particularly for the generation 65+. In this paper the recently created SenseEmotion Database for pain-vs. emotion-recognition is presented. Data of 45 healthy subjects is collected to this database. For each subject approximately 30 min of multimodal sensory data has been recorded. For a comprehensive understanding of pain and affect three rather different modalities of data are included in this study: biopotentials, camera images of the facial region, and, for the first time, audio signals. Heat stimulation is applied to elicit pain, and affective image stimuli accompanied by sound stimuli are used for the elicitation of emotional states.
Facial point detection gains an increasing importance in computer vision as it plays a vital role... more Facial point detection gains an increasing importance in computer vision as it plays a vital role in several applications such as facial expression recognition and human behavior analysis. In this work, we propose an approach to locate 49 facial points via neural networks in a cascade regression fashion. The localization process starts by detecting the face, followed by a face cropping refinement task and lastly arriving at the facial point location through five cascades of regressors. In particular, we perform a guided initialization using holistic features extracted from the entire face patch. Then, the points location is refined in the next four cascades using local features extracted from patches enclosing the prior estimates of the points. The generalization capability was improved by performing feature selection at each cascade. By evaluating our approach on samples gathered from four challenging databases, we achieved a location average error for each point ranging between 0.72 % and 1.57 % of the face width. The proposed approach was further evaluated according to the 300-w challenge, where we achieved competitive results to those obtained by state-of-the-art approaches and commercial software packages. Moreover, our approach showed better generalization capability. Finally, we validated the proposed enhancements by studying the impact of several factors on the point localization accuracy.
Adaptation to statistics of sensory inputs is an essential ability of neural systems and extends ... more Adaptation to statistics of sensory inputs is an essential ability of neural systems and extends their effective operational range. Having a broad operational range facilitates to react to sensory inputs of different granularities, thus is a crucial factor for survival. The computation of auditory cues for spatial localization of sound sources, particularly the interaural level difference (ILD), has long been considered as a static process. Novel findings suggest that this process of ipsi-and contra-lateral signal integration is highly adaptive and depends strongly on recent stimulus statistics. Here, adaptation aids the encoding of auditory perceptual space of various granularities. To investigate the mechanism of auditory adaptation in binaural signal integration in detail, we developed a neural model architecture for simulating functions of lateral superior olive (LSO) and medial nucleus of the trapezoid body (MNTB) composed of single compartment conductance-based neurons. Neurons in the MNTB serve as an intermediate relay population. Their signal is integrated by the LSO population on a circuit level to represent excitatory and inhibitory interactions of input signals. The circuit incorporates an adaptation mechanism operating at the synaptic level based on local inhibitory feedback signals. The model's predictive power is demonstrated in various simulations replicating physiological data. Incorporating the innovative adaptation mechanism facilitates a shift in neural responses towards the most effective stimulus range based on recent stimulus history. The model demonstrates that a single LSO neuron quickly adapts to these stimulus statistics and, thus, can encode an extended range of ILDs in the ipsilateral hemisphere. Most significantly, we provide a unique measurement of the adaptation efficacy of LSO neurons. Prerequisite of normal function is an accurate interaction of inhibitory and excitatory signals, a precise encoding of time and a well-tuned local feedback circuit. We suggest that the mechanisms of temporal competitive-cooperative interaction and the local feedback mechanism jointly sensitize the circuit to enable a response shift towards contra-lateral and ipsi-lateral stimuli, respectively.
Fine-grained temporal action parsing is important in many applications, such as daily activity un... more Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations over a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to previous work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lowerdimension representations of our bilinear forms, so that the dimensionality is reduced without suffering from information loss nor requiring extra computation. We perform extensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art pooling work on various datasets.
Figure 1: 3D human bodies with various shapes and poses are automatically generated to interact w... more Figure 1: 3D human bodies with various shapes and poses are automatically generated to interact with the scene. Appropriate human-scene contact is encouraged, and human-scene surface interpenetration is discouraged.
2022 International Conference on Robotics and Automation (ICRA), May 23, 2022
We propose a framework for robust and efficient training of Dense Object Nets (DON) [1] with a fo... more We propose a framework for robust and efficient training of Dense Object Nets (DON) [1] with a focus on industrial multi-object robot manipulation scenarios. DON is a popular approach to obtain dense, view-invariant object descriptors, which can be used for a multitude of downstream tasks in robot manipulation, such as, pose estimation, state representation for control, etc. However, the original work [1] focused training on singulated objects, with limited results on instance-specific, multi-object applications. Additionally, a complex data collection pipeline, including 3D reconstruction and mask annotation of each object, is required for training. In this paper, we further improve the efficacy of DON with a simplified data collection and training regime, that consistently yields higher precision and enables robust tracking of keypoints with less data requirements. In particular, we focus on training with multi-object data instead of singulated objects, combined with a well-chosen augmentation scheme. We additionally propose an alternative loss formulation to the original pixelwise formulation that offers better results and is less sensitive to hyperparameters. Finally, we demonstrate the robustness and accuracy of our proposed framework on a real-world robotic grasping task.
Proceedings of the International Symposium on Auditory and Audiological Research, 2019
The development of spatially registered auditory maps in the external nucleus of the inferior col... more The development of spatially registered auditory maps in the external nucleus of the inferior colliculus in young owls and their maintenance in adult animals is visually guided and evolves dynamically. To investigate the underlying neural mechanisms of this process, we developed a model of stabilized neoHebbian correlative learning which is augmented by an eligibility signal and a temporal trace of activations. This 3-component learning algorithm facilitates stable, yet flexible, formation of spatially registered auditory space maps composed of conductance-based topographically organized neural units. Spatially aligned maps are learned for visual and auditory input stimuli that arrive in temporal and spatial registration. The reliability of visual sensory inputs can be used to regulate the learning rate in the form of an eligibility trace. We show that by shifting visual sensory inputs at the onset of learning the topography of auditory space maps is shifted accordingly. Simulation results explain why a shift of auditory maps in mature animals is possible only if corrections are induced in small steps. We conclude that learning spatially aligned auditory maps is flexibly controlled by reliable visual sensory neurons and can be formalized by a biological plausible unsupervised learning mechanism.
A patient (HJA) with bilateral occipital lobe damage to ventral cortical areas V2, V3 and V4 was ... more A patient (HJA) with bilateral occipital lobe damage to ventral cortical areas V2, V3 and V4 was tested on a texture segmentation task involving texture bar detection in an array of oriented lines. Performance detecting a target shape was assessed as the orientations of the background lines had increasing orientation noise. Control participants found the task easier when the background lines had the same orientation or only slightly shifted in orientation. HJA was poor with all backgrounds but particularly so when the background lines had the same or almost the same orientations. The results suggest that V1 alone is not sufficient to perform easy texture segmentation, even when the background of the display is a homogeneous texture. Ventral extra-striate cortical areas are needed in order to detect texture boundaries. We suggest that extra-striate visual areas enhance the borders between the target and background, while also playing a role in reducing the signal from homogeneous texture backgrounds.
Automated analysis of facial expressions is a well-investigated research area in the field of com... more Automated analysis of facial expressions is a well-investigated research area in the field of computer vision, with impending applications such as human-computer interaction (HCI). The conducted work proposes new methods for the automated evaluation of facial expression in image sequences of color and depth data. In particular, we present the main components of our system, i.e. accurate estimation of the observed person’s head pose, followed by facial feature extraction and, third, by classification. Through the application of dimensional affect models, we overcome the use of strict categories, i.e. basic emotions, which are focused on by most state-of-the-art facial expression recognition techniques. This is of importance as in most HCI applications classical basic emotions are only occurring sparsely, and hence are often inadequate to guide the dialog with the user. To resolve this issue we suggest the mapping to the so-called “Circumplex model of affect”, which enables us to dete...
Uploads
Papers by Heiko Neuman