Academia.eduAcademia.edu

Communication Behavior in Embodied Virtual Reality

2018, Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems

Embodied virtual reality faithfully renders users' movements onto an avatar in a virtual 3D environment, supporting nuanced nonverbal behavior alongside verbal communication. To investigate communication behavior within this medium, we had 30 dyads complete two tasks using a shared visual workspace: negotiating an apartment layout and placing model furniture on an apartment floor plan. Dyads completed both tasks under three different conditions: face-to-face, embodied VR with visible full-body avatars, and no embodiment VR, where the participants shared a virtual space, but had no visible avatars. Both subjective measures of users' experiences and detailed annotations of verbal and nonverbal behavior are used to understand how the media impact communication behavior. Embodied VR provides a high level of social presence with conversation patterns that are very similar to face-to-face interaction. In contrast, providing only the shared environment was generally found to be lonely and appears to lead to degraded communication.

CHI 2018 Honourable Mention CHI 2018, April 21–26, 2018, Montréal, QC, Canada Communication Behavior in Embodied Virtual Reality Harrison Jesse Smith1,2 and Michael Neff1,2 1 Oculus Research, Sausalito, CA, USA 2 University of California, Davis, CA, USA [email protected], [email protected] ABSTRACT Embodied virtual reality faithfully renders users’ movements onto an avatar in a virtual 3D environment, supporting nuanced nonverbal behavior alongside verbal communication. To investigate communication behavior within this medium, we had 30 dyads complete two tasks using a shared visual workspace: negotiating an apartment layout and placing model furniture on an apartment floor plan. Dyads completed both tasks under three different conditions: face-to-face, embodied VR with visible full-body avatars, and no embodiment VR, where the participants shared a virtual space, but had no visible avatars. Both subjective measures of users’ experiences and detailed annotations of verbal and nonverbal behavior are used to understand how the media impact communication behavior. Embodied VR provides a high level of social presence with conversation patterns that are very similar to face-to-face interaction. In contrast, providing only the shared environment was generally found to be lonely and appears to lead to degraded communication. ACM Classification Keywords H.4.3. Communications Applications: Computer conferencing, teleconferencing, and videoconferencing Author Keywords Computer-mediated communication, virtual reality, embodiment, social presence. INTRODUCTION Modern communication is frequently mediated by technology. Each technology offers its own set of affordances [12], and yet it is difficult to match the immediacy and richness offered through the multimodality of face-to-face communication – so much so that the Department of Energy estimates that roughly eight percent of US energy is used to support passenger transport to enable face-to-face communication [1]. This work explores how embodied virtual reality (VR) can support communication around a spatial task. Embodied virtual reality means that a person’s movements are tracked and then used to drive an avatar in a shared virtual world. Using Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CHI 2018, April 21–26, 2018, Montreal, QC, Canada © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-5620-6/18/04. . . $15.00 a head mounted display (HMD), participants view the world through the avatar’s eyes, and the avatar’s movements reflect those of their own body, effectively embodying them in the virtual world. This technology allows people to interact in a shared, 3D environment and employ both verbal and nonverbal communication. In this work, we use marker-based, optical motion capture to track participants and render their bodies as simple 3D meshes in the environment, with an eyebrow ridge and nose, but no other facial features or facial animation (Fig. 1-C).Such an embodiment supports manual gesture, locomotion and verbal dialog, but limited hand movement and no facial expressions. To understand how such a representation (embodVR) performs as a communication tool, we compare it to two other conditions. The first is the gold standard: face-to-face communication (F2F). The second is a variant of VR in which both participants can see the same shared environment, but their avatars are not visible to themselves or each other (no_embodVR) (One task provided a representation of participant’s own hands to facilitate object interaction). Employing a within-subject design, 30 dyads interacted with each other in each of the three conditions, performing two tasks in each. They were told to role-play being new roommates, and in the first task, they were given a floor plan of their new apartment and had to agree on which rooms should be used for the living room, dining room and each of their bedrooms. In the second task, they had to agree on a furniture arrangement by placing representative furniture on the floor plan. These tasks offer an appropriate test bed for interactions in which people must work with a shared visual representation. The first task does not require manipulation of the environment and the second does. The technologies were evaluated based on both participants’ subjective impressions and a detailed analysis of their actual verbal and nonverbal communication behavior. We expected to see three distinct levels of performance, where F2F performs best, followed by embodVR and then no_embodVR. This echoes earlier work comparing face-to-face, audio/video, and audio-only communication. Instead, we found very similar behavior between F2F and embodVR, but a marked drop off for no_embodVR. Recordings (real and virtual) of the apartment layout task were annotated for both verbal and nonverbal behavior. For most factors, there was no significant difference between F2F and embodVR, with often a significant drop off for no_embodVR. This suggests that participants employed similar communication patterns in F2F and embodied virtual reality. Subjective measures provide insight into the “felt experience" of using the technology. On most, but not all, measures DOI: https://doi.org/10.1145/3173574.3173863 Paper 289 Page 1 CHI 2018 Honourable Mention of social presence, the same pattern emerged of no significant difference between embodVR and F2F, but a significant drop off for no_embodVR. Much more detail is provided below. RELATED WORK Most real-time collaboration mediums can be grouped into three different categories: face-to-face, tools that support audio communication only, such as a telephone, and tools that support audio and visual communication, such as video conferencing (for detailed reviews of computer-mediated communication, please see [16, 47]). Previous works have established a hierarchy in which face-to-face interactions are clearly superior to audio-only for most tasks involving spatial information or negotiation [47]. The role video plays is less clear-cut, however. While providing a visual channel can theoretically aid in mutual understanding (or conversational grounding), tools with video feeds often do not perform significantly better than audio-only equivalents [35, 32, 43, 25]. There are a number of fundamental issues preventing video conferencing tools from reaching the effectiveness of face-to-face interactions: 1. Interlocutors connected by video feeds are not co-present: they cannot fully observe their partner (visibility) or their partner’s physical context (visual co-presence). Fussell and Setlock argue that there is “clear evidence” that visual copresence improves task performance [16] and helps support grounding [18]. Visibility allows people’s gestures to be seen, which is important for representational gestures, and co-presence allows them to point to features in the environment. Co-presence reduces the need for verbal grounding. In embodied virtual reality, participants are naturally copresent in the same virtual space, without a need to try to engineer these features into a remote video system. 2. Most video feeds are stationary, positioned to provide closeup representations of the remote partner’s face or upper torso. Such positioning can hinder movement-centric tasks, and prevent transmission of posture and gesture cues. Most non-stationary video feeds, such as smart phone cameras, are controlled by the remote partner, which can result in disorienting camera movements and sub-optimal camera angles [26]. In embodied virtual reality, participants control their view of the scene by adjusting their gaze. 3. Offsets between screens and cameras make it difficult or impossible to establish mutual gaze. Eye contact is important for conversational management and establishing intimacy between partners, and its absence can reduce the perceived social presence of the communication tool. The difficulty in establish eye contact can disrupt conversational management behaviors [27]. Observing gaze also allows one to infer a partner’s focus of attention, which can aid grounding. The Role of Shared Visual Workspaces in Collaboration Tasks When considering remote collaboration tools with a visual component, it is helpful to draw distinctions between shared visual information pertaining to the state of a task (shared visual workspace) and visual information depicting the state of the remote partner. Many previous works have focused on shared visual workspaces as a key component of effective remote collaborations [19, 20, 21]. Because the current Paper 289 CHI 2018, April 21–26, 2018, Montréal, QC, Canada study focuses on evaluating the impacts of embodiment, all conditions incorporate a shared visual workspace. When performing complex tasks with many different possible states, a shared visual workspace can help partners synchronize their mental models of the task state and aid in the development of common ground: Gergle et. al. [19] found that, when completing a puzzle, the speed and efficiency with which dyads completed the task increased when a shared workspace was present. Participants were less likely to verbally verify their actions, relying on the visual information to transmit the necessary communicative cues. In a study conducted by Fussell et al. [15], partners completed bicycle repair tasks under various conditions: the task was most efficiently completed in the face-to-face condition, and participants attributed this to the presence of an effective shared visual workspace. Interestingly, a shared visual workspace does not always result in more efficient communication: in the same study by Fussell et al., the addition of a video feed depicting the task did not show significant advantages over the audio-only condition: participants mentioned that it was difficult to make effective deictic gestures [15]. In related studies, Kraut et. al. [32, 30] performed an experiment where novice/expert pairs of bicycle mechanics interacted via audio and audio/video media. While the presence of a shared visual workspace did assist in enabling shared common ground and more proactive help-giving by the expert, it did not ultimately result in higher quality solutions. The Role of Visual Behavior in Collaboration Tasks Video feeds that show a remote partner allow users to communicate with a wide array of nonverbal behaviors (such as gaze, gesture or posture) that can influence characteristics of the interaction; comparisons of audio telephony to face-to-face conversation indicate that people use more words and turns in audio-only conditions [16]. When people can gesture at a work space, they can use deictic utterances (“that", “those", “there", etc.) rather than relying on longer, more descriptive phrases ("the table between the door and the window"). Studying a drawing task, Bly found that gesturing behavior decreased both in count and as a percentage of drawing surface actions as interaction moved from from face-to-face to a video-link to telephone only [10]. Clark and Krych [13] found that the presence of a visual workspace results in many more deictic utterances and gestures. Some experimental work has shown that, if these gestures are appropriately supported, resulting communication is more efficient, task performance increases or users rate the quality of the experience higher [17, 28, 2]. The presence of nonverbal modalities may have additional, subtler effects. Peksa et. al. found that, in VR settings, orienting avatars towards a user results in the user taking significantly more turns in a conversation [41]. Fussell and Setlock [16] mention that conversational turn taking becomes more formal in the absence of visual cues. Bailensen et. al. found that, when virtual humans made eye contact with users, female users tended to stand further away [5]. Evaluating Communication Tools A useful remote communication tool should be efficient, allowing users to quickly achieve high-quality solutions. Therefore, Page 2 CHI 2018 Honourable Mention two commonly used metrics are the quality of task solutions achieved and their times-to-completion [32, 39, 48]. More nuanced insights can be gained by annotating and analyzing interactions: length, frequency, and initiation of conversational turns [37], gaps between turns [25], gaze patterns [4], and overlapping turns [43]. For example, overly-elaborate, highly redundant messages may indicate that a communication tool does not adequately support backchanneling [29, 31], which can result in decreased mutual understanding between partners. Presence of deixis, shorter spaces between conversational turns, shorter, more frequent conversational turns and the presence of interruptions can all provide important clues about how comfortable users feel within a medium. More subjective measures are obtained through user surveys. One such measure comes from Social Presence theory, which argues that technologies differ in their abilities to present a sense of other participants’ goals and motives [44]. A widelyused test for social presence, semantic differencing [40], asks users to evaluate the communication tool by rating it along multiple pairs of bipolar adjectives. It has been shown to be sensitive to differences between 2D video, 3D video, and faceto-face interactions[23]. It has also been used to distinguish between video conferencing tools that do and do not support spatiality [24, 22]. Networked minds [9, 8, 7] is an alternative survey approach focused on whether the user experienced presence; both surveys were administered in the current work. METHODS We designed a study to evaluate communication differences between face-to-face interactions (F2F), VR with a motioncapture tracked avatar providing an embodied representation of the users (embodVR) and VR with no avatars (no_embodVR). All conditions included a visually shared work space. Participants A total of 60 subjects (30 male, 30 female) were recruited through a professional agency, along with a backup roster of friends, remote coworkers, and neighbors. In all cases, care was taken to make sure participant pairs were strangers prior to beginning the experiment. FannoParticipants were paired into 30 same-gender dyads to limit the number of combinations and remove a potentially confounding factor of strangers being less comfortable with the roommate scenario when dealing with an opposite gender stranger. Participants were aged 18-56 (M=36.5, SD=10.0). The experiment took approximately 3 hours to complete and participants were compensated with gift cards. IRB approval was obtained ahead of the experiment. Study Design The experiment employed a 1x3 design in which the factor, communication medium, was (1) face-to-face, (2) virtual reality with a shared workspace, audio channel, and visible, fullyembodied, motion-capture driven avatars, or (3) virtual reality with a shared workspace and audio channel, but no avatars. To control for variations within participant pairs, we employed a within-subjects design where each dyad performed the tasks under each of the three conditions. In order to prevent subjects from reusing previous decisions in subsequent conditions, three different floor plans were utilized. The order in which Paper 289 CHI 2018, April 21–26, 2018, Montréal, QC, Canada the floor plans were utilized was constant across all dyads, confounding its effect with that of factor ordering. The impacts of both were minimized by using a fully-counterbalanced design (five repetitions of each of the six possible factor orderings). Post-experiment tests showed that order never had a significant impact on any of the factors examined. Because the effect of a communication medium can depend upon the type of task being conducted, our participants completed two distinct tasks under each condition: first a negotiation task, then a consensus-building workspace-manipulation task. These tasks are both social and relied on a shared visual workspace, allowing comparison of the role of the shared workspace with the visual presence of the interlocutor. We chose tasks that were familiar, but allowed detailed discussion. Task 1: Negotiating Room Uses Participants were instructed to role play a pair of roommates moving into a new apartment. They were given a miniature version of the apartment floor plan, which contained a labeled bathroom and kitchen, a lake, a road, and four unassigned rooms, labeled ’A’ through ’D’. These labels facilitated easy verbal references. Participants then decided which of these rooms would be each participant’s bedroom, the dining room, and the living room. The participants were given noncompatible preferences (both wanted the same bedroom, and different rooms to be the living room) and told to role play as they saw fit in justifying their preferences. Participants were given five minutes for the task and rooms were assigned by the researcher if consensus was not reached. Task 2: Building Consensus for Furniture Placement In the second task, participants placed furniture inside the apartment for which they had just assigned rooms. To foster inter-partner communication during the process, participants were asked to take turns placing furniture while adhering to a specific protocol. For each turn, one participant would select a piece of furniture and suggest two different locations for it inside the apartment, justifying each location. Their partner would then suggest a third option and justify it. Then, both partners would discuss the options and select one together. After this, the participants would switch roles for the next turn. Participants completed this task for for ten minutes. Procedure Upon arriving at the testing facility, participants completed a consent form and a short demographic survey. During this period, the researcher confirmed that both participants were complete strangers, and instructed them not to speak or otherwise interact with the each other prior to beginning the experiment. They were then fitted with motion capture suits and optical markers in accordance with OptiTrak’s Baseline + 13 Skeleton. Each participant then played through the Oculus Touch Tutorial [38] to familiarize them with the Oculus Rift head-mounted display (HMD) and Touch controllers. Participants were then told the rules of the tasks, positioned in the motion capture area, and fitted with the HMD and controllers (for VR conditions). Before the first task of each condition, participants played a short bluffing game ( 1 minute) to familiarize themselves with interacting in the condition. At Page 3 CHI 2018 Honourable Mention the conclusion of the second task, participants were given a survey to obtain their impressions of the task outcomes and the communication medium. This process was repeated for each of the remaining conditions. At the conclusion of the third condition, participants were given one additional survey to gather information about their most and least favorite communication medium. Then, the researcher performed an experimental debrief with both participants, and encouraged the participants to discuss their survey answers and their general impressions of all three conditions. Condition Implementation and Data Collection For all conditions, subjects were recorded with three GoPro cameras. Audio was recorded either with lapel microphones or through the HMD microphone. For the VR conditions, the POV (Point of View video) of each participant was recorded during the interaction. In addition, the various transformations of each object within the scene were recorded at 20 frames per second, allowing us to create videos post-hoc of the interaction with color-coded avatars, including making avatars visible that had been hidden during the No Embodiment condition. These videos were used for later analysis. Face-To-Face (F2F) In the face-to-face condition, participants performed the tasks facing each other from across a table (Fig. 1-A and Fig. 1B). Furniture was represented by foam boxes with full-color orthographic images of the furniture model on each side. Virtual Reality with Full Embodiment (embodVR) In the Full Embodiment VR condition, participants appeared inside of a grey room, on opposite sides of a table containing an apartment floor plan (Fig. 1-C); in actuality, participants were located on opposite sides of the motion capture space, facing opposite directions (Fig. 1-D). Table, furniture, and floor plan dimensions matched those of the face-to-face condition. Participants and their VR devices were tracked with a 24 camera (Prime 17W) OptiTrack motion capture system. Their positioning and pose was used to drive avatars and cameras within a customized Unity scene. The lag of the entire system was under 50 milliseconds; none of the participants mentioned noticeable lag during their exit interviews. The HMDs employed were Oculus Rifts: immersive, headmounted displays with 2160x1200 resolution, 90 Hz refresh rate, and 110◦ FOV [6]. See the supplementary video for examples of participants wearing the devices. The Rifts employed integrated microphones and in-ear earbuds to block out ambient noise and transmit mono audio between participants (via a local VoIP server). Participants also used hand-held Oculus Touch controllers to pick up furniture and make various hand shapes (fists, thumbs-up, pointing with index finger). Virtual Reality without Embodiment (no_embodVR) The No Embodiment VR condition was almost identical to the Full Embodiment VR condition. In the first task, however, neither avatar was visible to participants. In the second task, participants could view their own hands to assist in picking up furniture, but could not see any part of their partner’s avatar nor the rest of their body. The workspace was fully visible. Paper 289 CHI 2018, April 21–26, 2018, Montréal, QC, Canada Gesture Type Description Reference Object or Location Reference Person Deictic (or pointing) gesture to an object or location. Spatial or Distance Gestures conveying more complex spatial or distance information, such as a path through the apartment. Backchannel Acknowledgments of interlocutor, including head nods and manual gestures. Representation Metaphoric and iconic hand movements, illustrative of an idea (but not fitting in “Spatial or Distance"). Emotional or Social Gestures conveying strong emotions or other social information. Beat Small movements of the hand in rhythm with the spoken prosody. Self-adaptor Self-manipulations not designed to communicate, such as nose scratches. Deictic gesture at self or interlocutor. Table 1. Annotators would apply one or more of these tags to each observed gesture. MEASURES, RESULTS AND DISCUSSION To measure the effects of virtual reality and embodied avatars on participant interaction, we employed several different types of measurements. For readability, we will group each measure description with related results and discussion in the sections below. Results are summarized in Table 3 and Figure 2. ANNOTATED PARTICIPANT BEHAVIOR Measure Following the trials, a remote team annotated verbal and nonverbal behaviors exhibited by each dyad during the floor-plan negotiation task. This provided objective data on communication patterns to complement the subjective measures. Annotators annotated each gesture performed by each participant, labelling its type. Following McNeill’s [34] assertion that gesture should not be viewed categorically, but as having levels of different dimensions, annotators were allowed to apply more than one tag to a gesture. The gesture types are based on McNeill’s [33] proposal of deictic, beat, iconic and metaphoric, but some dimensions were either collapsed or subdivided to focus the analysis on the most relevant behavior for the tasks conducted here. Gesture types are shown in Table 1. Gesture may be redundant with or provide information not available in the verbal channel. As a simple example, consider a person pointing at room A on the floor plan. If they say “I want room A," the gesture is (largely) redundant. If they say “I want this room," the utterance cannot be understood without the gesture. Annotators were instructed to add a Novel Content tag to any gesture that contained information not available through the verbal channel. Participants’ dialog was annotated at two levels of granularity, as summarized in Table 2. Utterances are individual sentences or sentence-like units of dialog. Conversational turns denote a period when one person holds the floor before it passes to the other and may contain one or more utterances. Page 4 CHI 2018 Honourable Mention Speech Data Description Utterance A section of speech. A sentence or comparable. Pragmatic Task related suggestions and discussion. Social or Emotional Strongly social or emotional utterances, such as “I’m very excited." Non-task Discussion Backchannel Discussion not related to the task. Verbal acknowledgements that indicate listening, such as “uh huh". Complete Reference Fully qualified references that can be completely understood from the utterance, like “I’d like room A". Reference Pronoun The use of terms like “this" or “that" to refer to things, such as “I’d like this room." Conversational Turn The duration for which one person holds the floor before the other takes over. Labeled with how the person gets the turn. Interruption The person takes the floor by interrupting the other. No Marker No clear indication of how the floor was obtained. Verbal Hand Over The interlocutor verbally passed the floor to the speaker. Nonverbal Hand Over The interlocutor nonverbally passed the floor to the speaker. Table 2. Speech is tagged in the two levels specified, with individual tags listed below each level. In the Face-To-Face condition, these annotations were made based on audio and video feeds from three camera angles. For the virtual reality conditions, annotations were made based on video feeds, audio and color-coded audio waveforms, POV footage for each participant, and multiple scene reconstructions in which avatars were always visible and color-coded. To minimize the effects of individual annotators, all three of a dyad’s task conditions were annotated by the same annotators. To ensure high-quality annotation data, all tasks were annotated independently by two different annotators. Mismatches between the two annotators were resolved by a third annotator and quality checks were performed by the research team. Results A similar statistical approach was used for all data reported in this section. Repeated measures ANOVAs were run to determine if each dependent value varied significantly across the three conditions of F2F, embodVR and no_embodVR. Mauchly’s test for sphericity was run on all data and correction by Greenhouse-Geiser and Huynh-Feldt were applied as needed (both of these always succeeded). Type II error was corrected for using False Discovery Rate correction. When significant variation was found in the ANOVA, Bonferronicorrected pairwise t-tests were run to determine which factors varied. Significance was evaluated at the p < 0.05 level. Analysis of utterances focused on the distribution of utterance types. Their frequency is shown in Figure 2e. Condition had a significant effect on the occurrence of pragmatic utterances, with pragmatic utterances occurring significantly less frequently in the embodVR condition than in F2F. Condition also had a significant effect on the use of referential Paper 289 CHI 2018, April 21–26, 2018, Montréal, QC, Canada pronouns, with significantly fewer referential pronoun uses in the no_embodyVR condition than in F2F and embodVR. Analysis of conversational turns focused on the frequency of conversational turns and the manner the turn was begun. Condition had a significant effect on the frequency of turns, with significantly fewer turns occurring in the no_embod condition than in the embodVR condition. The data shows a tendency for the same relationship between F2F and no_embodVR conditions (p = 0.097). The relative frequency of the manner by which a conversational turn began is shown in Figure 2f. Condition had a significant effect on the frequency of interruptions, with interruptions occurring more frequently in the F2F condition than in either embodVR or no_embodVR. Analysis of nonverbal behavior focused on the frequency of gesturing, types of gestures employed and novelty of information carried by the gestures. The analysis showed that both F2F and embodVR had significantly higher gesturing rates than no_embodVR, but there was no significant difference between them (Figure 2a). In many dyads, one partner gestures more than the other. In no_embodVR, the less frequent gesturer made about 30 percent of the gestures, compared to 40 percent for F2F and embodVR. This disparity was significant between F2F and no_embodVR, and embodVR and no_embodVR, but not between F2F and embodVR. See Figure 2g. The frequency of gesture types is summarized in Fig. 2d. Since a single gesture may display features of more than one type, the total of frequency counts may exceed 100% (in practice, it was ~120%). References to objects or locations were the most frequent, followed by representational gestures and gestures displaying complex spatial or distance information. The categories “Reference Person" and “Emotional or Social" never occurred more than 5% of the time and were dropped from the analysis. All remaining categories, except “Backchannel," showed significant differences. In every case except for “Self-adaptors", there was no significant difference between F2F and embodVR, but both of these differed significantly from no_embodVR. A higher proportion of gestures in no_embodVR were representational or beats and a lower proportion were object/location references and spatial/distance gestures. Self-adaptors were a higher proportion of gestures in F2F and no_embodVR, compared with embodVR, with no significant difference between these two categories. Novel content was analyzed both in terms of the proportion of gestures that were so tagged and the number of novel content gestures per minute (Figure 2c). In both cases, there were no statistical differences between F2F and embodVR, but gesture with novel content was significantly lower in no_embodVR. Discussion Overall, the verbal behavior is more consistent across conditions and the nonverbal behavior shows greater variation, reflecting in part the visual nature of the nonverbal channel. Floor management is largely accomplished using nonverbal cues, such as gaze and posture [27], so it is more difficult over audio-only channels. The lower turn frequency in no_embodVR likely reflects the difficulty of efficiently obtaining and relinquishing the floor in this condition. Page 5 CHI 2018 Honourable Mention CHI 2018, April 21–26, 2018, Montréal, QC, Canada cant difference between the gesture rate in F2F and embodVR. This suggests that people gesture at more or less normal rates in embodied VR, even given the limitations of holding Touch controllers. The greater disparity in gesture rate within dyads for no_embodVR suggests that there may be entrainment behavior that occurs when people have visual access to their partner. Such entrainment is a feature of rapport, and visually displaying it during an interaction may be a way to increase the felt rapport [45]. Figure 1. Dyads performed the first (A) and second (B) tasks in the faceto-face conditions. In virtual reality conditions, avatars appeared across the table from each other (C), but were actually positioned on opposite sides of the motion capture stage (D). In the embodVR condition, participants were able to see both avatars (E). In the no_embodVR condition, participants were unable to see their partner and could only see their hands in the second task, to assist with furniture manipulation (F). The largely consistent utterance behavior suggests people are not making major changes in their conversational style, either to accommodate for the lack of nonverbal information in no_embodVR or due to any perceived differences between F2F and embodVR. Reference pronouns such as ’this’ or ’that’ often require a gesture or visual indication to clarify their meaning. It is therefore reasonable that they occur significantly less frequently in the no_embodVR condition where there is no way to accompany the utterance with a visible gesture. Interestingly, participants used such pronouns at the same rate in the F2F and embodVR conditions, suggesting participants felt comfortably able to clarify their pronoun usage though avatar gesturing. The dominance of pragmatic utterances across all three conditions indicate that participants remained focused on their task. However, there was a significant decrease in pragmatic utterances between the F2F and embodVR conditions. This may partially be explained by slightly more backchanneling and non-task discussions in the embodVR condition, though neither of these were not significant (MF2F = 29.9, MembodV R = 32.5, pad j = 0.29). The higher rate of interruptions in the F2F condition may reflect that participants have more visual information on their interlocutor (gaze, facial expressions) and hence have a greater sense of when it is possible to take the floor. It was anticipated that people would gesture less in no_embodVR, when they can neither see themselves nor their partner. It is perhaps surprising that they still averaged 9.1 gestures per minute. More interesting is that there is no signifi- Paper 289 It is reasonable that people make a significantly lower proportion of gestures that are tied to the visual environment (“Reference Object" and “Spatial") in no_embodVR. Similarly, they make a higher proportion of gestures without environmental references (“Representation" and “Beats"). Nonetheless, people still made referential gestures almost four times a minute in no_embodVR. These were often redundant, such as accompanying the utterance “I’d like room A." with a gesture pointing to the room, but when they weren’t they could generate substantial confusion. It is again important to note that, aside from self-adaptors, there were no significant differences between F2F and no_embodVR, offering further evidence that normal communication patterns transfer over to embodied VR. We did not develop a measure for gesture complexity, but after viewing the corpus, it appears that the gestures people make in no_embodVR are less complex and often smaller. The spatial gestures were generally the most complex in the corpus and often involved illustrating traffic flow in the apartment, how noise might travel or the relative location of rooms. While spatial gestures still occurred, none of these particularly complex forms were observed in the no_embodVR. It also does not appear that this level of detail was transferred to the verbal channel. Rather, some details of the arguments were left out. To better understand the variation in self-adaptor behavior, we conducted a follow-up analysis looking at self-adaptor rate. They occurred 1.4 times per minute for F2F, 0.54 for embodVR and 0.68 for no_embodVR. The F2F rate was significantly higher than the other two. Self-adaptors are associated with anxiety and the personality trait of neuroticism [46, 3, 14, 11], so one possible explanation is that people are more anxious standing across the table from a flesh-and-blood person than they are in VR. It is also possible that they are more engaged in VR, so manipulate less, or the Touch Controllers make it more difficult to perform self-manipulations, although these occurred and there was some additional adjusting of the HMD. With regards to novel content, once again F2F and embodVR show comparable performance. It is reasonable to expect novel content to be lower in no_embodVR as people make a conscious decision not to encode information on a channel that cannot be seen. The fact that people are continuing to encode novel content in gesture when not seen means that part of their message is lost, which can lead to misunderstandings. It is interesting to note that despite there being no physical limits on people’s movements in the embodVR, they respected standard rules of proxemics, but would occasionally move into each others space in no_embodVR where there was no indication of a body. Page 6 CHI 2018 Honourable Mention CHI 2018, April 21–26, 2018, Montréal, QC, Canada Figure 2. Sub figure a shows the average number of gestures performed per minute for each condition. Subfigure b shows the percentage of gestures that fall into each annotated category (note that, because some gestures fit multiple categories, totals for each condition can add up to over 100%). Subfigure c shows the rate and percentage of gestures which introduced novel content into the discussions (for example, point at a location while referring to it by a referential pronoun). Subfigure d shows the mean number of conversational turns taken per minute. Subfigure e shows the percent of utterances that fall in each annotated category (note that, because some gestures fit multiple categories, totals for each condition can add up to over 100%). Subfigure f shows the frequencies of the manners by which conversational turns were started. Subfigure g shows the ratio of gestures performed by the more frequent gesturer and less frequent gesturer in each dyad. Subfigure h shows the mean social presence scores, with standard errors of the mean, as measured by the semantic difference questionaire. Subfigure i shows the most and least favorite conditions, as reported by participants at the end of the experiment. All error bars show standard error of the mean. Paper 289 Page 7 CHI 2018 Honourable Mention SEMANTIC DIFFERENCE MEASURE OF SOCIAL PRESENCE Measure Social presence, or the sense of interacting with another, is a key factor in communication and a long term goal for virtual reality systems. To measure the degree of social presence afforded by each condition, participants completed a semantic difference survey immediately after completing the second task in each condition. Similar to previous works, our survey consisted of eight bipolar word pairs (e.g. "coldwarm", "impersonal-personal", "colorless-colorful") selected from [44]. Using a seven-point Likert scale, participants rated the degree to which they felt each adjective in the pair described the communication medium. This is a common technique and previous studies have found that communication mediums with higher degrees of social presence are often rated as warmer, more personal, and more colorful [23, 42, 36] Results A reliability analysis was conducted on the results of the semantic difference surveys by calculating Chronbach’s alpha, which yielded a good internal consistency of 0.82. An average social presence score was then calculated from the factor responses for each participant and each condition. The mean and standard error are shown in Fig. 2h. Results of a repeated measures ANOVA, followed by pairwise comparisons using paired t-tests, indicate that both F2F and embodVR showed significantly higher perceptions of social presence than no_embodVR (medium effect size, Cohen’s d of 0.62 and 0.65 respectively). There was no significant difference between F2F and embodVR (negligible effect size). Discussion Semantic Difference Measure of Social Presence While it is not surprising that no_embodVR showed the lowest social presence, we expected F2F to still show greater social presence than embodVR, especially given that the current avatar is relatively primitive, lacking facial expressions, a full set of hand movement, muscle deformations, etc. Despite this, both the results and comments in the surveys and exit interview seem to indicate that people felt a high level of social presence with their interlocutor when the avatar was present. NETWORKED MINDS MEASURE OF SOCIAL PRESENCE Measure Following the semantic difference survey, participants were asked to reply to an additional 36 prompts on a 7-point Likert scale (Please see supplemental material.). These questions were based on the Networked Minds survey [9, 8, 7], an alternative measure of social presence, as well as including additional items deemed relevant for this study. Results In the data from the long survey, 6 cells (of 6,480) were blank because subjects forgot to circle answers. These blanks were replaced with the average score of all participants in that condition. In addition, one subject missed 24 questions in the no_embodVR condition. Data for that subject and condition was excluded from the analysis. Paper 289 CHI 2018, April 21–26, 2018, Montréal, QC, Canada A factor analysis was performed on the full set of questions, described in detail in the supplemental material. It yielded six factors, four of which were maintained: Clarity of Communication, Social Awareness, Conversation Management, and Disconnection to Partner. Chronbach’s alpha for these four factors produced alpha’s of 0.92, 0.86, 0.81, and 0.76 respectively. ANOVAs showed that all four factors were significantly affected by condition at the p < 0.05 level, as summarized in Table 3 and detailed in the supplemental document. Posthoc analysis showed that there was no significant difference between embodVR and F2F on Clarity of Communication, Conversation Management and Disconnection to Partner. For Conversation Management, embodVR performed significantly better than no_embodVR. F2F and embodVR performed significantly better than no_embodVR on the other two factors, showing the same pattern as with the semantic difference measure. Effect sizes were medium in each case, except between F2F and no_embodVR, where it was small. For Social Awareness, a three-level order appears where F2F performs better than embodVR which performs better than no_embodVR, with means of 6.16, 5.81 and 4.31 respectively. There was a medium effect size between F2F and embodVR (Cohen’s d = 0.51). There was a large effect size between both F2F and embodVR when compared to no_embodVR (Cohen’s d of 1.3 and 1.1 respectively). Discussion Three of the factors, including connection with partner, showed the same pattern as semantic differencing, with no significant difference between F2F and embodVR, but a degradation for no_embodVR, offering further evidence that people experienced similar social presence in F2F and embodVR. For Social Awareness, the degradation had a large effect size when comparing either F2F or embodVR with no_embodVR, but there was also a medium size difference between F2F and embodVR. Looking at the individual components, the factors “I could tell what my partner was paying attention to." and “I am confident I understood the emotions expressed by my partner." have the largest impact on this difference between F2F and embodVR. Gaze and emotions are highly dependent on eye movement and facial expressions, both missing from embodVR, so this may explain its lower performance to F2F on this factor. PARTICIPANT PREFERENCES AND EXIT INTERVIEWS Measure At the conclusion of the final condition, participants were asked to list their favorite and least favorite communication medium in the experiment, along with reasons for their answers. Following the debrief, informal exit interviews were conducted with the last 52 participants. These began with a prompt such as “Do either of you have any questions for me or any impressions you’d like to share?". Results Results for participants most and least preferred interface are shown in Fig. 2i. EmbodVR was most preferred by 39 participants and least preferred by 0. F2F was most preferred Page 8 CHI 2018 Honourable Mention by 15 and least preferred by 17. Four people selected both embodVR and F2F as their favorite. No_embodVR was most preferred by 2 and least preferred by 42. Discussion: Participant Preferences To gain a deeper understanding of the preference results, we categorized the written reasons given for people’s most and least preferred interfaces. One obvious explanation for the preference of embodVR is novelty, and there is some evidence for this. Of those who preferred embodied VR, five mentioned something that related to novelty in their explanation. For those that least-preferred face-to-face, the lack of novelty came up for seven participants, many of whom thought it was “boring". Novelty does not seem to be a complete explanation, both because no_embodVR is also novel and the many other justifications offered. Of those who preferred embodied VR, eight mentioned they liked seeing their interlocutor, seven thought the interface was fun and/or exciting, six explicitly mentioned the importance of body language and being able to show things in the environment, with comments like “I got tone and body language." Five people thought the interface was more personal and social. Three of the people who preferred embodied VR mentioned a sense of presence, with comments “I enjoyed seeing my partners body language/movement. I felt like she was in room with me even though she wasn’t", “[It] allowed me to interact on a personal level with a stranger while still being intimate and fully immersed", and even “seeing and hearing a real person is amazing". Of those that least preferred no_embodVR, 12 commented that it felt impersonal, sterile or they had less connection with their partner. For example, saying “it felt the most distant and least like there was another person across from me." For some, the abstraction of a grey, faceless avatar used in the embodVR condition increased their comfort with the interaction. Four people mentioned this who listed the condition as their favorite, with comments “being in VR makes it somehow less intimidating arguing over room space when you don’t know your partner well", “[I] felt less shy and self-conscious than I would have otherwise because of this interface", “[embodVR was best] because it was a fun and safe environment to navigate things I wanted. In-person it felt awkward [F2F], last round [no_embodVR] it just felt like I could do it by myself. I didn’t have a connection to my partner." Another commented, “the [face-to-face was my least favorite] for sure. I do not like conflict and in the [two VR conditions] it was easier to voice my opinion." We anticipated that the lack of facial expressions would be an issue. There is some evidence that it was in the surveys, but less than anticipated. Two people who mentioned embodied VR as their preferred interface said that they felt the lack of facial expressions, e.g. “But lack of facial expression made hard to know his actual feeling". Two people that preferred face-to-face mentioned facial expressions in their explanation. Three people that preferred face-to-face also mentioned that they it was easier to see expressions and understand their partner, which may include facial expressions. Paper 289 CHI 2018, April 21–26, 2018, Montréal, QC, Canada Discussion: Exit Interviews The exit interviews are particular useful for understanding the more social aspects of people’s experience with the system. Most people felt that having an avatar improved communication over the no_embodVR condition (28/52 mentioned this). For example, P10 commented “I think by seeing the avatar ... this changes the entire outcome of the conversation....I think based on body language ... it makes it easier to have a communication and to find an agreement." P59 noted how the body focused her attention: “Actually being able to see another body across the table, your focus ... you’re automatically drawn into that situation and you’re way more focused. And it is amazing how much being able to gesticulate, having that ability to gesticulate, how that is so much part of the communication process....". The richness of embodied communication and how it impacted decision making was highlighted by P30 “I was more influenced by his body language in [embodVR], because I could see if he really liked it, or he was just making an effort to make something different." In looking at the mechanisms that helped in communication, some people mentioned the role of gestures and pointing. Three people pointed out that the embodied avatar allowed them to anticipate what the other person was thinking earlier and prepare a response, for example “It was nice to ...actually see the person and then seeing them actually begin reaching to an object, so you can .. you’re already thinking of your reaction to it, what you’re going to say,... so you’re preemptively thinking of how am I going to speak to him about it, like what am I going to say, how am I going to disagree, like ’oh. I knew that piece, I didn’t like that piece’, alright, I’m already thinking about how I’m going to disagree with him."[P36] People noticed the lack of facial expressions in the avatar, with ten participants suggesting that adding facial expressions would be helpful and three noting the need for eye contact. Five participants commented that it was easier to read expressions in the face-to-face condition. Participant comments suggested that they feel more alone and cold in the no_embodVR condition and a much greater sense of social presence with the avatar (44 of 52 participants commented on this in some form). Eleven participants commented that it felt like the other avatar was standing directly in front of them, even though they knew the person was physically standing on the other side of the room and faced away from them. Indicative comments include: P60 saying “I felt like you could see me. That was the weird thing....That changes your behavior." and P59 replying “I really felt like I was talking to that grey thing." P21 commented, “I felt like the avatar was really interesting, because even though I couldn’t see her facial expression, I could see her body movement and I felt her. I felt her presence there." A surprising finding is some evidence that people changed the competitiveness of their interaction depending on the interface used. They felt they were being more considerate and empathetic when a body was present than in no_embodVR, where they were willing to be more competitive, aggressive and less likely to compromise. Nine participants made comments related to the no_embodVR interface depersonalizing Page 9 CHI 2018 Honourable Mention CHI 2018, April 21–26, 2018, Montréal, QC, Canada Category Factor ANOVA Result Post-hoc Semantic Difference Avg. Semantic Difference Score F2,116 = 18.07, pad j < 0.001 F2F, embodV R > no_embodV R Detailed Survey Clarity of Communication Social Awareness Conversation Management Disconnection to Partner F2,116 = 12.13, pad j < 0.001 F2,116 = 91.30, pad j < 0.001 F2,116 = 4.89, pad j = 0.012 F2,116 = 13.39, pad j < 0.001 F2F, embodV R > no_embodV R F2F > embodV R > no_embodV R embodV R < no_embodV R (lower is better) F2F, embodV R < no_embodV R (lower is better) Utterance Type Pragmatic Social/Emotional Non-Task Discussion Backchannel Complete Reference Reference Pronoun F2,56 = 4.36, pad j = 0.043 F2,56 = 1.62, pad j = 0.260 F2,56 = 0.89, pad j = 0.500 F2,56 = 1.75, pad j = 0.260 F2,56 = 0.61, pad j = 0.547 F2,56 = 13.04, pad j < 0.001 F2F > embodV R No significance No significance No significance No significance F2F, embodV R > no_embodV R Turn Frequency Conversational Turn Frequency F2,56 = 3.94, p = 0.025 embodV R > no_embodV R Turn Type Verbal Handover Nonverbal Handover Interruptions No Marker F2,56 = 1.77, pad j = 0.179 F2,56 = 3.65, pad j = 0.064 F2,56 = 6.9, pad j = 0.001 F2,56 = 2.36, pad j = 0.138 No significance No significance F2F > embodV R, no_embodV R No significance Gesture Behavior Gesture Frequency Gesture Disparity F2,58 = 34.75, pad j < 0.001 F2,58 = 9.83, pad j < 0.001 F2F, embodV R > no_embodV R F2F, embodV R > no_embodV R Gesture Type Reference Object or Location Spatial or Distance Backchannel Representation Beat Self-adaptor F2,58 = 18.91, pad j < 0.001 F2,58 = 14.01, pad j < 0.001 F2,58 = 2.06, pad j = 0.14 F2,58 = 6.60, pad j = 0.004 F2,58 = 5.46, pad j = 0.008 F2,58 = 7.78, pad j = 0.002 F2F, embodV R > no_embodV R F2F, embodV R > no_embodV R No significance F2F, embodV R < no_embodV R no_embodV R > embodV R F2F, no_embodV R > embodV R Novel Gesture Content Percent of Blocks with Novel Content Novel Content Blocks Per Minute F2,58 = 53.94, pad j < 0.001 F2,58 = 46.27, pad j < 0.001 F2F, embodV R > no_embodV R F2F, embodV R > no_embodV R Table 3. Results from ANOVAs and significant post-hoc t-tests for all computed measures. Verbal and nonverbal measures are calculated on the annotation data from the floor plan task. Significance values for post-hoc results are reported in Figure 2 the interaction. Four commented that they were more aggressive, more direct, more willing to argue or less compromising in no_embodVR. Two of these found the task easier in VR because it was depersonalized. Two other subjects thought they were more objective in the avatar condition, especially without a body. Two other participants felt less present in no_embodVR so wanted to control the situation more. A different dyad, outside of the nine, expressed that it was easier to “bicker" when not in face-to-face. One other subject felt that body language made it easier to reach agreement. An additional person commented that turn taking was more formal in no_embodVR. for social presence between F2F and embodVR, with lower social awareness possibly reflecting the lack of facial information. There was a clear preference for including a body in the experience as people felt “alone" in no_embodVR and ratings dropped. Removing the body decreased referential pronoun usage and lowered the frequency with which participants took conversational turns. CONCLUSIONS There are, of course, limitations to the work. The first is that this study examined a particular context in which users have a shared visual work space. The activities included a negotiation task and a design task. Behavior may vary for different environments and different activities. A second limitation is that while we measure conversational behavior and subjective experience, we don’t measure the effectiveness of the conversation. Both of these issues point to interesting follow-up work. For example, it would be interesting to examine social conversation to see whether facial motion plays a more dominant role here. Facial animation was excluded from this study both due to technical limitations and in order to focus on the impact of body movement. The study also used relatively lowfidelity models. It would be interesting to see if behavior and experience changes with photo-realistic models that include facial animation. Embodied virtual reality and face-to-face interaction showed remarkably similar verbal and nonverbal communicative behavior, with the anticipated drop off for VR without bodies. Having a tracked body in the virtual world seems to help people feel that they are really interacting with another person: all but one subjective measure showed no significant difference We gratefully thank the team at Oculus Research. In particular, this work would not have been possible without Ronald Mallet, Alexandra Wayne, Matt Vitelli, Ammar Rizvi, Joycee Kavatur and the annotation team. Some participants felt more comfortable in VR than in F2F. For example, P55 said “I just felt like it was easier for me to be more relaxed and more myself [in VR] because ... I don’t know ... it just gave me a safe place to do it ... like, in person, I wouldn’t want to upset you, but when it became virtual reality it was a little different ... like ... I don’t know, I just felt like I relaxed more." This could be related to some participants feeling less revealed in VR, even if overall social presence was similar. Paper 289 Acknowledgements Page 10 CHI 2018 Honourable Mention REFERENCES 1. U.S. Department of Energy Advanced Research Projects Agency Energy (ARPA-E). 2017. FACSIMILE APPEARANCE TO CREATE ENERGY SAVINGS (FACES). (2017). 2. Leila Alem and Jane Li. 2011. A study of gestures in a video-mediated collaborative assembly task. Advances in Human-Computer Interaction 2011 (2011), 1. 3. M. Argyle. 1988. Bodily communication. Taylor & Francis. 4. Michael Argyle and Mark Cook. 1976. Gaze and mutual gaze. (1976). 5. Jeremy N Bailenson, Jim Blascovich, Andrew C Beall, and Jack M Loomis. 2001. Equilibrium theory revisited: Mutual gaze and personal space in virtual environments. Presence: Teleoperators and virtual environments 10, 6 (2001), 583–598. 6. Atman Binstock. 2015. Powering the Rift. (2015). https://www.oculus.com/blog/powering-the-rift/ 7. Frank Biocca, Judee Burgoon, Chad Harms, and Matt Stoner. 2001. Criteria and scope conditions for a theory and measure of social presence. Presence: Teleoperators and virtual environments (2001). 8. Frank Biocca, Chad Harms, and Judee K Burgoon. 2003. Toward a more robust theory and measure of social presence: Review and suggested criteria. Presence: Teleoperators and virtual environments 12, 5 (2003), 456–480. 9. Frank Biocca, Chad Harms, and Jenn Gregg. 2001. The networked minds measure of social presence: Pilot test of the factor structure and concurrent validity. In 4th annual international workshop on presence, Philadelphia, PA. 1–9. 10. Sara A Bly. 1988. A use of drawing surfaces in different collaborative settings. In Proceedings of the 1988 ACM conference on Computer-supported cooperative work. ACM, 250–256. CHI 2018, April 21–26, 2018, Montréal, QC, Canada 16. Susan R Fussell and Leslie D Setlock. 2014. Computer-mediated communication. Handbook of Language and Social Psychology. Oxford University Press, Oxford, UK (2014), 471–490. 17. Susan R Fussell, Leslie D Setlock, Jie Yang, Jiazhi Ou, Elizabeth Mauer, and Adam DI Kramer. 2004. Gestures over video streams to support remote collaboration on physical tasks. Human-Computer Interaction 19, 3 (2004), 273–309. 18. Darren Gergle, Robert E Kraut, and Susan R Fussell. 2004a. Action as language in a shared visual space. In Proceedings of the 2004 ACM conference on Computer supported cooperative work. ACM, 487–496. 19. Darren Gergle, Robert E Kraut, and Susan R Fussell. 2004b. Language efficiency and visual technology: Minimizing collaborative effort with visual information. Journal of language and social psychology 23, 4 (2004), 491–517. 20. Darren Gergle, Robert E Kraut, and Susan R Fussell. 2013. Using visual information for grounding and awareness in collaborative tasks. Human–Computer Interaction 28, 1 (2013), 1–39. 21. Darren Gergle, Carolyn P Rosé, and Robert E Kraut. 2007. Modeling the impact of shared visual information on collaborative reference. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 1543–1552. 22. Jörg Hauber, Holger Regenbrecht, Mark Billinghurst, and Andy Cockburn. 2006. Spatiality in videoconferencing: trade-offs between efficiency and social presence. In Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work. ACM, 413–422. 23. Jörg Hauber, Holger Regenbrecht, Aimee Hills, Andrew Cockburn, and Mark Billinghurst. 2005. Social presence in two-and three-dimensional videoconferencing. (2005). 11. A. Campbell and J. Rushton. 1978. Bodily communication and personality. The British Journal of Social and Clinical Psychology 17, 1 (1978), 31–36. 24. Aimèe Hills, Jörg Hauber, and Holger Regenbrecht. 2005. Videos in space: a study on presence in video mediating communication systems. In Proceedings of the 2005 international conference on Augmented tele-existence. ACM, 247–248. 12. Herbert H Clark, Susan E Brennan, and others. 1991. Grounding in communication. Perspectives on socially shared cognition 13, 1991 (1991), 127–149. 25. Ellen A Isaacs and John C Tang. 1994. What video can and cannot do for collaboration: a case study. Multimedia systems 2, 2 (1994), 63–73. 13. Herbert H Clark and Meredyth A Krych. 2004. Speaking while monitoring addressees for understanding. Journal of memory and language 50, 1 (2004), 62–81. 26. Steven Johnson, Madeleine Gibson, and Bilge Mutlu. 2015. Handheld or handsfree?: Remote collaboration via lightweight head-mounted displays and handheld devices. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 1825–1836. 14. P. Ekman and W. V. Friesen. 1972. Hand movements. Journal of Communication 22 (1972), 353–374. 15. Susan R Fussell, Robert E Kraut, and Jane Siegel. 2000. Coordination of communication: Effects of shared visual context on collaborative work. In Proceedings of the 2000 ACM conference on Computer supported cooperative work. ACM, 21–30. Paper 289 27. Adam Kendon. 1967. Some functions of gaze-direction in social interaction. Acta psychologica 26 (1967), 22–63. Page 11 CHI 2018 Honourable Mention 28. David Kirk and Danae Stanton Fraser. 2006. Comparing remote gesture technologies for supporting collaborative physical tasks. In Proceedings of the SIGCHI conference on Human Factors in computing systems. ACM, 1191–1200. 29. Robert M Krauss, Connie M Garlock, Peter D Bricker, and Lee E McMahon. 1977. The role of audible and visible back-channel responses in interpersonal communication. Journal of personality and social psychology 35, 7 (1977), 523. 30. Robert E Kraut, Susan R Fussell, and Jane Siegel. 2003. Visual information as a conversational resource in collaborative physical tasks. Human-computer interaction 18, 1 (2003), 13–49. 31. Robert E Kraut, Steven H Lewis, and Lawrence W Swezey. 1982. Listener responsiveness and the coordination of conversation. Journal of personality and social psychology 43, 4 (1982), 718. 32. Robert E Kraut, Mark D Miller, and Jane Siegel. 1996. Collaboration in performance of physical tasks: Effects on outcomes and communication. In Proceedings of the 1996 ACM conference on Computer supported cooperative work. ACM, 57–66. 33. David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago. 34. D. McNeill. 2005. Gesture and thought. University of Chicago Press. 35. Ian E Morley and Geoffrey M Stephenson. 1969. INTERPERSONAL AND INTER-PARTY EXCHANGE: A LABORATORY SIMULATION OF AN INDUSTRIAL NEGOTIATION AT THE PLANT LEVEL. British Journal of Psychology 60, 4 (1969), 543–545. 36. Kristine L Nowak and Frank Biocca. 2003. The effect of the agency and anthropomorphism on users’ sense of telepresence, copresence, and social presence in virtual environments. Presence: Teleoperators and Virtual Environments 12, 5 (2003), 481–494. 37. Brid O’Conaill, Steve Whittaker, and Sylvia Wilbur. 1993. Conversations over video conferences: An Paper 289 CHI 2018, April 21–26, 2018, Montréal, QC, Canada evaluation of the spoken aspects of video-mediated communication. Human-computer interaction 8, 4 (1993), 389–428. 38. Oculus. 2016. Oculus Touch Tutorial. Oculus Store. (2016). 39. Gary M Olson and Judith S Olson. 2000. Distance matters. Human-computer interaction 15, 2 (2000), 139–178. 40. Charles Egerton Osgood, George J Suci, and Percy H Tannenbaum. 1964. The measurement of meaning. University of Illinois Press. 41. Tomislav Pejsa, Michael Gleicher, and Bilge Mutlu. 2017. Who, Me? How Virtual Agents Can Shape Conversational Footing in Virtual Reality. Springer International Publishing, Cham, 347–359. 42. Jan Richter, Bruce H Thomas, Maki Sugimoto, and Masahiko Inami. 2007. Remote active tangible interactions. In Proceedings of the 1st international conference on Tangible and embedded interaction. ACM, 39–42. 43. Abigail J Sellen. 1995. Remote conversations: The effects of mediating talk with technology. Human-computer interaction 10, 4 (1995), 401–444. 44. John Short, Ederyn Williams, and Bruce Christie. 1976. The social psychology of telecommunications. (1976). 45. Linda Tickle-Degnen and Robert Rosenthal. 1990. The nature of rapport and its nonverbal correlates. Psychological inquiry 1, 4 (1990), 285–293. 46. P. Waxer. 1977. Nonverbal Cues for Anxiety: An Examination of Emotional Leakage. Journal of Abnormal Psychology 86, 3 (1977), 306–314. 47. Steve Whittaker. 2003. Theories and methods in mediated communication. The handbook of discourse processes (2003), 243–286. 48. Steve Whittaker, Erik Geelhoed, and Elizabeth Robinson. 1993. Shared workspaces: how do they work and when are they useful? International Journal of Man-Machine Studies 39, 5 (1993), 813–842. Page 12