Academia.eduAcademia.edu

Structuring dimensions for collaborative systems evaluation

2012, ACM Computing Surveys (CSUR)

https://doi.org/10.1145/2089125.2089128

Collaborative systems evaluation is always necessary to determine the impact a solution will have on the individuals, groups, and the organization. Several methods of evaluation have been proposed. These methods comprise a variety of approaches with various goals. Thus, the need for a strategy to select the most appropriate method for a specific case is clear. This research work presents a detailed framework to evaluate collaborative systems according to given variables and performance levels. The proposal assumes that evaluation is an evolving process during the system lifecycle. Therefore, the framework, illustrated with two examples, is complemented with a collection of guidelines to evaluate collaborative systems according to product development status.

Structuring Dimensions for Collaborative Systems Evaluation PEDRO ANTUNES, University of Lisbon VALERIA HERSKOVIC, SERGIO F. OCHOA, and JOSE A. PINO, University of Chile, Santiago Collaborative systems evaluation is always necessary to determine the impact a solution will have on the individuals, groups, and the organization. Several methods of evaluation have been proposed. These methods comprise a variety of approaches with various goals. Thus, the need for a strategy to select the most appropriate method for a specific case is clear. This research work presents a detailed framework to evaluate collaborative systems according to given variables and performance levels. The proposal assumes that evaluation is an evolving process during the system lifecycle. Therefore, the framework, illustrated with two examples, is complemented with a collection of guidelines to evaluate collaborative systems according to product development status. Categories and Subject Descriptors: H.5.3 [Information Interfaces and Presentation]: Group and Organization Interfaces—Evaluation/methodology, Theory and models; K.6.1 [Management of Computing and Information Systems]: Project and People Management—Life cycle, Systems development General Terms: Measurement, Human Factors, Management Additional Key Words and Phrases: Collaborative systems evaluation, human-computer interaction, interaction assessment, evaluation dimensions, evaluation guidelines ACM Reference Format: Antunes, P., Herskovic, V., Ochoa, S. F., and Pino, J. A. 2012. Structuring dimensions for collaborative systems evaluation. ACM Comput. Surv. 44, 2, Article 8 (February 2012), 28 pages. DOI = 10.1145/2089125.2089128 http://doi.acm.org/10.1145/2089125.2089128 1. INTRODUCTION The evaluation of collaborative systems is an important issue in the field of Computer Supported Cooperative Work (CSCW). Appropriate evaluation justifies investments, appraises stakeholders’ satisfaction, or redirects systems development to successful requirements matching. Several specific evaluation methods have been proposed [Herskovic et al. 2007] beyond those intended for Information Systems in general. However, many collaborative systems seem to be poorly evaluated. A study of 45 articles from eight years of the CSCW conference revealed that almost one third of the presented collaborative systems were not evaluated in a formal way [Pinelle and Gutwin 2000]. Even when evaluations are done, many of them seem to be performed in an ad hoc way, depending on the researchers’ interests or the practical adequateness for a specific setting [Inkpen et al. 2004; Greenberg and Buxton 2008]. This shows a need for a strategy that helps choose suitable collaborative systems evaluation methods. This article was partially supported by the Portuguese Foundation for Science and Technology (PTDC/EIA/102875/2008), Conicyt PhD scholarship, Fondecyt (Chile) Grants No. 11060467 and 1080352, and LACCIR Project No. R0308LAC004. Authors’ addresses: P. Antunes, Department of Informatics, University of Lisbon; email: [email protected]; V. Herskovic, S.F. Ochoa, and J. A. Pino, Computer Science Department, University of Chile; emails: {vherskov, sochoa, jpino}@dcc.uchile.cl. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2012 ACM 0360-0300/2012/02-ART8 $10.00  DOI 10.1145/2089125.2089128 http://doi.acm.org/10.1145/2089125.2089128 ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8 8:2 P. Antunes et al. This article proposes a framework to evaluate a collaborative system under development or procurement, as well as a set of guidelines to select the appropriate evaluation techniques. We understand evaluation as an evolving process that is in some way associated with the conception, design, construction, and deployment activities of a system development. The guidelines also address the case of a collaborative system being purchased by an organization. We consider two main structuring dimensions in order to frame the various contingencies of the evaluation process. One of these dimensions defines the set of relevant evaluation variables, and the other concerns the levels of human performance under evaluation. The considered evaluation variables are realism, generalization, precision, system detail, system scope, and invested time. The adopted levels of human performance consider role-based, rule-based, and knowledge-based tasks. The approach is generally applicable to all types of collaborative systems. Section 2 analyzes the major problems associated with collaborative systems evaluation. Section 3 discusses the related work. In particular, it describes, categorizes, and compares several well-known evaluation methods. Section 4 describes the proposed framework for evaluation. Section 5 presents the collection of guidelines for evaluation. Section 6 contains two case studies of collaborative systems evaluation. Finally, Section 7 presents the conclusions and further work. 2. STUDYING COLLABORATIVE SYSTEMS EVALUATION 2.1. Why Is Collaborative Systems Evaluation So Difficult? The success of a collaborative system depends on multiple factors, including group characteristics and dynamics, social and organizational context in which it is inserted, and positive and negative effects of technology on the group’s tasks and processes. Therefore, evaluation should attempt to measure several effects on multiple, interdependent stakeholders and in various domains. What distinguishes collaborative systems from other information systems is indeed the need to evaluate its impact with an eclectic approach. Ideally, a single collaborative systems evaluation method should cover the individual, group, and organizational domains, assessing whether or not the system is successful at the combination of those realms. Unfortunately, no such single method is currently available, and may never be. The fundamental cause for it is related with the granularity and time scale of the information obtained at these three domains [Newell 1990]. —The information pertaining to the individual is usually gathered at the cognitive level, focusing on events occurring on a timeframe in the order of a few minutes or even seconds. —Group information is gathered at the interaction/communication level, addressing activities occurring in the range of several minutes and hours. —The information regarding organizational impact concerns much longer timeframes, usually in the order of days, months, and even years. Moreover, the results of an evaluation should be weighted by the degree of certainty in them, which depends on the maturity of what is being evaluated. At the inception phase, the product to be evaluated may be just a concept or a collection of design ideas, so the results have a high degree of uncertainty. When the development reaches full deployment, the product may then be tested in much more far-reaching and systematic ways, providing evaluators with an increased degree of certainty and relatively precise results. The dependence between product development and evaluation is noticeable in the star model illustrated in Figure 1 [Hix and Hartson 1993]. Evaluation is a central ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:3 Fig. 1. The dependence between product development and evaluation. aspect of a broad collection of activities aiming to develop a product, but it has to compete with the other activities for attention, relevance, and critical resources, such as people, time and money. 2.2. Why Evaluate and How to Evaluate? McGrath [1984] characterized the purpose of conducting an evaluation as addressing three main goals: precision, generalizability, and realism. The first goal concerns the precision of the data obtained by the instrument being used. This goal is inherently linked with the capability to control the dependent and independent variables, the subjects, and the experiment. Laboratory experiments are usually selected to accomplish this high level of control. Generalizability concerns the extent to which the obtained results may be applied to a population. High-set goals on generalizability usually imply adopting large-scale inquiries and surveys, while low generalizability is obtained by interviewing a small audience. Realism addresses how closely the obtained results represent real-world conditions by considering the work setting, the population of users, and the tasks, stimulus, time stress, and absence of observers, etc. Laboratory experiments have been criticized for providing low realism, especially with collaborative systems, whereas field studies have been considered to score high on realism but low on precision. Overall, the ideal evaluation should maximize the three goals; for instance, using multiple evaluation methods and triangulating the obtained results. Nevertheless, McGrath [1984] states this sort of evaluation would be very costly and difficult to carry out, which ultimately may have to be considered utopian. McGrath then identifies the major compromising strategies adopted to overcome the costs of an ideal evaluation: —field strategies to make direct observations of realistic work. —experimental strategies based on artificial experimental settings to study specific activities with high precision. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:4 P. Antunes et al. —respondent strategies to obtain evidence by sampling a large and representative population. —theoretical strategies that use theory to identify the specific variables of interest. 2.3. What to Evaluate? Pinsonneault and Kraemer [1989] defined one of the pioneering collaborative systems evaluation frameworks addressing the practical aspects related to the exact object under evaluation. The framework adopts an input-process-output view to conceptualize the relationship between the technology support and other factors related to the group, group behavior, and work context. —Contextual variables are the important factors in the group behavior. Contextual variables belong to five major categories: personal, situational, group structure, task characteristics, and technology characteristics (e.g., anonymity and type of communication). —Group process, as defined by the characteristics of the group interaction, including decisional characteristics, communicational characteristics, and interpersonal characteristics. —Outcomes of the group process as affected by the technology support, including taskrelated outcomes and group-related outcomes. This framework has been highly influential, especially because it created a common foundation for comparing multiple collaborative systems experiments. Also, the distinction between group process and outcomes highlights two quite different evaluation dimensions commonly found in the literature, the former usually addressing questions of meaning (e.g., ethnography [Hughes et al. 1994] and groupware walkthrough [Pinelle and Gutwin 2002]), and the latter addressing questions of cause and effect (e.g., value creation [Briggs et al. 2004]). Other collaborative systems evaluation frameworks, such as the ones proposed by Hollingshead and McGrath [1995] and Fjermestad and Hiltz [1999], are based on this framework. Regarding more recent evaluation frameworks, Neale et al. [2004] proposed a simplified evaluation framework, basically consisting of two categories. One encompasses the contextual variables already mentioned. The other category concerns the level of work coupling attained by the work group, which combines technology characteristics with group process characteristics. Along with this proposition, Neale et al. [2004] also recommend blending the different types of evaluation. Araujo et al. [2002] also proposed a simplified framework based on four dimensions: group context (which seems consensual in every framework), system usability, level of collaboration (similar to the level of work coupling), and cultural impact. Cultural impact is seen as influencing the other dimensions, thus introducing a feedback loop in the input-process-output view. 2.4. When to Evaluate? The timing of the evaluation is inherently associated with the development process. It is common to distinguish between the preliminary and final development stages [DeSanctis et al. 1994; Guy 2005]. The preliminary stage affords what has been designated formative evaluation [Scriven 1967], which mainly serves to provide feedback to the designers about the viability of design ideas, usability problems, perceived satisfaction with the technology, possible focal points for innovation, and alternative solutions, and also feedback about the development process itself. The final stage, which sometimes is designated summative evaluation, provides complete and definitive information about the developed product and its impact on the users, the group, and the organization. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:5 3. RELATED WORK This section begins by describing the way we built a relevant corpus of papers to be analyzed, and also the literature review method used to classify the evaluation strategies. These retrieved strategies were split into two subsets: 1) evaluation methods that are presented in Section 3.2, and 2) evaluation frameworks that are described in Section 3.3. 3.1. Literature Review Methodology We began our search for articles in the literature concerning collaborative systems evaluation by exploring various ways to get a large initial corpus of papers. The main technique to obtain papers was to search using pertinent search engines, such as, Google Scholar and the ACM Digital Library, through combinations of keywords containing terms related with CSCW and evaluation (e.g., groupware evaluation, collaborative systems assessment, etc.). The proceedings from several relevant conferences and workshops in the area (e.g., CSCW, ECSCW, CHI, WETICE) were reviewed to find additional papers to add to the corpus. Then, we examined references of already found relevant papers, and searched through Google Scholar for papers citing those we had found. Each paper was carefully reviewed in order to determine if it had merit to be part of the set of preselected articles. The large set thus built was reduced by filtering out papers that did not present a distinctive evaluation proposal. The initial analysis of the corpus of papers identified several types of proposals to evaluate collaborative systems. Some papers presented ad hoc techniques or tools (e.g., questionnaires) defined specifically to evaluate a particular application. Such papers were not considered in our analysis because we were interested in finding evaluation methods with a clear and reusable evaluation strategy. Papers reporting just evaluation tools (i.e., single instruments intended to measure system variables) were also removed from the main corpus when they did not include an evaluation process. Once we filtered out the tools and nonreusable evaluation proposals, we analyzed the remaining contributions and realized those proposals could be classified in two categories: evaluation methods and evaluation frameworks. We define evaluation methods as procedures used to apply evaluation tools with a specific goal. For example, the Perceived Value evaluation method [Antunes and Costa 2003] uses evaluation tools such as questionnaires and checklists with the goal of determining the organizational impact of meetingware. We define evaluation frameworks as macro-strategies used to organize the evaluation process. Several evaluation methods and tools may be included in an evaluation framework. After classifying the articles into these two categories, the analysis of the contributions was focused on the evaluation methods category. Each subset was then expanded to include seminal evaluation methods that have been adapted to the collaboration context. The careful analysis of these selected papers led us to define a set of relevant inquiries that can be applied to each method in order to classify them more properly: (1) purpose of evaluation (why), (2) evaluation tools being used (how), (3) outcomes of the evaluation (what), and (4) moment in which the evaluation is conducted (when). Section 3.2 presents this classification, which is complemented with a narrative summary of the procedures adopted by each method. Moreover, we classified the evaluation methods by publication date, which served to build an understanding of their emergence and subsequent life. This classification allowed us to construct the timeline presented in Appendix A. The timeline analysis shows some identifiable patterns. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:6 P. Antunes et al. Table I. Characterization of Evaluation Methods Method Why GHE Precision When Summative GWA Formative CUA GOT HPM QDE PAN PVA SBE COS TTM KMA How What Software Analysis, checklist Effectiveness, efficiency, satisfaction Precision Software Analysis Effectiveness, efficiency, satisfaction Precision Software Analysis Effectiveness, efficiency, satisfaction Realism Observation, checklist Effectiveness, efficiency, satisfaction Precision Interaction Analysis Group performance Realism Observation Redesign Generalizability Formal analysis Efficiency Realism Questionnaire, checklist Organizational Impact Realism/Precision Interviews Organizational Contributions Realism Interviews, observation Redesign Generalizability Interviews, observation Predicted actual use Generalizability Software analysis, checklist Knowledge circulation Formative Summative Formative Summative Formative Formative Formative Formative Formative Formative —The adaptation of single-user evaluation methods, developed in the HumanComputer Interaction field, to the specific context of collaborative systems. This has occurred, for instance, with walkthroughs (structured walkthroughs, cognitive walkthroughs, groupware walkthroughs), heuristic evaluation (heuristic evaluation, heuristic evaluation based on the mechanics of collaboration) and scenario-based evaluation. —The assimilation of perspectives, methods, and techniques from other fields beyond technology development. The clearest example is ethnography (observational studies, quick-and-dirty ethnography, workplace studies), but cognitive sciences also seem to have impact (KLM, cognitive walkthroughs, computational GOMS). —The increasing complexity of the evaluation context. Most early methods (e.g., structured walkthroughs, KLM, discount methods) seem to focus on very specific variables measured under controlled conditions, while some of the latter methods seem to consider broader contextual issues (e.g., multi-faceted evaluation, perceived value, evaluating collaboration in co-located environments, lifecycle based approach). Finally we also analyzed the proposals concerning evaluation frameworks. Section 3.3 presents the most representative ones. 3.2. Sample of Evaluation Methods This section presents a sample of collaborative systems evaluation methods. Table 1 presents a summarized characterization of the selected evaluation methods, describing the purpose of the evaluation (why), the evaluation tools being used in each method (how), the outcomes of the evaluation (what), and the moment in which evaluation is conducted (when). Then, we present a brief description of the steps involved in each evaluation method. Groupware Heuristic Evaluation (GHE) [Baker et al. 2002]. GHE is based on eight groupware heuristics, which act as a checklist of characteristics a collaborative system should have. Evaluators who are experts in them examine the interface, recording each problem they encounter, the violated heuristic, a severity rating, and optionally, a solution to the problem. The problems are then filtered, classified, and consolidated into a list, which is used to improve the application. Groupware Walkthrough (GWA) [Pinelle and Gutwin 2002]. A scenario is a description of an activity or set of tasks, which includes the users, their knowledge, the ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:7 intended outcome, and circumstances surrounding it. Evaluators construct scenarios by observing users and identifying episodes of collaboration. Each evaluator, taking the role of all users or one in particular, walks through the tasks in a laboratory setting, recording each problem he encounters. A meeting is then conducted to analyze the results of the evaluation. Collaboration Usability Analysis (CUA) [Pinelle et al. 2003]. Evaluators map collaborative actions to a set of collaboration mechanisms, or fine-grained representations of basic collaborative actions, which may be related with elements in the user interface. The resulting diagrams capture details about task components, a notion of the flow through them, and the task distribution. Groupware Observational User Testing (GOT) [Gutwin and Greenberg 2000]. GOT involves evaluators observing how users perform particular tasks supported by a system in a laboratory setting. Evaluators either monitor users having problems with a task, or ask users to think aloud about what they are doing to gain insight into their work. Evaluators focus on collaboration and analyze users’ work through predefined criteria, such as the mechanics of collaboration. Human-Performance Models (HPM) [Antunes et al. 2006]. Evaluators first decompose the physical interface into several shared workspaces. Then, they define critical scenarios focused on the collaborative actions for the shared workspaces. Finally, evaluators compare group performance in the critical scenarios to predict execution times. “Quick-and-dirty” Ethnography (QDE) [Hughes et al. 1994]. Evaluators do brief ethnographic workplace studies to provide a general sense of the setting for designers. QDE suggests the deficiencies of a system, supplying designers with the key issues that bear on acceptability and usability, thus allowing existing and future systems to be improved. Performance Analysis (PAN) [Baeza-Yates and Pino 1997; Baeza-Yates and Pino 2006]. The application to be studied is modeled as a task to be performed by a number of people in a number of stages, and the concepts of result quality, time, and total amount of work done are defined. The evaluators must define a way to compute the quality (e.g., group recall in a collaborative retrieval task), and maximize the quality vs. work done, either analytically or experimentally. Perceived Value (PVA) [Antunes and Costa 2003]. PVA begins by developers identifying relevant components for system evaluation. Then, users and developers negotiate the relevant system attributes to be evaluated by users. After the users have worked with the system, they fill out an evaluation map by noting whether the components support the attributes or not. Using these ratings, a metric representing the PV is calculated. Scenario-Based Evaluation (SBE) [Haynes et al. 2004]. SBE uses field evaluation. Evaluators perform semi-structured interviews with users to discover scenarios, or detailed descriptions of activities, and claims about them. Then, focus groups validate these findings. The frequency and percentage of positive claims help quantify the organizational contributions of the system, and the positive and negative claims about existing and envisioned features provide information to aid in redesign. Cooperation Scenarios (COS) [Stiemerling and Cremers 1998]. Evaluators conduct field studies, semi-structured interviews, and workplace visits. They thus identify scenarios, cooperative behavior, users involved and their roles, and the relevant context. For each role involved in the cooperative activity, evaluators analyze the new design to see how the task changes and who benefits from the new technology. Then, the prototype is presented as a scenario in a workshop with users to discover design flaws. Knowledge Management Approach (KMA) [Vizcaı́no et al. 2005]. Evaluation using KMA measures whether the system helps users detect knowledge flows and ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:8 P. Antunes et al. disseminate, store and reuse knowledge. The knowledge circulation process is comprised of six phases (knowledge creation, accumulation, sharing, utilization, internalization), which are also the areas evaluated by this approach. The evaluation is performed by answering questions associated with each area. Technology Transition Model (TTM) [Briggs et al. 1998]. TTM predicts the actual system use as a function of the intent to use the system, the value that users attribute to it, how frequently it will be used, and the perceived cost of transition. This model proposes that users weigh all factors affecting the perceived value of a system, producing an overall value corresponding to their perception of the usefulness of the system. Users’ opinions are obtained by interviews, archival analysis, and observations. These opinions are the basis to predict actual use of the system. The collaborative application can thus be evaluated to increase the speed of its acceptance, while reducing the risk of technology transition. 3.3. Evaluation Frameworks This section presents existing macro-strategies to performing evaluation. Several frameworks adopt an input-process-output view [Pinsonneault and Kraemer 1989; Ross et al. 1995; Damianos et al. 1999; Araujo et al. 2002; Huang 2005], while others include evaluation in the software development cycle [Hix and Hartson 1993; Baecker et al. 1995; Veld et al. 2003; Huang 2005]. The star model [Hix and Hartson 1993] proposes evaluation as the central phase in the software development cycle. This means evaluation should be conducted after every development step. Baecker et al. [1995] regard development as an iterative process of design, implementation, and evolution, and apply appropriate evaluation methods after each development phase. The concept design is evaluated through interviews, the functional design through usability tests, the prototype through heuristics, the delivered system through usability tests, and finally, the system evolution is evaluated through interviews and questionnaires. Huang [2005] proposes a lifecycle strategy. An evaluation plan is defined before starting development, considering five domains: context, content, process, stakeholders, and success factors. The plan is improved at each cycle after analyzing the evaluation results. The E-MAGINE framework [Veld et al. 2003] has a similar structure: first, a meeting and an interview are done to establish the evaluation goals and group profile. This information guides the selection of evaluation methods and tools which will be used. Damianos et al. [1999] present a framework based on Pinsonneault and Kraemer’s proposal [1989]. The framework has four levels: requirement, capability, service, and technology. Appropriate methods should be selected at each level to conduct the evaluation. At the requirement level, evaluation concerns the overall system quality. At the capability level, evaluation addresses the system capabilities. At the service level, evaluation is focused on performance and cost. Finally, the technology level concerns benchmarking technical issues. The PETRA strategy combines the perspective of the evaluator and the perspective of the users, or participants [Ross et al. 1995]. In this way, it aims to achieve a balance between theoretical and practical methods. The CSCW Lab proposes four dimensions to consider when evaluating collaborative systems: group context, usability, collaboration, and cultural impact [Araujo et al. 2002]. Each dimension is a step of the evaluation process, which consists of characterizing the group and work context, measuring usability strengths and weaknesses and collaboration capabilities, and studying the impact of the application over time. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:9 Fig. 2. Variables adopted for the evaluation framework. 4. COLLABORATIVE SYSTEMS EVALUATION FRAMEWORK 4.1. Variables Section 2.1 introduced the need to choose variables to assess a collaborative system under development. We should, then, characterize our framework according to a set of variables providing insights on the evaluation methods to be applied. A starting point is McGrath’s evaluation goals mentioned in Section 2.2. These goals are fundamental to laying out the evaluation methodology. For our evaluation framework, it seems thus appropriate to choose variables associated to these goals; if the evaluation methods change in succeeding evaluations, these variables will reflect the new evaluation methodology, illustrated in Figure 2. Precision, generalization, and realism are our first three variables to describe the evaluation method. Precision focuses on the accuracy of the measuring tools, generalization concerns the extent (in terms of population) to which the method must be applied, and realism refers to whether the evaluation will use real settings or not. It is important to incorporate the level of system detail (depth) as one of the dimensions characterizing the evaluation activities. This dimension concerns the granularity of the evaluation. Evaluation methods with a high level of system detail (e.g., mouse movements of a user) will provide more specific and accurate information to improve the system under review. Another dimension we would like to incorporate in the evaluation framework is the scope (breadth) of the system being evaluated. An evaluation having a large value for this variable would mean the system being evaluated has many functionalities and components being assessed. This variable complements the detail dimension. The breadth dimension can help identify the scope of a system that could be covered with a particular evaluation method. We note that while the first three variables in our framework consider theoretical issues, the system detail and scope concern the product development state. Finally, an invested time variable describes the time used by the evaluators to carry out the work. This variable may not be completely independent from other variables, notably, detail and scope (since, e.g., a coarse-grain evaluation narrowing to a few functionalities will probably require little invested time). However, from a more practical ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:10 P. Antunes et al. standpoint, it is an important variable to distinguish the efficient evaluation methods from those which are not. Therefore, invested time is included in the framework. Other variables could be considered for our framework; however after analyzing several of these variables—such as evaluation cost/effort, feedback richness, or required expertise—we realized they could be inferred in some way by relating the results of the proposed dimensions. Moreover, the selected variables seem adequate to analyze evaluation methods, as shown in following sections. Figure 2 shows a radar-graph representation of the evaluation variables. A specific method is represented by a dot in each of the axes (variables). Each axis has a scale from 0 (or minimum value) in the origin to a certain maximum value. These dots may be joined to show a certain evaluation shape. It may be noticed a numeric value for the area within a shape does not make much sense, since the scales are not the same for each variable. However, a light evaluation procedure will probably have low values for most or all variables, whereas a heavy one will probably score high in several evaluation variables. 4.2. Performance Levels Reason [2008] proposed a three-layered model of human performance in organizational contexts by extending a proposal by Rasmussen and Jensen [1974]. We will apply this model to the specific context of collaborative systems evaluation. The model categorizes human performance according to two dimensions: situation and situation control. According to the situation dimension, the organizational activities may be classified as (1) routine, when the activities are well known by the performers and accomplished in an almost unconscious way; (2) planned, when the activities have been previously analyzed by the organization and thus there are available plans and procedures to guide the performers accomplishing the intended goals; and (3) novel, when the way to achieve the intended goals is unknown to the organization and thus human performance must include problem analysis and decision-making activities. Shared workspaces, workflow systems, and group support systems are good examples of collaborative systems technology supporting the routine, planned, and novel dimensions. The other dimension concerns the level of control the performers may exert while accomplishing the set goals. The control may be mechanical, when a human action is performed according to a predefined sequence imposed by the technology. The control may be human, when the technology does not impose any predefined action sequence. Finally the control may be mixed, when it opportunistically flows between humans and the technology. These two dimensions serve to lay down the following performance levels as illustrated in Figure 3. —Role-based performance encompass routine tasks performed with mechanical control at the individual level. Any group activity at this level is basically considered as a collection of independent activities. —Rule-based performance concerns tasks accomplished with some latitude of decision from humans but within the constraints of a specific plan imposed by technology. Unlike the previous level, the group activities are perceived as a collection of coordinated activities. —Knowledge-based performance concerns interdependent tasks performed by humans in the scope of group and organizational goals. This model highlights the increasing sophistication of human activity, in which simple (from the perspective of the organization) individual roles are complemented with more complex coordinated activities and supplemented by even more complex knowledge-based and information-rich activities. The group becomes more important ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:11 Fig. 3. Performance levels, adapted from Reason [2008]. than the individual. We will use this model to delineate three distinct collaborative systems evaluation scenarios. 4.3. Evaluation Scenarios Our evaluation scenarios follow the three-layer view previously mentioned. Role-based scenario. The evaluation data is gathered at the individuals’ cognitive level, focusing on events occurring during a timeframe in the order of minutes or even seconds. The most adequate evaluation methods to employ in this scenario adopt laboratory settings and considerable instrumentation (e.g., key logging). To gather the data, the evaluators must accurately specify the roles and activities, and the subjects must exactly act according to the instructions, under strict mechanical control. In these circumstances the system detail is high (e.g., keystrokes and mouse movements) but the system scope is low (e.g., roles associated to some particular functions). This scenario also trades off realism towards higher precision and generalizability. The time invested in this type of evaluation tends to be low and mostly used in the preparation of the experiment. The various trade-offs associated with this evaluation scenario are illustrated in Figure 4. Rule-based scenario. The evaluation data now concerns several subjects who must coordinate themselves to accomplish a set of tasks. The relevant events occur over several minutes and hours, instead of minutes or less. The system details being considered have large granularity (e.g., exchanged messages instead of keystrokes). The system scope also increases to include more functions. The evaluation methods employed in this scenario may still adopt laboratory settings although using less instrumentation. This scenario also represents trading off realism in favor of precision and generalizability. As with the role-based scenario, the evaluators must plan the subjects’ activities in advance; however, the subjects should be given more autonomy since control concerns the coordination level and not individual actions. The time invested in this type of evaluation is higher than in the previous case, since the data gathering takes more time and the data analysis is less straightforward (e.g., requiring debriefing by the participants). The trade-offs associated with this evaluation scenario are illustrated in Figure 5. Knowledge-based scenario. The evaluation is focused mostly on the organizational impact, and thus concerns much longer timeframes, usually on the order of days, ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:12 P. Antunes et al. Fig. 4. Role-based evaluation. Fig. 5. Rule-based scenario. months, and even years, since the technology assimilation and the perception of value to the organization may take a long time to emerge and stabilize. The evaluation scenario is also considerably different when compared to the other scenarios, involving for instance, knowledge management, creativity, and decision-making abilities. Considering these main goals, it is understandable that the system detail has coarse granularity, favoring broad issues such as perceived utility or value to business. The system scope may be wider for exactly the same reason. The evaluators may not specify the roles and activities beforehand, since the subjects have significant latitude for decision, which leads to open situations beyond the control of the evaluators. Considering the focus on knowledge, the trade-off is usually to reduce the precision and generalizability in favor of realism. All these differences imply the laboratory setting is not the most appropriate for the knowledge-based scenario, and point more in favor of more qualitative settings. Two examples of such evaluation methods employed in this scenario are case studies and ethnographic studies. These techniques need significant time to gather data in the field, and also time to transcribe, code, and analyze the ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:13 Fig. 6. Knowledge-based scenario. obtained data. The trade-offs associated to this evaluation scenario are illustrated in Figure 6. 4.4. Discussion Analysis of the scenarios described above highlights interesting issues to ponder when taking into consideration a collaborative system evaluation. Regarding the ensemble of variables, the rule-based scenario seems to be the most balanced in the adopted trade-offs. On the contrary, the role-based and knowledge-based scenarios show a clear tendency for the extremes. The role-based scenario emphasizes detail, precision, generalization, and time at the cost of scope and realism. On the opposite side, the knowledge-based scenario shows a clear emphasis in scope and realism at the expense of detail, precision, generalization, and time. These differences highlight the so-called instrumentalist and intersubjectivist strategies, which have been quite influential in the CSCW field [Pidd 1996; Neale et al. 2004; Guy 2005]. The instrumentalist strategy is mostly focused on accumulating knowledge through experimentation, whereas the inter-subjectivist strategy is concerned with interpreting the influences of the technology on individuals, groups, and the organization. Analysis of some individual variables may also give additional insights about the collaborative system evaluation. One such variable is invested time, which is distinct for the three discussed scenarios. From a very pragmatic perspective, the selection of the evaluation scenarios could be based on the time one is willing to invest on the evaluation process. Such considerations would lead to a preference for the rule-based and rolebased scenarios and a devaluation of the knowledge-based scenarios. Nevertheless, this approach may not be feasible due to lack of system detail, for example, whenever evaluating design ideas. This approach also has some negative implications, such as emphasizing details of little importance to the organization. System detail and scope are also related to the strategy adopted in developing the system. For instance, a breadth-first strategy indicates a strong initial focus on broad functionality, which would mandate an evaluation starting with a knowledge-based scenario that later on continues with role-based and rule-based scenarios. On the contrary, a depth-first strategy indicates a strong preference for ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:14 P. Antunes et al. Fig. 7. Evaluation lifecycle. fully developing a small functional set, which would mandate an evaluation starting with a role-based scenario that then proceeds with rule-based and knowledge-based scenarios. Figure 7 provides an overview of these evaluation issues. The two dotted lines show the limits suggested by the three evaluation scenarios. The arrows show the possible directions of the evaluation strategy and their basic assumptions. The arrows in Figure 7 indicate possible evaluation processes adopted according to various biases. The arrow’s starting point indicates which type of evaluation should be done first, while the end point suggests where to finish the evaluation process. This graphical representation also affords equating the collaborative system evaluation on other dimensions. For instance, the specific control and situation characteristics of one particular system may determine the effort involved in evaluation. Consider a database under evaluation that only supports mechanical control. Then, we may reckon the low dotted line shown on Figure 7 corresponds to the most adequate evaluation. An instrumentalist strategy should be adopted, assessing for instance the database usability. In the case of a workflow system, where control is mixed between the system and the users, we may consider the evaluation should be extended beyond the instrumentalist strategy, for example, contemplating the conformity of the system with organizational procedures and rules. 5. EVALUATION GUIDELINES This section presents a set of guidelines for the techniques and instruments used in collaborative systems evaluation. Figure 8 shows the evaluation methods which were presented in Section 3.2, organized by considering the role, rule, and knowledge-based categories. The knowledge-based evaluation emphasizes variables pertaining more to the organization and group than to the individual performance. Examples of metrics which can be delivered by these methods include interaction, participation, satisfaction, consensus, usefulness, and cost reductions. On the contrary, the role-based evaluation stresses the importance of the individual performance. Metrics that can be obtained using these methods are efficiency and usability. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:15 Fig. 8. Classification of evaluation methods. The rule-based evaluation may be seen as being in the middle of the extremes. Some metrics may include the organizational goals, for example, conformance to regulations, while others may concern group performance, such as productivity. Some methods may belong to one or two categories depending on the nature of the instruments involved in each method, for example, cooperation scenarios (COS) are located in the area between rule-and knowledge-based evaluation methods, because it has elements belonging to both categories. This classification allows evaluators to choose an appropriate method for their particular evaluation scenario. We have also developed guidelines to select an evaluation method that depends on the development status of the product being assessed. We consider which of the following stages the product is in: conception (during analysis and design), implementation (during coding and software refinement), production (the product is already being used), reengineering (the product is being structurally redesigned), or procurement (the product is going to be acquired by the organization). Figure 9 presents a summary of these guidelines. The rationale behind these recommendations is closely related to evaluation activities embedded in a typical software process. Validating the proposal of a collaborative system is mandatory during product conception or implementation phases. This validation typically involves a knowledge-based method intended to assess product usefulness for the organization. Further evaluation is usually justified if the results from this initial assessment are satisfactory, but the product requires some improvements. Following the same line of reasoning, rule-based methods should be applied before role-based. If we want to evaluate already-implemented products (i.e., products in production, reengineering, or procurement stage), the most suitable evaluation method will depend on what triggered the evaluation process, for example, refinement, redesign, or acquisition of a product. All guidelines and the rationale behind each are described in the following. If the product to be evaluated is in the conception stage, then the evaluation should be oriented towards obtaining coarse-grain information to help understand the role of the tool within the organization, the users’ expectations and needs, the business case, and the work context. This information, usually obtained from knowledge-based ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:16 P. Antunes et al. Fig. 9. Summary of guidelines for selection of evaluation methods. evaluation methods, may be very useful to specify or refine the user and software requirements, to establish the system scope, to identify product/business risks, and to validate a product design. Performing this assessment, evaluators should adopt an inter-subjectivist view over the collected data, considering qualitative and interactive ways to obtain data, and using various activities such as field studies, focus groups, and meetings. Rule-based and role-based methods do not provide a clear benefit in this stage because they require, at least, having a prototype of the system. If the product is in the implementation stage, a knowledge-based evaluation method is recommended, because it will serve to understand if the product can address organizational goals. This evaluation also provides coarse-grain information concerning the issues/components requiring improvements. This type of evaluation is optional if the product was already evaluated with a knowledge-based method during the conception phase. However, the implemented product could differ from its design; therefore, assuming the implemented system is still aligned with the organization needs and users’ expectations could be a mistake. When the available time and budget allow additional evaluation actions, the process may be complemented with a rule-based evaluation method, which would provide information necessary to adjust the product to the actual working scenario. For example, adjustments to concrete business processes may be identified this way. Optionally, a role-based evaluation method could also be used to fine-tune the product to the users. In case of rule/role-based assessments, the evaluation setting may be configured to assess the users’ activities in a controlled or mixed environment, which may utilize ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:17 laboratory settings. The evaluators may also adopt a more experimental view of the collected data. An evaluation may also serve to determine the current impact a system in production has on its business operations. Therefore, the first recommended evaluation action considers diagnosing the current situation using precise information obtained from the actual production system. A role-based evaluation method may then be used to gather such information. As with the previous case, if the available time and budget allow additional evaluation actions, rule-based and knowledge-based methods could subsequently be applied. The aim could be identifying concrete performance issues and improving organizational behavior. Rule-based methods will provide performance diagnosis information and knowledge-based methods will contribute to identifying the impact of the legacy system at an organizational level. Many organizations often decide to reengineer a legacy system. The main purpose is to change the organizational behavior by extending the system support. The existing system may be used to guide this reengineering. In such a case it is recommended to start with a rule-based evaluation to avoid anchoring the evaluation on too finegrained or too coarse-grained information. This type of evaluation helps identify particular improvement areas, which should be addressed in the reengineering process. Nevertheless, a subsequent knowledge-based evaluation may determine the impact of the reengineered product on the organizational strategy. If the reengineering process involves significant changes to the systems’ functions, user interfaces or interaction paradigms, a role-based evaluation may also be recommended. It allows focusing the evaluation on particular components and also getting fine-grain and accurate information to perform the reengineering. Often an evaluation action occurs when procuring a product. In such cases, the evaluation should start with a knowledge-based method, in order to understand whether the system functionality matches the organizational needs. Eventually, if the evaluators must also assess the system support of the organizational context and specific business processes, then the recommendation is to perform a rule-based evaluation, which will identify strengths and weaknesses of the product as support of particular activities in the organization. Besides these generic recommendations, the evaluators should also ponder the specific characteristics of the product under evaluation, namely the control and situation dimensions, which impact the evaluation scenario. The knowledge-based evaluation is naturally most adequate to products giving latitude of decision to the users and supporting interaction, collaboration, and decision-making. The evaluators should also ponder risk analysis. The risk adverse evaluator will set up a complete evaluation process by considering a combination of the three evaluation types, starting with knowledge-based and finishing with role-based scenarios. The risk taker evaluator will probably concentrate the evaluation only on the knowledge-based issues. The payoff of this high-risk approach is streamlining the evaluation efforts while focusing upon the issues that may have highest impact on the organization. The associated risk is the potential lack of quality of the outcomes. 6. THE COLLABORATIVE SYSTEM INCREMENTAL EVALUATION PROCESS This section describes two case studies of collaborative systems evaluation. The first one involves the evaluation of a requirements inspection tool for a governmental agency [Antunes et al. 2006; Ferreira et al. 2009]. The second one shows the evaluation processes of a mobile-shared workspace supporting construction inspection activities for a private construction company [Ochoa et al. 2008]. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:18 P. Antunes et al. 6.1. Evaluation of a Collaborative Software-Requirements Inspection Tool Software-requirements inspection is a well-known software engineering task. It engages a group of reviewers in the process of evaluating how well a software product under development accomplishes a set of previously established requirements. In a very simplified view, the tool under evaluation requests a group of software reviewers to synchronously complete a matrix with their perceived correlations between software requirements and specifications (from totally irrelevant to highly relevant). This matrix allows the reviewers to identify areas where software development has been underachieving and also to define priorities for further developing technical specifications. This tool has been subject to two formal evaluation procedures, the first one being a knowledge-based evaluation and the second, a role-based evaluation. The next sections briefly describe the new procedures, then we present a discussion about the overall evaluation process. 6.1.1. Knowledge-Based Evaluation. From a goal-oriented perspective, the major goal is to obtain a matrix of correlations expressing the reviewers’ perspectives, expectations, and worries about the software under development. The selection of correlations is necessarily a qualitative task in which the reviewers must agree upon the most appropriate link between what is being implemented and how the implementation corresponds to the reviewers’ expectations. This task is naturally complex because there are several reviewers involved who may have different perspectives about the software application, interpretations of what is involved in application development, hidden agendas, etc. The tool supports the negotiation and reconciliation of these conflicting views. Taking these problems into consideration, the initial evaluation step was focused on assessing the value brought by the tool to the evaluators, not only on assessing the software development but also on resolving their conflicting views in a productive and satisfactory way. This initial evaluation step therefore focused on knowledge-based issues. The adopted evaluation method was based on Cooperation Scenarios (COS) [Stiemerling and Cremers 1998] using scenario-based workshops to elicit design flaws. The evaluation procedure was set up as follows. The tool was evaluated in two pilot experiments involving two reviewers each. All of the reviewers were knowledgeable in software development, project management, requirements negotiation with outsourcing organizations, and software analysis and design. The pilot experiments were accomplished in the reviewers’ workplace, which was a governmental agency responsible for the national pension system. The participants’ task was to assess a project concerning the introduction of a new formula for computing pensions in the future. The specific goal set for the pilot experiments was to construct a matrix correlating a list of user requirements with a list of technical requirements, so that priorities could be set early in the project. The lists of user and technical requirements were specified at the beginning of the pilot experiments with help and approval from one of the most experienced participants. The evaluation itself was thus focused on negotiating and completing the correlations. The matrix under evaluation had 8 × 24 = 192 potential correlations to evaluate. Each pilot experiment started with a brief tutorial about the tool, which took approximately 15 minutes. Then, a pair of reviewers used the tool until a consensus was obtained. During the experiment, whenever necessary, additional help about the tool was provided to the reviewers. Afterwards, we asked the reviewers to complete a questionnaire with open questions about the tool’s most positive and negative aspects, as well as closed questions concerning the tool’s functionality and usability. The results are displayed in Table 2. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:19 Table II. Results from the Questionnaire Scores 3 4 1 (<) 2 — — — — — — — 2 — 4 2 3 — — 1 — — — — — — 2 2 2 1 1 1 1 1 1 Functionality Convenience (available functions and their appropriateness) Accuracy (reflecting the users opinions) Agreement (with the inspection method) Usability Comprehension (understanding the tool) Learning (how to use the tool) Operability (effort controlling the inspection) 5 (>) Regarding functional issues, the obtained results indicate that the tool was convenient to use and accurate concerning the evaluators’ view of the project. We also obtained positive indications about the consensus mechanism built into the tool, the reviewers’ understanding of the overall positions from others, ease of finding agreements, and simplicity revising their own opinions. Additionally, the reviewers agreed that the outcomes reflected their own opinions. Concerning usability issues, the obtained results indicated that the participants could understand the working logic behind the tool and that they easily learned to deal with its functionality, as well as with the negotiation process. However, it was pointed out that the tool was difficult to use by inexperienced users. Another drawback was related to bad performance, since the tool spent too much time synchronizing data. Several other minor functional and user interface details were also raised by the participants, for example, the absence of graphical information and the difficulties in obtaining a summary view of the negotiation. Another interesting outcome from this evaluation was evidence that the tool provided learning opportunities. Participants obtained new insights about negotiating software requirements. These two pilot experiments thus gave very rich indications about the value of this tool to the organization and to the group, as well as potential areas for improving the tool. The adopted evaluation approach also proved adequate to elicit knowledge-based design flaws and come up with design recommendations. 6.1.2. Role-Based Evaluation. The second formal evaluation procedure was aimed at evaluating in detail the user interaction with the tool. It was therefore a role-based evaluation. The user interaction with the tool was centered on the notion of shared workspace. Shared workspaces are becoming ubiquitous, allowing users to share information and to organize activities in very flexible and dynamic ways, usually relying on a simple graphical metaphor. This evaluation procedure thus aimed at optimizing the shared workspace use, assuming such optimization would increase the evaluators’ already positive opinion about the tool. The adopted evaluation approach was to analytically devise different options for shared workspace use and predict their performance. The method applied well-known human information processing models to measure the shared workspace performance and to draw conclusions about several design options. The adopted model was the Keystroke-Level Model (KLM) [Card et al. 1980]. KLM is relatively simple to use and has been successfully applied to evaluate single-user applications, although it had to be adapted to the collaborative systems context for this evaluation [Ferreira et al. 2009]. Based on KLM, each user interaction may be converted into a sequence of mental and motor operators, whose individual execution times have been empirically established and validated by psychological experiments. This way we could find out which sequence of operators would minimize the execution time of a particular shared workspace implementation. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:20 P. Antunes et al. Table III. Results of KLM Evaluation of Shared Workspace Design conditions a) 3 users a.1) no scroll (75% probability) a.2) scroll (25% probability) b) 6 users b.1) no scroll (75% probability) b.2) scroll (25% probability) Design A 5 s. MMPKKPKK 8.6 s. MPKPKMMPKKPKK Design B 9.8 s. 5 s. MMPKKPKK 11.3 s. MPKKMPKKPKKMMPKKPKK 9.8 s. 5 s. MMPKKPKK 11.3 s. MPKKMPKKPKKMMPKKPKK Note: Operators: M—Mental; P—Pointer; K—Key. We modeled three low-level functions associated with the shared workspace usage: locating correlations, selecting correlations, and negotiating correlation values. Several alternative designs for these functionalities were analytically evaluated. The adopted approach offered a common criterion, based on execution time, to compare the various implementations and find out which implementation would offer the best performance. In Table 3 we show the obtained results, highlighting that Design A has better overall performance than Design B. 6.1.3. Discussion. Overall, these evaluations allowed us to obtain several insights about the tool. The initial experiments were mostly focused on broader organizational and group issues, such as positive/negative effects, convenience, and respect for the participants’ opinions. Although the obtained results were characterized by low precision and generalizability, they were very insightful for further development and contributed to perceptions of the value attributed to the tool by the organization. The final experiments addressed fine-grained details about the tool usage and allowed us to experiment with alternative functionality and, ultimately, adopt the functionality that would offer the best performance. These latter results were characterized by high precision and generalizability, although they had low realism. In both cases the time invested in the evaluation was low, due to different causes. In the first case it was low because we adopted a pilot study approach; in the latter case, it was low because we adopted an analytic approach. The system detail was quite different between the two evaluations. In the first case it was very low (positive/negative aspects), while in the second case it was very high (keystrokes). Conversely, in the first case the system scope was high (whole application) and in the second case was very low (few functions). 6.2. Evaluation of an Application to Support Construction Inspection Activities Construction projects typically involve a main contractor, which in turn outsources several parts of the whole project, such as electrical facilities, gas/water/communication networks, painting, and architecture. The companies in charge of these activities usually work concurrently and they need to be coordinated because the work they are doing is highly interrelated. In fact, the project progress rate and the product quality increase when all these actors appropriately coordinate among themselves. The main contractor is usually a manager responsible for the coordination process. The inspection activities play a key role in this process. The goal of these activities is to diagnose the status of the construction project elements and to determine the need to approve, reject, or modify the built elements based on the diagnosis. Each inspection is carried out by one or more inspectors using paper-based blueprints. These inspectors work alone (doing independent tasks) or form an inspection team (when their examinations are interrelated). The inspection process requires that these inspectors ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:21 be on the move recording the contingency issues (problems identified by one inspector) related to particular components of the project. Periodically, the main contractor informs the subcontracted companies about the list of contingency issues they have to address. The process required to deal with these issues may involve the work of more than one subcontractor, and of course, at least one additional inspection. In order to support the inspection activities and help coordinate the problem-solving process, a mobile-shared workspace named COIN (COnstruction INspector), was developed. This collaborative system manages construction projects composed of sets of digital blueprints, which are able to store annotations done with a stylus on a Tablet PC. The system also supports mobile collaboration among the users, and data sharing (file transfer and data synchronization) between two mobile computing devices. Two types of evaluations were applied to this tool: knowledge- and rule-based evaluations. The following two sections describe the evaluation processes; a third section presents a discussion of the obtained results. 6.2.1. Knowledge-Based Evaluation. During the first stage of the project, a ScenarioBased Evaluation (SBE) strategy [Haynes et al. 2004] was used to identify the scenarios and requirements involved in the construction inspection process. Two formal evaluations were done using this strategy; the first one during the software conception phase and the second one during the design phase. Each one involved two steps: (1) individual interviews to construction inspectors and (2) a focus group to validate the interview results. Three experienced construction inspectors participated in the evaluation during the conception phase. Each interview was about one hour long. The participants had to characterize the work scenarios to be supported, and also specify and prioritize the functionalities which are required to carry out the inspection process. The results showed consensus on the types and features of the scenarios to be supported by the tool. However, there was no consensus on the functionalities the tool should provide to the inspectors. After the interviews, the results were written and given back to the inspectors. A week after that, a focus group was performed in order to try to get an agreed set of functionalities to implement in the software tool. The focus group was about three hours long and most of the participants changed their perception about which functionalities were most relevant to support the collaborative inspection process. A consensus was obtained after that session. The most important functionalities related with collaboration were the following: (1) transparent communication among inspectors, (2) selective visualization of digital annotations, (3) annotation filtering by several criteria, (4) unattended and on-demand annotations synchronization (between two inspectors), and (5) awareness of users’ availability and location. During the COIN design process, a preliminary prototype was used to validate the development team’s proposals to deal with the requirements identified in the previous phase. Once again, a SBE strategy was used. Before the individual interviews, the inspectors received a training session lasting about 30 minutes. After that, each one explored the prototype features for about 45 minutes. Finally, a one-hour interview was done with each inspector. The main goal of the interview was to identify positive, negative, and missing issues on the tool, and determine if the functionalities included in the prototype were enough to support a collaborative inspection process. The results showed a long list of specific and detailed comments with some kind of matching among the inspectors’ opinions. Similar to the previous evaluation process, these issues were written and given back to the participants. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:22 P. Antunes et al. Table IV. Results of the Inspection Process Experience With COIN Without COIN Labels Found 37 38 Inspection-Elapsed Time 23 minutes 35 minutes Annotations Review-Elapsed Time 6 minutes 9 minutes Total Elapsed Time 29 minutes 44 minutes After a week we did the focus group session, where the COIN prototype was reviewed again, as were the inspectors’ comments. The session’s main goal was to categorize the inspectors’ comments into the following three categories: (1) critical (it must be included in the tool), (2) recommended (it is a good idea to include it), and (3) and optional (it could be included if there is enough time). The focus group took about three and a half hours, and identified 12 critical, 17 recommended, and 8 optional issues. The developers were in the session (as observers) to get the requirements directly from the source. The effort of carrying out the second evaluation was at least double the first, however, the result was highly accurate, detailed, and valuable, which allowed us to adjust the proposed components in order to deal with the inspectors’ comments. The development team members recognized these comments were key to improving the matching between the prototype functionality and the inspectors’ needs. However, it is important to acknowledge the opinions of three inspectors are not enough to determine the inspection requirements of a construction company. A larger number of participants implies not only more general and validated results, but also a larger evaluation effort. 6.2.2. Rule-Based Evaluation. Once the first version of COIN was delivered, an empirical evaluation experience was conducted at the Computer Science Department of the University of Chile using the tool. A variant of the Cooperation Scenarios (COS) evaluation method [Stiemerling and Cremers 1998] was used in this case. The experience involved an area of 2000 m2 approximately, deployed on two floors. These areas mainly included offices, meeting rooms, laboratories, and public spaces. 40 labels simulating contingency issues were adhered to the physical infrastructure and electrical facilities. Two civil engineers, who participated in the previous evaluation process, conducted the reviewing process. They first used COIN running on a Tablet PC to carry out the inspection, and then repeated the process using physical blueprints. In both cases an observer followed the activities of each inspector in order to verify the coherence between the inspectors’ opinions and the empirical observation. In addition, these observers recorded the time involved in particular tasks of the inspection process. The engineers agreed beforehand on a common strategy to conduct both inspection processes. The strategy consisted of performing two tasks in sequence to gather the contingency issues, and to determine the coherence between the inspectors’ annotations. During the first evaluation round, the inspectors identified the contingency issues and created the corresponding annotations using COIN. Subsequently, they met to review each annotation, and they decided the reviews were consistent. Afterwards, the labels simulating contingency issues were changed and relocated, to reproduce the experimental conditions of the first experience. The inspection process was repeated, but now using blueprints. Finally, the engineers were interviewed to assess their feeling about the use of the tool to support inspection processes. The observers provided information about the duration of several activities involved in the inspection process, such as the contingency gathering and the integration of annotations. The idea was to establish a sound parameter against which to compare the whole process of inspection with/without COIN usage. Table 4 shows the results of the inspections using the tool and blueprints, respectively. These results indicate an improvement of the elapsed times when COIN is used. During the interview with the inspectors, both indicated they preferred to use the collaborative application for numerous reasons. (1) Digital maps are easier to use than ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:23 Table V. Results of Coordination Activities Experience With COIN Without COIN Time for Retrieving Blueprints < 2 minutes Go to the main contractor’s office Time to Integrate Annotations < 1 minutes 1 hour (∗ ) Time for Reporting Annotations < 2 minutes Go to the main contractor’s office Tasks Creation Elapsed Time 35 minutes 40 minutes (∗ ) Contingencies Report Creation Time < 2 minutes 1–2 hours (∗ ) Note: ∗ Estimations done by the inspectors. paper-based blueprints. (2) Writing annotations on the screen of a Tablet PC is more comfortable than writing them on a blueprint placed on a wall. (3) User mobility improves when COIN is used. (4) Reviewing annotations is faster when using the tool because both Tablet PCs can be put together, thus the distance between annotations being compared is small (which eases the process). Although the use of COIN shows positive results, they do not represent a great improvement to the current inspection process. The most important advantages of using COIN are related with the coordination process. Table 5 shows several interesting improvements in terms of coordination activities. For example, digital blueprints can be retrieved from the main contractor’s server through a Web service, which is accessed via Ethernet or a cellular network, when COIN is used. This operation took less than two minutes and avoided the trip to the main contractor’s office, which was required in the paper-based case. Moreover, the process of integrating the inspectors’ annotations took less than a minute when the collaborative system was utilized. By contrast, the integration could have taken about one hour for paper-based inspection. The time it took to report the annotations to the main contractor is also considerably reduced with the system use. The time spent creating the tasks related with the annotations is similar in both cases. However, the creation of the contingencies report is considerably reduced when COIN is used. This evaluation process gives us useful preliminary information to understand the possible impact of the tool in the construction inspection scenario. However, a large number of observations are required to get a more accurate diagnosis about the impact of the tool on a real construction company. 6.2.3. Discussion. In the first two evaluations (i.e., when SBE was used) just three inspectors were involved because of the effort required to carry out these evaluations. The evaluation effort (mainly time) in SBE grows considerably with each additional participant. Clearly this is a method which provides a high degree of realism when it is applied to a large number of participants. However, it also requires long invested time. The reward for that work is an agreed set of specific and detailed (positive, negative, and missing) issues, which must be considered during the development of a collaborative supporting tool. Counting on these issues is highly important in determining how well the product under development matches the users’ needs. The second evaluation process (i.e., when COS was used) provides an interesting strategy to obtain a diagnosis of the tool’s usefulness, and its impact on the process in which it is utilized. The feedback is detailed and precise; however it requires a large number of participants to generalize the results. This implies an increase in the evaluation effort. Such effort could be reduced using agents running in the background and recording the time involved in the several tasks. Thus, the observer would no longer be required. Finally, the main limitation of the COS evaluation method could be the level of realism of the obtained results. If the testing scenario (laboratory) is similar to the real scenario, then the results will be representative. Otherwise, the evaluation effort ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:24 P. Antunes et al. could be meaningless. If COS is going to be used for an evaluation process, then it is important to consider the cost of having a testing scenario similar to the real one. 7. CONCLUSIONS AND FUTURE WORK The second section of this article starts with a tough question: why is collaborative systems evaluation so difficult? As we have thoroughly discussed, there is no single culprit. Indeed, the difficulties are practical (e.g., dealing with many subjects and groups), theoretical (addressing different cognitive levels, specifying satisfying criteria) and methodological (e.g., dependence of the evaluation on the development process). The various evaluation methods reviewed in this paper and the timeline showing their emergence corroborate the complexities. Many of these methods are not competing for the same goal, but instead they complement the whole framework necessary to evaluate collaborative systems. The task, then, of the evaluator is to define the necessary trade-offs and select a set of satisfactory evaluation methods. This article tries to ease this task. To accomplish this goal, we started by identifying the set of variables which may be necessary to build a comprehensive evaluation framework. Such a framework must deliver a balanced albeit concise combination of variables addressing the practical, theoretical, and methodological issues that make collaborative systems evaluation so difficult. We defined six variables: generalization, precision, realism, system detail, system scope, and invested time. The generalization, precision, and realism variables fundamentally concern theoretical issues regarding how satisfying the evaluation results may be to the evaluator. The system detail and system scope concern methodological issues associated with the product development strategy. The invested time variable concerns a very practical issue of assessing the amount of time available to the evaluator in order to conduct the evaluation. Yet these six variables still constitute quite a complex evaluation framework. We must ease the evaluator’s decision-making task. Thus, we have also considered three performance levels: role-based, rule-based, and knowledge-based performance. These levels of performance lay out the relative importance attributed to each one of the six variables previously described. For instance, the role-based level assigns high importance to the generalization, precision, and system detail variables, and low importance to realism, invested time, and system scope. Overall, the performance levels define three distinct evaluation scenarios aiming to reduce the number of choices considered by the evaluator without significantly compromising the comprehensiveness of the evaluation process. Given the evaluation scenarios, we then discussed which evaluation lifecycle, that is, combination of scenarios, could be adopted by the evaluator. The discussion is essentially based on two criteria: bias for invested time and product development criteria. Considering the bias for invested time, the issue is to recommend the evaluation lifecycle and corresponding scenarios that are cost-effective with respect to the time spent doing the evaluation. On the other hand, the product development criterion is concerned with aligning the evaluation with the development cycle, which may have adopted a depth-first or breadth-first approach. Thus this approach leads the evaluator towards a fairly straightforward decisionmaking process that considers the product being developed, the development lifecycle, and the time available to evaluate the product. Finally we also relate the existing evaluation methods to the evaluation scenarios just mentioned, thus easing the definition of the concrete evaluation plan. The article also describes two case studies illustrating the use of the evaluation framework and ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:25 showing how the three evaluation scenarios complement each other towards assessing prototypes at various levels of granularity. The main contributions of this article are twofold. The most important one offers decision-making support to evaluators wishing to disentangle the inherent complexity of collaborative systems evaluation. The proposed approach covers the whole endeavor ranging from the selection of evaluation variables, definition of satisfying criteria, and adoption of an evaluation lifecycle. The second contribution lays out a foundation for classifying evaluation methods. The evaluation methods seem to emerge in a very adhoc way and cover quite distinct goals regarding why, how, what, and when to evaluate. This situation makes it difficult to classify them in a comprehensive way. We have proposed a classification highlighting their major distinctions. We hope this classification will be helpful to future research and practice in the CSCW area. APPENDIX Appendix A. Timeline of Evaluation Methods 1978 1980 1987 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2008 [Yourdon 1978] [Card et al. 1980] [Suchman 1987] [Nielsen 1989] [Nielsen and Molich 1990] [Wharton et al. 1994] [Tang 1991] [Bias 1991] [Polson et al. 1992] [Rowley and Rhoades 1992] [Urquijo et al. 1993] [Twidale et al. 1994] [Nielsen 1994] [Nielsen 1994] [Ereback and Höök 1994] [Bias 1994] [Hughes et al. 1994a] [Hughes et al. 1994b] [Plowman et al. 1995] [Gutwin et al. 1996] [Van Der Veer et al. 1996] [Baeza-Yates and Pino 1997] [Stiemerling and Cremers 1998] [Ruhleder and Jordan 1998] [Briggs et al. 1998] [Neale and Carroll 1999] [Gutwin and Greenberg 1999] [Gutwin and Greenberg 2000] [Carroll 2000] [Van Der Veer 2000] [Steves et al. 2001] [Baker et al. 2001] [Sonnenwald et al. 2001] [Baker et al. 2002] [Cockton and Woolrych 2002] [Pinelle and Gutwin 2002] [Pinelle et al. 2003] [Antunes and Costa 2003] [Haynes et al. 2004] [Convertino et al. 2004] [Humphries et al. 2004] [Inkpen et al. 2004] [Kieras and Santoro 2004] [Briggs et al. 2004] [Vizcaı́no et al. 2005] [Baeza-Yates and Pino 2006] [Antunes et al. 2006] [Pinelle and Gutwin 2008] Structured walkthroughs Keystroke-Level Model (KLM) Ethnomethodological studies Discount usability engineering Heuristic evaluation Cognitive walkthroughs Observational studies Interface walkthroughs Cognitive walkthroughs Cognitive jogthrough Breakdown analysis Situated evaluation Usability inspection Heuristic evaluation Cognitive walkthrough Pluralistic usability walkthrough Quick-and-dirty ethnography Evaluative ethnography Workplace studies Usability studies Groupware task analysis Formal evaluation of collaborative work Cooperation scenarios Video-based interaction analysis Technology Transition Model Multi-faceted evaluation for complex, distributed activities Evaluation of workspace awareness Mechanics of collaboration Scenario-based design Task-based groupware design Usage evaluation Heuristic evaluation based on the mechanics of collaboration Innovation diffusion theory Groupware heuristic evaluation Discount methods Groupware walkthrough Collaboration usability analysis Perceived value Scenario-based evaluation Activity awareness Laboratory simulation methods Evaluating collaboration in co-located environments Computational GOMS Satisfaction Attainment Theory Knowledge management approach Performance analysis Human performance models Tabletop collaboration usability analysis ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:26 P. Antunes et al. REFERENCES ANTUNES, P. AND COSTA, C. 2003. Perceived value: A low-cost approach to evaluate meetingware. In Proceedings of CRIWG’03. Lecture Notes in Computer Science, vol. 2806, 109–125. ANTUNES, P., FERREIRA, A., AND PINO, J. 2006. Analyzing shared workspace design with human-performance models. In Proceedings of CRIWG’06. Lecture Notes in Computer Science, vol. 4154, 62–77. ANTUNES, P., RAMIRES, J., AND RESPÍCIO, A. 2006. Addressing the conflicting dimension of groupware: A case study in software requirements validation. Comput. Informatics 25, 523–546. ARAUJO, R., SANTORO, F., AND BORGES, M. 2002. The CSCW lab for groupware evaluation. In Proceedings of CRIWG’02. Lecture Notes in Computer Science, vol. 2440, 222–231. BAECKER, R. M., GRUDIN, J., BUXTON, W., AND GREENBERG, S., EDS. 1995. Human-computer Interaction: Toward the Year 2000. Morgan Kaufmann, San Francisco, CA, 1995. BAEZA-YATES, R. AND PINO, J. 1997. A first step to formally evaluate collaborative work. In Proceedings of the ACM International Conference on Supporting GroupWork (GROUP ‘97). 55–60. BAEZA-YATES, R. AND PINO, J. 2006. Towards formal evaluation of collaborative work and its application to information retrieval. Info. Res. 11, 4. BAKER, K., GREENBERG, S., AND GUTWIN, C. 2001. Heuristic evaluation of groupware based on the mechanics of collaboration. In Proceedings of the 8th IFIP International Conference on Engineering For HumanComputer interaction. Lecture Notes in Computer Science, vol. 2254, 123–140. BAKER, K., GREENBERG, S., AND GUTWIN, C. 2002. Empirical development of a heuristic evaluation methodology for shared workspace groupware. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. 96–105. BIAS, R. 1991. Interface-walkthroughs: Efficient collaborative testing. IEEE Softw. 8, 5, 94–95. BIAS, R. 1994. The pluralistic usability walkthrough: coordinated empathies. Usability Inspection Methods. J. Nielsen and R. Mack, Eds., John Wiley & Sons, New York, 63–76. BRIGGS, R., ADIKNS, M., MITTLEMAN, D., KRUSE, J., MILLER, S., AND NUNAMAKER, J. 1998. A technology transition model derived from field investigation of GSS use aboard the U.S.S. CORONADO. J. Manage. Info. Syst. 15, 3, 151–195. BRIGGS, R., QURESHI, S., AND REINIG, B. 2004. Satisfaction attainment theory as a model for value creation. In Proceedings of the 37th Annual Hawaii International Conference on Systems Sciences, IEEE Computer Society Press. CARD, S., MORAN, T., AND NEWELL, A. 1980. The keystroke-level model for user performance time with interactive systems. Comm. ACM 23, 7, 396–410. CARROLL, J. 2000. Making use: Scenario-Based Design of Human-Computer Interactions. The MIT Press, Cambridge, MA. COCKTON, G. AND WOOLRYCH, A. 2002. Sale must end: should discount methods be cleared off HCI’s shelves? Interactions 9, 5, 13–18. CONEVRTINO, G., NEALE, D., HOBBY, L., CARROLL, J., AND ROSSON, M. 2004. A laboratory method for studying activity awareness. In Proceedings of the 3rd Nordic Conference on Human-Computer Interaction. 313– 322. DAMIANOS, L., HIRSCHMAN, L., KOZIEROK, R., KUTRZ, J., GREENBERG, A., WALLS, K., LASKOWSKI, S., AND SCHOLTZ, J. 1999. Evaluation for collaborative systems. ACM Comput. Surv. 31, 2, 15. DESANCTIS, G., SNYDER, J., AND POOLE, M. 1994. The meaning of the interface: a functional and holistic evaluation of a meeting software system. Decis. Supp. Syst. 11, 319–335. EREBACK, A. AND HÖÖK, K. 1994. Using cognitive walkthrough for evaluating a CSCW application. In Proceedings of the Conference Companion on Human Factors in Computing Systems, 91–92. FERREIRA, A., ANTUNES, P., AND PINO, J. 2009. Evaluating shared workspace performance using human information processing models. Info. Res. 14, 1, 388. FJERMESTAD, J. AND HILTZ, S. 1999. An assessment of group support systems experimental research: methodology and results. J. Manag. Info. Syst. 15, 3, 7–149. GREENBERG, S. AND BUXTON, B. 2008. Usability evaluation considered harmful (some of the time). In Proceedings of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems, 111–120. GUTWIN, C. AND GREENBERG, S. 1999. The effects of workspace awareness support on the usability of real-time distributed groupware. ACM Trans. Comput.-Human Interact. 6, 3, 243–281. GUTWIN, C. AND GREENBERG, S. 2000. The mechanics of collaboration: Developing low cost usability evaluation methods for shared workspaces. In Proceedings of the IEEE International Workshops on Enabling Technologies Infrastructures for Collaborative Enterprises, 98–103. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. Structuring Dimensions for Collaborative Systems Evaluation 8:27 GUTWIN, C., ROSEMAN, M., AND GREENBERG, S. 1996. A usability study of awareness widgets in a shared workspace groupware system. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, 258–267. GUY, E. 2005. “. . .real, concrete facts about what works. . .”: integrating evaluation and design through patterns. In Proceedings of the International ACM SIGGROUP Conference on Supporting Group Work, 99–108. HAYNES, S., PURAO, S., AND SKATTEBO, A. 2004. Situating evaluation in scenarios of use. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, 92–101. HERSKOVIC, V., PINO, J. A., OCHOA, S. F., AND ANTUNES, P. 2007. Evaluation methods for groupware systems. In Proceedings of CRIWG, Lecture Notes in Computer Science, vol. 4715, 328–336. HIX, D. AND HARTSON, H. R. 1993. Developing User Interfaces: Ensuring Usability Through Product and Process. John Wiley & Sons, Inc., New York, NY. HUANG, J. P. H. 2005. A conceptual framework for understanding collaborative systems evaluation. In Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise. 215–220. HUGHES, J., KING, V., RODDEN, T., AND ANDERSEN, H. 1994a. Moving out from the control room: ethnography in system design. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. 429–439. HUGHES, J., SHARROCK, W., RODDEN, T., O’BRIEN, J., ROUNCEFIELD, M., AND CALVEY, D. 1994b. Field Studies and CSCW. Lancaster University, Lancaster, U.K. HUMPHRIES, W., NEALE, D., MCCRICKARD, D., AND CARROLL, J. 2004. Laboratory simulation methods for studying complex collaborative tasks. In Proceedings of the 48th Annual Meeting Human Factors and Ergonomics Society, 2451–2455. INKPEN, K., MANDRYK, R., DIMICCO, J., AND SCOTT, S. 2004. Methodology for evaluating collaboration behaviour in co-located environments. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. KIERAS, D. AND SANTORO, T. 2004. Computational GOMS modeling of a complex team task: lessons learned. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 97–104. MCGRATH, J. 1984. Groups: Interaction and Performance. Prentice-Hall, Englewood Cliffs, NJ. NEALE, D. AND CARROLL, J. 1999. Multi-faceted evaluation for complex, distributed activities. In Proceedings of the Conference on Computer Support For Collaborative Learning. 425–433. NEALE, D., CARROLL, J., AND ROSSON, M. 2004. Evaluating computer-supported cooperative work: models and frameworks. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, 112–121. NEWELL, A. 1990. Unified Theories of Cognition. Harvard University Press, Cambridge, MA. NIELSEN, J. 1989. Usability engineering at a discount. In Designing and Using Human-Computer Interfaces and Knowledge Based Systems, G. Salvendy and M. Smith, Eds., Elsevier Science Publishers, Amsterdam, 394–401. NIELSEN, J. 1994. Usability inspection methods. In Proceedings of the Conference on Human Factors in Computing Systems, 413–414. NIELSEN, J. AND MOLICH, R. 1990. Heuristic evaluation of user interfaces. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, 249–256. OCHOA, S., PINO, J., BRAVO, G., DUJOVNE, N., AND NEYEM, A. 2008. Mobile shared workspaces to support construction inspection activities. In Collaborative Decision Making: Perspectives and Challenges, P. Zarate, J. Belaud, G. Camileri, and F. Ravat, Eds., IOS Press, Amsterdam, 211–220. PIDD, M. 1996. Tools for Thinking. J. Wiley & Sons, Chichester. PINELLE, D. AND GUTWIN, C. 2000. A review of groupware evaluations. In Proceedings of the 9th IEEE WETICE Infrastructure for Collaborative Enterprises, 86–91. PINELLE, D. AND GUTWIN, C. 2002. Groupware walkthrough: adding context to groupware usability evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 455–462. PINELLE, D. AND GUTWIN, C. 2008. Evaluating teamwork support in tabletop groupware applications using collaboration usability analysis. Pers. Ubiq. Comput. 12, 3, 237–254. PINELLE, D., GUTWIN, C., AND GREENBERG, S. 2003. Task analysis for groupware usability evaluation: modeling shared-workspace tasks with the mechanics of collaboration. ACM Trans. Comput.-Human Interact. 10, 4, 281–311. PINSONNEAULT, A. AND KRAEMER, K. 1989. The impact of technological support on groups: an assessment of the empirical research. Decis. Supp. Syst. 5, 3, 197–216. PLOWMAN, L., ROGERS, Y., AND RAMAGE, M. 1995. What are workplace studies for? In Proceedings of the 4th European Conference on Computer-Supported Cooperative Work, 309–324. ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012. 8:28 P. Antunes et al. POLSON, P. G., LEWIS, C., RIEMAN, J., AND WHARTON, C. 1992. Cognitive walkthroughs: a method for theorybased evaluation of user interfaces. Int. J. Man-Mach. Stud. 36, 5, 741–773. RASMUSSEN, J. AND JENSEN, A. 1974. Mental procedures in real-life tasks : a case-study of electronic trouble shooting. Ergonomics 17, 293–307. REASON, J. 2008. The Human Contribution: Unsafe Acts, Accidents and Heroic Recoveries. Ashgate, Surrey, UK. ROSS, S., RAMAGE, M., AND ROGERS, Y. 1995. PETRA: participatory evaluation through redesign and analysis. Interact. Comput. 7, 4, 335–360. ROWLEY, D. AND RHOADES, D. 1992. The cognitive jogthrough: a fast-paced user interface evaluation procedure. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 389–395. RUHLEDER, K. AND JORDAN, B. 1998. Video-based interaction analysis (VBIA) in distributed settings: a tool for analyzing multiple-site, technology-supported interactions. In Proceedings of the Participatory Design Conference, 195–196. SCRIVEN, M. 1967. The methodology of evaluation. In Perspectives of Curriculum Evaluation, R. Tyler, R. Gagne, and M. Scriven, Eds., Rand McNally, Chicago, 39–83. SONNENWALD, D., MAGLAUGHLIN, K., AND WHITTON, M. 2001. Using innovation diffusion theory to guide collaboration technology evaluation: work in progress. In Proceedings of the IEEE International Workshop on Enabling Technologies, 114–119. STEVES, M., MORSE, E., GUTWIN, C., AND GREENBERG, S. 2001. A comparison of usage evaluation and inspection methods for assessing groupware usability. In Proceedings of the International ACM SIGGROUP Conference on Supporting Group Work, 125–134. STIEMERLING, O. AND CREMERS, A. 1998. The use of cooperation scenarios in the design and evaluation of a CSCW system. IEEE Trans. Softw. Eng. 24, 12, 1171–1181. SUCHMAN, L. 1987. Plans and Situated Actions: The Problem of Human-Machine Communication. Cambridge University Press, Cambridge, U.K. TANG, J. 1991. Findings from observational studies of collaborative work. Intern. J. Man-Machine Stud. 34, 2, 143–160. TWIDALE, M., RANDALL, D., AND BENTLEY, R. 1994. Situated evaluation for cooperative systems. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, 441–452. URQUIJO, S., SCRIVENER, S., AND PALMEN, H. 1993. The use of breakdown analysis in synchronous CSCW system design. In Proceedings of the 3rd European Conference on Computer Supported Cooperative Work—ECSCW 93, 289–302. VAN DER VEER, G. 2000. Task based groupware design: putting theory into practice. In Proceedings of the Symposium on Designing Interactive Systems. 326–337. VAN DER VEER, G., LENTING, B., AND BERGEVOET, B. 1996. Gta: Groupware task analysis—modeling complexity. Acta Psychologica 91, 3, 297–322. VELD, M. A. A. H. I. T., ANDRIESSEN, J. H. E., AND VERBURG, R. M. 2003. E-MAGINE: The development of an evaluation method to assess groupware applications. In Proceedings of the 12th International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises, 153–158. VIZCAÍNO, A., MARTINEZ, M., ARANDA, G., AND PIATTINI, M. 2005. Evaluating collaborative applications from a knowledge management approach. In Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (WETICE’05), 221–225. WHARTON, C., RIEMAN, J., LEWIS, C., AND POLSON, P. 1994. The cognitive walkthrough method: a practictioner’s guide. In Usability Inspection Methods, J. Nielsen and R. Mack, Eds., John Wiley & Sons, New York, 105–140. YOURDON, E. 1978. Structured Walkthroughs. Yourdon Inc, New York. Received February 2010; revised June 2010; accepted August 2010 ACM Computing Surveys, Vol. 44, No. 2, Article 8, Publication date: February 2012.