Academia.eduAcademia.edu

Effects of serial translation evaluation

The overarching purpose of this research line is to map intra- and intergroup coincidences and differences at evaluating translations, and to look for correlations with other parameters, such as age, level of expertise, group orientation, etc. To do so, we operatively define evaluation as “the set of activities carried out by a subject which end up with at least an overall qualitative judgment of a text, or a pair of texts, independently of the way subjects envisioned and performed the task”. This preliminary study is a piece of descriptive-relational research, for it seeks to depict what already exists in a group or population, and it investigates the connection between variables that are already present in that group or population.

Ricardo Muñoz Martín / Tomás Conde Ruano PETRA Research Group Granada Effects of Serial Translation Evaluation Evaluating, in translation, is a prototypical concept with many extensions. Readers tend to view it as a matter of quality, adequacy, or acceptability, whereas other stakeholders conceive of it as activities, or part thereof, such as proofreading, correcting, revising, editing, assessing, grading, and so on. Means and goals are also different quite often. These circumstances, together with enormously varied personal criteria and standards in evaluators, support the generally accepted view that evaluation cannot be studied in deep. Our aim is to find out whether studying the way subjects actually perform evaluations might shed light on some regularities to better understand what is at stake. That is, we are trying to find out whether there is some order in evaluators’ subjectivities through their observable behavior. The overarching purpose of this research line is to map intra- and intergroup coincidences and differences at evaluating translations, and to look for correlations with other parameters, such as age, level of expertise, group orientation, etc. To do so, we operatively define evaluation as “the set of activities carried out by a subject which end up with at least an overall qualitative judgment of a text, or a pair of texts, independently of the way subjects envisioned and performed the task”. This preliminary study is a piece of descriptive-relational research, for it seeks to depict what already exists in a group or population, and it investigates the connection between variables that are already present in that group or population. 1 Goals and Hypotheses Evaluating several translations from the same original is pretty unnatural in the market. It actually comes up nearly only in translator training and translator hiring. However, training and hiring are crucial for the industry. Can the repeated activity teach us something about evaluating translations? Does the repetition have an influence on the outcome of the evaluation? Those were the questions raised in this study, which is a part of a larger effort by Tomás Conde, within the activities of the group Expertise and Environment in Translation (PETRA). We wanted to move away from popular approaches to evaluating translations focusing on mistakes, which assign arbitrary values to poorly defined categories. To do so, some concepts had to be operationalized. A text segment is “any portion of a text singled out for analysis”. Since evaluators do not only mark mistakes, we defined phenomenon as “what motivates an evaluator to act onto a particular text segment”. Phenomena were classified into two groups: 1) normalized phenomena, which include typos, punctuation and spelling, formatting Effects of Serial Translation Evaluation 429 variations, syntax, weights and measurements, and the like, where an authority (in Spanish, normally, RAE) sanctions a proper option; and 2) not normalized phenomena, such as optional word order, register and different interpretations of the original, where appropriateness is a matter of degree. Phenomena were taken to be more or less salient according to the number of subjects who worked on them. Hence, a text segment where seven evaluators perform an action is thought of as more salient than a text segment where only three evaluators do. Here saliency is reduced to phenomena where more than half of the evaluators coincide. Our current data couple each phenomenon with its corresponding action. An action is “any mark introduced by the evaluator on the text”. In this study, actions have been limited to those present in the evaluated translation. Therefore, phenomena which are identified but not worked upon go unnoticed. Further, on-line changes which are modified again later, which might be very informative, have not been taken into account either due to current data collection procedures. Hence, in this report each phenomenon is paired by a response from the subject. On the other hand, focusing on marked text segments is a safe criterion to identify phenomena. Professionals and some teachers often quote the amount of work needed to fix a translation as a criterion to evaluate it, but their actions may entail varying quantities of work depending on their nature (e.g. conceptual vs. grammatical), systematicity (e.g. spelling vs. correct figures), and other aspects (such as purpose and relevance, and even the skills of the subject!) which may be specific for each phenomenon. Actions, taken as phenomena indicators and as behavioral units, are informative. The classification of actions into types emerged from observation of the evaluated translations. Our guiding principles were avoiding overlaps and covering as many instances as possible with the fewest categories where behavior can be undeniably understood to follow a certain pattern. Hence, actions have been operationalized in their quantity, and in their types. There are, certainly, a number of cases where the action includes elements which belong to two different types. In such cases, each one has been counted as more than one action and is reflected in the figures as two different actions. Actions observed so far can be divided into those made in the body of the text and those made in the margin (which also include before and after the body of the text). The distinction is relevant because actions in the margin cannot be thought of as aiming to improve the text, but probably to inform another reader of the existence of phenomena in the text, or as personal reminders. This distinction might prove worth to become one of the main parameters in a potential classification of activities within translation evaluation. An effort was made to keep categories symmetrical on both sides, and up to now both actions in the margin and in the body of the texts can be classified into additions, suppressions, changes, marks, annotations, and comments. 430 Ricardo Muñoz Martín / Tomás Conde Ruano An addition is “any procedure which results in adding some alphanumeric chain of symbols to the translation”. Suppression is the opposite procedure. Changes imply the combination of both. Marks refer to “any introduction of signs which are not aimed to become part of the translation, and usually consist of symbols, underlining, abbreviations, question marks, etc.” Annotations are “alphanumeric chains which do not cross out a translation text segment but might constitute an alternative to it”. For example, alternative solutions scribbled in the margin. Comments are annotations which deal with the action or the phenomenon at stake, but do not (only) provide a text segment to substitute some translation fragment. Further, some evaluators chose to code their marks so as to classify phenomena in some way. Typically, an underlining system with a three or four color code was used. This classification is neither homogeneous nor totally sharp, but it responds to the nature of the phenomena pretty well, does not demand a strong heuristic effort, and brings about a considerable reduction of undetermined phenomena (5.99%). Preliminary observation yielded that evaluators did not seem to count on a set of clear or conscious criteria to evaluate translations, and that many of those who professed to adhere to a more or less elaborated set of parameters turned out to apply it rather unevenly in actual practice. However, their actual behavior showed some interesting, statistically significant tendencies. As in the case of action types, we chose to start by obvious quantitative parameters. Hence, we defined demand as “the set of conscious and unconscious expectations an evaluator seems to think that a translation should meet” to try to accommodate their various standards. Demand was operationalized from two perspectives: 1) level, that is, whether evaluators seem to expect more or less from a translation as reflected in their quality judgments; and 2) evenness, or the uniformity or lack of variation in the level of demand. The second perspective may indicate the existence of clear and/or stable criteria for evaluating, or else an attempt to pursue an even-handedness of some sort. Finally, order effects were defined as ‘any consistent tendency in serial evaluation behavior across evaluators which cannot be explained as a feature of the translations when considered separately’. 2 Materials and methods 35 students in their fourth year of the translation degree at the University of Granada were invited to “assess / correct / proofread / edit / revise four sets of 12 translations each, corresponding to four originals, according to their beliefs and intuition, and to the best of their knowledge.” This report presents the data of ten of these subjects, the first set to have been completely analyzed. As for the texts, originals A and C dealt with politics, and B and D, with technical procedures for painting machinery. They had been selected from The Economist and from an online technical list, and all four were brief but complete texts which seemed realistic commissions. Translations had been carried out by stu- Effects of Serial Translation Evaluation 431 dents from an earlier course, and they were chosen amongst those which had not been assigned a very good grade by the teacher, so as to avoid that an excellent rendering would serve as a model for the subjects when evaluating other translations from the same set. The sets were alternatively sequenced, as described above, to prompt subjects to think of them as four separate tasks. Translations within each set were randomly ordered and coded for blind intervention by subjects and the analysts. Originals and translations were provided as digital files, printouts were provided upon request, and subjects were also allowed to print them out. The only constraints imposed to all subjects alike were the following: They had to (1) process the translations in the order they were given; (2) work on a whole set of translations from the same original in a single session; and (3) classify translations into one of four categories: very good, good, bad, and very bad. We searched for three types of order effects: (1) between sets; (2) within each set; and (3), within each text. Since translations were evaluated in the same order, task effects were analyzed simply by checking changes in the progression through the sets. For set effects, translations were grouped into three subsets, so that subset I includes translations 1-4 from each set, subset II includes translations 5-8, and subset III includes translations 9-12. For effects within translations, originals were divided into three sections (initial, middle, and final) of roughly identical length, and translations were divided accordingly. Data were entered in a Microsoft Access database and later analyzed with SPSS 12.0. Bivariate correlations, frequency and descriptive statistics were the most useful analyses. 3 Results and discussion: subjects profiles We will first need to describe subjects and their behaviors. The first parameter was final quality judgments, which were assigned numerical values, to allow for computing averages: very bad, 1; bad, 2; good, 3; very good, 4. A B C D 1 2.1 2.4 2.2 3.3 2 2.2 2.6 2.2 2.2 3 1.2 2.7 1.6 1.8 4 1.5 1.7 1.2 1.6 5 1.6 2.5 2.0 2.6 6 1.1 1.9 2.2 2.4 7 2.1 2.8 1.3 2.3 8 1.6 2.4 3.0 2.0 9 1.8 2.6 2.5 1.6 10 2.9 2.6 2.3 2.0 11 2.1 2.2 2.3 1.9 12 2.4 2.2 1.5 2.0 Set aver. 1.883 2.383 2.025 2.141 Table 1: Average quality judgments Table 1 displays average quality judgments for the translations. In set A, translations A02 and A10 received the best grades, whereas A03 and A06 got the lowest. The median value of all translations was 2.1. Technical translations got higher grades than “general” ones. Translations’ lengths did not correlate with quality judgments, although I05 and I03 tended to think that long translations are good (correlations of 0.295 and 0.291, respectively, significant at 0.05). Ricardo Muñoz Martín / Tomás Conde Ruano 432 Graphic 1: Frequency of quality judgment averages in the task. Graphic 1 shows the frequency of average quality judgments in the task, which is close to a fairly typical distribution (Gauss’ bell), except for the fact that the curve is displaced to the left, probably because translations were chosen among the worst ones. Only nine translations were deemed Good or Very good (right columns). When the continuum is divided into three equal periods (1-4), then only two translations reach the highest third (darkest background). 3.1 Demand 3.1.1 Level Table 2 tries to capture some specifics of subjects’ behavior. Correlations between quality judgments by evaluators were statistically significant between I02 and I06 (0.426), I04 and I10 (0.627), I05 and I07 (0.375), and I07 with I08 (0.384) and with I09 (0.596). Evaluator I03 has the best opinion of the translations (general average, 2.74). Other generous or lenient evaluators are I04 (2.62 average), I11 (2.45 average) and I07 (2.27 average). On the other hand, I06 is the most demanding evaluator (1.52 average), followed by I05 (1.69 average). Graphic 2 shows that subjects can be classified into three groups: I05 and I06 are the most demanding evaluators; I03, I04, I10, and I11 are the lenient; and I02, I07, I08 and I09 are in between. The intermediate group is remarkably homogeneous. Effects of Serial Translation Evaluation subject/set I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 A Aver. 2.08 2.83 2.42 1.08 1.83 2.25 2.08 2.00 2.67 2.25 433 B C D Total s.d. Aver. s.d. Aver. s.d. Aver. s.d. aver. s.d. 0.996 1.82 0.603 2.25 1.055 2.92 0.669 2.28 0.926 0.835 2.64 0.505 2.92 0.669 2.58 0.669 2.74 0.675 0.793 2.73 0.786 2.83 0.835 2.50 0.905 2.62 0.822 0.289 2.25 0.622 1.67 0.651 1.75 1.138 1.69 0.829 0.835 1.50 0.674 1.17 0.389 1.58 0.996 1.52 0.772 0.965 2.92 0.793 2.17 0.937 1.75 0.452 2.27 0.893 0.900 2.42 0.900 2.33 1.155 2.17 1.030 2.25 0.978 0.853 2.92 0.793 2.42 0.793 1.58 0.515 2.23 0.881 0.492 – – 2.33 0.778 – – 2.50 0.659 0.965 2.58 1.084 2.50 0.905 2.45 0.934 2.45 0.951 Table 2: Quality judgment, per subject and set. 3.1.2 Evenness Graphic 3 displays average quality judgments per set in the evaluators. I05 seems especially tough in set A (1.08 set average) when compared to general average, and I02 is generous in set D (2.92). On the other hand, I08 is regular throughout the sets (2.25 subject average), followed by I11, I04 and I03, who have better general opinions on the translations. Lenient evaluators (and medium evaluator I08) seem more even than the rest in all texts. Graphic 2: Quality judgment, per subject Ricardo Muñoz Martín / Tomás Conde Ruano 434 Graphic 3: Set averages of subjects’ quality judgments. 3.2 Actions 3.2.1 Quantity The number of actions correlates significantly with quality judgments (-0.535) when considered text by text, but not when analyzed by subjects. The total amount of actions is 11909 (table 3). Within sets, C and D show the largest variations, which may amount up to four times as many actions between translations. text / set 01 02 03 04 05 06 07 08 09 10 11 12 set aver. A aver. 44.90 37.10 56.80 56.70 42.40 62.70 32.90 47.60 63.30 30.10 37.20 35.30 45.58 s.d. 21.702 15.366 20.471 25.975 20.250 24.784 15.871 20.007 18.331 17.272 17.561 15.151 19.4 B aver. 24.00 18.00 16.30 19.90 16.90 22.10 14.90 17.60 18.30 14.60 15.50 17.30 17.95 C s.d. 9.684 5.249 4.347 6.226 4.701 6.557 7.534 8.072 8.433 8.605 7.200 8.629 7.103 aver. 27.00 18.20 22.60 29.60 19.80 14.50 24.10 23.40 13.10 13.50 14.50 22.90 20.27 s.d. 15.727 6.374 8.579 11.138 8.053 10.157 10.682 19.945 7.370 7.706 8.567 9.597 10.32 D aver. 8.80 15.80 22.70 26.00 10.40 12.50 12.00 16.80 11.70 17.50 17.70 13.40 15.44 s.d. 5.181 8.664 7.675 5.598 5.296 8.100 6.716 8.257 4.877 7.634 6.701 5.621 6.693 Table 3: Quantity of actions, per translation Subjects differ widely in the number of actions carried out (table 4). I02 has done a total of 627 actions, while I10 reached 1877, around three times as many. Effects of Serial Translation Evaluation average set / subject I02 I03 I04 I05 I06 23.08 59 42.5 62.5 37.58 A 8.91 24.92 15.5 30.83 15.83 B 13.75 19.25 20.67 14.67 19.92 C 6.5 16.33 13.58 16.33 16.5 D 30 23 31 22 Total 13 Nr. actions 627 1434 1107 1492 1078 435 I07 28.42 14 11.17 11.33 16 779 I08 26.33 14.67 15.58 16.25 18 874 I09 47.83 13.33 12.08 11.83 21 1021 I10 72.67 30.17 25.17 28.42 39 1877 I11 55.92 34.5 27.25 17.33 34 1620 aver. 45.58 20.27 17.95 15.44 Table 4: Quantity of actions, per subject Graphic 4 displays the average quantity of actions that each subject performed for every set, ordered from left to right in decreasing total quantity of actions. Four out of the five subjects who made more actions were the lenient evaluators, and medium evaluators (clear background) performed fewer actions than demanding ones (dark background). Most medium evaluators also show the smallest differences in the number of actions undertaken between the first and the following sets. Graphic 4: Set averages for subjects’ quantity of actions 3.2.2 Types Marking is the only type of action which correlates significantly at 0.01 with quality judgments (so does changes at the margin, but there were very few cases). Actions co-occur in certain patterns. Adding in text strongly correlates with changing in text (0.896), suppressing (0.864) and marking (0.721). Other correlations show emergent profiles of coherent behavior: evaluators seem either to fix the translations for later use (text-oriented), or to provide explanations of their actions to the translator or the researcher (feedback-oriented). Graphic 5 shows the distribution of the five most common actions in the subjects. Subjects I02 and I08 only classify phenomena, whereas I03, I04 and I09 focus on changing, adding, and suppressing in the body of texts. Ricardo Muñoz Martín / Tomás Conde Ruano 436 actions/subjects Classification Mark Addition Note 39 total 2180 1517 1 22 Change 54 104 Addition 2 288 113 150 32 125 99 172 175 Suppression 141 134 85 47 65 90 87 206 Change 962 838 620 212 401 809 727 1151 Note 2 28 14 46 46 50 21 17 49 15 1 11 Doubtful Total 627 1434 1107 1492 1078 779 874 1021 1877 1620 158 1156 855 5720 273 27 11909 margin I03 I04 I05 568 1 I06 I07 I08 821 636 142 3 I09 2 I10 747 127 I11 22 in text I02 612 Table 5: Types of actions, per subject Evaluators I03, I04, I07, I09 and I11 tended to act on the texts by adding, suppressing, and changing text segments. On the opposite pole, I02 and I08 were oriented to offer feedback to another reader. The rest did not seem to have a clear pattern of behavior. When contrasted to their level of demand (Graphic 5), demanding evaluators preferred to just mark phenomena, medium evaluators tended to classify more, and lenient evaluators, to change and suppress. Graphic 5: Types of actions, per subject Comments did not yield any clear pattern. However, I05 —one of the most demanding evaluators— was the subject who made the most comments (37.5% of all), followed by I10 (14.4%). On the other hand, the subjects who wrote the Effects of Serial Translation Evaluation 437 fewest comments were I04 (1.4%) and I03 (2.8%), the two most lenient subjects. 2.3 Summary of subjects’ profiles Evaluators showed consistent tendencies (1) to adopt a given level of demand, and (2) to confront different texts with a certain degree of evenness. Their actions may be (3) more or less abundant, (4) text-oriented or feedback-oriented, and (5) supported with a few or many comments. Table 6 displays a summary of variables. Column I shows the level of demand, from the most lenient (1) to the most demanding (3). Column II displays the level of evenness in two stages, even (1) and uneven (2). Column III reflects the quantity of actions, from the fewest (1), to the most abundant (3). Column IV ranks subjects from the most feedback-oriented (1) to the most text-oriented (4). Finally, column V ranks subjects according to the number of comments introduced, from the fewest (1) to the most abundant (4). demand I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 Level 2 1 1 3 3 2 2 2 1 1 Even 1 2 2 1 1 1 2 1 2 2 actions Quant 1 3 2 3 2 1 1 2 3 3 Type 1 4 4 3 2 3 1 4 1 3 Comm 3 1 1 4 2 3 1 1 3 2 Table 6: Summary of subjects’ characteristics In brief, demanding evaluators tend to be feedback-oriented, and uneven in their level of demand. Medium evaluators tend to perform few actions, and to be pretty uneven. Lenient evaluators seem more homogeneous: they are textoriented, perform many actions, and tend to be pretty even in their judgments. Of course, ten evaluators are too few to think that data can hold any consistent truth, but they are interesting since they point to potential consistent tendencies and relationships between variables. 4 Results and discussion: Order effects 4.1 Order effects in the whole task Graphic 4 above showed that the number of actions decreases dramatically from set A to the rest in all subjects. This is the first and most obvious order effect, and might be due to the lack of experience of the students as evaluators. They 438 Ricardo Muñoz Martín / Tomás Conde Ruano would start performing many actions to progressively realize that it meant too much work or that it was unnecessary. Graphic 6 shows that classifications were the only type of action that increased from set A through D. The reason was that one of the subjects stopped changing and started classifying during the task. This supports the notion that decreasing actions might be due to sujects’ adjusting their effort to the task. Graphic 6: Type of actions in different sets Graphic 7 shows that normalized phenomena only account for ca. 5% of all actions and that salient phenomena >5 —those singled out by at least 6 out of 10 evaluators— stays around 20% in sets A, C, and D. Text B was the first technical translation and students were not familiar with the subject matter. This might explain the drop in coincidences. The relative increase in normalized phenomena within salient phenomena probably indicates that evaluators felt uncomfortable with the text, since they refrain a little bit from venturing into potentially questionable actions. Effects of Serial Translation Evaluation 439 Graphic 7: Percentage of salient phenomena >5 in each set 4.2 Order effects within sets Table 7 shows the amount of actions in the three subsets within each set. There is a general tendency to reduce the quantity of actions per subset, which may be due to an improvement in efficiency, such as the one we might expect as the product of the use of a macrostrategy. Again, this supports the notion of the evaluators learning how to carry out the task as they were doing it. The exception is set D, where subset III has more actions than subset II, but it also has a lower quality judgment average. Subsets/Sets 1 2 3 A 1955 1856 1659 B 974 818 640 C 782 715 657 D 733 517 603 Subset ave. 4444 3906 3559 Table 7: Amount of actions per subsets While there is a tendency for most types of action to appear less in subsets II and III across sets, suppressions increase in sets B and D; additions and changes, in set D; and classifications in sets C and D. The increase of suppressions throughout three sets may be taken to indicate that evaluators have a clearer notion about the relevance of the information. The evaluators turned to classifications throughout sets C and D, perhaps as a way to spare efforts. 440 Ricardo Muñoz Martín / Tomás Conde Ruano Graphic 8: Percentage of salient >5 phenomena, in each subset Graphic 8 shows that salient phenomena, i.e., coincidences between subjects’ actions, are higher in subset II, probably an indicator of subjects’ similar contextualization. On the other hand, the drop in normalized phenomena in subset II might be explained as a bottom in their level of of self-confidence. 4.3 Order effects within the texts Table 8 shows the relationship between quality judgments and number of actions in different translation sections. Evaluators seem to identify phenomena and perform actions in all sections of the translations, but the further down in the text, the stronger the effect on their judgment of the quality of the translation as a whole. Interestingly, this does not correspond to the percentage of salient phenomena, which drops in central sections, mainly due to the reduction of notnormalized phenomena. Thus, after the first section, these subjects seem to have become more assertive but also less personal, while they feel better with the task in the third section. The tendency to increasing significance is evident at sentence level, since actions in the first sentences of the translations do not correlate with quality judgments. The relationship between quality judgments and actions in translation text segments which received a special typographic treatment or else stood out due to their position in the text, such as titles, headings, captions and the like, showed a lower significance than regular segments. Hence, visual prominence was ruled out as an explanation for first and last sentence results. Quality judgments are independent of the quantity of actions introduced, when considered by subject. Lenient evaluators do perform more actions than demanding evaluators, and medium evaluators perform the fewest, as shown in graphic 10. Effects of Serial Translation Evaluation Translations Pearson Sig. (bil.) outstanding - 0.324* 0.025 Segment regular - 0.525** 0.000 first - 0.057 0.701 Sentence last - 0.514** 0.000 rest - 0.522** 0.000 initial - 0.411** 0.004 Section central - 0.548** 0.000 final - 0.597** 0.000 ** Correlation significant at 0.01 Table 8: Relationship between quality judgment and actions Graphic 9: Percentage of salient phenomena (>5) in initial, central, and final sections of translations 441 442 Ricardo Muñoz Martín / Tomás Conde Ruano Graphic 10: Quantity of actions in initial, central, and final sections by lenient, medium, and demanding evaluators Graphic 11: Quantity of actions in initial, central, and final sections of translations, per average quality judgment Another interesting effect can be traced down (graphic 11) when the number of actions in translations’ sections—their initial, middle, and final parts— is correlated to average quality judgments. Bad and Good translations show a similar pattern of subjects’ behavior, where initial sections contain an amount of actions which slightly decreases in central sections to minimally rise again in final sec- Effects of Serial Translation Evaluation 443 tions. Very Bad translations, however, show a steady increase in the number of actions across sections, and Very good translations present a constant decrease in the number of actions as the text progresses. This might point to an emotional involvement of evaluators. 5 Conclusion Technical translations had higher quality judgments than general translations. As expected, a higher number of actions usually corresponds to lower quality judgments but some order effects modify these results (see below). We found out that there were statistically significant correlations between some subjects’ final quality judgments, usually, in pairs (in a group of ten). A couple of subjects seemed to consistently value information completeness above text length. Our framework worked pretty well to distinguish between product-oriented and feedback-oriented evaluators as well, since marking, adding, suppressing, and changing in the body of texts do strongly correlate. Intersubject comparison allowed us to distinguish lenient, medium, and demanding evaluators, who showed consistent behavioral trends: Lenient evaluators seemed more even in their demands throughout the task, and performed more actions on the translations, especially changes and suppressions, since they seemed to be productoriented. Medium evaluators performed the fewest actions, were fairly uneven in their demands, and tended to classify. Demanding evaluators were uneven in their demand and tended to just perform minimal feedback-oriented actions. As for order effects, salient phenomena increased from one translation set to the next. A nearly constant decrease in the number of actions, with a steep difference between the first and second sets, seems to point to a learning curve whose goal is minimizing effort while performing well in the task. However, these drops do not correlate with final quality judgments. In fact, first sentence actions did not usually affect final quality judgments. Classifications rose towards the end of the task, and so did suppressions, additions, and changes. Since they seem to be closer to product-oriented evaluation, this mode might be thought of as demanding less cognitive effort, perhaps due to the lack of additional metalinguistic, probably conscious demands. The farther down the text, the stronger the correlation between the number of actions and final quality judgments, an increase paralleled by a constant increase in salient phenomena from one translation set to the next. This might indicate a stronger correlation between salient phenomena and quality judgments than more individual phenomena. This is, perhaps, a starting point for a motivated set of translation evaluation criteria. At single text level, subjects tended to use the first third of the text for contextualization purposes, as an addressee would do, but away from what might be expected in a professional, who knows that first impressions may have more influence on the reader’s opinion. This might be subject of study from a pedagogical 444 Ricardo Muñoz Martín / Tomás Conde Ruano perspective, to discern whether specific training improves subjects’ performance at the beginning of texts. In the second third, subjects refrain a little from acting upon not-normalized phenomena. This could be a symptom that subjects may have just developed a macrostrategy or macrostructure with perhaps a number of rules or criteria, and stick to it. Not-normalized phenomena did rise in the third section, where subjects probably felt more confident about their performance and could free mental resources thanks to sticking to their plans. If so, then reducing efforts does not seem incompatible with developing adhoc mental structures to handle evaluations. On the contrary, reduction of not-normalized actions might be a good indicator of the existence of mental constructions of some sort governing or interacting with the evaluation process. All these are just speculations on the results of ten subjects, but their quantity and nature seem to support the notion that the framework is useful to study translation evaluation. And it does so in such a way that makes it compatible with second generation cognitive paradigms such as situated cognition. In the near future, we will cross-analyze these variables in larger amounts of subjects and also between different groups of population; apart from translation students, we will study professional translators, translation teachers, and addressees. Comments are more than welcome. Full data and colored graphics are available upon request.