General Preference Ranking Guidelines
General Preference Ranking Guidelines
General Preference Ranking Guidelines
Guidelines
Apple Confidential–Internal Use Only
How to Use This Document
This document serves as a general reference for the Preference Ranking annotation
projects Summarization (CYU) and Text Composition (TC) tasks.
For detailed information on the requirements and specifications of each
individual task, please refer to the specific guidelines for each task.
SINGLE RESPONSE RATING: Users reach out to digital assistants for various reasons: to ask for specific information, to give instruction (e.g., create a passage, write a
PREFERENCE RANKING code), or simply to chat. Because of that, the majority of user requests are conversational and might be filled with colloquialisms, idioms, or
SUMMARY OF DIMENSION
APPLICABILITY unfinished phrases. Just like in human-to-human interaction, a user might comment on the digital assistant’s response or ask a follow-up
4. COMMENTS question. While a digital assistant is very capable of generating human-like conversations, the limitations are still present. For example, it is
5. RESOURCES challenging for the assistant to judge how accurate or safe (not harmful) the response is. This is where your role as an analyst comes into
play. The purpose of this project is to evaluate digital assistant responses to ensure they are relevant, accurate, concise, and safe.
• Summarization (CYU): The task involves evaluating summarization responses generated by a virtual assistant. The
process consists of examining instructions and original input text (which could be emails, chat messages, or other
forms of text) and then evaluating the assistant's summary based on specific criteria.
• Text Composition: The Text Composition tasks involve generating or evaluating various text types, such as replies,
rewrites, tone adjustments, and text-to-list or table transformations. These tasks focus on ensuring the text follows
instructions, remains grounded in the input, is comprehensive, and maintains proper tone, style, and grammar.
Please find below an summary of each task type involved in the Text Composition project:
o Tone Adjust: For this task, you will evaluate how well, based on present dimensions, the responses are at
rewriting a text in a specific tone (e.g., professional, concise, friendly). The tone is adjusted without changing
the message or adding new content. The goal is to modify the style and tone to match the requested preset. The
available preset tones include the following options per user request (concise, professional or friendly) while
preserving meaning and clarity.
o Key Points Transform: For this task, you will evaluate how well, based on present dimensions, the responses are
at condensing the input text into a prioritized list of key points. The most important points are listed first,
Apple Confidential–Internal Use Only
preserving the semantic meaning of the input. The goal is to create a clear, concise list of bullet points
summarizing the essential aspects of the original text.
Overview
o List Transform: For this task, you will evaluate how well, based on present dimensions, the responses are at
SINGLE RESPONSE RATING:
PREFERENCE RANKING converting the input text into a bullet-point list that maintains the original order and content. Each point should
SUMMARY OF DIMENSION be concise and distinct. The goal is to simplify the input content into a clear, logically ordered list without
APPLICABILITY
changing meaning.
4. COMMENTS
5. RESOURCES
o Mail Long Reply: For this task, you will evaluate how well, based on present dimensions, the responses are at
generating a full-form email response after selecting a an input mail, a user-selected reply snippet and user-
provided answers to information-gathering questions. The long reply is comprehensive and must align with the
snippet, the input mail, and any additional information gathered from questionnaires. The goal is to formulate a
full, coherent email response addressing all relevant points of the input email.
o "Mail Short Reply: For this task, you will evaluate how well, based on present dimensions, the responses are at
generating a short, snippet-based reply generated to emails. These responses offer a summary of the final mail
reply options (e.g., ""Got it!"", ""Thanks for the update!""). When presented with an email from a sender, a list of
suggested responses is generated for the receiver to select from. The receiver will then choose a response that is
meaningful and
o appropriate to the ongoing conversation and reply to the sender. Possible scenarios for this chosen response
include, but are not limited to, the following (e.g., gratitude, clarification, agreement, etc.) The goal is to
evaluate the brief snippet or topic suggestions for the user to finalize the full email reply."
o Message and Mail Short Reply: For this task, you will evaluate how well, based on present dimensions, the
o Message Reply: For this task, you will evaluate how well, based on present dimensions, the responses are at
generating responses to a message exchange in a conversation. The reply must be meaningful and appropriate to
the ongoing conversation. Some scenarios for this chosen response include gratitude, enjoyment, opinion,
clarification, empathy, apology, agreement, status updates, etc. The goal is to provide a meaningful, direct reply
in conversational style with no additional steps.
o Proofreading: For this task, you will evaluate how well, based on present dimensions, the responses are at
reviewing and correcting text for grammar, spelling, punctuation, and style without changing the meaning. The
goal is to correct all errors in grammar, punctuation, and structure while retaining the original content’s intent
and tone.
o Questionnaire for Mail Reply: For this task, you will evaluate how well, based on present dimensions, the
responses are at generating a set of questions and answers to gather additional information needed based on an
input email. The questionnaire supplements the mail reply snippet by helping the user gather any missing or
unclear information from the initial email. The goal is to craft questions that are relevant to the email’s context
and the selected reply snippet, ensuring that all necessary details are collected to complete the final mail reply.
Apple Confidential–Internal Use Only
o Rewrite: For this task, you will evaluate how well, based on present dimensions, the responses are at rewriting
Overview
input text in a different format, style, or tone while keeping the original meaning intact. The goal is to produce a
SINGLE RESPONSE RATING: rewritten version of the input, improving fluency, clarity, and structure while maintaining the content’s core
PREFERENCE RANKING
message.
SUMMARY OF DIMENSION
APPLICABILITY
4. COMMENTS
o Table Transform: For this task, you will evaluate how well, based on present dimensions, the responses are at
5. RESOURCES turning textual data into a table format with headers and organized rows/columns. The output table should
represent the data concisely and clearly. The goal is to transform the input data into a well-structured table with
accurate headers and values, maintaining the meaning and integrity of the original content.
The Response Evaluation tasks involves two steps: Single Response Rating and Preference Ranking. The following sections
will provide an overview of the Single Response Ratings and Preference Ranking for each of these dimensions based on the
Guidelines.
For more detailed information of these steps, please visit the corresponding sections in the guidelines for each project
(and, in the case of Text Composition, task type.)
SINGLE RESPONSE RATING: In this step, each response is analyzed against predefined criteria called dimensions and then rated to determine its overall quality. Dimensions outline the
PREFERENCE RANKING key factors or parameters that contribute to the overall quality or effectiveness of a response. This evaluates one response based on specific criteria (e.g.,
following instructions, groundedness, comprehensiveness) and assigns it a rating, typically using a scale (e.g., fully following, partially following, or not
SUMMARY OF DIMENSION
APPLICABILITY
following instructions).
4. COMMENTS
Single response rating involves evaluating a particular response across the following dimensions:
5. RESOURCES
1. Following instructions
2. Composition
3. Comprehensiveness
4. Groundedness
5. Harmfulness
6. Satisfaction
NOTE: Each dimension has a set of dedicated criteria. For the definition and examples of each
dimension and how they should be treated for each specific task, please visit the corresponding
guideline for each project (and, in the case of Text Composition, task type.)
SINGLE RESPONSE RATING: Once each response is evaluated (Single Response Evaluation), the next step is to compare and rank two responses against each other (Preference
PREFERENCE RANKING Ranking). This evaluation is based on the previously defined dimensions and includes Harmfulness (for TC) and Satisfaction (for both TC and CYU)
SUMMARY OF DIMENSION
APPLICABILITY dimensions, and compares two or more responses and ranks them based on how well they meet the evaluation criteria, determining which
4. COMMENTS response is better or more appropriate in the given context.
5. RESOURCES
5. RESOURCES
Comment Example Rationale
Example 1: Both comments compare the responses to each
Responses E & C (rank 1 and 2) provided the best responses with clear other, pointing out their strengths and weaknesses in
and concise information. E is 1 just based on preference because I believe relation to the criteria used for evaluation. This
the language used can be easily understood because they used simplistic comparative analysis helps in understanding why
language. C is a little more technical with its terminology. Response A/rank some responses were ranked higher or lower.
3 is in the middle because it was satisfying but it could have definitely
gone into more details to better support their example. Rank 4 and 5 are In Example 1, the commenter mentions a very specific
about the same, too broad to fully satisfy as a complete answer but would reason to prefer response E compared to C based on
ultimately be useful if my goal was to find examples without explanations. language simplicity.
Good
I ranked the responses based on the helpful, truthful, and This comment is not informative for someone who tries to
harmless principles. understand the reasons behind the ranking. Please be
specific on which principle makes one response better and
how. For instance, if you think one response is more helpful,
Bad briefly explain why it is more helpful.
The ranking is done based on their content and formatting. Please be specific on how one response is better than the
other in terms of content and formatting. For instance, it is not
clear that how the content and formatting differ among the
responses.
All responses are truthful and helpful to varying degrees. Please be specific on how Truthfulness and Helpfulness
Bad varies across responses. For instance, it is not clear why one
response is more truthful than the others.
Response C mentions the book that is not about romantic The comment is not informative enough. Be specific and
relations. provide the title of the book or any other reference from the
response that will help engineers to understand what book
you mention.
It would be the best response but contains issues. The comment requires specific information about why the
response "would be the best" and what "issues" it contains.
5. RESOURCES
Domain Description Links
Coding/Computer These websites provide resources related to programming. • Stack Overflow
Science/STEM • Compiler
• Tutorials Point
• Geeks for Geeks
• Python Standard Library
• Programming Cheat Sheets
• Codecademy Cheatsheets
Coding Code editors are tools typically used by programmers and web • Visual Studio Code
developers to write and edit code. These tools can be used to • Allows you to test locally on your Mac.
efficiently review responses. • Run “Make me Admin” in Self Service to temporarily
This is the list of code editors approved by Apple. elevate privileges for the installation.
• SublimeText
• Notepad++
• Xcode
• BBEdit
• Eclipse
• MacVim