General Preference Ranking Guidelines

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

General Preference Ranking

Guidelines
Apple Confidential–Internal Use Only
How to Use This Document

This document serves as a general reference for the Preference Ranking annotation
projects Summarization (CYU) and Text Composition (TC) tasks.
For detailed information on the requirements and specifications of each
individual task, please refer to the specific guidelines for each task.

Apple Confidential–Internal Use Only


Overview
Overview

SINGLE RESPONSE RATING: Users reach out to digital assistants for various reasons: to ask for specific information, to give instruction (e.g., create a passage, write a
PREFERENCE RANKING code), or simply to chat. Because of that, the majority of user requests are conversational and might be filled with colloquialisms, idioms, or
SUMMARY OF DIMENSION
APPLICABILITY unfinished phrases. Just like in human-to-human interaction, a user might comment on the digital assistant’s response or ask a follow-up
4. COMMENTS question. While a digital assistant is very capable of generating human-like conversations, the limitations are still present. For example, it is
5. RESOURCES challenging for the assistant to judge how accurate or safe (not harmful) the response is. This is where your role as an analyst comes into
play. The purpose of this project is to evaluate digital assistant responses to ensure they are relevant, accurate, concise, and safe.

Apple Confidential–Internal Use Only


Overview Overview
SINGLE RESPONSE RATING:
PREFERENCE RANKING
SUMMARY OF DIMENSION Task Sub-Task Description
APPLICABILITY
Response Evaluation a. Provide ratings for each response You will be shown at least 2 responses. At this stage you will provide
4. COMMENTS
ratings for each response (you might be shown a subset of rating
5. RESOURCES
questions). The evaluation will be done based on a set of predetermined
dimensions.
b. Compare responses You will then go through a series (if more than 2 responses are shown) of
comparisons, provide your rating for which side you prefer. The evaluation
will be done based on a set of predetermined dimensions.
Comments N/A Leave a detailed comment why one side was preferred to another OR if
they are same.

Apple Confidential–Internal Use Only


Overview Overview
SINGLE RESPONSE RATING:
PREFERENCE RANKING
Note that these guidelines serve as general reference to two Preference Ranking annotation projects: Summarization
SUMMARY OF DIMENSION (CYU) and Text Composition (TC) tasks.
APPLICABILITY
4. COMMENTS
Please find below a general overview of both projects:
5. RESOURCES

• Summarization (CYU): The task involves evaluating summarization responses generated by a virtual assistant. The
process consists of examining instructions and original input text (which could be emails, chat messages, or other
forms of text) and then evaluating the assistant's summary based on specific criteria.

• Text Composition: The Text Composition tasks involve generating or evaluating various text types, such as replies,
rewrites, tone adjustments, and text-to-list or table transformations. These tasks focus on ensuring the text follows
instructions, remains grounded in the input, is comprehensive, and maintains proper tone, style, and grammar.

Please find below an summary of each task type involved in the Text Composition project:

o Tone Adjust: For this task, you will evaluate how well, based on present dimensions, the responses are at
rewriting a text in a specific tone (e.g., professional, concise, friendly). The tone is adjusted without changing
the message or adding new content. The goal is to modify the style and tone to match the requested preset. The
available preset tones include the following options per user request (concise, professional or friendly) while
preserving meaning and clarity.

o Key Points Transform: For this task, you will evaluate how well, based on present dimensions, the responses are
at condensing the input text into a prioritized list of key points. The most important points are listed first,
Apple Confidential–Internal Use Only
preserving the semantic meaning of the input. The goal is to create a clear, concise list of bullet points
summarizing the essential aspects of the original text.
Overview
o List Transform: For this task, you will evaluate how well, based on present dimensions, the responses are at
SINGLE RESPONSE RATING:
PREFERENCE RANKING converting the input text into a bullet-point list that maintains the original order and content. Each point should
SUMMARY OF DIMENSION be concise and distinct. The goal is to simplify the input content into a clear, logically ordered list without
APPLICABILITY
changing meaning.
4. COMMENTS

5. RESOURCES
o Mail Long Reply: For this task, you will evaluate how well, based on present dimensions, the responses are at
generating a full-form email response after selecting a an input mail, a user-selected reply snippet and user-
provided answers to information-gathering questions. The long reply is comprehensive and must align with the
snippet, the input mail, and any additional information gathered from questionnaires. The goal is to formulate a
full, coherent email response addressing all relevant points of the input email.

o "Mail Short Reply: For this task, you will evaluate how well, based on present dimensions, the responses are at
generating a short, snippet-based reply generated to emails. These responses offer a summary of the final mail
reply options (e.g., ""Got it!"", ""Thanks for the update!""). When presented with an email from a sender, a list of
suggested responses is generated for the receiver to select from. The receiver will then choose a response that is
meaningful and

o appropriate to the ongoing conversation and reply to the sender. Possible scenarios for this chosen response
include, but are not limited to, the following (e.g., gratitude, clarification, agreement, etc.) The goal is to
evaluate the brief snippet or topic suggestions for the user to finalize the full email reply."

o Message and Mail Short Reply: For this task, you will evaluate how well, based on present dimensions, the

Apple Confidential–Internal Use Only


responses are at generating brief responses for messages and emails. For messages, the response includes all
necessary information to be sent directly. In a mail reply, a suggested response is merely a topic or snippet of a
Overview
potential full reply mail in the next stage (e.g., "Got it!"); it doesn’t include a complete email format and
SINGLE RESPONSE RATING: detailed information. Conversely, in a message reply, a suggested response includes all the necessary information
PREFERENCE RANKING
and may incorporate emojis or figurative language along with text; this response is ready to be sent directly. It's
SUMMARY OF DIMENSION
APPLICABILITY crucial to recognize that in certain situations, no reply, represented by an empty response, may be appropriate.
4. COMMENTS The goal is to provide concise, context-appropriate responses or snippets that are relevant to the ongoing
5. RESOURCES conversation.

o Message Reply: For this task, you will evaluate how well, based on present dimensions, the responses are at
generating responses to a message exchange in a conversation. The reply must be meaningful and appropriate to
the ongoing conversation. Some scenarios for this chosen response include gratitude, enjoyment, opinion,
clarification, empathy, apology, agreement, status updates, etc. The goal is to provide a meaningful, direct reply
in conversational style with no additional steps.

o Proofreading: For this task, you will evaluate how well, based on present dimensions, the responses are at
reviewing and correcting text for grammar, spelling, punctuation, and style without changing the meaning. The
goal is to correct all errors in grammar, punctuation, and structure while retaining the original content’s intent
and tone.

o Questionnaire for Mail Reply: For this task, you will evaluate how well, based on present dimensions, the
responses are at generating a set of questions and answers to gather additional information needed based on an
input email. The questionnaire supplements the mail reply snippet by helping the user gather any missing or
unclear information from the initial email. The goal is to craft questions that are relevant to the email’s context
and the selected reply snippet, ensuring that all necessary details are collected to complete the final mail reply.
Apple Confidential–Internal Use Only
o Rewrite: For this task, you will evaluate how well, based on present dimensions, the responses are at rewriting
Overview
input text in a different format, style, or tone while keeping the original meaning intact. The goal is to produce a
SINGLE RESPONSE RATING: rewritten version of the input, improving fluency, clarity, and structure while maintaining the content’s core
PREFERENCE RANKING
message.
SUMMARY OF DIMENSION
APPLICABILITY
4. COMMENTS
o Table Transform: For this task, you will evaluate how well, based on present dimensions, the responses are at

5. RESOURCES turning textual data into a table format with headers and organized rows/columns. The output table should
represent the data concisely and clearly. The goal is to transform the input data into a well-structured table with
accurate headers and values, maintaining the meaning and integrity of the original content.

The Response Evaluation tasks involves two steps: Single Response Rating and Preference Ranking. The following sections
will provide an overview of the Single Response Ratings and Preference Ranking for each of these dimensions based on the
Guidelines.

For more detailed information of these steps, please visit the corresponding sections in the guidelines for each project
(and, in the case of Text Composition, task type.)

Apple Confidential–Internal Use Only


Single Response Rating
Overview

SINGLE RESPONSE RATING: In this step, each response is analyzed against predefined criteria called dimensions and then rated to determine its overall quality. Dimensions outline the
PREFERENCE RANKING key factors or parameters that contribute to the overall quality or effectiveness of a response. This evaluates one response based on specific criteria (e.g.,
following instructions, groundedness, comprehensiveness) and assigns it a rating, typically using a scale (e.g., fully following, partially following, or not
SUMMARY OF DIMENSION
APPLICABILITY
following instructions).
4. COMMENTS
Single response rating involves evaluating a particular response across the following dimensions:
5. RESOURCES
1. Following instructions
2. Composition
3. Comprehensiveness
4. Groundedness
5. Harmfulness
6. Satisfaction

NOTE: Each dimension has a set of dedicated criteria. For the definition and examples of each
dimension and how they should be treated for each specific task, please visit the corresponding
guideline for each project (and, in the case of Text Composition, task type.)

Apple Confidential–Internal Use Only


Preference Ranking
Overview

SINGLE RESPONSE RATING: Once each response is evaluated (Single Response Evaluation), the next step is to compare and rank two responses against each other (Preference
PREFERENCE RANKING Ranking). This evaluation is based on the previously defined dimensions and includes Harmfulness (for TC) and Satisfaction (for both TC and CYU)
SUMMARY OF DIMENSION
APPLICABILITY dimensions, and compares two or more responses and ranks them based on how well they meet the evaluation criteria, determining which
4. COMMENTS response is better or more appropriate in the given context.
5. RESOURCES

Apple Confidential–Internal Use Only


Overview 2. Preference Ranking
SINGLE RESPONSE RATING:
PREFERENCE RANKING
SUMMARY OF DIMENSION
APPLICABILITY
NOTE: Each dimension has a set of dedicated criteria. For the definition and examples of each
4. COMMENTS
dimension and how they should be treated for each specific task, please visit the corresponding
5. RESOURCES
guideline for each project (and, in the case of Text Composition, task type.)

Apple Confidential–Internal Use Only


Summary of dimension applicability
Overview

SINGLE RESPONSE RATING: Consolidated Guidelines


PREFERENCE RANKING
SUMMARY OF DIMENSION
Dimension Evaluation Type
APPLICABILITY
4. COMMENTS Following Instructions Single Response Rating
5. RESOURCES
Comprehensiveness Single Response Rating

Groundedness Single Response Rating

Composition Single Response Rating

Harmfulness Single Response Rating

Satisfaction Single Response Rating

Harmfulness (just for TC) Preference Ranking

Satisfaction (for both TC and CYU) Preference Ranking

Apple Confidential–Internal Use Only


Overview
Comments
SINGLE RESPONSE RATING: Explain your ranking in the comments. Please be specific on why one response is better than the other.
PREFERENCE RANKING
SUMMARY OF DIMENSION
APPLICABILITY NOTE: The comment should always be written in English regardless of your locale.
4. COMMENTS

5. RESOURCES
Comment Example Rationale
Example 1: Both comments compare the responses to each
Responses E & C (rank 1 and 2) provided the best responses with clear other, pointing out their strengths and weaknesses in
and concise information. E is 1 just based on preference because I believe relation to the criteria used for evaluation. This
the language used can be easily understood because they used simplistic comparative analysis helps in understanding why
language. C is a little more technical with its terminology. Response A/rank some responses were ranked higher or lower.
3 is in the middle because it was satisfying but it could have definitely
gone into more details to better support their example. Rank 4 and 5 are In Example 1, the commenter mentions a very specific
about the same, too broad to fully satisfy as a complete answer but would reason to prefer response E compared to C based on
ultimately be useful if my goal was to find examples without explanations. language simplicity.
Good

Apple Confidential–Internal Use Only


Example 2: In both comments, the language is clear and concise.
Overview
C is the most complete and most informative. Ranked 1. The comments have the right key points which are
B is ranked 2 and it is very close to Rank 1 but missing some details on driving the decision making.
SINGLE RESPONSE RATING:
XXX.
PREFERENCE RANKING
A and D are ranked 3 and 4 because they are too generic and lack of Overall, these comments are well-structured,
SUMMARY OF DIMENSION
APPLICABILITY details. A is slightly easier to understand. informative, and transparent in explaining the ranking
4. COMMENTS E expresses biased judgement to YYY and is ranked as worst. process.
5. RESOURCES

Apple Confidential–Internal Use Only


Overview Explain your ranking in the comments. Please be specific on why one response is better than the other.

SINGLE RESPONSE RATING:


PREFERENCE RANKING
SUMMARY OF DIMENSION Comment Example Rationale
APPLICABILITY All responses are concise, follow instructions, harmless and Unless in cases where all responses are equally good, such
4. COMMENTS satisfying. comment is not informative as it does not tell us why one
5. RESOURCES response is better than the other.

I ranked the responses based on the helpful, truthful, and This comment is not informative for someone who tries to
harmless principles. understand the reasons behind the ranking. Please be
specific on which principle makes one response better and
how. For instance, if you think one response is more helpful,
Bad briefly explain why it is more helpful.

The ranking is done based on their content and formatting. Please be specific on how one response is better than the
other in terms of content and formatting. For instance, it is not
clear that how the content and formatting differ among the
responses.

Apple Confidential–Internal Use Only


Overview Explain your ranking in the comments. Please be specific on why one response is better than the other.

SINGLE RESPONSE RATING:


PREFERENCE RANKING
SUMMARY OF DIMENSION Comment Example Rationale
APPLICABILITY The above response can be considered helpful as the response is It is unclear which response the comment is referring to. Plus,
4. COMMENTS directly relevant to the user request, and satisfies what the user it does not explain why a certain ranking is given.
5. RESOURCES request is asking for. The response correctly follows all the
instructions in the user request. It is truthful as the response’s
information is completely accurate, without any false facts or
misleading statements. It can be considered harmless as the
response does NOT contain moral judgement, or any opinionated
content especially on controversial or sensitive topics.

All responses are truthful and helpful to varying degrees. Please be specific on how Truthfulness and Helpfulness
Bad varies across responses. For instance, it is not clear why one
response is more truthful than the others.

Response C mentions the book that is not about romantic The comment is not informative enough. Be specific and
relations. provide the title of the book or any other reference from the
response that will help engineers to understand what book
you mention.

It would be the best response but contains issues. The comment requires specific information about why the
response "would be the best" and what "issues" it contains.

Apple Confidential–Internal Use Only


Resources
Overview

SINGLE RESPONSE RATING:


Important Note
PREFERENCE RANKING Use only the resources listed in the table. Refrain from using web-based services that require copying and pasting responses for
SUMMARY OF DIMENSION verification, as this could violate privacy regulations. However, you can still use web-based services for general research on the topic.
APPLICABILITY
4. COMMENTS

5. RESOURCES
Domain Description Links
Coding/Computer These websites provide resources related to programming. • Stack Overflow
Science/STEM • Compiler
• Tutorials Point
• Geeks for Geeks
• Python Standard Library
• Programming Cheat Sheets
• Codecademy Cheatsheets

Math/STEM These websites offer free resources to learners in various • https://www.khanacademy.org


domains. It can be used to verify content for Math related topics. • WolframAlpha

Apple Confidential–Internal Use Only


Overview

SINGLE RESPONSE RATING:


Domain Description Links
PREFERENCE RANKING
Health These websites provide trustworthy resources for health/medical • CDC
SUMMARY OF DIMENSION
APPLICABILITY topics. See this and this for more resources. • NIH
4. COMMENTS • Mayo Clinic
• NHS
5. RESOURCES
• Medline
• Merck Manual
• UHC
• Lurie

Coding Code editors are tools typically used by programmers and web • Visual Studio Code
developers to write and edit code. These tools can be used to • Allows you to test locally on your Mac.
efficiently review responses. • Run “Make me Admin” in Self Service to temporarily
This is the list of code editors approved by Apple. elevate privileges for the installation.
• SublimeText
• Notepad++
• Xcode
• BBEdit
• Eclipse
• MacVim

Apple Confidential–Internal Use Only


Congratulations!
You have completed the guidelines!

Apple Confidential–Internal Use Only

You might also like