Assignment 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Natural Language Processing

Assignment 1
Type of Questions: MCQ

Number of Questions: 9 Total Marks: 9

Question 1: What would be the number of tokens for the following sentence after:
a) word level tokenization by space, and b) character level tokenization?

“All good things come to an end.”

1. 8, 25

2. 7, 31

3. 8, 31

4. 7, 25

Answer: 2
Solution: Character tokenization:
[‘A’, ‘l’, ‘l’, ‘ ’, ‘g’, ‘o’, ‘o’, ‘d’, ‘ ’, ‘t’,
‘h’, ‘i’, ‘n’, ‘g’, ‘s’, ‘ ’, ‘c’, ‘o’, ‘m’, ‘e’, ‘ ’,
‘t’, ‘o’, ‘ ’, ‘a’, ‘n’, ‘ ’, ‘e’, ‘n’, ‘d’, ‘.’]

Word tokenization by space:

[‘All’, ‘good’, ‘things’, ‘come’, ‘to’, ‘an’, ‘end.’]

Question 2: If we use the regular expression "\.[ ]+" (python syntax) for sentence
tokenization, what problems may we face?

1. ‘.’(dot) may be part of an abbreviation

2. There may not be a space after the end-of-sentence.

1
3. There may be more than one space after the end-of-sentence.
4. Sentence may end with punctuations other than ‘.’(dot).
Answer: 1, 2, 4
Solution:

Question 3: A text processing system found the following sentence in a document.


What are the most probable reasons for the two hyphens?
“With general-purpose” computers becoming more and more power-ful, multimedia
devices like iPod or Walkman have become rare.
1. Sententially determined hyphen, End-of-line hyphen
2. Sententially determined hyphen, Lexical hyphen
3. Sententially determined hyphen, Sententially determined hyphen
4. Lexical hyphen, Sententially determined hyphen
Answer: 1
Solution: “Powerful” is a word without any hyphens.

Question 4: Consider the imaginary words “starking” and “ylding”. If we pass it


through the Porter Stemmer algorithm what would be the outputs?
1. stark, yld
2. star, yld
3. starking, ylding
4. stark, ylding
Answer: 4
Solution:

Question 5: What is order relation between frequencies of any function word wf


and any content word wc ?

2
1. freq(wf ) > freq(wc )

2. freq(wf ) < freq(wc )

3. freq(wf ) ≈ freq(wc )

4. Not comparable

Answer: 1
Solution: Function words belong to a closed set of words and are limited in number.
Thus any function word is generally more frequent in a text than any content word.

Question 6: Consider the following text:

Mr. Bennet was among the earliest of those who waited on Mr. Bingley.
He had always intended to visit him, though to the last always assuring
his wife that he should not go; and till the evening after the visit was paid
she had no knowledge of it.

Find the running-average TTR for a window of length 40. (Assume that the text is
tokenized by spaces only.)

1. 0.7917

2. 0.8389

3. 1.2632

4. 0.7533

Answer: 2
Solution: We get 48 tokens after tokenization.
For all the sliding windows in [1, 40], [2, 41], ..., [9, 48] we get the TTR values as follows.
[0.825, 0.825, 0.825, 0.85, 0.825, 0.85, 0.85, 0.85, 0.85]
The average is 0.8389 (rounded to 4 decimals.)

Question 7: Which of the following phenomena make word-segmentation difficult


in Sanskrit language?

1. Inflectional Morphology

3
2. Ambiguity in function of punctuations

3. Sandhi

4. None of these

Answer: 3
Solution: Refer to lecture 5

Question 8: Which of the following algorithms can be used for automatically creating
decision trees?

1. Gradient Descent

2. ID3

3. Adaboost

4. C4.5

Answer: 2, 4
Solution: Refer to lecture 5

Question 9: If the TTR (type-to-token ratio) value in a book after first 500 tokens
is r500 and after first 50000 tokens is r50k , then what is the expected order of these
two ratios?

1. r500 < r50k

2. r500 > r50k

3. r500 ≈ r50k

4. Not comparable

Answer: 2
Solution: Refer to lecture 4

You might also like