Assignment 1
Assignment 1
Assignment 1
Assignment 1
Type of Questions: MCQ
Question 1: What would be the number of tokens for the following sentence after:
a) word level tokenization by space, and b) character level tokenization?
1. 8, 25
2. 7, 31
3. 8, 31
4. 7, 25
Answer: 2
Solution: Character tokenization:
[‘A’, ‘l’, ‘l’, ‘ ’, ‘g’, ‘o’, ‘o’, ‘d’, ‘ ’, ‘t’,
‘h’, ‘i’, ‘n’, ‘g’, ‘s’, ‘ ’, ‘c’, ‘o’, ‘m’, ‘e’, ‘ ’,
‘t’, ‘o’, ‘ ’, ‘a’, ‘n’, ‘ ’, ‘e’, ‘n’, ‘d’, ‘.’]
Question 2: If we use the regular expression "\.[ ]+" (python syntax) for sentence
tokenization, what problems may we face?
1
3. There may be more than one space after the end-of-sentence.
4. Sentence may end with punctuations other than ‘.’(dot).
Answer: 1, 2, 4
Solution:
2
1. freq(wf ) > freq(wc )
3. freq(wf ) ≈ freq(wc )
4. Not comparable
Answer: 1
Solution: Function words belong to a closed set of words and are limited in number.
Thus any function word is generally more frequent in a text than any content word.
Mr. Bennet was among the earliest of those who waited on Mr. Bingley.
He had always intended to visit him, though to the last always assuring
his wife that he should not go; and till the evening after the visit was paid
she had no knowledge of it.
Find the running-average TTR for a window of length 40. (Assume that the text is
tokenized by spaces only.)
1. 0.7917
2. 0.8389
3. 1.2632
4. 0.7533
Answer: 2
Solution: We get 48 tokens after tokenization.
For all the sliding windows in [1, 40], [2, 41], ..., [9, 48] we get the TTR values as follows.
[0.825, 0.825, 0.825, 0.85, 0.825, 0.85, 0.85, 0.85, 0.85]
The average is 0.8389 (rounded to 4 decimals.)
1. Inflectional Morphology
3
2. Ambiguity in function of punctuations
3. Sandhi
4. None of these
Answer: 3
Solution: Refer to lecture 5
Question 8: Which of the following algorithms can be used for automatically creating
decision trees?
1. Gradient Descent
2. ID3
3. Adaboost
4. C4.5
Answer: 2, 4
Solution: Refer to lecture 5
Question 9: If the TTR (type-to-token ratio) value in a book after first 500 tokens
is r500 and after first 50000 tokens is r50k , then what is the expected order of these
two ratios?
3. r500 ≈ r50k
4. Not comparable
Answer: 2
Solution: Refer to lecture 4