Language Technology, Exercise 0 (Due Friday 9/12 11:59pm on Blackboard)

Part I. Dollar Word. (trivial)

Find all dollar words in a reasonably-sized vocabulary list you find on the Web.
(case-insensitive)


Part II. Part-of-Speech Stats from Corpus. (easy)

Each sentence in this corpus is a sequence of word/tag pairs:

  The/DT man/NN saw/VBD a/DT boat/NN ./.

Find all part-of-speech (POS) tags for each word, and
1) sort all words by frequency (from most frequent to least)
2) for each word, sort its POS tags by frequency within the word (also from the most frequent)

Read from standard input. Print to standard output:

the    12390	DT 12300 NNP 90
...
man	1000	NN 900	VB 100
...

(which means the word "man" occurred 1000 times, among which 900 are nouns, and 100 are verbs).

Note: case-insensitive. tie-breaking is lexicographical.
Use a <tab> between the word and frequency, and between frequency and the list of POS tags,
and between each POS tag/freq pair, but a space between the tag and its frequency:

man<TAB>1000<TAB>NN 900<TAB>VB 100

Corpus (39,832 sentences; ~1 million words):

	http://acl.cs.qc.edu/~lhuang/teaching/nlp/02-21.wordtagpairs


Part III. Extend the n-ary Tree class from the slides, and write:

1. a post-order traversal method. [trivial]
2. a linearization method that prints a "linearized form" of the tree from which you can convert back to a Tree. [trivial]
3. a delinearization (static) method that converts it back to a Tree. [medium after in-class discussions]
4. redo 3. using recursion instead of stack [medium-hard; optional]

For example, if we run:

t = Tree(1, [Tree(2, [Tree(5), Tree(3, [Tree(4)])])])
print t
t.pp()
print t.postorder()
print t.linearize()
print Tree.delinearize(t.linearize())

it should print the following results:

(1 (2 (5) (3 (4))))
 1
 | 2
 | | 5
 | | 3
 | | | 4
5 4 3 2 1
1 2 5 NIL 3 4 NIL NIL NIL NIL
(1 (2 (5) (3 (4))))


Notes:
0. "NIL" ends a tree. 
1. question 3 involves "@staticmethod" (google it). or you can do a global function instead.
2. there should be exactly one space between each element in the linearization.
3. questions 1&2 prepares you for the majority of Quiz 0, while question 3 prepares you for its hardest question as well as HW2.
4. CS students should spend less than one hour for this exercise.
5. linguistics students might have trouble with 4, but 3 should be fine.