Language Technology, Spring 2013 FINAL PROJECT Important Dates --------------------------------- Fri 4/19 or Mon 4/22: please talk to me about your topic. Tue 4/23: proposal due (should include plans for data, eval, and baseline!). Thu 4/25: optional feedback to students whose proposals need major revisions. Mon 5/13: midway presentation and feedback. Sun 5/26: final report due. Proposal Must Include At Least ------------------- 0. A clear statement of the *goal* of the project. What's the problem? 1. The linguistics relevance of this project, as well as the computational perspective. Why is the problem interesting, relevant, and reasonably hard? 2. Description of the existing methods for this problem? What has been done before and how is it solved today? 3. Description of the *new methods* you propose to use. 4. Concrete description of the *data* that you will use, how/where to get it, and what (pre-)processing is needed to make it useable. 5. Description of the *evaluation* method that you will use. How do you know if you're successful? 6. Description of a *baseline* method, i.e. something that you can implement in *two hours* to attack the problem. See Also: Heilmeier Catchism. TOPIC -------------------------------------------- 1. A natural topic could be based on homework assignments: either combining two homeworks, or improving upon one homework (see examples below) or (preferably) part of a larger research project you are working on, and/or (ideally) something fun and exciting! 2. The topic should have something to do with [statistical] processing of the *structure* of natural language: e.g., bags-of-words (topic model) is *not* allowed. EXAMPLE TOPICS ----------------------------------- Here is a sample of possible topics, many of which are from past years (at USC). These are not off-limits, but remember that the instructor makes the final decision to approve each topic and expect and will reward *originality*. Please talk to me about your topic before writing the proposal. * Improving the accuracy of statistical parsing, e.g., head-lexicalization, parent-annotation, etc. (based on HW3). Data: HW3 or Penn Treebank (same below for parsing). * Improving the efficiency of statistical parsing, e.g., A* parsing, coarse-to-fine parsing, etc. (based on HW3). * $k$-best parsing (output top-$k$ parse trees for a given sentence) (based on HW3). * Dependency Parsing, either PCFG or discriminative (variant of HW3). * HMM word-alignment (extensions of HW4 and HW2). Data: Canadian Hansards, UN or EU Proceedings. * English respacing (word-segmentation) (variant of HW1, combined with techniques in HW4). You have to compare supervised with unsupervised approaches. Data: any sizable English text. * Chinese or Japanese word segmentation, either supervised or unsupervised (variant of HW2, combined with techniques in HW4). Data: Penn Chinese Treebank. * Back-transliteration from Japanese Katakana, Mandarin Chinese or Cantonese (combining techniques in HW2 and HW4), either supervised (if you can find the annotated data) or unsupervised (recommended). Data: list of Japanese/Chinese transliterations of foreign names; CMU pronunciation dictionary; Chinese pronunciation dictionary. (note: you just need to work on one Asian language.) * Translate Korean pronunciation for Chinese words into Japanese pronunciations. Data: collected manually from newspapers. * Automatically correct mis-heard song lyrics. Data: www.kissthisguy.com. * Unsupervised part-of-speech tagging. (extensions of HW2 & HW4). Data: Penn Treebank. * Learn phoneme changes across a pair of related languages (e.g. Uzbek and Turkish). Data: cognate pairs extracted from dictionaries. * Mad Gab generation (language game). Also: Shannon Game, Hangman. Data: CMU pronunciation lexicon. * Transliteration of Greek from Greek alphabet to Latin alphabet. Data: 5000 Greek words in Latin script taken from discussion forums. * Translate between ancient Greek (morphologically rich, free word-order) and English. Data: Perseus Project, 7 million words. * Convert natural language to image schemas. Data: 2129 preposition labels and 200 NL descriptions for 89 scenes. * Translate passages from Dante's Divine Comedy from Italian into English, maintaining verse. Data: original text of Divine Comedy, plus CMU pronunciation lexicon. * YOUR ORIGINAL IDEA