COMPUTER SYSTEMS RESEARCH
Computational Linguistics Project ideas
Fall/Spring 2007 - 2008
- Existing learning/teaching materials and references
- NLTK ( nltk.sourceforge.net). Good source of code and project
ideas, and it's also got a very nice collection of pre-processed
corpus materials, including a sampler of some of the LDC's greatest
hits. See especially:
- Nitin Madnani, Getting Started on Natural Language Processing with
Python, ACM Crossroads Xrds13-4,
http://www.acm.org/crossroads/xrds13-4/natural_language.html .
- Electronic Grammar modules (used with high school students):
writing programs to solve practical problems with words, texts
and grammar. http://nltk.org/index.php/Electronic_Grammar.
- The NLTK book, http://nltk.org/index.php/Book, which includes over
200 graded exercises along with introductions to programming and
NLP, some of which should be accessible to high school students.
- The Computational Linguistics Olympiad
- CSLU Toolkit, http://cslu.cse.ogi.edu/toolkit/ . A comprehensive
suite of tools to enable exploration, learning, and research into
speech and human-computer interaction.
- Ciezielska-Ciupek, M. 2001. Teaching with the internet and corpus
materials: Preparation of the ELT materials using the internet and
corpus resources. In Lewandowska-Tomaszczyk, B. (ed) PALC 2001:
Practical Applications in Language Corpora. Lodz Studies in
Language, 7. Frankfurt: Peter Lang, p.521-531.
- Sun, Y-C. & Wang, L-Y. 2003. Concordancers in the EFL classroom:
Cognitive approaches and collocation difficulty. CALL, 16/1,
p. 83-94.
- Using corpora in L1, Paul Thompson at the University of Reading has
worked with primary school children; Julia Blake & Tim Shortis in
secondary schools (cf their paper at BAAL 2007).
- Machine translation
- Implementing IBM Model 1
- Building a complete end-to-end statistical machine translation
system, e.g. using MOSES ( http://www.statmt.org/wmt07/baseline.html)
- Supervised learning (e.g. using a Naive Bayes classifier)
- Word sense disambiguation
- Spam filtering (e.g. using spam message databases)
- Document classification ( e.g. using the 20 Newsgroups corpus)
- Unsupervised techniques
- Implementing language models using the SRI LM toolkit
- Writing a bigram part of speech tagger, including Baum-Welch
training and Viterbi search.
- Studying, critiquing and building a mini document ranking system
based on Page Rank.
- Odd one out: use simple similarity measures to pick the odd-one-out
from a given set of words. E.g., in (Honda, Toyota, Sony,
BMW, Mercedes), Sony is the odd word (not a car company). Or, in
(India, China, Japan, Romania, Korea), Romania is the odd one (not
an asian country). The programming logic could be as simple as
extracting features for each word and then selecting a word as the
"odd" if after removing it from the set, the remaining members share
the maximum number of features. Or, something more sophisticated
using cosine similarity measure that picks the word with the least
cosine with the rest of the group as the Odd.
- Corpus and grammar building/exploration
- Investigating some linguistic, sociolinguistic or stylistic aspect
of the student's choice in blogs or constructing a Web corpus.
[Reading LanguageLog, www.languagelog.org, would probably be a great
start! -PSR]
- Building a small Web corpus and then doing collocation extraction or
text classification. E.g. how do sports reports differ from music
reviews, or tabloid journalism from broadsheet journalism, or
Democrat authors from Republicans, or what do female bloggers write
about more frequently than male bloggers?
[An exercise I wrote, at
http://www.umiacs.umd.edu/~resnik/nlstat_tutorial_summer1998/Lab_ngrams.html ,
might be useful here. -PSR]
- Generating simple English sentences using a simple substitution
based grammar. E.g. start by generating from a grammar like
"(the|a(n)) (big|little|smelly|argumentative) (cat|dog|teacher)
(ate|played with|jumped over|kicked|knew|typed on) (the|a(n))
(lazy|silly|old|fluffy|dusty|horrible) (white|fat|....)
(fox|school|telephone|keyboard)", and then represent some
constraints as a filter over random replacements (i.e. if a random
replacement creates a violation of a constraint, make a new random
replacement). For example, foxes aren't dusty, schools aren't lazy
and can't be eaten, keyboards can't be known, etc.
- Evaluating either the grammar checker or the readability statistics
that MS Word provides; then trying to design improvements, either as
a specification for a better piece of software, or as a real program
which does some things automatically that MS Word can't do.
- Spidering parallel texts that are generated daily from the
EU, and then exploring translations.
- Writing a KWIC concordancer in python, to get them used to
manipulating lots of text.
- Using the Sketch Engine and associated corpora
(http://www.sketchengine.co.uk/ ), e.g. to compare and contrast
behaviour of "clever" vs. "intelligent" or "strong" vs. "powerful".
- Using http://corpus.byu.edu/ (formerly view.byu.edu) to do similar
sorts of lexical explorations on material from the British National
Corpus or Time Magazine corpus.
- Using the Linguist's Search Engine (lse.umiacs.umd.edu) to explore
Web data by searching for syntactic structures.
- Writing or extending a grammar and evaluating its coverage
- Surveying different approaches to parsing and writing a simple
definite clause grammar
- Other
- Code-breaker exercise: given a text message, such as "meet me in the
park at 10", write a program that converts it into a cryptic code
messege and a decoder that retrieves the original messege back. For
example, one idea is to use the odd-even scheme and display all the
odd characters first, followed by the even characters. This would
generate a code messege: "MEE_EPTA_RMKE__AITN__1T0H". To decipher
this code, just read all the odd characters and then all the even
characters (treating spaces as regular characters). Alternatives,
e.g. block code, character substitution, etc.
- Other corpus suggestions
- Project Gutenberg
- Reuters RCV1 news corpus
- Enron e-mail corpus
- Wikipedia (downloadable as an XML file)
- Europarl parallel translations ( http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl/)
- Parallel Bibles and Web page translations ( http://www.umiacs.umd.edu/~resnik/parallel/)