NLTK Project Ideas
This page describes a variety of possible natural language processing projects that can be undertaken using NLTK. Several past projects are now a core part of NLTK. Please add other project ideas.
As far as possible, code that is developed in these projects should build on existing NLTK modules, especially the interface classes and APIs. In general, you should ensure that you complement your code with appropriate testing data. You are strongly encouraged to write a doctest document which will both explain the functionality of your code to users and also provide unit tests, in the style adopted in the NLTK HOWTOs (source).
The distinction between Computationally Oriented and Linguistically Oriented Projects is not hard and fast, since every project should have some mix of both aspects.
Computationally Oriented Projects
- Re-implement any NLTK functionality for a language other than English (tokenizer, tagger, chunker, parser, etc). You will probably need to collect suitable corpora, and develop corpus readers.
- Develop an interface between NLTK and the Xerox FST toolkit, using new Python-xfst bindings available from Xerox (contact Steven Bird for details).
- Build a compiler for finite state transducers (cf Xerox's xfst or Gertjan van Noord's FSA Utilities).
Develop a maximum-entropy POS tagger for NLTK, e.g. http://www.inf.ed.ac.uk/resources/nlp/local_doc/MXPOST.html, http://nlp.stanford.edu/software/tagger.shtml
Test and extend NLTK's implementation of Hadoop and MapReduce, available in nltk_contrib.hadoop
- Develop a chunker that uses transformation-based learning, adapting NLTK's Brill Tagger to chunk tags (see Ramshaw 1995).
Create a module for lexical distributional similarity http://www.aclweb.org/anthology/P06-1046.pdf
Bootstrap lexical relations, cf http://www.stanford.edu/%7Ejurafsky/paper887.pdf
Develop a lexical-chain based WSD system, using the similarity measures defined on WordNet, and evaluate it using the SEMCOR corpus (corpus reader provided in NLTK).
- Add graph visualization functionality to NLTK's dependency parser.
- Build and train a statistical Named Entity Recognizer for MUC-type entities (e.g., person, location, organisation, cardinal, duration, measure, date), to be included with NLTK's data distribution.
- Implement a chatbot that incorporates a more sophisticated dialogue model than nltk.chat.eliza.
- Build a chatbot that adapts machine-translation technology to map from an input utterance (source language) to an appropriate response (target language); for example, by using word alignments, "translation" probabilities and language models.
- Build an extensible state-based dialogue manager.
- Implement a categorial grammar parser, including semantic representations, cf nltk_contrib.lambek.
Develop a prepositional phrase attachment classifier, using the ppattach corpus for training and testing.
Taking the VerbOcean data which captures semantic relationships between verbs, generate a semantic network of verb relationships and implement a tree traversal algorithm that can calculate the similarity between two verbs, e.g. "fly" and "crash". http://semantics.isi.edu/ocean/verbocean.unrefined.2004-05-20.txt.gz You can find a demo of this system at http://falcon.isi.edu/cgi-bin/graph-analysis/view-graph.pl
- News stories from different sources often contain contradictory information regarding a particular event such as the number of people killed in an earthquake. Build a numerical expression recogniser and resolver that can identify equality and contradiction between numerical expression such as: "5 adults" != "3 children and 2 adults", but "5 people" = "3 children and 2 adults".
Develop a system for encoding lexicons that can be incorporated into existing NLTK code for parsing feature-based grammars (cf the treatment of lexion files in PC-PATR. Ideally this should include readers / writers for a number of existing lexical formats as well as the creation of new lexicons.
- Build a GUI-based grammar development environment that will help users identify and fix bugs in their grammars.
- Write a program to generate referring expressions: assume a collection of entities having attributes for shape, color, size, etc, then generate a noun phrase that mentions enough attributes in order to uniquely identify the intended entity (e.g. "the small green book")
Implement the TextTiling algorithm for segmenting text http://people.ischool.berkeley.edu/~hearst/research/tiling.html
Develop a full-featured concordance application, cf AntConc
Develop a command-line interface allowing POS-tagged corpora to be searched using a combination of words and POS tags, and regular expressions over these issue 51
Implement an interface to the Google or Yahoo web search API, allowing access to hit counts and possibly contextual snippets, for a given search term issue 259
Implement a method for space-efficient syntax tree layout issue 114
Port or wrap the [ftp://ftp.informatics.susx.ac.uk/pub/users/johnca/morph.tar.gz Morpha lemmatizer]
Generate possible words of a language given a phoneme/grapheme inventory, syllable structure, phonotactic constraints, word structure etc issue 481
Create an interface to the Curran and Clark CCG parser.
Create interconnectivity between NLTK & UIMA pipelines in order to combine available modules and analysis engines issue 522
Implement LDA style topic models issue 521
Linguistically Oriented Projects
- Develop a morphological analyser for a language of your choice.
Implement a rule-based language-independent syllabifier, e.g. http://homepages.inf.ed.ac.uk/sgwater/papers/conll05.pdf (mentioned in J&M p372)
- Develop a non-trivial grammar fragment that can be parsed with nltk.parse.featurechart.
Implement a TGrep2 interpreter for querying treebanks http://tedlab.mit.edu/%7Edr/TGrep2/
Build a shallow discourse parser, which takes chunked or parsed sentences as input and yields a discourse structure as output, cf. SPADE, the Penn Discourse Treebank (PTB), Prasad et al., Dinesh et al..
- Write a soundex function that is appropriate for a language you are interested in. If the language has clusters (consonants or vowels), consider how reliably people can discriminate the second and subsequent member of a cluster. If these are highly confusable, ignore them in the signature. If the *order* of segments in a cluster leads to confusion, normalize this in the signature (e.g. sort each cluster alphabetically, so that a word like treatments would be normalized to rtaemtenst, before the code is computed).
- Develop a text classification system which efficiently classifies documents in two or three closely related languages. Consider the discriminating features between languages despite their apparent similarity. Implementation should be evaluated using unseen data.
- Explore the phonotactic system of a language you are interested in. Compare your findings to a published phonological or grammatical description of the same language.
- Implement a structured text rendering module which takes linguistic data from a source such as Shoebox and generates XML based lexicon or interlinear text based on user preferences for field exports.
- Develop a grammatical paradigm generation function which takes some form of tagged text as input and generates paradigm representations of related linguistic features.
Develop and automatic essay assessment tool, cf http://www.pearsonkt.com/prodIEA.shtml
Build character n-gram models for different languages using the UDHR corpus (included with NLTK), and use these to generate hypothetical proper names in these languages (cf. Pywordgen)
Develop a program for unsupervised learning of phonological rules, using the method described by Goldwater and Johnson http://www.aclweb.org/anthology/W04-0105.pdf
Use WordNet to infer lexical semantic relationships on the entries of a Shoebox lexicon for some arbitrary language.
Develop support for competitive grammar writing, cf http://aclweb.org/anthology/W08-0212.pdf
Implement Sproat et al's work on normalization of non-standard words http://scholar.google.com/scholar?cluster=12806876023589028341 issue 146
Write a wrapper for CRFSuite