University of Arizona, Department of Computer Science

CSc 120: n-Grams

Background

The notion of n-grams (where n is a number, e.g., for n = 3, we have 3-grams, for n = 4, we have 4-grams, etc.) is a simple idea that has many applications, including computational linguistics, speech recognition, machine translation, plagiarism detection, genetic sequence analysis, and so on (see this Wikipedia page). This program involves writing a program to compute n-grams and count the number of times each n-gram is encountered.

An illustrative example explaining the notion of n-grams, together with an algorithm for computing n-grams, are given here.

Expected Behavior

Write a program, in a file ngrams.py, that behaves as follows:

Use input() (without any argument) to read the name of an input file. Read the contents of this file into a list of words, splitting at whitespace.
For each word in this list of words, strip away any punctuation at the beginning or end of the word (see below). Discard any empty strings that may result from stripping out of punctuation characters.
Use input() (without any argument) to read in an integer value n.
Construct n-grams as described here and count the number of occurrences of each distinct n-gram. These counts should be maintained in a case-insensitive way, i.e., "hello", "HELLO", and "hElLo" should all be considered to be the same word.
Find the maximum number of times M any n-gram occurrs. Print out all n-grams that occur M times, one per line, as described under Output below. Some examples are given here.

Before computing n-grams, the list of words obtained from the input files should be "cleaned". This can be done as follows:

Remove any punctuation characters from either end of each word (punctuation characters that are not at the end of a word need not be removed. For example, the list of words
['"Uh-oh!"', 'she', 'thought.', '"Trouble!"']
would give the following list after cleaning:
['Uh-oh', 'she', 'thought', 'Trouble']
Discard any empty strings, which can result from "cleaning" strings consisting entirely of punctuation characters.

Input

The input will be a file consisting of ordinary text and punctuation. An example is given here.

Output

The output of your program should be a sequence of those n-grams that occur the largest number of times in the input file. N-grams should be printed out one per line according to the following format:

"{:d} -- {}".format(M, ngram_string)

where:

M is the maximum number of times any n-gram occurrs, and
ngram_string is a string onbained by concatenating together the words in the n-gram, with a single blank letter separating each pair of adjacent words, and no additional blank characters at the beginning or end of the string.

Some examples are given here.

Assertions

Your program should use assert statements to check the following (however, see below for replacements for asserts in situations where asserts are difficult to state).

For each method, any pre-conditions on the arguments for that method. The assert should be placed at or very soon after the beginning of the method.
For each loop that either (i) computes a value; or (ii) transforms some data, at least one assert that holds in the body of the loop. You can choose what the invariant is, but it must be something that:
- reflects the computation of the loop; and
- is not simply a statement of the iteration condition (or some expression whose value follows directly from the iteration condition).

Asserts are not necessary for loops that neither compute values nor transform data, e.g., loops that simply read in data or print out results.

This level of assertion-checking is almost certainly overkill, but we'll do this for a little while in order to get more experience with pre-conditions and loop invariants and to practise working with assert statements.

Try to make your asserts as precise and specific as you can.

Replacements for asserts

In some situations, it may be difficult or impossible to write an assert that captures what you want to capture. In such situations, in place of an assert you can write a comment giving the invariant or assumption you want to state. Such a comment should be written as follows:

### INVARIANT: ...your invariant in English and/or Python

### ASSUMPTION: ...your assumption in English and/or Python

Programming Requirements

Your program should implement (at least) the following classes along with the methods specified.

class Input

An object of this class represents information about the input data. This class should implement (at least) the following methods:

__init__(self) : uses input() to read in the name of an input file, opens the file, and initializes an attribute of the object with the file object returned by open().

wordlist(self) : reads in the contents of the input file, splits it into a list of words, removes any punctuation at either end of the word (punctuation characters inside the word should not be removed), discards any empty strings that may result, and returns the resulting list.

class Ngrams

An object of this class represents n-gram information about the input list of words. This class should implement (at least) the following methods:

__init__(self) : uses input() to read in the value of n for computing n-grams and initializes an attribute of the object with this value. It also initializes the data structure that will be used to keep track of the number of times each n-gram occurs in the input.

update(self, ngram) : updates the count for the n-gram ngram.

process_wordlist(self, wordlist) : processes the list of words wordlist read from the input file to compute the number of times each n-gram occurs in wordlist.

print_max_ngrams(self) : prints out the n-grams that have the highest number of occurrences according to the output format specified above (see Output).

Examples

Several examples of this data analysis are given here.

Testing

You may want to consider some of the following for black-box testing purposes:

empty input file;
input file has fewer words than the value of n specied for n-grams
n = 0
input words on just a single line
input words one per line on multiple lines
n-grams where some of the words have punctuation on either side
n-grams where some of the words have upper/lower-case differences

In addition, you should examine your code to identify inputs for white-box testing.