CSc 120: n-Grams
Background
The notion of n-grams (where n is a number, e.g., for n = 3, we have 3-grams,
for n = 4, we have 4-grams, etc.) is a simple idea that has many applications,
including computational linguistics, speech recognition, machine translation, plagiarism detection,
genetic sequence analysis, and so on
(see this Wikipedia page).
This program involves writing a program to compute n-grams and count the number of times
each n-gram is encountered.
An illustrative example explaining the notion of n-grams, together with an algorithm for computing
n-grams, are given here.
Expected Behavior
Write a program, in a file ngrams.py, that behaves as follows:
-
Use input() (without any argument) to read the name of an input file.
Read the contents of this file into a list of words, splitting at whitespace.
-
For each word in this list of words, strip away any punctuation at the beginning or end
of the word (see below). Discard any empty strings that may result from stripping out
of punctuation characters.
-
Use input() (without any argument) to read in an integer value n.
-
Construct n-grams as described here and
count the number of occurrences of each distinct n-gram. These counts should be
maintained in a case-insensitive way, i.e., "hello",
"HELLO", and "hElLo" should all be considered to
be the same word.
-
Find the maximum number of times M any n-gram occurrs. Print out all n-grams
that occur M times, one per line, as described under Output below.
Some examples are given here.
Before computing n-grams, the list of words obtained from the input files should be
"cleaned". This can be done as follows:
-
Remove any punctuation characters from either end of each word (punctuation characters
that are not at the end of a word need not be removed.
For example, the list of words
['"Uh-oh!"', 'she', 'thought.', '"Trouble!"']
would give the following list after cleaning:
['Uh-oh', 'she', 'thought', 'Trouble']
-
Discard any empty strings, which can result from "cleaning" strings consisting
entirely of punctuation characters.
Input
The input will be a file consisting of ordinary text and punctuation. An example
is given here.
Output
The output of your program should be a sequence of those n-grams that occur the
largest number of times in the input file.
N-grams should be printed out one per line according to the following format:
"{:d} -- {}".format(M, ngram_string)
where:
-
M is the maximum number of times any n-gram occurrs, and
-
ngram_string is a string onbained by concatenating together the
words in the n-gram, with a single blank letter separating each pair of adjacent
words, and no additional blank characters at the beginning or end of the string.
Some examples are given here.
Assertions
Your program should use assert statements to check the following
(however, see below for replacements for asserts in situations where asserts are
difficult to state).
-
For each method, any pre-conditions on the arguments for that method. The
assert should be placed at or very soon after the beginning of
the method.
-
For each loop that either (i) computes a value; or (ii) transforms
some data, at least one assert that holds in the body of the
loop. You can choose what the invariant is, but it must be something that:
-
reflects the computation of the loop; and
-
is not simply a statement of the iteration condition (or some expression
whose value follows directly from the iteration condition).
Asserts are not necessary for loops that neither compute values nor transform data,
e.g., loops that simply read in data or print out results.
This level of assertion-checking is almost certainly overkill, but we'll do this for a little
while in order to get more experience with pre-conditions and loop invariants and to
practise working with assert statements.
Try to make your asserts as precise and specific as you can.
Replacements for asserts
In some situations, it may be difficult or impossible to write an assert that
captures what you want to capture. In such situations, in place of an assert
you can write a comment giving the invariant or assumption you want to state.
Such a comment should
be written as follows:
### INVARIANT: ...your invariant in English and/or Python
or
### ASSUMPTION: ...your assumption in English and/or Python
Programming Requirements
Your program should implement (at least) the following classes along with the methods specified.
-
class Input
-
An object of this class represents information about the input data. This class should
implement (at least) the following methods:
-
__init__(self) : uses input() to read in the name
of an input file, opens the file, and initializes an attribute of the object
with the file object returned by open().
-
wordlist(self) : reads in the contents of the input file, splits it
into a list of words, removes any punctuation at either end of the word (punctuation
characters inside the word should not be removed), discards any empty strings that
may result, and returns the resulting list.
-
class Ngrams
-
An object of this class represents n-gram information about the input list of words.
This class should implement (at least) the following methods:
-
__init__(self) : uses input() to read in the value
of n for computing n-grams and initializes an attribute of the object with
this value. It also initializes the data structure that will be used to keep track
of the number of times each n-gram occurs in the input.
-
update(self, ngram) : updates the count for the n-gram
ngram.
-
process_wordlist(self, wordlist) : processes the list of words
wordlist read from the input file to compute the number of times
each n-gram occurs in wordlist.
-
print_max_ngrams(self) : prints out the n-grams that have the
highest number of occurrences according to the output format specified above
(see Output).
Examples
Several examples of this data analysis are given
here.
Testing
You may want to consider some of the following for black-box testing purposes:
- empty input file;
- input file has fewer words than the value of n specied for n-grams
- n = 0
- input words on just a single line
- input words one per line on multiple lines
- n-grams where some of the words have punctuation on either side
- n-grams where some of the words have upper/lower-case differences
In addition, you should examine your code to identify inputs for white-box testing.