CSc 120: n-Grams: Example and Algorithm
Identifying n-grams: an example
Suppose we have a list of words [aa bb cc aa bb cc], and suppose that the n for
computing n-grams is 3. Each sequence of n elements from the list is an n-gram.
In this case, since n = 3, we look at 3-word sequences. Moving across the list L-to-R,
these are:
aa bb cc (list positions 0, 1, 2)
bb cc aa (list positions 1, 2, 3)
cc aa bb (list positions 2, 3, 4)
aa bb cc (list positions 3, 4, 5)
Since the 3-gram aa bb cc occurs twice, it has a count of 2.
An algorithm for computing n-gram frequencies
Given a value for n, the algorithm for computing n-grams for a list of words L is as follows:
-
If len(L) < n there are no n-grams to find. Stop.
-
# Initialization
Construct the first n-gram and initialize its count:
-
Remove the first n words of L.
-
Put the words you removed (in the same order) into current.
-
The length of current is now n and the length of L has been reduced by n.
-
Update the count for current in your database.
-
# Iterate through the list L
Construct the next n-gram:
-
Remove the first word from the beginning of current. (This makes the length of
current become n–1.)
-
Remove the first word from the beginning of L and append it to the end of current.
(This makes the length of current become n again, and reduces the length of
L by 1.)
-
Update the count for current in your database.
-
If L is empty, stop. Otherwise, go back to step 3.
An even simpler approach is to extract the n-gram as a slice, of length n, of the list of words.
By changing the position in the list where the slice starts, you can extract all of the n-grams
(but be careful to stop at the right place).
Implementation tricks
The example above talked about lists, and from a programming perspective it's probably simplest
to use a list for the sequence of words for which we are computing n-grams. On the other side,
a dictionary makes it very simple to keep track of how many times we have seen something.
Unfortunately, lists cannot be used as keys in Python dictionaries, which means we can't
go directly from a list of words to its count in some sort of dictionary.
One trick that gets around this is to concatenate the list
into a string, which we can then use as a key to a dictionary. But we have to be careful how we do
this, for two reasons:
-
if we just concatenate the strings in the list, the way we did in
assignment 1, we could end up
confusing different n-grams—for example, the following two 3-grams would end up
with the same concatenated strings:
fall entree speak
fallen trees peak
As a result, our counts for these 3-grams would get mixed up, and the program would compute
incorrect results.
-
This concatenation trick is just a way to simplify our programming. At the end, the information
the program prints out (e.g., about n-gram counts) must be in terms of n-grams of words, not
concatenated strings. So whatever we do over here, we have to be able to figure out a way to
go from the concatenated strings back to the n-grams.
Can you think of how you might be able to modify this idea of concatenating a list of strings into
a string to something that would address the two issues mentioned above? (Hint: Can you
think of a way to demarcate the boundaries between adjacent words when you concatenate them?)