University of Arizona, Department of Computer Science

CSc 120: n-Grams: Example and Algorithm

Identifying n-grams: an example

Suppose we have a list of words [aa bb cc aa bb cc], and suppose that the n for computing n-grams is 3. Each sequence of n elements from the list is an n-gram. In this case, since n = 3, we look at 3-word sequences. Moving across the list L-to-R, these are:
aa bb cc   (list positions 0, 1, 2)
bb cc aa   (list positions 1, 2, 3)
cc aa bb   (list positions 2, 3, 4)
aa bb cc   (list positions 3, 4, 5)

Since the 3-gram aa bb cc occurs twice, it has a count of 2.

An algorithm for computing n-gram frequencies

Given a value for n, the algorithm for computing n-grams for a list of words L is as follows:

  1. If len(L) < n there are no n-grams to find. Stop.
  2. # Initialization
    Construct the first n-gram and initialize its count:
  3. # Iterate through the list L
    Construct the next n-gram:
  4. Update the count for current in your database.
  5. If L is empty, stop. Otherwise, go back to step 3.

An even simpler approach is to extract the n-gram as a slice, of length n, of the list of words. By changing the position in the list where the slice starts, you can extract all of the n-grams (but be careful to stop at the right place).

Implementation tricks

The example above talked about lists, and from a programming perspective it's probably simplest to use a list for the sequence of words for which we are computing n-grams. On the other side, a dictionary makes it very simple to keep track of how many times we have seen something. Unfortunately, lists cannot be used as keys in Python dictionaries, which means we can't go directly from a list of words to its count in some sort of dictionary.

One trick that gets around this is to concatenate the list into a string, which we can then use as a key to a dictionary. But we have to be careful how we do this, for two reasons:

  1. if we just concatenate the strings in the list, the way we did in assignment 1, we could end up confusing different n-grams—for example, the following two 3-grams would end up with the same concatenated strings:
    fall entree speak
    fallen trees peak
    As a result, our counts for these 3-grams would get mixed up, and the program would compute incorrect results.

  2. This concatenation trick is just a way to simplify our programming. At the end, the information the program prints out (e.g., about n-gram counts) must be in terms of n-grams of words, not concatenated strings. So whatever we do over here, we have to be able to figure out a way to go from the concatenated strings back to the n-grams.

Can you think of how you might be able to modify this idea of concatenating a list of strings into a string to something that would address the two issues mentioned above? (Hint: Can you think of a way to demarcate the boundaries between adjacent words when you concatenate them?)