University of Arizona, Department of Computer Science

CSc 120: n-Grams Examples

Input file information

The examples below are based on input files of various different sizes:

  1. Small files of nonsense words:

    in0.txt : 6 words
    in1.txt : 15 words
  2. Abraham Lincoln's speeches:

    in2.txt : 264 words. Gettysburg address
    in3.txt : 702 words. Second inauguration address
  3. Various articles from The New York Times:
    nyt01.txt: 253 words
    nyt02.txt: 593 words
    nyt03.txt: 708 words
    nyt04.txt: 1125 words
    nyt05.txt: 621 words
    nyt06.txt: 1136 words
    nyt07.txt: 712 words
    nyt08.txt: 988 words
    nyt09.txt: 749 words
    nyt10.txt: 507 words
    nyt11.txt: 1108 words
    nyt12.txt: 821 words
    nyt13.txt: 240 words
    nyt14.txt: 250 words
    nyt15.txt: 451 words
    nyt16.txt: 784 words
    nyt17.txt: 631 words
    nyt18.txt: 452 words
    nyt19.txt: 664 words
    nyt20.txt: 631 words
    nyt21.txt: 6885 words
    nyt22.txt: 5908 words

Examples

  1. Input: in0.txt
    Output:
    n = 1n = 2n = 3
    2 -- bb
    2 -- cc
    2 -- aa
    2 -- bb cc
    2 -- aa bb
    2 -- aa bb cc
  2. Input: in1.txt
    Output:
    n = 1n = 2n = 3
    6 -- ccc 4 -- aaa bbb
    4 -- bbb ccc
    4 -- aaa bbb ccc
  3. Input: in2.txt
    Output:
    n = 1n = 2n = 3
    11 -- that
    11 -- the
    3 -- the people
    3 -- to the
    3 -- we cannot
    3 -- it is
    2 -- dedicated to the
  4. Input: in3.txt
    Output:
    n = 1n = 2n = 3
    58 -- the 7 -- of the 2 -- cause of the
    2 -- war rather than
    2 -- rather than let
    2 -- by whom the
    2 -- this interest was
    2 -- whom the offense
    2 -- the cause of
    2 -- drawn with the
  5. Input: nyt01.txt
    Output:
    n = 1n = 2n = 3
    15 -- the 3 -- in the 2 -- in december 1995
  6. Input: nyt02.txt
    Output:
    n = 1n = 2n = 3
    26 -- the 7 -- andy biggs 2 -- american family publishers
    2 -- biggs would like
    2 -- i m not
    2 -- phone calls from
    2 -- biggs of phoenix
    2 -- andy biggs of
    2 -- i ve got
  7. Input: nyt21.txt
    Output:
    n = 1n = 2n = 3
    381 -- the 38 -- in the 7 -- the rose bowl