CSc 120: Fake News
The primary purpose of this assignment is to work with linked lists.
Background
Fake news
has become a topic of interest in current times, but it is certainly nothing new. As this article points out,
"Fake news"
was a tatic used in the Roman era;
this
U of A special collections exhibit demonstrates its use during the Reformation.
This programming assignment involves writing a
program to analyze some recent data about fake news articles and identify
what they most commonly focus on.
The input data for this assignment are derived from
www.kaggle.com/mrisdal/fake-news. For use in this
assignment, the original data file has been modified as follows:
-
In order to simplify processing, non-ASCII characters have been replaced by blanks.
-
In order to reduce file sizes and make the analysis more manageable, the
bodies of the articles have been removed. The analysis in this assignment
is therefore limited to the titles (i.e., headlines) of articles.
-
Only US news sources have been considered.
-
For each discussion thread, only the first posting has been retained, i.e.,
comments and replies have been omitted.
If you would like to examine or work with the original un-edited data, see
the website mentioned above.
Expected Behavior
Write a program, in a file fake-news.py, that behaves as follows.
-
Read in the name of an input file using
input('File: ').
-
Read in the data and process it as specified below. You can
use the csv module to read the input data (see
Using Python's csv module below, under Miscellaneous).
-
Extract the title from each input line;
-
Process the title so obtained as follows:
-
Divide the title into a Python list of words by splitting
on whitespace and punctuation, so that no word in the resulting
list contains any whitespace or punctuation character.
Use string.punctuation for your punctuation characters and
string.whitespace for whitespace characters. You will have to
import the string module for this.
-
From the resulting list, discard words with length ≤ 2. This
gets rid of small fragments of words that are created by the
splitting process in the previous step (e.g., the word
"Tom's"
gets split into "Tom" and "s",
and this step gets rid of "s") as well as
common words, like "a", "an",
and "is", that convey little information here.
(Not all common words are excluded, but we'll leave that be as a
concession to keeping things from getting overly complex.)
-
Convert all words to lower case to make counting word occurrences
case-insensitive.
We will refer to the resulting list of words as the
"cleaned list".
-
Use a linked list to maintain a count of the number of times
each word occurs across all of the input data (but remember that this
program limits itself to just the title of each news item). To this end,
for each word w in the cleaned list:
-
if w occurs in the linked list, update its count appropriately;
-
otherwise, add w to the linked list with count
initialized to 1.
-
After the data file has been read in and processed, sort the linked list
of words in descending order by count. The relative order of different
words that have the same count is up to you.
An algorithm for sorting linked lists is given
here.
-
Read in an integer value N using
input('N: ').
N should be ≥ 0.
-
Traverse the sorted list to find the count for the node in the sorted list
at position Nth (the very first ndoe is at
position 0). We refer to this value
as k.
For example, suppose the linked list has the following values for
word counts (in descending order of count):
monsoon: 10
heat: 7
dry: 7
chili: 5
canyon: 5
bicycle: 3
room: 2
and N = 3, then the word at position 3 is chili,
so k = 5 (because chili has count 5).
-
Print out all the words in the sorted list that have count ≥ k
(see Output format below). The order in which they are printed
out is up to you.
Data Structures
You should use linked lists to keep track of words encountered in the
input data and their occurrence counts. You are free to either adapt the code
shown in the class lecture notes for this, e.g., by using any additional
attributes you may need, or to write your own.
SInce the primary purpose of this assignment is to practice using linked
lists, code that tries to sidestep linked lists, e.g., by copying data
to or from Python lists and doing some or all of the actual work of the
program in Python lists, will not get credit. (Python's csv module returns
a Python list when reading from a file. This is OK. Your code should
then process the strings in that list and insert them into your linked list,
and subsequent processing should be done using linked lists.)
Each line in the input file represents data about a particular news post.
Lines have the following format:
-
A line that begins with the character "#"
is a comment and should be ignored.
-
A non-comment line contains the following fields. (Here,
"..." indicates a field that has no bearing on this
programming problem and whose description has been omitted to
reduce clutter.)
0
identifier
|
1
...
|
2
author
|
3
date published
|
4
title of story
|
5
language
|
6
...
|
7
site URL
|
8
country
|
9–19
...
|
The highlighted field (field 4, "title of story") is the
field this program will analyze.
Print out the top k words by count from the sorted list using the
statement
print("{} : {:d}".format(word, count))
where word is a string (a word occurring in some title in the
input file) and count is an integer value giving the number
of occurrences of that string in the title field of posts in the input file.
Programming Requirements
Your program should implement (at least) the following classes along with the
methods specified.
-
class Node
-
An object of this class represents information about a word. It should
contain (at least) the following attributes:
-
_word: the word string.
-
_count: a count of the number of occurrences of the word.
-
_next: a reference to the next node in the linked list.
It should implement (at least) the following methods:
-
__init__(self, word): initializes the object's
attributes as follows: _word is set to
word; _count is set to 1;
_next is set to None.
-
word(self): returns the value of _word.
-
count(self): returns the value of _count.
-
next(self): returns the value of _next.
-
set_next(self, target): sets the _next
attribute to target.
-
incr(self): increments the value of _count.
-
__str__(self) and, optionally,
__repr__(self).
-
class LinkedList
-
An object of this class represents a linked list. It should contain
(at least) the following attributes:
-
_head: a reference to the first node in the list.
It should implement (at least) the following methods:
-
__init__(self): initializes _head
to None.
-
is_empty(self): returns a Boolean indicating whether
or not the list contains any nodes.
-
head(self): returns a reference to the first node
in the linked list, None if the list is empty.
-
update_count(self, word): if word
is present in the list, increments its count; otherwise, adds a node
for it with count initialized to 1.
-
rm_from_hd(self): removes the first node in the
linked list (updates the list's _head attribute
appropriately) and returns this node. It generates an error if this
method is invoked on an empty list.
-
insert_after(self, node1, node2):
node1 and node2 are references
to nodes. Inserts node2 into the list containing
node1 so that the node that comes after
node1 is node2.
-
sort(self): sorts the linked list in descending order
by _count. See here for an algorithm.
-
get_nth_highest_count(self, n): returns the
count associated with the node in the linked list at position
n.
-
print_upto_count(self, n): print out all the words
that have count at least n.
-
__str__(self) and, optionally,
__repr__(self).
Note It is OK to use or adapt the code you were given for
the short problems for this assignment. If you do so, however, then for
each such method that you did not write yourself, you must add a
comment above the method indicating that it is taken from, or based on,
code you were given as part of the short problem.
Errors
In order to facilitate grading, we will use specific error messages in this
assignment, as indicated below.
The following are errors:
-
Input file cannot be read.
Program behavior: Use a try statement to detect this.
Give an error message and quit.
Error message:
"ERROR: Could not open file " + filename
-
Input value N cannot be read or cannot be converted
to an integer.
Program behavior: Use a try statement to detect this.
Give an error message and quit.
Error message:
"ERROR: Could not read N"
-
Input value N is negative.
Program behavior: Use an assert to detect this. An
assert failure will terminate the program. No error message is needed.
Examples
Some examples of the behavior of this program are given
here.
Miscellaneous
Using Python's csv module
In general, CSV files can have a complex structure such that they
cannot be parsed using split(","). The file
csv-example.py,
which uses a small example input file
dummy.csv, shows how
Python's csv module can be used to read CSV files.
Sorting a linked list
An algorithm for sorting linked lists is given
here.