University of Arizona, Department of Computer Science

CSc 120: Fake News

The primary purpose of this assignment is to work with linked lists.

Background

Fake news has become a topic of interest in current times, but it is certainly nothing new. As this article points out, "Fake news" was a tatic used in the Roman era; this U of A special collections exhibit demonstrates its use during the Reformation.

This programming assignment involves writing a program to analyze some recent data about fake news articles and identify what they most commonly focus on.

The input data for this assignment are derived from www.kaggle.com/mrisdal/fake-news. For use in this assignment, the original data file has been modified as follows:

In order to simplify processing, non-ASCII characters have been replaced by blanks.
In order to reduce file sizes and make the analysis more manageable, the bodies of the articles have been removed. The analysis in this assignment is therefore limited to the titles (i.e., headlines) of articles.
Only US news sources have been considered.
For each discussion thread, only the first posting has been retained, i.e., comments and replies have been omitted.

If you would like to examine or work with the original un-edited data, see the website mentioned above.

Expected Behavior

Write a program, in a file fake-news.py, that behaves as follows.

Read in the name of an input file using
input('File: ').
Read in the data and process it as specified below. You can use the csv module to read the input data (see Using Python's csv module below, under Miscellaneous).
- Extract the title from each input line;
- Process the title so obtained as follows:
  - Divide the title into a Python list of words by splitting on whitespace and punctuation, so that no word in the resulting list contains any whitespace or punctuation character. Use string.punctuation for your punctuation characters and string.whitespace for whitespace characters. You will have to import the string module for this.
  - From the resulting list, discard words with length ≤ 2. This gets rid of small fragments of words that are created by the splitting process in the previous step (e.g., the word "Tom's" gets split into "Tom" and "s", and this step gets rid of "s") as well as common words, like "a", "an", and "is", that convey little information here. (Not all common words are excluded, but we'll leave that be as a concession to keeping things from getting overly complex.)
  - Convert all words to lower case to make counting word occurrences case-insensitive.
  We will refer to the resulting list of words as the "cleaned list".
- Use a linked list to maintain a count of the number of times each word occurs across all of the input data (but remember that this program limits itself to just the title of each news item). To this end, for each word w in the cleaned list:
  - if w occurs in the linked list, update its count appropriately;
  - otherwise, add w to the linked list with count initialized to 1.
- After the data file has been read in and processed, sort the linked list of words in descending order by count. The relative order of different words that have the same count is up to you. An algorithm for sorting linked lists is given here.
Read in an integer value N using
input('N: ').
N should be ≥ 0.
Traverse the sorted list to find the count for the node in the sorted list at position N^th (the very first ndoe is at position 0). We refer to this value as k.
For example, suppose the linked list has the following values for word counts (in descending order of count):
monsoon: 10 heat: 7 dry: 7 chili: 5 canyon: 5 bicycle: 3 room: 2
and N = 3, then the word at position 3 is chili, so k = 5 (because chili has count 5).
Print out all the words in the sorted list that have count ≥ k (see Output format below). The order in which they are printed out is up to you.

Data Structures

You should use linked lists to keep track of words encountered in the input data and their occurrence counts. You are free to either adapt the code shown in the class lecture notes for this, e.g., by using any additional attributes you may need, or to write your own.

SInce the primary purpose of this assignment is to practice using linked lists, code that tries to sidestep linked lists, e.g., by copying data to or from Python lists and doing some or all of the actual work of the program in Python lists, will not get credit. (Python's csv module returns a Python list when reading from a file. This is OK. Your code should then process the strings in that list and insert them into your linked list, and subsequent processing should be done using linked lists.)

Input format

Each line in the input file represents data about a particular news post. Lines have the following format:

A line that begins with the character "#" is a comment and should be ignored.

A non-comment line contains the following fields. (Here, "..." indicates a field that has no bearing on this programming problem and whose description has been omitted to reduce clutter.)

0
identifier 1
... 2
author 3
date published 4
title of story 5
language 6
... 7
site URL 8
country 9–19
...

The highlighted field (field 4, "title of story") is the field this program will analyze.

Output format

Print out the top k words by count from the sorted list using the statement

print("{} : {:d}".format(word, count))

where word is a string (a word occurring in some title in the input file) and count is an integer value giving the number of occurrences of that string in the title field of posts in the input file.

Programming Requirements

Your program should implement (at least) the following classes along with the methods specified.

class Node

An object of this class represents information about a word. It should contain (at least) the following attributes:

_word: the word string.

_count: a count of the number of occurrences of the word.

_next: a reference to the next node in the linked list.

It should implement (at least) the following methods:

__init__(self, word): initializes the object's attributes as follows: _word is set to word; _count is set to 1; _next is set to None.

word(self): returns the value of _word.

count(self): returns the value of _count.

next(self): returns the value of _next.

set_next(self, target): sets the _next attribute to target.

incr(self): increments the value of _count.

__str__(self) and, optionally, __repr__(self).

class LinkedList

An object of this class represents a linked list. It should contain (at least) the following attributes:

_head: a reference to the first node in the list.

It should implement (at least) the following methods:

__init__(self): initializes _head to None.

is_empty(self): returns a Boolean indicating whether or not the list contains any nodes.

head(self): returns a reference to the first node in the linked list, None if the list is empty.

update_count(self, word): if word is present in the list, increments its count; otherwise, adds a node for it with count initialized to 1.

rm_from_hd(self): removes the first node in the linked list (updates the list's _head attribute appropriately) and returns this node. It generates an error if this method is invoked on an empty list.

insert_after(self, node1, node2): node1 and node2 are references to nodes. Inserts node2 into the list containing node1 so that the node that comes after node1 is node2.

sort(self): sorts the linked list in descending order by _count. See here for an algorithm.

get_nth_highest_count(self, n): returns the count associated with the node in the linked list at position n.

print_upto_count(self, n): print out all the words that have count at least n.

__str__(self) and, optionally, __repr__(self).

Note It is OK to use or adapt the code you were given for the short problems for this assignment. If you do so, however, then for each such method that you did not write yourself, you must add a comment above the method indicating that it is taken from, or based on, code you were given as part of the short problem.

Errors

In order to facilitate grading, we will use specific error messages in this assignment, as indicated below.

The following are errors:

Input file cannot be read.
Program behavior: Use a try statement to detect this. Give an error message and quit.
Error message: "ERROR: Could not open file " + filename
Input value N cannot be read or cannot be converted to an integer.
Program behavior: Use a try statement to detect this. Give an error message and quit.
Error message: "ERROR: Could not read N"
Input value N is negative.
Program behavior: Use an assert to detect this. An assert failure will terminate the program. No error message is needed.

Examples

Some examples of the behavior of this program are given here.

Miscellaneous

Using Python's csv module

In general, CSV files can have a complex structure such that they cannot be parsed using split(","). The file csv-example.py, which uses a small example input file dummy.csv, shows how Python's csv module can be used to read CSV files.

Sorting a linked list

An algorithm for sorting linked lists is given here.