concordance.py: 34 points
In this lab you will create a concordance. What’s a concordance? It’s an index to the words of a text or of a body of texts. For example, if you are writing an essay about Shakespeare’s view of kingship, you might want to look at the instances in his plays where the word “king” is used. There are a lot of these instances! You can find them all by looking at a concordance to Shakespeare—look up the word “king” and you will get references by Play, Act, Scene and Line Number, to every use of this word in every one of Shakespeare’s plays. The Oberlin College library has concordances to Shakespeare, Chaucer, Dante, Vergil, Plato, and even to Joyce’s Finnegans Wake.
One of the issues that the creator of a concordance faces is how to refer to the address of a specific use of a word. We are going to use line numbers. Here is one small portion of the output of our concordance for The Love Song of J. Alfred Prufrock by T.S. Eliot:
etherized 3
evening 2 19 86
evenings 56
eyes 62 63
So the word “etherized” appears on line 3, “evening” appears three times, on lines 2, 19, and 86, and so forth. In this lab you will write a program that asks the user for the name of a text file, and then prints a concordance of the words in that file.
Hint
We need to store line numbers, possibly one or possibly many, for each word in the text. This is a problem of association—we want to associate line numbers with words. Dictionaries are the structures to use for this. In dictionary terminology, we use keys to look up values. For our situation the words of the text will be our keys and the line numbers on which a word is found will be our value. The line numbers themselves should be sequential – we want to store them in increasing order. Lists are good for this, and are easy to use. Altogether, our concordance will be stored as a dictionary, where the keys are words (i.e., strings) and the values are lists of line numbers.
Your program should start by asking the user for the name of a file - your code should work for any properly formatted text file in the same folder as your program. Your program should try to open the file, making sure to handle any errors (e.g. FileNotFoundError
) with try/except
statement(s) which print error message(s) to the user and exit the program.
After successfully opening a file, your program should then read the file one line at a time (a for
loop does this nicely). Then each word in the line should be stripped of punctuation marks, converted to lowercase, and added to your concordance with its line number. After the entire file is processed, you should print all of the words that are keys of your concordance, in alphabetical order, along with the list of line numbers for each word. Finally, at the end you should print the number of lines in the file and the number of unique words found.
For example, consider the file Test.txt
:
one!!
Three three ThReE "five"
!!!! --
Five five Five five Five three?
Eight eIght eiGht eigHt eighT eight eight eight
Here is the output we want from this file:
eight 8 8 8 8 8 8 8 8
five 3 5 5 5 5 5
one 1
three 3 3 3 5
I found 8 lines containing 4 unique words.
The word “one” appears once on the first line of the file; “three” appears thrice on line 3 and once on line 5. Lines 2, 4, 6, and 7 are either blank, contain only spaces, or consist only of punctuation characters so they are not added to the concordance.
There are three issues to consider with this program: