CSCI 150: Lab 6
Concordances
Due:
10PM
on
Tuesday,
October 29th
The purpose of this lab is to:
- Practice with lists and dictionaries
- Create a practical program for text analysis
Getting Started
You will want to download the following test files for your program:
- Prufrock.txt (T.S. Eliot's The Love Song of J. Alfred Prufrock)
- Jabberwocky.txt (The Lewis Carrol poem)
- Test.txt (a file for testing your line numbering)
Concordances.
In this lab you will create a concordance. What is a concordance? It is an index to the words of a text or of a body of texts. For example, if you are writing an essay about Shakespeare's view of kingship, you might want to look at the instances in his plays where the word "king" is used. There are a lot of these instances. You can find them all by looking at a concordance to Shakespeare -- look up the word "king" and you will get references by Play, Act, Scene and Line Number, to every use of this word in every one of Shakespeare's plays. The Oberlin College library has concordances to Shakespeare and Chaucer and Dante and Vergil and Plato, and even to Joyce's Finnegan's Wake. It has several concordances to the Bible and the Qur'an and the Guanzi. In fact, the library has more than 150 books whose title starts "A concordance to ..."
One of the issues that the creator of a concordance faces is how to refer to a specific use of a word. We are going to take the easy way out and just use line numbers. This is great for making a concordance to a single poem, and less practical for a novel. Here is one small portion of the output of our concordance for The Love Song of J. Alfred Prufrock by T.S. Eliot:
etherized 3 evening 2 17 77 evenings 50 eyes 55 56
So the word "etherized" appears on line 3, "evening" appears 3 times, on lines 2, 17 and 77, and so forth. In this lab you will write a program that asks the user for the name of a text file, and then prints a concordance of the text in that file.
Data Structures: Lists and Dictionaries
The interesting parts of this lab are the structures we use to create the concordance. We need to store line numbers, possibly one and possibly many, for each word in the text. This is a problem of association -- we want to associate line numbers with words. Dictionaries are the structures to use for this. Dictionaries are designed to efficiently associate one datum with another. In dictionary terminology, keys are the things we use to look up values. The keys act like indexes. For our situation the words of the text will be our keys; the line numbers on which a word is found will be the value associated with that word. The line numbers themselves should be sequential -- we want to store them in increasing order. Lists are good for this, and are easy to use. Altogether, our concordance will be stored as a dictionary, where the keys are words (strings) and the values are lists of line numbers.
We have talked about both dictionaries and lists in class. Here are reminders of how these structures work in Python:
Dictionaries
Ages = {} # sets Ages to be an empty dictionary
Ages["Sophia"] # returns the value associated with "Sophia",
# presumbly her age. Throws a KeyError if
# "Sophia" is not a key of Ages.
Ages["Sophia"] = 3 # Makes "Sophia" a key and associates 3 with it.
Ages["Bob"] = math.inf # Makes "Bob" a key and associates infinity with it.
del Ages["Tom"] # removes key "Tom" and the value associated with it.
Ages.keys() # returns a "view" of the keys of Ages. You can
# treat this like a list of the keys.
len(Ages) # returns the number of keys in Ages
for person in Ages: # Iterates over the keys in Ages.
Lists
Numbers = [] # sets Numbers to be an empty list
Numbers[3] # returns the fourth entry of Numbers.
# Throws an IndexError if Numbers does not have
# 4 entries.
Numbers.append(18) # adds entry 18 onto the end of Numbers
del Numbers[3] # removes the fourth entry of Numbers, shifting
# later entries down.
for x in Numbers: # Iterates over the list Numbers.
Numbers.sort(): # Reorders the elements of Numbers so they are sorted
Finally, in this lab we will make repeated use of several methods of the String class:
String Methods
In these examples assume s is a string.
s.strip() # returns a string like s except with leading and
# trailing spaces and newline characters deleted
" bob ".strip() # "bob"
s.strip(p) # Here p is a string of punctuation characters to be
# deleted. This returns a string like s, only with
# all of the letters of p, in any order, deleted
# from the front and back of s.
"(bob!*!!".strip( "(!*" ) # "bob"
s.split( ) # returns a list of the "words" in s, using
# white space as the separator between words
"The time is now!".split( ) # ["The", "time", "is", "now!"]
s.split( delim ) # returns a list of the "words" in s, using the
# string delim as the separator between words.
"3/20/2019".split( "/" ) # ["3", "20", "2019"]
s.lower() # returns a copy of s with all letters converted to lowercase
s.upper() # returns a copy of s with all letters converted to uppercase
"aBC(De)fG".lower() # "abc(de)fg"
Your Program
concordance.py: 38 ptsYour program should ask the user for the name of one file. That file should be in the same folder as your program. Most text files have names that end in ".txt", so be sure to type this as part of the file name you enter when you are running the program. Your program should open this file (make sure to handle exceptions and print error messages to the user), then read the file one line at a time (a for loop does this nicely), counting the line numbers (only count the non-blank lines; the first non-blank line should be numbered 1). Each word in the line should be stripped of punctuation marks, converted to lowercase, and added to your concordance with its line number. After the entire file is processed, you should print all of the words that are keys of your concordance, in alphabetical order, along with the list of line numbers for each word. Finally, at the end you should print the number of lines in the file and the number of unique words found.
For example, consider the file Test.txt:
one!! Two Two !!!! -- four four four four five five Five! 'five five
Here is the output we want from this file:
five 5 5 5 5 5 four 4 4 4 4 one 1 two 2 2 I found 5 lines containing 4 unique words.
The word "one" appears once on the first line of the file; "two" appears twice on the line numbered 2 (we ignored the blank line between 1 and 2). There is a line 3, but the "words" on it consist only of punctuation characters so they are never added to the concordance.
There are 3 issues to consider with this program:
- How to read the file line-by-line, counting the line numbers (see Lab 5)
- How to get the individual words from a line, strip off their punctuation, and convert them to lowercase (see String functions above)
- How to add the words and their line numbers to the concordance.
Handling the dictionary
As we have said, the Concordance is a dictionary. At the start of your program you will create an empty dictionary for your concordance:
Concordance = { }
Each time you come across a word you need to know if it is already a key in your concordance:
if word in Concordance.keys():
In this lab, the values stored in the dictionary are lists of line numbers. So whenever you add a word as a new key into the dictionary, you will use a list as its value (where the list only contains the current line number lineNumber):
Concordance[word] = [lineNumber] # make a list containing lineNumber and add it to the Concordance under word
Instead, if the word already exists as a key in the dictionary, then you append the current line number to the end of the list for that word:
lineNumbers = Concordance[word] # gets the list of line numbers stored for word in Concorence
lineNumbers.append(lineNumber) # appends lineNumber to the end of the list saved under word in Concordance
After processing the entire text file you need to print all of the words in alphabetical order, followed by their line numbers. You can't directly sort a key structure; you need to first convert it to a list and then sort it. You should print the list of line numbers in a nice way so that it looks like the examples below.
The design of your program is up to you, but you should certainly divide the work to be done among several functions. One way to do this would be to use the following functions:
- RemovePunctuation(s):
- This function returns a new string that has the letters of s translated to lowercase, with all of the punctuation removed.
- AddWord(word, lineNumber, Concordance):
- This handles the work of recording that a given word was found on the given lineNumber in the dictionary Concordance
- PrintEntry(word, Concordance):
- This handles one word of the output. Concordance is the dictionary storing the concordance, so Concordance[word] is the list of line numbers on which word occurs.
- main( ):
- This gets the file name, opens the file,and has a loop to read the file one line at a time, then splits the line into words. RemovePunctuation prepares the word for adding to the concordance; AddWord actually does the addition. Finally, a loop over the keys of the concordance calls PrintEntry on each word to handle the output
Testing your work
Here are several files that will help you test out your program:
- Prufrock.txt (T.S. Eliot's The Love Song of J. Alfred Prufrock)
- Jabberwocky.txt (The Lewis Carrol poem)
- Test.txt (a file for testing your line numbering)
File Test.txt should give you the following output:
five 5 5 5 5 5 four 4 4 4 4 one 1 two 2 2 I found 5 lines containing 4 unique words.
If you get a different output, there is either a problem with your line numbering or the way you are stripping punctuation. The other files are mainly useful for checking punctuation; there are many different punctuation characters used in these files and you should remove all of them. Look carefully at your output. If you see what appears to be a blank word followed by line numbers, it probably comes in the following way. The split() method separates a string into words by using white space as as delimiter, so some "words" might be just sequences of punctuation characters, such as "!!!". When you strip off the punctuation you are left with the empty string. Before you add a word and its line number to the concordance, you should check if the word is the empty string; if it is, just don't add it.
If you want to play with your concordance, here are a few additional files you might work with:
- Beowulf.txt (translated to modern English by Hall)
- DavidCopperfield.txt (all 626 pages of the Dickens novel)
- Inferno.txt (the first third of Dante's Divine Comedy, translated by Norton)
- KingLear.txt (the Shakespear play)
- Republic.txt (Plato's "Republic")
Thanks to Project Gutenberg, there are many, many more text files available on the Web.
Wrap Up - 2 pts
As with every lab, your last job prior to submission is to complete a brief write-up by filling out a Google Form.
Handin
You now just need to electronically handin all your files. As a reminder
cd # changes to your home directory cd cs150 # goes to your cs150 folder handin # starts the handin program # class is 150 # assignment is 6 # file/directory is lab06 lshand # should show that you've handed in something
You can also specify the options to handin from the command line
cd ~/cs150 # goes to your cs150 folder handin -c 150 -a 6 lab06
File Checklist
You should have submitted the following files:
concordance.py