CSCI 150: Lab 6

Concordances
Due: 10PM on Tuesday, October 29th

The purpose of this lab is to:

Practice with lists and dictionaries
Create a practical program for text analysis

Getting Started

You will want to download the following test files for your program:

Prufrock.txt (T.S. Eliot's The Love Song of J. Alfred Prufrock)
Jabberwocky.txt (The Lewis Carrol poem)
Test.txt (a file for testing your line numbering)

Concordances.

In this lab you will create a concordance. What is a concordance? It is an index to the words of a text or of a body of texts. For example, if you are writing an essay about Shakespeare's view of kingship, you might want to look at the instances in his plays where the word "king" is used. There are a lot of these instances. You can find them all by looking at a concordance to Shakespeare -- look up the word "king" and you will get references by Play, Act, Scene and Line Number, to every use of this word in every one of Shakespeare's plays. The Oberlin College library has concordances to Shakespeare and Chaucer and Dante and Vergil and Plato, and even to Joyce's Finnegan's Wake. It has several concordances to the Bible and the Qur'an and the Guanzi. In fact, the library has more than 150 books whose title starts "A concordance to ..."

One of the issues that the creator of a concordance faces is how to refer to a specific use of a word. We are going to take the easy way out and just use line numbers. This is great for making a concordance to a single poem, and less practical for a novel. Here is one small portion of the output of our concordance for The Love Song of J. Alfred Prufrock by T.S. Eliot:

      etherized 3
      evening 2 17 77
      evenings 50
      eyes 55 56

So the word "etherized" appears on line 3, "evening" appears 3 times, on lines 2, 17 and 77, and so forth. In this lab you will write a program that asks the user for the name of a text file, and then prints a concordance of the text in that file.

Data Structures: Lists and Dictionaries

The interesting parts of this lab are the structures we use to create the concordance. We need to store line numbers, possibly one and possibly many, for each word in the text. This is a problem of association -- we want to associate line numbers with words. Dictionaries are the structures to use for this. Dictionaries are designed to efficiently associate one datum with another. In dictionary terminology, keys are the things we use to look up values. The keys act like indexes. For our situation the words of the text will be our keys; the line numbers on which a word is found will be the value associated with that word. The line numbers themselves should be sequential -- we want to store them in increasing order. Lists are good for this, and are easy to use. Altogether, our concordance will be stored as a dictionary, where the keys are words (strings) and the values are lists of line numbers.

We have talked about both dictionaries and lists in class. Here are reminders of how these structures work in Python:

Dictionaries


    Ages = {}                       # sets Ages to be an empty dictionary
    
    Ages["Sophia"]                  # returns the value associated with "Sophia", 
                                    # presumbly her age.  Throws a KeyError if 
                                    # "Sophia" is not a key of Ages.
                                    
    Ages["Sophia"] = 3              # Makes "Sophia" a key and associates 3 with it.
    
    Ages["Bob"] = math.inf          # Makes "Bob" a key and associates infinity with it.
    
    del Ages["Tom"]                 # removes key "Tom" and the value associated with it.
    
    Ages.keys()                     # returns a "view" of the keys of Ages. You can 
                                    # treat this like a list of the keys.
                                    
    len(Ages)                       # returns the number of keys in Ages
    
    for person in Ages:             # Iterates over the keys in Ages.

Lists


    Numbers = []                    # sets Numbers to be an empty list
    
    Numbers[3]                      # returns the fourth entry of Numbers. 
                                    # Throws an IndexError if Numbers does not have
                                    # 4 entries.
                                    
    Numbers.append(18)              # adds entry 18 onto the end of Numbers
    
    del Numbers[3]                  # removes the fourth entry of Numbers, shifting
                                    # later entries down.
                
    for x in Numbers:               # Iterates over the list Numbers.
    
    Numbers.sort():                 # Reorders the elements of Numbers so they are sorted

Finally, in this lab we will make repeated use of several methods of the String class:

String Methods

In these examples assume s is a string.


    s.strip()                       # returns a string like s except with leading and
                                    # trailing spaces and newline characters deleted
                                    
    " bob ".strip()                   # "bob"
    
    s.strip(p)                      # Here p is a string of punctuation characters to be
                                    # deleted.  This returns a string like s, only with
                                    # all of the letters of p, in any order, deleted
                                    # from the front and back of s.
                                    
    "(bob!*!!".strip( "(!*" )       # "bob"
    
    s.split( )                      # returns a list of the "words" in s, using 
                                    # white space as the separator between words
                                    
    "The time is now!".split( )     # ["The", "time", "is", "now!"]
    
    s.split( delim )                # returns a list of the "words" in s, using the 
                                    # string delim as the separator between words.
                                    
    "3/20/2019".split( "/" )        # ["3", "20", "2019"]
    
    s.lower()                       # returns a copy of s with all letters converted to lowercase
                                    
    s.upper()                       # returns a copy of s with all letters converted to uppercase
                                    
    "aBC(De)fG".lower()             # "abc(de)fg"

Your Program

concordance.py: 38 pts

Your program should ask the user for the name of one file. That file should be in the same folder as your program. Most text files have names that end in ".txt", so be sure to type this as part of the file name you enter when you are running the program. Your program should open this file (make sure to handle exceptions and print error messages to the user), then read the file one line at a time (a for loop does this nicely), counting the line numbers (only count the non-blank lines; the first non-blank line should be numbered 1). Each word in the line should be stripped of punctuation marks, converted to lowercase, and added to your concordance with its line number. After the entire file is processed, you should print all of the words that are keys of your concordance, in alphabetical order, along with the list of line numbers for each word. Finally, at the end you should print the number of lines in the file and the number of unique words found.

For example, consider the file Test.txt:

    one!!
          
    Two Two
    !!!! --
    four four four four
      
    five five Five!   'five five

Here is the output we want from this file:

    five 5 5 5 5 5
    four 4 4 4 4
    one 1
    two 2 2
    I found 5 lines containing 4 unique words.

The word "one" appears once on the first line of the file; "two" appears twice on the line numbered 2 (we ignored the blank line between 1 and 2). There is a line 3, but the "words" on it consist only of punctuation characters so they are never added to the concordance.

There are 3 issues to consider with this program:

How to read the file line-by-line, counting the line numbers (see Lab 5)
How to get the individual words from a line, strip off their punctuation, and convert them to lowercase (see String functions above)
How to add the words and their line numbers to the concordance.

Handling the dictionary

As we have said, the Concordance is a dictionary. At the start of your program you will create an empty dictionary for your concordance:


    Concordance = { }

Each time you come across a word you need to know if it is already a key in your concordance:


    if word in Concordance.keys():

In this lab, the values stored in the dictionary are lists of line numbers. So whenever you add a word as a new key into the dictionary, you will use a list as its value (where the list only contains the current line number lineNumber):


    Concordance[word] = [lineNumber]      # make a list containing lineNumber and add it to the Concordance under word

Instead, if the word already exists as a key in the dictionary, then you append the current line number to the end of the list for that word:


    lineNumbers = Concordance[word] # gets the list of line numbers stored for word in Concorence
    lineNumbers.append(lineNumber)  # appends lineNumber to the end of the list saved under word in Concordance

After processing the entire text file you need to print all of the words in alphabetical order, followed by their line numbers. You can't directly sort a key structure; you need to first convert it to a list and then sort it. You should print the list of line numbers in a nice way so that it looks like the examples below.

The design of your program is up to you, but you should certainly divide the work to be done among several functions. One way to do this would be to use the following functions:

RemovePunctuation(s):: This function returns a new string that has the letters of s translated to lowercase, with all of the punctuation removed.
AddWord(word, lineNumber, Concordance):: This handles the work of recording that a given word was found on the given lineNumber in the dictionary Concordance
PrintEntry(word, Concordance):: This handles one word of the output. Concordance is the dictionary storing the concordance, so Concordance[word] is the list of line numbers on which word occurs.
main( ):: This gets the file name, opens the file,and has a loop to read the file one line at a time, then splits the line into words. RemovePunctuation prepares the word for adding to the concordance; AddWord actually does the addition. Finally, a loop over the keys of the concordance calls PrintEntry on each word to handle the output

Testing your work

Here are several files that will help you test out your program:

Prufrock.txt (T.S. Eliot's The Love Song of J. Alfred Prufrock)
Jabberwocky.txt (The Lewis Carrol poem)
Test.txt (a file for testing your line numbering)

File Test.txt should give you the following output:

    five 5 5 5 5 5
    four 4 4 4 4
    one 1
    two 2 2
    I found 5 lines containing 4 unique words.

If you get a different output, there is either a problem with your line numbering or the way you are stripping punctuation. The other files are mainly useful for checking punctuation; there are many different punctuation characters used in these files and you should remove all of them. Look carefully at your output. If you see what appears to be a blank word followed by line numbers, it probably comes in the following way. The split() method separates a string into words by using white space as as delimiter, so some "words" might be just sequences of punctuation characters, such as "!!!". When you strip off the punctuation you are left with the empty string. Before you add a word and its line number to the concordance, you should check if the word is the empty string; if it is, just don't add it.

If you want to play with your concordance, here are a few additional files you might work with:

Beowulf.txt (translated to modern English by Hall)
DavidCopperfield.txt (all 626 pages of the Dickens novel)
Inferno.txt (the first third of Dante's Divine Comedy, translated by Norton)
KingLear.txt (the Shakespear play)
Republic.txt (Plato's "Republic")

Thanks to Project Gutenberg, there are many, many more text files available on the Web.

Wrap Up - 2 pts

As with every lab, your last job prior to submission is to complete a brief write-up by filling out a Google Form.

Handin

You now just need to electronically handin all your files. As a reminder

 
        cd             # changes to your home directory
        cd cs150       # goes to your cs150 folder
        handin         # starts the handin program
                        # class is 150
                        # assignment is 6
                        # file/directory is lab06
        lshand         # should show that you've handed in something

You can also specify the options to handin from the command line

 
        cd ~/cs150     # goes to your cs150 folder
        handin -c 150 -a 6 lab06

File Checklist

You should have submitted the following files:

      concordance.py