9:59pm, Friday May 10th
You may work with a partner on this assignment.
In class, we have been discussing how Graph structures might can be used to represent relationships between groups of objects. For this assignment, you will be writing a program that allows you to play the "Kevin Bacon Game". A person's "Bacon Number" is computed based on the number of movies of separation between that person and the actor Kevin Bacon. For example, if you are Kevin Bacon, then your Bacon Number is 0. If you were in a movie with Kevin Bacon, your number would be 1. If you weren't in a movie with Kevin Bacon, but were in a movie with someone who was, your Bacon Number would be 2. In short, your Bacon Number is one greater than the smallest Bacon Number of any of your co-stars.
Note that this is a take off of Erdos numbers , and the two can be combined to form the more elusive Erdos-Bacon number.
For fun and some additional background, you can try out the Oracle of Bacon at the University of Virginia.
You will be writing a class called BaconNumber that will read a data
file and allow you to interactively query the system for the Bacon Number
and path for any actor in the database. The program should require a single
argument which is the filename containing the information on people and the
roles they played in a movie. An optional second argument can be used to
specify the initial center. After reading in the data, the program
should then prompt the user for commands until an end-of-file (CTRL-D) is
reached (hasNextLine()
will return false).
Similar to what you did in past labs, if the filename argument begins with "http:" you should treat it as an URL and read the file from the network. This will enable you to play the game without having to download the entire file. To open a Scanner from an URL, you just need to do something similar to the following:
Scanner s = new Scanner( new URL("http://www.cs.oberlin.edu/").openStream() );
% java -Xmx2g BaconNumber imdb.full.txt # plays the game with the full data set centered at "Kevin Bacon (I)" % java -Xmx2g BaconNumber imdb.pre1950.txt "Bela Lugosi" # plays the game with the center set to "Bela Lugosi" % java -Xmx2g BaconNumber http://www.cs.oberlin.edu/~gr151/imdb/imdb.no-tv-v.txt # plays the game with the no TV/V data set centered at "Kevin Bacon (I)"
The movie data file contains information on what movies a performer appears in. Every line contains information on one person appearing in one movie. The lines are formatted as follows:
<performer name>|<movie title>
The vertical pipe character '|' can be used to determine where the name
ends and the title begins. There will only be one '|' on a line and there
are no empty names or titles. java.lang.String has a number of methods that can
be used to divide up the line. (e.g., split("\\|")
)
I have supplied several data files of varying sizes for you to work with. (Don't download them to your CS account, see below.)
Rather than cluttering up your account with these files, you can either use the links above for URLs. Also, once you have your lab folder created, you can run 151lab11setup from a lab machine and you'll get symbolic links to the files in the current directory. Don't submit the imdb files when you handin the assignment.
Other than the small database, you'll almost certainly need to increase the amount of memory allowed via the -Xmx argument.
Your program should read in the specified file and in the default case, choose "Kevin Bacon (I)" as the initial center. There are a number of commands you are to support in order query the database and change the center.
find <name>
Find the shortest path from the current center to <name>. The output should be of the format
<name1> -> <movie1> -> <name2> -> <movie2> -> ... -> Kevin Bacon (I) (n)
where <name1> is the person specified by the user and the movies and actors in between show the path from that actor to the current center. The '(n)' should indicate the Bacon Number. E.g., "find James Earl Jones" in the "full" database yields
James Earl Jones -> Magic 7, The (2008) (TV) -> Kevin Bacon (I) (1)
and in the "no-tv-v" set:
James Earl Jones -> Blood Tide (1982) -> Mary Louise Weller -> Animal House (1978) -> Kevin Bacon (I) (2)
Note that your links may differ, but the path length should be the same.
If someone is disconnected from the center simply print
<name> is unreachable
recenter <name>
Change the center to the given name if it exists in the database. If the name is not found, print an appropriate message and do not change the center.
avgdist
Calculate the average Bacon Number for the given center among all connected nodes. Your output should be the following
<avg><tab><name><space>(<number reachable>,<number unreachable>)
The average should only be for the nodes reachable from the center. In the top250 database, I get the following
3.5942556977039737 Kevin Bacon (I) (11803,663)
and in the "no-tv-v" set I get
3.003019238870945 Kevin Bacon (I) (2221752,164815)
topcenter <n>
For each actor in the current connected component (i.e., the one containing the current center), calculate the average bacon distance to all actors in that component. (NOTE: this can take a very long time on larger data sets.) Then print a table of the n best centers (i.e., the ones whose average bacon distance is the smallest).
Calculate the average Bacon Number for all entries in the database. NOTE: this can take a very long time on larger data sets.
In the top 250 set, my program finds "Robert Duvall (11803,663)" is the best center (~2.699) and the worst center is "Kumeko Otowa (11803,663)" (~6.378).
Here's the output from my running topcenter 5 on the top250 dataset:
Enter a command: topcenter 5
2.6989748369058715 robert duvall
2.7369312886554265 harrison ford (i)
2.741930017792087 robert de niro
2.776666949080742 john ratzenberger
2.798017453189867 alec guinness
table - print a table of the counts of bacon numbers for the given center from 0 up to the longest.
In the top250 database I get:Table of distances for Kevin Bacon (I) Number 0: 1 Number 1: 87 Number 2: 539 Number 3: 4462 Number 4: 5786 Number 5: 840 Number 6: 88 Unreachable: 663
in the no-tv-v database I get:
Table of distances for Kevin Bacon (I) Number 0: 1 Number 1: 3344 Number 2: 408925 Number 3: 1425751 Number 4: 349704 Number 5: 30061 Number 6: 3482 Number 7: 380 Number 8: 92 Number 9: 12 Unreachable: 164815
and for the full database I get:
Table of distances for Kevin Bacon (I) Number 0: 1 Number 1: 5920 Number 2: 646684 Number 3: 1653925 Number 4: 289613 Number 5: 24138 Number 6: 2738 Number 7: 361 Number 8: 64 Number 9: 6 Unreachable: 176859
You may opt to include additional other commands for consideration towards extra credit. For any additional commands you implement, you should document them in the README file. Be sure to explain what it does and how someone could use it.
Here are some suggestions
The longest Bacon Number I found in the 'imdb.no-tv-v.txt' dataset for Kevin Bacon was 9 ("Andrea Parlato" and others). "Kevin Bacon (I)" has an average distance value of ~2.994 while "Sean Connery" has ~2.955 indicating that he is a better center than Kevin Bacon. The Oracle of Bacon has a top 1000 list of centers which could be used to search for better values.
As we have been discussing graphs, It should be no surprise that a good way to represent these acting relationships would be through a graph. There are a number of ways in which this can be done, however, if you want to maintain a simple graph you might want to have both movies and actors be vertices and the edges simply being relationships between them.
While an undirected graph could be used, the resulting path length will be double the Bacon Number. You would need to divide the path length by 2 or use weights of 0.5 for the edges. Another technique would be to create a directed graph and weight the paths from actors to movies as 0 and movies to actors as 1. Then, using Dijkstra's algorithm, you can find the shortest path where all actors and actresses that are listed for a movie can be consider equally.
Remember that it is best to build and test your program incrementally. Construct your Graph class and be sure to include test cases in the main method.
If you decide to either use or model part of your implementation off of what is in the book, be sure to give proper credit in the methods or comments at the start of the file.
You can improve your results by appending a "(I)" to a name and retrying the operation if it isn't found in the database before giving up. (IMDB has been adding that to the end of a number of entries.)
Use handin to submit the following files:
If you adhered to the honor code in this assignment, add the following statement to your README file:
I have adhered to the Honor Code in this assignment.
If you work with a partner, just submit one solution per team.
Information courtesy of The Internet Movie Database (http://www.imdb.com/). Used with permission. The data should only be used for personal and non-commercial purposes.