Summer School Data Science

Program Overview

For your final program, we will be creating a program that analyzes data. This will practice using Scanner objects to both read data from file and parse data within the file, calculate statistics from data, and demonstrate how computer programs can be used to aid in answering research questions for data science.

We will be using a data set of anonymized Teaching Assistant evaluation scores collected by the Statistics Department of the University of Wisconsin-Madison. Our research question for this dataset is:

“Do classes offered during the summer have higher or lower evaluation scores than those offered during the regular year?”

This data is in a file called tae.prn that is already in your GitHub repository. You can think of this file as being a spreadsheet. The file is organized into five columns that represent the following:

  1. Course instructor (categorical, 25 categories);
  2. Course (categorical, 26 categories);
  3. Summer or regular semester (binary) 1=Summer, 2=Regular;
  4. Class size (numerical); and
  5. Course Evaluation Scores (categorical) 1=Low, 2=Medium, 3=High

and each row represents a different teaching evaluation, recorded as 5 numbers (one per column) separated by spaces.

For our purpose, we will be comparing the Course Evaluation Scores between Summer and Regular semester courses. Thus, we will not need to look at every number in the data set; instead, we will be looking only at the values in columns 3 and 5 (the others exist to answer other research questions one might pose).

Your program should read in this data, calculate the average course evaluation for Summer classes, calculate the average course evaluation for Regular semester courses, and print out the two averages. Ultimately, the output of your program should look something like:

The average evaluation of a summer class is: x.xxxxxxx
The average evaluation of a regular class is: x.xxxxxxx

where the xs will be replaced with the actual number you calculate.

To test your program, we have provided a second, smaller data set in the same format called sample.prn. If you run your program with this data set, your output should be:

The average evaluation of a summer class 1.5714285
The average evaluation of a regular class 2.2173913

Program Instructions

Inside your GitHub repository, you will find a file called CourseEvals.java. Open this file in Visual Studio Code by either using the “File” > “Open” dialog box, or double click on the CourseEvals.java file on the left side of your IDE in the “Explorer” area. Once again, the skeleton of the program is already provided for you in this file. All you need to do is implement the program in the public static void main(String[] args) function. Your program will:

  1. Take in the name of a file from the command line arguments. As in Part 1 of the lab, print an error message and gracefully exit the program if this input is missing.
  2. Loop through every line of the file and read in the necessary data.
  3. Calculate the average course evaluation scores for both summer and regular courses.
  4. Print the average evaluations for summer and regular courses.
Reading Data From File

You will need to use one Scanner to read the file, and another Scanner to read each line of the file. For more details, please refer Warmup Part 7 and Part 8.

To calculate the average scores, keep track of two running totals, one where you add together the total of all course evaluation scores for summer courses, and one where you add together the total of all the course evaluation scores for regular courses. You should also keep track of the count of summer courses and the count of regular courses. Then after you have gone through the file, you can calculate the average by dividing the course eval total by the count of courses.

Skipping Unneeded Data

If you do not care about the number in a particular column, but you need to read a number from the column after it, you can call scanner.nextInt() to read in the number, but not save it anywhere, effectively ignoring it. If I wanted to skip the first two numbers in a line but save the third one, my code would look like this.

for (int i = 0; i < 2, i++){
    scanner.nextInt();
}
int third = scanner.nextInt();
Float Division

Note that unlike in Python, Java does not have different forms of division for integer versus float division; instead, it does integer division if both of the numbers are integers, and float division if either is a float. To get the correct answer here, since both your numbers are integers, first multiply one of them by 1.0 to transform it into a float, and then perform the division.

Compiling and Running Your Program

In your terminal window, type:

javac CourseEvals.java
java CourseEvals

Committing Your Program to GitHub

Once you’ve finished your program, don’t forget to add and commit your changes to git, then push them to your GitHub repository. Instructions for how to do so can be found earlier in the warmup.