Penguin Data Science

penguin.py: 15 points


One of the benefits of using computers to solve problems is they can process data very quickly to help us discover important facts about the real-world. In this part of the lab, we will be using Python to perform some data science on observations biologists recorded about three species of penguins on different islands around Antarctica. In other words, we will be using data science as an interdisciplinary approach to answer questions related to biodiversity.

Gentoo Penguin (source Andrew Shiva at Wikipedia, CC-BY-SA 4.0).

Special thanks and credit to Professor Allison Horst at the University of California Santa Barbara for making this data set public: Twitter post and thread with more information and GitHub repository.

Data About Penguins

You have been provided with a read_data() function in penguin.py that reads in all of the data from the penguins_data.csv file. This file contains data for about 342 real-life penguins. Calling read_data() returns a list of all of the penguins we will be working with.

penguins = read_data()

Each penguin is a list containing the six values described in the table below.

List IndexInformationType
0speciesstr
1home islandstr
2bill lengthfloat
3bill depthfloat
4flipper lengthfloat
5body massfloat

For example, one penguin might be represented as the following list:

penguin = ["Adelie", "Torgersen", 39.1, 18.7, 181.0, 3750.0]

Since each penguin is itself a list of values, then a list of multiple penguins is represented as a two-dimensional list, such as the following list that contains 5 penguins:

five_penguin_list = [["Adelie", "Torgersen", 39.1, 18.7, 181.0, 3750.0],
                     ["Adelie", "Briscoe", 37.8, 18.3, 174.0, 3400.0],
                     ["Gentoo", "Biscoe", 46.1, 13.2, 211.0, 4500.0],
                     ["Gentoo", "Biscoe", 50.0, 16.3, 230.0, 5700.0],
                     ["Chinstrap", "Dream", 46.5, 17.9, 192.0, 3500.0]]

The penguins list returned by the read_data() function is similar in structure to the five_penguins_list above, except it contains 342 penguins, instead of only 5.

ReadMe

We will read in all the penguin observations from a file, so you do not need to make any assignments like the above (they merely illustrate what the data looks like).

Program Goal

Your goal in this program is to use that list of penguins to discover possible differences between the three species of penguins (Adelie, Chinstrap, and Gentoo) based on their data. In particular, your program should do the following:

  1. Ask the user for a species name (either “Adelie”, “Chinstrap”, or “Gentoo”).
  2. Ask the user for which type of measurement they want to see (either bill length or body mass; you do not have to handle bill depth or flipper length).
  3. Based on the species chosen by the user in Step 1, create a new list containing only the penguins that belong to this species.
  4. Calculate and then print the average, minimum, and maximum of the measurements selected by the user in Step 2 (bill length or body mass) for the list of penguins created in Step 3.

During Steps 1 and 2, you should make sure the user enters a valid option. If the user did not, you should print a message telling them what mistake they made then close the program. Tip: you might want to provide the user with a menu like we did with image filters in Lab 4

Reminder

As you make progress on your program, don’t forget to commit and push your changes regularly!

Reminder

As you make progress on your program, don’t forget to commit and push your changes regularly!

Useful Functions

To complete this assignment, the following functions will help us.

find_species(penguins, species):

The find_species() function will perform Step 3 above by taking in a list of all of the penguins in the data, and return a smaller list that contains only the penguins of a particular species. This can be done by following these steps:

  1. Create a new empty list called filtered.
  2. Loop over each penguin in the penguins list.
    1. Check if the current penguin’s species (in index 0, i.e., penguin[0]) is equal to species.
    2. If so, append the current penguin to the filtered list.
  3. Return the filtered list.

For example, say we have the same five penguins as we did above.

five_penguin_list = [["Adelie", "Torgersen", 39.1, 18.7, 181.0, 3750.0],
                     ["Adelie", "Briscoe", 37.8, 18.3, 174.0, 3400.0],
                     ["Gentoo", "Biscoe", 46.1, 13.2, 211.0, 4500.0],
                     ["Gentoo", "Biscoe", 50.0, 16.3, 230.0, 5700.0],
                     ["Chinstrap", "Dream", 46.5, 17.9, 192.0, 3500.0]]

Then, if you want to create a list of only the Adelie penguins from those five, you can call find_species(five_penguin_list, "Adelie"), which should return a list with the two Adelie penguins.

filtered = [["Adelie", "Torgersen", 39.1, 18.7, 181.0, 3750.0],
            ["Adelie", "Briscoe", 37.8, 18.3, 174.0, 3400.0]]

Reminder

Once you’ve implemented the find_species function, remember to commit and push your changes!

find_measurements(filtered, index):

In order to perform Step 4, we need to work with either the bill length or body mass of all of the penguins of a given species (returned as filtered from our find_species() function). To get those measurements, we will use the find_measurements() function.

The find_measurements() function is very similar to find_species(), except we are only saving a particular measurement from each penguin, instead of the entire penguin. This function should:

  1. Create a new empty list called measurements.
  2. Loop over each penguin in the filtered list.
    1. Grab the measurement from penguin[index] (index = 2 if the user chose bill length and index = 5 if they chose body mass).
    2. Save the measurement in the measurements list.
  3. Return the measurements list.

For example, say we have the same two Adelie penguins as we did above:

filtered = [["Adelie", "Torgersen", 39.1, 18.7, 181.0, 3750.0],
            ["Adelie", "Briscoe", 37.8, 18.3, 174.0, 3400.0]]

Then, if I want to create a list of all of their bill lengths, I can call find_measurements(filtered, 2), which should return a list:

measurements =  [39.1, 37.8]

Reminder

Remember to commit and push your changes before moving on!

find_average(measurements):

For the find_average() function, we will want to add together all the numbers in the input measurements list, then divide that total by the count of numbers in the list and return the result.

find_max(measurements) and find_min(measurements):

For the find_max() and find_min() functions, we will need to loop through the values in measurements. Within the loop, we will keep track of which value is currently the largest (for find_max()) or smallest (for find_min()). You should not use Python’s built-in max() or min() functions here. In addition, you should not use the variable names min or max as they will collide with the built-in function names.

ReadMe

As a hint of how to loop over each penguin contained in a list of penguins (i.e., a list of lists), we can use the following code:

for penguin in penguins:
    # do something with penguin, which is a list of measurements

Reminder

Commit and push your changes as you complete each function!

Correct Answers

SpeciesMeasurementMinAverageMax
AdelieBill Length32.138.791446.0
AdelieBody Mass2850.03700.66234775.0
ChinstrapBill Length40.948.833858.0
ChinstrapBody Mass2700.03733.08824800.0
GentooBill Length40.947.504959.6
GentooBody Mass3950.05076.01636300.0