Machine Learning
and Data Mining

Announcements

I will post announcements here. Please check often to make sure you do not miss any important information.

  • 12/2: The Reflective Writing Assignment is assigned.
  • 11/14: Homework 3 is assigned.
  • 10/14: Homework 2 is assigned.
  • 9/21: Homework 1 is assigned.
  • 9/19: The group project is assigned.
  • 9/9: I have added a new section to the course webpage: "Course Topics and Readings". This section provides an overview of which topics were covered in course on which days. For each topic, also listed are the corresponding chapter in the course textbook, any required readings (beyond the textbook), and any additional readings that might be helpful.
  • 9/7: I will be moving the Thursday morning office hour (formerly 9:30-10:30) to either 11:00-Noon or 3:00-4:00 on Thursday. Please see an upcoming email for more details.
  • 9/7: Lecture slides for the first topic (Introduction to Machine Learning and Data Mining) have been posted to Blackboard.
  • 8/29: Welcome to CSCI374! Today is the first day of class. We meet in King 306.

Course Topics and Readings

Topic 1A: Introduction to Machine Learning and Data Mining (8/29-9/2)

Course Textbook: Chapter 1

Additional Helpful Readings


Topic 1B: Supervised Learning [with k-Nearest Neighbor] (9/7-9/14)

Course Textbook: Chapter 8.1-8.2 (k-Nearest Neighbor)

Additional Required Readings

  • Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Informatica, vol. 31, 2007, pp. 249-268. URL: http://www.informatica.si/index.php/informatica/article/view/148
    Description: a general overview paper of different supervised learning approaches with a good comparison of their strengths and weaknesses. Cited ~1500 times in the literature.
  • Domingos, Peter. A Few Useful Things to Know about Machine Learning. Communications of the ACM, vol. 55, no. 10, October 2012. pp. 78-87. URL: http://dl.acm.org/citation.cfm?id=2347755. Description: A useful overview of supervised learning that highlights important issues, tips, and challenges. Cited ~500 times in the literature.

  • Additional Helpful Readings

    • ESL Textbook: Chapter 2.1-2.3
      Note: in this course, we use a more general definition of quantitative data to include all data that can be measured and represented by numbers (including discrete variables such as nominals and ordinals, for which we can simply assign the first category the value of "1", the second category the value of "2", etc.). Thus, we also use a more broad definition of qualitative data to mean data that cannot be represented by numbers.

    Topic 2: Evaluating Performance (9/16-9/19)

    Course Textbook: Chapter 5 (especially 5.2, 5.5, 5.6)
    Note: when the textbook uses the word "hypothesis", they are referring to the model learned by the machine learning algorithm from the training set that maps an instance's attributes to a label. Mathematically, hypothesis h(x1, x2, x3) -> y. So evaluating hypotheses means evaluting the performance of the model learned by the algortihm from the training set.

    Additional Helpful Readings

    • ISL Textbook: Chapter 2.2
    • ESL Textbook: Chapter 7
      Note: this chapter is very technical, so think of it more as a reference to refer back to if you ever need to really analyze the mathematical modeling of errors in prediction. Those interested are more than welcome to read it.

    Topic 3: Decision Trees (9/21-9/28)

    Course Textbook: Chapter 3
    Note: since we did not cover Chapter 2, you are not responsible for understanding the comparison between ID3 and Candidate-Elimination (another supervised learning algorithm) in Section 3.6, although the Section is still worth reading as it provides some interesting information about ID3 that can be understood without knowing about Candidate-Elimination.

    Additional Helpful Readings

    • ISL Textbook: Chapter 8.1 (CART)
    • ESL Textbook: Chapter 9.2 (CART)
      Note: this section has more details about CART than the ISL Textbook, but it is also more technical.
    • Therneau, Terry M. & Atkinson, Elizabeth M. An Introduction to Recursive Partitioning Using the RPART Routines. Mayo Foundation. June 29, 2015. URL: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
      This manual for the most popular CART implementation in R provides many useful details about Entropy, the Gini index, surrogate variables, etc., as well as details about how to use the R implementation of CART and interpret its results.

    Topic 4: Bayesian Learning (9/30-10/5, 10/24)

    Course Textbook: Chapter 6 (primarily Sections 6.1 - 6.2, 6.9-6.11)


    Topic 5: Sequential Learning (10/5-10/14)

    Course textbook: not included


    Topic 6: Regression and Neural Networks (10/31-11/14)

    Course Textbook: Chapter 4

    Additional Helpful Readings

    • ISL Textbook: Chapter 3 (Linear Regression), Chapter 4.2-4.3 (Logistic Regression)
    • ESL Textbook: Chapter 3.1-3.2 (Linear Regression), Chapter 4.4 (Logistic Regression), Chapter 11 (Neural Networks)
    • LeCun, Yann, Bengio, Yoshua, & Hinton, Geoffrey. Deep Learning. Nature, vol. 7553, May 28, 2015, pp. 436-444. URL: https://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf
      Description: a great, short summary of Deep Learning by some of the biggest names in the area.
    • Schmidhuber, Jurgen. Deep Learning in Neural Networks: An Overview. Neural Networks, vol. 61, Jan. 2015, pp. 85-117. URL: https://arxiv.org/abs/1404.7828
      Description: a more in depth summary (with many great references) of Deep Learning by one of the key researchers on RNNs.
    • Goodfellow, Ian, Bengio, Yoshua, & Courville, Aaron. Deep Learning. Book in preparation for MIT Press. 2016. URL: http://www.deeplearningbook.org/
      Description: an upcoming textbook on Deep Learning, currently available online in HTML

    Topic 7: Association Rule Mining (11/16-21)

    Course Textbook: not included

    Additional Helpful Readings

    • ESL Textbook: Chapter 14.2
      Note: the beginning of this section is pretty math heavy, but the section is more accessible starting a little bit into Chapter 14.2.2, then should be easier to read through the end of the section

    Topic 8: Clustering (11/28)

    Course Textbook: Chapter 6.12

    Additional Helpful Readings


    Topic 9: Recommender Systems (11/30)

    Course Textbook: not included

    Syllabus

    Course Overview

    Machine learning and data mining are closely related capabilities that enable computers to learn to perform tasks without explicit programming, as well as discover interesting information from data. This course explores topics within machine learning and data mining, including classification, unsupervised learning, and association rule mining. Students will gain hands-on practice with popular machine learning and data mining algorithms, as well as discuss challenges, issues and solutions to working with complexities in real-world data.


    Instructor

    Dr. Adam Eck
    Office: King 231
    Email (preferred contact method): adam.eck@oberlin.edu
    Response policy: emails received by 5PM on a weekday will receive a response the following morning. Emails received during the weekend will receive a response sometime on Monday.


    Meeting Time and Location

    Time: 10:00-10:50 AM MWF
    Location: King 306


    Office Hours

    11:00 AM-Noon Mondays (King 231)
    11:00 AM-Noon Thursdays (King 231)
    2:00-3:00 PM Fridays (The Local)


    Course Prerequisites

    CSCI 151 (Required)
    MATH 220 (Recommended)


    Textbook

    Mitchell, Tom M. Machine Learning. WCB/McGraw-Hill, Boston, MA, 1997.

    Additional recommended textbooks are listed below under Additional Resources on the course website.


    Course Objectives

    1. Exposure to a breadth of topics related to machine learning and data mining.
    2. Understanding of supervised and unsupervised machine learning.
    3. Hand-on experience practicing with commonly used algorithms and software tools for machine learning and data mining.
    4. Practice implementating algorithms described in pseudocode.
    5. Consideration of the potential impact of machine learning and data mining on society and real-world applications.
    6. Refinement of experimentation, analysis, and technical writing skills.
    7. Training identifying problems of interest, developing solutions, and working in teams on a substantial student-driven project.

    Course Topics

    1. Background information for machine learning and data mining (introduction, notation and terminology, probability and statistics, etc.) (Chapters 1 and 8)
    2. Empirical evaluation of algorithm performance (Chapter 5)
    3. Decision trees (Chapter 3)
    4. Bayesian learning (Chapter 6)
    5. Neural networks and deep learning (Chapter 4)
    6. Support vector machines
    7. Sequential learning
    8. Association rule mining
    9. Clustering (Chapter 6.12)
    10. Recommender systems
    11. Advanced topics [time permitting]

    Class Participation

    Students are expected to attend each class period, read in advance any assigned readings, pay attention during class (e.g., not be distracted by working on laptops, smartphones, tablets, etc.), and actively participate in class discussions. If students know in advance that they must miss a class period, they are expected to inform the professor at least one business day in advance. If the student has more than one unexpected absences in a two week period, they should discuss the reason(s) for the absences with the professor (by email or during office hours).

    This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates and the instructor. Rather than emailing questions, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com. You can find our class page at: https://piazza.com/oberlin/fall2016/csci374/home


    Exams

    The only exam in this course will be a take-home pre-test exam, administered at the beginning of the semester. This pre-test is intended to assess what background knowledge necessary for the course that students already possess, as well as what needs to be covered by the instructor. As such, students should attempt to answer any question for which they believe they know the answer, while also marking “I don’t know” for questions for which they have never had an opportunity to acquire the associated knowledge and skills (e.g., if a question asks the student to find the partial derivative of an equation, and the student does not know how to find partial derivatives, then the student should simply indicate “I don’t know”). Grading will be based solely on whether students turn in the exam; assigned grades for the pre-test will not be based on the correctness of answers to individual questions. You are not expected to know how to answer every question on the pre-test before taking this course, so please do not feel discouraged if there are questions you do not know how to answer.


    Quizzes

    In the absence of regular exams, quizzes will be occasionally administered to evaluate individual student learning throughout the course of the semester, as well as to identify important course concepts that could benefit from additional instruction and practice. These quizzes will be either administered (1) at the beginning of lectures during the regularly scheduled course meeting time, or (2) as take-home quizzes, to be turned in at the beginning of the next course meeting time.


    Homework Exercises

    There will be three homework exercises administered throughout the semester, each providing students an opportunity for hands-on practice with the course content. Each exercise will consist of three main components: (1) the implementation of one or more machine learning or data mining algorithms studied in class, (2) experimentation of these algorithms on various data sets, and (3) a written technical report describing what the student accomplished and discovered as part of the assignment.


    Reflection Assignment

    Throughout the semester, students will also read articles highlighting the application (or potential application) of machine learning and data mining to real-world problems. Based on these articles, students will formulate their own opinions about the potential benefits and drawbacks of machine learning and data mining on our society. These opinions will be expressed through a written “reflection” assignment, where evidence supporting the author’s opinion should be provided and appropriately cited.


    Final Project

    In place of a final exam, students will be required to work in teams for this Final Project assignment. Each team of students will be required to: (1) choose a project, (2) write a proposal identifying the problem of interest along with a proposed solution (presented to the class around the middle of the semester), (3) develop a solution, (4) report on their analysis and future work, and (5) present their project (during the final weeks of the semester).

    The goal of this project is to provide students with an opportunity to explore their own interests within machine learning and data mining, beyond what is covered by class lectures and readings or completed in the homework assignments. For example, some students might choose to explore the application of machine learning and data mining to a particular real-world problem, finding appropriate data and investigating how different algorithms might perform on that data. Additionally, some students might choose to implement additional algorithms not considered in the homework assignments to practice with additional representations and learning approaches. Each project will be chosen by the group's members to reflect the member's own interests.

    This project will require substantial participation by the members of each group, so it will be assigned sufficiently early in the semester so that students have time to successfully complete the project.


    Grades

    Here is the distribution of points for the course grade: (1) Class Participation = 10%, (2) Pre-test Exam = 5%, (3) Quizzes = 10%, (4) Homework Exercises = 30%, (5) Reflection Assignment = 10%, and (6) Final Project = 35%.


    Late Submission Policy

    When permitted by the Oberlin calendar (e.g., before the reading period), late submissions of assignments, exercises, etc. will be accepted but will be subject to a percent deduction penalty:

    1 second to 1 hour late: 5% deduction
    1 hour, 1 second – 24 hours late: 10% deduction
    24 hours, 1 second – 48 hours late: 20% deduction
    Each additional 24 hour period late: additional 10% deduction

    For example, assume an assignment is due at 11:59 PM Friday, October 7. Student X turns in the assignment at 12:15 AM Saturday, October 8, causing a 5% deduction penalty (for a maximum possible score of 95%) due to turning in the assignment late, but less than one hour late.

    Student Y later turns in the same assignment at 5:00 PM on Saturday, October 8, causing a 10% deduction penalty (for a maximum possible score of 90%) due to being more than 1 hour but less that 24 hours late.

    Finally, Student Z turns in the same assignment at 12:00 PM on Tuesday, October 11, causing a 40% deduction penalty (for a maximum possible score of 60%) for being more than 72 hours but less than 96 hours late.


    Disabilities

    The College makes reasonable accommodations for persons with disabilities. Students should notify the Office of Disability Services located in Peters G-27/G-28 and their instructor of any disability related needs.  For more information, see http://new.oberlin.edu/office/disability-services/index.dot.  Any student eligible for and needing academic adjustments or accommodations because of a disability (including non-visible disabilities such as chronic diseases, learning disabilities, head injury, attention deficit/hyperactive disorder, or psychiatric disabilities) is requested to speak with the professor.

    Academic Dishonesty

    Students are expected to adhere to the Oberlin College Honor Code. Any violations will be reported to the Honor Code Committee.

    Different assignments in this course will have different expectations with respect to the Honor Code, which will be clearly explained in the assignment instructions (in case of confusion, please contact the professor for clarification). For example, the pre-test and quizzes are meant to assess individual knowledge, and thus must be completed independently (without reference to study materials, textbooks, etc. unless explicitly permitted). On the other hand, the final project is a group excerise and students are required to closely collaborate with other students (within their groups) to successfully complete their projects. Between these two polar ends of the spectrum, students are encouraged to discuss the homework assignments with their peers, but (1) students must acknowledge with whom they discussed their assignment on the written report, and (2) students are not allowed to share or show their code to one another, nor discuss implementation details (discussions should be done at a higher level about the algorithms, experimental design, etc. and not about source code). Please note: looking at source code from machine learning libraries or other sources is strictly forbidden for the homework assignments. However, use of pre-existing software and libraries might be acceptable for the final group projects, provided the students receive explicit permission from the professor.

    For every assignment, students must indicate whether they followed the Honor Code in completing the assignment. If so, students should end each assignment by writing "I have adhered to the Honor Code in this assignment.”

    Assignments

    Assignments will be posted here as they are assigned in class.

    Pre-test

    Assignment: Pre-test Assignment (LaTeX Source)
    Due Date: 9/2 at 11:59 PM (submit on Blackboard)


    Homework Assignments

    Homework 1: Decision Trees Assignment (Updated 10/4)
    Snapshot 1 Due: 9/28 at 11:59 PM (submit on Blackboard)
    Snapshot 2 Due: 10/510/7 at 11:59 PM (submit on Blackboard)
    Final Due Date: 10/10 10/12 at 11:59 PM (submit on Blackboard)

    Helpful Tips:

    • For information about k-fold cross-validation, see Section 5.1.3 of the ISL Textbook (especially Figure 5.5 on Page 181).
    • The purpose of using a random seed is to be able to replicate the "randomness" created by the (pseudo-)random number generator used by the computer. That is, when you specify a random seed, every random number generated will occur in the exact same order as any other time you used the exact same random seed. Diffferent random seeds will generate different sequences of "random" numbers.
      • Why are we concerned about this? In science, all experiments should be reproducable. So by specifying a random seed, we make it possible for our program to still use randomness to split the training and test sets (so that each instance is just as likely as any other to be used for training or testing), and yet we can also verify the results of the experiment by running it again to duplicate its results.
      • This means that if you run your program a second time with a particular random seed (for a given algorithm and data set), the output should be exactly the same as the first time you ran it with the same random seed. Whether your program does indeed produce the same results each time the same random seed is used might be considered when grading your assignment.
    • To randomly shuffle a list in Python, use the following code snippet:
    • import random                
      random.seed(yourRandomSeed)   
      shuffled = list(yourInstances)
      random.shuffle(shuffled)     
    • To randomly shuffle a list in Java, use the following code snippet:
    • import java.util.Random;                    
      import java.util.Collections;                
      Random rng = new Random(yourRandomSeed);     
      List shuffled = new ArrayList(yourInstances);
      Collections.shuffle(shuffled, rng);         
    • For Java users: a Makefile is a script that compiles your code (using javac) to create your .class files. By turning in a Makefile, you do not need to also turn in your .class files, and it makes it easier for me to compile your code. Not to mention, Makefiles are good practice to make compiling easier in general, especially as your projects get more complex (and you eventually want to also create JAR files).

      A couple links describing how to create Java Makefiles include:

    Report Sections: For your report, you should include the following sections:

    1. Introduction: an overview of the homework assignment, plus a brief description of what the report will contain (e.g., "In this report, we describe our implementation of two decision tree algorithms in Section 2, followed by...")
    2. Implementation: a description of how you implemented your homework assignment, including any design decisions made (e.g., which language did you use, what data structures did you use or create, how did you keep track of important information such as the partial counts for missing values, etc.). There should be enough here that someone would know how to look through your code after reading the section.
    3. Experimental Setup: what experiments did you run, and what parameters did you use (e.g., which random seeds, any other parameters such as α in CART's pruning method, how did you choose those parameters, etc.).
    4. Results: what results did you find and what lessons did you learn? I will be looking for the following as part of your analysis:
      • Presentation: present your results in an easy to read format, such as a table containing the accuracy measured during your experiments (columns = algorithms, rows = data sets). This is how you want to visualize your results for your readers.
      • Observation: what should your readers see when they look at your presentation? Discuss each important observation, rather than assuming your readers will see them. This allows you to reinforce what you want the readers to take away from your presentation.
      • Explanation: explain to your readers why your observations occurred. What caused each observation? For example, if you observed that C4.5 performed better than ID3 on the Optical Digit data set, this is your opportunity to argue why that happened. Imagine you created C4.5 -- you want to explain to the reader why your solution gives better performance than others. Important note: it is not a valid explanation to say "We observe that accuracy increased in C4.5 from ID3 because the accuracy numbers were higher in Table 1." That is circular reasoning -- that is why you made the observation, not what caused the observation to happen in the first place. Important note 2: if your implementation did not work as you expected, here is your opportunity to explain why. Were there bugs? Or maybe a difficult design choice that didn't work as expected?
      • Meaning: explain to your reader why the observations are meaningful or important. Again, imagine you created C4.5. Now you can argue that because it does better across data sets, it should be favored over C4.5. Why is what you created/accomplished valuable to others? Important note: if your observations were unexpected, maybe because your implementation didn't work as intended, here you could talk about how you would improve your implementation to address the problems discussed under "explanation".
    5. Conclusion: wrap up with a one paragraph summary of the report. If you have future work you want to perform, you could describe that in a second paragraph.

    Homework 2: Naive Bayes Assignment
    Snapshot 1 Due: 10/28 at 11:59 PM (submit on Blackboard)
    Snapshot 2 Due: 11/04 at 11:59 PM (submit on Blackboard)
    Final Due Date: 11/07 at 11:59 PM (submit on Blackboard)

    Homework 3:: Neural Networks Assignment
    Final Due Date: 11/28 at 11:59 PM (submit on Blackboard)


    Reflective Writing Assignment

    Reflective Writing Assignment:: Reflective Writing Assignment
    Final Due Date: 12/09 at 11:59 PM (submit on Blackboard)

    Final Project

    Assignment: Group Project Assignment

    More information about the final group project will be posted soon.

    Additional Resources

    Recommended Textbooks

    Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. Springer-Verlang, 2009. Website: http://statweb.stanford.edu/~tibs/ElemStatLearn/

    James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. An Introduction to Statistical Learning (with Applications in R). Springer-Verlang, 2013. Website: http://www-bcf.usc.edu/~gareth/ISL/


    Related Videos

    NB: these videos are posted to inspire thought and do not necessarily represent the views or opinions of the professor. Each video is copyright TED and are the original material of the presenters.

    TED Talks on Machine Learning: A collection of thought provoking talks by machine learning researchers and practitioners at TED conferences.


    Jeremy Howard: "The Wonderful and Terrifying Implications of Computers that can Learn" from TEDxBrussels (Dec 2014) [Machine Learning, Applications, Society]


    Kenneth Cukier: "Big data is better data" from TEDSalon Berlin 2014 (June 2014) [Data, Society]


    Jennifer Golbeck: "The curly fry conundrum: Why social media 'likes' say more than you might think" from TEDxMidAtlantic 2013 (October 2013)[Correlation/Causation, Privacy, Society]
    Original PNAS Paper: http://www.pnas.org/content/110/15/5802.full