Students
Tuition Fee
Not Available
Start Date
Not Available
Medium of studying
Not Available
Duration
Not Available
Details
Program Details
Degree
Masters
Course Language
English
Intakes
Program start dateApplication deadline
2013-02-19-
About Program

Program Overview


Overview

The Data Mining: Learning from Large Data Sets program is a graduate-level course that focuses on applying, analyzing, and evaluating state-of-the-art techniques from statistics, algorithms, and discrete and convex optimization for learning from large data sets.


Topics

  • Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
  • Fast nearest neighbor methods (Shingling, locality sensitive hashing)
  • Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
  • Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
  • Active learning (uncertainty sampling, pool-based methods, label complexity)
  • Dimension reduction (random projections, nonlinear methods)
  • Data streams (Sketches, coresets, applications to online clustering)
  • Recommender systems

Details

  • VVZ Information:
  • Recitations:
    • Tue 13-14 in CAB G 61. Last names starting with A-L
    • Fri 14-15 in NO C 6. Last names starting with M-Z
  • Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets.

Homeworks

  • Self Assessment Questions
  • Homework 1
  • Homework 2
  • Homework 3
  • Homework 4
  • Homework 5
  • Homework 6

Solutions

  • Homework 1
  • Homework 2
  • Homework 3
  • Homework 4
  • Homework 5
  • Homework 6

Lecture Notes

  • February 19: Introduction
  • February 26: Approximate Retrieval; Min-hashing
  • March 5: Locality sensitive hashing
  • March 12: SVMs; online convex programming
  • March 19: (Parallel) stochastic gradient descent
  • March 26: Feature selection via l1-regularization; multi-class/structured prediction
  • April 9: Active Learning
  • April 16: Large scale unsupervised learning (Online k-means, coresets)
  • April 23: Large scale unsupervised learning (Online EM, coresets, anomaly detection)
  • April 30: Exploration--exploitation tradeoffs (k-armed bandits, upper confidence sampling)
  • May 7: Contextual bandits
  • May 14: Submodular functions (properties, algorithms and applications)
  • May 28: Recommending sets (structured prediction, online submodular optimization)

Recitations

  • Feb 26: Hadoop tutorial
  • Mar 5: LSH & NN
  • Mar 12: SVM
  • Mar 19: Project 1 - Approximate Retrieval
  • April 9: Online SVMs (HW3/Loss Functions/L1 Regularization)
  • April 16: Project 2 - Large-scale Classification
  • April 23: Active Learning
  • April 30: Unsupervised Learning
  • May 7: Exploration-Exploitation
  • May 14: Project 3 - Recommender Systems

Old Exams

  • Data Mining Exam, 2012 Spring

Relevant Readings

  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.
  • Jure Leskovec, Eric Horvitz. Planetary-Scale Views on a Large Instant-Messaging Network.
  • Manuel Gomez Rodriguez, Jure Leskovec, Andreas Krause. Inferring Networks of Diffusion and Influence,
  • James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs.
  • Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?.
  • Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing
  • Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent.
  • Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent.
  • Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines.
  • John Duchi, Shai Shalev-Shwartz, Yoram Singer, Tushar Chandra. Efficient Projections onto the l1-Ball for Learning in High Dimensions.
  • Nathan Ratliff, J. Andrew (Drew) Bagnell, and Martin Zinkevich. (Online) Subgradient Methods for Structured Prediction.
  • Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning.
  • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification.
  • Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for k-Means Clustering Based on Weak Coresets.
  • Chris Bishop. Pattern Recognition and Machine Learning.
  • Percy Liang, Dan Klein. Online EM for Unsupervised Models.
  • Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets.
  • Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.
  • Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation.
  • Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere.
  • Matthew Streeter, Daniel Golovin. An Online Algorithm for Maximizing Submodular Functions.
  • Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments.
  • Andreas Krause, Daniel Golovin. Submodular Function Maximization.
  • Yisong Yue, Carlos Guestrin. Linear Submodular Bandits and their Application to Diversified Retrieval.

Related Courses

  • CS345a: Data Mining at Stanford University
  • 15-826: Multimedia Databases and Data Mining at Carnegie Mellon University
  • CS/CNS/EE 253: Advanced Topics in Machine Learning at Caltech
See More