The Kaggle Competitions: An Introduction to CAMCOS Fall 2015

Similar documents
Week 42: Siamese Network: Architecture and Applications in Visual Object Tracking. Yuanwei Wu

[Boston March for Science 2017 photo Hendrik Strobelt]

Multiclass and Multi-label Classification

Lab Developed: 6/2007 Lab Revised: 2/2015. Crickthermometer

Machine Learning.! A completely different way to have an. agent acquire the appropriate abilities to solve a particular goal is via machine learning.

Recruitment Pack Cattery Team Leader (Part-time) Battersea Dogs & Cats Home

CS6501: Deep Learning for Visual Recognition. CNN Architectures

PIGEON DISCRIMINATION OF PAINTINGS 1

Development of intelligent systems (RInS) Object recognition with Convolutional Neural Networks

Dog ecology studies oral vaccination of dogs Burden of rabies

Building Rapid Interventions to reduce antimicrobial resistance and overprescribing of antibiotics (BRIT)

Veterinary Price Index

STUDY BEHAVIOR OF CERTAIN PARAMETERS AFFECTING ASSESSMENT OF THE QUALITY OF QUAIL EGGS BY COMPUTER VISION SYSTEM

Supplementary material to Forecasting with the Standardized Self-Perturbed Kalman Filter

The Veterinary Epidemiology and Risk Analysis Unit (VERAU)

MAN REPELLER. The Social Media Powerhouse That Turned A Blog Into Business

Effective Vaccine Management Initiative

Building Concepts: Mean as Fair Share

Analysis of Veterinary Workforce in Thailand National Veterinary Education sub committee Gr.1

Physician Veterinarian Do you have the Bayer Spirit?

Econometric Analysis Dr. Sobel

Community Cat Programs Handbook. CCP Operations: Working with Shelter Staff and Volunteers

UPDATE: Dog Off Leash Areas July 7, 2011

Cat Swarm Optimization

RECENT TRENDS IN COMPLEX ACCC MERGER REVIEW CASES FEBRUARY 2017

Probe-Tip Clean On Demand

Vice President of Development Denver, CO

For personal use only

Impact of Antimicrobial Stewardship Program

Representation, Visualization and Querying of Sea Turtle Migrations Using the MLPQ Constraint Database System

Biology 164 Laboratory

MAJOR IN ANIMAL SCIENCE

Global Communication on AMR in Animal Health: Tripartite and OIE Efforts

Notes and INFORMATION

Knowledge Discovery in Microbiology Data: Analysis of Antibiotic Resistance in Nosocomial Infections

Dunbia 2017 Dunbia 2017

6. 1 Leaping Lizards!

Development and improvement of diagnostics to improve use of antibiotics and alternatives to antibiotics

Texas 4-H/FFA Heifer Validation Program

Chapter 18: Categorical data

Nathan A. Thompson, Ph.D. Adjunct Faculty, University of Cincinnati Vice President, Assessment Systems Corporation

Investing in Human Resources in Veterinary Services

UK biddable media AWARDS 2019

Science Based Standards In A Changing World Canberra, Australia November 12 14, 2014

Scientifically evaluating welfare in commercial breeding kennels: does high volume preclude good welfare?

Information/advice for organisers and judges

THE LANDSENSE ENGAGEMENT PLATFORM

Subdomain Entry Vocabulary Modules Evaluation

Higher National Unit Specification. General information for centres. Unit code: F3V4 34

10/3/16. About Eric Garcia. Thanks to Our Sponsor!

Overview of Findings. Slide 1

GAO Earned Value Management (EVM) Audit Findings

Impact of Postoperative Antibiotic Prophylaxis Duration on Surgical Site Infections in Autologous Breast Reconstruction

Draft ESVAC Vision and Strategy

Using Earned Value in Scientific Research. David Roberts & Sheila Roberts CUPE International.

CONTENTS INTRODUCTION MARKET OPPORTUNITIES PROBLEM STATEMENT OUR TECHNOLOGY. About Bastet. Bastet Game and Digital Currency.

Name: Date: Algebra I - Unit 3, Lesson 4: Writing and Graphing Inequalities to Represent Constraints

Social Listening Among Pet Parents CASE STUDY

The integration of dogs into collaborative humanrobot. - An applied ethological approach - PhD Thesis. Linda Gerencsér Supervisor: Ádám Miklósi

CITIZENS AGAINST LITTER

Effects of interactive visitor encounters on the behaviour and welfare of animals commonly housed in Australian zoos

Performance Task: Lizards, Lizards, Everywhere!

Multi-Frequency Study of the B3 VLA Sample. I GHz Data

WOOL DESK REPORT MAY 2007

Teaching Assessment Lessons

Member Needs Assessment Report to the Members June 2012

More Than a Pet Level J Nonfiction

Implementation of Estimated Breeding Values (EBVs) for health and behavioural traits at Guide Dogs UK

Effective Vaccine Management (EVM) Global Data Analysis

Lesson 1.1 Assignment

Effective Vaccine Management (EVM) Global Data Analysis

Classification and Salary: Registered Veterinary Technician Classification

DAIRY HERD HEALTH IN PRACTICE

Advancing Veterinary Medical Education

Indigo Sapphire Bear. Newfoundland. Indigo Sapphire Bear. January. Dog's name: DR. NEALE FRETWELL. R&D Director

Dogs at Work Level N Nonfiction

Steps to becoming an Animal House Volunteer

OIE Resolution and activities related to the Global Action Plan. Regional Seminar for OIE National Focal Points for Veterinary Products 4 th Cycle

MSc in Veterinary Education

Electronic Supplementary Information

Grade: 8. Author: Hope Phillips

RESPONSIBLE ANTIMICROBIAL USE

Lesson Objectives. Core Content Objectives. Language Arts Objectives

Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

2017 ANIMAL SHELTER STATISTICS

How to use Mating Module Pedigree Master

Caring and. sharing. We love Hong Kong. 2 Small houses News report. 3 Food in a basin Fun and games Description. 4 Computer Jobs Biography

King Fahd University of Petroleum & Minerals College of Industrial Management

2013 AVMA Veterinary Workforce Summit. Workforce Research Plan Details

A Veterinary Student Interviews a Veterinarian Astronaut

DOWNLOAD OR READ : CAT OWNERS PROBLEM SOLVER PDF EBOOK EPUB MOBI

WELCOME CLASS OF 2018! WE ARE HERE TO SUPPORT YOU!

CALIFORNIA EGG LAWS & REGULATIONS: BACKGROUND INFORMATION

Toward a Common Swine Industry Audit

SANDUSKY COUNTY An Equal Opportunity Employer POSITION DESCRIPTION

For further information, addresses and tips about moving with pets, please also check the internet which offers a wealth of information.

Welcome! Your interest in the veterinary technology program at ACC is greatly appreciated. AS a recently AVMA accredited program there are many

TECHNICAL BULLETIN Claude Toudic Broiler Specialist June 2006

1.2. Handler training shall include human scent theory, relevant canine case law and legal preparation, including court testimony.

A tail of two scorpions Featured scientists: Ashlee Rowe and Matt Rowe from University of Oklahoma

Transcription:

The Kaggle Competitions: An Introduction to CAMCOS Fall 15 Guangliang Chen Math/Stats Colloquium San Jose State University August 6, 15

Outline Introduction to Kaggle Description of projects Summary Guangliang Chen San Jose State University /35

CAMCOS in Fall 15: A glance Made possible by a proposal by Dr. Bremer (no outside sponsor this time) Theme of the program is data science (we all like data science) Many online courses Many universities start to offer a degree in this field High demand, from both industry and academia, for graduates in data science is projected Projects of this CAMCOS are selected from the online competitions at Kaggle.com Guangliang Chen San Jose State University 3/35

Basic facts about Kaggle Kaggle is a Silicon Valley start-up and Kaggle.com is its online platform hosting many data science competitions. Founded by Anthony Goldbloom in 1 in Melbourne, and moved to San Francisco in 11. It uses a crowdsourcing approach which relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know at the outset which technique or analyst will be most effective. Hal Varian, Chief Economist at Google, described Kaggle as "a way to organize the brainpower of the world s most talented data scientists and make it accessible to organizations of every size". Guangliang Chen San Jose State University /35

How it works Companies, with the help of Kaggle, post their data as well as a description of the problem on the website; Participants (from all over the world) experiment with different techniques and submit their best results to a scoreboard to compete; After the deadline passes, the winning team receives a cash reward (which could be as much as several millions) and the company obtains "a worldwide, perpetual, irrevocable and royalty-free license. Guangliang Chen San Jose State University 5/35

Achievements and impact of Kaggle Kaggle claims 358,5 data scientists on its jobs board (picture on next slide) Customers include many big companies and organizations such as NASA, Merck, GE, Microsoft, Facebook, Allstate and Mayo Clinic It has advanced the state of the art in different fields, such as HIV research, traffic forecasting and mapping dark matter. It has lead to academic papers and continued interest to further innovate. Guangliang Chen San Jose State University 6/35

Guangliang Chen San Jose State University 7/35

Potential benefits of participating in the Kaggle competitions Experience with large, complex, interesting, real data Learning (new knowledge and skills) Become a part of the data science community Cash prize Can get you a job Guangliang Chen San Jose State University 8/35

Back to CAMCOS The projects of this CAMCOS are selected from Kaggle competitions: Project 1: Digit recognizer Duration: July 5, 1 December 31, 15 Award: knowledge (no cash prize) Project : Springleaf marketing Duration: August 1 October 19 Award: $1, Guangliang Chen San Jose State University 9/35

Project 1: Digit recognition Given an image of a handwritten single digit, determine what it is by machine: Training images must be given; Need to learn a rule (classifier) and apply it to new images Guangliang Chen San Jose State University 1/35

MNIST handwritten digits The MNIST database of handwritten digits, formed by Yann LeCun of NYU, has a total of 7, examples from approximately 5 writers: The images are 8 8 in size The training set contains 6, images while the test set has 1, It is a benchmark dataset used by many people to test their algorithms Guangliang Chen San Jose State University 11/35

An introduction to CAMCOS Fall 15 projects Visualization of the data set 1. The average writer. PCA plot of each digit cloud -3 6 3 5 5 3 3 1 1 1-1 -1 - -1 - - -3 - - -3 - - -3-5 -6 - - 6 - -3 - -1 1 3 5 - - 6 - - 6 (cont d on next page) Guangliang Chen San Jose State University 1/35

An introduction to CAMCOS Fall 15 projects -6 5 3 5 3 3 1 1 1-1 -1 - -1 - - -3-3 -3 - - - -5-5 -5 - -3 - -1 1 3 - - 6-6 -5 - -3 - -1 1 3 7-9 6 3 3 1 1-1 -1 - - -3-3 - - - -5-5 - -6 - - -6-5 - -3 - -1 1 Guangliang Chen San Jose State University 3 5-6 -5 - -3 - -1 1 3 13/35

The general classification problem Given data and their class labels (x i, y i ) R d {1,..., J}, 1 i n, find a function f (in some function space) by minimizing L(yi, f(x i )) where L is a loss function (e.g., l 1 or l distance) It is an instance of supervised learning. Statistically, this is a regression problem (with categorical outcomes), often done with logistic regression. Lots of applications: document classification, spam email detection, etc. Guangliang Chen San Jose State University 1/35

Some classifiers from the literature Nearest subset classifiers: kmeans Nearest neighbors classifiers: knn Linear classifiers, such as Logistic regression Naive Bayes classifier Linear discriminant analysis (LDA) Support vector machine (SVM) Other: Decision trees, perceptron, neural networks, etc. Guangliang Chen San Jose State University 15/35

Nearest subset classifiers The idea is to assign a new point to the closest class of training points: ĵ = argmin 1 j J dist(x, C j ) by using some kind of distance metric: kmeans: using only the center of each C j Local kmeans: using the center of the k closest points from each C j, where k Z + Guangliang Chen San Jose State University 16/35

Nearest neighbors classifiers knn assigns class label based on the k closest points around a new point Guangliang Chen San Jose State University 17/35

Some quick experimental results The error rate of the global kmeans classifier is 18.%. The error rate of the local kmeans classifier (for k = 1) is 3.1%. Error rate of the knn classifier (for different k) is shown below:.38.37.36.35 error rate.3.33.3.31.3.9 6 8 1 1 1 16 18 k Guangliang Chen San Jose State University 18/35

Comments on the kmeans/knn classifiers Instance-based learning (or lazy learning) Simple to implement Algorithmic complexity only depends nearest neighbors search The choice of k is important Cannot handle skewed distributions Guangliang Chen San Jose State University 19/35

Linear classifiers For two classes, linear classifiers typically have the following form { 1, if w T x b > ; f(x) =, otherwise where w, b are learned from training samples. Guangliang Chen San Jose State University /35

The above rule is equivalent to using a hyperplane as the classification decision boundary. Guangliang Chen San Jose State University 1/35

Building linear classifiers There are two classes of methods for training w, b: Distribution-based (statistical methods): functions P (x C j ) to model conditional density Linear discriminant analysis (LDA): assuming Gaussian conditional distributions and performing a likelihood ratio test (when having only two categories) Naive Bayes classifier: using Bayes rule P (C j x) P (C j )P (x C j ) and selecting priors P (C j ) Guangliang Chen San Jose State University /35

Optimization-based (discriminative methods): to solve where R(w): regularization term min w,b R(w) + γ L(y i, 1 wt x i b) L(y i, 1 wt x i b): loss of the prediction γ: tradeoff constant Examples of this class include Support vector machine (SVM) Perceptron Guangliang Chen San Jose State University 3/35

Challenges of the project Large amount of high dimensional data (6 78) Great variability in the ways people write the digits (i.e., strong noise) Similar digits, e.g., {7, 9} and {3, 5, 8} Guangliang Chen San Jose State University /35

An introduction to CAMCOS Fall 15 projects 7 9 Guangliang Chen San Jose State University 3 5 8 5/35

Lots of classifiers to try (and beat) Guangliang Chen San Jose State University 6/35

Why you want to work on this project Data format is simple (easy to get started) Data set is well understood (as it has been extensively studied) The competition will provide tutorial to help you Lots of existing algorithms in the literature (good chance to learn) Can develop a solid background in classification Guangliang Chen San Jose State University 7/35

Project : Springleaf marketing First, some background information: Springleaf is a company that operates in the financial services industry and does business in consumer lending Direct offers mailed to potential customers provide great value to the customers and it is an important marketing strategy used by Springleaf They want to improve their strategy to better target customers who truly need loans and seem to be good candidates They hosted this competition by providing training data and asking you to predict which customers will respond to a direct mail offer Guangliang Chen San Jose State University 8/35

Description of the data Both the training and test data sets are 9 mb, in csv format: Each row corresponds to one customer (>15, customers only in the training set) The columns represent the anonymized customer information (a mix of continuous and categorical variables): ID, VAR_1, VAR_,..., VAR_15 The response variable is binary and labeled target There are many missing values Guangliang Chen San Jose State University 9/35

Challenges of this project You need to be able to open/load the data files (enormous amount of complex business data) Need to deal with categorical variables Need to handle missing values Need to do feature selection (and get rid of lots of redundant information) Need to build a good classifier Guangliang Chen San Jose State University 3/35

No need to win the competition (time is short and you are competing with 59+ teams) Guangliang Chen San Jose State University 31/35

Why you want to work on this project Because it is challenging! Guangliang Chen San Jose State University 3/35

Characteristics of an ideal candidate Good linear algebra knowledge (have taken 19A) Know probability and statistics well (have taken 163 and 16) Excellent programming skills (in Matlab, R, Python) Hard work Team player Eager to learn Guangliang Chen San Jose State University 33/35

Thank you for your attention! Introduction to the Kaggle competitions Description of course projects (both are about classification) Digit recognition Springleaf marketing Thanks to Woodward Foundation for support Contact: guangliang.chen@sjsu.edu Guangliang Chen San Jose State University 3/35

Questions?