Math 203 Course Page
MATH 203: Applied Mathematics, Computing & Statistics Projects (CAMCOS)
Fall 2015, San Jose State University
Projects in this CAMCOS
Selected from the online Kaggle competitions (www.kaggle.com):
- Digit Recognizer: Conventional classification with handwritten digits data
- Springleaf Marketing: Classification with modern business data which contains a large amount of continuous and categorical variables and many missing values
Intended Learning Outcomes
Ideally, after the semester is over, everybody will
- Thoroughly understand the topic of classification and learn many of the existing approaches;
- Master three programming languages (MATLAB, R, Python);
- Be familiar with various software packages that implement different types of classifiers;
- Gain first-hand research experience with large amount of complex data
Weekly Progress
Final report [pdf]
Final presentation: Group 1 [pptx] [pdf] Group 2 [pptx] [pdf]
Assignments for Team 1
- Give an overview of the program (theme, competition, problem, goal and data)
- Describe your group's problem and data, display some classes (singular values and 3D projection), and explain what challenges there are
- Talk about the different kinds of methods and their results
- Instance based classifiers: (local) kmeans, (weighted) knn
- PCA + Bayes
- Linear classifiers such as LDA/QDA, SVM (including kernel, multiclass, PCA)
- Neural networks
- Summary (display a table of results obtained by the methods and explain which methods are better and why). If possible, also report the running times. Finally, you may want to submit the best method to Kaggle.com and compare your result with others
- Talk about preprocessing techniques like deskewing and report results. Mention a few other ways to be explored
Team 2
- Describe the competition and your data, and explain what challenges there are (large size, categorical, and missing values)
- Talk about your techniques to handle the above challenges
- Present the different kinds of methods and their results (report both overall accuracy and individual class accuracy)
- Ensemble learning methods such as random forest and XGBoost (point out that they can handle categorical variables directly)
- Convert categorical variables to numerical ones and then apply knn, LDA, logistic regression, svm, etc. If possible, compare different ways of convertion
- Summary (display a table of results obtained by the different methods and explain which methods are better and why). If possible, also report the running times.
- Acknowledgement (thank Woodward for funding, Dr. Simic for all the support, etc.)
Resources to learn neural networks
- Excellent online blog by Nielson;
- Youtube videos;
- Standford deep learning website
Team 1 assignments
- Develop an outline for your group presentation and estimate how much time you need
- Use LDA followed by PCA on residuals to reduce dimensionality for kernel SVM
- Try the separate PCA idea in the one versus all scenario to see whether it is better
Team 2 assignments
- Develop an outline for your group presentation and estimate how much time you need
- Compare the following three options to classify the data
- Emsemble learning using all variables
- Emsemble learning using all variables + kernel SVM using continuous variables
- Emsemble learning using categorical variables + kernel SVM using continuous variables
For kernel svm, use nearest neighbors to set the gamma parameter and try different values of C. Also, try several different 20% of the negative group to see how stable the method is.
- Try different parameters for neural networks and see what is the best result
Practiced talks for the MAA meeting
Discussed nonparametric discriminant analysis and separate PCA for kernel SVM.
Assignments
- For group 1
- Set the tuning parameter in Gaussian kernel by average distance of points to their kNN in each class
- Find a way to use both the signs and also absolute values of w^T x + b in pairwise SVM
- Redo LDA+SVM with higher dimensional projections (selected based on the decay of the generalized eigenvalues)
- Continue exploring t-SNE
- When you have time, read this paper (equation 10) on diffusion maps which provide a different way to embed data
- Keep the ensemble methods in mind
- For group 2
- Implement the kNN classifier on your own, with various options such as metrics and weights
- Set the tuning parameter in Gaussian kernel by average distance of points to their kNN in each class
- Take 20% negatives and all 20% positives to test Gaussian SVM
- Review how you preprocessed the original data and see whether any part should be modified
Lecture: XGBoost (by Sha) and SVM
See here for a presentation (similar to Sha's lecture) and here for more resources.
Assignments
- Finish any leftover from previous weeks
- Read the following lectures on SVM
Make sure you fully understand SVM (we'll have more discussions on the SVM ideas next time).
- LibSVM - A Library for Support Vector Machines: [website] [Matlab code], which is a state-of-the-art implementation. You may also want to check out LibLinear [website] [Matlab code], a fast, linear classifier specially designed for data with millions of instances and features.
- Explore a comprehensive SVM website for more references and software.
- Here is another centralized website to explore (if you have time)
- Present your results at next meeting (let's reserve one hour for each group).
Lecture: Random forest (by Xiaoyan) and LDA+QR (by Wilson)
References on random forests
- Trees, Bagging, Random Forests and Boosting by T. Hastie of Stanford;
- Trees and Random Forests by Adele Cutler of Utah State (A slightly longer version is here);
- This short article provides an introduction to the R randomForest package
References on multiclass logistic regression
- Wikipedia page;
- This short wirte-up describes a little bit about multiclass logistic regression;
- Matlab function is mnrfit.m (documentation is here).
Assignments
- Team 1
- Redo multiclass logistic regression using the Matlab function mnrfit.m
- Study and implement SVM. Teach us how it works and present your results in next meeting
- Read the references on random forest (at some point we will need to try it too)
- Team 2
- Optimize kNN by (1) trying different values of k (2) incorporating cosine of angle and correlation coefficient and (3) introducing weights
- Compare logistic regression with/without PCA and try different classifers on the probabilities
- Explore the classifier Xgboost and present in next meeting both how it works and your results
Lecture: Bayes classifier (MAP)
Assignments for both teams
- Try the following techniques to handle the singular covariance matrices that arise in LDA:
- Previously proposed strategies include pseudo-inverse LDA, PCA+LDA, and regularized LDA. These papers [1] [2] [3] review those methods while proposing a new technique (QR+LDA).
- Here is another technique: Two dimensional LDA (2DLDA). This paper - A note on 2DLDA - actually shows that 2DLDA is not always better than LDA in terms of discriminating power.
- Test the following classifiers in the LDA space:
- Kmeans
- Minimum training error
- Maximum A Posterior (MAP)
- Record both the descriptions of the above methods and experimental results in your report. Also, prepare to present them in the next meeting.
No lecture
Assignments for both teams
- Add both your slides (that were presented today) and your report to Google Drive, if you haven't done so already.
- Learn the following linear classifiers on your own (we will go over them in next meeting)
- Try the following Matlab functions or packages on your data set:
- LDA: Try Matlab bulit-in function 'classify.m' and also this package
- Logistic Regression: Try both 'mnrfit.m' and 'glmfit.m' that are built in Matlab
If you have purchased the Matlab 2015b student version (for $99 that comes with lots of toolboxes), you should check out the Statistics and Machine Learning toolbox where you can find most classifiers already implemented; see this wepage for an overview.
-
Present your results in next meeting. In the meantime, add descriptions of these algorithms to your report.
Team 2 only
- Implement the various options such as scaling and missing values, as discussed today
- Test the knn classifier with k up to 10 on the cleaned data. Try also different metrics, like Euclidean, L1, cosine of the angle, and correlation coefficient. Compare knn with LDA and Logistic Regression.
No lecture
Assignments for BOTH teams
- Download NX client from this website and install it on your laptop. Afterwards, follow the instructions stored under the Google Drive folder 'Common/Golub/' to configure your account (Both account information and a key file can also be found there). Make sure you can connect to Golub and start using it before the weekend.
- Start your report (and write as much as you can) by following the outline I wrote on the board today. Use LaTex. I hope to see a first version in PDF format by next meeting.
- Prepare slides that include visuals (like figures, tables) to convey ideas and results. Each group should limit the length of your presentation to 25 minutes (+ 5 additional minutes for further questions).
- Read the tutorial on LDA. We will go over it in next meeting. More references can be found here:
Team 1 assignments
- Re-do the local kmeans classifier, trying different numbers of neighbors (up to 10).
- Re-run weighted KNN classifier with different kinds of weights and different choices of k, and display all the results in a figure. Make sure you label each axis and add legends to increase readability. Always try to beautify your figures as they will be included in the report and later possibly in a formal publication.
Team 2 assignments
- Continue with those tasks listed in the previous week (Week 2)
- Present your method and results to the class (you can also talk about some basics of R programming as well as important lines of your codes)
Lecture: MATLAB programming and applications of matrix SVD
MATLAB scripts and data used in class
See Google Drive folder.
Team 1 assignments
- Perform PCA on the digits data and show center, principal directions and components for each digit
- Implement the kmeans classifier (both global and local). For the local kmeans classifier, try different numbers of neighbors. Meanwhile, you will need to do nearest neighbor search which can take a lot of time. I suggest to reduce the dimensionality of the data first (by PCA)
- Implement the knn classifier and run it with different choices of k (Again, dimensionality reduction would help with speed).
You need to be able to reproduce the results I presented at the department colloquium. A very simple tutorial on the knn classification is available here: [PDF]. If you want to learn the Python implementation of KNN classification, see KNN in Python.
- Learn weighted KNN classification: Reference 1, Reference 2 (more to be added)
- Record results and present them in next meeting. Meanwhile, add to the report descriptions of your data, what you did, and what you obtained.
Team 2 assignments
- Perform initial analysis of the data set and report its size, # continuous and categorical variables, # missing values and their locations, etc.
- Apply PCA to the continuous variables and see if you can learn any insight (display the principal coordinates in different colors representing the two different groups)
- Check correlation of each categorical variable with the response variable and see if you can spot any useful/useless variable
- Continue exploring the online forums on Kaggle to learn as many insights as possible
- Learn KNN classification with categorical data: webpage 1, R function
- Record results and present them in next meeting. Meanwhile, add to the report descriptions of your data, what you did, and what you obtained.
Lecture: Matrix SVD
Assignments
- Read the CAMCOS handbook
- Read the SVD handout
- Explore Kaggle.com and the competition pages
- Download your data set
- Install/check software (MATLAB, R, Python) and perform very preliminary experiments (e.g., you need to be able to load the data). Try to understand your data as much as you can.
- Here are some resources to learn the three programming languages: Matlab tutorials, DataCamp's Introduction to R course, Google's Python Class
- Start version 0 of the report by listing the section titles such as abstract, introduction/background, methodology, experiments, etc. Meanwhile, you can already include a description of the Kaggle competitions and the particular one your group will work on. LaTex preferred (or required).
The slides presented in the department colloquium are here: [PDF]