Homework #2 |
Spring 2007 |
Boosting and Bagging |
Due: Friday, March 9 |
The written exercises can be found in this pdf file.
For the programming part of this assignment, you will implement the boosting algorithm AdaBoost as well as a related algorithm called "bagging", in both cases using decision trees as the base classifiers. You will then be asked to experiment with your implementation on a number of provided datasets, and to write up a brief report on your findings.
(Minor note on terminology: Classifier means the same thing as hypothesis, and base classifier, in the context of boosting, is exactly the same as a weak classifier.)
The first part of the assignment is to write an R function called boost.trees that implements the AdaBoost algorithm discussed in class. Your function should take as input training and test sets in the form of data frames, as well as the number of rounds of boosting T. It should then run AdaBoost for T rounds, using the decision-tree algorithm that is built in to R as the base learner. The function should then return the predictions of the final combined classifier on the given training and test examples, as well as the training and test error rate of the combined classifier following each of the T rounds.
"Bagging" (short for "bootstrap aggregating") is a different method for combining decision trees or other base classifiers. Similar to boosting, the base learning algorithm is run repeatedly in a series of rounds. However, the manner in which the base learner is called is different than in boosting. In particular, on each round, the base learner is trained on what is often called a "bootstrap replicate" of the original training set. Suppose the training set consists of m examples. Then a bootstrap replicate is a new training set that also consists of m examples, and which is formed by repeatedly selecting uniformly at random and with replacement m examples from the original training set. This means that the same example may appear multiple times in the bootstrap replicate, or it may appear not at all.
Thus, on each of T rounds of bagging, a bootstrap replicate is created from the original training set. A base classifier is then trained on this replicate, and the process continues. After T rounds, a final combined classifier is formed which simply predicts with the majority vote of all of the base classifiers.
Here is pseudocode:
The second part of the assignment is to write an R function called bag.trees that implements this bagging procedure. The input parameters and returned values of this function should be just like those for your boosting function. And like the boosting function, your bagging function should use decision trees as the base learner.
R has a package for learning decision trees called rpart. We will use this package as the base learner. To avoid having to learn the intricacies of using this package, we have provided a simplified function called tree.learn that should be used for computing decision trees. This function takes as input training and test sets, computes a decision tree, and returns the predictions of the decision tree on the training and test examples. In addition, the function accepts an optional argument that can be used to assign different weights to the different training examples as is required for boosting. You also can optionally limit the depth of the trees that are learned.
We are providing three "real-world" datasets, plus one "artificial" dataset.
The letter dataset contains descriptions of the characters "C" and "G", the goal being to distinguish between these two letters. The class label is either C or G. There are 16 attributes for things like the width of the letter and the total number of pixels turned on. There are 500 training and 1009 test examples. More detailed information about this dataset and the various attributes is available here (obviously, we used only the letters C and G).
The credit dataset classifies people described by a set of attributes as good or bad credit risks. There are 20 attributes encoding credit history, purpose of the loan, employment status, etc. There are 400 training and 600 test examples. More detailed information about this dataset and the various attributes is available here (we used the "original" dataset, and did not make use of the "cost matrix").
The spam dataset classifies email messages as spam or ham. The 57 attributes mainly encode the number of times that certain words or characters occur. There are 1000 training and 3601 test examples. More detailed information about this dataset and the various attributes is available here (where we renamed and abbreviated the feature names in a straightforward way).
The dataset twonorm is generated artificially using a provided R function. (Note that the dataset will be slightly different each time it is loaded.) There are two equally occurring classes, c0 and c1. Examples have 20 attributes, each of which is generated as a normal distribution with variance one. Attributes for class c0 examples are generated with mean c, while attributes for class c1 examples have mean -c, where c is the constant 2/sqrt(20). There are 500 training and 2000 test examples. (In addition, we are providing a generator, so you can create datasets of whatever size you wish.)
Your code need only work on binary (two-class) datasets, such as those that we are providing. Note also that the class label to be predicted on each of the provided datasets is called "Class".
You will need to download all of the files in this directory, or you can download this zip file to get everything all at once. After you have done so, and placed the files in your R working directory, you can load all of the datasets using the command source("load-data.R"). Assuming this works correctly, the various training and test datasets will exist in the variables letter.train, letter.test, credit.train, etc.
In addition, the file trees.R contains the tree.learn function described above, as well as templates for the boost.trees and bag.trees functions. Your functions must follow this template, and must accept the specified input parameters and return values. (However, you can add additional return values. You also may modify the functions to take additional optional input parameters, provided that the default values of these additional parameters cause the function to behave normally as specified.)
Making sure that algorithms of this kind are working correctly is essential, but can be difficult since it is not always obvious what correct behavior looks like. For this reason, it is important to take extra care to test your implementations. One way of doing so is by trying small datasets on which it might be possible to compute the correct answer by hand. Also, because R is an interactive environment, it is easy to run your code one line at a time, and to check interactively that each line is doing the right thing.
Once you have your boosting and bagging implementations working, your next step is to run some experiments using them on the provided datasets. You should try out both boosting and bagging on the four provided datasets. You also should experiment both with deep trees and very shallow trees (say, of depth only one, often called "decision stumps"). Thus, there are (at least) sixteen combinations to consider: there are four datasets, on each of which you can try any combination of bagging or boosting on deep or shallow trees.
Your aim should be to compare the algorithms, and to try to figure out which algorithm is most effective when, and why. Your experiments certainly should include a look at how the test error evolves as a function of the number of rounds by running the various algorithms for at least a few hundred rounds. Compare the behavior and performance of the different algorithms on the different datasets, and check for overfitting. Also compare performance to using the base learner on its own, that is, without bagging or boosting. Be as perceptive and observant as possible, and try to think about how to explain what you observe. Think about experiments to test your ideas, or to explore the algorithms further. For instance, you might look at how the training and test errors of the base classifiers are evolving, or you might make a plot of the margin distribution, or you might plot error as a function of the number of training examples. You can also try creating new synthetic datasets, or modifying the provided datasets (for instance, by randomly flipping some of the labels) to see how performance is affected. You might be led to think about ways of improving these algorithms, which you are welcome to try out.
With regard to bagging which we did not discuss in class, you should specifically think about what the underlying assumptions might be for such an algorithm, and thus what the right conditions might be for it to be effective at producing a combined classifier that is more accurate than the decision trees that it is using as base classifiers. How do these assumptions compare to the assumptions for boosting? Again, you might try to devise experiments for testing your conjectures.
The end result of these experiments should be a brief written report, about 1-2 pages in length (not counting figures), describing in precise terms what experiments you ran, what observations you made, and what you can say about the questions and issues discussed above. You certainly should include plots showing the results of your experiments. This report should include at least the following:
You should turn in the file trees.R with the functions boost.trees and bag.trees filled in. Also turn in any other files that you may have created to carry out your explorations, for instance containing additional R functions. These should all be submitted electronically using moodle. You also should turn in your written report in hard copy together with your written exercises.
Part of your grade will be based on having a complete, correct and documented implementation of AdaBoost and bagging. We intend to test your code automatically, which is why it is important that you adhere to the given specifications. Your code also should not be unreasonably slow, and certainly should run on all of the given datasets in under a few minutes. Your written report should be thoughtful, perceptive, clear, concise, and should include an element of ingenuity in how you go about exploring the algorithms and datasets.
The programming part of this assignment will be worth about 75 points.