COS 402: Artificial Intelligence

Homework #7

Machine learning

Fall 2010

Due: Tuesday, January 11
(but one part of the assignment is due Thursday, January 6)


Dean's date reminder:  Unlike the other homeworks, this one must be turned in on time.  This final homework is due on "dean's date," the latest possible due date allowed by university policy.  As per university rules, this also means that you will need written permission from the appropriate dean to turn it in late.


Part I: Written Exercises

The written exercises are available in pdf in two different versions, depending on which edition of R&N you are using.  Those using the 2nd edition should access this version of the exercises.  For 3rd edition, use this version.


Part II:  Programming

The topic of this assignment is machine learning for supervised classification problems.  Here are the main components of the assignment:

  1. Implementation of the machine learning algorithm of your choice.
  2. Comparison of your learning algorithm to those implemented by your fellow students on a small set of benchmark datasets.
  3. A systematic experiment of your choice using your algorithm.
  4. A short written report describing and discussing what you did and what results you got.

For this assignment, you may choose to work individually or in pairs.  You are encouraged to work in pairs since you are likely to learn more, have more fun and have an easier time overall.  (The written exercises should still be done individually though.)

Note that part of this assignment must be turned in by Thursday, January 6.  See "What to turn in" below.  Also, be sure to plan your time carefully as the systematic experiment may take hours or days to run (depending on what you decide to do for this part).


A machine learning algorithm

The first part of the assignment is to implement a machine learning algorithm of your choice.  We have discussed several algorithms including naive Bayes, decision trees, AdaBoost, SVM's and neural nets.  R&N discuss others including decision stumps and nearest neighbors.  There are a few other algorithms that might be appropriate for this assignment such as the (voted) perceptron algorithm and bagging.  You may choose any of these to implement.  More details of these algorithms are given below, in some cases with pointers for further reading.  For several of these algorithms, there are a number of design decisions that you will need to make; for instance, for decision trees, you will need to decide on a splitting criterion, pruning strategy, etc.  In general, you are encouraged to experiment with these algorithms to try to make them as accurate as you can.  You are welcome to try your own variants, but if you do, you should compare to the standard vanilla version of the algorithm as well.

If you are working individually, you should implement one algorithm.  If you are working with a partner, the two of you together should implement two algorithms.  You are welcome to implement more algorithms if you wish.

If it happens that you have previously implemented a learning algorithm for another class or independent project, you should choose a different one for this homework.

For this assignment, you may wish to do outside reading about your chosen algorithm, but by no means are you required to do so.  Several books on machine learning have been placed on reserve at the Engineering Library.   Although outside reading is allowed, as usual, copying, borrowing, looking at, or in any way making use of actual code that you find on-line or elsewhere is not permitted.  Please be sure to cite all your outside sources (other than lecture and R&N) in your report.

Notes on debugging:  It can be very hard to know if a machine learning program is actually working.  With a sorting program, you can feed in a set of numbers and see if the result is a sorted list.  But with machine learning, we usually do not know what the "correct" output should be.  Here are some suggestions for debugging your program:


Comparison on benchmark datasets

We have set up a mechanism whereby you will be able to compare the performance of your program to that of your fellow students.  The idea is to simulate what happens in the real world where you need to build a classifier using available labeled data, and then you use that classifier on new data that you never got to see or touch during training.  Here is how it works:

We are providing four benchmark datasets described below.  Each dataset includes a set of labeled training examples and another set of unlabeled test examples.  Once you have your learning algorithm written, debugged and working, you should try training your algorithm on the training data and producing a set of predictions on all of the test examples.  These predictions can then be submitted using DropBox.  If you then press the "Check All Submitted Files" button, your submitted predictions will be compared to the correct test labels and the resulting test error rate will be posted here where everyone can see how well everyone else's programs are performing.

The website will show, for each such submission, the date submitted, the author(s) of the program, a short description that you provide of the learning algorithm used, and the test error rate achieved.  The name listed as "author" will be a name that you provide.  So, if you wish to remain anonymous on the website, you can do so by using a made-up name of your choice, or even a random sequence of letters, but something that will allow you to identify your own entry.  (However, please refrain from using a name that might be considered offensive or inappropriate.  Also, please avoid using strange characters that might cause problems for our scripts.)

The "description" you provide in submitting your test predictions should clearly describe the algorithm you are using, and any important design decisions you made (such as parameter settings).  This one or two sentence description should be as understandable as possible to others in the class.  For instance, try to avoid cryptic abbreviations.  (In contrast, the "author" you provide in submitting your test predictions can be any name you wish.)  While being brief, try to give enough information that a fellow classmate might have a reasonable chance of reproducing your results.

Once you have seen your results on test data, you may wish to try to improve your algorithm, or you may wish to try another algorithm altogether (although the assignment does not require you to do so).  Once you have done this, you may submit another set of test predictions.  However, to avoid the test sets becoming overused (leading to statistically meaningless results), each student will be limited to submitting three sets of predictions for each benchmark dataset.  Note that this limit is per student, not per team; in other words, if you are working as a pair, then together you can submit up to six sets of predictions per dataset.

Your grade will not depend on how accurate a classifier you are able to produce relative to the rest of the class.  Although you are encouraged to do the best you can, this should not be regarded as anything more than a fun (I hope) communal experiment exploring various machine learning algorithms.  This also means that, in choosing an algorithm to implement, it is more important to choose an algorithm that interests you than to choose one that you expect will give the best accuracy.  The greater a variety of algorithms that are implemented, the more interesting will be the results of our class experiment, even if some of those algorithms perform poorly.


A systematic experiment

The third part of this assignment is to run a single systematic experiment.  For instance, you might want to produce a "learning curve" such as the one shown in R&N Figure 18.11a (2nd ed.) / 18.35a (3rd ed.).  In such a curve, the accuracy (or error) is measured as the training set size is varied over some range.  To be more specific, here is how you might produce such a curve.  The provided test datasets cannot be used here since they are unlabeled, and since you are limited to making only three sets of predictions on each.  Instead, you can split the provided training set into two subsets, one for training, and the other for measuring performance.  For instance, if you have 2000 training examples, you might hold out 1000 of those examples for measuring performance.  You can then run your learning algorithm on successively larger subsets of the remaining 1000 examples, say of size 20, 50, 100, 200, 500, 1000.  Each run of your algorithm will generate a classifier whose error can be measured on the held-out set of 1000 examples.  The results can then be plotted using matlab, gnuplot, excel, etc.  Note that such an experiment, though requiring multiple runs of the same algorithm, can be programmed to run entirely automatically, thus reducing the amount of human effort required, and also substantially reducing the chance of making mistakes.

This is just one possible experiment you might wish to run.  There are many other possibilities.  For instance, if you are using neural nets, you might want to plot accuracy as a function of the number of epochs of training.  Or if you are using boosting, you might plot accuracy as a function of the number of rounds.  Another possibility is to compare the accuracy of two different variants of the same algorithm, for instance, decision trees with and without boosting, or decision trees with two different splitting criteria.

This general approach of holding out part of the training set may also be helpful for improving the performance of your learning algorithm without using the "real" test set.  For instance, if your algorithm has a parameter (like the learning rate in neural nets) that needs to be tuned, you can try different settings and see which one seems to work the best on held out data.  (This could also count as a systematic experiment.)  You might then use this best setting of the learning rate to train on the entire training set and to generate test predictions that you submit for posting on the class website.

In general, such held-out sets should consist of about 500-1000 examples for reliable results.

Note that systematic experiments of this kind can take a considerable amount of computation time to complete, in some cases, many hours or even days, depending on the experiment and the algorithm being used.  Therefore, it is very important that you start on this part of the assignment as early as possible.

If you are working as a pair, it is okay to do just one experiment.  However, whatever experiments you do should involve at least two of the algorithms that you implemented.  For instance, you might produce learning curves for both.


A written report

The fourth part of this assignment is to write up your results clearly but concisely in a short report.  Your report should include all of the following (numbers in brackets indicate roughly how many paragraphs you might want to write for each bullet):

If you are working as a pair, the two of you together only need to submit a single, jointly authored report (in which case, your report might be slightly longer than indicated by the numbers above).


The code we are providing

We are providing a class DataSet for storing a dataset, and for reading one in from data files that we provide or that you generate yourself for testing.  Each dataset is described by an array of training examples, an array of labels and an array of test examples.  Each example is itself an array of attribute values.  There are two kinds of attributes: numeric and discrete.  Numeric attributes have numeric values, such as age, height, weight, etc.  Discrete attributes can only take on values from a small set of discrete values, for instance, sex (male, female), eye color (brown, blue, green), etc.  Below, we also refer to binary attributes; these are numeric attributes that happen to only take the two values 0 and 1.

Numeric attributes are stored by their actual value as an integer (for simplicity, we don't allow floating point values).  Discrete attributes are stored by an index (an integer) into a set of values.  The DataSet class also stores a description of each attribute including its name, and, in the case of discrete attributes, the list of possible values.  Labels are stored as integers which must be 0 or 1 (we will only consider two-class problems).  The names of the two classes are also stored as part of the DataSet class.

A dataset is read in from three files with the names <stem>.names, <stem>.train and <stem>.test.  The first contains a description of the attributes and classes.  The second and third contain the labeled training examples and unlabeled test examples.  A typical <stem>.names file looks like the following:

yes   no
age         numeric
eye-color   brown  blue  green

The first line must contain the names of the two classes, which in this case are called "yes" and "no".  After this follows a list of attributes.  In this case, the second line of the file says that the first attribute is called "age", and that this attribute takes numeric values.  The next line says that the second attribute is called "eye-color", and that it is a discrete attribute taking the three values "brown", "blue" and "green".

A typical <stem>.train file might look like this:

33   blue   yes
15   green  no
25   green  yes

There is one example per line consisting of a list of attribute values (corresponding to those described in the <stem>.names file), followed by the class label.

A <stem>.test file has exactly the same format except that the label is omitted, such as the following:

33 green
19 blue

The DataSet class has a constructor taking a file-stem as an argument that will read in all three files and set up the public fields of the class appropriately.  The .train and .names files are required, but not the .test file (if no .test file is found, a non-fatal warning message will be printed and an empty test set established).

Working with several different kinds of attributes can be convenient when coding a dataset but a nuisance when writing a machine learning program.  For instance, neural nets prefer all of the data to be numeric, while decision trees are simplest to describe when all attributes are binary.  For this reason, we have provided additional classes that will read in a dataset and convert all of the attributes so that they all have the same type.  This should make your job much, much simpler.  Each of these classes is in fact a subclass of DataSet (see the references listed on the course home page for an explanation of subclasses and how to use them), and each has a constructor taking as argument a file-stem.  The three classes are NumericDataSet, BinaryDataSet and DiscreteDataSet, which convert any dataset into data that is entirely numeric, binary or discrete.  (In addition, BinaryDataSet is a subclass of NumericDataSet.)  So, for instance, if you are using neural nets and want your data to be entirely numeric, simply load the data using a command like this:

ds = new NumericDataSet(filestem);

Using these subclasses inevitably has the effect of changing the attributes.  When converting discrete attributes to numeric or binary, a new binary attribute is created for each value.  For instance, the eye-color attribute will become three new binary attributes: eye-color=brown, eye-color=blue and eye-color=green; if eye-color is blue on some example, then eye-color=blue would be given the value 1, and the others the value 0.  A numeric (non-binary) attribute is converted to binary by creating new binary attributes in a similar fashion.  Thus, the numeric attribute age would be replaced by the binary attributes age>=19, age>=25, age>=33.  If age actually is 25, then age>=19 and age>=25 would be set to 1, while age>=33 would be set to 0.  When converting a numeric (including binary) attribute to discrete, we simply regard it as a discrete attribute that can take on the possible values of the original numeric attribute.  Thus, in this example, age would now become a discrete attribute that can take on the values "15", "19", "25" and "33".  Note that all ordering information has been lost between these values.

If you produce your own dataset, it is important to know that the provided code assumes that numeric attributes can only take on a fairly small number of possible values.  If you try this code out with a numeric attribute that takes a very large number of values, you probably will run into memory and efficiency issues.  All of the provided datasets have been set up so that this should not be a problem.

The DataSet class also includes a method printTestPredictions that will print the predictions of your classifier on the test examples in the format required for submission.  The output of this method should be stored in a file called <stem>.testout and submitted using DropBox.

We also are providing an interface called Classifier that your learning program and the computed classifier (hypothesis) should adhere to.  This interface has three methods: predict, which will compute the prediction of the classifier on a given example; algorithmDescription, which simply returns a very brief but understandable description of the algorithm you are using for inclusion on the course website; and author, which returns the "author" of the program as you would like it to appear on the website (can be your real name or a pseudonym).  A typical class implementing this interface will also include a constructor where the actual learning takes place.

A very simple example of a class implementing the Classifier interface is provided in BaselineClassifier.java which also includes a simple main for loading a dataset, training the classifier and printing the results on test data.

All code and data files can be obtained from this directory, or all at once from this zip file.  Data is included in the data subdirectory.

Documentation on the provided Java classes is available here.


The datasets we are providing

We are providing four datasets, all consisting of real-world data suitably cleaned and simplified for this assignment.

The first two datasets consist of optical images of handwritten digits.  Some examples are shown in R&N Figure 20.29 (2nd ed.) / 18.36 (3rd ed).  (The data we are providing actually comes from the same source, although ours have been prepared somewhat differently.)  Each image is a 14x14 pixel array, with 4 pixel-intensity levels.  The goal is to recognize the digit being represented.  In the first and easier dataset with file-stem ocr17, the goal is to distinguish 1's from 7's.  In the second and harder dataset with file-stem ocr49, the goal is to distinguish 4's from 9's.

The third dataset consists of census information.  Each example corresponds to a single individual with attributes such as years of education, age, race, etc.  The goal is predict whether this individual has an income above or below $50,000.  The file-stem is census.

The fourth dataset consists of DNA sequences of length 60.  The goal is to predict whether the site at the center of this window is a "splice" or "non-splice" site.  The file-stem is dna.

The DNA dataset consists of 1000 training examples and 2175 test examples.  All of the other datasets consist of 2000 training examples and 4000 test examples.

It might happen that you think of ways of figuring out the labels of the test examples, for instance, manually looking at the OCR data to see what digit is represented, or finding these datasets on the web.  Please do not try anything of this kind, as it will ruin the spirit of the assignment.  The test examples should not be used for any purpose other than generating predictions that you then submit.  You should pretend that the test examples will arrive in the future after the classifier has already been built and deployed.


The code that you need to write

Your job is to create one or more Classifier classes implementing your learning algorithm and the generated classifier.  Since we do not intend to do automatic testing of your code, you can do this however you wish.  You also will probably need to write some kind of code to carry out your systematic experiment.

Because we will not be relying on automatic testing, we ask that you make an extra effort to document your code well to make it as readable as possible.


What to turn in

For the purposes of submitting to DropBox, we have divided this assignment in two.

On DropBox, under the assignment called "Machine Learning  test predictions", you should turn in the following:

In addition, under the assignment called "Machine Learning  code", you should turn in the following:

Finally, your program report should be submitted in hard copy as described on the assignments page.

If you are working with a partner, the two of you together only need to submit your code once, and you only need to prepare and turn in a single written report .  Be sure that it is clear who your partner is.  In all cases, the written exercises should be completed and turned in individually.

You do not need to submit any code or anything in writing by Thursday, January 6.  The only thing you need to submit by that date is a single round of predictions on each of the test sets.  The reason this part is due before the rest of the assignment is so that you will have time to compare your results to those of your fellow classmates when you write up your report.  You can continue to submit test predictions (up to three rounds, including the one due on January 6), up until the assignment due date (Tuesday, January 11).


What you will be graded on

You will be graded on completing each of the components of this assignment, as described above.  More emphasis will be placed on your report than on the code itself.  You have a great deal of freedom in choosing how much work you want to put into this assignment, and your grade will in part reflect how great a challenge you decide to take on.  Creativity and ingenuity will be one component of your grade.  Here is a rough breakdown, with approximate point values in parentheses, of how much each component is worth:

As noted above, your grade will not at all depend on how well your algorithm actually worked, provided that its poor performance is not due to an incorrect or incomplete implementation.

Full credit for this assignment will be worth around 85 points.  However, exceptional effort will receive extra credit of 5-20 points.


Algorithms you can choose from

Here is a list of machine learning algorithms you can choose from for the programming assignment.  Most of these are described further in the books requested for reserve at the Engineering Library (see below).  A few additional pointers are also provided below.  (Note that to access some of these links, you may need to be on a computer that is inside the Princeton network.  If you have difficulty accessing any of these, please let me know.)

Be sure to take note of the R&N errata detailed on the written exercises for this homework.


Decision trees

These were discussed in class, and also in R&N Section 18.3.  To implement them, you will need to decide what splitting criterion to use, when to stop growing the tree and what kind of pruning strategy to use.


AdaBoost

This algorithm was discussed in class, and also in R&N Section 18.4 (2nd ed.) / 18.10 (3rd ed.).  AdaBoost is an algorithm that must be combined with another "weak" learning algorithm, so you will need to implement at least one other algorithm (which might work well if you are working as a pair).  Natural candidates for weak learning algorithms include decision stumps or decision trees.  You also will need to decide how the weak learner will make use of the weights Dt on the training examples.  One possibility is to design a weak learner that directly minimizes the weighted training error.  The other option is to select a random subset of the training examples on each round of boosting by resampling according to distribution Dt.  This means repeatedly selecting examples from the training set, each time selecting example i with probability Dt(i). This is done "with replacement", meaning that the same example may appear in the selected subset several times.  Typically, this procedure will be repeated N times, where N is the number of training examples.

Finally, you will need to decide on the number of rounds of boosting.  This is usually in the 100's to 1000's.

See also this overview paper, or this survey.


Support-vector machines (SVM's)

This algorithm was discussed in class, and also in R&N Section 20.6 (2nd ed.) / 18.9 (3rd ed.).  Even so, we did not describe specific algorithms for implementing it, so if you are interested, you will need to do some background reading.  One of the books on reserve is all about kernel machines (including SVM's).  You can also have a look at this tutorial paper, as well as the references therein and some of the other resources and tutorials at www.kernel-machines.org.  The SMO algorithm is a favorite technique for computing SVM's.

Since implementing SVM's can be pretty challenging, you might instead want to implement the very simple (voted) perceptron algorithm, another "large margin" classifier described below which also can be combined with the kernel trick, and whose performance is substantially similar to that of SVM's.


Neural networks

This algorithm was discussed in class, and also in R&N Section 20.5 (2nd ed.) / 18.7 (3rd ed.) in considerable detail.  You will need to choose an architecture, and you have a great deal of freedom in doing so.  You also will need to choose a value for the "learning rate" parameter, and you will need to decide how long to train the network for.  You might want to implement just a single-layer neural network, or you might want to experiment with larger multi-layer networks.  You can also try maximizing likelihood rather than minimizing the sum of squared errors as described in the 2nd ed. of R&N Eq. (20.13) and the surrounding text.  (This seems to have been dropped from the 3rd ed., although maximum likelihood is explained in Section 20.2.1.)  It can be proved that taking this approach with a single-layer net has the important advantage that gradient ascent can never get stuck in a local maximum.  (This amounts to a tried and true statistical method called logistic regression, which has similar elements, but is different, from what is called logistic regression in the 3rd ed.)


Naive Bayes

We discussed this algorithm in class much earlier in the semester, but it can be used as a very simple algorithm for classification learning.  It is described in R&N at the very end of Section 13.6 (2nd ed.) / 13.5 (3rd ed.), and also in the middle of Section 20.2 (2nd ed.) / 20.2.2 (3rd ed.).  Although simple, and although the naive independence assumptions underlying it are usually wrong, this algorithm often works better than expected.  In estimating probabilities, you will probably want to work with log probabilities and use Laplace ("add-one") smoothing as in HW#5.  This algorithm works best with discrete attributes.


Decision stumps

This is probably the simplest classifier on this list (and the least challenging to implement).  They are briefly touched upon in R&N Section 18.4 (2nd ed.) / 18.10 (3rd ed.).  A decision stump is a decision tree consisting of just a single test node.  Given data, it is straightforward to search through all possible choices for the test node to build the decision stump with minimum training error.  These make good, truly weak, weak hypotheses for AdaBoost.


Nearest neighbors

We did not discuss this algorithm in detail in class, but it is discussed in R&N Section 20.4 (2nd ed.) / 18.8.1 (3rd ed.).  The idea is simple: during training, all we do is store the entire training set.  Then given a test example, we find the training example that is closest to it, and predict that the label of the test example is the same as the label of its closest neighbor.  As described, this is the 1-nearest neighbor algorithm.  In the k-nearest neighbor algorithm, we find the k closest training examples and predict with the majority vote of their labels.  In either case, it is necessary to choose a distance function for measuring the distance between examples.


(Voted) perceptron algorithm

The perceptron algorithm (mentioned in a footnote on page 742 of R&N 2nd ed., and discussed in more detail in Section 18.6.3 of the 3rd ed.) is one of the oldest learning algorithms, and also a very simple algorithm.  Like SVM's the algorithm's purpose is to learn a separating hyperplane defined by a weight vector w.  Starting with an initial guess for w, the algorithm  proceeds to cycle through the examples in the training set.  If example x is on the "correct" side of the hyperplane defined by w, then no action is taken.  Otherwise, yx is added to w.  The algorithm has some nice theoretical properties, and can be combined with the kernel trick.  Also, there is a version of the algorithm in which the average of all of the weight vectors computed along the way are used in defining the final weight vector defining the output hypothesis.  All this is described in this paper.  [Unfortunately, this paper has some annoying typos in Figure 1.  The initialization line should read: "Initialize: k := 1, v1 := 0, c1 := 0."  Also, the line that reads "If ŷ = y then..." should instead read "If ŷ = yi then..."]


Bagging

This is an "ensemble" method similar to boosting, somewhat simpler though not quite as effective overall.  As in boosting, we assume access to a "weak" or base learning algorithm.  This base learner is run repeatedly on different subsets of the training set.  Each subset is chosen by selecting N of the training examples with replacement from the training set, where N is the number of training examples.  This means that we select one of the training examples at random, then another, then another and so on N times.  Each time, however, we are selecting from the entire training set, so that some examples will appear more than once, and some won't appear at all.  The base learner is trained on this subset, and the entire procedure is repeated some number of times (usually around 100).  These "weak" or base hypotheses are then combined into a single hypothesis by a simple majority vote.  For more detail, see this paper.  This algorithm works best with an algorithm like decision trees as the base learner.


Random forests

This is yet another "ensemble" method, similar but somewhat more sophisticated than bagging, and designed specifically for use with decision trees.  This algorithm is known to often perform very well in terms of test accuracy.  For more detail, see this paper.  If you are working as a pair, implementing this single method (which involves both an ensemble method, and a tree-growing algorithm) will satisfy the requirement of implementing two learning algorithms.


Other

If you are interested in implementing some other algorithm not listed here, please contact me first.


Books on reserve at the Engineering Library