Fraud Detection Checklist

Frequently Asked Questions (Clustering)

What methods from Points2D can I use? You can use any methods from its API, which you can consult here Point2D. You won't need most of the available methods, but you might want to use the distanceTo() to compute Euclidean distances.

Do I have to implement both a minimum spanning tree and connected components algorithm from scratch? No! You can use two data types available in the algs4.jar, namely KruskalMST and CC. The first one let's you compute a minimum spanning tree and the second one let's you compute connected components of a graph. Their APIs are very short, so it should be easy to find what methods you need to use.

What exactly is a cluster and why am I computing one? A cluster is a group of points that are close to each other. This is a very broad definition and in fact there are many reasonable ways of clustering points. We chose to use one based on minimum spanning trees so you can practice using this method you learned in class.

You are implementing something to compute clusters so you can reduce the dimension of the input by contracting all dimensions that are in the same cluster. This makes the classifier part of the algorithm more efficient, since you only have to deal with arrays of length \(k\), instead of length \(m\) (so if \(k\) is much smaller than \(m\), this is more efficient). Additionally, if you think about what the underlying machine learning model is doing, grouping together close locations is actually aggregating data in a useful way, so reducing dimensions in this way can actually help the model be better.

Frequently Asked Questions (WeakLearner)

What is the best way to test my implementation? See next section.

When training a weak learner, how should I tie break over two weak learners with the same weight of correctly predicted points? The assignment doesn't specify, so it doesn't matter.

If you are curious about our tie breaking rule, here is what the reference solution does: in case of a tie, always pick the smallest possible dimension predictor, if a tie still persists pick the smallest possible value predictor, and if a tie still yet persists, pick the smallest sign predictor.

I am not sure how to implement the constructor? This is probably the trickiest part of the assignment, so you might have to spend some time thinking about how to do this efficiently. Start by trying to get the less efficient \(O(kn^2)\) solution to work. To do it, notice that you can compute the weight of correct predictions for all possible weak learners. How? Try all possible values of \(d_p\) (there are \(k\) of them), all possible values of \(s_p\) (there are only 2 of them), and all values of \(v_p\), and compute the weight of correct predictions for each one of these weak learners (by looping through all \(n\) input points and computing the weak learner's prediction). But can't \(v_p\) be any possible non-negative integer? Yes, but notice that we only care about values that are coordinates in our input (there are \(n\) of them). A weak learner with a value of \(v_p\) between two coordinates of our input does exactly the same predictions on our input, so its weight of correct predictions is the same. To visualize this, look at the example from the assignment statement:

Note that moving the line up by anything less than \(1\) (so setting \(v_p\) to anything at least \(4\) and less than \(5\)) does the exact same predictions on the input.

Once you get this solution to work, think about how to avoid having to repeatedly go through all the points for every possible weak learner, by finding a more clever way to compute the weight of correct predictions without having to go through all points repeatedly. Hint: if you sorted your points by a certain coordinate, could you find the best decision stump for that coordinate efficiently? In other words, suppose you sorted all the points by the \(0\) coordinate, can you find the best decision stump with \(d_p = 0\) in \(\Theta(n)\) time?

Frequently Asked Questions (Testing the WeakLearner)

We created a few small test cases that are good to find corner case bugs in your WeakLearner code, the stump_i.txt. For each one of them, the stump_i.ans.txt contains the parameters of the best decision stump for that particular data set (i.e. what your code should find).

Here is some code you can use to read one of these examples and print the trained weak learner parameters:

In datafile = new In(args[0]);

int n = datafile.readInt();
int k = datafile.readInt();

int[][] input = new int[n][k];
for (int i = 0; i < n; i++) {
    for (int j = 0; j < k; j++) {
        input[i][j] = datafile.readInt();
    }
}

int[] labels = new int[n];
for (int i = 0; i < n; i++) {
    labels[i] = datafile.readInt();
}

double[] weights = new double[n];
for (int i = 0; i < n; i++) {
    weights[i] = datafile.readDouble();
}

WeakLearner weakLearner = new WeakLearner(input, weights, labels);
StdOut.printf("vp = %d, dp = %d, sp = %d\n", weakLearner.valuePredictor(),
              weakLearner.dimensionPredictor(), weakLearner.signPredictor());

If you make the main method of your WeakLearner the above, you can determine your output for each of the examples like so:

~/Desktop/fraud> java-algs4 WeakLearner stump_1.txt

All the examples are based on the point sets of the following images:

stump_1.txt represents the left image with all weights equal to \(1\).
stump_2.txt represents the center image with all weights equal to \(0.125\).
stump_3.txt represents the center image with all weights equal to \(0.01\) except for the topmost red circles, which have weight \(0.47\).
stump_4.txt represents the right image with all weights equal to \(1\) except for the lowest red circle, which has weight \(2\).
stump_5.txt is exactly the same as stump_4.txt, but all the labels are flipped.

The first example is helpful to test the basic logic of your implementation. The second two examples help you make sure you are correctly considering the weights in your implementation. The last two examples are helpful to debug one of the most common bugs in the efficient solution, which is forgetting that multiple points can have the same coordinate.

Frequently Asked Questions (BoostingAlgorithm)

How do I normalize the weights and why do I have to? Normalizing can be done by summing all the weights and then dividing each individual weight by this sum. This ensures that the resulting weights sum to 1.

We do this because otherwise we might have problems with numeric overflow, since we are repeatedly doubling weights, if we never normalized, after 100 mistakes the weight of a point would be \(2^{100}\), which is larger than what a long can represent in java.

Why does this model work well to classify data? The intuition is that we are training several weak learners, each will be better at predicting certain collections of points. Individually, each weak learner is only slightly good at predicting, but when you combine all of them, they become really good.

Also, we must confess that in reality the exact model that you are implementing doesn't really work well for many scenarios. The complete AdaBoost algorithm is slightly more complex, and this extra complexity makes it work really well in all possible scenarios (one can even show theoretical guarantees on how well it works). We chose to have you learn and implement this simpler version to make your lives easier, but we encourage you to look up how the full version works and why.

Are there other types of weak learners? Certainly! Any simple model that is very efficient to train and predict, is a good candidate for a weak learner. The only thing you need is a model that is better than assigning labels at random. You are implementing decision stumps here because they are the simplest to understand and they are also really popular in practice.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Implement all the methods in the Clustering data type. Go through the algorithm's description in the statement and think about what instance variables you'll need and what temporary variables you'll need in the constructor to compute them. Don't be wasteful! Only store as an instance variable what you really need to in order to support the instance methods.
Implement all the methods in the WeakLearner data type using the \(O(kn^2)\) algorithm. Note that this data type is independent of Clustering, so you don't have to use that in this part of the assignment.
Implement the BoostingAlgorithm data type. You'll need to create a single Clustering object in the constructor, and a WeakLearner every time the iterate() method gets called. Don't throw away these WeakLearners! Store them in a linked list or something like that, since you need them in the predict() method, to compute their predictions and take a majority.
Once you got all of this to work and passed all tests on tigerfile except for one timing test on WeakLearner, try to find a \(O(nk \log n)\) solution and then implement it. Don't get rid of your inefficient solution, save it just in case you need to look at it again when you are trying to implement the more efficient one.

If you get stuck or confused, it might help to revisit the lecture slides and the precept exercises from Precept 10. Also, ask questions on Ed/go to office hours.