You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.
Problem 1 Section 12.5.2 of Modern Information Retreival describes the use of RED, GREEN, and BLUE averages for an image to serve as features on which images can be compared. One problem with this approach is that a picture that has all pixels with RED at 50% intensity and BLUE at 50% intensity (and no GREEN) will "look" the same as an image with all pixels in the left half at full RED intensity (no BLUE or GREEN) and all pixels in the right half at full BLUE intensity (no RED or GREEN). Suppose that instead we divide the image into four regions by cutting it in half vertically and horizontally and take the RED, BLUE and GREEN averages for each region.
Part a How would two images be compared if we use RED, BLUE and GREEN averages in each of the four regions for each image (so that an image is characterized by 3 vectors of 4 components: one vector for RED, one for BLUE, and one for GREEN)? What would be the similarity function?
Part b Would the use of four regions help the inaccuracy described at the beginning of this problem? Why or why not? In general, what are the pros and cons of dividing each image into four regions?
Problem 2 Consider a "wagon-wheel graph": a graph consisting of 6 nodes -- a, b, c, d, e, f -- in a cycle and a seventh node, x, connected to each of the six nodes. The edges have the following similarity weights (higher number is more similar):
(a,b) has weight 10 (b,c) has weight 4 (c,d) has weight 4 (d,e) has weight 10 (e,f) has weight 4 (f,a) has weight 4 (x,a) has weight 1 (x,b) has weight 1 (x,c) has weight 8 (x,d) has weight 1 (x,e) has weight 1 (x,f) has weight 8Find clusters for this graph using hierarchical agglomerative clustering. In specific, use the algorithm which starts with n clusters for an n-node graph, one node in each cluster, and in each of n-1 stages merges the two clusters that are most similar (breaking ties arbitrarily). Measure the similarity between clusters as the similarity of the most similar pair of nodes, one from each cluster. Show the result of each of the n-1 stages.
Problem 3
Part a Find the minimum cut tree of the graph of Problem 2.
Part b Cluster the graph of Problem 2 using the minimum cut tree of Part a. Use the clustering algorithm that finds the centroid of the minimum cut tree and removes it to create the first-level clustering. How many levels of clustering can you get using this minimum cut tree?
Problem 4 Consider the graph that is simply a 4-cycle -- a,b,c,d -- with edge similarity weights:
(a,b) has weight 4 (b,c) has weight 1 (c,d) has weight 2 (d,a) has weight 2Cluster this graph by applying the clustering algorithm based on random walks, as described in
<=2 P (i,j) visitfor each pair of nodes (i,j). Use the similarity measure derived from these values and a threshold of your choosing to decompose the graph into clusters. Why did you choose that threshold? Are there other reasonable choices?