COS 429 - Computer Vision

Fall 2019

Course home

Outline and Lecture Notes

Assignments

Assignment 3: Tracking

Due Thursday, November 21

Changelog and Clarifications

11/8/2019: Frames that do not have ground truth rectangles have rect = [-0.5, -0.5, -0.5, -0.5], not rect = [0, 0, 0, 0] as before.
11/11/2019: Add hint for 2.2.
11/11/2019: Clarify test code for defineActiveLevels in part 3.1.
11/14/2019: Add hint for defineActiveLevels in part 3.1.
11/17/2019: Clarify assumptions for pyramid levels in part 3.1.
11/25/2019: Make various miscellaneous fixes now that the assignment submission period has passed.

In this assignment you will be building a face tracker based on the Lukas-Kanade algorithm. For background, you should read the lecture slides and the Lucas-Kanade paper and look at the notes for Lecture 13.

Part 1. Preliminaries (25 points)

Do this:

Begin by downloading the starter code and the data. Once you have unzipped the contents, please place the data directory inside the cos429_f19_assignment3 directory. The starter code contains several files with functions for various pieces of the LK algorithm, which you will need to use/implement. Familiarize yourself with the following contents:
- LK.py - This file contains various functions for implementing the LK algorithm.
  - LKinitParams - This function defines a number of parameters to the LK alogrithm. You will alter some of these later on and observe their effect on performance.
  - LKonCoImage - This function predicts the motion of a given rectangle from one frame to the next. You will implement much of this functionality below.
  - LKonPyramid - This function performs the same motion prediction as LKonCoImage on a pair of image pyramids. It does this by calling LKonCoImage for each level of the pyramid.
  - LKonSequence - This function performs LK tracking across multiple frames of a video by calling LKonPyramid on pairs of frames and tracking rectangle positions and motions across frames. You will implement this functionality below.
- data - This directory contains several image sequences and ground truth rectangles for each frame. You will use these sequences for testing and to demonstrate your results.
- coi.py - This file contains functions defining operations on "coordinate images", which are defined below in part 1.3. You will need to implement drawFaceSequence and call other functions defined here.
- rect.py - This file contains functions that manipulate rectangles, which are discussed in part 1.2. You will implement rect2uvs below.
- uvs.py - This file contains functions computing/manipulating uvs motion models. You will implement several functions within this file.
- test.py - This file contains snippets of code used to run/test your implementation for each part of this assignment. See the bottom of the file for sample usage. Note that passing these tests does not guarantee the correctness of your code. You are responsible for thoroughly testing your own code. We will grade based on your code, not based on the test outputs.

1.1 Motion (10 points)

We will be working with two motion models:

Translation by $(u,v)$, where the coordinate $(x,y)$ is updated to a new location $(x',y')$ according to $$ x'=x+u $$ $$ y'=y+v $$
Translation by $(u,v)$ and scale by $s$, where the scale $s$ is defined by the change of the object width from $w$ to $w'$ as: $$ s = \frac{w'-w}{w} $$ Thus $s=0$ means no scale, $s=1$ means the object doubles in size, $s=-0.5$ means it shrinks in half, and $s=-1$ means it shrinks to a singularity.
To define the motion of a point $(x,y)$ due to scale, one needs to define the "center" of motion. We will refer to this arbitrary point as $(x_0,y_0)$. Thus scaling the point $(x,y)$ means $$ x' = x+(x-x_0)*s $$ $$ y' = y+(y-y_0)*s $$
After scaling, we apply the translation also, producing the motion equations: $$ x' = x+u+(x-x_0)*s $$ $$ y' = y+v+(y-y_0)*s; $$

In the code, the file uvs.py contains functions that manipulate $(u,v,s)$ motion models. The data structure they use is simply a vector of 5 elements: [u, v, s, x0, y0].

Do this and turn in:

In uvs.py, write the body of the function uvsInv, which inverts the motion mot. The location of motinv's center of motion [x0', y0'] should be the projection of [x0, y0] by the motion mot.
Include in your write-up a code snippet showing your implementation of uvsInv.
Include in your write-up the output of the commands in part 1.1 of test.py

1.2 Rectangles (10 points)

In frame-to-frame alignment, our goal is to estimate these parameters, $(u,v)$ or $(u,v,s)$, from a local part of a given pair of images. For simplicity, we will limit ourselves to rectangular areas of the image. Our rectangles are defined as rect = [xmin, xmax, ymin, ymax] and the file rect.py contains functions to manipulate rectangles. (Technical note: the sides of a rect can take on non-integer values, where a rectangle may include only part of some pixels. Pixels are centered on the integer grid, thus a rectangle that contains exactly pixel [1,3] would be rect = [0.5 1.5 2.5 3.5].)

Given 2 rectangles (of the same aspect ratio), we can compute the $(u,v,s)$ motion between them. This could be useful to define an initial motion for the tracking if you have a guess of 2 rectangles that define the object (e.g. by running your face detector from Assignment 2).

Do this and turn in:

In rect.py, finish the function rect2uvs. If the aspect ratio of the 2 rectangles is not exactly the same, just use the average of the 2 possible scales.
Include in your write-up a code snippet showing your implementation of rect2uvs.
Include in your write-up the output of the commands in part 1.2 of test.py

1.3 Image sequence (5 points)

Included with the starter code (in the data directory) are the 'woman' and 'man' video sequences of faces from http://vision.ucsd.edu/~bbabenko/project_miltrack.html .

Each frame of the video is just an image (here it's actually stored as set of .png files). We could represent the image as just a numpy array, as you have done in the previous assignments, but we will be cutting and warping images, and it is useful for an image to be able to have a coordinate system that is not necessarily rooted at the top-left corner. Here we will use a "coordinate image" structure (often called coimage or coi in the code). It is just a wrapper around the image with some extra fields, for example:

    coi = 
        im: [240x320 double]
        origin: [30 20]
        label: 'img00005'
        level: 1

The level field will be useful later to keep track of which pyramid-level this image represents.

The file coi.py contains functions to create and manipulate these images. See the constructor function coimage in coi.py for more information.

The class FaceSequence in coi.py) is there to simplify the access to the tracking sequences downloaded above. See part 1.3 in test.py for some example usage. To initialize:

    fs = FaceSequence('path/to/data/woman')

Then, to access, e.g., the 5th image and read it into a coimage structure, you can do:

    coi = fs.readImage(4)

Note that FaceSequence is 0-indexed. If you want to access a sequence of images, say every 3rd image starting from the 2nd, do:

    fs.next = 1
    fs.step = 3
    coi1 = fs.readNextImage()  # the 2nd
    coi2 = fs.readNextImage()  # the 5th

Additionally, fs contains the "ground truth" rectangles stored with the clip in

     rect = fs.gt_rect[0, :]  # rectangle for 1st image

Beware: only some frames have valid ground truth rectangles, otherwise this rect = [-0.5, -0.5, -0.5, -0.5].

Do this and turn in:

In coi.py, write the function drawFaceSequence, which will display an animation of the video with the rectangles from rects drawn on it. Use the plt.pause() command in your for loop to make sure matplotlib shows the frames at a reasonable speed. For example, including plt.pause(0.2) should display the animation at roughly 5 frames per second. Be sure to properly handle images that do not have ground truth rectangles.
Include in your write-up a code snippet showing your implementation of drawFaceSequence.
Include in your write-up the last frame of the animation produced by part 1.3 of test.py. You can take a screenshot or save the figure.

Part 2: LK at a Single Level (35 points)

2.1 (u,v) motion (20 points)

Review the tracking lectures. The Lukas-Kanade algorithm repeatedly warps the "current" image backward according to the current motion to be similar to the "previous" image inside the area defined by the "previous rectangle".

Let $N$ be the number of pixels inside the rectangle. Recall that we need to compute a motion update $x = (u,v)$ (for translation-only motion) that satisfies: $$ Ax=b $$ where $A$ is an $N \times 2$ matrix, each of whose rows is the image gradient $(dx, dy$ at some pixel of the previous image, and $b$ is an $N \times 1$ column vector of errors (image intensity differences) between the previous and current image. To solve this with least squares, we compute $$ x = (A^T A)^{-1} A^T b. $$

The new combined motion (in motion + update) should be

    new_mot_u = mot_u + x_u;
    new_mot_v = mot_v + x_v;

Notice that $A$, hence $(A^T A)^{-1}$, does not change between iterations: we only need to re-warp the current image according to the updated motion.

The function LKonCoImage implements the LK algorithm on a single level.

A default set of params can be generated with LKinitParams.

Do this and turn in:

Finish the missing code inside LKonCoImage in LK.py to estimate the $(u,v)$ motion increment. You need to edit 4 places:
1. Compute $A$.
2. Compute $A^T A$ and its inverse (variables AtA and AtAinv). (Hint: np.linalg.pinv).
3. Compute mot_update using the above equations and it (previous-current).
4. Update the motion mot based on mot_update.
Now call LKonCoImage on pairs of nearby frames, using the ground truth rectangle fs.gt_rect[i, :] as the prect. The input motion can be blank: [0, 0, 0, x0, y0].
Note: for the scale estimation (which you will implement below) to be numerically stable, [x0, y0] should be close to the center of the rectangle (rectCenter() can help here).
If you set the show_fig parameter to True, you should be able to see the animation of the iterations.
This will display a plot showing three images:
1. The top of image shows the cropped region within the bounding box in the initial frame, plus the pyramid layer from that image selected(for part 3)
2. The middle shows the current fit rectangle region in the second frame, plus the iteration count and current uvs values.
3. The bottom shows a heatmap of error values per pixel, called $b$ above, with the mean squared error listed.
Include in your write-up code snippets showing the lines that you added to LKonCoImage.
Submit in your write-up the final mean squared error value produced by running part 2.1 in test.py. How much less is this value than the initial value?

2.2 (u,v,s) motion (10 points)

Now we will implement motion with scale. The formulas are very similar: $x = (u, v, s)$ is the mot_update we want to solve for, and so each row of $A$ has another column: $$ A_i = (dx_i \; dy_i \; ww_i) $$ where $ww = dx \cdot (x-x_0) + dy \cdot (y-y_0)$ for each pixel inside the rectangle. (Thus, $A^T A$ will be a $3 \times 3$ matrix.)

Hint: coiPixCoords might be useful here.

Finally the motion update is a bit more complex:

    new_mot_u = mot_u + x_u + x_u * mot_s
    new_mot_v = mot_v + x_v + x_v * mot_s
    new_mot_s = mot_s + x_s + x_s * mot_s

Do this and turn in:

Finish the code in LKonCoImage to deal with the case do_scale=True
Include in your write-up code snippets showing the lines that you added to LKonCoImage.
Submit in your write-up the error values and final plots produced with and without do_scale turned on for the commands in part 2.2 of test.py.

2.3 Analysis (5 points)

Experiment with the scale version of LKonCoImage and answer the following questions in your write-up:

How far from the correct motion (in u, v, and s) can a single level of LK typically recover and still end up finding the correct motion? You can give numerical ranges for your answer. Keep the default max_iter for this question. Hint: play with the input motion.
How consistent is the result (when it can converge) given different initial motions? Convergence here means that the algorithm was able to find the correct motion (or was very close). If it found a motion that is nowhere close to correct, then it did not converge.
Can you make it more consistent by changing the parameters: max_iter, uvs_min_significant, and err_change_thr. Explain what each of these do and whether/how changing them changes the result.

Part 3: Multi-level LK (25 points)

The function LKonPyaramid implements multi-resolution LK. It should call LKonCoImage for each level of an Gaussian image pyramid. The image pyramid is built using

    pyr = coPyramid(coi, levs)

Each time we go "up" a level the image is subsampled by a factor of 2. The origin of the coordinate system is kept at the same pixel, but the width and height of any object is reduced to 1/2. A feature at coordinate x_L2 on level 2 would have coordinate x_L1 = x_L2 * 2; on level 1. The function rectChangeLevel implements level conversions for rectangles.

3.1 Active Levels (10 points)

The first task is to figure out which levels of the pyramid we want to work on. The function defineActiveLevels in LK.py should return a vector [L_start, ..., L_end], inclusive, of the levels we can work on. L_start for now should be the lowest level of the given pyramid (since we are willing to work on very big images), but the top level needs to be computed so that the number of pixels in the area of the rectangle is not smaller than specified in the parameter min_pix[0].

You may assume that in a pyramid, levels always start at 1, levels are in order, and levels are not skipped.

Hint: Be sure you are using all of the given arguments (prect, mep1, mep2), as all of them impose constraints on the active levels.

Do this and turn in:

Fill in defineActiveLevels in LK.py. Include in your write-up the printed output of defineActiveLevels in part 3.1 of test.py. Make sure to modify min_pix as described above.
Include in your write-up a code snippet showing your implementation of defineActiveLevels.

3.2 Changing Levels (10 points)

We also need to update the motion as we move from level to level. This is done by calling the function uvsChangeLevel in uvs.py.

Do this and turn in:

Implement the body of uvsChangeLevel in uvs.py. Include in your write-up the printed outputs of uvsChangeLevel in part 3.2 of test.py.
Include in your write-up a code snippet showing your implementation of uvsChangeLevel.

3.3 Analysis (5 points)

Answer the following questions in your write-up:

Run LKonPyramid on a pair of images from each example sequence ('man' and 'woman'). Note that the further apart the images are in the sequence the worse your results are likely to be. Then, submit in your write-up the figures generated by LKonPyramid in part 3.3 of test.py. There should be a figure generated for each level.
How far can init_motion be from the correct motion? How does this compare with your results from part 2.2 on a single level? Note that part 3 uses test frames that are further away, so you should rerun part 2.2 with the new pair of frames.

Part 4: Incremental Tracking for a Sequence (15 points)

4.1 LK on a Sequence (10 points)

The function LKonSequence runs LKonPyramid on a sequence in an incremental way: 1→2, 2→3, 3→4, ... In other words, it should go through the sequence and calls on pairs of consecutive frames. mots and rects correspond to mot and rect from prior parts of this assignment, and they each are of length seq_length + 1. For each pair of frames, we find a new motion and rectangle, and the extra element is because we also store the initial motion and rectangle in mots and rects.

Do this and turn in:

Implement the missing lines in the loop of LKonSequence in LK.py.
- The code should call LKonPyramid for each pair of images, record the resulting motion and rectangle in the mots and rects matrices, and update init_mot for the next frame.
- You can assume that the motion to the next frame is similar to the motion to the current frame, with some damping to make the mot center lie close to the center of prect.
- Recall that the input motion's center $(x_0,y_0)$ should be close to the center of prect.
Include in your write-up a code snippet showing your implementation of LKonSequence.

4.2 Analysis (5 points)

Do this and turn in:

Run LKonSequence on each of the example sequences (run on both 'woman' and 'man' sequences).
For each example, how long is the longest sequence of consecutive frames you can track, and under what situations does it break on each example sequence?
Why do you think it breaks under these circumstances?
Submit in your write-up a frame from each sequences ('woman' and 'man') that causes tracking to fail.

Submitting

This assignment is due Thursday, November 21, 2019 at 11:59 PM. Please see the general notes on submitting your assignments, as well as the late policy and the collaboration policy.

Submissions will be done through Gradescope. This assignment has 2 submissions:

Assignment 3 Written: Submit one single PDF containing your write-up, including code snippets of all the code you added, answers to questions, and all requested figures.
Assignment 3 Code: Please submit all of your code files. This should include coi.py, LK.py, rect.py, uvs.py, and test.py.

Please note that as was the case for Assignment 2, Assignment 3 Code is worth 0 points on Gradescope. We will grade the write-up and code together, but we will put scores into Assignment 3 Written only.

Last update 13-Dec-2019 03:17:23