COS 429 - Computer Vision

Fall 2017

Assignment 3: Tracking

Due Sunday, Nov. 19

In this assignment you will be building a face tracker based on the Lukas-Kanade algorithm. For background, you should read the Lucas-Kanade paper and look at the notes for lectures 13 and 14.

Part 1. Preliminaries (20 points)

Do this:

Begin by downloading the starter code and data, which contains the following files:
- hw3start.m - Adds the subdirectories of the hw3 code to the Matlab path
- hw3test.m - This file contains snippets of code to run/test your implementation, for each section of the assignment. Do not run this file as a whole, copy/paste sections from it for testing.
- LKinitParams.m - This file defines a number of parameters to the LK alogrithm. You will alter some of these later on and observe their effect on performance.
- LKonCoImage.m - This file predicts the motion of a given rectangle from one frame to the next. You will implement much of this functionality below.
- LKonPyramid.m - This file performs the same motion prediction as LKonCoImage.m on a pair of image pyramids. It does this by calling LKonCoImage.m for each level of the pyramid.
- LKonSequence.m - This performs LK tracking across multiple frames of a video by calling LKonPyramid.m on pairs of frames and tracking rectangle positions and motions across frames. You will implement this functionality below.
The starter code also contains several directories with functions for various pieces of the LK algorithm, which you will need to use/implement. Familiarize yourself with the following directories' contents:
- coi - Contains files defining operations on "coordinate images" Which are defined below in section 1.3. You will need to implement drawFaceSequence.m in this directory and call other functions defined here.
- data - Contains several image sequences and ground truth rectangles for each frame. You will use these sequences for testing and to demonstrate your results.
- rect - Contains functions that manipulate rectangles, which are discussed in section 1.2. You will implement rect2uvs.m below.
- util - Contains various utility functions.
- uvs - Contains functions computing/manipulating uvs motion models. You will implement several functions within this directory.
Run hw3start.m to add the directories to your path, then take a look at hw3test.m to understand how to run the code as you implement features below.

1.1 Motion

We will be working with two motion models:

Translation by $(u,v)$, where the coordinate $(x,y)$ is updated to a new location $(x',y')$ according to $$ x'=x+u $$ $$ y'=y+v $$
Translation by $(u,v)$ and scale by $s$, where the scale $s$ is defined by the change of the object width from $w$ to $w'$ as: $$ s = \frac{w'-w}{w} $$ Thus $s=0$ means no scale, $s=1$ means the object doubles in size, $s=-0.5$ means it shrinks in half, and $s=-1$ means it shrinks to a singularity.
To define the motion of a point $(x,y)$ due to scale, one needs to define the "center" of motion. We will refer to this arbitrary point as $(x_0,y_0)$. Thus scaling the point $(x,y)$ means $$ x' = x+(x-x_0)*s $$ $$ y' = y+(y-y_0)*s $$
After scaling, we apply the translation also, producing the motion equations: $$ x' = x+u+(x-x_0)*s $$ $$ y' = y+v+(y-y_0)*s; $$

In the code, the directory uvs/ contains functions that manipulate $(u,v,s)$ motion models. The data structure they use is simply a row vector of 5 elements: [u v s x0 y0]. For example, the function

    wc = uvsCWarp(mot, c)

takes a motion model mot = [u v s x0 y0] and coordinate c = [x,y] and produces the new warped point wc = [x',y'].

Note that most of the functions are written in a vectorized form where they accept multiple rows of motions (and or points), hence you see a lot of code such as mot( : , U ).

Do this and turn in:

In the uvs/ subdirectory, write the body of the function
```
    motinv = uvsInv(mot)
```
in uvsInv.m, which inverts the motion mot. The location of motinv's center of motion [x0',y0'] should be the projection of [x0,y0] by the motion mot. Similarly to the other functions, your function may get several rows of motions as input and should output the same number of rows with each motion inverted.
Include in your readme.pdf the output of the commands in part 1.1 of hw3test.m

1.2 Rectangles

In frame-to-frame alignment, our goal is to estimate these parameters, $(u,v)$ or $(u,v,s)$, from a local part of a given pair of images. For simplicity, we will limit ourselves to rectangular areas of the image. Our rectangles are defined as rect = [xmin xmax ymin ymax] and the directory rect/ contains functions to manipulate rectangles. (Technical note: the sides of a rect can take on non-integer values, where a rectangle may include only part of some pixels. Pixels are centered on the integer grid, thus a rectangle that contains exactly pixel [1,3] would be rect = [0.5 1.5 2.5 3.5].)

Given 2 rectangles (of the same aspect ratio), we can compute the $(u,v,s)$ motion between them. This could be useful to define an initial motion for the tracking if you have a guess of 2 rectangles that define the object (e.g. by running your face detector from assignment 2!).

Do this and turn in:

In the rect/ subdirectory, finish the function
```
    mot = rect2uvs(r1, r2)
```
in rect2uvs.m that defines this motion. If the aspect ratio of the 2 rects is not exactly the same, just use the average of the 2 possible scales. (Hint: notice the functions rectCenter(r) and rectSize(r) in the rect/ directory).
Include in your readme.pdf the output of the commands in part 1.2 of hw3test.m

1.3 Image sequence

Included with the starter code (in the data/ subdirectory) are the 'girl' and 'david' video sequences of faces from http://vision.ucsd.edu/~bbabenko/project_miltrack.html .

Each frame of the video is just an image (here it's actually stored as set of png files). We could represent the image as just a Matlab matrix, as you have done in the previous assignments, but we will be cutting and warping images and it is useful for an image to be able to have a coordinate system that is not necessarily rooted at the top-left corner. Here we will use a "coordinate image" struct (often called coimage or coi in the code). It is just a wrapper around the matrix with some extra fields, for example:

    coi = 
        im: [240x320 double]
        origin: [30 20]
        label: 'img00005'
        level: 1

The level field will be useful later to keep track of which pyramid-level this image represents.

The directory coi/ contains functions to create and manipulate these images. See the constructor function coimage.m for more information.

The class FaceSequence (coi/FaceSequence.m) is there to simplify the access to the tracking sequences downloaded above. To initialize:

    fs = FaceSequence('path/to/data/girl');

Then, to access e.g. the 5th image and read it into a coimage structure, you can do:

    coi = fs.readImage(5);

(NOTE: this is probably the image img00004.png - don't be confused.) If you want to access a sequence of images, say every 3rd image starting from the 2nd, do:

    fs.next=2; fs.step=3;
    coi1 = fs.readNextImage(); % the 2nd
    coi2 = fs.readNextImage(); % the 5th
    ...

Additionally, fs contains the "ground truth" rectangles stored with the clip in

     rect = fs.gt_rect(1,:); % rectangle for 1st image

Beware: only some frames have valid ground truth rectangles, otherwise this rect = [0 0 0 0];

Do this and turn in:

In the coi/ subdirectory, write the function
```
    drawFaceSequence(fs, from, step, number, rects)
```
in drawFaceSequence.m that will display an animation of the video with the rectangles from rects drawn on it. Use the pause() command in your for loop to make sure Matlab draws each frame and it is is not shown too fast. For example, including pause(0.2) should display the animation at roughly 5 frames per second.
Submit images of the first and last frames of the animation produced by the drawFaceSequence() command in hw3test.m section 1.3.

Part 2: LK at a single level (20 points)

2.1 (u,v) motion

Review the tracking lectures. The Lukas-Kanade algorithm repeatedly warps the "current" image backward according to the current motion to be similar to the "previous" image inside the area defined by the "previous rectangle".

Let $N$ be the number of pixels inside the rectangle. Recall that we need to compute a motion update $x = (u,v)$ (for translation-only motion) that satisfies: $$ Ax=b $$ where $A$ is an $N \times 2$ matrix, each of whose rows is the image gradient $(dx, dy$ at some pixel of the previous image, and $b$ is an $N \times 1$ column vector of errors (image intensity differences) between the previous and current image. To solve this with least squares, we compute $$ x = (A^T A)^{-1} A^T b. $$

The new combined motion (in motion + update) should be

    new_mot_u = mot_u + x_u;
    new_mot_v = mot_v + x_v;

Notice that $A$, hence $(A^T A)^{-1}$, does not change between iterations: we only need to re-warp the current image according to the updated motion.

The function LKonCoImage implements the LK algorithm on a single level:

    [mot, err, imot] = LKonCoImage(prevcoi, curcoi, prect, init_mot, params)

A default set of params can be generated with LKinitParams();

Do this and turn in:

Finish the missing code inside LKonCoImage.m to estimate the $(u,v)$ motion increment. You need to edit 4 places:
1. compute $A$. (Hint: for a matrix m, m(:) unrolls it into a column vector)
2. compute $A^T A$ and its inverse (variables AtA and AtAinv). (Hint: use the function pinv()).
3. compute mot_update using the above equations and It (previous-current)
4. update the motion mot based on mot_update
Now call LKonCoImage on pairs of nearby frames, using the ground truth rectangle fs.gt_rect(i,:) as the prect. The input motion can be blank: [0 0 0 x0 y0].
Note: for the scale estimation (which you will implement below) to be numerically stable, [x0,y0] should be close to the center of the rectangle (rectCenter() can help here).
If you set the show_fig parameter to 1, you should be able to see the animation of the iterations.
This will display a plot showing three images:
1. The top of image shows the cropped region within the bounding box in the initial frame, plus the layer from that image selected(for part 3)
2. The middle shows the current fit rectangle region in the second frame, plus the iteration count and current uvs values.
3. The bottom shows a heatmap of error values per pixel, called $b$ above, with the mean squared error listed.
Submit the final mean squared error value produced by this process. How much less is this value than the initial value?

2.2 (u,v,s) motion

Now we will implement motion with scale. The formulas are very similar: $x = (u, v, s)$ is the mot_update we want to solve for, and so each row of $A$ has another column: $$ A_i = (dx_i \; dy_i \; ww_i) $$ where $ww = dx \cdot (x-x_0) + dy \cdot (y-y_0)$ for each pixel inside the rectangle. (Thus, $A^T A$ will be a $3 \times 3$ matrix.)

Finally the motion update is a bit more complex:

    new_mot_u = mot_u + x_u + x_u * mot_s;
    new_mot_v = mot_v + x_v + x_v * mot_s;
    new_mot_s = mot_s + x_s + mot_s * x_s;

Do this and turn in:

Finish the code in LKonCoImage to deal with the case do_scale=1
Submit the error values and final plots produced with and without do_scale turned on for the experiment in section 2.2 of hw3test.m.
Experiment with LKonCoImage and answer the following questions in your writeup:
1. How far from the correct motion (in u, v, and s) can a single level of LK typically recover? Hint: play with the input motion.
2. How consistent is the result (when it can converge) given different starting motions? Can you make it more consistent by changing the exit conditions - the parameters: max_iter, uvs_min_significant, and err_change_thr. Explain what these do and whether/how changing them changes the accuracy.

Part 3: Multi-level LK (35 points)

The function LKonPyaramid.m implements the multi-resolution LK. It should call LKonCoImage() for each level of an Gaussian image pyramid. The image pyramid is built using

    pyr = coPyramid(coi, levs);

Each time we go "up" a level the image is subsampled by a factor of 2. The origin of the coordinate system is kept at the same pixel, but the width and height of any object is reduced to 1/2. A feature at coordinate x_L2 on level 2 would have coordinate x_L1 = x_L2 * 2; on level 1. The function rectChangeLevel() implements level conversions for rectangles.

The first task is to figure out which levels of the pyramid we want to work on. The function defineActiveLevels should return a vector [L_start : L_end], inclusive, of the levels we can work on. L_start for now should be the lowest level of the given pyramid (since we are willing to work on very big images), but the top level needs to be computed so that the number of pixels in the area of the rectangle is not smaller than specified in the parameter min_pix(1).

We also need to update the motion as we move from level to level. This is done by calling the function uvsChangeLevel() (in the uvs/ subdirectory).

Do this and turn in:

Fill in defineActiveLevels.m
Implement the body of uvsChangeLevel() in the uvs/ subdirectory
Run LKonPyramid on a pair of images from each example sequence. Submit the initial and final plots and errors for each. Note that the further apart the images are in the sequence the worse your result is likely to be.
How far can init_motion be from the true motion? How does this compare with your results from section 2.2 on a single level?

Part 4: Incremental tracking for a sequence (15 points)

Function LKonSequence runs LKonPyramid on a sequence in an incremental way: 1→2, 2→3, 3→4, ...

Do this and turn in:

Implement the missing lines in the loop of LKonSequence.m.
1. The code should call LKonPyramid for each pair of images, record the resulting motion and rectangle in the mots and rects matrices, and update init_mot for the next frame.
2. You can assume that the motion to the next frame is similar to the motion to the current frame, with some damping to make the mot center lie close to the center of prect.
3. Recall that the input motion's center $(x_0,y_0)$ should be close to the center of prect.
Run LKonSequence.m on each of the example sequences.
For each example, how long is the longest sequence of consecutive frames you can track, and under what situations does it break on each example sequence? Why do you think it breaks under these circumstances? Submit a frame from each that causes tracking to fail and a frame from each >10 frames after the first for which tracking works well.

Part 5: Face tracking in your own video! (10 points)

Try and run your finished algorithm on your own video sequence, using initial rectangles produced by your face detection algorithm from assignment 2!

If you have a different partner than in assignment 2 you may use either partner's detector.

This is a more open-ended task to help you practice for working on your final projects, where you will not have instructions to guide you. Use your best judgement as to how to solve the problems described below, keeping in mind the goal of demonstrating your LK implementation using a dataset captured "in the wild."

Do this and turn in:

To start, capture a video sequence of your own or select a video clip you like. Convert this clip into sequential png images(various tools to do this), matching the structure of the sequences we provided. You can ignore the _MIL_ files, they are not used.
Run your multiscale face detector on the first frame of your sequence and grab the bounding box coordinates of what you think is a successful face detection from the output. You may need to modify your assignment 2 code slightly to get these coordinates out.
Write this bounding box into a file matching the ground truth face rectangles from the provided sequences(e.g. girl_gt.txt for the girl sequence)
Run your video sequence using LKonSequence.m. If all goes well and your video is well-behaved you should have output rects containing a face for each image in your sequence!
Submit 3 or more frames from different parts of your sequence with tracking rectangles drawn on them to show how it did! If tracking failed at any point, explain briefly why. How does behavior on your sequence compare to the provided sequences?

Submitting

Please submit all of your .m files, and the requested images showing the results of your experiments (e.g., video frames with correctly or incorrectly tracked rectangles). Please include one README.pdf file containing a description of all experiments and submitted images, the answers to all questions, and any relevant implementation notes. Please be sure that you do not submit a copy of the data.

The Dropbox link to submit your assignment is here.

Last update 23-Jan-2018 10:17:01