COS 429 - Computer Vision |
Fall 2016 |
Course home | Outline and Lecture Notes | Assignments |
|
In this assignment you will be building a face tracker based on the
Lukas-Kanade algorithm. For background, you should read
the Lucas-Kanade paper
and look at the
notes for lecture 12.
Begin by downloading the starter code and data, unzipping it, changing to the directory containing the code, and running hw3start to put all the subdirectories on Matlab's path.
The script hw3test.m has some useful commands and tests for the assignments.
1.1 Motion
We will be working with two motion models:
To define the motion of a point $(x,y)$ due to scale, one needs to define the "center" of motion. We will refer to this arbitrary point as $(x_0,y_0)$. Thus scaling the point $(x,y)$ means $$ x' = x+(x-x_0)*s $$ $$ y' = y+(y-y_0)*s $$
After scaling, we apply the translation also, producing the motion equations: $$ x' = x+u+(x-x_0)*s $$ $$ y' = y+v+(y-y_0)*s; $$
In the code, the directory uvs/ contains functions that manipulate $(u,v,s)$ motion models. The data structure they use is simply a row vector of 5 elements: [u v s x0 y0]. For example, the function
wc = uvsCWarp(mot, c)
takes a motion model mot = [u v s x0 y0] and coordinate c = [x,y] and produces the new warped point wc = [x',y'].
Note that most of the functions are written in a vectorized form where they accept multiple rows of motions (and or points), hence you see a lot of code such as mot( : , U ).
In the uvs/ subdirectory, write the body of the function
motinv = uvsInv(mot)
which inverts the motion mot. The location of motinv's center of motion [x0',y0'] should be the projection of [x0,y0] by the motion mot. Similarly to the other functions, your function may get several rows of motions as input and should output the same number of rows with each motion inverted. Try to do it without a loop.
Include in your readme.txt the output of the commands in part 1.1 of hw3test.m
1.2 Rectangles
In frame-to-frame alignment, our goal is to estimate these parameters, $(u,v)$ or $(u,v,s)$, from a local part of a given pair of images. For simplicity, we will limit ourselves to rectangular areas of the image. Our rectangles are defined as rect = [xmin xmax ymin ymax] and t directory rect/ contains functions to manipulate rectangles. (Technical note: the sides of a rect can take on non-integer values, where a rectangle may include only part of some pixels. Pixels are centered on the integer grid, thus a rectangle that contains exactly pixel [1,3] would be rect = [0.5 1.5 2.5 3.5].)
Given 2 rectangles (of the same aspect ratio), we can compute the $(u,v,s)$ motion between them. This could be useful for you later to define an initial motion for the tracking if you have a guess of 2 rectangles that define the object (e.g. by running the face detector).
In the rect/ subdirectory, finish the function
mot = rect2uvs(r1, r2)
that defines this motion. If the aspect ratio of the 2 rects is not exactly the same, just use the average of the 2 possible scales. (Hint: notice the functions rectCenter(r) and rectSize(r) in the rect/ directory).
Include in your readme.txt the output of the commands in part 1.2 of hw3test.m
1.3 Image sequence
Included with the starter code (in the data/ subdirectory) are the 'girl' and 'david' video sequences of faces from http://vision.ucsd.edu/~bbabenko/project_miltrack.html .
Each frame of the video is just an image (here it's actually stored as set of png files). We could represent the image as just a Matlab matrix, as you have done in the previous assignments, but we will be cutting and warping images and it is useful for an image to be able to have a coordinate system that is not necessarily rooted at the top-left corner. Here we will use a "coordinate image" struct (often called coimage or coi in the code). It is just a wrapper around the matrix with some extra fields, for example:
coi = im: [240x320 double] origin: [30 20] label: 'img00005' level: 1
The level field will be useful later to keep track of which pyramid-level this image represents.
The directory coi/ contains functions to create and manipulate these images. See the constructor function coimage.m for more information.
The class FaceSequence (coi/FaceSequence.m) is there to simplify the access to the tracking sequences downloaded above. To initialize:
fs = FaceSequence('path/to/data/girl');
Then, to access e.g. the 5th image and read it into a coimage structure, you can do:
coi = fs.readImage(5);
(NOTE: this is probably the image img00004.png - don't be confused.) If you want to access a sequence of images, say every 3rd image starting from the 2nd, do:
fs.next=2; fs.step=3; coi1 = fs.readNextImage(); % the 2nd coi2 = fs.readNextImage(); % the 5th ...
Additionally, fs contains the "ground truth" rectangles stored with the clip in
rect = fs.gt_rect(1,:); % rectangle for 1st image
Beware: only some frames have valid ground truth rectangles, otherwise this rect = [0 0 0 0];
In the coi/ subdirectory, write the function
drawFaceSequence(fs, from, step, number, rects)
that will display an animation of the video with the rectangles from rects drawn on it. Use the pause() command in your for loop to make sure Matlab draws each frame and it is is not shown too fast. For example, including pause(0.2) should display the animation at roughly 5 frames per second.
Review slides 24-27 of the Lukas-Kanade lecture. The algorithm repeatedly warps the "current" image backward according to the current motion to be similar to the "previous" image inside the area defined by the "previous rectangle".
Let $N$ be the number of pixels inside the rectangle. Recall that we need to compute a motion update $x = (u,v)$ (for translation-only motion) that satisfies: $$ Ax=b $$ where $A$ is an $N \times 2$ matrix, each of whose rows is the image gradient $(dx, dy$ at some pixel of the previous image, and $b$ is an $N \times 1$ column vector of errors (image intensity differences) between the previous and current image. To solve this with least squares, we compute $$ x = (A^T A)^{-1} A^T b. $$
The new combined motion (in motion + update) should be
new_mot_u = mot_u + x_u; new_mot_v = mot_v + x_v;Notice that $A$, hence $(A^T A)^{-1}$, does not change between iterations: we only need to re-warp the current image according to the updated motion.
The function LKonCoImage implements the LK algorithm on a single level:
[mot, err, imot] = LKonCoImage(prevcoi, curcoi, prect, init_mot, params)
A default set of params can be generated with LKinitParams();
2.1 (u,v) motion
Finish the missing code inside LKonCoImage.m, to estimate the $(u,v)$ motion increment. You need to edit 4 places:
Now call LKonCoImage on pairs of nearby frames, using the ground truth rectangle fs.gt_rect(i,:) as the prect. The input motion can be blank: [0 0 0 x0 y0].
Note: for the scale estimation (which you will implement below) to be numerically stable, [x0,y0] should be close to the center of the rectangle (rectCenter() can help here).
If you set the show_fig parameter to 1, you should be able to see the animation of the iterations.
2.2 (u,v,s) motion
Now we will implement motion with scale. The formulas are very similar: $x = (u, v, s)$ is the mot_update we want to solve for, and so each row of $A$ has another column: $$ A_i = (dx_i \; dy_i \; ww_i) $$ where $ww = dx \cdot (x-x_0) + dy \cdot (y-y_0)$ for each pixel inside the rectangle. (Thus, $A^T A$ will be a $3 \times 3$ matrix.)
Finally the motion update is a bit more complex:
new_mot_u = mot_u + x_u + x_u * mot_s; new_mot_v = mot_v + x_v + x_v * mot_s; new_mot_s = mot_s + x_s + mot_s * x_s;
Finish the code in LKonCoImage to deal with the case do_scale=1
2.3 Experimenting with single-scale LK
Experiment with LKonCoImage and answer the following questions:
The function LKonPyaramid.m implements the multi-resolution LK. It should call LKonCoImage() for each level of an Gaussian image pyramid. The image pyramid is built using
pyr = coPyramid(coi, levs);
Each time we go "up" a level the image is subsampled by a factor of 2. The origin of the coordinate system is kept at the same pixel, but the width and height of any object is reduced to 1/2. A feature at coordinate x_L2 on level 2 would have coordinate x_L1 = x_L2 * 2; on level 1. The function rectChangeLevel() implements level conversions for rectangles.
3.1 Picking active levels
The first task is to figure out which levels of the pyramid we want to work on. The function defineActiveLevels should return a vector [L_start : L_end], inclusive, of the levels we can work on. L_start for now should be the lowest level of the given pyramid (since we are willing to work on very big images), but the top level needs to be computed so that the number of pixels in the area of the rectangle is not smaller than specified in the parameter min_pix(1).
Fill in defineActiveLevels.m
3.2 Updating the motion
We also need to update the motion as we move from level to level. This is done by calling the function uvsChangeLevel() (in the uvs/ subdirectory). Implement the body of this function.
3.3 Experimenting with multi-scale LK
Experiment with LKonPyramid on various examples. How much further can init_motion be from the true motion?
Function LKonSequence runs LKonPyramid on a sequence in an incremental way: 1→2, 2→3, 3→4, ...
Implement the missing lines in the loop of LKonSequence.m. The code should call LKonPyramid for each pair of images, record the resulting motion and rectangle in the mots and rects matrices, and update init_motion for the next frame. You can assume that the motion to the next frame is similar to the motion to the current frame, perhaps with some dampening. Recall that the input motion's center $(x_0,y_0)$ should be close to the center of prect.
How long can you track, and under what situations does it break?
The Dropbox link to submit your assignment is
here.