COS 429 - Computer Vision |
Fall 2017 |
Course home | Outline and Lecture Notes | Assignments | Featured Projects |
In this assignment you will be building a face tracker based on the
Lukas-Kanade algorithm. For background, you should read
the Lucas-Kanade paper
and look at the
notes for lectures 13 and 14.
Do this:
We will be working with two motion models:
To define the motion of a point $(x,y)$ due to scale, one needs to define the "center" of motion. We will refer to this arbitrary point as $(x_0,y_0)$. Thus scaling the point $(x,y)$ means $$ x' = x+(x-x_0)*s $$ $$ y' = y+(y-y_0)*s $$
After scaling, we apply the translation also, producing the motion equations: $$ x' = x+u+(x-x_0)*s $$ $$ y' = y+v+(y-y_0)*s; $$
In the code, the directory uvs/ contains functions that manipulate $(u,v,s)$ motion models. The data structure they use is simply a row vector of 5 elements: [u v s x0 y0]. For example, the function
wc = uvsCWarp(mot, c)
takes a motion model mot = [u v s x0 y0] and coordinate c = [x,y] and produces the new warped point wc = [x',y'].
Note that most of the functions are written in a vectorized form where they accept multiple rows of motions (and or points), hence you see a lot of code such as mot( : , U ).
Do this and turn in:
motinv = uvsInv(mot)
in uvsInv.m, which inverts the motion mot. The location of motinv's center of motion [x0',y0'] should be the projection of [x0,y0] by the motion mot. Similarly to the other functions, your function may get several rows of motions as input and should output the same number of rows with each motion inverted.
In frame-to-frame alignment, our goal is to estimate these parameters, $(u,v)$ or $(u,v,s)$, from a local part of a given pair of images. For simplicity, we will limit ourselves to rectangular areas of the image. Our rectangles are defined as rect = [xmin xmax ymin ymax] and the directory rect/ contains functions to manipulate rectangles. (Technical note: the sides of a rect can take on non-integer values, where a rectangle may include only part of some pixels. Pixels are centered on the integer grid, thus a rectangle that contains exactly pixel [1,3] would be rect = [0.5 1.5 2.5 3.5].)
Given 2 rectangles (of the same aspect ratio), we can compute the $(u,v,s)$ motion between them. This could be useful to define an initial motion for the tracking if you have a guess of 2 rectangles that define the object (e.g. by running your face detector from assignment 2!).
Do this and turn in:
mot = rect2uvs(r1, r2)
in rect2uvs.m that defines this motion. If the aspect ratio of the 2 rects is not exactly the same, just use the average of the 2 possible scales. (Hint: notice the functions rectCenter(r) and rectSize(r) in the rect/ directory).
Included with the starter code (in the data/ subdirectory) are the 'girl' and 'david' video sequences of faces from http://vision.ucsd.edu/~bbabenko/project_miltrack.html .
Each frame of the video is just an image (here it's actually stored as set of png files). We could represent the image as just a Matlab matrix, as you have done in the previous assignments, but we will be cutting and warping images and it is useful for an image to be able to have a coordinate system that is not necessarily rooted at the top-left corner. Here we will use a "coordinate image" struct (often called coimage or coi in the code). It is just a wrapper around the matrix with some extra fields, for example:
coi = im: [240x320 double] origin: [30 20] label: 'img00005' level: 1
The level field will be useful later to keep track of which pyramid-level this image represents.
The directory coi/ contains functions to create and manipulate these images. See the constructor function coimage.m for more information.
The class FaceSequence (coi/FaceSequence.m) is there to simplify the access to the tracking sequences downloaded above. To initialize:
fs = FaceSequence('path/to/data/girl');
Then, to access e.g. the 5th image and read it into a coimage structure, you can do:
coi = fs.readImage(5);
(NOTE: this is probably the image img00004.png - don't be confused.) If you want to access a sequence of images, say every 3rd image starting from the 2nd, do:
fs.next=2; fs.step=3; coi1 = fs.readNextImage(); % the 2nd coi2 = fs.readNextImage(); % the 5th ...
Additionally, fs contains the "ground truth" rectangles stored with the clip in
rect = fs.gt_rect(1,:); % rectangle for 1st image
Beware: only some frames have valid ground truth rectangles, otherwise this rect = [0 0 0 0];
Do this and turn in:
drawFaceSequence(fs, from, step, number, rects)
in drawFaceSequence.m that will display an animation of the video with the rectangles from rects drawn on it. Use the pause() command in your for loop to make sure Matlab draws each frame and it is is not shown too fast. For example, including pause(0.2) should display the animation at roughly 5 frames per second.
Review the tracking lectures. The Lukas-Kanade algorithm repeatedly warps the "current" image backward according to the current motion to be similar to the "previous" image inside the area defined by the "previous rectangle".
Let $N$ be the number of pixels inside the rectangle. Recall that we need to compute a motion update $x = (u,v)$ (for translation-only motion) that satisfies: $$ Ax=b $$ where $A$ is an $N \times 2$ matrix, each of whose rows is the image gradient $(dx, dy$ at some pixel of the previous image, and $b$ is an $N \times 1$ column vector of errors (image intensity differences) between the previous and current image. To solve this with least squares, we compute $$ x = (A^T A)^{-1} A^T b. $$
The new combined motion (in motion + update) should be
new_mot_u = mot_u + x_u; new_mot_v = mot_v + x_v;Notice that $A$, hence $(A^T A)^{-1}$, does not change between iterations: we only need to re-warp the current image according to the updated motion.
The function LKonCoImage implements the LK algorithm on a single level:
[mot, err, imot] = LKonCoImage(prevcoi, curcoi, prect, init_mot, params)
A default set of params can be generated with LKinitParams();
Do this and turn in:
Note: for the scale estimation (which you will implement below) to be numerically stable, [x0,y0] should be close to the center of the rectangle (rectCenter() can help here).
If you set the show_fig parameter to 1, you should be able to see the animation of the iterations.
This will display a plot showing three images:
Now we will implement motion with scale. The formulas are very similar: $x = (u, v, s)$ is the mot_update we want to solve for, and so each row of $A$ has another column: $$ A_i = (dx_i \; dy_i \; ww_i) $$ where $ww = dx \cdot (x-x_0) + dy \cdot (y-y_0)$ for each pixel inside the rectangle. (Thus, $A^T A$ will be a $3 \times 3$ matrix.)
Finally the motion update is a bit more complex:
new_mot_u = mot_u + x_u + x_u * mot_s; new_mot_v = mot_v + x_v + x_v * mot_s; new_mot_s = mot_s + x_s + mot_s * x_s;
Do this and turn in:
The function LKonPyaramid.m implements the multi-resolution LK. It should call LKonCoImage() for each level of an Gaussian image pyramid. The image pyramid is built using
pyr = coPyramid(coi, levs);
Each time we go "up" a level the image is subsampled by a factor of 2. The origin of the coordinate system is kept at the same pixel, but the width and height of any object is reduced to 1/2. A feature at coordinate x_L2 on level 2 would have coordinate x_L1 = x_L2 * 2; on level 1. The function rectChangeLevel() implements level conversions for rectangles.
The first task is to figure out which levels of the pyramid we want to work on. The function defineActiveLevels should return a vector [L_start : L_end], inclusive, of the levels we can work on. L_start for now should be the lowest level of the given pyramid (since we are willing to work on very big images), but the top level needs to be computed so that the number of pixels in the area of the rectangle is not smaller than specified in the parameter min_pix(1).
We also need to update the motion as we move from level to level. This is done by calling the function uvsChangeLevel() (in the uvs/ subdirectory).
Do this and turn in:
Function LKonSequence runs LKonPyramid on a sequence in an incremental way: 1→2, 2→3, 3→4, ...
Do this and turn in:
Try and run your finished algorithm on your own video sequence, using initial rectangles produced by your face detection algorithm from assignment 2!
If you have a different partner than in assignment 2 you may use either partner's detector.
This is a more open-ended task to help you practice for working on your final projects, where you will not have instructions to guide you. Use your best judgement as to how to solve the problems described below, keeping in mind the goal of demonstrating your LK implementation using a dataset captured "in the wild."
Do this and turn in:
Write this bounding box into a file matching the ground truth face rectangles from the provided sequences(e.g. girl_gt.txt for the girl sequence)
The Dropbox link to submit your assignment is
here.