CS 291A -- Mixed and Augmented Reality
Assignment 4 Option: AR Magic Mirror


Change Log: ...


For this assignment, you will build a "magic mirror" application which augments the view from your laptop camera with 3D graphics. You will also implement computer vision techniques to recognize gestures from the user for interaction. This application will build upon the camera calibration and video capture code you implemented in the previous assignment. Starting with a basic video viewer, add functionality step by step as you create an AR Magic Mirror.

1. Mirrored Video Display

The first step is to implement a basic video viewer which grabs frames from either a camera or a video file. Like in the last assignment, take a single optional argument from the command line which specifies the path to a video file. If the argument is omitted, then the application should start video capture from your camera. You can re-use the code you developed in the previous assignment or use one of the sample solution we will make available in Slack.

As a recommendation, OpenGL code should not perform extensive computation in the display() function. Computation should be performed in the idle() function (specified with glutIdleFunc() ). The display() function should only redraw elements to the screen. For your application, you should grab a frame from the camera or video file in the idle() func, store it in a global structure, and draw the image in the display() function.

To simulate a mirror, your application must flip the video image horizontally before displaying it.

2. Face Detection

The OpenCV library offers many options for object detection, and comes with pre-trained classifiers for faces, eyes, and other objects. These are found under data/haarcascades in the OpenCV distribution, and are typically installed at .../opencv4/haarcascades. Use the CascadeClassifier class to load the file haarcascade_frontalface_default.xml (or pick a different one from other recommendations, if you think they will work better).

Before running the classifier, you must create a grayscale copy of your image. Also, object detection is a computationally intensive task, and is most likely too slow to run in real-time on the full size camera image. You need to create a downsampled version of your camera image before running the classifier. Resize (cv::resize) the image to a width of 320 pixel.

Run face detection using the function CascadeClassifier::detectMultiScale(). Use a scale factor of 1.1 and minNeighbors = 2. For the flags, use CV_HAAR_SCALE_IMAGE which speeds up computation, and CV_HAAR_FIND_BIGGEST_OBJECT to only return at most one object.

The face detector should return a rectangle covering your face, but it does work differently on different people. Fix the height and width of the rectangle by scaling it as necessary to cover your face approximately.

Finally, display a yellow rectangle around the face in the camera image. Toggle the rectangle using the 'r' key.


3. From 2D to 3D

Now that you have detected the face, you will create an augmented reality experience by rendering a 3D object which spins around the head of the viewer. First, you must determine the 3D position of the head based on the 2D rectangle given by the face detector. You need several pieces of information to determine this:
1. The intrinsic parameters of the camera (found in the previous assignment).
2. The height of your head in centimeters (measured by you).
3. The height of your face as seen by the camera (given by the face detector).

Using the basic projection equation,

x = fx X / Z + u
y = fy Y / Z + v

you can determine the distance from the camera to your head. Figuring out how to determine the value of Z (the camera distance to the head) from the above observations is part of this assignment. Once you have the depth, you can also determine the 3D locations of the four corners of the rectangle given by the face detector, and the center point of the rectangle.

Now that you have the 3D coordinates of the head, use OpenGL to render red 3D points at the four corners of the rectangle. The projection should exactly match the 2D yellow rectangle which you drew in the previous step. The red 3D points should be toggled with the 'r' key.

 

4. Spinning Teapots

Render a teapot which spins around the center point of the head. This should be toggled with the 'o' key. By default, OpenGL renders flat shaded polygons. Turn on OpenGL lighting by enabling GL_LIGHTING, GL_LIGHT0 . Also, use the color-material commands to specify the material of the teapot when calling glColor(). Enable GL_COLOR_MATERIAL and use the glColorMaterial() function for front and back faces, and ambient and diffuse components. Be sure to disable these afterwards.

The trick here is to simulate the teapot being occluded by the head. This effect can be achieved using the OpenGL depth buffer. Make sure to allocate the depth buffer during GLUT initalization using the GLUT_DEPTH flag. Enable depth buffering using glEnable( GL_DEPTH_TEST ). This should only be enabled during teapot rendering, and disabled afterwards.

Before rendering the teapot, render a 3D filled rectangle around the head, but render only to the depth buffer. Look at glColorMask() for an idea of how to do this. Now, the teapot will be occluded by the phantom rectangle when it is behind it.

Your code should also be able to render a cone or a torus instead of a teapot. The user will step through the list of objects using the gesture recognizer implemented in the next part.



5. Gesture Recognition

You will implement two different techniques to allow the user to activate a button on the screen. You should set up two boxes to appear a the top left and top right corners of the screen. The left box should be red, and the right box should be green. When the red box is clicked, the system should select the previous object in the list to render; the green box makes the system select the next object.

Image Segmentation by Background Subtraction

The first technique is to use background subtraction. The concept of background subtraction is to compare the current camera image to a background image, and use the difference to label each pixel as either background or foreground.

First capture and store a background image when the 'b' key is pressed. Use the down-sampled grayscale image you created for face detection.

At each frame, compute the per-pixel absolute difference between the current image and background image. Resize the absolute difference image back to full size. Then threshold each pixel by 10 to separate background from foreground. (There are OpenCV functions to perform these tasks.)

When the 'g' key is pressed toggle the display of the thresholded absolute difference image instead of the camera image.

Now, detect when a button is clicked by counting how many foreground pixels there are in the box. This can be performed in one line using OpenCV. When the number of foreground pixels in a box changes from zero, that box is clicked. Change the color of the box to yellow when there are foreground pixels in the box.



Motion Detection with Optical Flow

Our second technique for gesture recognition is to use optical flow. Optical flow is a classical computer vision technique for determining the movement of pixels between two images. The flow image represents the motion vector for each pixel between the previous frame and the current frame.

Use the OpenCV function calcOpticalFlowFarneback() to compute the flow between the previous frame and the current frame. Again, use the downsampled image, because this is a computationally expensive task. Use pyramid scale 0.5, one pyramid level, a window size of 3 pixels, one iteration, polyN = 5 and polySigma = 1.1 .

Now, compute the magnitude of the flow vector at each pixel location. You will need to iterate over the pixels in the flow image and compute the L2 norm of the vector. Resize the flow magnitude image back to full size. Then, threshold the flow magnitude by 10 to find pixels of significant motion.

This thresholded image can be used as in the previous part to determine when a button is clicked.

The 'f' key should switch between the background subtraction and optical flow techniques. The 'g' key should toggle display of the thresholded flow magnitude image.

Here are Quicktime movies of some test sequences: ARChatTest1.mov, ARChatTest2.mov

The thresholds for both background subtraction and optical flow may have to be adjusted to get reasonable performance with the selection (try e.g. 40 instead of 10 for background subtraction and 20 instead of 10 for optical flow).

As calibration values for the camera that took these videos, please use:
 
focal length fc: 684.58685   685.28020
principal point cc: 281.92771   238.46315
distortion (k1,k2,p1,p2): 0.03096   -0.15837   0.00213   -0.00555
 
and you can use 15 for the face height (not that it makes a difference what number you use, since all that it does is pick a unit for real world measurements, but we asked you to use cm and that's approximately the right value in cm).

Submission

Use Gauchospace to submit all files (including code and a README.TXT describing your approach, any difficulties you had, and all external sources you used).