CS190N/CS290N -- Assignment 3: qwertyprints

John Brevik and Rich Wolski --- Winter, 2006


Hunt and Peck

Recall from the episode entitled "Sacrifice" that Charlie uses the typing cadence recorded for the passwords used to access the victim's secret files to identify who the killer is. Specifically, from keystroke cadence, he determines that the same person who accessed the files on the victim's computer at home also accessed them at work, and this information (presumably) leads him to the victim's graduate student. In retrospect, analysis of the cadence was probably unnecessary. Your assignment is to duplicate Charlie's cadence analysis (necessary or otherwise). For each part of the assignment, you will be given a set of files each containing a sample of the cadence generated by a different known typist. You will also get a set of cadence files each containing a sample from an unknown typist. Your job is twofold: You are free to use whatever method you think best, but you must provide a short write-up describing your methodology and briefly justifying it.

Part 0

As a warm up, as advertised in class, the first part of the assignment is to go through the hand calculations to determine the maximum-likelihood estimator for the pair (&mu, &sigma) from a normal population, based on a sample S = (x1, ... xn). Please provide a short write-up (handwritten is fine) showing your work.

File Format

For the remainder of the assignment, you will be working with cadence data contained in test files. Cadence sample files have the following format. Each line of a cadence file contains timing data for keystroke pairs (in milliseconds) separated by white space. Characters at the beginning of a line up to the ":" indicate the keystroke pair to which the timing data following the ":" pertains.

For example, a cadence file containing the following data

w a: 322.20 140.53 154.15 168.14 155.37 169.53 160.40 169.30
w e: 80.20
w h: 81.57
w i: 69.32 59.33 48.95
indicates that the keystroke pair "wa" was typed 8 times, and that, the 8 timings of "wa" were 322.20, 140.53, 154.15, 168.14, 155.37, 169.53, 160.40, and 169.30 milliseconds. You should not infer that the "wa" pairs generating these timings occurred in the order that is shown when their author generated them. Similarly, "we" occurred once, "wh" occurred once, and "wi" occurred 3 times with the timings given. You should not assume that lines within a cadence file are sorted in any particular order. Letter-space and space-letter transitions are omitted, as are any transitions involving a non-printing or control character.

Part 1

In the tar file located HERE you will find known samples and unknown samples generated by your previous enjoyment of Martha Stewart's biography. Each known file (indicated by a file name not beginning with a capital 'X') contains the cadence data for the first half of the Martha bio as typed by a student in the class. Each "mystery" file contains the second half of some student's Martha Stewart typing effort. Thus, there is a one-to-one mapping between known files and mystery files.

You should devise and implement a hypothesis test that can determine whether a a known file and a mystery file were typed by the same typist with significance alpha = 0.05. You should then use this test to match each mystery file with its corresponding known file and thereby identify the typist responsible for each mystery file.

You should also come up with a way to determine which known file is most likely to be the first half of the text for which the mystery file is the second half. Note that there are 29 sample pairs in the data set. At alpha=0.05, one would expect at least one type 1 error if your hypothesis test is working perfectly. To pick up that one identification if it is, or to perform the match if it is not, you should develop some type of scoring method for determining the most likely match for each known-file-mystery-file pairing.

Part B

In this part of the assignment you will use your newly developed expertise in qwertyprint analysis to solve the problem Charlie solved. In the tar file located HERE you will find 10 cadence files -- 5 with file names that end in ".train" and 5 with obscure names beginning with "X" and ending in ".exp" Each file contains the cadence measurements for someone (not enrolled in the class) repeatedly typing the same password: penitentiary. However, unlike in Part 1 of this assignment, there is not a one-to-one mapping between known typists and unknown typists. That is, there are some ".train" files that do not have ".exp" files generated by the same author, and vice versa. No author of a ".train" file authored more than one ".exp" file, however, and similarly, no author of a ".exp" file authored more than one ".train" file.

As in Part 1, your job is to conduct hypothesis tests (with significance level alpha=0.05) and a likelihood ranking to determine which authors of ".exp" files also authored a ".train" file by finding matchings between the two.

What to Turn In

For part 0, you should turn in your handy work in exploring the likelihood calculation by hand.

For Part 1, you should turn in the following:

For Part B, you should turn in the following:

For the Graduate Student

Develop a method for hypothesis testing and/or likelihood scoring based on Baysean inference techniques as opposed to the pedestrian frequentist approach favored by your instructors.

Hints from Your Uncle Norman and Aunt Heloise

Spend a little time designing the code you plan to use and implement it carefully. Notice that it is easy to make assumptions about the contents of the sample files that are untrue. For example, we split the Martha Stewart biography roughly into two even parts with respect to the keystroke transition count. That does not mean that the number of transitions for each pair is evenly split. There might have been more "en" transitions in the first half than in the second, or vice versa. Your implementation should be prepared for this possibility.

Similarly, typing accuracy varies from half to half and more obviously from typist to typist. You should not assume that all files of any type contain exactly the same set of keystroke transitions for each pair, or that the counts for the transitions that are common to all files are the same. For example, one known file might contain a different number of "th" transitions than another and both may be different from the actual number of "th" transitions that occurred in the text.

Your favorite Aunt and Uncle also discovered a few tid bits that you might also enjoy. The first is that it is often helpful to generate your own data from known distributions to see how well your method works, and how sensitive it is to perturbation of its underlying assumptions from known data. For example, your Auntie Heloise cooked up a batch of cadence files from normal distributions with known mean and variances to see how well our hypothesis testers and likelihood scorers worked when we knew what the answer should be ahead of time.

Perhaps the biggest lesson that came out of these investigations is that unfortunately the size of the sample matters. You should experiment with different minimal sample sizes when working through your solutions. Notice that if you choose a large minimum sample size, you cut down on the number of keystroke pairs that are eligible for comparison. Smaller sample sizes implies more pairs, but each pair then doesn't give your method much with which to work.