CS190N/CS290N -- Assignment 3: qwertyprints
Hunt and Peck
Recall from the episode entitled "Sacrifice" that Charlie uses the typing
cadence recorded for the passwords used to access the victim's secret
files to identify who the killer is. Specifically, from keystroke cadence, he
determines that the same person who accessed the files on the victim's
computer at home also accessed them at work, and this information (presumably)
leads him to the victim's graduate student. In retrospect, analysis of the
cadence was probably unnecessary.
Your assignment is to duplicate Charlie's cadence analysis (necessary or
otherwise). For each part of the assignment, you will be given a set of files
each containing a sample of the cadence generated by a different known
typist. You will also get a set of cadence files each containing a sample
from an unknown typist. Your job is twofold:
- design a hypothesis test, using a significance level
&alpha=.05, for discerning one typist from another, by testing
hypotheses of the form H0: known typist A = unknown
typist X. Ideally, this method would retain 95% of the true
null hypotheses and reject all the false ones.
- devise a method for determining the most likely typist of each unknown
sample from among the authors of the known samples. Notice that this
is a different and less restrictive exercise, but you should find a
method that provides fairly clear-cut answers.
You are free to use whatever method you think best, but you must provide a
short write-up describing your methodology and briefly justifying it.
Part 0
As a warm up, as advertised in class, the first part of the assignment is
to go through the hand calculations to determine the
maximum-likelihood estimator for the pair (&mu, &sigma) from a normal
population, based on a sample S = (x1,
... xn). Please provide a short write-up
(handwritten is fine) showing your work.
File Format
For the remainder of the assignment, you will be working with cadence data
contained in test files.
Cadence sample files have the following format. Each line of a cadence file
contains timing data for keystroke pairs (in milliseconds) separated by white
space. Characters at the beginning of a line up to the ":" indicate the
keystroke pair to which the timing data following the ":" pertains.
For example, a cadence file containing the following data
w a: 322.20 140.53 154.15 168.14 155.37 169.53 160.40 169.30
w e: 80.20
w h: 81.57
w i: 69.32 59.33 48.95
indicates that the keystroke pair "wa" was typed 8 times, and that,
the 8
timings of "wa" were 322.20, 140.53, 154.15, 168.14, 155.37, 169.53,
160.40, and 169.30 milliseconds. You should not infer that the "wa"
pairs generating these timings occurred in the order that is shown when their
author generated them. Similarly, "we" occurred once, "wh" occurred once, and
"wi" occurred 3 times with the timings given. You should not assume
that lines within a cadence file are sorted in any particular order.
Letter-space and space-letter transitions are omitted, as are any transitions
involving a non-printing or control character.
Part 1
In the tar file located HERE you will find
known samples and unknown samples generated by your previous enjoyment of
Martha Stewart's biography. Each known file (indicated by a file name not
beginning with a capital 'X') contains the cadence data for the first half of
the Martha bio as typed by a student in the class. Each "mystery" file
contains the second half of some student's Martha Stewart typing effort.
Thus, there is a one-to-one mapping between known files and mystery files.
You should devise and implement a hypothesis test that can determine whether a
a known file and a mystery file were typed by the same typist with
significance alpha = 0.05. You should then use this test to match each
mystery file with its corresponding known file and thereby identify the
typist responsible for each mystery file.
You should also come up with a way to determine which known file is most
likely to be the first half of the text for which the mystery file is the
second half. Note that there are 29 sample pairs in the data set. At
alpha=0.05, one would expect at least one type 1 error if your hypothesis test
is working perfectly. To pick up that one identification if it is, or to
perform the match if it is not, you should develop some type of scoring method
for determining the most likely match for each known-file-mystery-file pairing.
Part B
In this part of the assignment you will use your newly developed expertise in
qwertyprint analysis to solve the problem Charlie solved. In the tar file
located HERE you will find 10 cadence
files -- 5 with file names that end in ".train" and 5 with obscure names
beginning with "X" and ending in ".exp" Each file contains the cadence
measurements for someone (not enrolled in the class) repeatedly typing the
same password: penitentiary. However,
unlike in Part 1 of this assignment, there is not a one-to-one mapping between
known typists and unknown typists. That is, there are some ".train" files
that do not have ".exp" files generated by the same author, and vice versa.
No author of a ".train" file authored more than one ".exp" file, however, and
similarly, no author of a ".exp" file authored more than one ".train" file.
As in Part 1, your job is to conduct hypothesis tests
(with significance level alpha=0.05) and a likelihood ranking to determine
which authors of ".exp" files also authored a ".train" file by finding
matchings between the two.
What to Turn In
For part 0, you should turn in your handy work in exploring the likelihood
calculation by hand.
For Part 1, you should turn in the following:
- You should turn in a two-column text file showing the matchings between
known typists
(represented as .train files) in column 1
and unknown typists (represented as .exp files) in column 2
that your hypothesis testing reveals. Each line of your text file should
correspond to a separate matching.
- You should turn in a similar two-column text file where the matching is
determined by a likelihood "score" rather than a hypothesis test.
- You should provide short-but-clear written description of the
methodologies you used to generate these text files.
- You should turn in all of the code necessary for us to reproduce your
results, again, with the admonition that if the TA can't build and execute
your system easily, it will be considered and incorrect solution (even if
your matchings are correct).
- You should include a README file with any information you believe is
necessary for the TA to achieve this "ease-of-use" criterion.
For Part B, you should turn in the following:
- You should turn in a two-column text file showing the matchings between
known typists in column 1 and unknown typists
in column 2
that your hypothesis testing reveals. Each line of your text file should
correspond to a separate matching. Note that in this case, not all ".train"
files will have a matching ".exp" and vice versa.
- You should turn in a short write-up discussing, again, the methodologies
you used and also why you chose the matchings you did. In this case,
particularly, you may find yourself making intuitive decisions about the
"guilt" or "innocence" of your typists. You should provide written insight
into that intuition as well as into the thinking you used to derive your
solutions.
- You should turn in all of the code necessary for us to reproduce your
results, again, with the admonition that if the TA can't build and execute
your system easily, it will be considered and incorrect solution (even if
your matchings are correct).
- You should include a README file with any information you believe is
necessary for the TA to achieve this "ease-of-use" criterion.
For the Graduate Student
Develop a method for hypothesis testing and/or likelihood
scoring based on Baysean inference
techniques as opposed to the pedestrian frequentist approach favored by your
instructors.
Hints from Your Uncle Norman and Aunt Heloise
Spend a little time designing the code you plan to use and implement it
carefully. Notice that it is easy to make assumptions about the contents of
the sample files that are untrue. For example, we split the Martha Stewart
biography roughly into two even parts with respect to the keystroke transition
count. That does not mean that the number of transitions for each pair
is evenly split. There might have been more "en" transitions in the first
half than in the second, or vice versa. Your implementation should be
prepared for this possibility.
Similarly, typing accuracy varies
from half to half and more obviously from typist to typist. You should not
assume that all files of any type contain exactly the same set of
keystroke transitions for each pair, or that the counts for the transitions
that are common to all files are the same. For example, one known file
might contain a
different number of "th" transitions than another and both may be different
from the actual number of "th" transitions that occurred in the text.
Your favorite Aunt and Uncle also discovered a few tid bits that you might
also enjoy. The first is that it is often helpful to generate your own data
from known distributions to see how well your method works, and how sensitive
it is to perturbation of its underlying assumptions from known data. For
example, your Auntie Heloise cooked up a batch of cadence files from normal
distributions with known mean and variances to see how well our hypothesis
testers and likelihood scorers worked when we knew what the answer should be
ahead of time.
Perhaps the biggest lesson that came out of these investigations is that
unfortunately the size of the sample matters. You should experiment with
different minimal sample sizes when working through your solutions. Notice
that if you choose a large minimum sample size, you cut down on the number of
keystroke pairs that are eligible for comparison. Smaller sample sizes
implies more pairs, but each pair then doesn't give your method much with
which to work.