CS190N Numb3rs -- The Pilot

The situation:

Los Angeles is plagued by a serial rapist. He has struck 14 times in a seemingly random pattern around the city. We have little chance of predicting where he will strike next, but can we somehow use these locations to find the rapist?

Our math-hero Charlie draws the analogy between the locations of his strikes and the locations of water droplets from a sprinkler. Discussion Question: What assumptions are being made when we draw this analogy? Given the locations of all the droplets, we would have a very good idea where the sprinkler is located!

Now, how can we re-express this type of problem into something more mathematical, a setting in which we can use, well, numb3rs? I think you'll agree that a natural starting point would be to plot the strikes as a set of points in the xy-plane. We're looking for the "source" of the points, something like the "center." Is there a good way to measure the center of a set of points? Assuming that the points really are "randomly" distributed in some sense about the center, how closely will the center that we measure approximate the true center? Good questions!

Walk the line

Before we get to this problem, let's take a close look at how we would understand the problem in one dimension. In other words, let's imagine that all the crimes were committed along the same (straight) street, so that we could describe each point as a single number on the number line -- the x-axis, if you will. Now, what might we possibly mean by the "center" of a bunch of numbers? The most natural answer is "average," and in fact our approach here can be applied to more than just fighting crime -- why, it could even be used if you were interested in, say the average height of all the male UCSB students, based on your measurement of 14 of them chosen at random.

How does this work? It relates to the so-called normal distribution, which is a way of describing how numbers are distributed. You are probably already familiar with the uniform distribution on [0,1], whether you know it or not: It is the distribution that functions like rand() in C are supposed to use. The uniform distribution has the property that numbers near the center and away from the center all have an equal chance of being picked. Graphically, this distribution looks like a horizontal line of height 1, above the interval x=0 to x=1. (Note: This picture is really describing what histograms of samples from this population would look like if you had "perfectly representative" samples of large size, scaled so that the total area is equal to 1.)

In the real world, a lot of things are not distributed uniformly. Going back to the example of heights of male UCSB students, we would be much more likely to see values near, say 5'10" than near 7'1". The way the heights are distributed would be roughly symmetric, tall near the middle, and tapering off at the ends in the famous "bell-shaped" pattern: in other words, normally distributed populations have more measurements near the center and fewer far from the center. Normal distributions can be centered anywhere; the mean or population average, which we call &mu for short, of a normal distribution is located right at the peak of the bell. Also, the "bell" can be tall and thin or shorter and wider; the "spread" of a normal distribution is measured by its standard deviation, &sigma, which is the distance from the mean to either inflection point on the "bell."

On the right-had side of this page is a cute applet to graph a normal distribution having any mean and standard deviation (which they call A and B for some bizarre reason!!) that you choose. They also have a mathematical expression (labeled "PDF") defining the equation for a normal distribution with given &mu = A and &sigma = B, but we won't worry a huge amount about that at the moment. A couple of things we do need to know about the normal distribution are that

OK, what does all this have to do with making the world safer for Truth, Justice, and The American Way of Life? Simmer down -- we'll get there.

For now, let's go back to our problem about the heights of male students. Suppose that someone has told you that their heights are normally distributed with standard deviation 2 inches. Then, as noted above, 95% of the population will be within 1.96 &sdot 2 = 3.92 inches of the mean, whatever the mean is. Formally, we would say that if you choose a male student at random, the probability that his height is within 3.92 inches of the mean is 95%, or .95. But the really useful observation is that the tail also wags the dog: If the randomly chosen student's height is within 3.92 inches of the mean, then the mean is within 3.92 inches of that student's height! I know that last sentence looks obvious, but chew on it for a minute. Go ahead. I'll wait for you. OK, so even if you know nothing about the mean, you can get information about it by choosing a male student at random and measuring his height: You are 95% confident that the average is somewhere within 3.92 inches of that student's height. Suppose that your randomly chosen student is 5'9"; you are then 95% confident that the average height of all male students is between about 5'5" and about 6'1". It will be useful to keep in mind the difference between probability and confidence. Probability is about what's going to happen with a random event, while confidence is about a non-random thing, in this case the average height of all students, that is unknown.


Sample Means have Distributions, Too

How do we gain more certainty about the mean height? Why, we collect more data, of course -- we increase our sample size. If you think about it for a minute, I think you'll agree that it makes sense that the bigger the sample, the closer its average is likely to be to the average of the whole population - for example, it's perfectly possible to choose a student randomly who is 6'6", but it's practically impossible to choose 100 students at random and have their average height be 6'6"! The question is "How much more certainty do we gain?" Let's denote the mean of a sample by x, as opposed to the mean &mu of the whole population. Now, let's imagine the process of taking a bunch of random samples with a specific number of students, say n, in each sample, and measuring their means. These are random numbers, just as the heights of randomly chosen male students are, but what are the characteristics of these sample means? More fun facts: