Note that there are two distinct issues here. The first is simply to hide the one picture within the other picture in a way that makes it difficult to detect; the second is to hide it in such a way that it is difficult to extract, even if you know something is there. We will take on both of these challenges, although ultimately, paralleling the development in the current Numb3rs episode, we are less concerned with unobtrusiveness (remember Charlie's "persistent artifacts"!) than we are with security. In fact, we'll present a general method for hiding files using other files, wherein it's not only essentially impossible for an intruder to read them, it's essentially impossible for an intruder to know they even exist. This method was developed originally to accommodate "plausible deniability"; to lift text directly from the original paper,
The concept of operations is that the user of such a file system could, if placed under compulsion, reveal (say) the three passwords used to protect the directories with his email archive, his tax records and his love letters, but keep quiet about the directory containing his trade secrets. The opponent would have no means of proving that such a directory exists.
First of all, suppose for simplicity that all the files in this discussion are exactly the same size, say n bits, and suppose further that there is one file in each hierarchy level. Call the files we wish to hide F1, ..., Fm. We are going to create a larger number of cover files, say C1, ..., Ck, all of which look random, but for which each of the hidden files is the exclusive OR of some subset of the Ci. Exactly which of the Ci are involved for a given file Fj is determined by that file's password Pj, which needs to be k bits in length. Specifically, the file Ci is to be used for Fj if the jth bit position in Pj is equal to 1.
For my entertainment dollar, the cleanest way to express the construction of the filesystem is using the language of linear algebra rather than the relatively clumsy language of Boolean functions. In order to do this, we need to take a step backward and see how far we can abstract the usual theory of linear algebra. In other words, we need to do us a little abstract algebra.
u + v and au are in V | (closure) |
u + v = v + u | (commutative law) |
u + (v + w) = (u + v) + w, a(bu) = (ab)u | (associative laws) |
a(u + v) = au + av, (a + b)u = au + bu | (distributive laws) |
One of the primary uses of linear algebra is to provide a framework for the theory of solving systems of linear equations in several unknowns. If you think about it for a sec, then you'll see that even to solve an arbitrary linear equation in a single unknown, you get into trouble if you're not able to perform division with your scalars. For example, 2x + 3 = 8 is a very nice linear equation over ℤ, but it has no solution in ℤ, because we can't divide 5 by 2. In other words, for us to be able to solve linear equations, each nonzero scalar needs to have a multiplicative inverse. So to answer the question posed in the above paragraph, the real numbers are superior to the integers as a set of scalars in linear algebra because of this multiplicative-inverse business. Aside from that, there's nothing particularly special about ℝ. For example, the set of rational numbers ℚ also has the multiplicative-inverse property and therefore makes a perfectly decent set of scalars; in a very concrete way, the complex numbers ℂ are even better than ℝ.
In general, a set for which there are defined sum and product operations with respective identities, usually called 0 and 1, which obey the usual commutative, associative, and distributive laws and for which every element has an additive inverse is called a ring; the canonical example of a ring is the set ℤ. A ring for which every nonzero element has a multiplicative inverse is called a field. Fields are the appropriate structures for doing linear algebra, as observed above. I also note here that Gaussian elimination, the most straightforward process by which to solve systems of linear equations, is possible over any field at all.
So, just how many different fields are there in the world, anyway? It probably will not surprise you that the answer is "infinitely many," although after you have named the reals, the rationals, and the complexes, you might have a hard time thinking of any others. No prob &mdash we can expand your field repertoire in a jif!
Recall the mod operator from somewhere in your education; let's work mod 6 for the time being. A computer scientist would view "mod 6" as a procedure, along the lines of "subtract 5 from your doohickey until you end up with something less than 5, and spit out the result." One says things like "23 mod 6 = 5," and the statement "23 mod 6 = 11" is fairly universally dismissed as false.
To a mathematician, on the other hand, "mod 6" is a whole way of life. To use the phrase "mod 6" in mathematics means that you are doing arithmetic within the set ℤ⁄6, which is the set you get from the integers when you set 6 equal to 0. You write 23 &equiv 5 (mod 6), or, less artificially, just 23 = 5 (mod 6), but equally well 23 = 11 (mod 6). There's nothing privileged about the number 5; it's just a convenient representative for the set of all of the (equal) numbers ..., -7, -1, 5, 11, 17, 23, ... mod 6.
I don't want to spend too much time here; note that for any n, ℤ⁄n is a ring, because it inherits this structure from the integers. ℤ⁄6 is not a field, however. For example, you can't divide 3 by 2, since any multiple of 2 is either 0, 2, or 4. On the other hand, ℤ⁄2 is a field! This is extremely easy to verify, since there is only one nonzero element, namely 1, and it obviously has a multiplicative inverse, namely itself. When we're thinking of ℤ⁄2 as a field, we often write it as F2, the (unique!) field of two elements.
Since 1 + 1 = 0 in F2, addition in F2 is the same as the "exclusive or" (XOR) operation you're used to from whatever class you study things like "exclusive or." The reason I made the detour into algebra-land, when we already had one piece of notation to express the problem, is that it's really nice to be able to express this operation algebraically. The XOR of two bit strings then becomes their vector sum, for example; thus, we have the language of matrix multiplication available to us in order to solve Boolean equations with XORs in them, and we have the usual theory from linear algebra to guarantee the existence of solutions of such equations.
C | = |
|
Pi &sdot C = Fi. |
So far we've done a good job of wishing, but how do we actually go about encoding our files? Well, let's start with any C1, ..., Ck that we like, and suppose we have a file F and a password P for it. Then we calculate the product P &sdot C, which of course is almost certainly not equal to our desired F. But here's what we're going to do: Choose any one of the files, say C*, that actually did contribute to the sum, and replace it with C = C* + P &sdot C + F. Replacing C* by C in the matrix C yields an updated matrix C. By the way we chose C* in the first place, the row C does contribute to the product P &sdot C, and in fact we can calculate the entire product in terms of the original product P &sdot C. Namely, it's equal to the sum of all the Cj that were there before, plus C. The first summand in C is the old C*, so putting that with the rest of the product gives back P &sdot C; then you get the rest of C, namely P &sdot C + F. In toto, then, P &sdot C = P &sdot C + P &sdot C + F, and since anything plus itself is 0, this last thingy is equal to F, which is just what we were after!
The next step is to understand how to encode multiple files. In order to do this, we need to do a little more linear algebra. Recall that if u and v are row vectors of the same length, their scalar product is just the product u vT, where T stands for transpose. The scalar product of two vectors is, of course, a scalar. Two vectors are orthogonal if their scalar product is 0. I should warn you that orthogonality is a little weird working over F2, because a vector with an even number of 1s is orthogonal to itself! Finally, a set of vectors {ui} is called orthonormal if the scalar product of each ui with itself is 1 and each ui is orthogonal to all the other uj.
Now, to pull off the trick with multiple files, we're going to start by making sure that our passwords form an orthonormal set. If we have m files, and therefore m passwords, we can form the passwords into an m × k matrix, which I'll call P for the password matrix. Note that PPT = Im, the m × m identity matrix.
So suppose the ith file Fi is
currently equal to Pi C, and we modify
Fi to Fi + D. The goal is
somehow to modify the cover files Cj so that we can
both extract the new Fi + D using
Pi and also still extract
all the other files with their passwords. Suppose that the modified file is the
vector Fi + D. Let's modify our file C to
C + PiT D. (Reality
check: PiT is k ×
1, and D is 1 × n, so the product is k
× n, so it makes sense to add it to C.) Now,
Pi &sdot (C +
PiT D) =
Pi C + Pi
(PiT D).
The first term in the sum is the old Fi. Since
matrix multiplication is associative, the second term is equal to
(Pi PiT)
D. Now, watch this! We chose the passwords to be orthonormal,
so Pi PiT
= 1. Therefore the above calculation for Pi &sdot
(C + PiT D) reduces
to F + 1⋅D = F + D, just as desired.
The other thing that we need to check is that if we choose any other
file Fr, r &ne i, the product Pr
&sdot (C + PiT D) is
equal to the product Pr
&sdot C, so we still get back our unmodified file
Fr. The calculation looks pretty much the same:
Pr &sdot (C +
PiT D) =
Pr C + Pr
(PiT D) =
Fr + (Pr
(PiT) D.
Orthonormality comes to the rescue again, telling us that
(Pr (PiT)
= 0, and therefore Pr
&sdot (C + PiT D) =
Pr &sdot C = Fr.
ℝ
*
There are undoubtedly a bunch of good reasons why you should appreciate the full beauty and splendor of PNG by reading its most recent full specification but offhand, I can't think of what those reasons are. We'll try to cover the basics and then get down to brass tacks instead.
First, from a high level, you can think of a PNG as a kind of image communication protocol. A PNG file is a little like a network packet. It has a header that identifies it as containing data that should be interpreted as a PNG image an a series of sections -- called "chunks" -- that describe how the image is represented. Chunks carry different kinds of data and "metadata" that can then be composed into an image. Each chunk has a length, a chunk header, potentially some data if the length is not zero, and a CRC for the chunk occurring in that order. Unlike a network packet, however, chunks are variable length and can come in any order (almost) within the file. Thus a PNG file is meant to be interpreted as an identifying header followed by a "stream" of chunks that can be parsed in the order they arrive to construct an image.
So why have yet another specification? I mean, TIFF, JPEG, GIF? Seems like the last thing we should be spending our time on is another specification for images. I'm not sure what to say about that philosophically, but it is clear that PNG is designed to fill in some "gaps" that its designers perceive these other standards to have. PNG uses lossless compression which makes it suitable as a format for image editing (as opposed to JPEG). It also supports "alpha channels" (these are not TV channels with specific significance levels) so that pixel transparency can be represented, and "gamma correction" that encodes brightness information so that differences in local graphics hardware can be removed.
There are other issues which you can read for yourself, but the bottom line is that it is fairly general and lossless, which makes it an excellent choice for us to use in this class.
There appear to be three size choices for pixel representation: 8 bits per pixel, 24 bits per pixel, and 48 bits per pixel. The 8-bit version encodes color differently than the 24-bit and 48-bit versions. In an 8-bit image, the pixel value is used as an index into a color palette (which comes with the PNG file as its own chunk) to indicate what color the pixel should be. Clearly, pixels are thus limited to 256 different colors.
The more common representation (called "truecolor") uses 3 bytes (24 bits in total) to represent the red, green, and blue intensities, one per byte. Thus a pixel can take on one of 256^3 color values, where the color is an R-G-B composition of three 8-bit intensities. We'll stick with truecolor images in this class.
For completeness, the 24-bit representation has a double-precision big sibling in which each R-G-B triplet is represented by three 16-bit values. According to the FAQ, these images are used primarily by scientists and the entertainment industry (sounds like this class to me), and as a result, viewers and editors of images at this precision are hard to come by.
Both the 24-bit and 48-bit representations, also, can come with an alpha opacity byte (in the 24-bit case) or a 16-bit opacity value (in the 48-bit case), or not. Thus your options for color are
So that's the deal with PNG. Armed with this information and a few other details such as the understanding that bytes are stored in network-byte-order format you, a sufficient quantity of computer science machismo, and couple of pots of coffee should be able to whip up a PNG image parser in no time (well, maybe a little time). However, as promised, this class, will not require you to get in touch with your inner hacker quite so intimately. Instead, I urge you to take full advantage of the lovely work your friends in the land of Open Source have deigned to perform.
In particular, there is a PNG manipulation package called libpng that comes pretty much as standard equipment on any Linux system, including those available in the CSIL and GSL.
Allow me to briefly apologize to those of you who are Java programmers. I expected that since the primary use for PNG is to encode images for the web, it would most certainly be a Java package since we all know that Java is the web programming language of choice. I'll leave it to you to decide whether it is instructive to realize that the the gold standard for web image processing is written in C (not even C++) and has a C interface by default. Now back to our regularly scheduled program.
The libpng library is fairly high quality and portable enough to work out of the box on both Linux and OSX. It provides a fairly straight-forward interface to implementations of the various PNG parsing and manipulation functions that are so painstakingly discussed in the specification. If you read the documentation for libpng and you like what you see, feel free to use its interface directly.
However.
The maintainer of the PNG home page, Greg Roelofs, has written a set of simple wrappers for libpng functions that allow you to read and write PNG images without getting too caught up with the specifications semantics. With a small set of modifications that I needed to make my code work, I've placed this code HERE and also in this tarfile. Using Greg's interface, I've written a sample program that reads and write PNG images as an example that should prove useful as you develop your code. We'll spend a little time understanding this code in preparation for your assignment.
The first thing to understand is that Greg's interface expects the PNG file be be opened "rb" -- read binary when reading.
fd = fopen(Fname,"rb"); if(fd == NULL) { fprintf(stderr,"couldn't open file %s\n",Fname); exit(1); }No big deal. The next step is to read up the header information, primarily, so that you can allocate the space needed to hold the image in memory. His code is clever enough to allow all sorts of options including a line-at-a-time option, but we'll skip those fancy ideas for now.
rc = readpng_init(fd,&width,&height); if(rc != 0) { fprintf(stderr,"readpng_init error %d\n", rc); exit(1); }The variable rc is an error code, and width and height are out parameters indicating the dimension of the image in pixels. Greg's code uses an internal static buffer to hold the header information you you thread heads need to be careful here.
To read up the image bytes, the next step is
image = readpng_get_image(display_exp, &channels, &rowbytes); if(image == NULL) { fprintf(stderr,"readpng_get_image fails\n"); fclose(fd); readpng_cleanup(FALSE); exit(1); }Here, Greg gives you the number of bytes in a row which could could also calculate as width * channels in our case. These values, however, come through the libpng interface, however, which makes me think that if there is fancy business going on, the rowbytes computation is not that simple.
The fact that rowbytes comes back also gives you a clue as to how libpng stores the image. As you might expect from a C programmer, image is a pointer to a one-dimensional array that holds a two-dimensional matrix of values stored in row major order (sorry FORTRAN fans).
And there you go. Your image is in memory in row major order and and the fun can now begin.
The first bit of fun is to print out the R-G-B and alpha values just so you can get your eyeballs around what they look like. Your Aunt Heloise found this feature to be particularly useful in developing her recipe for steganography. Perhaps you will too.
First -- a little typing (pun obviously intended with regard to your previous assignment). The pointer variable image points to an array of unsigned characters. We'll just make red, green, blue and alpha unsigned char as well. Next, if we don;t have an output file name specified, print the values to stdout and quit.
if(Oname[0] == 0) { for(i=0; i < height; i++) { curr = &(image[i*rowbytes+0]); for(j=0; j < width; j++) { red = curr[0]; green = curr[1]; blue = curr[2]; if(channels == 4) { alpha = curr[3]; curr += 4; printf("r: %d, g: %d, b: %d, alpha: %d\n", red,green,blue,alpha); } else { printf("r: %d, g: %d, b: %d\n", red,green,blue); curr += 3; } } } fclose(fd); readpng_cleanup(FALSE); exit(0); }What could be simpler? Writing out your own PNG? Let's see...
Greg's write interface is a little bit more clunky than his read interface for our problem. In Greg's defense, I've lifted the interface calls from two different codes he's writtten: one to display a PNG in an X-window and one to write a PNG from a PNM file. In the latter, he uses a large structure to pass the values to his interface and for reasons that are not entirely clear to me he defines it as a global. So be it. Near the top of the file, we find
mainprog_info Wpng_info; /* global in Greg's code */ int main(int argc, char *argv[]) { int i,j; . . .and later
Wpng_info.infile = NULL; Wpng_info.outfile = NULL; Wpng_info.image_data = NULL; Wpng_info.row_pointers = NULL; Wpng_info.filter = FALSE; Wpng_info.interlaced = FALSE; Wpng_info.have_bg = FALSE; Wpng_info.have_time = FALSE; Wpng_info.have_text = 0; Wpng_info.gamma = 0.0;Notice all of the goodies one gets to consider with respect to writing one's own PNGs. In this class, we'll keep it pretty simple. These are the initial default values that Greg uses. We'll modify them a bit later.
First, though, we open the file "wb" and tell Greg's code that is the file we want to use for output.
od = fopen(Oname,"wb"); if(od == NULL) { fprintf(stderr,"couldn't output file %s\n", Oname); fclose(fd); readpng_cleanup(FALSE); exit(1); } Wpng_info.outfile = od; Now here we have an artifact from whence Greg comes. The write code assumes the file you are writing is in PNM format which uses the same representation as PNG's image data format. The identifier numbers, though, looks like 2 * channels for some reason. Thus, if you want to write out a 4-channel PNG, you need to set the pngtype field to 8 in the output structure. That you can set it, though, is kind of cool. It means, for example, that you can add an alpha channel to your favorite non-alpha image. Similarly, you can remove an alpha channel from a 4--channel image if so desired. The PNMtype variable in my code is set via a command-line argument to indicate whether you want a 4-channel or 3-channel PNG as an output file. In the case where you are adding a channel, the opacity is set to 255 (fully opaque) in this code, but you could change that.
if((PNMtype == 8) && (channels == 3)) { printf("converting 3 channel to 4 channel\n"); new_image = (unsigned char *)malloc(height*width*4); new_rowbytes = width*4; if(new_image == NULL) { exit(1); } memset(new_image,0,height*width*4); for(i=0; i < height; i++) { curr = &(image[i*rowbytes+0]); new_curr = &(new_image[i*new_rowbytes+0]); for(j=0; j < width; j++) { new_curr[0] = curr[0]; new_curr[1] = curr[1]; new_curr[2] = curr[2]; new_curr[3] = 255; curr += 3; new_curr += 4; } } free(image); image = new_image; rowbytes = new_rowbytes; } else if((PNMtype == 6) && (channels == 4)) { printf("converting 4 channel image to 3\n"); }And then we tell Greg's code what kind of image we want
Wpng_info.pnmtype = PNMtype; Wpng_info.height = height; Wpng_info.width = width; Wpng_info.sample_depth = 8; wc = writepng_init(&Wpng_info); if(wc != 0) { fprintf(stderr,"writepng_init: error %d\n", wc); fclose(fd); fclose(od); readpng_cleanup(FALSE); exit(1); }And then we twll Greg's code where to find the data and to write it out
Wpng_info.image_data = image; Wpng_info.row_pointers = (unsigned char **)malloc(height* sizeof(unsigned char *)); if(Wpng_info.row_pointers == NULL) { fclose(fd); fclose(od); writepng_cleanup(&Wpng_info); exit(1); } for(i=0; i < height; i++) { Wpng_info.row_pointers[i] = &(image[i*rowbytes+0]); } wc = writepng_encode_image(&Wpng_info); if(wc != 0) { fprintf(stderr,"writepng_encode_image error %d\n", wc); fclose(fd); fclose(od); writepng_cleanup(&Wpng_info); exit(1); }There are a couple of subtleties here, but not many. First, his interface expects a 1-D array of row pointers. I believe the reason for this interface decision has to do with interlaced files which can be generated a row at a time. No matter.
And finally, we clean up.
free(Wpng_info.row_pointers); free(image); writepng_cleanup(&Wpng_info); exit(0);Piece of cake, no? This interface, in all of its simplicity, is quite powerful. Testing it a bit, it understands all of the 24-bit images I've tried with it, including those that are heavily compressed. Certainly for the class assignment you won't need anything more sophisticated.
Recall that the steganography algorithm we have discussed assumes that the original set of cover files contains random data. That "random" data, however, could be somewhat less than random and the algorithm will still work. Indeed, if, say, each cover file contained an equal portion of the bits in a PNG image, that would be fine. Moreover, if we had a way to disseminate the bits from a PNG into a set of covers, perform the magic dance of the F2, and then reform the image, we'd just about have it. As long as we could choose a password carefully, we'd be able to
read in a PNG image for the hiding file read in the PNG image of the file we wish to hide form a set of cover files by distributing the bits from the hiding PNG perform the steganographic encryption using the covers and the hidden file reform the image from the resulting set of coversThere are a few details here that might bear discussion. The first concerns how the bits are to be disseminated from the hiding image to the cover files and back again. You could tesselate the hiding image with cover files, but then one "tile" would be scrambled while the others would remain unaltered.
A better way -- and the one we plan to pursue here -- is (assuming with have k cover files in all) to put every kth bit in cover file i starting with bit i in the file.
For example, let's say that there are four cover files and completely cover the image. We can put bits 0, 3, 7, 11,... in cover file 0 (this is a computer science class after all), bits 1, 4, 8, 12,... in cover file 1, etc.
Recall that to pull the steganography trick off, your cover files all have to be the same size and the same size as the file you are trying to hide. Since the division is not likely to come out even, you should assume that your cover files will be padded with zeros whenever necessary.
When would that be necessary? Clearly, for this method to work, the size of each cover file has to be the smallest integer larger than the size of the hidden file that evenly divides the size of the hiding file. In short, for this assignment, the sizes matter.
Speaking of passwords, the next issue that we need to get clear is how it is that we'll be determining a password. Your favorite aunt and uncle have given this some thought and the answer is that we will use a linear congruential generator (see below) that we will seed with the product mod a maximum value of the bytes that make up the ASCII code for a string containing the password. Thus the algorithm for generating a k bit password might be represented as
acc = 1.0; for(i=0; i < strlen(s); i++) { v = (double)s[i]; acc = acc * v; } v = fmod(acc, (double)mod); seed = (unsigned int)v; SEED(seed); for(k=0; k < number of cover files; k++) if(RAND() > 0.5) cover_file_password[k] = 1; else cover_file_password[k] = 0;where mod defines the maximum possible value + 1 for the seed.
The only wrinkle left is to run the steganographic algorithm using this password and then back the bits out of the cover into their proper place and you have a password protected hidden image (albeit not one that is too protected, as we shall see).
At the same time, it's pretty clear that random numbers are useful. In this very class, we do the Monte Carlo thing enough that rolling a fair die or pulling black and white balls out of an urn would be rather onerous. (Sorry, I've always wanted to use the words "urn" and "onerous" in the same sentence.) Heck, your favorite programming language probably has a "random number generator" built into it, in spite of all those naysayers and their chestnuts.
So where does that leave us? What the random number generators really do is generate pseudo-random numbers, which pretty much means "stuff computers spit out because they can't do really random numbers." The pseudo-random numbers in question are almost invariably designed to imitate random numbers drawn from U(0, 1). So how "random" is pseudo-random? It's often expressed that you want your numbers to "be random enough" for your purposes, but to me a better way to express it is that you want them to look random enough &mdash but the whole idea of randomness is very slippery. It's not as though you can look at sequences of numbers and definitively identify them as "random" or "not random"; randomness is more about the process that generates the values, and I don't want to get into a philosophical discussion about whether truly random processes even exist! (We also typically want them to look independent enough, which is a totally different issue, but nobody seems to mention that little fact. The world is a perplexing place.) And the question of how random the numbers need to look is also a point of consideration: For example, if you're doing Monte Carlo simulations for some dingbat class based on some TV show, you probably don't have to be too fussy; on the other hand, if you're basing a cryptographic scheme for sending sensitive government secret messages on your pseudo-random number generator, your standards should be pretty high. Fortunately, there are various tests that one can use on a sequence of pseudo-random numbers that can at least identify glaring differences between its statistical properties and those of a (hypothetical) sequence coming from a "true" random process.
The most straightforward approach, and one that happens to work
decently if you do things right, is the linear congruential
generator, or LCG. (Not to be mistaken for the Lira Collectors
Guild, which is the Italian version of the PNG.) In this case, the
acronym is beautifully descriptive: We generate integers, mod some
modulus m, by starting with a seed X0 and
using the (linear) iterative rule
Xi+1 = a Xi + c   (mod m) |
The question of "how good" a random number generator is can mean at least two different things, depending on which branch of the government you're working for. The first interpretation of this question is "Does the sequence of numbers that my generator produces have enough properties in common with a 'truly' random sequence that it's possible to use the random numbers for Monte Carlo simulations?" In this space, some LCGs are pretty darn good, while others are not; by using a few more or less straightforward tests, which I'll go into a little bit later, we can fairly reliably separate out the bad from the good.
The second way to interpret the question is "Does the sequence of numbers that my generator produces provide enough unpredictability so that I can be sure that an interloper can't figure out the next number it will generate?" For these purposes, LCGs are downright lousy, as we will show.
This flaw in the numbers generated by LCGs can be exploited fairly
directly using our good friend Linear Algebra. The fact is that if you
have four consecutive integer outputs from an LCG, say x, y, z,
w, then the matrix
x | y | 1 |
y | z | 1 |
z | w | 1 |
−c | −c | 1 &minus a. |
−c | −c | 1 &minus a. |
So what does that mean to you the cracker? Well, that means that if you just calculate the determinant of the above matrix as integers, what you end up with is a multiple of m, and not too big a multiple at that. Chances are that if you have, oh, six or seven outputs in a row, so that you can perform this procedure three or four times with different matrices, and you take the greatest common factor of the determinants you get, you'll find m, or else a really small multiple of m. Given m, you've essentially reduced the problem to the one-parameter problem of finding a, since a and m determine c from just two outputs. If you have, say, a 20-bit modulus, checking all (m, a) pairs might be computationally challenging, but finding m and then checking somewhere around 220 c-values by brute force is eminently possible.
Having more or less discussed the security problems with LCGs, I'd like to spend a bit of time on how to make sure that yours at least delivers reasonably high-quality quasi-random product.
But even that doesn't get you out of the woods. For example, let's take
m = 7, a = 2, c = 3, X0 =
5. Then the sequence looks like
5, 2 &sdot 5 + 3 = 6, 2 &sdot 6 + 3 = 1, 2 &sdot 1 + 3 = 5, 6, 1, ...
(It's even worse if you use X0 = 4.) This little
example illustrates the general principle that if m is prime,
the only way to construct an LCG with cycle length m is to
choose a = 1 &mdash which doesn't make for good pseudo-random
numbers.
Using a modulus that is the product of distinct primes suffers from
the same problem. Unless you use a = 1, a is going to
fail to be 1 mod at least one of those primes, and this shortens up
the cycle. In his weighty tome Seminumerical Algorithms: The Art
of Computer Programming, Vol. 2, Donald Knuth is kind enough to
tell us the following:
The LCG Xi+1 = a Xi + c   (mod m) has cycle length m if and only if
|
But even that isn't enough!
The next thing to do is to examine the distribution of outputs that you get. Donny goes bolting for the &chi² test &mdash you divide the set of integers mod m into d bins, calculate the expected frequency of each bin as bin size⁄m and then take as your sample a run of at least 5d or so and calculate the &chi² statistic, using d−1 degrees of freedom in the table. The Knuthster notes that if you make your sample size really really big, &chi² can hide "locally nonrandom behavior" and goes on to suggest, "Perhaps a chi-square test should be made for a number of different values of n." I'd say n = 100 and d = 10 is a good choice at the "low" end, on up to n = 1000 and maybe d =50 at the other &mdash but that's just me. Knuth makes no specific suggestions.
The next thing you can do is to test whether there is some dependency between the bins into which adjacent numbers fall. Accordingly, let's take adjacent pairs from a run (Xi, Xi+1), (Xi+2, Xi+3), ... (Notice that we don't re-use any values as we did in the geometric examination we did earlier!!) and place them in a 2 × 2 table according to which bin they each fell into. The expected frequencies of the cell entries can be calculated simply from the number of values in each bin, so we can calculate from our run of values a &chi² statistic to compare against a critical value with (d &minus 1)² degrees of freedom, where d is again the number of bins. This time there are lots more bins, so you need a larger sample for a given value of d.
As per my rant back in the days at the Typing Pool, &chi² methods
aren't all that appropriate for numerical data. On the other hand, in
some sense this LCG thingy really does generate categorical data, with
categoreies numbered 0, ..., m−1. On the other other
hand, the values it generates often get divided by m and used
to imitate the U(0,1) distribution, which is numerical. If we
choose to treat the numbers as numerical, we can use the
Kolmogorov-Smirnov test, which in addition to being the
first experiment in cross-marketing between statistics and the
distilled-spirits industry, is also a
test of goodness of fit between a sample and a proposed
distributional model. The idea for Kolmogorov-Smirnov is very simple:
Calculate the cumulative distribution function Fsamp
from your sample and that for the proposed model
Fmodel. The test statistic is just
D = maxx |Fsamp(x) &minus Fmodel(x)|. |
Of course, in this case the distribution we're hypothesizing is U(0,1) for the outputs of our LCG all divided by m. Again there is the question of how big a sample to take in order to detect both "local nonrandomness" and "global nonrandomness," so a couple of different sample sizes are probably in order, just as for the &chi² test given above. The inscrutable Knuth actually suggests doing a bunch of K-S tests for a fixed length, say 1000, recording the values of the test statistic you get, and then doing a K-S test with those values, using the asymptotic distribution function F(x) = 1 &minus e2x2. This is a weird enough idea to appeal to me, but DK doesn't really give much insight as to why he thinks it's a good idea.
Recall the above observation that ordered t-tuples of
consecutive outputs of a LCG fall on a relatively small number of
hyperplanes. The idea of the spectral test is that the distance
between these hyperplanes tells you something about how good &mdash
the term he uses is "accurate" &mdash your random number generator
is. He calls the accuracy &nut, which is defined by
the equation
1⁄&nut = maximum distance between hyperplanes, in the t-dimensional unit cube, taken over all families of equidistant parallel hyperplanes covering all your points. |