Modern crime often leaves an electronic trail. Finding and preserving that evidence requires careful methods as well as technical skill
Just two decades ago, there were no digital forensics tools as we know them today. Instead, practitioners had to repurpose tools that had been developed elsewhere. For example, disk backup software was used for collection and preservation, and data recovery tools were used for media analysis. Although these approaches worked, they lacked control, repeatability, and known error rates.
The situation began to change in 1993. That year, the U.S. Supreme Court held in the case of Daubert v. Merrell Pharmaceutical that any scientific testimony presented in court must be based on a theory that is testable, that has been scrutinized and found favorable by the scientific community, that has a known or potential error rate, and that is generally accepted. Although the case didn’t directly kick off the demand for digital forensics tools, it gave practitioners grounds for arguing that validated tools were needed not just for good science and procedure but as a matter of law. Since then there has been a steady development of techniques for what has come to be called technical exploitation.
Probably the single most transformative technical innovation in the field has been the introduction of hash functions, first as a means for ensuring the integrity of forensic data, and later as a way to recognize specific files.
In computer science a hash function maps a sequence of characters (called a string) to a binary number of a specific size—that is, a fixed number of bits. A 16-bit hash function can produce 216 = 65,536 different values, whereas a 32-bit hash function can produce 232 = 4,294,967,296 possible values. Hash functions are designed so that changing a single character in the input results in a completely different output. Although many different strings will have the same hash value—something called a hash collision—the more bits in the hash, the smaller the chance of such an outcome. (The name hash comes from the way hash functions are typically implemented as a two-step process that first chops and then mixes the data, much the same way one might make hash in the kitchen.)
Hashing was invented by Hans Peter Luhn and first described in a 1953 IBM technical memo; it’s been widely used for computerized text processing since the 1960s. For example, because every sentence in a document can be treated as a string, hashing makes it possible to rapidly see if the same paragraph ever repeats in a long document: Just compute the hash value for each paragraph, put all of the hashes into a list, sort the list, and see if any number occurs two or more times.
If there is no repeat, then no paragraph is duplicated. If a number does repeat, then it’s necessary to look at the corresponding paragraphs to determine whether the text really does appear twice, or the duplicate is a the result of a hash collision. Using hashes in this manner is quicker than working directly with the paragraphs because it is much faster for computers to compare numbers than sequences of words—even when you account for the time to perform the hashing.
In 1979, Ralph Merkle, then a Stanford University doctoral student, invented a way to use hashing for computer security. Merkle’s idea was to use a hash function that produced more than 100 bits of output and additionally had the property of being one-way. That is, it was relatively easy to compute the hash of a string, but it was nearly impossible, given a hash, to find a corresponding string. The essence of Merkle’s idea was to use a document’s 100-bit one-way hash as a stand-in for the document itself. Instead of digitally certifying a 50-page document, for example, the document could be reduced to a 100-bit hash, which could then be certified. Because there are so many different possible hash values (2100 is about 1030 combinations), Merkle reasoned that an attacker could not take the digital signature from one document and use it to certify a second document—because to do so would require that both documents had the same hash value.
Merkle got his degree, and today digital signatures applied to hashes are the basis of many cybersecurity systems. They protect credit card numbers sent over the Internet, certify the authenticity and integrity of code run on iPhones, and validate keys used play digital music.
The idea of hashing has been applied to other areas as well—in particular, forensics. One of the field’s first and continuing uses of hashing was to establish chain of custody for forensic data. Instead of hashing a document or a file, the hash function is applied to the entire disk image. Many law enforcement organizations will create two disk images of a drive and then compute the hash of each image. If the values match, then the copies are assumed to each be a true copy of the data that were on the drive. Any investigator with a later copy of the data can calculate the hash and see if it matches the original reported value. Hashing is so important that many digital forensics tools automatically perform this comparison.
A second use for hashing is to identify specific files. This approach takes advantage of the property that it is extraordinarily unlikely for two files to have the same hash value, so they can label files in much the same way a person can be recognized by their fingerprints.
Today forensic practitioners distribute databases containing file hashes. These data sets can be used to identify known goods, such as programs distributed as part of operating systems, or known bads, such as computer viruses, stolen documents, or child pornography. Recently, several groups, including my team at the Naval Postgraduate School, have applied cryptographic hashing to blocks of data smaller than files, taking advantage of the fact that even relatively short 512-byte and 4,096-byte segments of files can be highly identifying.
File and sector identification with hashing means that a hard drive containing millions of files can be automatically searched against a database containing the hashes of hundreds of millions of files in a relatively short amount of time, perhaps just a few hours. The search can be done without any human intervention.