Connecting the Dots
By Brian Hayes
Can the tools of graph theory and social-network studies unravel the next big plot?
Can the tools of graph theory and social-network studies unravel the next big plot?
DOI: 10.1511/2006.61.400
In the five years since that wrenching Tuesday morning when hijacked aircraft sliced into the World Trade Center and the Pentagon, Americans have been living with a new undercurrent of worry and mistrust. Naturally, there's fear of further attacks. But there's also concern that measures taken to forestall such attacks could erode traditional rights and liberties. In recent months, controversy has erupted over reports that government agencies are monitoring Internet and telephone communications as well as financial transactions. Some of the surveillance programs are said to be sifting through gigantic data sets, scanning for patterns that might reveal criminal intent or activity.
Brian Hayes
The debate over these programs has focused mainly on legal and political questions. Are constitutional and statutory safeguards being respected? What about laws that bar intelligence agencies from spying on American citizens? Do the programs strike an appropriate balance between the right to privacy and the need for security? These are important issues, but I shall leave them to others. Here I want to ask a different kind of question: What can one expect to learn through such wholesale screening and data-mining operations? Do the communications patterns of terrorists have a signature so distinctive that computer algorithms can detect signs of a conspiracy amid trillions of other telephone calls or e-mail messages?
In addressing these questions I face an obvious impediment: Very little reliable information on the nature and scope of the surveillance programs has been made public. However, mathematicians and computer scientists have tackled problems very similar to those confronting an intelligence analyst trying to make sense of surveillance data. And social scientists have long taken an interest in the networks that bind people together—including networks of criminals and terrorists. Perhaps by combining insights from these fields we can make some plausible guesses.
The newly revealed surveillance programs seem to include several distinct activities. Some involve eavesdropping—listening in on telephone conversations or recording the content of Internet messages. A follow-the-money program gathers information from a banking clearinghouse. But the reports I find most intriguing mention efforts to analyze a database of telephone calls with the aim of tracing links among conspirators. The database includes no sound recordings or any other hints about what might have been said in a conversation; it merely lists the telephone numbers at the two ends of each call and gives the date and time when a call began and ended.
This "call detail" database sounded very familiar. Several years ago I had read of experiments done with a similar database—almost surely an earlier version of the one that is now said to be under government scrutiny. The experiments were tests of algorithms in the mathematical field known as graph theory, which studies network-like structures. The phone-call database was a useful test bed because it can be viewed as an enormous mathematical graph. I wrote about this work in an earlier column in American Scientist (January-February 2000).
Vague allusions to the database, or "call graph," appeared in the first public accounts of the new surveillance programs. Writing in The New York Times last December, Eric Licht-blau and James Risen noted, "[National Security Agency] technicians, besides actually eavesdropping on specific conversations, have combed through large volumes of phone and Internet traffic in search of patterns that might point to terrorism suspects." The nature of the operation became clearer in May when Leslie Cauley wrote in USA Today that at least three telephone companies are voluntarily supplying call-detail records to the NSA. Two of those companies later denied that they participate in the program, and USA Today retracted that part of the story. The third company, AT&T, has declined to comment on the substance of the report, and so has the NSA. When AT&T was sued for allegedly violating privacy statutes, the Bush administration moved to suppress the suits on the grounds that litigating the matter would reveal state secrets. As this issue of American Scientist goes to press, the facts remain murky.
The NSA is the U.S. espionage service with responsibility for cryptography and "signals intelligence." Although its budget and staffing are secret, it is often said to be the largest of the U.S. intelligence agencies and also, incidentally, the largest employer of mathematicians in the United States and perhaps in the world. And it is assumed to possess prodigious computing resources.
Exploration of the call graph belongs to the branch of signals intelligence known as traffic analysis. In a battlefield situation, you might intercept an enemy's radio transmissions but be unable to read their encrypted content. Nevertheless, just counting the messages can yield valuable information. A flurry of activity might signal an impending troop movement; sudden radio silence could be even more ominous. If you can identify the source and the intended recipient of each message—in effect, constructing a call graph—you can learn even more, since lines of communication often reveal something about the organization of a military force.
The search for meaningful patterns in telephone records could rely on similar principles, but the problem is much harder. In the military situation, messages between enemy units are readily identified as such. In the telephone database, calls among a few dozen conspirators would all too easily get lost in the background noise of other conversations.
The records in the call database are collected not for the sake of national security but for mundane commercial purposes. In order to send you an itemized bill at the end of the month, a phone company needs to keep track of every call completed, with the originating and receiving phone numbers and the starting and ending times. The largest companies handle roughly 250 million toll calls a day, and so a month's worth of data amounts to several billion call records. AT&T reports that its database of retained records is approaching two trillion calls and more than 300 terabytes of data.
Apart from billing, the call graph has other uses within the phone company—some of which are not too different from what the NSA may be doing, and almost as secretive. Historical calling patterns can be used to detect fraud, and some patterns are also of interest in marketing. For example, a company that offers a discounted rate within a "calling circle" can use information from the call graph to estimate the costs and benefits of the program.
In principle, the same kind of traffic data found in telephone call-detail records could also be compiled for other communications channels. For instance, Federal Express and other courier services keep digitized records of their deliveries, which could readily be transformed into a database of senders and receivers. Curiously, the most digital medium of all—the Internet—does not provide for routine retention of who-speaks-to-whom data; there's no direct need for it, since customers do not pay by the message. However, there is no technological barrier to collecting detailed statistics on e-mail messages and other kinds of Internet traffic. A "packet sniffer" installed on the network backbone would simply need to scan the headers of messages and record the to and from addresses. (It's even possible that equipment reportedly installed by the NSA at certain Internet switching centers could have this purpose.)
Digging into the call graph is a form of data mining—and the process could not be more aptly named. Heaving aside hundreds of terabytes of extraneous data is the digital equivalent of a major earthmoving project. Before firing up the steam shovels, it would be helpful to know what we're looking for. What are the communications patterns characteristic of dangerous plotters and connivers?
Brian Hayes
A good place to turn for an answer to this question is the community of scholars who study social networks: the structures of groups of people as defined by the connections between them. (Of course the community of social-network scholars is itself a social network, held together by many interpersonal connections.)
A seminal paper in this literature, "The strength of weak ties," was published in 1973 by Mark S. Granovetter, now of Stanford University. Granovetter observed that when people are strongly connected—when they are close friends, say, or family members, or working colleagues—the ties between them are usually symmetrical and also obey a rule that might be called triangularity. Symmetry implies that if A is a friend of B, then B is also a friend of A. Triangularity says that if A is friendly with both B and C, then B and C should also be friends with each other. Of course these are merely tendencies, and anyone can cite exceptions (unrequited love, pathological triangles), but for purposes of analysis it's useful to ask what a society would look like if symmetry and triangularity were strictly enforced rules. The answer is that the social structure would consist entirely of perfect cliques—groups in which every person is linked to every other person.
Strong bonds between individuals would seem to be the very stuff of social cohesion, but Granovetter's theory suggests a paradoxical effect. Locally, strong ties create highly robust structures, but on a larger scale they also isolate one group from another. Because of the all-or-nothing nature of strong ties, distinct cliques become island universes that cannot communicate with one another. What really holds the world together, Granovetter argues, are weak ties between casual acquaintances. These relationships are often symmetrical but seldom tri-angular: You can chat with a bank teller every week without getting to know all of the teller's other customers. Such weak ties, which at first seem socially insignificant, provide vital cross-links between cliques. The prevailing network structure, according to Granovetter, consists of clusters tightly bound internally by strong ties and loosely linked to other clusters by weak ties.
Social-network theory has obvious affinities with mathematical graph theory—even though people who work in the two fields tend to form distinct clusters linked only by weak ties. Graph theory brings its own vocabulary, as well as a more abstract view of the subject matter: Formally, a graph is a set of vertices together with a set of edges, where each edge connects a pair of vertices. This rather opaque definition can be interpreted in various ways, but in practice graph theorists, like social networkers, draw lots of diagrams with dots and lines.
What do the principles of social networks and graph theory tell us about the structure of terrorist cells? The very word "cell" offers a clue: It suggests compartmentalization. And indeed the lore of spy rings and resistance fighters speaks of limiting communication so that if one person is captured others will not be put in jeopardy. At the same time, however, the members of the group have to keep in touch in order to make plans and carry them out.
An illuminating case study comes from a rather different context: price-fixing by manufacturers of electrical equipment in the 1950s. The social network of the colluding managers and executives was examined by Wayne E. Baker of the University of Chicago and Robert R. Faulkner of the University of Massachusetts. They found that "the structure of illegal networks is driven primarily by the need to maximize concealment, rather than the need to maximize efficiency." Nevertheless, the price-fixing and bid-rigging simply could not be accomplished without communication among the conspirators, especially in the case of the biggest machinery. Despite the risks, executives had to meet face-to-face to coordinate their plans.
Networks of terrorists apparently face the same conflicting imperatives. Valdis E. Krebs, a consultant who usually applies social-network analysis to business problems, has used the same tools to map relations among the September 11 hijackers. In a paper written just a few weeks after the attacks, he found the network surprisingly sparse. Although every hijacker could be connected to every other via some path through the network, many of the paths were quite long, passing through three or four intermediaries. This attenuated structure would make communication extremely inefficient.
Krebs later revised his analysis, as more information became available. He has posted a new map at the Web site http://orgnet.com/prevent.html. Here he reaches a different conclusion. Starting with two men who were already under suspicion in January of 2000, Krebs finds that known linkages lead to all 19 hijackers, and to other conspirators as well. Each node of the network is tied to the two initial subjects either directly or through a single intermediary.
José A. Rodríguez of the University of Barcelona has created a similar network map for the bombing of commuter trains in Madrid on March 11, 2004. Rodríguez recorded several kinds of strong links among the conspirators. Some had ties of kinship or had been childhood friends; others congregated at a shop owned by two of the subjects; some were veterans of earlier wars or terrorist actions. Looking at just the 13 men who actually placed and detonated the explosives, Rodríguez found that the strong ties produced a somewhat strange network. A core of six people formed a clique: Each one was linked to all the others. But the remaining members were only loosely associated or were completely disconnected from the main group.
The outlook changed entirely when Rodríguez included some 70 persons associated with the plot in various ways and when he mapped weak ties as well as strong ones. The weak ties denote pairs of people connected by financial transactions, casual encounters and the like. This larger and fuller network looks much like what Granovetter's theory would predict. There are several dense clusters, within which most nodes are strongly connected, but the clusters communicate with one another only via comparatively loose and unreliable couplings. For example, one cluster is made up of Spanish citizens from whom the bombers obtained explosives; most paths from this subgroup to the rest of the network pass through a single node, a vulnerable choke-point.
The social networks described above were constructed retrospectively. The starting point was a complete list of the known members of the group, along with enough biographical information to fill in the links. Discerning the same structure in advance—when an attack is still in the planning stage and most of the plotters are unknown—would be much harder, especially when working from impersonal data such as logs of phone calls or e-mails.
Sketching out such networks surely falls within the NSA's mandate, which might explain the agency's interest in the call graph. One scenario is easy to imagine. Someone has come under suspicion, based on information from other sources, and by consulting the call graph the NSA learns whom that person has been talking with in recent weeks or months. The result is a "ring" of contacts surrounding the subject. Then each of the contacts is investigated in the same way, producing a second ring of contacts-of-contacts. This process could be continued further, although the exponential growth of the graph will soon take in most of the population (especially if the subject has answered a call from a telemarketer or has ordered a pizza). Of more interest are instances where the subject's contacts are also contacts of one another, since such triangular links suggest strong ties.
The tricky part of this network analysis is not finding the links but knowing which of them are significant. It may well be true that if intelligence agencies had "connected the dots," all of the September 11 hijackers could have been linked to the two who were spotted in January of 2000. But thousands of other people would have been linked to those individuals as well. Showing the network of conspirators in isolation is misleading; the graph is actually embedded in a vastly larger structure.
Direct access to the call graph would be a convenience in tracing the associations of known suspects, but it is not necessary. For any named individual, the same records could be obtained under court order, as they are by law-enforcement agencies during criminal investigations. Indeed, if the call graph is used only for such purposes, it seems like quite an extravagance—sifting through 1012 records for the few hundred or few thousand calls that might be of interest.
But news reports hint at a more ambitious function for the call graph: not merely tracking down the associates of a known malefactor but rather discovering a plot without any prior guidance, merely by searching the archive for "patterns that might point to suspects." This sounds like magic: You cast your gaze over the vast and intricate web of the call graph, and without even knowing the names behind the phone numbers, you perceive some pattern of linkages that's a danger sign. Can this trick be transferred from the world of magic to the world of algorithms?
If such a signature pattern exists, the findings of social-network theory suggest it should involve some distinctive combination of strong and weak ties. A terrorist cell, seen from the point of view of telephone traffic, might be a set of people who talk among themselves a lot but have little to say to the rest of the world. Thus the pattern that rings the alarm bells would be a dense subgraph in comparative isolation from its surroundings.
Only the NSA knows whether it can actually spot such patterns, but a somewhat simpler problem can serve as a proxy in estimating the difficulty of the task. The proxy problem is that of finding a large clique within the call graph. This process was studied in the late 1990s by James Abello, then at AT&T Bell Laboratories, and his colleagues.
Finding the largest clique in a graph is a classic hard problem. The brute-force method simply examines every possible subset of vertices and checks to see if all of them are connected. The number of subsets grows so rapidly that this algorithm runs out of oomph even for a graph with 50 vertices; it would be unthinkable for 50 million. The only practical alternatives are approximate and probabilistic methods, which usually converge on a good solution but can't promise to find the best one.
Brian Hayes
The graph used for Abello's experiments included a single day's records; it had 53,767,087 vertices (corresponding to telephone numbers) and more than 170 million edges (representing calls). The algorithm started with a small clique and tried to build a bigger one. In a first stage, the program repeatedly searched for a new vertex connected with all those already in the set. When no more vertices of this kind could be found, the program switched to another strategy, looking for opportunities to remove one vertex from the clique in exchange for adding two others. Grinding through the day's database took about five hours on a machine with four processors and four gigabytes of memory.
The largest cliques found had 30 vertices. Consider what this means: In a group of 30 telephone numbers, each one either called or was called by all of the other 29 numbers in the course of a single day, for a total of at least 435 calls. That's quite a busy calling circle! But there wasn't just one such clique: Abello's group found more than 14,000 distinct cliques of size 30.
The curious result of this experiment suggests several conclusions, all of them tentative. First, the computational power needed for analyzing call graphs appears to be readily available—though it's not yet on every desktop. Abello's experiments were able to chew through one day's worth of data; today's hardware could doubtless manage much larger bites.
Second, it looks like finding instances of a given pattern within the call graph is not the problem. The problem is defining a pattern selective enough to identify a target group without also branding 14,000 others as possible terrorists. The algorithms must somehow distinguish a few dozen people intent on mayhem from other groups of the same size and structure who are planning a family reunion, canvassing the neighborhood for a lost cat, running for city council or war-dialing to win free concert tickets from a radio station.
No matter what methods are brought to bear on the problem, the intelligence agencies face a formidable task: To survey a huge population (potentially all six billion of us) looking for a tiny subgroup (those planning violence). It's analogous to screening for a rare disease. Even if the test is right 99 percent of the time, almost all the positive results will necessarily be false positives.
Abello (who is now with DIMACS, the Center for Computer Science and Discrete Mathematics at Rutgers University, and with Ask.com) thinks that the task of tracing terrorists through call graphs would be difficult, but he also notes that additional information can be extracted from the graphs, apart from the simple pattern matching I have described here. For example, cliques and other clusters have interesting dynamics; some persist from day to day but others vanish. Information like this might help to distinguish one kind of group from another. Abello also mentions extensive recent work on identifying self-organized communities in other contexts, from chat-room participants to eBay customers.
It's in the nature of secret intelligence programs that most of us will never know for sure what the programs do, how well they work or even whether they exist. Nevertheless, in a democracy citizens can't entirely cede responsibility for what their government may be doing behind the black curtain. To have an informed opinion, we need to puzzle out the facts as best we can. Besides, it's an interesting puzzle.
My own opinion, so far, remains ill-formed. Tracking terrorists through call graphs looks like a hard problem. But just because I'm stumped certainly doesn't mean it can't be done!
Whether or not call graphs lead to hidden terrorist cells, they may be just the ticket for other tasks. Here's one idea. The Bush administration has expressed displeasure with the public disclosure of all the new surveillance programs, and would like to know who leaked the news. The call graph might be an ideal device for answering that question. One need merely list, on the one hand, all those who had access to the information, and on the other hand the journalists who ultimately reported the story. Search in the graph for direct or indirect connections between those two sets of vertices. The irony is that whoever released the information probably understood quite clearly this potential for exposure.
© Brian Hayes
Click "American Scientist" to access home page
American Scientist Comments and Discussion
To discuss our articles or comment on them, please share them and tag American Scientist on social media platforms. Here are links to our profiles on Twitter, Facebook, and LinkedIn.
If we re-share your post, we will moderate comments/discussion following our comments policy.