COMPUTING SCIENCE
Connecting the Dots
Can the tools of graph theory and social-network studies unravel the next big plot?
Brian Hayes
Clique Here
Only the NSA knows whether it can actually spot such patterns, but a
somewhat simpler problem can serve as a proxy in estimating the
difficulty of the task. The proxy problem is that of finding a large
clique within the call graph. This process was studied in the late
1990s by James Abello, then at AT&T Bell Laboratories, and his colleagues.
Finding the largest clique in a graph is a classic hard problem. The
brute-force method simply examines every possible subset of vertices
and checks to see if all of them are connected. The number of
subsets grows so rapidly that this algorithm runs out of oomph even
for a graph with 50 vertices; it would be unthinkable for 50
million. The only practical alternatives are approximate and
probabilistic methods, which usually converge on a good solution but
can't promise to find the best one.


The graph used for Abello's experiments included a single day's
records; it had 53,767,087 vertices (corresponding to telephone
numbers) and more than 170 million edges (representing calls). The
algorithm started with a small clique and tried to build a bigger
one. In a first stage, the program repeatedly searched for a new
vertex connected with all those already in the set. When no more
vertices of this kind could be found, the program switched to
another strategy, looking for opportunities to remove one vertex
from the clique in exchange for adding two others. Grinding through
the day's database took about five hours on a machine with four
processors and four gigabytes of memory.
The largest cliques found had 30 vertices. Consider what this means:
In a group of 30 telephone numbers, each one either called or was
called by all of the other 29 numbers in the course of a single day,
for a total of at least 435 calls. That's quite a busy calling
circle! But there wasn't just one such clique: Abello's group found
more than 14,000 distinct cliques of size 30.
The curious result of this experiment suggests several conclusions,
all of them tentative. First, the computational power needed for
analyzing call graphs appears to be readily available—though
it's not yet on every desktop. Abello's experiments were able to
chew through one day's worth of data; today's hardware could
doubtless manage much larger bites.
Second, it looks like finding instances of a given pattern within
the call graph is not the problem. The problem is defining a pattern
selective enough to identify a target group without also branding
14,000 others as possible terrorists. The algorithms must somehow
distinguish a few dozen people intent on mayhem from other groups of
the same size and structure who are planning a family reunion,
canvassing the neighborhood for a lost cat, running for city council
or war-dialing to win free concert tickets from a radio station.
No matter what methods are brought to bear on the problem, the
intelligence agencies face a formidable task: To survey a huge
population (potentially all six billion of us) looking for a tiny
subgroup (those planning violence). It's analogous to screening for
a rare disease. Even if the test is right 99 percent of the time,
almost all the positive results will necessarily be false positives.
Abello (who is now with DIMACS, the Center for Computer Science and
Discrete Mathematics at Rutgers University, and with Ask.com) thinks
that the task of tracing terrorists through call graphs would be
difficult, but he also notes that additional information can be
extracted from the graphs, apart from the simple pattern matching I
have described here. For example, cliques and other clusters have
interesting dynamics; some persist from day to day but others
vanish. Information like this might help to distinguish one kind of
group from another. Abello also mentions extensive recent work on
identifying self-organized communities in other contexts, from
chat-room participants to eBay customers.
» Post Comment