The Britney Spears Problem
Tracking who's hot and who's not presents an algorithmic challenge
Gently Down the Stream
Are there better algorithms for stream problems waiting to be discovered? Some of the limits of what's possible were outlined in 1996 by Noga Alon, Yossi Matias and Mario Szegedy, who were all then at AT&T Bell Laboratories. They framed the issue in terms of an infinite series of frequency moments , analogous to the more familiar statistical moments (mean, variance, and so on). The zero-order frequency moment F 0 is the number of distinct elements in the stream; the first-order moment F 1 is the total number of elements; F 2 is a quantity called the repeat rate; still higher moments describe the "skew" of the stream, giving greater emphasis to the most-frequent elements. All of the frequency moments are defined as sums of powers of m i , where m i is the number of times that item i appears in a stream, and the sum is taken over all possible values of i .
It's interesting that calculating F 1 is so easy. We can determine the exact length of a stream with a simple, deterministic algorithm that runs in constant space. Alon, Matias and Szegedy proved that no such algorithm exists for any other frequency moment. We can get an exact value of F 0 (the number of distinct elements) only by supplying much more memory. Even approximating F 0 in constant space is harder: It requires a nondeterministic algorithm, one that makes essential use of randomness. The same is true of F 2 . For the higher moments, F 6 and above, there are no constant-space methods at all.
All this mathematics and algorithmic engineering seems like a lot of work just for following the exploits of a famous "pop tart." But I like to think the effort might be justified. Years from now, someone will type "Britney Spears" into a search engine and will stumble upon this article listed among the results. Perhaps then a curious reader will be led into new lines of inquiry.
- Alon, Noga, Yossi Matias and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing , pp. 20–29.
- Chakrabarti, Amit, Graham Cormode and Andrew McGregor. 2007. A near-optimal algorithm for computing the entropy of a stream. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms , pp. 328–335.
- Charikar, Moses, Kevin Chen and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the 29th International Colloquium on Automata, Languages and Programming , pp. 693–702.
- Cormode, Graham, and S. Muthukrishnan. 2005. What's hot and what's not: tracking most frequent items dynamically. ACM Transactions on Database Systems 30(1):249–278.
- Demaine, Erik D., Alejandro López-Ortiz and J. Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms , pp. 348–360.
- Dimitropoulos, Xenofontas, Paul Hurley and Andreas Kind. 2008. Probabilistic lossy counting: an efficient algorithm for finding heavy hitters. ACM SIGCOMM Computer Communication Review 38(1):7–16.
- Estan, Cristian, and George Varghese. 2003. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems 21(3):270–313.
- Fischer, Michael J., and Steven L. Salzberg. 1982. Solution to problem 81–5. Journal of Algorithms 3(4):375–379.
- Karp, Richard M., Scott Shenker, and Christos H. Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems 28(1):51–55.
- Manku, Gurmeet Singh, Sridhar Rajagopalan and Bruce G. Lindsay. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data , pp. 426–435.
- Manku, Gurmeet Singh, and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases , pp. 346–357.
- Metwally, Ahmed, Divyakant Agrawal and Amr El Abbadi. 2008. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology , pp. 618–629.