Seeing between the Pixels
The Mind's Eye
When you ponder how best to represent a picture, a promising approach is to ask how people or animals see images. What kinds of data structures does the brain employ for visual information?
The input end of the visual system is clearly a pixel-based device: The retina of the eye has much in common with the array of photosensors in a digital camera. Nevertheless, we do not see in pixels. No one experiences the visual surround as an array of colored dots—not even Seurat. At the level of conscious awareness, what we see are faces, trees, trombones, streetlights, paintings—and all of these objects seem to have a continuous and unpixelated surface, no matter how closely we look at them. Evidently visual information is re-encoded somewhere along the neural pathways of the eye and the brain.
A part of the brain where image representation has been studied extensively is a region known variously as the primary visual cortex, the striate cortex or simply V1. In the 1960s David Hubel of Harvard University and Torsten Wiesel of the Rockefeller University recorded the response of individual V1 neurons when animals were shown various simple patterns, such as spots and stripes. These experiments and later work revealed that the "receptive field" of a typical V1 neuron has a distinctive geometry, with three main characteristics. First, the receptive field is localized: The cell responds most strongly to stimuli in a particular area of visual space. The field is also oriented: Each neuron has a favored axis for stripes or elongated features. And finally the field is most sensitive to variations in luminance over a specific size range; in other words, it has a preferred band of spatial frequencies. Thus the V1 cortex seems to classify features according to their position, orientation and angular size.
Why should the mammalian visual system favor this particular way of representing images? That's a good question, which neurobiological experiments have not yet answered. Bruno A. Olshausen of the University of California at Davis and David J. Field of Cornell University have therefore approached the problem from the opposite direction. Instead of looking into the brain for clues to how it encodes images, they look at images of natural scenes and ask what encoding would give the simplest or most efficient representation. (In this context a "natural" scene is not necessarily the forest primeval; it could be a city street, or even a page of text. What the term excludes are artificial patterns such as random visual noise.)
The technique adopted by Olshausen and Field is a distant cousin of Fourier analysis. The aim is to find a set of "basis functions" that can be combined in various proportions to generate an image. The sine and cosine functions of Fourier analysis are one such basis set, but it is not the best set for natural images. Following earlier work by John G. Daugman of the University of Cambridge, Olshausen and Field argue that the optimal basis functions are those that yield a sparse encoding for images. What this means is that any single image is likely to excite only a few of the V1 neurons. The basis set should also be complete, in the sense that it can account for the features of any natural image.
A set of functions that satisfy these criteria is not something that can be cooked up analytically. Olshausen and Field search for a sparse basis set through an iterative learning procedure, which might even resemble the mechanism by which a developing organism (or perhaps an evolving species) learns to make sense of visual input. Several images are selected as a training set, from which many small square patches are extracted at random. The functions chosen to describe these patches are initially arbitrary; they are refined by repeatedly making small changes and accepting a change if the resulting functions yield a sparser representation. With 16-by-16 patches, the procedure takes a few hours to converge on a basis set.
What kinds of basis functions emerge from this process? They look nothing like the stripes and checkerboards of the discrete cosine transform. Instead most of the functions are elongated ellipses, which seem at first glance to be scattered randomly over the square patches. On examining the entire set of functions, it turns out that nearly all combinations of position, orientation and spatial frequency are present in the set, which means that each function can respond to a specific combination of these three properties. Of course position, orientation and spatial frequency are just the features detected by V1 neurons. Finding a resemblance between the basis functions and the V1 receptive fields is not a proof that the brain employs functions of this particular form, but the result is encouraging and suggests strategies for further experimental work.
Even if we knew the brain's own graphics file format, we would not necessarily want to adopt it for computer graphics files. A biological precedent is not binding on technology. Furthermore—and here I depart on a perilous flight of speculation—the brain's encoding may be well adapted only to image analysis and understanding, not to image generation. Because of a curious asymmetry in mammalian sensory architecture, the brain has no need ever to recreate a pixel array. Although audible channels of communication go both ways—we have ears to hear with and a mouth to speak with—in the electromagnetic spectrum we have receptors but no projectors. Thus the images we receive and interpret never have to be reconstructed for display or transmission (unless you are Seurat, painting with pixels). It is easy enough to imagine a planet where creatures have organs for both input and output of images; their visual cortex would doubtless be different from ours. Maybe we should check out the V1 neurons of Teletubbies.
© Brian Hayes