Fast Stereovision by Coherence Detection

Fast Stereovision by Coherence Detection
R.D. Henkel
Institute of Theoretical Neurophysics,
University of Bremen
Postfach 330 440
D-28334 Bremen, Germany
Email: [email protected]
Abstract
In human stereovision, a vivid sensation of depth is created by the
small relative displacements of objects in the retinal images of the left
and right eye. Utilizing only two di erent views of a scene, the range of
disparities which simple disparity units can estimate is severely limited
by aliasing e ects. A new computational approach to stereo vision utilizes these aliasing e ects in a coherence detection scheme to allow the
calculation of dense disparity maps over a large range of disparities with
plausible biological hardware. The fast, non-iterative algorithm creates
within a single network structure simultaneously a disparity map exhibiting hyperacuity, a veri cation count of the disparity data and fuses the
left and right input images to the cyclopean view of the scene.
1 Introduction
The structure of 3D-space is hidden in the small relative displacements between
the left and right view of a scene (disparities). Beside the small displacements
of image parts, numerous other image variations occur between the two stereo
views, making the reliable computation of disparities a nontrivial problem. Sensor noise and di erent transfer functions of the left and right imaging system
introduce stochastic signal variations, whereas the varying view geometry leads
to a variety of systematic image variations, including occlusion e ects and foreshortening. In addition, since most object surfaces in real-world scenes display
specular re ection, the intensities observed by the imaging systems are not directly correlated with the object surfaces, but nearly always have a viewpoint
dependent component which moves independently of the surface in question.
Classical approaches to stereo vision try to counteract this whole set of
distorting signal variations with two basic algorithmic paradigms, known as
feature- and as area-based approaches 1]. In feature-based stereo algorithms,
the intensity data is rst converted to a set of features assumed to be a more
stable image property than the raw intensities. The matching stage operates
only on these extracted image features. In contrast, area-based approaches
compare directly intensity values within small image patches of the left and
right view, and try to maximize the correlation between these patches. To
assure stable performance, area-based stereo algorithms need suitably chosen
correlation measures and a su ciently large patch size.
Both types of stereo algorithms create computational problems which can be
directly attributed to the basic assumptions inherent in these approaches. For
example, only a few speci c feature-classes are generally utilized in feature-based
algorithms. Therefore most areas of the stereo images end up in the \no feature
present"-class, which is not considered further in the matching process. This
leads to a tremendous data reduction which speeds up processing, but makes
it impossible to calculate disparity estimates for most of the image regions. In
order to obtain dense disparity maps, one is forced to interpolate these missing
values.
To complicate matters further, every feature detected in the left image can
potentially be matched with every feature of the same class in the right image.
This is the classical \false matches"-problem, which is basic to all feature-based
stereo algorithms. The problem can only be solved by the introduction of additional constraints to the nal solution of the matching problem.
These constraints are usually derived from reasonable assumptions about
the physical properties of object surfaces, and rule out certain combination
of matches. Classical constraints include the uniqueness of a match, gural
continuity and the preserved ordering of matches along horizontal scanlines 2].
In conjunction with the features extracted from the images, constraints de ne a
complicated error measure which can be minimized by direct search techniques
or through cooperative processes.
The problems inherent to feature-based stereo algorithms can simply be
reduced by increasing the number of features classes considered in the matching
process. In the extreme case, one might utilize a continuum of feature-classes.
For example, the locally computed Fourier phase can be used for classifying
local intensity variations into feature-classes indexed by the continuous phase
value 3].
Using such a continuum of feature-classes, with the feature index derived
from some small image area, is very similar to standard area-based stereo algorithms. In these algorithms, not only a single feature value, but the full vector
of image intensities over a small image patch is used for matching.
Classical area-based approaches minimize the deviation or maximize the correlation between patches of the left and right view. A large enough patchsize
assures a stable performance (via the central-limit theorem). The computation of the correlation measure and the subsequent maximization of this value
turns out to be a computationally expensive process, since extensive search is
required in con guration space. This problem is usually solved within hierarchical approaches, where disparity data obtained at coarse spatial scales is used
to restrict searching at ner scales. However, for generic image data there is no
guarantee that the disparity information obtained at the coarse scales is valid.
The disparity estimate might be wrong, might have a di erent value than at
ner scales, or might not be present at all. Thus hierarchical approaches will
fail under various circumstances.
A third way for calculating disparities is known as phase-based methods
4, 5]. These approaches derive Fourier-phase images from the raw intensity
data. Extraction of the Fourier phase can be considered as a local contrast
equalization reducing the e ects of many intensity variations between the two
stereo views. Algorithmically, these approaches are in fact gradient-based op2
optical flow
ϕ
disparity calculations
ϕ
ϕ
t0
Left
t1
Right 1
t2
t
t
correct
Right 2
t
correct
wrong
Figure 1: The velocity of an image patch manifests itself as the main texture
direction in the space-time ow eld traced out by the intensity pattern in time
(left). Sampling such ow patterns at discrete time points can create aliasing
e ects which lead to wrong estimates if the velocity of the ow is too fast (right).
Using optical ow estimation techniques for disparity calculations, this problem
is always present, since only the two samples obtained from the left and right
eye are available for ow estimation.
tical ow methods, with the time-derivative approximated by the di erence
between the left and right Fourier-phase images 6]. The Fourier phase exhibits wrap-around, making it again necessary to employ hierarchical methods
| with the already discussed drawbacks. Furthermore, additional steps have
to be taken to ensure the exclusion of regions with ill-de ned Fourier phase 7].
The new approach to stereo vision presented in this paper rests on simple
disparity estimators which also employ classical optical ow estimation techniques. However, essential to the new approach are aliasing e ects of these
disparity units in connection with a simple coherence detection scheme. The
new algorithm is able to calculate dense disparity maps, veri cation counts and
the cyclopean view of the scene within a single computational structure.
2 Coherence Based Stereo
The estimation of disparity shares many similarities with the computation of
optical ow. But having available only two discrete \time"-samples, namely
the images of the left and right view, creates an additional problem in disparity
estimation. The discrete sampling of visual space leads to aliasing e ects which
limit the working ranges of simple disparity detectors.
For an explanation consider Figure 1. If a small surface patch is shifted over
time, the intensity pattern of the patch traces out a corresponding ow pattern
in spacetime. The principal texture direction of this ow pattern indicates the
velocity of the image patch. It can be estimated without di culty if the intensity
data for all time points is available (Fig. 1, left). Even if the ow pattern can
not be sampled continuously, but only at some discrete time points, the shift can
be estimated without ambiguity if this shift is not too large (Fig. 1, middle).
However, if the shift between the two samples exceeds a certain limit, this
becomes impossible (Fig. 1, right). The wrong estimates are caused by simple
aliasing in the \time"-direction; an everday example of this e ect is sometimes
3
seen as motion reversal in movies.
To formalize, let IL (') be the image intensity of a small patch in the left
view of a scene with corresponding Fourier transform IL (k'). Moving the left
camera on a linear path to the position of the right camera, we obtain a local
ow- eld very similar to Figure 1, namely: I ('; ) = IL (' + d ). Here d
is the disparity of the image patch and the shift parameter runs from 0
to 1. The Fourier transform of I ('; ) follows from elementary calculus as
I (k' ; k ) = IL (k' ) (k ? dk'). Now, if the spectrum of IL (k' ) is bounded by
' , i.e. if IL (k ' ) = 0 for jk ' j > k ' , we nd
some maximum wavevector kmax
max
' .
as highest wavevector of the ow eld I (k' ; k ) in -direction kmax = dkmax
However, the maximal representable wavevector in this direction is given by
sampling theory as kmax = = . Since sampling in -direction is done with
a step size of
= 1, we obtain as an upper bound for sampling the ow eld
without aliasing e ects
(1)
jdj < k' = 12 'min :
max
Equation (1) states that the range of reliable disparities estimates for a
simple detector is limited by the largest wavevector present in the image data.
This size-disparity scaling is well-known in the context of spatial frequency
channels assumed to exist in the visual cortex. Cortical cells respond to spatial
frequencies up to about twice their peak wavelength opt , therefore limiting the
range of detectable disparities to values less than 1=4 opt . This is known as
Marr's quarter-cycle limit 8, 9].
Since image data is usually sampled in spatial direction with some xed re' which can be present in the
ceptor spacing ', the highest wavevector kmax
'
data after retinal sampling is given by kmax = = '. This leads to the requirement that jdj < ' | without additional processing steps, only disparities less
than the receptor spacing can be estimated reliably by a simple disparity unit.
Equation (1) immediately suggests a way to extend the aliasing limited working range of disparity detectors: spatial pre ltering of the image data before or
' , and in turn increases the disparduring disparity calculation reduces kmax
ity range. In this way, larger disparities can be estimated, but only with the
consequence of reducing simultaneously the spatial resolution of the resulting
disparity map.
Another way of modifying the disparity range is the application of a preshift
to the input data of the detectors before the disparity calculation. However,
modi cation of the disparity range by preshifting requires prior knowledge of
the correct preshift to be applied, which is a nontrivial problem. One could
resort again to hierarchical coarse-to- ne schemes by using disparity estimates
obtained at some coarse spatial scale to adjust the processing at ner spatial
scales, but the drawbacks inherent to hierarchical schemes have already been
elaborated.
Instead of counteracting the aliasing e ects discussed, one can utilize them
within a new computational paradigm. Basic to the new approach is a stack
of simple disparity estimators, all responding to a common view direction, with
each unit i having some preshift or presmoothing applied to its input data.
Such a stack might even be composed of di erent types of disparity units. Due
to random preshifts and presmoothing, the units within the stack will have
4
di erent and slightly overlapping working ranges of reliable disparity estimates,
max
Di = dmin
i ; di ].
If an object seen in the common view direction of the stack has true disparity
d, the stack will be split by the stimulus into two disjunct classes: the class C
of detectors with d 2 Di for all i 2 C , and the rest of the stack, C , where
d 62 Di . All disparity detectors 2 C will code more or less the true disparity
di d, but the estimates of detectors belonging to C will be subject to random
aliasing e ects, depending in a complicated way on image content and speci c
disparity ranges Di of the units. Thus, we will have di d dj whenever units
i and j belong to C , and random values otherwise. A simple coherence detection
within each stack, i.e. searching for all units with di dj and extracting the
largest cluster found will be su cient to single out C . The true disparity d in
the common view direction of the stack can be estimated as an average over the
detected cluster:
d
hdi ii2C :
The coherence detecting scheme has to be repeated for every view direction
and leads to a fully parallel algorithm for disparity calculation. Neighboring
disparity stacks responding to di erent view directions estimate disparity independently from each other.
Since coherence detection is based on analyzing the multi-unit activity within
a stack, the scheme turns out to be extremely robust against single-unit failure. As long as the density of disparity estimators remains high enough along
a speci c view direction, no substantial loss of network performance will be
noticed.
3 Computational Structure
A computational structure realizing coherence-based stereo is quite simple (Figure 2). Arranging identical disparity units into horizontal disparity layers, left
and right image data is fed into the network of disparity layers along diagonally
running data lines. Thus each disparity layer receives the stereo data with a
certain xed preshift applied. This varying preshift between disparity layers
leads to slightly di erent working-ranges of neighboring layers. Disparity units
stacked vertically above each other are collected into the disparity stacks which
are analyzed for coherently coding subpopulations of disparity units.
In most of the simulations reported below, responses of two disparity units
i and j in a given stack are marked as coherent with each other whenever
jdi ? dj j < = 0:3 ? 0:5. Only the largest cluster found in each stack is read
out by the coherence mechanism.
This type of coherence detection can be realized easily with biological hardware, for example with neuronally implemented limit-cycle oscillators 10]. Assuming appropriate links structures, these neural oscillators will synchronize
their responses, but only if they code approximately the same stimulus value
11]. A valid disparity estimate would show up in such a network as a strong
coherent neuronal signal.
For the disparity estimators, various circuitry can be used. In the simulations
reported below, units based on motion-energy ltering 12, 13, 14], units based
on standard optical ow estimation techniques 6] and units based on algorithms
5
disparity planes
far
0
near
left data
ϕ
right data
disparity data
ϕR
L
Figure 2: The network structure for calculating stereo by coherence detection.
Simple disparity estimators are arranged in horizontal layers, which have slightly
overlapping working ranges. Image data is fed into the network along diagonal
running data lines. Within each of the vertical disparity stacks, coherently
coding subpopulations of disparity units are detected, and the average disparity
value of these pools is nally read out by the mechanism.
for the estimation of principal texture direction 16] were utilized. All these
disparity estimators can be realized by simple spatial lter operations combined
with local nonlinearities; at least the units based on motion-energy ltering are
currently discussed as models for cortical processing by complex cells.
The structure of the new network resembles super cially earlier cooperative
schemes used to disambiguate false matches 8, 15], but the dynamics and the
link structures of the new network are quite di erent. In the new scheme,
the disparity stacks do not interact with each other; thus no direct spatial
facilitation of disparity estimates is taking place. Such a spatial facilitation is a
central ingredient of almost all other cooperative algorithms. Some small spatial
facilitation enters coherence-based stereo only through the spatially extended
receptive elds of the elementary disparity units used.
Interaction in the coherence detection occurs only along the lines de ning
the disparity stacks. Essentially, this interaction is of the excitatory type, while
usual cooperative schemes have either none or inhibitory interactions along
such lines. Furthermore, coherence detection is a fast and non-iterative process, whereas classical cooperative algorithms require many iterations before
approaching some steady state.
Finally, the disparity units in classical cooperative schemes are basically
binary units, indicating the existence or non-existence of a possible match.
In coherence-based stereo, the disparity units are disparity estimators, giving
back a guess of what value the correct disparity might have. For this reason,
the disparity maps obtained by classical cooperative schemes and by the new
coherence-based stereo algorithm have a very di erent quality: classical schemes
return integer-valued maps, whereas the maps obtained with the new scheme
have continuous values, displaying hyperacuity (i.e. sub-pixel precision).
Since coherence detection is an opportunistic scheme, extensions of the neu6
A
B
Figure 3: The performance of coherence-based stereo on classical testimages. A)
The Pentagon. The overpasses are recovered well in the disparity map, and also
some of the trees in the center of the building are visible. B) Renault automobile
part. Even for the relative structureless background a good disparity estimate
could be obtained. Both disparity maps were calculated with an input image
size of 256x256 pixels.
ronal network to multiple spatial scales and to combinations of di erent types
of disparity estimators is trivial. Additional units are simply included in the
appropriate coherence stack. The coherence scheme will combine only the information from the coherently acting units and ignore the rest of the data. For
example, adding units with asymmetric receptive elds 17] will result in more
precise disparity estimates close to object boundaries.
Presumably, this is the situation in a biological context: various disparity
units with rather random properties are grouped into stacks responding to common view directions. The coherence-detection scheme singles out precisely those
units which have detected the true disparity. The responses of all other units
are ignored.
4 Results
Figure 3 shows some results calculated with the new algorithm on two classical
testimages. Data from several spatial scales was combined in each disparity
stack to yield the nal disparity maps. The results obtained with coherencebased stereo on these standard testimages are comparable to usual area-based
approaches. The resulting disparity maps are dense with clearly de ned object
7
A
B
C
D
E
F
G
Figure 4: When enough image detail is present, disparity is estimated by the new
algorithm with subpixel precision (E). In other image areas, a low a veri cation
count (F) indicates unreliable estimates. This can be used to mask unreliable
data (G). A simple superposition of the left and right input pictures displays
double contours (C), but the coherence detection scheme is able to fuse both
stereo pictures to the cyclopean view of the scene (D). Note the change of
perspective between the left and right input image and the cyclopean view.
borders.
Within the new coherence-based stereo algorithm, a simple veri cation count
can be derived from the relative number of coherently acting disparity units
in each stack, i.e. by calculating the ratio N (C )=N (C C ), where N (C ) is the
number of units in class C . The veri cation count can be used to mark unreliable
estimates.
Arti cial stimuli, as well as images taken in technical context, are sometimes
composed of large structureless regions. In those areas where the sensor noise
is stronger than the intensity variation correlated with the object surfaces, a
reliable estimation of disparities becomes impossible (parts of Figure 4E). With
coherence-based stereo, these areas can be detected by a low veri cation count
(Figure 4F). Using the veri cation count for masking all unreliable disparity
estimates, coherence-based stereo can mimic a standard feature-based algorithm
when confronted with di cult data (Figure 4G).
However, in contrast to classical feature-based approaches, none of the usual
constraints basic to feature-based algorithms are explicitly build into the coherence detecting network. Also, with coherence-based stereo, disparity estimates
will be obtained in all image areas with su cient structure, regardless of the
type of features present there. This is not the case with classical feature-based
algorithms, where one gets estimates only in image regions with the preselected
type of features.
As Figures 4E-G show in comparison to Figures 4A-B, the maps which are
output by the coherence-based stereo algorithm are neither aligned with the left
nor with the right view of the scene. This is a direct result of the network structure used (cmp. Figure 2). The diagonally running data lines in the network
correspond to the lines of sight from the left and right eye positions, but the
coherence detection is done within the vertical disparity stacks of the network.
8
A
B
C
Figure 5: The new scheme works well with very di erent source images. A)
Filling-in occurs in sparse random-dot-stereograms from coarser spatial channels
(top: stereo pair with 3% black dots, bottom right: true disparity, bottom left:
calculated disparity). B) Images without any localized image-features are also
handled well. C) Dense disparity maps are obtained for image data of natural
scenery, which usually has su cient detail for disparity estimation in all image
areas.
The links de ning these stacks split the angle between the left and right data
lines in half. Reprojecting data lines and coherence-detecting links back into
3D-space shows that the disparity stacks actually analyze the image data along
view lines splitting the angle between left and right view directions in half, i.e.,
along the cyclopean view direction. Therefore, any data which is output by
the coherence-based stereo algorithm is aligned with the cyclopean view of the
scene.
It is very easy to obtain the cyclopean view of the scene itself. Let IiL and
R
Ii be the left and right input data of disparity unit i, respectively. A simple
summation over all coherently coding disparity units,
I C = IiL + IiR i2C ;
gives the image intensities of the cyclopean view (Figure 4D).
Interestingly, the image constructed in this way is not severely distorted
even in areas where the disparity estimate is poor. Note that the fusion of left
and right input image to the cyclopean view is not an interpolation between
the left and right image data. This can be seen by observing the change in
perspective between the calculated cyclopean view and the left and right input
images. The simple superposition of both stereo images (Figure 4C) displays in
addition double contours which are absent in the cyclopean view.
Coherence-based stereo works equally well with very di erent source images.
In its multi-scale version, the coarser spatial channels interpolate disparity values at ner spatial channels when no estimate can be obtained there. Figure 5A
displays results of a disparity calculation with a random-dot stereogram having
only a pixel density of 3%. Due to lling-in, the true disparity map is recovered
to a large extent even in image areas with no detail. In contrast to other cooperative schemes, the lling-in occurs in coherence-based stereo instantaneously;
9
no iterative re nement or propagation of disparity values is necessary as with
other approaches.
As already mentioned, various types of disparity estimators can be used
within the basic algorithm. Fig. 5B shows results obtained with a disparity
unit based on an algorithm for the estimation of principal texture direction.
Classical feature-based stereo algorithms would fail with this stimulus set, since
the stereo data lacks any localized images features 18]. Finally, in Fig. 5C,
a disparity map obtained from an outdoor scene is shown. Natural imagery
usually has su cient detail present in all image areas and at all spatial scales,
so dense and stable disparity maps can be obtained with the coherence-based
stereo algorithm.
Acknowledgements
Fruitful discussions and encouraging support from H. Schwegler is acknowledged. Image data is supplied by Gerard Medioni, USC Institute for Robotics
and Intelligent Systems (Image 3B), the Kiel Cognitive Systems Group, G. Sommer, Christian-Albrechts-University of Kiel (Fig 4), by the Max-Planck-Institute
for Biological Cybernetics, AG Bultho , (Fig 5B), and Bob Bolles, AIC, SRI
International (Fig 5C).
References
1] U. R. Dhond and J. K. Aggarwal, Structure from Stereo - A Review, IEEE
Trans. Syst., Man and Cybern. 19, 1489{1510, 1989.
2] J. P. Frisby, Stereo Correspondence and Neural Networks, in \Handbook of
Brain Theory", ed. M. A. Arbib, 937{941, MIT Press 1995.
3] J. Weng, Image Matching Using the Windowed Fourier Phase, Int. J. of
Comp. Vis. 3, 211{236, 1993.
4] T.D. Sanger, Stereo Disparity Computations Using Gabor Filter, Biol. Cybern. 59, 405{418, 1988.
5] M. R. M. Jenkin and A. D. Jepson, Recovering Local Surface Structure
through Local Phase Di erence Methods, CVGIP 59, 72{93, 1994.
6] J. L. Barron, D.J. Fleet and S. S. Beauchemin, Performance of Optical
Flow Techniques, Int. J. Comp. Vis. 12, 43{77, 1994.
7] D.J. Fleet and A. D. Jepson, Stability of Phase Information, IEEE Trans.
Patt. Anal. Mach. Intel. 15, 1253{1268, 1993.
8] R. Blake and H. R. Wilson, Neural Models of Stereoscopic Vision, TINS
14, 445{452, 1991.
9] D. Marr and T. Poggio, A Computational Theory of Human Stereo Vision,
Proc. R. Soc. Lond. B 204, 301{328, 1979.
10] H. R. Wilson and J. D. Cowan, Excitatory and inhibitory interactions in
localized populations of model neurons, Biophys. J. 12, 1{24, 1972.
10
11] R.D. Henkel, Segmentation in Scale Space, in: "Computer Analysis of Images and Pattern, CAIP '95 Proceedings", LNCS 970, 41{48, Springer
1995.
12] N. Qian and Y. Zhu, Physiological Computation of Binocular Disparity, to
appear in Vision Research '97.
13] G. C. DeAngelis, I. Ohzawa, and R. D. Freeman, Depth is Encoded in
the Visual Cortex by a Specialized Receptive Field Structure, Nature 11,
156{159, 1991.
14] E. H. Adelson and J. R. Bergen, Spatiotemporal Energy Models for the
Perception of Motion, J. Opt. Soc. Am. A 2, 284{299, 1985.
15] K. Prazdny, Detection of Binocular Disparities, Biol. Cybern 52, 93{99,
1985.
16] J. Bigun, G. H. Granlund and J. Wiklund, Multidimensional Orientation
Estimation with Application to Texture Analysis and Optical Flow, IEEE
Trans. Patt. Anal. and Mach. Intl. 13, 775{790, 1991.
17] A. Fusiello, V. Roberto, and E. Trucco, A Symmetry-based Stereo Algorithm, Research report udmi/53/96/rr, submitted to CVPR 97, Machine
Vision Lab, University of Udine, Italy, 1996.
18] P. A. Arndt, HP. A. Mallot, H. H. Bultho , Human stereovision without
localized image features, Biol. Cybern. 72, 279{293, 1995.
11