Academia.eduAcademia.edu

Dynamic textures

2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001

Dynamic textures are sequences of images of moving scenes that exhibit certain stationarity properties in time; these include sea-waves, smoke, foliage, whirlwind but also talking faces, traffic scenes etc. We present a novel characterization of dynamic textures that poses the problems of modelling, learning, recognizing and synthesizing dynamic textures on a firm analytical footing. We borrow tools from system identification to capture the "essence" of dynamic textures; we do so by learning (i.e. identifying) models that are optimal in the sense of maximum likelihood or minimum prediction error variance. For the special case of secondorder stationary processes we identify the model in closed form. Once learned, a model has predictive power and can be used for extrapolating synthetic sequences to infinite length with negligible computational cost. We present experimental evidence that, within our framework, even low dimensional models can capture very complex visual phenomena.

Dynamic Textures Stefano Soatto UCLA Computer Science Los Angeles - CA 90095, and Washington University, St.Louis soatto@ucla.edu, soatto@ee.wustl.edu Gianfranco Doretto UCLA Computer Science Los Angeles - CA 90095 doretto@cs.ucla.edu Abstract Dynamic textures are sequences of images of moving scenes that exhibit certain stationarity properties in time; these include sea-waves, smoke, foliage, whirlwind but also talking faces, traffic scenes etc. We present a novel characterization of dynamic textures that poses the problems of modelling, learning, recognizing and synthesizing dynamic textures on a firm analytical footing. We borrow tools from system identification to capture the “essence” of dynamic textures; we do so by learning (i.e. identifying) models that are optimal in the sense of maximum likelihood or minimum prediction error variance. For the special case of secondorder stationary processes we identify the model in closed form. Once learned, a model has predictive power and can be used for extrapolating synthetic sequences to infinite length with negligible computational cost. We present experimental evidence that, within our framework, even low dimensional models can capture very complex visual phenomena. 1. Introduction Consider a sequence of images of a moving scene. Each image is an array of positive numbers that depend upon the shape, pose and motion of the scene as well as upon its material properties (reflectance distribution) and on the light distribution of the environment. It is well known that the joint reconstruction of photometry and geometry is an intrinsically ill-posed problem: from any (finite) number of images it is not possible to uniquely recover all unknowns (shape, motion, reflectance and light distribution). Traditional approaches to scene reconstruction rely on fixing some of the unknowns either by virtue of assumption or by restricting the experimental conditions, while estimating  This research is supported in part by NSF grant IIS-9876145, ARO grant DAAD19-99-1-0139. We wish to thank Prabhakar Pundir, and Alessandro Chiuso. Ying Nian Wu UCLA Statistics Los Angeles - CA 90095 ywu@stat.ucla.edu the others1 . However, such assumptions can never be validated from visual data, since it is always possible to construct scenes with different photometry and geometry that give rise to the same images2 . The ill-posedness of the most general visual reconstruction problem and the remarkable consistency in the solution as performed by the human visual system reveals the importance of priors for images [29]. They are necessary to fix the arbitrary degrees of freedom and render the problem well-posed. In general, one can use the extra degrees of freedom to the benefit of the application at hand: one can fix photometry and estimate geometry (e.g. in robotic vision), or fix geometry and estimate photometry (e.g. in image-based rendering), or recover a combination of the two that satisfies some additional optimality criterion, for instance the minimum description length of the sequence of video data [23]. Given this arbitrariness in the reconstruction and interpretation of visual scenes, it is clear that there is no notion of a true interpretation, and the criterion for correctness is somewhat arbitrary. In the case of humans, the interpretation that leads to a correct Euclidean reconstruction (that can be verified by other sensory modalities, such as touch) has obvious appeal, but there is no way in which the correct Euclidean interpretation can be retrieved from visual signals alone. In this paper we will analyze sequences of images of 1 For instance, in stereo and structure from motion one assumes that (most of) the scene has Lambertian reflection properties, and exploits such an assumption to establish correspondence and estimate shape. Similarly, in shape from shading one assumes constant albedo and exploits changes in irradiance to recover shape. 2 For example, a sequence of images of the sea at sunset could have been originated by a very complex and dynamic shape (the surface of the sea) with constant reflection properties (homogeneous material, water), and also by a very simple shape (e.g. the plane of the television monitor) with a non-homogeneous radiance (the televised spatio-temporal signal). Similarly, the appearance of a moving Lambertian cube can be mimicked by a spherical mirror projecting a light distribution to match the albedo of the cube. moving scenes solely as visual signals. “Interpreting” and “understanding” a signal amounts to inferring a stochastic model that generates it. The “goodness” of the model can be measured in terms of the total likelihood of the measurements or in terms of its predicting power: a model should be able to give accurate predictions of future signals. Such a model will involve a combination of photometry, geometry and dynamics and will be designed for maximum likelihood or minimal prediction error variance. Notice that we will not require that the reconstructed photometry or geometry be correct (in the Euclidean sense), for that is intrinsically impossible without involving (visually) non-verifiable prior assumptions. But the model must be capable of predicting future measurements. In a sense, we look for an “explanation” of the image data that allows us to recreate and extrapolate it. It can therefore be thought of as the compressed version or the “essence” of the sequence of images. 1.1. Prior related work There has been extensive work in the area of 2D texture analysis, recognition and synthesis. Most of the approaches use statistical models [13, 29, 21, 22, 5, 19, 4, 12] while few others rely on deterministic structural models [8, 28]. Another distinction is that some of them work directly on the pixel values while others project image intensity onto a set of basis functions3 There have been many physically based algorithms which target the visual appearance of specific phenomenons [7, 9, 20, 26]. These methods are computationally intensive, customized for particular textures and allow no parameters to control the simulation once a model is inferred. On the other hand there has been comparatively little work in the specific area of dynamic textures. Schödl et al. [24] address the problem by finding transition points in the original video sequence where the video can be looped back on itself in a minimally obtrusive way. The process involves morphing techniques to smooth out visual discontinuities. Levoy and Wei [28] have also suggested extending their approach to dynamic textures by creating a repeatable sequence. The approach is clearly very restrictive and obtains a relatively quick solution for a small subset of problems without explicitly inferring a model. Bar-Joseph [2] uses multi resolution analysis (MRA) tree merging for the synthesis and merging of 2D textures and extends the idea to dynamic textures. For 2D textures new MRA trees are constructed by merging MRA trees obtained from the input; the algorithm is different from De Bonet’s [5] algorithm that operates on a single texture sample. The idea is extended to dynamic textures by constructing MRA trees using a 3D wavelet transform. Impressive results were obtained for the 2D case, but only a finite 3 Most common methods use Gabor filters [14, 3] and steerable filters [10, 13]. length sequence is synthesized after computing the combined MRA tree. Our approach captures the essence of a dynamic texture in the form of a dynamic model, and an infinite length sequence can be generated in real-time using the parameters computed off-line and, for the case of second-order process, in closed form. Szummer and Picard’s work [27] on temporal texture modelling uses a similar approach towards capturing dynamic textures. They use the spatio-temporal autoregressive model (STAR), which imposes a neighborhood causality constraint even for the spatial domain. This severely restricts the textures that can be captured. The STAR model fails to capture rotation, acceleration and other simple non translational motions. It works directly on the pixel intensities rather than a smaller dimensional representation of the image. We incorporate spatial correlation without imposing causal restrictions, as would be clear in the coming sections, and can capture more complex motions, including ones where the STAR model is ineffective (see [27], from which we borrow some of the data processed in Section 5). 1.2. Contributions of this work This work presents several novel aspects in the field of dynamic textures. On the issue of representation, we present a novel definition of dynamic texture that is general (even the simplest instance can capture second-order processes with an arbitrary covariance sequence) and precise (it allows making analytical statements and drawing from the rich literature on system identification). On learning, we propose two criteria: total likelihood or prediction error. For the case of second-order model we give a closed-form solution of the learning problem. On recognition, we show how textures alike tend to cluster in model space. On synthesis, we show that even the simplest model (first-order ARMA with white IID Gaussian input) captures a wide range of textures. Our algorithm is simple to implement, efficient to learn and fast to simulate; it allows one to generate infinitely long sequences from short input sequences and to control parameters in the simulation. 2. Representation of dynamic textures For a single image, one can say it is a texture if it is a realization from a stationary stochastic process with spatially invariant statistics [29]. This definition captures the intuitive notion of texture discussed earlier. For a sequence of images (dynamic texture), individual images are clearly not independent realizations from a stationary distribution, and there is a temporal coherence intrinsic in the process that needs to be captured. The underlying assumption, therefore, is that individual images are realizations of the output of a dynamical system driven by an independent and identically distributed (IID) process. We now make this concept precise as an operative definition of dynamic texture. 2.1. Definition of dynamic texture     Let be a sequence of images. Suppose that at each instant of time a noisy version   we can  measure   of the image,  where is an independent and identically"!distributed sequence drawn from a known distribution  $#&%( '*)+*resulting -,././. 0 4in a positive measured sequence  . We say that the sequence is a (linear) dynamic texture if there exists )567,8 ./.. . 1 and a;stationa set of 1 spatial"!filters 243    < 2 ary distribution 9 such that, calling : we  =>? ACB @ EDGFHIKJML4 L4 @ : have : , with an IID  ! realization from the density 9 , for somechoice of matriNOP B  )/../.) B )<J ? and initial condition : T, :RQ . Withces . out loss of generality, we can assume S since  we  U: can augment the state of the above model to be V  W XDY,"WZ../. [D "WC\]W : : : S . Therefore, a dynamic texture is associated to an auto-regressive, moving average process (ARMA) with unknown input distribution ^ A_,( :    NOb 2  : L; [c c d B  `JL; :   (a  ! (1) = Pc cfd "! e e  Q , 9 unknown, with :  : M   given, and . One can obviously extendthe 2 :  definition to an arbitrary non-linear model of the form : ,ghji(  k) L4 < , leading to the concept of non-linear : dynamic textures. 2.2. Filters and dimensionality reduction The definition of,8dynamic texture above entails a choice )I5`l ./.. of filters 2 3 1 . These filters are also inferred as part of the learning process for a given dynamic texture. There are several criteria for choosing a suitable class of filters, ranging from biological motivations to computational efficiency. In the trivial case, we can take 2 to be the identity, and therefore look at the dynamics of individual  ml ; pixels : in (1). We view the choice of filters as a dimensionality reduction step, and seek for a decomposition of the image in the simple (linear) form .  8op n  "q @ r  @ : : @ A rstV qu)././.) q \ (2) gqv where and can be an orthonormal n basis of wEx , a set of principal components, or a wavelet filter bank. An alternative non-linear choice of filters can be obtained by processing the image with a filter bank, and representing it with the collection of positions of the maximal response in the passband [18]. In this paper we will restrict 4 This distribution can be inferred from the physics of the imaging device. For CCD sensors, for instance, a good approximation is a Poisson distribution with intensity related to the average photon count. our attention to linear filters. In [25] we discuss how to extend some of these results to special classes of nonlinear filters. 3. Learning dynamic textures The maximum-likelihood formulation of the dynamic texture learning problem can be posed as follows: y{z}|{~/   ,g€)./../)  0R€)A R‚ ƒ ) r ƒ ) ƒ ) ƒ "!K†ˆ‡ B ƒ ) J„ ym‰ †ˆŠ y   ,g€)./..k)  0R< :RQ 9 ‹Œ 8Œ Ž’‘   “<”–•v—"~g˜€™™ ‘ ",š† R‚ L4 c ec d 9 . The inference method depends crucially upon what type of inferrepresentation we choose for 9 . Note that the above  ence problem involves the hidden variables : L;multiply B ing the unknown parameter and multiJ realizations plying the unknown parameter , and is therefore intrinsically non-linear even if the original state model is linear. In general one could use iterative techniques that alternate between estimating (sufficient statistic of) the conditional density of the state and maximizing the likelihood with respect to the unknown parameters, in a fashion similar to the expectation-maximization (EM) algorithm [6]. In order for such iterative techniques to converge to a unique minimum, canonical model realizations need to be considered, corJ B responding to particular forms for the matrices and . We discuss such realizations in Section 4, where we also present a closed-form solution for a wide class of linear dynamic textures. 3.1. Representation of the driving distribution So far we have managed to defer addressing the fact that the unknown driving distribution belongs, in principle, to an infinite-dimensional space, and therefore something needs to be said about how this issue is dealt with algorithmically. We consider three ways to approach this problem. One is to transform this into a finite-dimensional inference problem by choosing a parametric class of densities. This is done in the next section, where we postulate that the unknown driving density belongs to a finite-dimensional parameterization of a class of exponential densities, and therefore the inference problem is reduced to a finitedimensional optimization. The exponential class is quite rich and it includes, in particular, multi-modal as well as skewed densities, although with experiments we show that even a single Gaussian model allows achieving good results. When the dynamic texture is represented by a secondorder stationary process we will show in Section 4.2 that, contrary to popular belief, a closed-form solution can be obtained. The second alternative is to represent the density 9 via a finite number of fair samples drawn from it; the model (1) can be used to represent the evolution of the conditional density of the state given the measurements, and the density is evolved by updating the samples so that they remain a fair realization of the conditional density as time evolves. Algorithms of this sort are called “particle filters” [16], and in particular the CONDENSATION filter is the best known instance in the Computer Vision community. The third alternative is to treat (1) as a semi-parametric statistical problem, where one of the parameters ( › ) lives in the infinite-dimensional manifold of probability densities that satisfy certain regularity conditions, endowed with a Riemannian metric (corresponding to Fisher’s Information matrix), and to design gradient descent algorithms with respect to the natural connection, as it has been done in the context of independent component analysis (ICA). This avenue is considerably more laborious and we are therefore not considering it in this study. 3.2. Compression and denoising Due to the equivalence between stationary second-order covariance sequences and first-order Gauss-Markov models, we can represent a dynamic texture with its model parameters and the input sequence. This, in general, will result in a lossy compression of the original sequence, where the error is given by the innovation process (see [25] for more details). In the experimental section we show a simple example of how a compressed sequence can be obtained. The model we describe in this paper can also be used to perform suboptimal denoising of the original sequence. It is immediate to see that the denoised sequence is given by  œ žŸ ¢¡£ ¤ œ ¥ œ žŸ  ¤œ likelihood estimates of where is the maximum œ žŸš¦_§ £ ¨ œ ¥ œ žŸ A¦ © œ ª œ žŸ is obtained from ¥ . ¤ (3) and ¥ œ žŸ 4. A closed-form solution for learning secondorder stationary processes It is well known that a stationary second-order process with arbitrary covariance can be modelled as the output of a linear dynamical system driven by white, zero-mean Gaussian noise [17]. In our case, we will therefore assume that žŸ ­ with initial there exists a positive integer « , a process ¬ ¥ condition ¥R®°¯²±(³ and symmetric positive-definite matrices ´ and µ such that ¶ ¥ ž Ÿš¦§g £ ¨ ¥ žŸ š¦ ª žŸ ½ žŸ £ ¤ ¥ ž Ÿ A¦`¾žŸ ¥ ž·O £ ¥®P¸Aª ž Ÿ ¹»ºYž·–¼ ´ ¾žŸ ¿¹`ºYž·–¼ µ (4) ¨ ¯À±(³Áv³ and ¤ ¯À±ÂMÁ–³ . The for some matrices problem of model identification consists in estimating ¨ ¼ ¤ ¼ ¼ ´ µ from measurements of the model parameters ½ ž"§€¼ ¡¡/¡ ¼ ½ žÃR . Note that © in the model (1) is such that ©©=Ä £ ´ . 4.1. Uniqueness and canonical model realizations The first observation concerning the model (4) is that the ¨ ¼ ¤ ¼ ´ is not unique, in the sense that choice of matrices there are infinitely many such žmatrices that give rise to exŸ starting from suitable iniactly the same sample paths ½ ¨ tial conditions. This is¤ immediately seen by substituting ¨ ¤ Ä with Å ÅmÆCÇ , with ÅmÆCÇ and ´ with ÅP´bÅ ž , and choosing the initial condition Å ¥R® , where Å ¯aÈAÉ « is any invertible «*Ê=« matrix. In other words, the basis of the statespace is arbitrary, and any given process has not a unique ¨XÍ £ ¡ model, but¤ an equivalenceÍ class of models Ë £ ¬Ì ¨ Í £ ¤ ¼ ¼ ( Ä P ¼ Ï ž €­ £ ÅP´bÅ . In Å Å ÆÎÇ Ì Å ÆCÇ Ì ´ Å ¯GÈšÉ « order to be able to identify a unique model of the type (4) žŸ from a sample path ½ , it is therefore necessary to choose a representative of each equivalence class: such a representative is called a canonical model realization, in the sense that it does not depend on the choice of basis of the state space (because it has been fixed). While there are many possible choices of canonical models (see for instance [15]), we are interested in one that is “tailored” to the data, in the sense of having a diagonal state covariance. Such a model is called balanced [1]. Since we are interested in data dimensionality reduction, we will make the following assumptions about the model (4): ž¤ « ¸šÓ ÔˆÕ–Ö ÐÒÑmÑ £ « (5) and choose the canonical model that makes the columns of ¤ orthonormal: ¤ Ä ¤ £  (6) ³  where ³ is the identity matrix of dimension «Ê`« . As we will see shortly, this assumption results in a unique model that is tailored to the data in the sense of defining ¡ aØÙ basis of the state space such that its covariance × £ Í  ž Ÿ Þ Ä  ž Ÿ Ð*ÚÛMÜ Ý Ì ¥ ¥ is asymptotically diagonal. The problem we set out to solve can then be formulated as follows: given measurements of a sample¨ œ path of the ž"§gk¼ ¼ žÃR ¸ à ÑmÑ ¼ ¤œ ¼ œ ¼ œ ¡¡/¡ ½ process: ½ « , estimate ´ µ ,a žŸ €­ canonical model of the process ¬ ½ . Ideally, we would want the maximum likelihood solution from the finite sample: ¨ œ žÃR€¼ ¤ œ ž ÃRk¼ œ ž ÃR€¼ œ žÃR £ ´ µ ž ½ ž §g žÃR ԈÓfßáàâ’Õ ¡¡/¡ ½ ãä åšä æIä çEè (7) however, in this section we only derive a closed-form suboptimal solution in the sense of Frobenius; the asymptotically optimal solution can also be obtained in closed form, but is beyond the scope of this paper and is described in [25]. 4.2. Closed-form solution ¡ "ž §gk¼ ¼ žÃR Í ¯ë± ÂMÁ à ¡¡/¡ ½ ê with Let éê £ Ì ½ Ç similarly for ì ê and í ê , and notice that é ê £ Ç ¤ Ç ¦ ì ê Ç Ç í Ç ê ¸ ¤ ¯°± ÂMÁv³ ¸ ¤ Ä ¤ £ Ñ  « , and (8) ñ ñ_ý ñ ý  ï ó be the singular value decomposition (SVD) [11] with ñ  ð ÿ kÿ ü , and consider the problem of finding the ÿ  best estimate of in the sense of Frobenius:       ñ ! "$# %'& ( ð ï &*) subject to (8). It follows immediately òMóÞôõ8ö÷ò&ø*ù(úMûvü8öIòbõ(ò Now let îMð ï ö(ôløþù from the fixed rank approximation property of the SVD [11] that the unique solution is given by     ñ ò   ñ 20 û–üIÿôõô óÞô õ (9) Average prediction error per pixel in 256 gray levels by our assumptions (5) and (6). 18 16 14 12 10 + and  can be determined uniquely, again in the sense of linear problem: ð + Frobenius, ,by- solving  . +  the2 following ï10 & ) which is trivially done ï   ñ & ð/  in closed form using an estimate of from (9): + where ð ó ð óÞô õ43 ð ô ô õ536 ô 0  0   ñ 3 ð ñ87 ý 9 ð ï 0 9 9;: and 3 6 ñ87 ý ð ï10 9 9 (10) 9<: . No- tice that    is uniquely determined up to a change of sign of the components of  and = . Also note that >@? =A BC =  õ 6 N ï õ óÞô õ ô=ó ó ñ ñ BCEDGF8H =A BRQ'ST =   BRQ'ST ï IKJM L O*P ð , (11) which is diagonal. Thus the resulting model is balanced. Finally, the sample input noise covariance U can be estimated from UV   ñ NW ï X õ YZ X  Y[ L P ð  (12) + where X  BC ñ =A B\Q  .   =/ BC . Should U  not be full rank, its dimensionality can be further reduced L ò/]¿ó^]Eòb] õ ó^] ñ where by computing the SVD U  ñ  ] ÿ kÿ ] d   _ST  with Sa`cb , and letting  be such d d=õ ñ that   L U  . In the algorithm above we have assumed that the order of the model b was given. In practice, this needs to be inferred from the data. Following [1], we propose to determine 6 the model order empirically from the singular values ÿ ð ÿ , by choosing b as the cutoff where the value of drops below a threshold. A threshold can also be imposed on the difference between adjacent singular values. 8 0 20 40 60 80 Sequence length in frames 100 120 140 Figure 1. egf!hAikjmlnimoqprts1uv pfxw;y to verify the quality of the model learned, we have used a fixed number of principal components in the representation (20) and considered sub-sequences of the original data set of length varying from 9 to 9 . We have used such sub-sequences to learn L the parameters of the model L z in the Maximum-Likelihood sense, and then used the model to predict the next image. Using one criterion for learning (ML) and another one for validation (prediction error) is informative, for it challenges the model. The average prediction error per pixel is shown as a function of the length of the training sequence (for the smoke sequence), expressed in gray scale within a range of levels. The average prediction error per pixel isz1{shown as a function of the | length of the training sequence (for the smoke sequence), expressed in gray scale within a range of 256 levels. The average error per pixel decreases and becomes stable after some critical length. Mean and standard deviation for 9}9 trials is shown as an errorbar plot. L a closed-form solution that can be proven to be asymptotically efficient, i.e. to approach the maximum likelihood solution. Such an optimal algorithm is beyond the scope of this paper and has been described in [25]. 5. Experiments 4.3. Asymptotic properties We have developed a M ATLAB ~ implementation of the algorithm described in Section 4: learning a sequence of 9}9 frames takes about 5 minutes on a 1GHz pentium~ III PC. L Synthesis can be performed at frame rate. In our implementation we have used  between 50 and 150, b between 9 and 9 and S between 9 and  9 . The solution given above is, strictly speaking, incorrect because the first SVD does not take into account the fact  that has a very particular structure (i.e. it is the state of a linear dynamical model). It is possible, however, to adapt the algorithm to take this into account while still achieving Figures 3 to 4 show the behavior of the algorithm on a representative set of experiments. In each case of Figure 4, on the first row we show a few images from the original dataset, on the second row we show a few extrapolated z { 5.1. Synthesis L 30 distance of the former realizations against themselves and the latter. It is evident that alike textures tend to cluster together. 2 Kullback−Leibler distance I(p ||p ) 25 5.3. Compression 1 20 15 10 5 0 0 10 20 30 40 50 60 Number of frames T 70 80 90 100 Figure 2. The figure demonstrates that textures belonging to the same class tend to cluster together in the sense of Kullback-Leibler. In particular for this figure distances are computed amongst three realizations of the river sequence and three of the smoke sequence w.r.t. the former. The cluster of graphs on top refer to “smoke w.r.t. river” type of distances and the ones below refer to the “river w.r.t. river” type. The K-L divergences are computed using Monte-Carlo methods. samples. Figure 3 shows the overall compression error as a function of the dimension of the state space (top row) as well as the prediction error as a function of the length of the learning set (bottom row). For very regular sequences, the prediction error decreases monotonically; however, for highly complex scenes (e.g. a talking face, smoke), it is not monotonic. As explained in Section 4, we choose the model order € and learn the parameters of the model. We do so for different lengths  of the same training sequence; in order to cross-verify our models, we test the prediction error. For each length,  , we predict the frame \‚„ƒ (not part of the training set) and compute the prediction error per pixel in gray levels. The results are shown in Figure 1. The average error per pixel decreases and becomes stable after the 1†1‡ˆ frame. Thus, the predicting power of the model grows with the length of the training sequence and so does its ability to capture the spatio-temporal dynamics. Furthermore, after some “critical” length of the sequence, there is no improvement in the predicting power: the spatio-temporal dynamics has been captured and the use of a longer sequence does not provide additional information. 5.2. Recognition We use the Kullback-Leibler divergence ‰xŠ[‹Œq‹Ž to compute the discrepancy between different dynamic textures (represented by the probability density functions ‘ and n’ ). In Figure 2 we display the quantity ‰m“TŠ”‘}ŒZn’ Ž , plotted against the length  . We have taken different realizations of the textures river and smoke and have computed the In this section we present a preliminary comparison between storage requirements for the estimated parameters w.r.t. the original space requirement of the texture sequences, to get an estimate of the sequence comparison capabilities of our model. Just as anecdotal evidence of the potential of this method for video compression, we point out that the storage requirement of the original dataset is •@Š–aŽ , while for the ’ model stored is •—Š–˜€™‚š€ ‚š€5›\‚œ›kŽ ¤where €žŸ – and œ¡¢€ and › is the effective rank of £ . For the sake of example, let –¦¥§ƒ †}†©¨œƒ †}† (the size of the original image sequence) and —¥gƒ †1† (its length). In order to store the original sequence one would need ƒ †1ª numbers. The components of the model that are necessary ¤ to re-create an approximation of the sequence are «­¬®¯¬ and the input sequence °Š±CŽ . Typical numbers that result in “acceptable” lossy compression we have observed in our sequences are €c¥M²† and ›³¥´ƒ † . Therefore, one would need € ’ numbers (for « £ ), –µ¨˜€™¶³€Š€™¶šƒ ŽR·1² (for ® £ , counting the or¤ thogonality constraints), €¸¨¹› numbers for £ and, finally, ›™¨™ numbers for the input sequence °G£ Š±CŽ . For the particular choices above, in order to store a model of the original sequence one would need about ²\¨³ƒ †º numbers. Of course, a more systematic evaluation of the potential of this model for compression is due. For very long sequences (large  ) , the algorithm presented above can be modified in order to avoid computing the SVD of a very large matrix. In particular, the model can be identified from a shorter subsequence, and then the identified model can be used to compute the input (in innovation form) using a simple linear Kalman filter. For details on how to do this see [25]. 6. Discussion We have introduced a novel representation of dynamic texture and associated algorithms to perform learning and synthesis of sequences from training data. We have demonstrated experimentally that even the simplest choice in the model (a linear stochastic system driven by Gaussian white noise) can capture complex visual phenomena. The algorithm is simple to implement, efficient to learn and fast to simulate. Some of these results may be useful for image compression and for image-based rendering and synthesis of image sequences. References [1] K. Arun and S. Kung. Balanced approximation of stochastic systems. SIAM Matrix Analysis and Applications, 11, pages 15000 6000 2500 5000 2000 10000 4000 || || Y − Y hat || hat || Y − Y || Y − Y hat || 1500 3000 1000 5000 2000 500 1000 0 0 10 20 30 40 50 n 60 70 80 90 0 100 64 0 10 20 30 40 50 n 60 70 80 90 0 100 6 0 10 20 30 40 50 n 60 70 80 90 100 24 22 62 5.5 20 60 (t+1)|| ||y(t+1) − y ||y(t+1) − y 54 16 hat (t+1)|| 5 hat (t+1)|| 56 hat ||y(t+1) − y 18 58 4.5 14 12 52 10 4 50 48 20 8 30 40 50 60 tau 70 80 90 100 3.5 20 30 40 50 (a) 60 tau (b) 70 80 90 100 6 20 30 40 50 60 tau 70 80 90 100 (c) Figure 3. Compression error as a function of the dimension of the state space » (top row), and extrapolation error as a function of the length of the training set ¼ (bottom row). Column (a) river sequence, column (b) face sequence, column (c) toilet sequence. 42–68, 1990. [2] Z. Bar-Joseph, R. El-Yaniv, D. Lischinski, and M. Werman. Texture mixing and texture movie synthesis using statistical learning. To appear in IEEE Transactions on Visualization and Computer Graphics. [3] J. Bigun and J. M. du Buf. N-folded symmetries by complex moments in gabor space and their application to unsupervised texture segmentation. In IEEE transactions on Pattern Analysis and Machine Intelligence,16(1), pages 80–87, january 1994. [4] G. Cross and A. Jain. Markov random field texture models. In IEEE transactions on Pattern Analysis and Machine Intelligence, volume 5, pages 25–40, 1983. [5] J. de Bonet and P. Viola. Multiresolution sampling procedure for analysis and synthesis of texture images. In Proceedings IEEE Conf. On Computer Vision and Pattern Recognition, 1998. [6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39:185–197, 1977. [7] D. Ebert, W. Carlson, and R. Parent. Solid spaces and inverse particle systems for controlling the animation of gases and fluids. The Visual Computer,10(4), pages 179–190, September 1994. [8] A. Efros and T. Leung. Texture synthesis by non-parametric sampling. In Seventh International Conference on Computer Vision, Corfu, Greece, 1999. [9] A. Fournier and W. Reeves. A simple model of ocean waves. In ACM SIGGRAPH Proceedings, pages 75–84, 1986. [10] W. Freeman and E. Adelson. The design and use of steerable filters. In IEEE transactions on Pattern Analysis and Machine Intelligence, 13(9), pages 891–906, 1991. [11] C. F. V. L. G. H. Golub. Matrix computations. Johns Hopkins University Press, Baltimore, MD, 1996. [12] M. Hassner and J. Sklansky. The use of markov random fields as models of texture. Image Modeling. Academic Press Inc., 1981. [13] D. Heeger and J. Bergen. Pyramid-based texture analysis /synthesis. In ACM SIGGRAPH Conference Proceedings, August 1995. [14] A. K. Jain and F. Farrokhnia. Unsupervised texture segmentation using gabor filters. Pattern Recognition, 24:1167– 1186, 1991. [15] T. Kailath. Linear Systems. Prentice Hall, Englewood Cliffs, NJ, 1980. [16] J. Liu, R. Chen, and T. Logvinenko. A theoretical framework for sequential importance sampling and resampling. Technical report, Stanford University, Department of Statistics, January 2000. [17] L. Ljung. System Identification -Theory for the User. Prentice Hall, Englewood Cliffs, NJ, 1987. [18] S. Mallat. A theory of multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:674–693, 1989. [19] R. Paget and D. Longstaff. A nonparametric multiscale markov random field model for synthesising natural textures. In Fourth International Symposium on Signal Processing and its Applications, volume 2, pages 744–747, August 1996. [20] D. Peachey. Modeling waves and surf. In SIGGRAPH Conference Proceedings, pages 65–74, 1986. [21] K. Popat and R. Picard. Novel cluster-based probability model for texture synthesis, classification, and compression. In Proceedings SPIE visual Communications and Image Processing, Boston, 1993. (a) (b) (c) (d) Figure 4. (a) river sequence, (b) smoke sequence, (c) toilet sequence, (d) face sequence. For each of them the top row are samples of the original sequence (the toilet sequence has been borrowed from [27]), the bottom row shows samples of the extrapolated sequence. All the data are available on-line at http://www.vision.ucla.edu/projects/dynamic-textures.html [22] J. Portilla and E. Simoncelli. Texture representation and synthesis using correlation of complex wavelet coefficient magnitudes. In CSIC,Madrid, April 1999. [23] J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978. [24] A. Schodl, R. Szeliski, D. Salesin, and I. Essa. Video textures. In Proceedings of ACM SIGGRAPH Conference, New Orleans, LA, January 2000. [25] S. Soatto and A. Chiuso. Dynamic data factorization. Technical Report CSD-2001/01, UCLA, Computer Science Department, March 2001. [26] J. Stam and E. Fiume. Depicting fire and other gaseous phe- View publication stats nomena using diffusion processes. In SIGGRAPH Conference Proceedings, pages 129–136, 1995. [27] M. Szummer and R. W. Picard. Temporal texture modeling. In IEEE International Conference on Image Processing, Lausanne, Switzerland, volume 3, Sept 1996. [28] L. Y. Wei and M. Levoy. Fast teture synthesis using tree structured vector quantization. In SIGGRAPH Conference Proceedings, 2000. [29] S. Zhu, Y. Wu, and D. Mumford. Filters, random fields and maximum entropy (frame):towards the unified theory for texture modeling. In Proceedings IEEE Conf. On Computer Vision and Pattern Recognition, June 1996.