RJournal 2011-2 Lin Shang

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

54

C ONTRIBUTED R ESEARCH A RTICLES

rainbow: An R Package for Visualizing Functional Time Series


Han Lin Shang Abstract Recent advances in computer technology have tremendously increased the use of functional data, whose graphical representation can be innite-dimensional curves, images or shapes. This article describes four methods for visualizing functional time series using an R add-on package. These methods are demonstrated using age-specic Australian fertility data from 1921 to 2006 and monthly sea surface temperatures from January 1950 to December 2006. of the data. The article proceeds as follows. Visualization methods of functional time series are rst reviewed. Illustrated by two data sets, the visualization methods are then demonstrated using the rainbow package. Conclusions are given in the end.

Data sets
The visualization methods are demonstrated using age-specic Australian fertility rates and monthly sea surface temperatures, both included in rainbow. The details of these two data sets are described below. Figure 1 shows annual age-specic Australian fertility rates between ages 15 and 49 observed from 1921 to 2006. These data were obtained from the Australian Bureau of Statistics (Cat No, 3105.0.65.001, Table 38). The fertility rates are dened as the number of live births at 30th June each year, per 1000 of the female resident population of the same age.
1921 1963 2006 250 Fertility rate 0 15 50 100 150 200

Introduction
Recent advances in computer technology have enabled researchers to collect and store highdimensional data. When the high-dimensional data are repeatedly measured over a period of time, a time series of functions can be observed. Although one can display high-dimensional time series by adapting multivariate techniques, it is important to take smoothness of functions into account (Ramsay and Dalzell, 1991). It is the smooth property of functions that separates functional time series from multivariate time series. Unlike longitudinal time series, functional time series mitigates the problem of missing data by an interpolation or smoothing technique, and so functional time series can be thought of as continuous. Visualization aids the discovery of characteristics in data that might not have been apparent in mathematical models and summary statistics. Yet visualization has not yet received much attention in the literature of functional data analysis. Notable exceptions are the phase-plane plot of Ramsay and Ramsey (2002), which highlights important distributional characteristics using the rst and second derivatives of functional data, and the singular value decomposition (SVD) plot of Zhang et al. (2007), which displays the changes in latent components in relation to the increases of the sample size or dimensionality. Another exception is the rainbow plot of Hyndman and Shang (2010), which can simultaneously display of functional data and identify possible outliers. The aim of this article is to collect the R code that implements these graphical techniques. The R code of phase-plane plot is included in the fda package (Ramsay et al., 2011), while others are included in the rainbow package (Shang and Hyndman, 2011). In addition, this article also presents the use of animation, which can easily be used with all three graphical techniques in order to visualize time-varying features The R Journal Vol. 3/2, December 2011

20

25

30 Age

35

40

45

50

Figure 1: Smoothed Australian fertility rates between ages 15 and 49 observed from 1921 to 2006. Although the four graphical techniques work equally well for plotting un-smoothed multivariate data, functional data ought to be smooth in nature. Therefore, the fertility rates were smoothed using a weighted median smoothing B-spline, constrained to be concave (see He and Ng, 1999; Hyndman and Ullah, 2007, for details). Figure 2 shows monthly sea surface temperatures (in C) from January 1950 to December 2006. These data were obtained from National Oceanic and Atmospheric Administration (http://www.cpc.noaa.gov/ data/indices/sstoi.indices). These sea surface ISSN 2073-4859

C ONTRIBUTED R ESEARCH A RTICLES

55

temperatures were measured by moored buoys in the Nio region", dened as the area within the coordinate 0 10 South and 90 80 West.
Smoothed sea surface temperature 1950 1978 2006 28

24

ity rates occurred around 1961, followed by a rapid decrease during the 1980s, due to the increasing use of contraceptive pills. Then there is an increase in fertility rates at higher ages in the most recent years, which may be caused by a tendency to postpone childbearing while pursuing careers. The rainbow plot is useful to reveal pattern changes for functional time series with a trend. It was produced by the following code:
# load the package used throughout this article library("rainbow") # plot.type = "function", curves are plotted by time # the most recent curve is shown in purple # the distant past cure is shown in red plot(Australiasmoothfertility, plot.type = "functions", plotlegend = TRUE) plot(ElNinosmooth, plot.type = "functions", plotlegend = TRUE)

20

22

26

6 Month

10

12

Figure 2: Smoothed monthly sea surface temperatures (in C) from January 1950 to December 2006. The sea surface temperatures were smoothed using a smoothing spline with the smoothing parameter determined by generalized cross validation. Each curve represents smoothed sea surface temperatures in each year.

For functional time series without a trend (e.g., Figure 2), the rainbow plot can still be used by constructing other order indexes, such as halfspace location depth and highest density regions. The colors of curves are then chosen in a rainbow color palette according to the ordering of depth or density.
1960 1985 1997 28 20 22 24 26

Rainbow plot
The rainbow plot is a graphical display of all the functional data, with the only additional feature being a rainbow color palette based on an ordering of the data. By default, the rainbow plot displays functional data that are naturally ordered by time. Functional data can also be ordered by halfspace location depth (Tukey, 1975) and highest density regions (Hyndman, 1996). The depth and density orderings lead to the developments of functional bagplot and functional HDR boxplot, described in the next subsections. As the referees pointed out, the rainbow plot (with the default rainbow color palette) may not be suitable for readers with color blindness. To mitigate this problem, the plot.fds function allows users to specify their preferred color, such as the heat or terrain palettes. In addition to the computer-screen based RGB colors, the plot.fds function allows users to utilize the perceptually-based Hue-Chroma-Luminance (HCL) colors included in the colorspace package (Ihaka et al., 2011). The use of HCL colors is superior to RGB colors for readability and color separation, and it is thus preferred (Zeileis et al., 2009). Figure 1 presents the rainbow plot of the smoothed fertility rates in Australia between ages 15 and 49 observed from 1921 to 2006. The fertility rates from the distant past years are shown in red, while the most recent years are shown in violet. The peak of fertilThe R Journal Vol. 3/2, December 2011

Smoothed sea surface temperature

Functional time series visualization methods

6 Month

10

12

Figure 3: Rainbow plot with depth ordering. The median curve is shown in black.

Smoothed sea surface temperature

1960 1988 1997

20

22

24

26

28

6 Month

10

12

Figure 4: Rainbow plot with density ordering. The mode curve is shown in black. Figures 3 and 4 present the rainbow plots of sea ISSN 2073-4859

56

C ONTRIBUTED R ESEARCH A RTICLES

Smoothed sea surface temperature

surface temperatures ordered by halfspace location depth and highest density regions. The colors reect the ordering and follow the order of the rainbow. The curves closest to the center of the data set are shown in red, whereas the most outlying curves are shown in violet. The curves are plotted in the order of depth and density, so the red curves are mostly obscured, but the violet curves are clearly seen even if they overlap with the majority of the data. These rainbow plots were produced using the following code.
# plot.type="depth", curves are plotted by depth # depth is distance between median and each curve # median curve (black line) is the center plot(ElNinosmooth, plot.type = "depth", plotlegend = TRUE) # plot.type="density", curves are plotted by density # mode (black line) has the highest density plot(ElNinosmooth, plot.type = "density", plotlegend = TRUE)

region bounded by all curves corresponding to the points within the bivariate fence region. The colors of bivariate outliers are matched to the same colors of functional outliers. Figures 5 and 6 display the bivariate and functional bagplots of the sea surface temperature data.
1982 1983 1997 1998

20

22

24

26

28

6 Month

10

12

Functional bagplot
Adopting from the idea of projection pursuit (Cook et al., 1995), Hyndman and Shang (2010) use a robust functional principal component analysis to decompose functional data into the rst two functional principal components and their principal component scores. As surrogates for functional data, the bivariate principal component scores can be ordered by Tukeys halfspace location depth and plotted in a familiar twodimensional graph.
20

Figure 6: The functional bagplot. The detected outliers in the sea surface temperature data are the years 1982-1983 and 1997-1998. The sea surface temperatures during 1982-1983 began in June 1982 with a moderate increase, then there were abnormal increases between September 1982 and June 1983 (Timmermann et al., 1999). The sea surface temperatures during 1997-1998 were also unusual: they became extremely warm in the latter half of 1997, and stayed high for the early part of 1998. In Figure 5, the dark gray region shows the 50% bag, and the light gray region exhibits the customary 99% fence. These convex hulls correspond directly to the equivalent regions with similar colors and shading in the functional bagplot (in Figure 6). Points outside these regions are dened as outliers. The different colors for these outliers enable the functional outliers to be matched to the bivariate outliers. The red asterisk marks the Tukey median of the bivariate principal component scores, and the solid black curve shows the median curve. The dotted blue line in the functional bagplot gives 95% pointwise condence intervals for the median curve. These bagplots were produced using the following code.

10

15

1997q
q

PC score 2

1982q
q

q q q q
q qq

q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q

q q q

q q q

qq q q

q q q

1983q
q q

1998q
q

10

10

5 PC score 1

10

15

20

Figure 5: The bivariate bagplot. Following Jones and Rice (1992) and Sood et al. (2009), the functional bagplot is considered as a mapping of the bivariate bagplot (Rousseeuw et al., 1999) of the rst two robust principal component scores to the functional curves. The functional bagplot displays the median curve, and the inner and outer regions. The inner region is dened as the region bounded by all curves corresponding to the points in the bivariate bag. Hence, 50% of curves are in the inner region. The outer region is similarly dened as the The R Journal Vol. 3/2, December 2011

# plot.type = "bivariate", the bivariate principal # component scores are displayed # type = "bag" requests the bagplot fboxplot(ElNinosmooth, plot.type = "bivariate", type = "bag", ylim = c(-10, 20), xlim = c(-10, 20)) # plot.type = "functional", the bivariate pc scores # are matched to corresponding curves fboxplot(ElNinosmooth, plot.type = "functional", type = "bag")

ISSN 2073-4859

C ONTRIBUTED R ESEARCH A RTICLES

57

The bivariate principal component scores can also be ordered by the highest density regions. The highest density regions are the quantiles of a two-dimensional Parzen-Rosenblatt kernel density estimate, where the bandwidths are chosen by a plug-in method (Hyndman, 1996). In comparison to a depth-measure approach, the density-measure approach is able to display multimodality if it is present in the data. The functional HDR boxplot is a mapping of the bivariate HDR boxplot (Hyndman, 1996) of the rst two robust principal component scores to the functional curves. The functional HDR boxplot displays the modal curve (i.e., the curve with the highest density), and the inner and outer regions. The inner region is dened as the region bounded by all the curves corresponding to the points inside the 50% bivariate HDR. Thus, 50% of curves are in the inner region. The outer region is similarly dened as the region bounded by all the curves corresponding to the points within the outer bivariate HDR. The colors of bivariate outliers are matched to the same colors of functional outliers. Figures 7 and 8 display the bivariate and functional HDR boxplots of the sea surface temperature data set. As with any outlier detection methods, the coverage probability of the outer region needs to be pre-specied. If we set the coverage probability of the outer region to be 93%, then the outliers detected would match the results obtained by the bagplot. This indicates that these outliers are not only far from the median, but also have the lowest density. In Figure 7, the dark and light gray regions show the 50% HDR and the 93% outer HDR, respectively. These correspond directly to the equivalent regions with similar colors and shading in the functional HDR boxplot (in Figure 8). Points outside these outer regions are identied as the outliers. The use of different colors for these outliers enables the functional outliers to match with the bivariate outliers. The red dot in the bivariate HDR boxplot marks the mode of bivariate principal component scores, and it corresponds to the solid black curve in the functional HDR boxplot.
20

Smoothed sea surface temperature

20

22

24

26

28

Functional highest density region (HDR) boxplot

1982 1983

1997 1998

6 Month

10

12

Figure 8: The functional HDR boxplot. These HDR boxplots were produced using the following code.
# type = "hdr" requests the HDR boxplot # alpha requests the coverage probability of inner # and outer HDR regions, customarily c(0.05,0.5) fboxplot(ElNinosmooth, plot.type = "bivariate", type = "hdr", alpha = c(0.07,0.5), ylim = c(-10,20), xlim = c(-10,20)) fboxplot(ElNinosmooth, plot.type = "functional", type = "hdr", alpha = c(0.07,0.5))

Singular value decomposition (SVD) plot


Zhang et al. (2007) proposed an interactive plot for visualizing patterns of functional data and multivariate data. They use the idea of projection pursuit to nd only those low-dimensional projections that expose interesting features of the high-dimensional point cloud. As a popular projection pursuit technique, singular value decomposition (SVD) decomposes high-dimensional smoothed multivariate data into singular columns, singular rows, and singular values ordered by the amount of explained variance. Zhang et al. (2007) discretize a set of functional data on a dense grid, denoted as f ( xi ) = [ f 1 ( xi ), , f n ( xi )] , for i = 1, , p, where p is the number of covariates, and n is the number of curves. Let {ri ; i = 1, , p} and {c j ; j = 1, , n} be the row and column vectors of the (n p) matrix f ( xi ), respectively. The SVD of f ( xi ) is dened as
T T T f ( xi ) = s1 u1 v1 + s2 u2 v2 + + sK uK vK ,

10

15

1997q
q

PC score 2

1982q
q

q q q q q q q q q q q q q q q q q q q q q q

q q q

q q q

qq q q

q q q q qq q q qq q q q q q

q q q q

1983q
q

1998q
q

10

5 PC score 1

10

15

20

Figure 7: The bivariate HDR boxplot. The R Journal Vol. 3/2, December 2011

where the singular columns u1 , , uK form K orthonormal basis functions for the column space spanned by {c j }; the singular rows v1 , , vK form K orthonormal basis functions for the row space spanned by {ri }; and T symbolizes vector transpose. The vectors {uk } and {vk } are called singular column and singular row, respectively. The scalars s1 , , sK T are called singular values. The matrix {sk uk vk ; k = 1, , K } is referred to as the SVD component. ISSN 2073-4859

10

58

C ONTRIBUTED R ESEARCH A RTICLES

Original data

SVD1
2

SVD2

28

Smoothed sea surface temperature

28

26

26

24

24

22

22

20

6 Month

10

12

20

6 Month

10

12

6 Month

10

12

SVD3
1.5 30

Reconstruction
1.5

Residual

1.0

28

0.5

26

0.0

0.5

24

1.0

22

1.5

20

6 Month

10

12

6 Month

10

12

1.5

1.0

0.5

0.0

0.5

1.0

6 Month

10

12

Figure 9: The animation of SVD plot is started by clicking on any graph when using Adobe Reader. The interactive plot of Zhang et al. (2007) captures the changes in the singular columns, as the number of curves gradually increases. Similarly, it also captures the changes in the singular rows, as the number of covariates gradually increases. The interactive plot simultaneously presents the column and row information of a two-way matrix, to relate the matrix to the corresponding curves, to show local variation, and to highlight interactions between columns and rows of a two-way matrix. By using the animate package (Grahn, 2010), a set of dynamic movies can be created (see Figure 9 for an example). The advantage of creating these movies is the ability to demonstrate the time-varying features of the singular components, and to highlight outliers if present in the data. The animation is started by clicking on any graph when using Adobe reader. Figure 9 shows the SVD plot of the sea surface temperature data set. The rst SVD component captures the seasonal pattern, while the second and third SVD components show the contrasts of sea surface temperatures among different months. Note that the SVD2 and SVD3 are on a much smaller scale in comparison to the SVD1, because the SVD1 accounts for the most of the curves variation. The functional time series can be approximated by the summation of the rst three SVD components. From the residual plot, outliers can be identied if they differ signicantly from zero. The animated SVD plot was produced in a A L TEXeditor, such as WinEdt. For creating one aniThe R Journal Vol. 3/2, December 2011 mation in Figure 9, a series of 57 gures (as there are 57 years in the sea surface temperature data set) are linked together as follows
\% load a package in WinEdt \usepackage{animate} \% link a series of 57 figures for producing \% one animation \begin{figure} \animategraphics{figure_}{1}{57} \end{figure}

In R, the non-animated SVD plot can be produced using the following code.
# order represents the number of SVD components # as the number of SVD components increases # the residuals should be centered around zero # plot can be suppressed by setting plot = FALSE SVDplot(ElNinosmooth, order = 3, plot = TRUE)

Conclusions
This article described four graphical methods included in the rainbow package, for visualizing functional time series. These methods can be categorized into graphical techniques using projection pursuit. Each of these methods has its unique advantages for revealing the characteristics of functional time series. Some of the methods enable us to identify and analyze abnormal observations, while others can be very useful in visualizing trend. Overall, these graphical ISSN 2073-4859

C ONTRIBUTED R ESEARCH A RTICLES

59

methods present a summary of functional time series, and should be considered as the rst step of functional time series analysis.

Acknowledgements
Thanks to the editors and reviewers for constructive comments and suggestions that have signicantly improved this article. Thanks also to Professor Rob Hyndman for helping with R code.

J. O. Ramsay and J. B. Ramsey. Functional data analysis of the dynamics of the monthly index of nondurable goods production. Journal of Econometrics, 107(1-2):327344, 2002. URL http://ideas.repec.org/a/eee/econom/ v107y2002i1-2p327-344.html. J. O. Ramsay, H. Wickham, S. Graves, and G. Hooker. fda: Functional Data Analysis, 2011. URL http: //CRAN.R-project.org/package=fda. R package version 2.2.6. P. J. Rousseeuw, I. Ruts, and J. W. Tukey. The bagplot: a bivariate boxplot. The American Statistician, 53 (4):382387, 1999. URL http://www.questia.com/ googleScholar.qst?docId=5001888966. H. L. Shang and R. J. Hyndman. rainbow: Rainbow plots, bagplots and boxplots for functional data, 2011. URL http://CRAN.R-project.org/package= rainbow. R package version 2.6. A. Sood, G. M. James, and G. J. Tellis. Functional regression: A new model for predicting market penetration of new products. Marketing Science, 28(1):36 51, 2009. URL http://mktsci.journal.informs. org/cgi/content/abstract/mksc.1080.0382v1. A. Timmermann, J. Oberhuber, A. Bacher, M. Esch, M. Latif, and E. Roeckner. Increased El Nio Frequency in a Climate Model Forced by Future Greenhouse Warming. Nature, 398(6729):694 697, 1999. URL http://www.nature.com/nature/ journal/v398/n6729/abs/398694a0.html. J. W. Tukey. Mathematics and the picturing of data. In R. D. James, editor, Proceedings of the International Congress of Mathematicians, volume 2, pages 523 531. Canadian mathematical congress, Vancouver, 1975. A. Zeileis, K. Hornik, and P. Murrell. Escaping rgbland: selecting colors for statistical graphics. Computational Statistics and Data Analysis, 53(9): 32593270, 2009. URL http://www.sciencedirect. com/science/article/B6V8V-4VM43VH-1/2/ b63f4a1a558083ab2b98f12e446d1ac7. L. Zhang, J. S. Marron, H. Shen, and Z. Zhu. Singular value decomposition and its visualization. Journal of Computational and Graphical Statistics, 16(4):833854, 2007. URL http://pubs.amstat.org/doi/abs/10. 1198/106186007X256080?journalCode=jcgs

Bibliography
D. Cook, A. Buja, J. Cabrera, and C. Hurley. Grand tour and projection pursuit. Journal of Computational and Graphical Statistics, 4(3):155172, 1995. URL http://www.jstor.org/stable/1390844. A. Grahn. The Animate Package, 2010. URL http://www.tug.org/texlive/Contents/live/ texmf-dist/doc/latex/animate/animate.pdf. X. He and P. Ng. COBS: qualitatively constrained smoothing via linear programming. Computational Statistics, 14(3):315337, 1999. URL http://papers. ssrn.com/sol3/papers.cfm?abstract_id=185108. R. J. Hyndman. Computing and graphing highest density regions. The American Statistician, 50(2):120 126, 1996. URL http://www.jstor.org/stable/ 2684423. R. J. Hyndman and H. L. Shang. Rainbow plots, bagplots, and boxplots for functional data. Journal of Computational and Graphical Statistics, 19(1):2945, 2010. URL http://pubs.amstat.org/doi/pdf/10. 1198/jcgs.2009.08158. R. J. Hyndman and M. S. Ullah. Robust forecasting of mortality and fertility rates: A functional data approach. Computational Statistics & Data Analysis, 51(10):49424956, 2007. URL http://portal.acm. org/citation.cfm?id=1241107.1241182. R. Ihaka, P. Murrell, K. Hornik, and A. Zeileis. colorspace: Color Space Manipulation., 2011. URL http: //CRAN.R-project.org/package=colorspace. R package version 1.1-0. M. C. Jones and J. A. Rice. Displaying the important features of large collections of similar curves. The American Statistician, 46(2):140145, 1992. URL http://www.jstor.org/stable/2684184. J. O. Ramsay and C. J. Dalzell. Some tools for functional data analysis (with discussion). Journal of the Royal Statistical Society: Series B, 53(3):539572, 1991. URL http://www.jstor.org/stable/2345586.

Han Lin Shang Department of Econometrics and Business Statistics, Monash University, Cauleld East, Victoria 3145, Melbourne, Australia. HanLin.Shang@monash.edu.au

The R Journal Vol. 3/2, December 2011

ISSN 2073-4859

You might also like