Academia.eduAcademia.edu

Synthesis of Hand Clapping Sounds

2007, IEEE Transactions on Audio, Speech & Language Processing

PREPRINT: 14-Dec-06 13:04 Synthesis of Hand Clapping Sounds1 Leevi Peltolaa), Cumhur Erkut* a), Perry R. Cook b), and Vesa Välimäki a) a) Helsinki University of Technology, Laboratory of Acoustic and Audio Signal Processing, P.O. Box 3000, FI-02015 TKK, Espoo, Finland Phone: +358-9-451-5349, Fax: +358-9-460-224 b) Princeton University, Department of Computer Science and Department of Music, Princeton, New Jersey 08544-2087 USA Email: lpeltola@cc.hut.fi, Cumhur.Erkut@tkk.fi, prc@cs.princeton.edu, Vesa.Valimaki@tkk.fi (*Corresponding author) Abstract— We present two physics-based analysis, synthesis, and control systems for synthesizing hand clapping sounds. They both rely on the separation of the sound synthesis and event generation, and both are capable of producing individual hand-claps, or mimicking the asynchronous/synchronized applause of a group of clappers. The synthesis models consist of resonator filters, whose coefficients are derived from experimental measurements. The difference between these systems is mainly in the statistical event generation. While the first system allows an efficient parametric synthesis of large audiences, as well as flocking and synchronization by simple rules, the second one provides parametric extensions for synthesis of various clapping styles and enhanced control strategies. The synthesis and the control models of both systems are implemented as software running in real time at the audio sample rate, and they are available for download at http://ccrma-www.stanford.edu/software/stk and http://www.acoustics.hut.fi/go/clapd. Index Terms—Acoustic signal processing, acoustic resonator filters, control systems, emotions, signal synthesis EDICS: AUD-HWSW, AUD-AUMM 1 This research was funded by the Academy of Finland (project numbers 104934 and 105651). 1 PREPRINT: 14-Dec-06 13:04 I. INTRODUCTION Clapping of hands is a very popular audible activity in every culture [3]. Its most common function is to show approval and favor, although it may also be used as a rhythmic instrument. Despite this popularity, there are only a few studies about the analysis of handclaps. In his pilot study, Repp [3] has analyzed the sound of hand clapping as an individual sound-generating activity. His study has showed that the configuration of hands and the clapping rate provide perceptually important cues. These results are in accordance with other perceptual studies, which show that the properties of objects, such as size, material, hardness and shape, can be surprisingly well estimated from their sound (see [4] for a review). Néda and his colleagues have investigated the tendency of a large number of people to synchronize their clapping rates into a rhythmic applause and found that the synchronization is achieved by the period doubling of the clapping rhythm [5], [6]. They have later introduced a revised model for generation of synchronized applause events [7]. However, a parametric sound synthesis model that incorporates these findings remained to be developed. In general, sound synthesis is used by musicians and the movie industry, as well as in virtual reality applications and computer games. Synthetic hand clapping can be used, for example, in sport games when modeling the applause given by the audience or to make the applause in live recordings more intense. It can also be used as a feedback in virtual reality and computer games; a user can be rewarded with an enthusiastic applause or negative feedback can be given with a bored applause. Applause is also one of the basic General MIDI [8] sounds, and a handclap is also one of the percussion sounds of MIDI channel no. 10. However, nearly all MIDI implementations of applause are based on one or a few PCM recordings, rather than a flexible and convincing parametric synthesis model2. Some historical analog and sample-based drum machines have included 2 Except in early analog synthesizers and drum machines, such as the Roland TR Series. 2 PREPRINT: 14-Dec-06 13:04 handclaps as one of the many percussion "voices." Some of these devices (such as the Roland TR Series3) attempted to provide "chorused" versions of single claps by multi-triggered envelopes controlling filtered noise. Our algorithms also use filtered noise to excite simple resonant filters to model the sound of single handclaps. However, we are more interested in modeling the behavior (statistics) of ensembles of individual clappers in audience settings, under different emotional and social settings. There are many studies on physics-based synthesis of everyday sounds. Cook introduced the PhISEM (Physically Informed Stochastic Event Modeling) algorithm for the synthesis of complex multi-dimensional multiple particle models, such as maracas, ice cubes in a glass, or rain drops [9], [10]. He also developed an analysis/synthesis system for walking sounds [11]. Fontana proposed a synthesis and control model for crushing, walking, and running sounds [12], [13]. Lukkari and Välimäki developed a synthesis and control model for a wind chime [14]. Interactive multimedia implementations based on physics-based synthesis and control models of everyday sounds have been reported [15], [16], [17]. Common to all these works is that the synthesis (sound generation) and control (event generation, usually stochastic) models are separated. The synthesis part generally is a simplified model of a single sound emitting process [4], for example modal synthesis can be used for sound generation [18]. These grains of sound are then combined using a higher-level control model that is also based on the physics of the sound event in question. A similar strategy has been recently used in [19], where a granular synthesizer was controlled by a pulse-coupled network of spiking neurons, although the event and sound generation models were physically not matched. The advantage of physics-based synthesis is that the models are easy to control, as the physical parameters can be tuned. Furthermore, the control parameters are based 3 http://www.roland.com/about/en/development-history.html 3 PREPRINT: 14-Dec-06 13:04 on the physics of the model, thus they can be interactively controlled by sensors. The physics-based sound model can be realized in several degrees of fidelity. For example, simplified synthesis models can be used in portable devices such as mobile phones or electronic games. Models with higher details are used if more fidelity is needed and more computational power is available. Another potential application area of physics-based sound and control models is in structured audio coding [20], which was introduced as part of the MPEG-4 standard. This paper presents two physics-based synthesis and control systems for synthesizing hand clapping sounds. The first one, the ClapLab, is a general system for clapping analysis/synthesis implemented in STK [1]. It has been demonstrated in various conferences and workshops [21][22], but a technical description of the system was not published previously. The second system, the ClaPD, provides parametric extensions for synthesis of various clapping styles and enhanced control strategies for one clapper, as well as an ensemble of clappers [2]. ClaPD is implemented in the Pure Data (Pd) environment [23]. In order to synthesize accurate hand-clapping sounds as part of a more complex display representing a listening environment, artificial reverberation has been optionally included within both systems4. These two systems are presented consecutively in this paper. Finally, the conclusions are drawn and further research directions are indicated. 4 ClaPD uses Freeverb˜ [24], which implements the standard Schroeder-Moorer reverb model [25],[26] and consists of eight comb filters and four all-pass filters on both stereo channels. The filter coefficients on the left and right channels are slightly different to create a stereo effect. ClapLab uses the built-in Chowning-Schroeder reverberator JCRev, constructed of four comb filters and three all-pass filters per channel. 4 PREPRINT: 14-Dec-06 13:04 II. CLAPLAB The block diagram of Fig. 1 shows the general system architecture for clapping analysis/synthesis. This is a simplified form of the walking analysis/synthesis system described in [11]. To perform analysis, a monophonic (one person) clapping sound file is input to the system, which first does envelope extraction as described in (1) and (2). e(n) = (1 − b(n) ) x(n) + b(n)e(n − 1) , (1) where ⎧⎪bup if x(n) > e(n − 1) b( n) = ⎨ . ⎪⎩bdown otherwise (2) In this “envelope follower,” the input signal is first rectified (absolute value) then the rectified signal is passed through a non-linear one-pole filter. If the rectified input is greater than the current output of the filter, a rapid "attack" tracking coefficient bup is used. If the rectified input signal is less than the current filter output, a slower "release" coefficient bdown is used. The filter gain coefficient g is always set to (1.0 - b), to keep the total dc gain of the filter equal to 1.0. Typical values for a 22,050 Hz sample rate clapping/walking file are bup = 0.8 and bdown = 0.995. The envelope signal e(n) is sampled at 100 Hz. Fig. 2 shows the results of processing signals through this envelope follower. The top sound file is that of female clapping and the lower sound file is of male clapping. After envelope extraction, the system estimates the average frequency of events (claps) using three algorithms; autocorrelation, average magnitude difference function (AMDF), and centerclipped zero-crossing detection [27]. These three algorithms "vote" based on their confidence (peak to noise ratio in autocorrelation, null to energy ratio in the AMDF, etc.) and a decision is made as to the average frequency of events. For well recorded single person clapping sounds with minimal 5 PREPRINT: 14-Dec-06 13:04 background noise, usually all methods agree within a very small rounding error. Using this estimate, the system then marks all event beginnings in the envelope file, and also builds a table of the individual event boundaries in the original audio file. A threshold on positive going spikes in the derivative of the envelope is used to define event beginnings. The mean and standard deviation of event period is stored. The individual segmented audio files are then processed by low order (2 or 4 poles) linear prediction [28], to determine the resonance of the claps. Means and standard deviations of resonance frequencies and Q values are computed and stored (analysis verified that clapping periods followed a normal distribution). Fig. 3 shows the spectrum of a single clap, and the 2nd order LPC filter to that clap. The residual (error) signal after LPC is a short, exponentially decaying burst of noise. For non-parametric resynthesis, the system can regenerate the original clapping by concatenating the individual PCM clapping event segments. It can also speed up or slow down the clapping by simply changing the spacing between clapping events (by overlap/add if faster clapping is desired). The system can also concatenate the claps in random order, resulting in a soundfile that can be played longer without sounding repetitive. For parametric synthesis, a simple exponentially decaying noise source5 is used to excite a resonant filter whose parameters are controlled by the average LPC resonance and Q parameters (and standard deviations) previously determined. The following resonant filter is used in ClapLab y (n ) = A0 x(n ) + 2 R cos(θ ) y (n − 1) − R 2 y (n − 2) , 5 (3) The envelope is the impulse response of a one pole filter with radius at 0.95 at 22050-Hz sampling rate. This corresponds to a 60 dB decay time of 6 ms. 6 PREPRINT: 14-Dec-06 13:04 where A0 is a gain factor that makes the magnitude response unity at resonant frequency, R is the pole radius and θ is the pole angle. The clapping rate is controlled by the average clapping frequency and standard deviation, for a given person to be modeled. Of course, all parameters can be scaled or replaced by values obtained from sliders, sensors, or data from a simulation (like a game or virtual reality system). Fig. 4 shows the waveform and spectrogram of parametrically generated clapping. 8 subjects (4 males and 4 females) were enlisted to record clapping. Table I shows the average clapping period and standard deviation, and the average and standard deviation of resonant filter center frequency for all 8 subjects. A. Efficient Parametric Synthesis of Large Audiences. Based on the parameters from Table I, a variety of simulated clapping "characters" can be easily generated. For large audiences, the system does not actually require one filter/noise-source per "person." A fairly small number of virtual "clappers" can be constructed, and their filter settings reset in "round robin" fashion (least recently used filter is reallocated for the next clap). This is the same technique employed in the Physically-Inspired Stochastic Event Model (PhISEM) [9]. The overall noisy nature of large audience applause masks most artifacts that arise from re-using the least recently used filter. For the case of a very large audience with no synchronization (see next section), a counter is not required for each clapper. Rather, a Poisson event probability can be used at each time step to determine if a clap should occur. This is also the technique used in PhISEM, where it is outlined that the Poisson event waiting time can be computed at a fairly arbitrary sample rate (once the probability is established for a given time interval). The result is more effective, especially in stereo or multi-channel, where claps are being located at specific locations, if a separate data structure (counter, spatial location, means and standard deviations of frequency, 7 PREPRINT: 14-Dec-06 13:04 amplitude, and filter parameters) is used for each clapper. Artificial stereo reverberation further enhances the crowd effect. B. Flocking and Synchronization One interesting aspect of applause, and other "flocking-like" behaviors, is that sometimes the clappers fall in and out of synchrony. A simple and novel mechanism to model this effect was added to the ClapLab, called the "Affinity Knob." This slider ranges from 0.0 to 1.0, to control the relative synchronization between clappers in an audience. A very simple algorithm was employed to implement affinity. At each clap event, each individual clapper looks at the "master clapper" and drives his clapping timing toward that by setting his next clap event to HIS_NEXT_EVENT + affinity * (MASTER_NEXT_EVENT - HIS_NEXT_EVENT). The master clapper still exhibits the slight random variations in timing. The "slave" clappers exhibit as much timing randomness as their servitude level allows, ranging from their full randomness with affinity=0, to no randomness with affinity=1.0. All clappers exhibit their same randomizations of amplitude, frequency, and resonance. Thus, even with affinity set to 1.0, all clappers will clap at exactly the same times, but those times will exhibit the random timings of the master clapper, and each individual slave clapper will still have random timbral qualities. Fig. 5 shows clapping spectrograms with affinity set to 0.0, 0.25, 0.5, 0.75, and 1.0. The same flocking algorithm can be applied to other collections of sound producers, such as birds, crickets, or even musical instruments. ClapLab is implemented in STK [1] and runs in real time at audio sample rates. III. SYNTHESIS OF VARIOUS MONOPHONIC CLAPPING STYLES Acoustically, a hand clap corresponds to a rapid formation and excitation of a cavity between two hands. Motivated by Repp’s study [3], which relates the spectral features of various clapping styles to the configuration of hands and indicates that the configuration of hands provide 8 PREPRINT: 14-Dec-06 13:04 perceptually important cues, we have decided to simulate the cavity resonance separately by including a hand configuration parameter in the synthesis model. For this purpose, we have conducted measurements, where the hand configuration has been a controlled variable. In these measurements we have followed Repp’s suggestions for hand configurations (“modes”) that correlate well with the measured spectral features in his study [3]. By assuming that the highest peak of the spectra corresponds to the cavity resonance, we have conducted a high-order LPC analysis to extract it, and allocated the second-order resonator of the ClapLab in (3) for parametric re-synthesis of this cavity resonance. This strategy has caused two problems: (i) the overall spectral characteristics of a single hand-clap within the whole frequency band could not be accounted for, (ii) the excitation of the cavity resonator by an exponentiallydecaying noise burst resulted in a synthetic signal with an unrealistic attack. These problems are because of the relatively high Q-factor of the cavity resonator; an unrealistic attack does not occur in the technique described in Section II, which typically facilitates a lower-Q resonator. For a simple solution of both problems, we have focused on the excitation signal: the first problem was solved by band-pass filtering the excitation noise, and the second by applying an envelope on it. The first problem could alternatively be solved by tuning another resonator as explained in Section II and running it in parallel with the cavity resonator. However, this strategy would probably over-parameterize the model and increase the computational load in the case of multiple clappers. This section provides a detailed summary of our measurements, analysis, and synthesis model. A. Measurements The measurements were made in an anechoic chamber. Two AKG 480B microphones were positioned at the distance of one meter from the subjects’ hands. The first microphone was positioned directly at the front of the subject, whereas the second was at the angle of 60 degrees on 9 PREPRINT: 14-Dec-06 13:04 the left side of the clapper6. The microphone signals were sampled at 44100 Hz by a Digigram VX Pocket V2 soundcard7. The movement of the body when clapping hands caused some vibration on the floor, which was removed from the recordings using a fifth order Chebyshev Type I high pass filter with the cut-off frequency of 100 Hz. The mean center frequencies in Table I, as well as our primary investigations on the collected data showed that the frequencies of interest for clapping are higher than this cut-off frequency, and thus are not affected by the high-pass filtering. The test was made with three subjects. A sequence of five claps in each clapping mode was recorded. The positioning of hands was photographed for each clapping mode. These photos can be seen in Fig. 6. A short description of each clapping mode is as follows. In P (parallel) modes, the hands are kept parallel and flat, so that in P1 each finger of the left hand is aligned with the corresponding finger of the right hand. The position of the right hand is varied from palm-to-palm (P1) to fingersto-palm (P3), P2 corresponding to their midpoint. In A (angle) modes, the position of the right hand is varied in a similar fashion; from palm-to-palm (A1) to fingers-to-palm (A3), but the hands formed a right angle. Finally, the curvature of hands are varied in A1 mode to result a flat (A1-) or a very cupped (A1+) configuration, compared to A1. B. Analysis In this application, a moderate prediction error can be tolerated in favor of a smooth spectral characteristic of the LPC encoding filter around the cavity resonance. LPC encoder poles have been inspected for several filter orders to estimate a reasonable trade-off between accuracy and spectral smoothness. Analysis filters of orders around 70 gave good results for our purposes. Fig. 7 presents 6 The second microphone has been considered for probing the directional characteristics of the hand claps. These characteristics were not used afterwards and were left out for further analysis. 7 http://www.digigram.com/products/VXpocket.html 10 PREPRINT: 14-Dec-06 13:04 an example of the LP spectrum of a single hand clap in mode A2, the frequency response of the resonator fitted to the most significant peak (cavity resonance), and the FFT of the signal with a small offset for comparison. Comparing our results to Repp’s [3], we can observe the similarities in spectra. Modes where the palms strike to each other (P2, A1, A1+) have a spectral peak below 1 kHz, and if the hands are clapped so that fingers of the other hand struck to the palm of the other hand (P3, A3, A1-), the spectral peak is closer to 2 kHz. Fig. 8 illustrates the mean value and the standard deviation of the center frequency f c , the bandwidth B, and the gain g of the strongest resonance for each clapping mode. Here, the gain is a dimensionless scaling factor that tunes the A0 coefficient of the resonator in Equation (3), so that the LP and FFT spectra match at the peak level. The statistics are estimated from each clap of each subject (i.e. 15 claps in total). As the number of events is so small, it is difficult to analyze the exact shape of the distribution. A Gaussian distribution was assumed to obtain indicative results. Once the cavity resonances of hand clapping recordings are extracted, they can be filtered out from the original signal. This spectral flattening technique, called inverse filtering [30], is widely used in speech synthesis and coding. The resonator filter is inverted, i.e. the numerator and denominator are transposed, and then the inverted filter is applied to the original signal. The resulting excitation signals were collectively modeled as a band-pass filtered noise burst with an exponential attack and (optional) decay envelope. The parameters of these blocks are extracted by examining the average spatio-temporal characteristics of the inverse-filtered excitation signals and tuned by hand. The following parameters gave good results for the band-pass filter HBP(z) and the envelope e(n) H BP ( z ) = 1 − z −2 , 1 + 0.2 z −1 + 0.22 z − 2 (4) 11 PREPRINT: 14-Dec-06 13:04 ⎧⎪0.99 (140−n ) n ≤ 140 e( n ) = ⎨ . ⎪⎩0.99 ( n −140 ) 140 < n ≤ 600 (optional) (5) The short (3.2 ms) exponentially-rising attack segment roughly corresponds to the dynamic formation of the cavity. The decay is generally handled by the cavity resonator, although in some cases the cavity resonator provides a shorter decay time compared to our measurements. Usually, this difference is not perceived, however we included an optional extension of the excitation decay time to compensate it. C. Parametric Resynthesis in ClaPD A simplified8 block diagram of the synthesis process can be seen in Fig. 9. When the system is triggered, new cavity resonator coefficients are calculated based on the mean value and deviation obtained from the analysis (see Fig. 8). New coefficients are updated for each clap so that there is some variation between each clap. A trigger also launches the envelope generator which passes the enveloped noise signal to the resonator. IV. CONTROL MODEL FOR ONE CLAPPER In Repp’s research [3] the average onset-to-onset interval (OOI) among his 20 subjects was 250 ms ranging between 196 ms and 366 ms. Standard deviation varied between 2.8 ms and 13.6 ms (1% and 5%) with average of 6.8 ms (2.7%). There was a small difference on clapping rates between genders. Males clapped slightly slower (average OOI = 265 ms) than females (average OOI = 236 ms). Articles by Néda et al. [5], [6] give also some information on clapping rates. They measured the clapping rate of 73 subjects. First the subjects were asked to clap naturally as they would after a good performance. Then they were asked to clap in the manner they would do during the synchronized applause. Average OOI of natural clapping was roughly 250 ms and in synchronized mode about 500 ms. 8 The band-pass filter HBP(z) and mode selection are not shown in the block diagram. 12 PREPRINT: 14-Dec-06 13:04 Although these OOI values are in accordance with the ClapLab parameters presented in Table I, we wanted to verify the average OOI of 250 ms for tuning our control model, and enhance it by the statistics of basic expressions, such as the level of enthusiasm. This section presents our experiments and the resulting statistical control model of one clapper. A. Statistics of a Clapping Sequence During the acoustically-based “mode” measurements outlined in Section III, some emotionally-based experiments for statistical control modeling were also made. First, the subjects were asked to clap their hands naturally, as they would after an average performance. Then they were asked to make a very enthusiastic and very bored sequence of hand clapping. This procedure was repeated twice. After the different modes of clapping were recorded, the subjects were asked to repeat a sequence of normal, enthusiastic and bored clapping, but this time in clapping mode A2. The results obtained from our measurements are presented in Table II. These results can only give some direction for the clapping rates with different levels of enthusiasm, as the number of subjects was only three and only six sequences of clapping was measured for each clapper. Thus, we should pay attention to clapping rate measurements by Repp [3] and Néda et al. [5], [6]. Basic statistics for the implementation can be derived from these combined results. The clapping rate was chosen to vary between 4.17 Hz for enthusiastic clapping (OOI = 240 ms) and 2.5 Hz for bored clapping (OOI = 400 ms). Even though the measurement data shows that the OOI for bored clapping averages to the 600 ms, it sounds more realistic if the clapping rate is a little bit faster. Especially in the case of bored clapping, our subjects tend to exaggerate the slow clapping rate. During the analysis, we have observed systematic OOI fluctuations in clapping sequences. For example, at the start of a sequence it takes some time for subjects to find their convenient clapping frequency. This explains why the variation of the OOI is usually larger at the start of a sequence. An example of a clapping sequence, where the variation of OOI is larger at the start and 13 PREPRINT: 14-Dec-06 13:04 at the end of a sequence can be seen in Fig. 10. The height of a bar in the figure indicates the time interval between claps (OOI). In this sequence, the subject was asked to clap naturally. The clapping rhythm is also disturbed just before the end of a sequence. The variation of the clapping rate is not entirely random; it resembles the musical performance rules accelerandi (Engl. accelerating) and rallentandi (Engl. slowing down) [31][32]. Such rules can also be used in the control model of a clapping sequence to model the fluctuation of clapping rate. Especially at the end of a clapping sequence, the tempo seems to slow down a little. This indicates the Final Ritard performance rule that was used in the control model of walking and running sounds to model the decreasing tempo that precedes the stopping [13],[31]. This phenomenon can be seen very well in long enthusiastic clapping sequences where the subject has problems in maintaining the fast clapping rate. This decay of the clapping rate caused by becoming exhausted can also be considered as one reason for the transformation to synchronized clapping. Néda et al. [5], [6] proposed that the audience needs to double their natural clapping rate so that the synchronization can be found. In the single-clapper control model within the ClaPD, the user can control the length of a clapping sequence as well as the level of enthusiasm. The clapping rate varies between OOI = 240 ms for enthusiastic clapping and OOI = 400 ms for bored clapping. The variation of OOI is slightly exaggerated to 10% of the clapping rate and assumed to have a triangular distribution. It is doubled for the first two seconds of a clapping sequence to model the time for people to find their convenient clapping frequency. Also the Final Ritard is implemented so that during the last third of the sequence, OOI is increased after each clap by 2% of the original clapping rate. 14 PREPRINT: 14-Dec-06 13:04 V. ENSEMBLE CONTROL MODELING Néda and his colleagues have explained that the synchronization is achieved by the period doubling9 of the clapping rhythm [5], [6]. One example of a synchronized applause (extracted from a live recording) is illustrated in Fig. 11(a). Some fluctuation can be seen in the envelope from the start of the clip. The synchronization becomes audible at about two seconds and it takes about two seconds to find a very clear synchronization. The OOI for synchronized clapping is roughly 400 ms in this example. A simplified block diagram for a control model that incorporates the period doubling, as implemented in ClaPD, can be seen in Fig. 12. Each clapper is modeled individually and is aware of its current OOI (measured in milliseconds). The user can control the number of clappers. In the asynchronous mode, each clapper runs independently with its own natural clapping rate. The individual OOIs are drawn from a symmetric triangular distribution Tri(220,70) so that OOI is between 150 and 290 ms. This rate is slightly faster than those proposed by Néda et al. [5], [6] because our measurements have indicated faster clapping rates and these faster rates produce more convincing synthetic applauses. In the synchronized mode, each clapper aims to clap around the same rate (frequencylocking) and absolute time (phase-locking) with the lead oscillator. Each clapper calculates its phase difference (measured in milliseconds) with the lead oscillator. Since the lead oscillator has a constant OOI of 440 ms, the phase difference can be considered a uniform distribution U(0,440) with the mean of 220 ms. If a clapper is trailing behind of the lead oscillator (phase difference <220 ms) its clapping rate is accelerated. Similarly, if the clapper is ahead of the lead oscillator (phase difference > 220 ms) its clapping rate is slowed down. These operations are called entrainment [33]. 9 This feature does not exist in many other systems that are known to synchronize [33]. 15 PREPRINT: 14-Dec-06 13:04 A parameter10 K∈[0,1] determines the entrainment range in the phase cycle and weights the acceleration/deceleration curve (a function of phase difference and current OOI) during the entrainment. The acceleration/deceleration ranges are depicted in Fig. 13. Within its range, the acceleration is calculated by OOI next = OOI lead + and the deceleration by OOI next = OOI lead + K (OOI current − OOI lead ) − 1 PhaseDiff , c1 + c 2 K 2 K (OOI current − OOI lead ) + 1 (OOI lead - PhaseDiff ) , 2 c1 + c 2 K (6) (7) where c1 and c2 are constants. In our experiments, c1 =3 and c2 =4 provided a good match to the observed dynamics of synchronized applause. Finally, when the mode is switched back to asynchronous, a clapper is decoupled from the lead oscillator, and its clapping rate is sped up by OOI next = 1 OOI current c1 + c 2 K (8) until the natural clapping rate (150 <OOI<290 milliseconds) is achieved11. Again, the constants are tuned by hand; c1 =1.3 and c2 =-0.25 provided good results. The synthetic results of our control model are convincing and the process of finding synchronization and losing it sound realistic. An example of a synthetic clapping sequence with synchronization is illustrated in Fig. 11(b) for K=0. The sequence starts in asynchronous mode; after approximately two seconds the user triggers the synthetic mode, and the control model enforces the entrainment. 10 K roughly corresponds to (1-affinity) parameter of the ClapLab; K=0 indicates a good synchronization. 11 These algorithms can be further inspected from the source code cosc.c in ClaPD. 16 PREPRINT: 14-Dec-06 13:04 VI. CONCLUSIONS AND FUTURE WORK We have presented two physics-based synthesis and control systems for synthesizing hand clapping sounds. A technical description of the ClapLab has been provided. ClapLab is implemented in the Synthesis Toolkit (STK), and is available for download [1]. Parametric extensions for synthesis of various clapping styles and enhanced control strategies for one clapper, as well as an ensemble of clappers are introduced. The extended synthesis and control models are implemented as a Pd library (ClaPD). ClaPD is also available for download [2]. The measurements reported in this paper were conducted with a relatively limited number of subjects. More reliable values for model coefficients may be obtained from a larger set. The perceptual testing of the algorithms and implementations presented in this paper is also an important future direction. The noise envelope generator in ClaPD was originally designed to create isolated claps. When modeling multiple clappers, the control messages can arrive frequently and irregularly, and the clap that is currently being processed may be discarded when a new event arrives. Although this defect did not cause any audible effects while testing the ensemble control model for multiple clappers, a better scheduling algorithm may prevent potential problems. Even though the results of stochastic ensemble control modeling are satisfying, there are many possibilities for future studies. For example, the globally coupled two-mode stochastic oscillators [7] can be implemented. The dynamics of the ensemble control model could be further improved by relaxing the probabilistic nature of the control algorithm and implementing simple nonlinear dynamic models based on phase-coupled oscillators [34][35]. Alternatively, the pulsecoupled oscillator network reported in [19] may be considered. Note that these models generally assume acoustically identical oscillators and concentrate on their rate variations. However, as shown in [3] and verified by the measurements presented in this paper, each clapper can alter the 17 PREPRINT: 14-Dec-06 13:04 acoustics of their clap. Incorporating such acoustical variations in social clapping and looking for systematic strategies seems to be yet another interesting and important future direction. VII. ACKNOWLEDGMENTS Thanks to Dennis Puri, who began the Princeton clapping project as undergraduate independent work in 2001. L. Peltola’s and C. Erkut’s research is funded by the Academy of Finland (Projects 104934 and 105651). VIII. REFERENCES [1] P. R. Cook and G. P. Scavone. The Synthesis ToolKit STK. In Proc. Int. Computer Music Conf., pp. 164-166, Beijing, China, October 1999. For an updated version of STK see http://ccrma-www.stanford.edu/software/stk/. [2] L. Peltola. Analysis, Parametric Synthesis, and Control of Hand Clapping Sounds, Master’s Thesis, Helsinki University of Technology, 2004. Available online at http://www.acoustics.hut.fi/publications/files/theses/lpeltola_mst/ [3] B. H. Repp. The sound of two hands clapping: an exploratory study. J. Acoust. Soc. Am., 81(4):1100–1109, April 1987. [4] D. Rocchesso and F. Fontana, editors. The Sounding Object. Edizioni di Mondo Estremo, Firenze, Italy, 2003. Available online at http://www.soundobject.org/SObBook/SObBook_JUL03.pdf [5] Z. Néda, E. Ravasz, Y. Brechet, T. Vicsek, and A.L. Barabási. Physics of the rhythmic applause. Physical Review E, 61:6987–6992, 2000. [6] Z. Néda, E. Ravasz, Y. Brechet, T. Vicsek, and A.L. Barabási. The sound of many hands clapping. Nature, 403:849–850, 2000. [7] Z. Néda, A. Nikitin, and T. Vicsek. Synchronization of two-mode stochastic oscillators: a new model for rhythmic applause and much more. Physica A: Statistical Mechanics and its Applications, 321:238–247, 2003. [8] MIDI Manufacturers Association Incorporated. MIDI manufacturer’s association, October 2004. Available online at http://www.midi.org/. [9] P. R. Cook. Physically informed sonic modeling (PhISM): Synthesis of percussive sounds. Computer Music J., 21(3):38–49, 1997. 18 PREPRINT: 14-Dec-06 13:04 [10] P. R. Cook. FOFs, wavelets, and particles. In Real Sound Synthesis for Interactive Applications, pages 149–168. A. K. Peters, Natick, MA, 2002. [11] P. R. Cook, "Modeling Bill's Gait: Analysis and Parametric Synthesis of Walking Sounds", in Proc. Audio Eng. Soc. 22nd Conf. Virtual, Synthetic, and Entertainment Audio, Helsinki, Finland, 2002. [12] F. Fontana. Chapter 6: Auxiliary Work – Example of Physics-based Synthesis. In Physicsbased models for the acoustic representation of space in virtual environments. PhD thesis, Univ. of Verona, 2003. Available online at http://profs.sci.univr.it/~fontana/paper/20.pdf. [13] F. Fontana and R. Bresin. Physics-based sound synthesis and control: crushing, walking and running by crumpling sounds. In Proc. XIV Colloquium on Musical Informatics, Firenze, Italy, 2003. Available online at http://profs.sci.univr.it/~fontana/paper/21.pdf. [14] T. Lukkari and V. Välimäki. Modal synthesis of wind chime sounds with stochastic event triggering. In Proc. 6th Nordic Signal Processing Symposium - NORSIG 2004, pages 212–215, Espoo, Finland, June 2004. Available online at http://wooster.hut.fi/publications/norsig2004/61_LUKKA.PDF. [15] D. Rocchesso, R. Bresin, and M. Fernström. Sounding objects. IEEE Multimedia, 10(2): 42-52, 2003. [16] M. Rath and D. Rocchesso. Continuous sonic feedback from a rolling ball. IEEE Multimedia, 12(2): 60-69, 2005. [17] F. Avanzini, S. Serafin, and D. Rocchesso. Interactive simulation of rigid body interaction with friction-induced sound generation. IEEE Trans. Speech and Audio Process., 13(5): 1073-1081, 2005. [18] J.-M. Adrien. The missing link: modal synthesis. In G. De Poli, A. Piccialli, and C. Roads, editors, Representations of Musical Signals, pages 269–297. The MIT Press, Massachusetts, 1991. [19] E. R. Miranda and J. Matthias. Granular sampling using a pulse-coupled network of spiking neurons. In F. Rothlauf et al., editors, EvoWorkshops 2005, Lecture Notes in Computer Science 3449, pages 539-544, Springer-Verlag, Berlin, Germany. [20] B. L. Vercoe, W. G. Gardner, and E. D. Scheirer. Structured audio: creation, transmission, and rendering of parametric sound representations. Proc. IEEE, 86(3):922-940, 1998. [21] P.R. Cook, "Physics-Based Sound Synthesis for Graphics and Interactive Applications," ACM SIGGRAPH 2003 Course Notes #36. 19 PREPRINT: 14-Dec-06 13:04 [22] P. R. Cook, "Physics-Based Synthesis of Sound Effects," Game Developer's Conference Tutorial, San Jose, CA, 2003. Slides available at http://www.gamasutra.com/features/gdcarchive/2003/Cook_Perry.ppt [23] M. Puckette. Pure data: another integrated computer music environment. In Proc. Second Intercollege Computer Music Concerts, pages 37–41, Tachikawa, Japan, 1996. Pure data is available online at http://pure-data.iem.at. [24] O. Matthes. freeverb˜, November 2004. Available online at http://www.akustische-kunst.de/maxmsp/. [25] M. R. Schroeder. Natural sounding artificial reverberation. J. Audio Eng. Soc., 10(3):219–224, 1962. [26] J. A. Moorer. About this reverberation business. Computer Music J., 3(2): 13–28, 1979. [27] L. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal. A comparative performance study of several pitch detection algorithms, IEEE Trans. Acoust. Speech, and Signal Process. 24(5):399418, 1976. [28] J. Makhoul. "Linear prediction: A tutorial review," Proc. IEEE, vol. 63, no. 4, pp. 561--580, 1975. [29] M. Karjalainen, P. A. A. Esquef, P. Antsalo, A. Mäkivirta, and V. Välimäki. Frequencyzooming ARMA modeling of resonant and reverberant systems. J. Audio Eng. Soc., 50(12):1012–1029, 2002. [30] J. O. Smith III. Physical Audio Signal Processing: Digital Waveguide Modeling of Musical Instruments and Audio Effects, August 2004 Draft, http://ccrma.stanford.edu/~jos/pasp04/ [31] A. Friberg and J. Sundberg. Does music performance allude to locomotion? A model of final ritardandi derived from measurements of stopping runners, J. Acoust. Soc. Am., 105(3), pp 1469-1484, 1999. [32] R. Bresin, A. Friberg, and S. Dahl. Toward a new model for sound control. In Proc. Conf. of Digital Audio Effects, pages 45–49, Limerick, Ireland, 2001. [33] S. Strogatz. Sync: Rhythms of Nature, Rhythms of Ourselves. Penguin Books, England, 2003. [34] Y. Kuramoto and I. Nishikawa. Statistical macrodynamics of large dynamical system. Case of plane transition in oscillator communities. J. of Statistical Physics, 49:569–605, 1987. [35] J. A. Acebrón, L. L. Bonilla, C. J. P. Vicente, F. Ritort, and R. Spigler. The Kuramoto model: a simple paradigm for synchronization phenomena. Rev. Mod. Phys., 55(1):137-185, 2005. 20 PREPRINT: 14-Dec-06 13:04 FIGURE CAPTIONS Fig. 1. General system architecture for clapping analysis/synthesis. Fig. 2. Envelopes of claps Fig. 3. The spectrum of a single clap, and the 2nd order LPC filter to that clap. Fig. 4. The waveform and spectrogram of parametrically generated clapping Fig. 5. Spectrograms of 10 clappers at different affinity levels Fig. 6. Different hand clapping modes after Repp [3]. Fig. 7. LP spectrum of a single clap, a resonator fitted to model it and for comparison the FFT spectrum of the clap. Fig. 8. The mean value and standard deviation of center frequency, bandwidth, and gain. Fig. 9. Block diagram of synthesis of single hand clap. Fig. 10. (a) Example of the OOIs of a clapping sequence where the variation of OOI is larger at the start. (b) Example of the OOIs of a clapping sequence with decreasing tempo at the end (Final Ritard). Fig. 11. (a) Envelope of a synchronized applause extracted from a live recording. (b) Envelope of a synthetic synchronized applause (K=0). Fig. 12. Simplified block diagram for a synchronization model. Fig. 13. Acceleration/deceleration determination ranges on the phase cycle. 21 PREPRINT: 14-Dec-06 13:04 TABLES TABLE I CLAP STATISTICS. Subject M1 M2 M3 M4 F1 F2 F3 F4 Mean Clapping Period (s) .256 .327 .276 .265 .238 .284 .298 .285 STD (s) .0093 .0083 .0128 .0060 .0061 .0077 .0100 .0116 Mean Center Freq. (Hz) 1203 435 3863 1193 1519 2243 1515 1928 STD (Hz) 278 40 1009 243 219 775 239 764 22 PREPRINT: 14-Dec-06 13:04 TABLE II ONSET-TO-ONSET INTERVALS OBTAINED FROM MEASUREMENTS Average OOI Minimum OOI Maximum OOI Standard deviation of OOI within a sequence Natural 403 ms 316 ms 610 ms 25 ms Enthusiastic 323 ms 232 ms 547 ms 21 ms Bored 612 ms 362 ms 829 ms 29 ms 23 PREPRINT: 14-Dec-06 13:04 24 PREPRINT: 14-Dec-06 13:04 25 PREPRINT: 14-Dec-06 13:04 26 PREPRINT: 14-Dec-06 13:04 27 PREPRINT: 14-Dec-06 13:04 28 PREPRINT: 14-Dec-06 13:04 P1 P2 P3 A1 A2 A3 A1+ A1- 29 PREPRINT: 14-Dec-06 13:04 30 PREPRINT: 14-Dec-06 13:04 31 PREPRINT: 14-Dec-06 13:04 32 PREPRINT: 14-Dec-06 13:04 33 PREPRINT: 14-Dec-06 13:04 34 PREPRINT: 14-Dec-06 13:04 35 PREPRINT: 14-Dec-06 13:04 Pr(PhaseDiff ) No update Speed-up Slow-down No update PhaseDiff [ms] 0 20+70K 220 420-70K 440 36