Video Content Marketing The Making of Clips
Video Content Marketing The Making of Clips
Video Content Marketing The Making of Clips
Journal of Marketing
PrePrint, Unedited
All rights reserved. Cannot be reprinted without the express
permission of the American Marketing Association.
April 3, 2018
1
Senior Data Scientist, Netflix, Netflix Corporate Headquarters, 100 Winchester Circle, Los
Gatos, CA 95032, alexl@netflix.com.
2
Assistant Professor of Marketing, Leavey School of Business, Santa Clara University, 500 El
Camino Real, Santa Clara, CA95053, Tel. (408)554-4798, wshi@scu.edu
3
Lumry Family Associate Professor, Harvard Business School, Harvard University, Boston, MA
02163, USA, Tel. (617) 495-6125, tteixeira@hbs.edu.
4
Pepsico Professor of Consumer Science, Robert H. Smith School of Business, University of
Maryland, College Park, MD 20742, USA, Tel. (301) 405-2162,
mwedel@rhsmith.umd.edu.
Acknowledgement
The order of authors is alphabetical. The authors thank nViso for data collection and processing
by the web-based face tracking system, and Netflix for running the field experiment. This study
was supported by Robert H. Smith School of Business, Harvard Business School, and Leavey
School of Business.
1
ABSTRACT
Consumers have an increasingly wide variety of options available to entertain themselves. This
poses a challenge for content aggregators who want to effectively promote their video content
online via the original trailers of movies, sitcoms, and video games. Marketers are now seeking
to produce much shorter video clips to promote their content on a variety of digital channels.
This research is the first to propose an approach to produce such clips and to study their
used to study viewers’ real-time emotional responses when watching comedy movie trailers
online. These data are used to predict viewers’ intentions to watch the movie and its box office
success. The authors then propose an optimization procedure for cutting scenes from trailers to
produce clips and test it in an online experiment and in a field experiment. The results provide
evidence that the production of short clips using the proposed methodology can be an effective
The Internet has drastically reduced barriers to the distribution of video content. This has
caused an unprecedented proliferation of sitcoms, scripted series, documentaries, and long- and
short-format movies. Online content aggregators are making this vast array of video material
readily available to consumers for on-demand streaming. For short-format user-generated video,
there is YouTube. For video games, there is Twitch. For broadcast and cable shows, there is
Hulu. And for movies and web series, there are Netflix and Amazon.
Given that consumers have such a wide variety of options available to entertain
themselves, a challenge for online content aggregators is how to effectively promote their video
content. Synopses, critics’ reviews, and viewer ratings are important, but the best way for a
consumer to evaluate the quality of video content and to determine if she wants to see it, is for
her to watch a sample. For that reason, video content producers have historically used trailers as
their main marketing tool. This started around 1920 when movie theatres produced snippets of
upcoming films with simple text overlays and showed these “trailing” a feature film to entice
viewers to return to the theatre. The National Screen Service, a company that wrote scripts and
produced trailers on behalf of movie studios, was founded soon thereafter. It developed a
template for trailer design that included a montage and music, and held a monopoly over the
creation and distribution of movie trailers that lasted into the 1950s, when more competitors
entered the market. Movie trailer production has evolved into an industry with dozens of
independent production houses charging upwards of $500,000 for a trailer (Last 2004).
Nowadays there are trailers not only for movies, but also for sitcoms, for video games, and even
for books. These trailers are typically two- to three-minute videos created by selecting and
editing scenes from the original video content, and adding music and other sound effects. Their
purpose is to elicit a sample of the emotions that viewers will experience when watching the full
3
content (Kerman 2004). At least a dozen websites are uniquely devoted to showing trailers (e.g.,
themselves have become some of the most popular forms of entertainment on the web.
But, given consumers’ ever-shorter attention spans, the original trailers for movies,
sitcoms, and video games are becoming less effective marketing tools on some digital channels,
particularly those that do not support sound (for example, email and social media). Therefore,
online aggregators have cut the trailers that they obtain from production studios down to “clips”
of 30, 20, or sometimes as little as 10 seconds. The newfound marketing problem for content
aggregators has become not one of creating promotional trailers, but rather one of editing down
the trailer content provided by trailer production studios into formats that are suitable for digital
providing just the first few seconds of a trailer for online viewing is not always effective.” Thus,
marketers need better tools to produce these short clips from original trailers. Yet, despite its
importance, there is no academic research that assists in the development of these tools, or that
even helps to understand if and how these short clips can induce consumers to experience a
This paper is the first, to our knowledge, to look at how marketers of video-based content
should edit trailers to produce (shorter) clips that help consumers decide whether to watch the
content. While marketing departments at content aggregators often have limited control over the
specific content of the trailer, they do have control over which scenes to select from the trailer in
producing a shorter clip. Our conceptual framework and methodology are therefore founded on
the notion that the scene is the basic semantic building block of video content and also the
To illustrate our method, we focus on the creation of clips for comedy movies as an
application. We collected and analyzed four different types of data and propose an optimization
were shown movie trailers and their intentions to watch the full movie were measured. These
data were used to calibrate a model that explains viewing preferences based on the audio-visual
scene structure of trailers, as well as the real-time emotional responses evoked from each viewer.
Second, we collected information on various ratings and box office sales of the movies in
question. This allows us to validate in-market the role of the scene structure, under the control of
the marketer, and assess the intermediary role of the emotions evoked. We account for content-
specific effects by controlling for movie ratings and variations of trailers for the same movie (we
do not code and insert content variables because the online aggregator marketer has no control
over them: instead, we measure consumers’ emotional response to that content). Third, after
understanding how the scene structure is associated with the emotional response, intention to
watch, and box office sales, we optimized the editing of trailers to produce short film clips for
use on digital channels in which sound is (e.g., IMDb) or is not (e.g., Facebook) supported.
field experiment with one of the world’s largest video content aggregators. We show that (a) it is
superior to the currently used heuristic approach for producing clips and (b) it can be automated
Online content aggregators can apply our methodology to design clips that deliver an
optimal emotional experience and consequently induce higher watching intentions and sales for
their video content. The proposed framework applies not only to the comedy movies / trailers /
clips that we examined as a prototypical case, but to all movie genres, and more generally to
5
other types of digital content (news, books, TV shows, etc.) that are being marketed via clips.
Apart from contributing to the literature on content marketing and movie marketing in particular
(Eliashberg, Hui, and Zhang 2007; Faber and O’Guinn 1984; Litman 1983), this paper also aims
to contribute to the literature on online advertising (Teixeira, Wedel, and Pieters 2012), and fits
This paper is structured as follows. First, we review the relevant literature and provide a
general conceptual framework for how the scene structure of trailers affects consumers’ viewing
experience and watching intentions. Then we develop an empirical model of watching intentions
and propose an optimization tool to produce clips. Subsequently, we describe the empirical
results of the model estimation for data on comedy movies and the prediction of their box office
success. We then show the managerial implications of optimal clip production by testing the
optimal clips through simulation, an online experiment, and a large-scale field experiment.
Lastly, we discuss the insights obtained and the potential usage of our tools by marketers of
online content, and reflect on future developments regarding automation and personalization.
The movie industry and its box office performance has seen ample research in marketing.
Litman (1983) showed the relationships between box office success and determinants such as
time of release, distributor, movie genre, production costs, and Academy Awards. Faber and
O’Guinn (1984) then confirmed that the effect of movie previews and movie excerpts (such as
trailers) on movie-going behavior is stronger than the effects of word of mouth and critics’
reviews. Reviews from critics were shown to play a role by Eliashberg and Shugan (1997).
6
Sharda and Delen (2006) showed that the success of a movie is determined by the number of
screens on which the movie is shown during the initial launch and the stars featured in the movie.
Eliashberg, Hui, and Zhang (2007) demonstrated further that the scripts of trailers could be used
to forecast a movie’s return on investment. More recently, Boksem and Smidts (2015) showed,
using electroencephalography (EEG) measures, that emotions are important predictors of movie
preferences and box office success. Despite significant research being conducted to explain the
causes and drivers of successful movies, less work has been done on movie marketing per se.
Since around 50 percent of a major Hollywood studio’s movie budget is spent on marketing,
whereas the other half goes into production, this gap in the literature is rather puzzling.
Movie marketing is a big business today. According to a report by Statista, global cinema
advertising spending was $2.7 billion in 2015 and is expected to total $3.3 billion in 2020. The
main tool for movie and other video-based content marketers is currently the trailer. By
including scenes from the movie that elicit a sample of the emotions that viewers will experience
while watching the full movie, a trailer allows viewers to form expectations of the experience of
watching the entire film (Kerman 2004). Trailer music is used to support the emotional
experience elicited by the montage, but the majority of trailers use “library” music (Shannon-
Jones 2011) that is not from the movie’s soundtrack itself. Indeed, movie trailers have been
shown to be the most influential factor to impact consumers' intentions to watch a movie (Faber
Unfortunately, trailers for movies and other video content are no longer always directly
useful in their original format as marketing tools for online content aggregators and distributors,
given concerns over consumers’ ever-shorter attention spans. The present research focuses on the
scene as the elementary building block of trailers and as the basis for creating short “clips” for
7
online marketing. Figure 1 visualizes our conceptual framework. It reflects the fact that scenes
are the basic audio-visual building blocks of video content, and that the creative design of a
trailer involves a montage of selected scenes from that content. Marketers at online streaming
services have no influence on the plot, narrative, or script of the trailer, but rather need to edit
Our framework in Figure 1, which applies not only to movies but generally to other
categories of video content as well, is therefore based on the recognition that the problem of
modern-day online content marketers is one of editing down content, as opposed to producing it
from scratch. The cutting of scenes from the trailer produces a clip that retains important
elements of the movie’s emotional experience, which aims to generate a positive intention to
watch the full content as a response to viewing the clip. Practitioners tend to view trailer
“cutting” more as an art than a science and tend to use ad-hoc methods in trailer design by, for
example, applying “a lot more cutting” or adding “an unexpected jolt of some kind or a
wonderful piece of music” (Hart 2014). Ultimately, to produce clips marketers often simply use
the first few scenes of the trailer. A more rigorous approach is called for.
Emotions play an essential role in experiencing movies and shows, and thus in the trailers
and clips created to promote this content (Boksem and Smidts 2015). Movies draw audiences in
because they provide a concentrated emotional experience (Hewig et al. 2005; McGraw and
Warren 2010). Each movie genre is comprised of a prototypical narrative that is designed to
elicit a central emotion; for example, horror movies evoke fear, tragedies evoke sadness, and
comedies evoke happiness (Grodal 1997). In the present study, we focus on comedy movies and
happiness as their central emotion. Our conceptual framework in Figure 1 shows that the key
8
problem that needs to be addressed for marketers to create effective clips is to identify which
scenes of the original trailer evoke the highest level of that central emotion among viewers. Only
after knowing the intensity and timing of the central emotion can content marketers edit down
long-form trailers to shorter clips that are potentially even more effective than the original
trailers.
The psychology literature on events in film has focused on the scene as a unit of analysis
(Zacks, Speer, and Reynolds 2009), and has shown that viewers parse a film into events based on
the perceptual information that defines and delineates the scenes in the film (Cutting, Brunick,
and Candan 2012). Our framework therefore revolves around the audio-visual scene structure of
trailers (Figure 1). Marketers control the way consumers experience a trailer, and a clip in
particular, through the pacing and length of scene cuts. Consumer behavior research has shown
that pacing and sequencing (Galak, Kruger, and Loewenstein 2011; 2013; Ratner, Kahn, and
Kahneman 1999; Zauberman, Diehl, and Ariely 2006) — in the present context, the number, and
length of scenes — and delays and interruptions (Nelson and Meyvis 2008; Nowlis, Mandel, and
McCabe 2004) — in the present context, scene transitions and cuts — are prime components
due to overly fast satiation (Galak et al. 2013), whereas a slower consumption, sometimes with
an interruption, slows down satiation and leads to a more enjoyable overall experience (Nelson
and Meyvis 2008). We therefore predict that the pacing of the scenes in a comedy movie trailer
will exert a similar impact: happiness levels will generally improve across the sequence of scenes
in a trailer, but a fast-paced trailer with a larger number of scenes results in a lower level of
happiness, and consequently in a lower watching intention. Prior research also addressed the
consumption will lead to greater consumption enjoyment because the utility of anticipating a
pleasant consumption outweighs the utility of waiting. Loewenstein and Prelec (1993) also
showed that people prefer anticipating the best outcome at the end of a consumption experience.
We therefore predict that if the key scene (typically the longest) is placed later in a comedy
effects, and especially music. Past literature has shown that sound plays a dual role in
experiences: It orients attention and intensifies emotions. The intensity (volume) of the sound has
manner (Bradley and Lang 2000; Lang 1995; Lang, Bradley, and Cuthbert 1997). Therefore, we
expect a positive impact of moment-to-moment overall and music volume in a comedy trailer on
happiness and consequently on watching intention. Moreover, because some of the digital media
in which clips are commonly placed (e.g., email and social media) do not support sound, it is
important to be able to predict consumers’ reaction to a clip that does not include any audio or
music. In our framework, we therefore allow for the possibility that several aspects of audio and
music volume (including start, peak, trend, and end volumes) can have an impact on emotions
and viewing intentions. The aspects that exert an influence on the experience depend on the
context (Zauberman et al. 2006) and given a lack of prior literature, we do not formulate specific
Given that the goal of trailer and clip design is to produce a representative emotional
experience, it is necessary to characterize the emotional content of trailer scenes. While the entire
emotional experience throughout trailer consumption may be significant, research has shown that
the peak and end points of the emotional experience are disproportionately more important in the
10
overall evaluation of the experience by consumers (Baumgartner, Sujan, and Padgett 1984;
Fredrickson and Kahneman 1993). Hence, we predict that scenes with high peak and end
happiness result in higher watching intentions. In addition, research has shown that the general
impacts the overall enjoyment (Elpers, Wedel, and Pieters 2003). We therefore predict that a
the trailers that we focus on, and their downstream impact on emotions and watching intentions
in the context of comedies and happiness as their central motion, as guided by the literature.
With respect to the visual scene structure, the number, length, and sequencing of scenes should
have a measurable impact on the viewer’s emotions. With respect to audio, total sound volume
as well as the music-only volume of scenes should have a direct impact on watching intentions,
as well as an indirect impact, because they evoke or intensify the central emotion of the genre.
Not all moments in the trailer are expected to have a significant impact, but rather the start, peak,
In the next section, we explain the methodology employed to collect emotional reactions
to trailers in order to understand their role in consumers’ intentions to watch movies. We focus
on trailers for comedy movies as a prototypical case. Comedy has been the leading genre in the
last two decades, with a little more than 2,000 comedies produced and a market share of over 20
percent. The average gross revenue was about $20 million per movie. The main goal of comedy
movies is to elicit joy and laughter among the audience (McGraw and Warren 2010) and
effective trailers for comedy movies are designed to induce happiness as the central emotion
11
(Grodal 1997). We thus focus on happiness as the key emotion in the illustration of our
methodology, but also look at the role of surprise and disgust as secondary emotions. In the next
section, we explain the method used to parse out trailers into scenes and to measure emotions
METHODOLOGY
An online experiment was conducted in collaboration with the company nViso2, in which
facial expressions and watching intentions were collected for participants watching 100 comedy
movie trailers. Each participant was asked to view a webpage that contained 12 comedy movie
trailers in a setup that mimicked what he or she would encounter on trailer websites from IMDb,
iTunes, or YouTube. Participants watched the trailer in their natural environment, at home, at
work, etcetera, which increases the external validity of data collection. Facial expressions were
recorded remotely through the webcams on participants' computers. At the end of the
experiment, the participants were asked to answer questions regarding their evaluations of the
trailers and the corresponding movies, as well as their intentions to watch the movies. To make
the study incentive-compatible, participants entered a lottery to win the DVD of the movie that
they most wanted to watch. Each participant also received $5 in the form of an Amazon gift card
A total of 122 paid participants were recruited online. The participants had a mean age of
24 and an age range from 18 to 68, with 28 percent being men. Participants had to have access to
a personal computer with a webcam and high-speed Internet connection and have near-perfect
12
vision without glasses or contact lenses. Male participants with a full mustache or beard were
excluded.
A total of 100 comedy movie trailers were taken from public access video channels.
Thirteen comedy subgenres were selected, including nine drama comedies, eight animation
comedies, seven action comedies, seven romantic comedies, four horror comedies, four indie
comedies, four parodies, two dark comedies, and one each from political comedy, sci-fi comedy,
slapstick, sports comedy, and late-night comedy. The trailer for a movie typically comes in
different versions with different lengths developed for different viewing situations and/or for
different audiences. Two different versions of the trailer for each movie were included in the
present study to separately identify trailer-specific features (e.g., scene usage, sound, volume)
from movie-specific features (e.g., stars, casting, plot). Taking the movie “Project X” as an
example, one trailer is one minute and 37 seconds and contains 29 scenes, while the second
trailer is two minutes and 26 seconds and contains 32 scenes and has louder sound on average.
Each participant was only exposed to one version of the trailer for the same movie. Overall, 100
trailers for 50 comedies were used in the study; these trailers were selected from a pool of 100
comedy trailers through a balanced incomplete block design. Data collection generated a massive
dataset of upwards of 1.5 million (participant × time × movie) emotional reactions to audio and
video content scenes. The design minimized spillover effects by randomizing the order of the
trailers shown to each participant. One randomly selected comedy trailer was used as a control
stimulus to form an individual-specific emotional baseline and was shown to all participants at
Procedure
The participants were asked for consent to participate in the experiment and to be
13
recorded via their webcam. Participants needed to be in a well-lit environment and at most 60
centimeters (2 ft.) away from their webcams. Participants were requested to refrain from eating,
chewing, drinking, or talking. Although this request may have had some impact on the external
validity of the study, it was necessary to ensure accurate recording of facial expressions and
Each participant was shown a random series of 12 trailers. The length of each trailer was
between one and three minutes. After each trailer, participants were asked five questions about
their previous exposure and their evaluation of the trailer and the movie. Watching intention
(WatchMovie) was measured on a scale ranging from one to seven, with seven being the highest
intention, indicating how much participants would like to watch the movie after they had been
exposed to the trailer. After all trailers were shown, participants were asked to answer questions
about their demographics and their general movie-going behavior. At the end of the experiment,
participants were entered in a raffle in which they had a one in 10 chances to win a free DVD.
They were asked to choose one or more movies from any of the movies they had just watched in
the experiment. If they won, one movie was selected from the choices they made. The whole
Data Collection
The facial expressions of emotions were collected, calculated, and provided to the
researchers in raw data form by the company nViso, which provides real-time cloud computing
to measure consumers’ emotion reactions in online experiments. For each second that a
participant watched a trailer, a probability was calculated indicating the intensity of the emotion.
An emotional profile was created for each participant, containing the moment-to-moment
measures of happiness and other emotions. The original videos of participants’ expressions were
14
not retained because of privacy concerns, as outlined in IRB regulations for the study. There
were 122 participants in the online questionnaire data and 104 participants provided valid
emotion data. Ninety participants completed the entire questionnaire and had a valid emotion
profile, indicating full compliance with the instructions. Five participants did not provide valid
control data for the calibration trailer, and therefore the final sample consisted of 85 participants
from whom we obtained complete data, which is comparable to the sample sizes commonly used
by nViso in its online tests. For each of the 100 trailers, the data from participants who had seen
the movie previously were removed. Therefore, the number of participants per trailer varied from
Measuring emotions had been a long-standing problem (Mauss and Robinson 2009) until
Ekman and Friesen developed the Facial Action Coding System (FACS) to systematically
categorize emotions by coding instant facial muscular changes (Ekman and Friesen 1978). The
FACS decomposes facial movements into anatomically based “action units,” reflecting the
muscular activity that produces the facial appearance. For example, happiness is characterized by
two primary and three secondary action units. Recently, the Expression Descriptive Units that
measure the interactions among facial muscular movements (Antonini et al. 2006), as well as
Appearance Parameters that consider global facial features (Sorci et al. 2010), have been used to
augment emotion recognition. Although initially emotions had to be assessed by trained coders,
nowadays several off-the-shelf software solutions are available to provide automatic and
This software has been used in previous marketing studies (Teixeira, Wedel, and Pieters
2012; Teixeira, Picard, and Kaliouby 2014). It has been proven to outperform the work of non-
15
expert coders and to be approximately as accurate as that of expert coders (Bartlett et al. 1999).
The automated algorithm used by nViso splits the video recording of the user’s face into separate
frames, and then uses the facial expression in each static frame to identify the probability of the
occurrence of six basic emotions (happiness, surprise, fear, disgust, sadness and anger) based on
a Multinomial logit model. The explanatory variables include the measurements from Ekman’s
FACS, the Expression Descriptive Units, and the Appearance Parameters (MacCallum and
Gordon 2011; Sorci et al. 2010; more details are provided in Web Appendix I).
The algorithm has been validated using 11-fold cross validation on a database of 1,271
images of facial expressions, manually coded for the expression of emotions by 33 human
coders. This cross-validation yielded a (normed) correlation of .76 (Sorci et al. 2010, Table 4, p.
800). This result is comparable to those for similar automated algorithms, for which
classification accuracies ranging from .78 to .88 have been reported (Brodny et al. 2016; McDuff
et al. 2013). Sorci et al. (2010) also reported other supportive evidence on the performance of the
nViso algorithm and compared it to neural networks, based on Histogram Intersection and
Kullback-Leibler measures.
The movie trailer video and audio content of all 100 trailers was analyzed using image
Scene cuts: Scene cuts in the movie trailers were detected automatically using the “Scene
based solely on the frame image data. Based on the scene cuts, we calculated the following
variables for use in the analysis: the total number of scenes, the average length of scenes, and the
Audio Volume: We extracted two types of volume data. One is Total Volume: Amplitude
data were extracted every millisecond from MP3 audio files using the sound processing software
basis to match the video data. The other is Music Volume: By removing vocals utilizing SoX, the
music was separated from the audio files, and its volume was calculated as described above. For
both the total volume data and total music volume data, based on our conceptual framework, we
calculated the following variables to be used in the analysis: the moment-to-moment volume
across the trailer, the trend of volume over the course of the trailer, the average volume in the
start scene, the average volume in the end scene, and the scene with peak volume. Figure 2
shows an example of total volume and music volume from one movie trailer (the vertical dashed
The trailers are comprised of, on average, 23 scenes (Min = 1, Max = 56, SD = 14.23),
with an average length of 11 seconds (SD = 17.6). The total volume is .053 dB (SD = .025),
while the peak volume is .097 dB (SD = .046). The music volume is substantially lower, .017 dB
Emotion Variables
Based on our conceptual framework, aggregate measures of this emotion for each trailer were
calculated as follows: Start, the total emotional intensity during the first scene; the Trend,
calculated via a linear fit to each emotion curve; Peak, the average happiness of the scene with
the highest average emotion level; PeakIndex, the location of that scene in the trailer; and End,
the total emotional intensity during the last scene. Figure 3 shows an example of moment-to-
17
moment happiness and its summary measures for the movie trailer for Men in Black 3. Several
comedy subgenres may rely on other, concomitant emotions of happiness, most importantly
surprise in spoof and action comedies, and disgust in dark, satire, and horror comedies (McGraw
and Warren 2010). Therefore, we retained moment-to-moment surprise and disgust as secondary
emotions in comedy trailers and calculated the Start, Peak, Trend, and End measures for these
two emotions as well. Tests based on a random effects linear model show that there is no
significant trend in any of the emotion variables across the sequence in which the trailers are
shown.
Control Variables
While there are countless variables that one could incorporate into the model, we chose to
incorporate the ones readily available to marketers of online video content aggregators. In that
spirit, we obtain several control variables for movies extracted from the online databases
“Internet Movie Database” (IMDb), owned by Amazon, “Rotten Tomatoes,” and “The
Numbers,” including MPAA Reviews, ratings (from IMDb), number of ratings above 3.5 (from
Rotten Tomatoes, log-transformed), and release time (whether the movie is released during the
STATISTICAL MODEL
In our conceptual framework (see Figure 1 and Table 1): a) audio and video features
including start, peak, and end levels, affect intentions to watch the movie; c) in addition to their
18
indirect effects via emotions, some aggregate audio and video measures may affect watching
intentions directly; and d) finally, watching intentions, together with aggregate emotion measures
and high-level movie characteristics, such as reviews and ratings, impact box office revenue.
The statistical methodology reflects the postulated theoretical relations. Whereas prior
research has mostly used emotions as explanatory variables, here we model the moment-to-
moment emotional response jointly with the end-point behaviors of prime interest (watching
intention and box office revenues). There are three sub-models combined in this joint model, one
for the (longitudinal) happiness data, one for watching intention data, and one for box office
revenue data. The happiness and watching intention sub-models are connected through
individual-specific random effects (Tsiatis and Davidian 2004). We account for unobserved
individual differences though a hierarchical formulation. Given that there are over 40 predictor
variables, we simultaneously apply Bayesian variable selection to each of the three model
Here, θijt is the underlying true emotion trajectory (modeled as described below). Whereas
variables (for example, Teixeira, Wedel, and Pieters 2012), here measurement/classification
error, denoted as ξijt , is accommodated. It is obvious that ξijt is not separately identified from
other sources of error, say ςijt , and therefore subsumed in the model’s error term: eijt = ξijt +
ςijt . The error terms eijt are assumed to be independently Normally distributed, and because the
assume them to be uncorrelated over time. In equation (1), the underlying emotion trajectory
θijt is expressed as
Here, 𝐖1i (t) are subject-specific random effects (see below). As for the moment-to-moment
audio and video features, Sjt represents the index of the scene at time t; Vjt represents the total
audio volume of trailer j at time t; and Mjt represents the volume of music of trailer j at time t.
The matrix 𝐗1j contains the trailer-specific aggregate video and audio variables (start, peak, end,
Second, an ordered logit model is developed for the watching intentions, with yij
representing individual i’s intention to watch the movie j, modeled as a function of the latent
Here, D = 7 and 𝐖2i contains subject-specific effects, similar to 𝐖1i (t). (The threshold
parameters satisfy the order constraint: τ1 < τ2 < ... < τD, and the first and last thresholds are fixed
for identification (Lenk, Wedel, and Böckenholt 2006). The matrix 𝐗 2i,j contains the predictor
variables, including the aggregate measures of emotions, and the aggregate video and audio
variables extracted from the movie trailers, which are our main explanatory variables. This
model is linked to the model for happiness through the dependence between 𝐖1i (t) and 𝐖2i .
These random effects capture unobserved individual-specific effects in the intercept and the
trend of happiness, respectively. Specifically, 𝐖1i (t) and 𝐖2i are expressed as:
20
The random effects for the intercept and the slope are denoted as u1i and u2i , and together with
u3i are assumed to follow Normal distributions. The parameters ν1 and ν2 capture the association
Third, we model log-box office revenues at the movie level as a function of predicted
Here, y∙j∗ = ∑i yij∗ /N, and 𝐗 3j contains the aggregate emotion measures and control variables
described in the previous section. Note that equations (1) to (6) constitute a system of
simultaneous equations that are jointly estimated. Gross box office revenue for each of the
movies corresponding to the trailers used in the study was obtained from the IMDb database for
To identify a parsimonious model that has fewer explanatory variables, we apply the
Gibbs Variable Selection (GVS) procedure developed by Dellaportas, Forster, and Ntzoufras
(2000) to efficiently search for the best subset of predictor variables in 𝐗1ij , 𝐗 2ij , and 𝐗 3j for each
of the three model components, respectively. We use a variable selection approach because this
allows us to use a limited subset of predictor variables in hold-out data collection and model
validation. In the GVS approach, the coefficients of the regression model are assumed to have
spike-and-slab prior distributions with a mixture of a point mass at 0 and a diffuse distribution
equations (2), (3), and (6), with Ik = 0 indicating the absence of the covariate k in the model and
0, if Ik = 0 (spike)
(7) βk = { .
ηk , if Ik = 1 (slab)
The joint density P(Ik , βk ) = P(βk |Ik )P(Ik ). The effect size parameter ηk is assumed to have a
The model is estimated with Markov Chain Monte Carlo (MCMC), using the JAGS
The integrated model described above is used to produce optimal short movie clips that
can be inserted in emails, messages, and social media, and in the apps, landing pages, and user
interfaces of content providers. Online advertising channels are idiosyncratic in the video
formats they accept, specifically regarding whether or not they support audio. For example,
YouTube plays videos with sound by default, while on Facebook, 85 percent of videos are
watched without sound (Patel 2016), and Netflix only allows GIF format clips without audio and
subtitles in its promotional emails. We produce clips of about 30 seconds in length (but, in
principle, any desired length is possible) and design optimal trailers both with and without audio.
For the former, we use the full model described above and for the latter, we recalibrate the model
Let Sj = {Sj,1 , … , SjTj } be the sequence of scene indicators across Tj , the length of trailer j,
22
and let there be K j = n(Sj ) scenes for trailer j. The criterion optimized is the mean of the
where Φ contains all model parameters, f(Φ|data) denotes the posterior distribution of the
parameters, and Sj∗ = {Sj,1 , … , Sj,T∗j } denotes the sequence of scene indicators across the length of
the clip, Tj∗ . The algorithm we propose to find an optimal trailer j of approximately 30 seconds
(or any other length) is a backward elimination algorithm that one by one, eliminates scenes that
∗
2. Eliminate all Sj,t = k, for k = 1, … , K ∗j and in turn, delete the corresponding elements
3. Retain the clip without scene ℓ = arg min[yj (Sj−k )] by removing all time periods t
k
∗
for which Sj,t = ℓ, and set K ∗j = K j − 1; and
4. If |Tj∗ − Δ| < ϵ, stop and if not, return to 2. In the application, we use ϵ = 1 second
that make a locally optimal decision at each stage s of the algorithm. For example, at step 1 for
trailer j, it eliminates the most redundant scene from the trailer to produce an optimal clip with
for a trailer j, it will provide a solution in less than K j steps. It avoids the need to enumerate all
possible configurations of scenes, which would be required to find the globally optimal solution.
23
In some cases, the proposed backward elimination strategy may thus not produce a globally
optimal solution, but it will yield a locally optimal approximation of that solution (Couvreur and
Bresler 2000). For online content aggregators, this approach has two benefits. It allows the
movie marketer to optimally create clips of any length shorter than the original trailer, and it can
RESULTS
Model Comparison
We first test alternative specifications of the joint model with the purpose of investigating
the contribution of the happiness and secondary emotion measures, the video characteristics, the
audio characteristics, and the intention as a predictor of box office revenue. We calculate several
measures of model fit, including Akaike Information Criterion (AIC), the Deviance Information
Criterion (DIC; Spiegelhalter et al. 2002), and the Watanabe-Akaike Information Criterion
(WAIC). The latter was proposed by Gelman et al. (2014) as a computationally convenient
predictive measure that is based on the entire posterior distribution rather than a point estimate
(as is the case for the AIC and DIC statistics). A smaller value for these statistics indicates a
We examine six models: 1) the full model, and five models for which we remove the
following sets of predictor variables from the full model: 2) the measures of the secondary
emotions surprise and disgust, 3) the measures of all emotion variables of happiness, surprise,
and disgust, 4) all video variables, 5) all audio variables, and 6) intention as a predictor of box
office sales. Table 2 shows the AIC, BIC, and WAIC statistics for each of the models. The model
without the audio variables shows the largest reduction in (predictive) fit relative to the full
24
model, followed by the model without the video variables. The importance of audio that this
reveals has ramifications for the design of clips for media that do not support audio. Dropping
watching intention as a predictor of box office success significantly reduces the (predictive) fit of
the model. However, while dropping all emotion measures simultaneously reduces model fit
significantly, the model without the (start, trend, peak, and end) measures of surprise and disgust
fits better than the full model. Apparently, for our sample of movie trailers, these two secondary
emotions do not play a significant role in the formation of watching intentions. We therefore
report the results of the model without the secondary emotions (model 2, main model) below.
Table 3 displays the posterior means of the inclusion probabilities obtained from the
Bayesian variable selection. The table shows that for the happiness model, only the music peak
has a very low inclusion probability (.046). For the watching intention model, variables that have
very low inclusion probabilities are the sequence number of the scenes (.030), the average scene
length (.053), the peak volume (.016), the end volume (.028), the music peak (.029), the end
volume (.029), and, to a lesser extent, the trend in music volume (.142). Thus, while most of the
audio and video variables affect moment-to-moment happiness, only a few (longest scene, total
and music volume in the first scene, and the trend in volume) affect watching intention directly.
Almost all emotion measures affect watching intention, but the inclusion probability of average
happiness in the watching intention model is relatively low (.196). For the box office model, only
In our application, a very large number of models is searched over and even the set of
most promising models may be large. We therefore use a heuristic cutoff on the inclusion
25
probabilities to select the “best” model and based on Table 3, a standard cutoff of .2 is employed.
We investigate the sensitivity to the cutoff by varying it from .10 to .35 in steps of .05 and re-
estimating the model, including the variables, based on that cutoff. All coefficients of predictor
variables in the emotion and box office models that are significant (have a credible interval that
does not cover zero) stay the same, regardless of the cutoff used, but for the watching intention
model, as the cutoff changes there are some relatively minor variations in the significance of
In Table 4, we present the estimates of the final model. The table shows that video
features directly impact momentary feelings of happiness. The level of happiness increases with
increasing scene sequences (Scene). The number of scenes in a trailer (SceneNum) has a negative
effect on happiness, confirming that fast-paced comedy trailers tend to result in a lower level of
happiness. We find that longer scenes placed later in the trailers (SceneLongestInd) increase
happiness significantly. These findings confirm our predictions (see Table 1). As for audio, we
find that its moment-to-moment volume has a significant positive instantaneous effect on
happiness, as predicted (Table 1). We did not have specific predictions on the effects of the
sound volume measures, but peak volume (VolumePeak) and increasing trend in volume
(VolumeTrend) decrease happiness. End volume (VolumeEnd) has a positive effect. Music
volume (Music) has a negative moment-to-moment effect on happiness, but louder music at the
watching intentions. As predicted by peak-end theory (Fredrickson and Kahneman 1993), both
26
the peak happiness (HappinessPeak) and the happiness experienced at the end of the trailer
(HappinessEnd) have a positive effect on watching intentions (Table 1). In line with our
positively. Finally, the association between the random intercepts in the happiness and watching
intention models is significant, with a higher variation in the level of happiness being associated
with lower watching intentions. These findings confirm our predictions (Table 1).
Alongside the indirect effects of video and audio variables on watching intentions via the
happiness experienced, there are also direct effects. Longer scenes placed later in the trailers
(SceneLongestInd) increase watching intentions significantly, over and above their effect on
happiness, as predicted (Table 1). Further, increasing volume (VolumeTrend) decreases not only
happiness but also watching intentions directly. However, louder music at the start of the trailer
(MusicStart), while improving happiness, has a negative direct effect on watching intentions.
In the box office revenue model, several of the control variables have a significant
impact, including ratings from Rotten Tomatoes (NumRatingabove3.5; positive effect) and
MPAA reviews (MPAA = PG, PG13, and R; negative effect). Finally, and importantly, watching
The parameter estimates in Table 4 were used as inputs to the stepwise scene selection
algorithm to produce an optimal movie clip of about 30 seconds in length for each of the 50 pairs
27
of trailers. Because some media for which these clips are intended do not allow for sound, clips
were produced both with and without sound. For this purpose, two models with and without the
sound variables were used. As a benchmark for comparison, we use the current practice to
produce clips by selecting the first 30 seconds of the trailer. The results are shown in Table 5.
Optimal movie clips with audio: Table 5 shows that the optimal clips consist on average
of 3.6 scenes, while the benchmark has more and thus shorter scenes, 4.8 on average. The
predicted average watching intention of the optimal clips (7-point scale) is considerably higher
(3.83) than that of the benchmark clips (2.91). The predicted watching intention of the original
trailer is 3.32 (SD = .55), and thus the shorter optimal clip results in an even higher intention to
watch the movie than the original trailer. The average difference in watching intention between
the optimal and benchmark clips is almost a full point (.92) on the seven-point scale, and over 90
percent of the optimal clips have higher watching intentions than the corresponding benchmark
clips. These watching intentions translate to a predicted 3.17 percent improvement in expected
Recall that the data contains two versions of the trailer for each movie. For each of the
two versions of a trailer, we produced a clip using our algorithm. On average, the difference in
predicted watching intention between these two clips for the same movie was 1.29 (SD = .76).
From each pair of clips, we selected the one with the highest predicted watching intention. These
clips had an average predicted watching intention across all movies of 4.21 (SD = .92), which
translates to a 4.80 percent predicted increase in box office revenue. Thus, because the two
different trailers have a wider range of scenes from the movie, selecting the best clip from the
pair results in considerably higher watching intentions and predicted box office success.
Optimal silent movie clips: For clips produced without audio (“silent clips”), the optimal
clip contains 3.6 scenes, on average, while the benchmark contains 4.8 scenes. These results are
not noticeably differ from those for clips with audio. The predicted watching intentions for the
optimal silent clips are 3.80 (seven-point scale), which is only somewhat lower than those for the
optimal clips with audio (Table 5). Yet, 42 percent of the silent clips result in a higher watching
intention than their counterparts with audio. Note that this does not reflect a lack of contribution
of audio to watching intentions, but reveals that it is possible to eliminate scenes from the trailer
in such a way that even the resulting silent clips still have a high intention of being watched.
The optimal silent clips result in higher watching intentions than the original trailer (3.32)
and the benchmark silent clips (3.28). The average predicted difference in watching intention
between the optimal and benchmark silent clips is about a half point (.5), and over 90 percent of
the optimal silent clips have higher watching intentions than the silent benchmark clips. These
higher watching intentions result in a predicted 1.75 percent improvement in expected box office
revenue. We conducted a similar analysis using the best of the two versions of the clip for each
movie. On average, the difference in predicted watching intention between the two silent clips
for the same movie was .79 (SD = .45). For the best silent clip in each pair, the average predicted
watching intention is 4.07 (SD= .66), which translates to a 2.45 percent increase in box office
revenue.
We thus demonstrate the beneficial effects of optimizing movie clips via simulation. To
investigate consumers’ response to the actual clips in hold-out validations, we conduct two
experiments. The first is an online experiment and the second is a large-scale field experiment.
We selected the five best-performing clips with audio and the five best-performing silent
29
clips from the simulation analyses. The movie titles included “Dark Shadow,” “Mirror Mirror,”
“The Odd Life of Timothy Green,” “Project X,” “Rock of Ages,” “Some Guy Who Kills
People,” “Wanderlust,” and “What to Expect When You Are Expecting.” Two clips overlapped
between the two sets of five: “Project X” and “Mirror Mirror.” For each clip, we produced the
actual benchmark and optimal movie clips by editing the digital video file of the trailers based on
the proposed procedure. The clips were produced in GIF format and were about 30 seconds long.
One-hundred and seventy-five undergraduate and graduate students were recruited for the
experiment and participated for extra course credit. To make the study incentive-compatible,
participants were entered in a lottery for the chance to win a $50 gift card to be used to go and
see the movie they liked the most. Some platforms that do not allow for sound (such as
Facebook.com) do allow clips to show subtitles. We therefore also added subtitles to the
optimized silent clips, as this may increase comprehension of the narrative. We showed each
participant five clips in a randomized order. To avoid spillover effects, we only showed one
watching each clip online, the participants were asked to answer three questions based on seven-
point scales to assess their evaluation of the clip (How much do you like this movie clip?), the
movie (How would you rate the movie based on this trailer?), and their intention to watch the
We obtained usable data on 169 respondents. The average of the three evaluation
measures was analyzed with MANOVA, which reveals strong evidence of the performance of
the optimized clips over the benchmark clips with audio (p < .001, partial eta-squared ƞ2 = .08)
and without audio (p < .001, ƞ2 = .08), and of the effect of adding subtitles to silent clips (p
< .001, ƞ2 = .13). Relative to the benchmark, the optimization procedure significantly improves
30
the measures for each type of movie clip, with moderate effect sizes. The results for the average
of the three evaluation measures are presented in Table 6 for each of the five movie clips
separately. For all three types of clips, improvement over the benchmark is among the largest for
“Mirror Mirror”: For the clip with audio, evaluations increase by 21.9 percent (p = .003, ƞ2
= .16), while for silent clips without subtitles, they increase by 17.1 percent (p = .056, ƞ2 = .16),
and for silent clips with subtitles, they increase by 32.3 percent (p = .002, ƞ2 = .21). The benefits
of the proposed procedure are smallest for silent clips, and adding subtitles may thus be
important when auto-play videos are muted, which is something companies currently do not
seem to do. This hold-out study, in which actual clips were produced and presented to a new
sample of respondents, thus provides evidence of the effectiveness of the proposed model and
optimization procedure.
To test whether optimized clips indeed improve the actual viewing behavior of actual
customers, we worked with Netflix’s messaging team and conducted a field experiment to test
our approach on an email campaign for one of Netflix’s original romantic comedy movies, just
before its launch in August 2017. First, we collected data in a facial-tracking experiment
(through nViso) with a sample of 41 participants viewing the trailer of this movie, using the
procedure described in the methodology section. With our model estimates reported in Table 4
and usable facial-expression data from 40 participants and from scene and other characteristics
of the trailer, we produced silent optimized benchmark clips of 19 seconds in GIF format
clip to the benchmark clip, we also compared it to a static image as these are still frequently
31
Using a stratified sampling procedure, Netflix users were allocated to strata based on a
unique combination of their region, device type (e.g., iPhone, Android, Apple TV, etc.),
payment type (e.g., debit, credit, etc.), tenure (1, 2, … years), and plan (basic, standard,
premium). Then, using machine-generated random numbers, the participants in each stratum
were randomly assigned to one of three conditions: 1) the baseline with a static image, 2) the
benchmark clip, and 3) the optimized clip. Each condition has an equal number of participants
from each of the strata. In total, 40,000 Netflix customers from non-U.S., English-speaking
countries were involved. Each participant received a promotional email from Netflix with the
optimized clip, benchmark clip, or static image embedded. The emails had the same subject line
and supporting text, and the clips looped. Next, we report Netflix’s standard statistics on the
First, compared to the static image, the optimized and benchmark clips perform
significantly better in terms of the average number of Streaming Hours with a moderate effect
size (an average lift of 1.60 percent, p < .001, Cohen’s h = .253), showing that customers are
more receptive to clips than to static images. In addition, while a .30 percent higher Watching
Percentage (customers who watched at least 70 percent of the movie) for the optimized clip
relative to the benchmark is not significant, the optimized clip reduces the percentage of Short
Viewers (customers who viewed less than 6 minutes of the movie) by 10.5 percent (p = .058,
Odds ratio = 1.121), and reduces the Bad Player Ratio (percentage of Short Viewers divided by
Watching Percentage) by 12.5 percent (p < .000, Odds ratio = 1.170), compared to the
benchmark clip. Although the effect sizes are relatively small, the optimized clip enhanced
The results of the field test, while preliminary, are encouraging. This is especially the
case because the results rely only on a single silent movie clip, because the Netflix original
comedy movie was not part of the model calibration data, because the samples came from very
different populations, and because streaming behaviors occur far downstream from exposure to
the clips. Nevertheless, the field test showed that streaming behavior was meaningfully
DISCUSSION
Movie trailers have long been regarded as the movie industry's most effective marketing
tool (Faber and O’Guinn 1984), but original two- to three- minute trailers, not only for movies
but also for sitcoms and video games, are becoming less effective in new digital media.
Marketers are therefore seeking to produce much shorter video clips to promote their content in
these media. Film clips are ads for movies, akin to those for video games and TV shows, but
different from food, car, and electronics commercials in that they are made up of samples of the
product that is being promoted. Viewers of clips thus experience a sample of the emotions that
they will experience when they go see the movie. The challenge for marketers resides in
identifying how many and which scenes of the trailer to show in a short video clip. The goal of
this research is to support movie marketers in this effort by investigating how to cut the trailers
provided by trailer production houses down to short clips that are suitable for today’s electronic
media, while eliciting an emotional experience that is representative of the movie and stimulates
But how to optimally sample the trailer content remains an open question. The
33
framework for moment-to-moment emotions, watching intentions, and box office success
(Figure 1) to support this goal. The framework centers on the scene as the basic building block of
movies, trailers, and clips. The proposed method helps marketers to select those scenes from a
trailer that renders short clips the most effective. The findings of the analyses, simulations, and
online and field tests show that our approach enables the design of short clips that not only
increase consumers’ intentions to watch the movie, but that also improve predicted box office
This research marks a first attempt to investigate the effectiveness of clips, with the
application focusing on clips for the comedy movie genre. While happiness, as the central
emotion of the genre, has strong effects, we do not find an effect of concomitant emotions, such
as surprise and disgust that might be significant in spoof, action, dark, satire, and horror
comedies. This result might be caused by the sample of participants and movies in the present
study, and future research should further examine the role of such concomitant emotions. We do
expect the proposed approach to be directly applicable to other movie genres which elicit a
different central emotion (Grodal 1997). Further refinement of the approach for that purpose may
be useful.
sample of the product is marketed, and applications arise not only for movies, but also for
immersive games, for TV shows on HBO and Netflix, for news items shown on news sites and
news aggregators, such as Flipboard, and even for books (Arons 2013). The manner in which our
approach can be extended to support the marketing of these other types of products requires
further study. In such future research, face tracking may be combined with measures such as
34
those derived from EEG, which have recently been shown to be predictive of movie preferences
Online ads have a short “shelf life” compared to traditional forms of advertising, such as
TV commercials. As such, online marketers need to constantly create new content for these ads
to attract and retain consumers’ attention online. The traditional production process of ads is
expensive and often slow and therefore marketers are increasingly considering automation to
produce variations of ads as quickly as possible and with low budgets. Our approach to
advertising online content via short clips can be automated, scaled up, and personalized. Once
representative calibration data is available on which the models in question have been trained,
film clips can be automatically produced using the proposed algorithm. Taking this one step
further, using customer-level data, our procedure could be utilized to customize the selection of
scenes to produce personalized clips that maximize the elicited response from each individual
customer. The pursuit of the automation and personalization of the content of movie clips holds
promise to greatly enhance marketing effectiveness (Wedel and Kannan, 2016). We hope the
present study provides a starting point for these future research avenues, and consequently
improves the effectiveness of marketing for movies and the marketing of content.
35
References
Arons, Rachel (2013), “The Awkward Art of Book Trailers,” The New Yorker, (accessed
awkward-art-of-book-trailers]
Bartlett, Marian Stewart, Joseph C. Hager, Paul Ekman, and Terrence J. Sejnowski (1999),
253–63.
Baumgartner, Hans, Mita Sujan, and Dan Padgett (1997), “Patterns of Affective Reactions to
Boksem, Maarten A. S. and Ale Smidts (2015), “Brain Responses to Movie Trailers Predict
Bradley, Margaret M., and Peter J. Lang (2000), “Affective Reactions to Acoustic Stimuli,”
for Emotion Recognition Based on Facial Expressions,” In: 9th International Conference
Cornfield, Jerome (1951), “A Method for Estimating Comparative Rates from Clinical Data.
Applications to Cancer of the Lung, Breast, and Cervix,” Journal of the National Cancer
Couvreur, Christophe, and Yoram Bresler (2000), “On the Optimality of the Backward Greedy
Algorithm for the Subset Selection Problem,” SIAM Journal on Matrix Analysis and
Cutting, James E., Kaitlin L. Brunick, and Ayse Candan (2012), “Perceiving Event Dynamics
Dellaportas, P., J. J. Forster, and Ioannis Ntzoufras (2000), “Bayesian Variable Selection Using
Ekman, Paul and Wallace V. Friesen (1978), Facial Action Coding System: A Technique for the
Eliashberg, Jehoshua and Steven M. Shugan (1997), “Film Critics: Influencers or Predictors?”
-------, Sam K. Hui, and Z. John Zhang (2007), “From Story Line to Box Office: A New
Elpers, Josephine L.C.M. Woltman, Michel Wedel, and Rik G. M. Pieters (2003), “Why Do
Faber, Ronald J. and Thomas C. O’Guinn (1984), “Effect of Media Advertising and Other
Fasel, Beat and Juergen Luettin (2003), “Automatic Facial Expression Analysis: A Survey,”
Fredrickson, Barbara L., and Daniel Kahneman (1993), “Duration Neglect in Retrospective
44-55.
Galak, Jeff, Justin Kruger, and George Loewenstein (2011), “Is Variety the Spice of Life? It All
Depends on The Rate of Consumption,” Judgment and Decision Making, 6(3), 230-238.
-------, (2013), “Slow down! Insensitivity to Rate of Consumption Leads to Avoidable Satiation,”
George, Edward I., and Robert E. McCulloch (1997), “Approaches for Bayesian Variable
Gelman, Andrew, Jessica Hwang, and Aki Vehtari (2014), “Understanding Predictive
Information Criteria for Bayesian Models,” Statistics and Computing, 24(6), 997-1016.
Grodal, Torben (1997), “Moving Pictures: A New Theory of Film Genres, Feeling, and
Hart, Hugh (2014), “9 (Short) Storytelling Tips from A Master of Movie Trailers.” Fast
http://www.fastcocreate.com/3031012/9-short-storytelling-tips-from-a-master-of-movie-
trailers]
Hewig, Johannes, Dirk Hagemann, Jan Seifert, Mario Gollwitzer, Ewarld Naumanna and Dieter
Bartussek (2005), “A Revised Film Set for The Induction of Basic Emotions,” Cognition
Hui, Sam K, Tom Meyvis and Henry Assael (2014), “Analyzing Moment-To-Moment Data
Kellaris, James J. and Rice, Ronald C. (1993), “The Influence of Tempo, Loudness, and Gender
Kerman, Lisa (2004), “Coming Attractions: Reading American Movie Trailers,” in Texas Film
and Media Studies Series. Austin: University of Texas Press, 1st Edition.
Lang, Peter. J. (1995), “The Emotion Probe: Studies of Motivation and Attention,” American
-------, Margaret M. Bradley, and Bruce N. Cuthbert (1997), “Motivated attention: Affect,
Activation, and Action,” in Attention and Orienting: Sensory and Motivational Processes,
Last, J. (2004), “Opening Soon,” Wall Street Journal, (accessed May 1, 2004), [available
http://wwww.opinionjournal.com].
Lenk, Peter, Wedel, Michel and Böckenholt, Ulf (2006), “Bayesian Estimation of Circumplex
71(1), 33—55.
Loewenstein, George F., and Dražen Prelec, (1993), “Preferences for Sequences of Outcomes,”
MacCallum, David and Alistair Gordon (2011), “Say It to My Face! Applying Facial Imaging to
McDuff, Daniel, Rana El Kaliouby, David Demirdjian, and Rosalind Picard (2013), “Predicting
Online Media Effectiveness Based on Smile Responses Gathered Over the Internet,” In
39
Automatic Face and Gesture Recognition (FG), 10th IEEE International Conference and
Workshops, 1-7.
McGraw, A. Peter and Caleb Warren (2010), “Benign Violations: Making Immoral Behavior
Mauss, Iris B. and Michael D. Robinson (2009), “Measures of Emotions: A Review,” Cognition
Nelson, Leif D., and Tom Meyvis (2008), “Interrupted Consumption: Disrupting Adaptation to
Newcombe, Robert G. (1998), “Interval Estimation for the Difference between Independent
Nowlis, Stephen M., Naomi Mandel, and Deborah Brown McCabe (2004), “The Effect of a
Patel, Sahil (2016), “85 Percent of Facebook Video Is Watched without Sound,” Digiday,
facebook-video/]
Ratner, Rebecca K., Barbara E. Kahn, and Daniel Kahneman (1999), “Choosing Less-Preferred
Experiences for The Sake of Variety,” Journal of Consumer Research 26(1), 1-15.
Shannon-Jones, Samantha (2011), “Trailer Music: A Look at The Overlooked,” The Oxford
2011/10/26/trailermusi/].
Sharda, Ramesh and Dursun Delen (2006), “Predicting Box-Office Success of Motion Pictures
Shiv, Baba, and Stephen M. Nowlis, (2004), “The Effect of Distractions While Tasting a Food
Sorci, Matteo, Gianluca Antonini, Javier Cruz, Thomas Robin, Michel Bierlaire, and J-Ph Thiran
(2010), “Modelling Human Perception of Static Facial Expressions,” Image and Vision
Spiegelhalter David J., Nicola G. Best, Bradley P. Carlin, and Angelika Van Der Linde (2002),
“Bayesian Measures of Model Complexity and Fit,” Journal of the Royal Statistical
Teixeira, Thales, Michel Wedel and Rik Pieters (2012), “Emotion induced engagement in internet
-------, Rosalind Picard and Rana El Kaliouby (2014), “Why, When, And How Much to Entertain
Tsiatis, Anastasios A. and Marie Davidian (2004), “Joint Modeling of Longitudinal and Time-To-
Wedel, Michel and P.K. Kannan (2016), “Marketing Analytics for Data-Rich Environments,”
Wilson, Edwin B. (1927), “Probable Inference, the Law of Succession, and Statistical Inference,”
Zacks, Jeffrey M., Nicole K. Speer, and Jeremy R. Reynolds (2009), “Segmentation in Reading
Footnotes:
https://www.nytimes.com/2017/08/30/business/media/nfl-six-second-commercials.html]
3. The Random Intercept, Index of the Longest Scene, the Audio Trend, Happiness Peak,
and Happiness Trend are “significant” for all cutoffs. The signs of the coefficients are
stable, but Average Happiness (c.o. < .2), Happiness End (c.o. = .2), Music Start (c.o. =
.2), and Volume Start (c.o. > .3) are significant only for some of the cutoffs.
4. We calculate the p-value based on tests on the relationship between proportions of two
groups (Newcombe 1998; Wilson 1927), using the stats package in R (prop.test) and set
the alternative hypothesis as greater or less. We calculate the Odds ratio as an effect-size
measure (e.g., Cornfield 1951), and calculate Cohen’s h for the lift measure of Streaming
Delays and Nelson and Meyvis Average scene length Average scene length
interruptions 2008; Nowlis et al. (-); longest scene (-); longest scene
2004; Shiv and Nowlis index number (+). index number (+).
2004.
Audio
Moment-to-moment Bradley and Lang Volume level (+); Volume level (+);
level 2000; Lang 1995; Lang Music level (+). Music level (+).
et al. 1997.
Start, peak, end, and Zauberman et al. 2006. Volume start, peak, Volume start, peak,
trend end, and trend (+/-); end, and trend (+/-);
Music start, peak, end Music start, peak, end
and trend (+/-). and trend (+/-).
Emotions
Happiness Baumgartner et al. Happiness start (+/-),
1984; Fredrickson and peak (+), end (+), and
Kahneman 1993; trend (+).
Elpers et al. 2003.
Note: Expected direction (+ , - or +/-) of the effects in parenthesis.
43
TABLE 2: MODEL COMPARISON STATISTICS FOR THE FULL MODEL AND FIVE
MODELS THAT ARISE BY REMOVING VARIABLES FROM THE FULL MODEL
Notes: Scenes are the basic building blocks of video content. The audio-visual scene structure of
movies elicits an intended emotional response. Scenes from the movie are selected for the trailer,
and scenes from the trailer are selected to produce a clip. The clip provides a representative
emotional experience and results in the intention to watch the full movie.
49
FIGURE 2: TOTAL SOUND VOLUME AND MUSIC VOLUME FOR ONE TRAILER (MEN IN BLACK 3)
Note: The solid line indicates the total sound volume (in dB). The dotted line indicates the music volume (in dB) with vocals removed.
Vertical dashed lines indicate scene cuts.
50
FIGURE 3: HAPPINESS PROFILE AND HAPPINESS AND SCENE MEASURES FOR ONE INDIVIDUAL FOR A
SAMPLE TRAILER (MEN IN BLACK 3)
Happiness
Notes: The happiness measure ranges from 0 to 1 (the algorithm assigns a probability based on three sets of facial expression
measurements; details about the measurement are provided in Web Appendix I). Vertical dashed lines indicate scene cuts. The middle
shaded area is the region of the scene with the happiness peak; left and right shaded areas are start and end scenes. The horizontal
dashed line indicates 75% of the peak value; the dotted line is a linear fit used to represent the happiness trend.
51
WEB APPENDIX
The facial recognition algorithm developed by Sorci et al (2010) and used by nViso
specifies the probability of six basic emotions (Ekman and Friesen 1971), namely, “happiness”,
“surprise”, “fear”, “disgust”, “sadness”, and “anger”. Using a multinomial logit model, the
algorithm assigns a probability to each of emotions based on three sets of facial expression
measurements: those based on the Facial Action Coding Systems (FACS); based on the
Expression Descriptive Units (EDU); and based on Appearance Parameters (AP).
The FACS developed by Ekman and Friesen (1978) is the leading standard for measuring
facial expressions. All visible movement of muscular activities on a face are categorized into
“action units” (AUs) and emotions are identified based on a unique combination of these AUs.
For example, happiness is characterized by two primary and three secondary action units. Zhang
et al. (2005) validated the classification of emotions based on the AUs and supplemented the
classification with auxiliary AUs and transient facial features, such as wrinkles and furrows.
The EDU was developed by Antonini et al. (2006), after recognizing that face
recognition also involves spatial configuration of facial features (Cabeza and Kato 2000; Farah et
al. 1998). The EDU encodes the interactions among facial features (e.g., the interactions between
eyebrows and mouth), in addition to the isolated AUs identified in FACS. Lastly, AP was
developed by Sorci et al. (2010) to provide a description of a face as a global entity.
Sorci et al. (2010) compared three different models: (1) the model with only measures
from FACS as explanatory variables; (2) the model with EDU and significant measures from
FACS in model 1 as explanatory variables; (3) the model with AP, and significant measures from
EDU and FACS in model 2 as explanatory variables. The model is specified as:
Emotion j Intercept j k 1 I kjK1 kjK1 FACSkK1 ,
K1
Model 1
K2 K
(W1) I 2
k 1 kj
kjK2 ( FACS EDU ) kK2 , Model 2
K3 K
I 3
k 1 kj
kjK3 ( FACS EDU AP) kK3 , Model 3
in which the Emotionj includes “happiness”, “surprise”, “fear”, “disgust”, “sadness”, “anger”,
“neutral”, “other”, and “I don’t know”; K1, K2, and K3 represent the total number of
measurements from FACS, FACS and EDU, and FACS, EDU and AP respectively; I is the
indicator variable, which equals 1 if the k-th is included for emotion j, and 0 otherwise; and the
intercept captures the average effect of factors that are not included.
The models are estimated by maximum likelihood, and model comparison statistics show
52
that the full model (model 3) performs best, and is used as the final model for facial expression
recognition algorithms.
References:
Antonini, Gianluca, Matteo Sorci, Michel Bierlaire, and Jean-Philippe Thiran (2006), "Discrete
Choice Models for Static Facial Expression Recognition," In Advanced Concepts for Intelligent
Vision Systems, 710-721. Springer Berlin/Heidelberg.
Cabeza, Roberto, and Takashi Kato (2000), "Features Are Also Important: Contributions of
Featural And Configural Processing To Face Recognition," Psychological Science 11(5), 429-
433.
Ekman, Paul, and Wallace V. Friesen (1971), "Constants across Cultures in the Face and
Emotion," Journal of Personality and Social Psychology 17(2), 124-129.
Ekman, Paul, and Wallace V. Friesen (1978), Facial Action Coding System Investigator’s Guide.
Consulting Psycologist Press, Palo Alto, CA.
Farah, Martha J., Kevin D. Wilson, Maxwell Drain, and James N. Tanaka (1998), "What is"
Special" about Face Perception?," Psychological Review, 105(3), 482-498.
Zhang, Yongmian, and Qiang Ji (2005) "Active and Dynamic Information Fusion for Facial
Expression Understanding from Image Sequences," IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(5), 699-714.
53
model{
for (j in 1:numTrailer){
for (i in 1:nObsMatrix[,j]){
happiness_peaktemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],21,j])/nObsMatrix[,j]
happiness_peakIndextemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],24,j])/nObsMatrix[,j]
happiness_endtemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],27,j])/nObsMatrix[,j]
happiness_starttemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],30,j])/nObsMatrix[,j]
mu_bo[j]<-vbo[1]*WatchMovieTemp[j]+vbo[2]*Holiday[j]
+vbo[3]*mpaaPG[j]+vbo[4]*mpaaPG13[j]+vbo[5]*mpaaR[j]+vbo[6]*log(Numratingabov
e3_5[j]+1)+vbo[7]*happiness_avg_temp[j]+vbo[8]*happiness_peaktemp[j]+vbo[9]*happi
ness_endtemp[j]+vbo[10]*happiness_starttemp[j]+vbo[11]*happiness_coftemp[j]
+vbo[12]
LGBoxOffice[j] ~ dnorm(mu_bo[j], sigma_bo)
} #j
for (i in 1:nRespondent){
for (f in 1:3){
u[i,f] ~ dnorm(0, sigma_m[f])
}
}
for (r1 in 1:3){
sigma_m[r1] ~ dgamma(0.01,0.01)
}
for (d in 1:D){
tau0[d] ~ dnorm(0, 0.001)
}
tau <- sort(tau0)
for (m in 1:nAlpha){
alpha[m] ~ dnorm(0, 0.01)
}
for (l in 1:nV){
v[l] ~ dnorm(0, 0.01)
}
sigma ~ dgamma(0.01, 0.01)