Video Content Marketing The Making of Clips

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

© 2018, American Marketing Association

Journal of Marketing
PrePrint, Unedited
All rights reserved. Cannot be reprinted without the express
permission of the American Marketing Association.

VIDEO CONTENT MARKETING: THE MAKING OF CLIPS

Xuan Liu1, Savannah Shi2, Thales Teixeira3, and Michel Wedel4

April 3, 2018

1
Senior Data Scientist, Netflix, Netflix Corporate Headquarters, 100 Winchester Circle, Los
Gatos, CA 95032, alexl@netflix.com.
2
Assistant Professor of Marketing, Leavey School of Business, Santa Clara University, 500 El
Camino Real, Santa Clara, CA95053, Tel. (408)554-4798, wshi@scu.edu
3
Lumry Family Associate Professor, Harvard Business School, Harvard University, Boston, MA
02163, USA, Tel. (617) 495-6125, tteixeira@hbs.edu.
4
Pepsico Professor of Consumer Science, Robert H. Smith School of Business, University of
Maryland, College Park, MD 20742, USA, Tel. (301) 405-2162,
mwedel@rhsmith.umd.edu.

Acknowledgement

The order of authors is alphabetical. The authors thank nViso for data collection and processing

by the web-based face tracking system, and Netflix for running the field experiment. This study

was supported by Robert H. Smith School of Business, Harvard Business School, and Leavey

School of Business.
1

VIDEO CONTENT MARKETING: THE MAKING OF CLIPS

ABSTRACT

Consumers have an increasingly wide variety of options available to entertain themselves. This

poses a challenge for content aggregators who want to effectively promote their video content

online via the original trailers of movies, sitcoms, and video games. Marketers are now seeking

to produce much shorter video clips to promote their content on a variety of digital channels.

This research is the first to propose an approach to produce such clips and to study their

effectiveness, focusing on comedy movies as an application. Web-based facial-expression is

used to study viewers’ real-time emotional responses when watching comedy movie trailers

online. These data are used to predict viewers’ intentions to watch the movie and its box office

success. The authors then propose an optimization procedure for cutting scenes from trailers to

produce clips and test it in an online experiment and in a field experiment. The results provide

evidence that the production of short clips using the proposed methodology can be an effective

tool to market movies and other online content.

Keywords: Video Content Marketing, Trailers, Clips, Emotions, Facial-Expression Tracking


2

The Internet has drastically reduced barriers to the distribution of video content. This has

caused an unprecedented proliferation of sitcoms, scripted series, documentaries, and long- and

short-format movies. Online content aggregators are making this vast array of video material

readily available to consumers for on-demand streaming. For short-format user-generated video,

there is YouTube. For video games, there is Twitch. For broadcast and cable shows, there is

Hulu. And for movies and web series, there are Netflix and Amazon.

Given that consumers have such a wide variety of options available to entertain

themselves, a challenge for online content aggregators is how to effectively promote their video

content. Synopses, critics’ reviews, and viewer ratings are important, but the best way for a

consumer to evaluate the quality of video content and to determine if she wants to see it, is for

her to watch a sample. For that reason, video content producers have historically used trailers as

their main marketing tool. This started around 1920 when movie theatres produced snippets of

upcoming films with simple text overlays and showed these “trailing” a feature film to entice

viewers to return to the theatre. The National Screen Service, a company that wrote scripts and

produced trailers on behalf of movie studios, was founded soon thereafter. It developed a

template for trailer design that included a montage and music, and held a monopoly over the

creation and distribution of movie trailers that lasted into the 1950s, when more competitors

entered the market. Movie trailer production has evolved into an industry with dozens of

independent production houses charging upwards of $500,000 for a trailer (Last 2004).

Nowadays there are trailers not only for movies, but also for sitcoms, for video games, and even

for books. These trailers are typically two- to three-minute videos created by selecting and

editing scenes from the original video content, and adding music and other sound effects. Their

purpose is to elicit a sample of the emotions that viewers will experience when watching the full
3

content (Kerman 2004). At least a dozen websites are uniquely devoted to showing trailers (e.g.,

traileraddict.com, booktrailersforreaders.com, IMDb.com, comingsoon.net), and trailers

themselves have become some of the most popular forms of entertainment on the web.

But, given consumers’ ever-shorter attention spans, the original trailers for movies,

sitcoms, and video games are becoming less effective marketing tools on some digital channels,

particularly those that do not support sound (for example, email and social media). Therefore,

online aggregators have cut the trailers that they obtain from production studios down to “clips”

of 30, 20, or sometimes as little as 10 seconds. The newfound marketing problem for content

aggregators has become not one of creating promotional trailers, but rather one of editing down

the trailer content provided by trailer production studios into formats that are suitable for digital

marketing channels. However, according to a manager at Netflix, “The current approach of

providing just the first few seconds of a trailer for online viewing is not always effective.” Thus,

marketers need better tools to produce these short clips from original trailers. Yet, despite its

importance, there is no academic research that assists in the development of these tools, or that

even helps to understand if and how these short clips can induce consumers to experience a

sample of emotions and, ultimately, watch the full content.

This paper is the first, to our knowledge, to look at how marketers of video-based content

should edit trailers to produce (shorter) clips that help consumers decide whether to watch the

content. While marketing departments at content aggregators often have limited control over the

specific content of the trailer, they do have control over which scenes to select from the trailer in

producing a shorter clip. Our conceptual framework and methodology are therefore founded on

the notion that the scene is the basic semantic building block of video content and also the

elementary unit for the production of trailers and clips.


4

To illustrate our method, we focus on the creation of clips for comedy movies as an

application. We collected and analyzed four different types of data and propose an optimization

algorithm for clip production. First, in an online facial-expression-tracking experiment, viewers

were shown movie trailers and their intentions to watch the full movie were measured. These

data were used to calibrate a model that explains viewing preferences based on the audio-visual

scene structure of trailers, as well as the real-time emotional responses evoked from each viewer.

Second, we collected information on various ratings and box office sales of the movies in

question. This allows us to validate in-market the role of the scene structure, under the control of

the marketer, and assess the intermediary role of the emotions evoked. We account for content-

specific effects by controlling for movie ratings and variations of trailers for the same movie (we

do not code and insert content variables because the online aggregator marketer has no control

over them: instead, we measure consumers’ emotional response to that content). Third, after

understanding how the scene structure is associated with the emotional response, intention to

watch, and box office sales, we optimized the editing of trailers to produce short film clips for

use on digital channels in which sound is (e.g., IMDb) or is not (e.g., Facebook) supported.

Fourth, we validated the proposed approach in an online experiment, as well as in a large-scale

field experiment with one of the world’s largest video content aggregators. We show that (a) it is

superior to the currently used heuristic approach for producing clips and (b) it can be automated

and, thus, is scalable.

Online content aggregators can apply our methodology to design clips that deliver an

optimal emotional experience and consequently induce higher watching intentions and sales for

their video content. The proposed framework applies not only to the comedy movies / trailers /

clips that we examined as a prototypical case, but to all movie genres, and more generally to
5

other types of digital content (news, books, TV shows, etc.) that are being marketed via clips.

Apart from contributing to the literature on content marketing and movie marketing in particular

(Eliashberg, Hui, and Zhang 2007; Faber and O’Guinn 1984; Litman 1983), this paper also aims

to contribute to the literature on online advertising (Teixeira, Wedel, and Pieters 2012), and fits

the recent trend, in practice, toward shorter advertising messages.1

This paper is structured as follows. First, we review the relevant literature and provide a

general conceptual framework for how the scene structure of trailers affects consumers’ viewing

experience and watching intentions. Then we develop an empirical model of watching intentions

and propose an optimization tool to produce clips. Subsequently, we describe the empirical

results of the model estimation for data on comedy movies and the prediction of their box office

success. We then show the managerial implications of optimal clip production by testing the

optimal clips through simulation, an online experiment, and a large-scale field experiment.

Lastly, we discuss the insights obtained and the potential usage of our tools by marketers of

online content, and reflect on future developments regarding automation and personalization.

LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK

The movie industry and its box office performance has seen ample research in marketing.

Litman (1983) showed the relationships between box office success and determinants such as

time of release, distributor, movie genre, production costs, and Academy Awards. Faber and

O’Guinn (1984) then confirmed that the effect of movie previews and movie excerpts (such as

trailers) on movie-going behavior is stronger than the effects of word of mouth and critics’

reviews. Reviews from critics were shown to play a role by Eliashberg and Shugan (1997).
6

Sharda and Delen (2006) showed that the success of a movie is determined by the number of

screens on which the movie is shown during the initial launch and the stars featured in the movie.

Eliashberg, Hui, and Zhang (2007) demonstrated further that the scripts of trailers could be used

to forecast a movie’s return on investment. More recently, Boksem and Smidts (2015) showed,

using electroencephalography (EEG) measures, that emotions are important predictors of movie

preferences and box office success. Despite significant research being conducted to explain the

causes and drivers of successful movies, less work has been done on movie marketing per se.

Since around 50 percent of a major Hollywood studio’s movie budget is spent on marketing,

whereas the other half goes into production, this gap in the literature is rather puzzling.

Movie marketing is a big business today. According to a report by Statista, global cinema

advertising spending was $2.7 billion in 2015 and is expected to total $3.3 billion in 2020. The

main tool for movie and other video-based content marketers is currently the trailer. By

including scenes from the movie that elicit a sample of the emotions that viewers will experience

while watching the full movie, a trailer allows viewers to form expectations of the experience of

watching the entire film (Kerman 2004). Trailer music is used to support the emotional

experience elicited by the montage, but the majority of trailers use “library” music (Shannon-

Jones 2011) that is not from the movie’s soundtrack itself. Indeed, movie trailers have been

shown to be the most influential factor to impact consumers' intentions to watch a movie (Faber

and O’Guinn 1984).

Unfortunately, trailers for movies and other video content are no longer always directly

useful in their original format as marketing tools for online content aggregators and distributors,

given concerns over consumers’ ever-shorter attention spans. The present research focuses on the

scene as the elementary building block of trailers and as the basis for creating short “clips” for
7

online marketing. Figure 1 visualizes our conceptual framework. It reflects the fact that scenes

are the basic audio-visual building blocks of video content, and that the creative design of a

trailer involves a montage of selected scenes from that content. Marketers at online streaming

services have no influence on the plot, narrative, or script of the trailer, but rather need to edit

and cut scenes to produce an effective promotional clip.

Our framework in Figure 1, which applies not only to movies but generally to other

categories of video content as well, is therefore based on the recognition that the problem of

modern-day online content marketers is one of editing down content, as opposed to producing it

from scratch. The cutting of scenes from the trailer produces a clip that retains important

elements of the movie’s emotional experience, which aims to generate a positive intention to

watch the full content as a response to viewing the clip. Practitioners tend to view trailer

“cutting” more as an art than a science and tend to use ad-hoc methods in trailer design by, for

example, applying “a lot more cutting” or adding “an unexpected jolt of some kind or a

wonderful piece of music” (Hart 2014). Ultimately, to produce clips marketers often simply use

the first few scenes of the trailer. A more rigorous approach is called for.

[INSERT FIGURE 1 ABOUT HERE]

Emotions play an essential role in experiencing movies and shows, and thus in the trailers

and clips created to promote this content (Boksem and Smidts 2015). Movies draw audiences in

because they provide a concentrated emotional experience (Hewig et al. 2005; McGraw and

Warren 2010). Each movie genre is comprised of a prototypical narrative that is designed to

elicit a central emotion; for example, horror movies evoke fear, tragedies evoke sadness, and

comedies evoke happiness (Grodal 1997). In the present study, we focus on comedy movies and

happiness as their central emotion. Our conceptual framework in Figure 1 shows that the key
8

problem that needs to be addressed for marketers to create effective clips is to identify which

scenes of the original trailer evoke the highest level of that central emotion among viewers. Only

after knowing the intensity and timing of the central emotion can content marketers edit down

long-form trailers to shorter clips that are potentially even more effective than the original

trailers.

The psychology literature on events in film has focused on the scene as a unit of analysis

(Zacks, Speer, and Reynolds 2009), and has shown that viewers parse a film into events based on

the perceptual information that defines and delineates the scenes in the film (Cutting, Brunick,

and Candan 2012). Our framework therefore revolves around the audio-visual scene structure of

trailers (Figure 1). Marketers control the way consumers experience a trailer, and a clip in

particular, through the pacing and length of scene cuts. Consumer behavior research has shown

that pacing and sequencing (Galak, Kruger, and Loewenstein 2011; 2013; Ratner, Kahn, and

Kahneman 1999; Zauberman, Diehl, and Ariely 2006) — in the present context, the number, and

length of scenes — and delays and interruptions (Nelson and Meyvis 2008; Nowlis, Mandel, and

McCabe 2004) — in the present context, scene transitions and cuts — are prime components

affecting consumption experiences. A fast-paced consumption leads to a decrease in enjoyment

due to overly fast satiation (Galak et al. 2013), whereas a slower consumption, sometimes with

an interruption, slows down satiation and leads to a more enjoyable overall experience (Nelson

and Meyvis 2008). We therefore predict that the pacing of the scenes in a comedy movie trailer

will exert a similar impact: happiness levels will generally improve across the sequence of scenes

in a trailer, but a fast-paced trailer with a larger number of scenes results in a lower level of

happiness, and consequently in a lower watching intention. Prior research also addressed the

impact of the consumption sequence. As Nowlis et al. (2004) demonstrated, a delay in


9

consumption will lead to greater consumption enjoyment because the utility of anticipating a

pleasant consumption outweighs the utility of waiting. Loewenstein and Prelec (1993) also

showed that people prefer anticipating the best outcome at the end of a consumption experience.

We therefore predict that if the key scene (typically the longest) is placed later in a comedy

trailer, happiness levels and watching intention will be improved.

Of particular importance in producing clips is sound, which includes voice, special

effects, and especially music. Past literature has shown that sound plays a dual role in

experiences: It orients attention and intensifies emotions. The intensity (volume) of the sound has

been shown to particularly amplify the emotional experience if it occurs in a synchronized

manner (Bradley and Lang 2000; Lang 1995; Lang, Bradley, and Cuthbert 1997). Therefore, we

expect a positive impact of moment-to-moment overall and music volume in a comedy trailer on

happiness and consequently on watching intention. Moreover, because some of the digital media

in which clips are commonly placed (e.g., email and social media) do not support sound, it is

important to be able to predict consumers’ reaction to a clip that does not include any audio or

music. In our framework, we therefore allow for the possibility that several aspects of audio and

music volume (including start, peak, trend, and end volumes) can have an impact on emotions

and viewing intentions. The aspects that exert an influence on the experience depend on the

context (Zauberman et al. 2006) and given a lack of prior literature, we do not formulate specific

predictions on the effects of these specific measures on happiness or watching intentions.

Given that the goal of trailer and clip design is to produce a representative emotional

experience, it is necessary to characterize the emotional content of trailer scenes. While the entire

emotional experience throughout trailer consumption may be significant, research has shown that

the peak and end points of the emotional experience are disproportionately more important in the
10

overall evaluation of the experience by consumers (Baumgartner, Sujan, and Padgett 1984;

Fredrickson and Kahneman 1993). Hence, we predict that scenes with high peak and end

happiness result in higher watching intentions. In addition, research has shown that the general

trend of the emotional experience — whether it is increasing, stable, or decreasing — also

impacts the overall enjoyment (Elpers, Wedel, and Pieters 2003). We therefore predict that a

positive trend in happiness results in higher watching intentions.

In Table 1, we summarize the predicted relationships between the scene-level factors of

the trailers that we focus on, and their downstream impact on emotions and watching intentions

in the context of comedies and happiness as their central motion, as guided by the literature.

With respect to the visual scene structure, the number, length, and sequencing of scenes should

have a measurable impact on the viewer’s emotions. With respect to audio, total sound volume

as well as the music-only volume of scenes should have a direct impact on watching intentions,

as well as an indirect impact, because they evoke or intensify the central emotion of the genre.

Not all moments in the trailer are expected to have a significant impact, but rather the start, peak,

end, and trajectory of audio and emotions should matter.

[INSERT TABLE 1 ABOUT HERE]

In the next section, we explain the methodology employed to collect emotional reactions

to trailers in order to understand their role in consumers’ intentions to watch movies. We focus

on trailers for comedy movies as a prototypical case. Comedy has been the leading genre in the

last two decades, with a little more than 2,000 comedies produced and a market share of over 20

percent. The average gross revenue was about $20 million per movie. The main goal of comedy

movies is to elicit joy and laughter among the audience (McGraw and Warren 2010) and

effective trailers for comedy movies are designed to induce happiness as the central emotion
11

(Grodal 1997). We thus focus on happiness as the key emotion in the illustration of our

methodology, but also look at the role of surprise and disgust as secondary emotions. In the next

section, we explain the method used to parse out trailers into scenes and to measure emotions

moment to moment from a large sample of viewers’ reactions.

METHODOLOGY

An online experiment was conducted in collaboration with the company nViso2, in which

facial expressions and watching intentions were collected for participants watching 100 comedy

movie trailers. Each participant was asked to view a webpage that contained 12 comedy movie

trailers in a setup that mimicked what he or she would encounter on trailer websites from IMDb,

iTunes, or YouTube. Participants watched the trailer in their natural environment, at home, at

work, etcetera, which increases the external validity of data collection. Facial expressions were

recorded remotely through the webcams on participants' computers. At the end of the

experiment, the participants were asked to answer questions regarding their evaluations of the

trailers and the corresponding movies, as well as their intentions to watch the movies. To make

the study incentive-compatible, participants entered a lottery to win the DVD of the movie that

they most wanted to watch. Each participant also received $5 in the form of an Amazon gift card

if they completed the experiment.

Participants and Stimuli

A total of 122 paid participants were recruited online. The participants had a mean age of

24 and an age range from 18 to 68, with 28 percent being men. Participants had to have access to

a personal computer with a webcam and high-speed Internet connection and have near-perfect
12

vision without glasses or contact lenses. Male participants with a full mustache or beard were

excluded.

A total of 100 comedy movie trailers were taken from public access video channels.

Thirteen comedy subgenres were selected, including nine drama comedies, eight animation

comedies, seven action comedies, seven romantic comedies, four horror comedies, four indie

comedies, four parodies, two dark comedies, and one each from political comedy, sci-fi comedy,

slapstick, sports comedy, and late-night comedy. The trailer for a movie typically comes in

different versions with different lengths developed for different viewing situations and/or for

different audiences. Two different versions of the trailer for each movie were included in the

present study to separately identify trailer-specific features (e.g., scene usage, sound, volume)

from movie-specific features (e.g., stars, casting, plot). Taking the movie “Project X” as an

example, one trailer is one minute and 37 seconds and contains 29 scenes, while the second

trailer is two minutes and 26 seconds and contains 32 scenes and has louder sound on average.

Each participant was only exposed to one version of the trailer for the same movie. Overall, 100

trailers for 50 comedies were used in the study; these trailers were selected from a pool of 100

comedy trailers through a balanced incomplete block design. Data collection generated a massive

dataset of upwards of 1.5 million (participant × time × movie) emotional reactions to audio and

video content scenes. The design minimized spillover effects by randomizing the order of the

trailers shown to each participant. One randomly selected comedy trailer was used as a control

stimulus to form an individual-specific emotional baseline and was shown to all participants at

the beginning of the experiment.

Procedure

The participants were asked for consent to participate in the experiment and to be
13

recorded via their webcam. Participants needed to be in a well-lit environment and at most 60

centimeters (2 ft.) away from their webcams. Participants were requested to refrain from eating,

chewing, drinking, or talking. Although this request may have had some impact on the external

validity of the study, it was necessary to ensure accurate recording of facial expressions and

checked afterwards for compliance.

Each participant was shown a random series of 12 trailers. The length of each trailer was

between one and three minutes. After each trailer, participants were asked five questions about

their previous exposure and their evaluation of the trailer and the movie. Watching intention

(WatchMovie) was measured on a scale ranging from one to seven, with seven being the highest

intention, indicating how much participants would like to watch the movie after they had been

exposed to the trailer. After all trailers were shown, participants were asked to answer questions

about their demographics and their general movie-going behavior. At the end of the experiment,

participants were entered in a raffle in which they had a one in 10 chances to win a free DVD.

They were asked to choose one or more movies from any of the movies they had just watched in

the experiment. If they won, one movie was selected from the choices they made. The whole

experiment took up to 45 minutes.

Data Collection

The facial expressions of emotions were collected, calculated, and provided to the

researchers in raw data form by the company nViso, which provides real-time cloud computing

to measure consumers’ emotion reactions in online experiments. For each second that a

participant watched a trailer, a probability was calculated indicating the intensity of the emotion.

An emotional profile was created for each participant, containing the moment-to-moment

measures of happiness and other emotions. The original videos of participants’ expressions were
14

not retained because of privacy concerns, as outlined in IRB regulations for the study. There

were 122 participants in the online questionnaire data and 104 participants provided valid

emotion data. Ninety participants completed the entire questionnaire and had a valid emotion

profile, indicating full compliance with the instructions. Five participants did not provide valid

control data for the calibration trailer, and therefore the final sample consisted of 85 participants

from whom we obtained complete data, which is comparable to the sample sizes commonly used

by nViso in its online tests. For each of the 100 trailers, the data from participants who had seen

the movie previously were removed. Therefore, the number of participants per trailer varied from

three to 19, with an average of 8.43 (SD = 3.14) participants.

Emotion Measurements and Their Validity

Measuring emotions had been a long-standing problem (Mauss and Robinson 2009) until

Ekman and Friesen developed the Facial Action Coding System (FACS) to systematically

categorize emotions by coding instant facial muscular changes (Ekman and Friesen 1978). The

FACS decomposes facial movements into anatomically based “action units,” reflecting the

muscular activity that produces the facial appearance. For example, happiness is characterized by

two primary and three secondary action units. Recently, the Expression Descriptive Units that

measure the interactions among facial muscular movements (Antonini et al. 2006), as well as

Appearance Parameters that consider global facial features (Sorci et al. 2010), have been used to

augment emotion recognition. Although initially emotions had to be assessed by trained coders,

nowadays several off-the-shelf software solutions are available to provide automatic and

accurate moment-to-moment identification of emotions (Fasel and Luettin 2003).

This software has been used in previous marketing studies (Teixeira, Wedel, and Pieters

2012; Teixeira, Picard, and Kaliouby 2014). It has been proven to outperform the work of non-
15

expert coders and to be approximately as accurate as that of expert coders (Bartlett et al. 1999).

The automated algorithm used by nViso splits the video recording of the user’s face into separate

frames, and then uses the facial expression in each static frame to identify the probability of the

occurrence of six basic emotions (happiness, surprise, fear, disgust, sadness and anger) based on

a Multinomial logit model. The explanatory variables include the measurements from Ekman’s

FACS, the Expression Descriptive Units, and the Appearance Parameters (MacCallum and

Gordon 2011; Sorci et al. 2010; more details are provided in Web Appendix I).

The algorithm has been validated using 11-fold cross validation on a database of 1,271

images of facial expressions, manually coded for the expression of emotions by 33 human

coders. This cross-validation yielded a (normed) correlation of .76 (Sorci et al. 2010, Table 4, p.

800). This result is comparable to those for similar automated algorithms, for which

classification accuracies ranging from .78 to .88 have been reported (Brodny et al. 2016; McDuff

et al. 2013). Sorci et al. (2010) also reported other supportive evidence on the performance of the

nViso algorithm and compared it to neural networks, based on Histogram Intersection and

Kullback-Leibler measures.

Movie Trailer Video and Audio Variables

The movie trailer video and audio content of all 100 trailers was analyzed using image

and audio processing software, which yielded the following variables.

Scene cuts: Scene cuts in the movie trailers were detected automatically using the “Scene

Detector” (http://www.scene-detector.com), which is software that detects the scene boundaries

based solely on the frame image data. Based on the scene cuts, we calculated the following

variables for use in the analysis: the total number of scenes, the average length of scenes, and the

location of the longest scene in the trailer.


16

Audio Volume: We extracted two types of volume data. One is Total Volume: Amplitude

data were extracted every millisecond from MP3 audio files using the sound processing software

SoX (http://sox.sourceforge.net). The absolute values were averaged on a second-by-second

basis to match the video data. The other is Music Volume: By removing vocals utilizing SoX, the

music was separated from the audio files, and its volume was calculated as described above. For

both the total volume data and total music volume data, based on our conceptual framework, we

calculated the following variables to be used in the analysis: the moment-to-moment volume

across the trailer, the trend of volume over the course of the trailer, the average volume in the

start scene, the average volume in the end scene, and the scene with peak volume. Figure 2

shows an example of total volume and music volume from one movie trailer (the vertical dashed

lines indicate the scene cuts).

[INSERT FIGURE 2 ABOUT HERE]

The trailers are comprised of, on average, 23 scenes (Min = 1, Max = 56, SD = 14.23),

with an average length of 11 seconds (SD = 17.6). The total volume is .053 dB (SD = .025),

while the peak volume is .097 dB (SD = .046). The music volume is substantially lower, .017 dB

(SD = .016), with an average peak volume of .041 dB (SD = .038).

Emotion Variables

Intensities of happiness were measured for each participant on a second-by-second basis.

Based on our conceptual framework, aggregate measures of this emotion for each trailer were

calculated as follows: Start, the total emotional intensity during the first scene; the Trend,

calculated via a linear fit to each emotion curve; Peak, the average happiness of the scene with

the highest average emotion level; PeakIndex, the location of that scene in the trailer; and End,

the total emotional intensity during the last scene. Figure 3 shows an example of moment-to-
17

moment happiness and its summary measures for the movie trailer for Men in Black 3. Several

comedy subgenres may rely on other, concomitant emotions of happiness, most importantly

surprise in spoof and action comedies, and disgust in dark, satire, and horror comedies (McGraw

and Warren 2010). Therefore, we retained moment-to-moment surprise and disgust as secondary

emotions in comedy trailers and calculated the Start, Peak, Trend, and End measures for these

two emotions as well. Tests based on a random effects linear model show that there is no

significant trend in any of the emotion variables across the sequence in which the trailers are

shown.

[INSERT FIGURE 3 ABOUT HERE]

Control Variables

While there are countless variables that one could incorporate into the model, we chose to

incorporate the ones readily available to marketers of online video content aggregators. In that

spirit, we obtain several control variables for movies extracted from the online databases

“Internet Movie Database” (IMDb), owned by Amazon, “Rotten Tomatoes,” and “The

Numbers,” including MPAA Reviews, ratings (from IMDb), number of ratings above 3.5 (from

Rotten Tomatoes, log-transformed), and release time (whether the movie is released during the

summer or Christmas holiday season).

STATISTICAL MODEL

In our conceptual framework (see Figure 1 and Table 1): a) audio and video features

affect the moment-to-moment emotional experience; b) aggregate measures of emotions,

including start, peak, and end levels, affect intentions to watch the movie; c) in addition to their
18

indirect effects via emotions, some aggregate audio and video measures may affect watching

intentions directly; and d) finally, watching intentions, together with aggregate emotion measures

and high-level movie characteristics, such as reviews and ratings, impact box office revenue.

The statistical methodology reflects the postulated theoretical relations. Whereas prior

research has mostly used emotions as explanatory variables, here we model the moment-to-

moment emotional response jointly with the end-point behaviors of prime interest (watching

intention and box office revenues). There are three sub-models combined in this joint model, one

for the (longitudinal) happiness data, one for watching intention data, and one for box office

revenue data. The happiness and watching intention sub-models are connected through

individual-specific random effects (Tsiatis and Davidian 2004). We account for unobserved

individual differences though a hierarchical formulation. Given that there are over 40 predictor

variables, we simultaneously apply Bayesian variable selection to each of the three model

components to select the specific measures that predict watching intention.

First, the logit-transformed happiness probabilities for individual i watching trailer j at

time t are denoted as hijt , and are modeled as

(1) hijt = θijt + eijt .

Here, θijt is the underlying true emotion trajectory (modeled as described below). Whereas

previous research treated moment-to-moment measurements of emotions as fixed exogeneous

variables (for example, Teixeira, Wedel, and Pieters 2012), here measurement/classification

error, denoted as ξijt , is accommodated. It is obvious that ξijt is not separately identified from

other sources of error, say ςijt , and therefore subsumed in the model’s error term: eijt = ξijt +

ςijt . The error terms eijt are assumed to be independently Normally distributed, and because the

emotions are classified independently on a frame-by-frame basis, it is not unreasonable to


19

assume them to be uncorrelated over time. In equation (1), the underlying emotion trajectory

θijt is expressed as

(2) θijt = 𝐖1i (t) + 𝐗1j 𝛃 + ζ1 Sjt + ζ2 Vjt + ζ3 Mjt

Here, 𝐖1i (t) are subject-specific random effects (see below). As for the moment-to-moment

audio and video features, Sjt represents the index of the scene at time t; Vjt represents the total

audio volume of trailer j at time t; and Mjt represents the volume of music of trailer j at time t.

The matrix 𝐗1j contains the trailer-specific aggregate video and audio variables (start, peak, end,

and trend) described in the previous section.

Second, an ordered logit model is developed for the watching intentions, with yij

representing individual i’s intention to watch the movie j, modeled as a function of the latent

variable yij∗ as follows:

(3) yij∗ = 𝐗 2ij α + 𝐖2i ,

(4) 1, yij∗ < τ1


yij = {d, τd−1 < yij∗ < τd , d = 2, … , D − 1.
D, yij∗ > τD−1

Here, D = 7 and 𝐖2i contains subject-specific effects, similar to 𝐖1i (t). (The threshold

parameters satisfy the order constraint: τ1 < τ2 < ... < τD, and the first and last thresholds are fixed

for identification (Lenk, Wedel, and Böckenholt 2006). The matrix 𝐗 2i,j contains the predictor

variables, including the aggregate measures of emotions, and the aggregate video and audio

variables extracted from the movie trailers, which are our main explanatory variables. This

model is linked to the model for happiness through the dependence between 𝐖1i (t) and 𝐖2i .

These random effects capture unobserved individual-specific effects in the intercept and the

trend of happiness, respectively. Specifically, 𝐖1i (t) and 𝐖2i are expressed as:
20

(5) 𝐖1i (t) = u1i + u2i t

𝐖2i = ν1 u1i + ν2 u2i + u3i

The random effects for the intercept and the slope are denoted as u1i and u2i , and together with

u3i are assumed to follow Normal distributions. The parameters ν1 and ν2 capture the association

between the happiness and watching intention models.

Third, we model log-box office revenues at the movie level as a function of predicted

watching intentions and as such:

(6) ln(rj ) = ν0 + νy∙j∗ + 𝐗 3j λ + εj ,

Here, y∙j∗ = ∑i yij∗ /N, and 𝐗 3j contains the aggregate emotion measures and control variables

described in the previous section. Note that equations (1) to (6) constitute a system of

simultaneous equations that are jointly estimated. Gross box office revenue for each of the

movies corresponding to the trailers used in the study was obtained from the IMDb database for

the year in which the study was conducted.

To identify a parsimonious model that has fewer explanatory variables, we apply the

Gibbs Variable Selection (GVS) procedure developed by Dellaportas, Forster, and Ntzoufras

(2000) to efficiently search for the best subset of predictor variables in 𝐗1ij , 𝐗 2ij , and 𝐗 3j for each

of the three model components, respectively. We use a variable selection approach because this

allows us to use a limited subset of predictor variables in hold-out data collection and model

validation. In the GVS approach, the coefficients of the regression model are assumed to have

spike-and-slab prior distributions with a mixture of a point mass at 0 and a diffuse distribution

elsewhere. Specifically, an auxiliary indicator variable Ik is introduced for each covariate in

equations (2), (3), and (6), with Ik = 0 indicating the absence of the covariate k in the model and

Ik = 1 indicating its presence. A (generic) regression coefficient βk in any of these equations is


21

then specified as:

0, if Ik = 0 (spike)
(7) βk = { .
ηk , if Ik = 1 (slab)

The joint density P(Ik , βk ) = P(βk |Ik )P(Ik ). The effect size parameter ηk is assumed to have a

mixture prior: P(ηk |Ik ) = (1 − Ik ) × N(μ̃, τ̃2 ) + Ik × N(0, σ


̃2 ), where (μ̃, τ̃2 ) requires tuning.

̃2 , is the fixed prior variance of ηk . A Bernoulli prior distribution is assumed for


The parameter, σ

the indicator: Ik ~Bern(.5).

The model is estimated with Markov Chain Monte Carlo (MCMC), using the JAGS

software, with the code provided in Web Appendix II.

OPTIMAL DESIGN OF CLIPS

The integrated model described above is used to produce optimal short movie clips that

can be inserted in emails, messages, and social media, and in the apps, landing pages, and user

interfaces of content providers. Online advertising channels are idiosyncratic in the video

formats they accept, specifically regarding whether or not they support audio. For example,

YouTube plays videos with sound by default, while on Facebook, 85 percent of videos are

watched without sound (Patel 2016), and Netflix only allows GIF format clips without audio and

subtitles in its promotional emails. We produce clips of about 30 seconds in length (but, in

principle, any desired length is possible) and design optimal trailers both with and without audio.

For the former, we use the full model described above and for the latter, we recalibrate the model

while excluding the audio variables.

Let Sj = {Sj,1 , … , SjTj } be the sequence of scene indicators across Tj , the length of trailer j,
22

and let there be K j = n(Sj ) scenes for trailer j. The criterion optimized is the mean of the

posterior distribution of the predicted watching intention of the movie:

(8) yj (Sj∗ ) = ∫ y∙j (Sj∗ |Φ)f(Φ|data)dΦ ,

where Φ contains all model parameters, f(Φ|data) denotes the posterior distribution of the

parameters, and Sj∗ = {Sj,1 , … , Sj,T∗j } denotes the sequence of scene indicators across the length of

the clip, Tj∗ . The algorithm we propose to find an optimal trailer j of approximately 30 seconds

(or any other length) is a backward elimination algorithm that one by one, eliminates scenes that

reduce yj (Sj∗ ) least. The algorithm works as follows:

1. Start with the complete trailer, Sj = {Sj,1 , … , SjTj };


2. Eliminate all Sj,t = k, for k = 1, … , K ∗j and in turn, delete the corresponding elements

from Vj , Mj , and hij , and calculate: yj (Sj−k ) =∑Rr=1 ∑N −k r


i=1 yij (Sj |Φ )/NR, with r a

draw from the Gibbs sampler;

3. Retain the clip without scene ℓ = arg min[yj (Sj−k )] by removing all time periods t
k


for which Sj,t = ℓ, and set K ∗j = K j − 1; and

4. If |Tj∗ − Δ| < ϵ, stop and if not, return to 2. In the application, we use ϵ = 1 second

and Δ is set to be 30 seconds.

The proposed backward elimination algorithm is part of a class of “greedy” algorithms

that make a locally optimal decision at each stage s of the algorithm. For example, at step 1 for

trailer j, it eliminates the most redundant scene from the trailer to produce an optimal clip with

K j − 1 scenes. This simple backward selection algorithm is computationally attractive because

for a trailer j, it will provide a solution in less than K j steps. It avoids the need to enumerate all

possible configurations of scenes, which would be required to find the globally optimal solution.
23

In some cases, the proposed backward elimination strategy may thus not produce a globally

optimal solution, but it will yield a locally optimal approximation of that solution (Couvreur and

Bresler 2000). For online content aggregators, this approach has two benefits. It allows the

movie marketer to optimally create clips of any length shorter than the original trailer, and it can

be done very fast.

RESULTS

Model Comparison

We first test alternative specifications of the joint model with the purpose of investigating

the contribution of the happiness and secondary emotion measures, the video characteristics, the

audio characteristics, and the intention as a predictor of box office revenue. We calculate several

measures of model fit, including Akaike Information Criterion (AIC), the Deviance Information

Criterion (DIC; Spiegelhalter et al. 2002), and the Watanabe-Akaike Information Criterion

(WAIC). The latter was proposed by Gelman et al. (2014) as a computationally convenient

predictive measure that is based on the entire posterior distribution rather than a point estimate

(as is the case for the AIC and DIC statistics). A smaller value for these statistics indicates a

better fitting or predicting model.

We examine six models: 1) the full model, and five models for which we remove the

following sets of predictor variables from the full model: 2) the measures of the secondary

emotions surprise and disgust, 3) the measures of all emotion variables of happiness, surprise,

and disgust, 4) all video variables, 5) all audio variables, and 6) intention as a predictor of box

office sales. Table 2 shows the AIC, BIC, and WAIC statistics for each of the models. The model

without the audio variables shows the largest reduction in (predictive) fit relative to the full
24

model, followed by the model without the video variables. The importance of audio that this

reveals has ramifications for the design of clips for media that do not support audio. Dropping

watching intention as a predictor of box office success significantly reduces the (predictive) fit of

the model. However, while dropping all emotion measures simultaneously reduces model fit

significantly, the model without the (start, trend, peak, and end) measures of surprise and disgust

fits better than the full model. Apparently, for our sample of movie trailers, these two secondary

emotions do not play a significant role in the formation of watching intentions. We therefore

report the results of the model without the secondary emotions (model 2, main model) below.

[INSERT TABLE 2 ABOUT HERE]

Bayesian Variable Selection

Table 3 displays the posterior means of the inclusion probabilities obtained from the

Bayesian variable selection. The table shows that for the happiness model, only the music peak

has a very low inclusion probability (.046). For the watching intention model, variables that have

very low inclusion probabilities are the sequence number of the scenes (.030), the average scene

length (.053), the peak volume (.016), the end volume (.028), the music peak (.029), the end

volume (.029), and, to a lesser extent, the trend in music volume (.142). Thus, while most of the

audio and video variables affect moment-to-moment happiness, only a few (longest scene, total

and music volume in the first scene, and the trend in volume) affect watching intention directly.

Almost all emotion measures affect watching intention, but the inclusion probability of average

happiness in the watching intention model is relatively low (.196). For the box office model, only

the IMDb ratings have a low inclusion probability (.055).

In our application, a very large number of models is searched over and even the set of

most promising models may be large. We therefore use a heuristic cutoff on the inclusion
25

probabilities to select the “best” model and based on Table 3, a standard cutoff of .2 is employed.

We investigate the sensitivity to the cutoff by varying it from .10 to .35 in steps of .05 and re-

estimating the model, including the variables, based on that cutoff. All coefficients of predictor

variables in the emotion and box office models that are significant (have a credible interval that

does not cover zero) stay the same, regardless of the cutoff used, but for the watching intention

model, as the cutoff changes there are some relatively minor variations in the significance of

coefficients.3 We discuss the estimates of the final model next.

[INSERT TABLE 3 ABOUT HERE]

Interpretation of the Parameter Estimates

In Table 4, we present the estimates of the final model. The table shows that video

features directly impact momentary feelings of happiness. The level of happiness increases with

increasing scene sequences (Scene). The number of scenes in a trailer (SceneNum) has a negative

effect on happiness, confirming that fast-paced comedy trailers tend to result in a lower level of

happiness. We find that longer scenes placed later in the trailers (SceneLongestInd) increase

happiness significantly. These findings confirm our predictions (see Table 1). As for audio, we

find that its moment-to-moment volume has a significant positive instantaneous effect on

happiness, as predicted (Table 1). We did not have specific predictions on the effects of the

sound volume measures, but peak volume (VolumePeak) and increasing trend in volume

(VolumeTrend) decrease happiness. End volume (VolumeEnd) has a positive effect. Music

volume (Music) has a negative moment-to-moment effect on happiness, but louder music at the

start of the trailer improves happiness (MusicStart).

The moment-to-moment experience of happiness throughout the trailer positively affects

watching intentions. As predicted by peak-end theory (Fredrickson and Kahneman 1993), both
26

the peak happiness (HappinessPeak) and the happiness experienced at the end of the trailer

(HappinessEnd) have a positive effect on watching intentions (Table 1). In line with our

prediction, an increasing trend in happiness (HappinessTrend) also affects watching intentions

positively. Finally, the association between the random intercepts in the happiness and watching

intention models is significant, with a higher variation in the level of happiness being associated

with lower watching intentions. These findings confirm our predictions (Table 1).

Alongside the indirect effects of video and audio variables on watching intentions via the

happiness experienced, there are also direct effects. Longer scenes placed later in the trailers

(SceneLongestInd) increase watching intentions significantly, over and above their effect on

happiness, as predicted (Table 1). Further, increasing volume (VolumeTrend) decreases not only

happiness but also watching intentions directly. However, louder music at the start of the trailer

(MusicStart), while improving happiness, has a negative direct effect on watching intentions.

In the box office revenue model, several of the control variables have a significant

impact, including ratings from Rotten Tomatoes (NumRatingabove3.5; positive effect) and

MPAA reviews (MPAA = PG, PG13, and R; negative effect). Finally, and importantly, watching

intentions (WatchIntention), as predicted by the watching intention model, have a significant

positive impact on box office revenues.

[INSERT TABLE 4 ABOUT HERE]

MANAGERIAL IMPLICATIONS: OPTIMAL CLIP PRODUCTION AND TESTING

Optimal Clip Production and Predictions

The parameter estimates in Table 4 were used as inputs to the stepwise scene selection

algorithm to produce an optimal movie clip of about 30 seconds in length for each of the 50 pairs
27

of trailers. Because some media for which these clips are intended do not allow for sound, clips

were produced both with and without sound. For this purpose, two models with and without the

sound variables were used. As a benchmark for comparison, we use the current practice to

produce clips by selecting the first 30 seconds of the trailer. The results are shown in Table 5.

Optimal movie clips with audio: Table 5 shows that the optimal clips consist on average

of 3.6 scenes, while the benchmark has more and thus shorter scenes, 4.8 on average. The

predicted average watching intention of the optimal clips (7-point scale) is considerably higher

(3.83) than that of the benchmark clips (2.91). The predicted watching intention of the original

trailer is 3.32 (SD = .55), and thus the shorter optimal clip results in an even higher intention to

watch the movie than the original trailer. The average difference in watching intention between

the optimal and benchmark clips is almost a full point (.92) on the seven-point scale, and over 90

percent of the optimal clips have higher watching intentions than the corresponding benchmark

clips. These watching intentions translate to a predicted 3.17 percent improvement in expected

box office revenue for the optimal clips.

Recall that the data contains two versions of the trailer for each movie. For each of the

two versions of a trailer, we produced a clip using our algorithm. On average, the difference in

predicted watching intention between these two clips for the same movie was 1.29 (SD = .76).

From each pair of clips, we selected the one with the highest predicted watching intention. These

clips had an average predicted watching intention across all movies of 4.21 (SD = .92), which

translates to a 4.80 percent predicted increase in box office revenue. Thus, because the two

different trailers have a wider range of scenes from the movie, selecting the best clip from the

pair results in considerably higher watching intentions and predicted box office success.

[INSERT TABLE 5 ABOUT HERE]


28

Optimal silent movie clips: For clips produced without audio (“silent clips”), the optimal

clip contains 3.6 scenes, on average, while the benchmark contains 4.8 scenes. These results are

not noticeably differ from those for clips with audio. The predicted watching intentions for the

optimal silent clips are 3.80 (seven-point scale), which is only somewhat lower than those for the

optimal clips with audio (Table 5). Yet, 42 percent of the silent clips result in a higher watching

intention than their counterparts with audio. Note that this does not reflect a lack of contribution

of audio to watching intentions, but reveals that it is possible to eliminate scenes from the trailer

in such a way that even the resulting silent clips still have a high intention of being watched.

The optimal silent clips result in higher watching intentions than the original trailer (3.32)

and the benchmark silent clips (3.28). The average predicted difference in watching intention

between the optimal and benchmark silent clips is about a half point (.5), and over 90 percent of

the optimal silent clips have higher watching intentions than the silent benchmark clips. These

higher watching intentions result in a predicted 1.75 percent improvement in expected box office

revenue. We conducted a similar analysis using the best of the two versions of the clip for each

movie. On average, the difference in predicted watching intention between the two silent clips

for the same movie was .79 (SD = .45). For the best silent clip in each pair, the average predicted

watching intention is 4.07 (SD= .66), which translates to a 2.45 percent increase in box office

revenue.

We thus demonstrate the beneficial effects of optimizing movie clips via simulation. To

investigate consumers’ response to the actual clips in hold-out validations, we conduct two

experiments. The first is an online experiment and the second is a large-scale field experiment.

Evaluation of Optimal Clips with an Online Experiment

We selected the five best-performing clips with audio and the five best-performing silent
29

clips from the simulation analyses. The movie titles included “Dark Shadow,” “Mirror Mirror,”

“The Odd Life of Timothy Green,” “Project X,” “Rock of Ages,” “Some Guy Who Kills

People,” “Wanderlust,” and “What to Expect When You Are Expecting.” Two clips overlapped

between the two sets of five: “Project X” and “Mirror Mirror.” For each clip, we produced the

actual benchmark and optimal movie clips by editing the digital video file of the trailers based on

the proposed procedure. The clips were produced in GIF format and were about 30 seconds long.

One-hundred and seventy-five undergraduate and graduate students were recruited for the

experiment and participated for extra course credit. To make the study incentive-compatible,

participants were entered in a lottery for the chance to win a $50 gift card to be used to go and

see the movie they liked the most. Some platforms that do not allow for sound (such as

Facebook.com) do allow clips to show subtitles. We therefore also added subtitles to the

optimized silent clips, as this may increase comprehension of the narrative. We showed each

participant five clips in a randomized order. To avoid spillover effects, we only showed one

(randomly selected) version of a clip (optimized or benchmark) to each participant. After

watching each clip online, the participants were asked to answer three questions based on seven-

point scales to assess their evaluation of the clip (How much do you like this movie clip?), the

movie (How would you rate the movie based on this trailer?), and their intention to watch the

movie (Would you like to watch this movie?).

We obtained usable data on 169 respondents. The average of the three evaluation

measures was analyzed with MANOVA, which reveals strong evidence of the performance of

the optimized clips over the benchmark clips with audio (p < .001, partial eta-squared ƞ2 = .08)

and without audio (p < .001, ƞ2 = .08), and of the effect of adding subtitles to silent clips (p

< .001, ƞ2 = .13). Relative to the benchmark, the optimization procedure significantly improves
30

the measures for each type of movie clip, with moderate effect sizes. The results for the average

of the three evaluation measures are presented in Table 6 for each of the five movie clips

separately. For all three types of clips, improvement over the benchmark is among the largest for

“Mirror Mirror”: For the clip with audio, evaluations increase by 21.9 percent (p = .003, ƞ2

= .16), while for silent clips without subtitles, they increase by 17.1 percent (p = .056, ƞ2 = .16),

and for silent clips with subtitles, they increase by 32.3 percent (p = .002, ƞ2 = .21). The benefits

of the proposed procedure are smallest for silent clips, and adding subtitles may thus be

important when auto-play videos are muted, which is something companies currently do not

seem to do. This hold-out study, in which actual clips were produced and presented to a new

sample of respondents, thus provides evidence of the effectiveness of the proposed model and

optimization procedure.

[INSERT TABLE 6 ABOUT HERE]

Evaluation of Optimal Clips with a Field Experiment

To test whether optimized clips indeed improve the actual viewing behavior of actual

customers, we worked with Netflix’s messaging team and conducted a field experiment to test

our approach on an email campaign for one of Netflix’s original romantic comedy movies, just

before its launch in August 2017. First, we collected data in a facial-tracking experiment

(through nViso) with a sample of 41 participants viewing the trailer of this movie, using the

procedure described in the methodology section. With our model estimates reported in Table 4

and usable facial-expression data from 40 participants and from scene and other characteristics

of the trailer, we produced silent optimized benchmark clips of 19 seconds in GIF format

(without subtitles), according to Netflix’s requirements. In addition to comparing the optimized

clip to the benchmark clip, we also compared it to a static image as these are still frequently
31

used in Netflix’s email campaigns.

Using a stratified sampling procedure, Netflix users were allocated to strata based on a

unique combination of their region, device type (e.g., iPhone, Android, Apple TV, etc.),

payment type (e.g., debit, credit, etc.), tenure (1, 2, … years), and plan (basic, standard,

premium). Then, using machine-generated random numbers, the participants in each stratum

were randomly assigned to one of three conditions: 1) the baseline with a static image, 2) the

benchmark clip, and 3) the optimized clip. Each condition has an equal number of participants

from each of the strata. In total, 40,000 Netflix customers from non-U.S., English-speaking

countries were involved. Each participant received a promotional email from Netflix with the

optimized clip, benchmark clip, or static image embedded. The emails had the same subject line

and supporting text, and the clips looped. Next, we report Netflix’s standard statistics on the

variables directly related to streaming behavior, as well as effect sizes.4

First, compared to the static image, the optimized and benchmark clips perform

significantly better in terms of the average number of Streaming Hours with a moderate effect

size (an average lift of 1.60 percent, p < .001, Cohen’s h = .253), showing that customers are

more receptive to clips than to static images. In addition, while a .30 percent higher Watching

Percentage (customers who watched at least 70 percent of the movie) for the optimized clip

relative to the benchmark is not significant, the optimized clip reduces the percentage of Short

Viewers (customers who viewed less than 6 minutes of the movie) by 10.5 percent (p = .058,

Odds ratio = 1.121), and reduces the Bad Player Ratio (percentage of Short Viewers divided by

Watching Percentage) by 12.5 percent (p < .000, Odds ratio = 1.170), compared to the

benchmark clip. Although the effect sizes are relatively small, the optimized clip enhanced

streaming behavior compared to the benchmark clip.


32

The results of the field test, while preliminary, are encouraging. This is especially the

case because the results rely only on a single silent movie clip, because the Netflix original

comedy movie was not part of the model calibration data, because the samples came from very

different populations, and because streaming behaviors occur far downstream from exposure to

the clips. Nevertheless, the field test showed that streaming behavior was meaningfully

impacted by optimizing the clip with the proposed approach.

DISCUSSION

Movie trailers have long been regarded as the movie industry's most effective marketing

tool (Faber and O’Guinn 1984), but original two- to three- minute trailers, not only for movies

but also for sitcoms and video games, are becoming less effective in new digital media.

Marketers are therefore seeking to produce much shorter video clips to promote their content in

these media. Film clips are ads for movies, akin to those for video games and TV shows, but

different from food, car, and electronics commercials in that they are made up of samples of the

product that is being promoted. Viewers of clips thus experience a sample of the emotions that

they will experience when they go see the movie. The challenge for marketers resides in

identifying how many and which scenes of the trailer to show in a short video clip. The goal of

this research is to support movie marketers in this effort by investigating how to cut the trailers

provided by trailer production houses down to short clips that are suitable for today’s electronic

media, while eliciting an emotional experience that is representative of the movie and stimulates

people to go and watch it.

But how to optimally sample the trailer content remains an open question. The
33

contribution of this research lies in the development of a theoretical and methodological

framework for moment-to-moment emotions, watching intentions, and box office success

(Figure 1) to support this goal. The framework centers on the scene as the basic building block of

movies, trailers, and clips. The proposed method helps marketers to select those scenes from a

trailer that renders short clips the most effective. The findings of the analyses, simulations, and

online and field tests show that our approach enables the design of short clips that not only

increase consumers’ intentions to watch the movie, but that also improve predicted box office

success and streaming behavior.

This research marks a first attempt to investigate the effectiveness of clips, with the

application focusing on clips for the comedy movie genre. While happiness, as the central

emotion of the genre, has strong effects, we do not find an effect of concomitant emotions, such

as surprise and disgust that might be significant in spoof, action, dark, satire, and horror

comedies. This result might be caused by the sample of participants and movies in the present

study, and future research should further examine the role of such concomitant emotions. We do

expect the proposed approach to be directly applicable to other movie genres which elicit a

different central emotion (Grodal 1997). Further refinement of the approach for that purpose may

be useful.

In a broader context, our approach involves content marketing. In content marketing, a

sample of the product is marketed, and applications arise not only for movies, but also for

immersive games, for TV shows on HBO and Netflix, for news items shown on news sites and

news aggregators, such as Flipboard, and even for books (Arons 2013). The manner in which our

approach can be extended to support the marketing of these other types of products requires

further study. In such future research, face tracking may be combined with measures such as
34

those derived from EEG, which have recently been shown to be predictive of movie preferences

and box office success (Boksem and Smidts 2015).

Online ads have a short “shelf life” compared to traditional forms of advertising, such as

TV commercials. As such, online marketers need to constantly create new content for these ads

to attract and retain consumers’ attention online. The traditional production process of ads is

expensive and often slow and therefore marketers are increasingly considering automation to

produce variations of ads as quickly as possible and with low budgets. Our approach to

advertising online content via short clips can be automated, scaled up, and personalized. Once

representative calibration data is available on which the models in question have been trained,

film clips can be automatically produced using the proposed algorithm. Taking this one step

further, using customer-level data, our procedure could be utilized to customize the selection of

scenes to produce personalized clips that maximize the elicited response from each individual

customer. The pursuit of the automation and personalization of the content of movie clips holds

promise to greatly enhance marketing effectiveness (Wedel and Kannan, 2016). We hope the

present study provides a starting point for these future research avenues, and consequently

improves the effectiveness of marketing for movies and the marketing of content.
35

References

Arons, Rachel (2013), “The Awkward Art of Book Trailers,” The New Yorker, (accessed

February 5, 2015), [available at http://www.newyorker.com/books/page-turner/the-

awkward-art-of-book-trailers]

Bartlett, Marian Stewart, Joseph C. Hager, Paul Ekman, and Terrence J. Sejnowski (1999),

“Measuring Facial Expressions by Computer Image Analysis,” Psychophysiology, 36 (2),

253–63.

Baumgartner, Hans, Mita Sujan, and Dan Padgett (1997), “Patterns of Affective Reactions to

Advertisements: The Integration of Moment-To-Moment Responses into Overall

Judgments,” Journal of Marketing Research, 34 (2), 219–232.

Boksem, Maarten A. S. and Ale Smidts (2015), “Brain Responses to Movie Trailers Predict

Individual Preferences for Movies and Their Population-Wide Commercial Success,”

Journal of Marketing Research, 52(4), 482-492.

Bradley, Margaret M., and Peter J. Lang (2000), “Affective Reactions to Acoustic Stimuli,”

Psychophysiology, 37(02), 204-215.

Brodny, Grzegorz, Agata Kołakowska, Agnieszka Landowska, Mariusz Szwoch, Wioleta

Szwoch, and Michał R. Wróbel (2016), “Comparison of Selected Off-The-Shelf Solutions

for Emotion Recognition Based on Facial Expressions,” In: 9th International Conference

on Human System Interactions (HSI), 397-404.

Cornfield, Jerome (1951), “A Method for Estimating Comparative Rates from Clinical Data.

Applications to Cancer of the Lung, Breast, and Cervix,” Journal of the National Cancer

Institute, 11, 1269–1275.


36

Couvreur, Christophe, and Yoram Bresler (2000), “On the Optimality of the Backward Greedy

Algorithm for the Subset Selection Problem,” SIAM Journal on Matrix Analysis and

Applications, 21 (3), 797-808.

Cutting, James E., Kaitlin L. Brunick, and Ayse Candan (2012), “Perceiving Event Dynamics

and Parsing Hollywood Films,” Journal of Experimental Psychology: Human Perception

and Performance, 38 (6), 1476-1490

Dellaportas, P., J. J. Forster, and Ioannis Ntzoufras (2000), “Bayesian Variable Selection Using

the Gibbs Sampler,” Biostatistics-Basel, 5 (May), 273-286.

Ekman, Paul and Wallace V. Friesen (1978), Facial Action Coding System: A Technique for the

Measurement of Facial Movement, Consulting Psychologists Press, Palo Alto.

Eliashberg, Jehoshua and Steven M. Shugan (1997), “Film Critics: Influencers or Predictors?”

Journal of Marketing, 61 (2), 68-78.

-------, Sam K. Hui, and Z. John Zhang (2007), “From Story Line to Box Office: A New

Approach for Green-Lighting Movie Scripts,” Management Science, 53 (6), 881–893.

Elpers, Josephine L.C.M. Woltman, Michel Wedel, and Rik G. M. Pieters (2003), “Why Do

Consumers Stop Viewing Television Commercials? Two Experiments on the Influence of

Moment-to-Moment Entertainment and Information Value,” Journal of Marketing

Research, 40(4), 437-453.

Faber, Ronald J. and Thomas C. O’Guinn (1984), “Effect of Media Advertising and Other

Sources on Movie Selection,” Journalism Quarterly, 61 (2), 371–377.

Fasel, Beat and Juergen Luettin (2003), “Automatic Facial Expression Analysis: A Survey,”

Pattern Recognition, 36, 259-275.


37

Fredrickson, Barbara L., and Daniel Kahneman (1993), “Duration Neglect in Retrospective

Evaluation of Affective Episodes,” Journal of Personality and Social Psychology, 65 (1),

44-55.

Galak, Jeff, Justin Kruger, and George Loewenstein (2011), “Is Variety the Spice of Life? It All

Depends on The Rate of Consumption,” Judgment and Decision Making, 6(3), 230-238.

-------, (2013), “Slow down! Insensitivity to Rate of Consumption Leads to Avoidable Satiation,”

Journal of Consumer Research, 39(5),993-1009.

George, Edward I., and Robert E. McCulloch (1997), “Approaches for Bayesian Variable

Selection,” Statistica Sinica, 7, 339-373

Gelman, Andrew, Jessica Hwang, and Aki Vehtari (2014), “Understanding Predictive

Information Criteria for Bayesian Models,” Statistics and Computing, 24(6), 997-1016.

Grodal, Torben (1997), “Moving Pictures: A New Theory of Film Genres, Feeling, and

Cognition,” Oxford: Clarendon Press.

Hart, Hugh (2014), “9 (Short) Storytelling Tips from A Master of Movie Trailers.” Fast

Company (accessed January 29, 2015), [available at

http://www.fastcocreate.com/3031012/9-short-storytelling-tips-from-a-master-of-movie-

trailers]

Hewig, Johannes, Dirk Hagemann, Jan Seifert, Mario Gollwitzer, Ewarld Naumanna and Dieter

Bartussek (2005), “A Revised Film Set for The Induction of Basic Emotions,” Cognition

and Emotion, 19 (7), 1095–1109.

Hui, Sam K, Tom Meyvis and Henry Assael (2014), “Analyzing Moment-To-Moment Data

Using a Bayesian Functional Linear Model: Application to TV Show Pilot Testing,”

Marketing Science, 33(2), 222–240.


38

Kellaris, James J. and Rice, Ronald C. (1993), “The Influence of Tempo, Loudness, and Gender

of Listener on Responses to Music,” Psychology and Marketing, 10: 15–29.

Kerman, Lisa (2004), “Coming Attractions: Reading American Movie Trailers,” in Texas Film

and Media Studies Series. Austin: University of Texas Press, 1st Edition.

Lang, Peter. J. (1995), “The Emotion Probe: Studies of Motivation and Attention,” American

Psychologist, 50(5), 372.

-------, Margaret M. Bradley, and Bruce N. Cuthbert (1997), “Motivated attention: Affect,

Activation, and Action,” in Attention and Orienting: Sensory and Motivational Processes,

Mahwah, NJ: Lawrence Erlbaum, 97-135.

Last, J. (2004), “Opening Soon,” Wall Street Journal, (accessed May 1, 2004), [available

http://wwww.opinionjournal.com].

Lenk, Peter, Wedel, Michel and Böckenholt, Ulf (2006), “Bayesian Estimation of Circumplex

Models Subject to Prior Theory Constraints and Scale-Usage Bias,” Psychometrika,

71(1), 33—55.

Litman, Barry R. (1983), “Predicting Success of Theatrical Movies: An Empirical Study,”

Journal of Popular Culture, 16 (9), 159–175.

Loewenstein, George F., and Dražen Prelec, (1993), “Preferences for Sequences of Outcomes,”

Psychological Review, 100(1), 91-108.

MacCallum, David and Alistair Gordon (2011), “Say It to My Face! Applying Facial Imaging to

Understanding Consumer Emotional Response,” AMSRS Conference 2011.

McDuff, Daniel, Rana El Kaliouby, David Demirdjian, and Rosalind Picard (2013), “Predicting

Online Media Effectiveness Based on Smile Responses Gathered Over the Internet,” In
39

Automatic Face and Gesture Recognition (FG), 10th IEEE International Conference and

Workshops, 1-7.

McGraw, A. Peter and Caleb Warren (2010), “Benign Violations: Making Immoral Behavior

Funny,” Psychological Science, 21 (8), 1141-1149.

Mauss, Iris B. and Michael D. Robinson (2009), “Measures of Emotions: A Review,” Cognition

and Emotion, 23 (2), 209–237.

Nelson, Leif D., and Tom Meyvis (2008), “Interrupted Consumption: Disrupting Adaptation to

Hedonic Experiences,” Journal of Marketing Research, 45(6), 654-664.

Newcombe, Robert G. (1998), “Interval Estimation for the Difference between Independent

Proportions: Comparison of Eleven Methods,” Statistics in Medicine,17, 873–890.

Nowlis, Stephen M., Naomi Mandel, and Deborah Brown McCabe (2004), “The Effect of a

Delay Between Choice and Consumption on Consumption Enjoyment.” Journal of

Consumer Research, 31(3), 502-510.

Patel, Sahil (2016), “85 Percent of Facebook Video Is Watched without Sound,” Digiday,

(accessed February 13, 2018), [available at https://digiday.com/media/silent-world-

facebook-video/]

Ratner, Rebecca K., Barbara E. Kahn, and Daniel Kahneman (1999), “Choosing Less-Preferred

Experiences for The Sake of Variety,” Journal of Consumer Research 26(1), 1-15.

Shannon-Jones, Samantha (2011), “Trailer Music: A Look at The Overlooked,” The Oxford

Student, (accessed October 26, 2011), [available at http://oxfordstudent.com/

2011/10/26/trailermusi/].

Sharda, Ramesh and Dursun Delen (2006), “Predicting Box-Office Success of Motion Pictures

with Neural Networks,” Expert Systems with Applications, 30 (2), 243–254.


40

Shiv, Baba, and Stephen M. Nowlis, (2004), “The Effect of Distractions While Tasting a Food

Sample: The Interplay of Informational and Affective Components in Subsequent

Choice,” Journal of Consumer Research, 31(3), 599-608.

Sorci, Matteo, Gianluca Antonini, Javier Cruz, Thomas Robin, Michel Bierlaire, and J-Ph Thiran

(2010), “Modelling Human Perception of Static Facial Expressions,” Image and Vision

Computing, 28, 790–806.

Spiegelhalter David J., Nicola G. Best, Bradley P. Carlin, and Angelika Van Der Linde (2002),

“Bayesian Measures of Model Complexity and Fit,” Journal of the Royal Statistical

Society Series B-Statistical Methodology, 64, 583-616.

Teixeira, Thales, Michel Wedel and Rik Pieters (2012), “Emotion induced engagement in internet

video ads,” Journal of Marketing Research, 49 (2), 144–159.

-------, Rosalind Picard and Rana El Kaliouby (2014), “Why, When, And How Much to Entertain

Consumers in Advertisements? A Web-Based Facial Tracking Field Study,” Marketing

Science, 33(6), 1– 19.

Tsiatis, Anastasios A. and Marie Davidian (2004), “Joint Modeling of Longitudinal and Time-To-

Event Data: An Overview,” Statistica Sinica, 14, 809–834.

Wedel, Michel and P.K. Kannan (2016), “Marketing Analytics for Data-Rich Environments,”

Journal of Marketing, 80 (6), 97-121.

Wilson, Edwin B. (1927), “Probable Inference, the Law of Succession, and Statistical Inference,”

Journal of the American Statistical Association, 22(158), 209-212.

Zacks, Jeffrey M., Nicole K. Speer, and Jeremy R. Reynolds (2009), “Segmentation in Reading

and Film Understanding,” Journal of Experimental Psychology: General, 138, 307–327.


41

Footnotes:

1. Six-Second Commercials Are Coming to N.F.L. Games on Fox, by Sapna Maheshwari,

Aug 30, 2017 (accessed Sep 21, 2017), [available at

https://www.nytimes.com/2017/08/30/business/media/nfl-six-second-commercials.html]

2. nViso: Artificial Intelligence Emotion Recognition Software, www.nViso.ch

3. The Random Intercept, Index of the Longest Scene, the Audio Trend, Happiness Peak,

and Happiness Trend are “significant” for all cutoffs. The signs of the coefficients are

stable, but Average Happiness (c.o. < .2), Happiness End (c.o. = .2), Music Start (c.o. =

.2), and Volume Start (c.o. > .3) are significant only for some of the cutoffs.

4. We calculate the p-value based on tests on the relationship between proportions of two

groups (Newcombe 1998; Wilson 1927), using the stats package in R (prop.test) and set

the alternative hypothesis as greater or less. We calculate the Odds ratio as an effect-size

measure (e.g., Cornfield 1951), and calculate Cohen’s h for the lift measure of Streaming

Hours because the Odds ratio cannot be calculated in this case.


42

TABLE 1: THEORETICAL PREDICTIONS FOR THE EFFECTS OF VIDEO AND


AUDIO CHARACTERISTICS ON HAPPINESS AND WATCHING INTENTIONS

Variables affecting Variables affecting


Primitives References
happiness watching intentions
Video
Pacing and sequence Galak et al. 2011;2013; Number of scenes (-); Number of scenes (-);
Ratner et al.1999. Scene sequence (+). Scene sequence (+).

Delays and Nelson and Meyvis Average scene length Average scene length
interruptions 2008; Nowlis et al. (-); longest scene (-); longest scene
2004; Shiv and Nowlis index number (+). index number (+).
2004.
Audio
Moment-to-moment Bradley and Lang Volume level (+); Volume level (+);
level 2000; Lang 1995; Lang Music level (+). Music level (+).
et al. 1997.
Start, peak, end, and Zauberman et al. 2006. Volume start, peak, Volume start, peak,
trend end, and trend (+/-); end, and trend (+/-);
Music start, peak, end Music start, peak, end
and trend (+/-). and trend (+/-).

Emotions
Happiness Baumgartner et al. Happiness start (+/-),
1984; Fredrickson and peak (+), end (+), and
Kahneman 1993; trend (+).
Elpers et al. 2003.
Note: Expected direction (+ , - or +/-) of the effects in parenthesis.
43

TABLE 2: MODEL COMPARISON STATISTICS FOR THE FULL MODEL AND FIVE
MODELS THAT ARISE BY REMOVING VARIABLES FROM THE FULL MODEL

Model AIC DIC WAIC


1. Full Model 233519.2 233459.6 253212.9
2. Without Surprise and Disgust 233505.3 233444.4 253009.8
3. Without Emotion Variables 233592.9 233524.1 253406.2
4. Without Video Variables 233709.6 233640.5 254027.4
5. Without Audio Variables 233760.5 233675.9 257211.0
6. Without Intention 233603.9 233543.6 253536.8
44

TABLE 3: INCLUSION PROBABILITIES OF VARIABLES IN THE JOINT MODEL,


ESTIMATED WITH BAYESIAN VARIABLE SELECTION

Happiness Model Watch Intention Model Box Office Model


Variables Prob. Variables Prob. Variables Prob.
Scene 1.000 SceneNum .030 Holiday .481
SceneNum 1.000 SceneLengthAvg .053 MPAA=PG .637
SceneLengthAvg .395 SceneLongestInd .452 MPAA=PG13 .622
SceneLongestInd 1.000 Volume Peak .016 MPAA=R 1.000
Volume 1.000 Volume End .028 Ratings (from IMDb) .055
Volume Peak 1.000 Volume Start .416 NumRatingAbove3.5 1.000
Volume End 1.000 Volume Trend .714 Happiness Avg .265
Volume Start 1.000 Music Peak .029 Happiness Peak .226
Volume Trend 1.000 Music End .029 Happiness End .209
Music 1.000 Music Start .271 Happiness Start .225
Music Peak .046 Music Trend .142 Happiness Trend .230
Music End .346 Happiness Avg .196
Music Start .352 Happiness Peak .984
Music Trend .546 Happiness End .223
Happiness Start .898
Happiness Trend .625
Notes: The posterior means of inclusion in the model are reported.
Estimates that are greater than the cutoff of .2 are in bold.
45

TABLE 4: PARAMETER ESTIMATES CAPTURING THE EFFECTS OF


VARIABLES ON HAPPINESS, WATCHING INTENTION AND BOX OFFICE
PERFORMANCE IN THE JOINT MODEL
Model Posterior Posterior
Variables Mean SD
Component 2.50% 97.50%
Happiness Video Scene .002 .000 .001 .003
SceneNum -.023 .004 -.030 -.016
SceneLengthAvg -.005 .002 -.010 .000
SceneLongestInd .027 .002 .022 .031
Audio Volume .210 .066 .078 .341
Volume Peak -.032 .003 -.037 -.027
Volume End .010 .003 .003 .017
Volume Start -.005 .003 -.012 .001
Volume Trend -.008 .002 -.013 -.004
Music Music -.351 .104 -.553 -.144
Music End -.007 .004 -.015 .002
Music Start .029 .005 .019 .038
Music Trend .006 .004 -.002 .014
Linkage Random Intercept -.743 .188 -1.097 -.381
Random Slope .619 9.864 -18.660 19.650
Watch Intention Video SceneLongestInd .165 .065 .037 .293
Audio Volume Start -.126 .083 -.286 .035
Volume Trend -.221 .065 -.351 -.093
Music Music Start -.161 .082 -.321 -.003
Happiness Happiness Peak .514 .124 .273 .762
Happiness End .358 .139 .083 .631
Happiness Start -.111 .122 -.354 .126
Happiness Trend .237 .090 .064 .414
τ1 -.249 .383 -.999 .502
τ2 .692 .383 -.059 1.442
Threshold τ3 1.236 .384 .483 1.989
Parameters τ4 1.944 .388 1.183 2.705
τ5 2.794 .396 2.017 3.571
τ6 4.155 .416 3.340 4.970
Box Office Intercept -2.951 1.745 -6.407 .442
WatchIntention .562 .270 .033 1.089
Holiday .930 1.068 -1.205 2.953
MPAA=PG -2.666 1.052 -4.726 -.562
MPAA=PG13 -2.914 1.049 -5.009 -.888
MPAA=R -3.792 .928 -5.615 -1.974
NumRatingAbov3.5 1.986 .097 1.802 2.186
Happiness Happiness Avg 1.645 1.550 -1.450 4.684
Happiness Peak -.589 .852 -2.265 1.099
Happiness End .397 1.184 -1.884 2.751
Happiness Start -1.004 .996 -2.958 .925
Happiness Trend -1.375 .746 -2.847 .077
Notes: Parameters for which the 95% credible interval does not cover zero are in bold.
46

TABLE 5: RESULTS OF OPTIMAL CLIPS RESULTING FROM STEPWISE


REMOVAL OF SCENES FROM THE TRAILER, COMPARED TO A BENCHMARK
WITH THE FIRST SCENES OF THE TRAILER

Movie clips with audio Optimal clip Benchmark clip


Average number of scenes 3.57 (2.07) 4.76 (3.20)
Predicted watching Intention 3.83 (.92) 2.91 (.68)
Difference in watching intentions .92 (.77)
Percentage of clips with positive improvement 90.91%
Improvement in Box office revenue 3.17%
Movie clips without audio Optimal clip Benchmark clip
Average number of scenes 3.56 (2.07) 4.76 (3.20)
Predicted watching Intention 3.80 (.68) 3.28 (.55)
Difference in watching intentions .49 (.43)
Percentage of clips with positive improvement 90.91%
Improvement in box office revenue 1.75%
Average difference between clips with and without audio .04 (.53)
Percentage of clips with audio better than those without 57.58%
Notes: SD in parentheses.
One trailer only had one scene, so no optimization is performed.
The optimization of movie clips without audio is based on the model without sound variables.
47

TABLE 6: AVERAGES OF THE THREE EVALUATION MEASURES FOR THE


OPTIMIZED AND BENCHMARK CLIPS IN THE ONLINE VALIDATION
EXPERIMENT

Clips With Sound


Odd Life of Dark Mirror Project Rock of
Timothy Shadow Mirror X Ages
Benchmark Average 3.86 3.61 4.16 3.58 3.24
Optimized Average 4.42 3.97 5.07 4.38 3.74
% Difference 14.70 9.94 21.92 22.55 15.48
Clips Without Sound
What to Some Mirror Project Wander
Expect Guy Mirror X Lust
Benchmark Average 3.66 2.96 3.86 3.52 3.78
Optimized (no subtitles) Average 4.02 3.33 4.52 4.00 4.03
% Difference 9.75 12.49 17.10 13.65 6.38
Optimized (subtitles) Average 4.52 3.82 5.11 4.35 4.58
% Difference 23.40 29.16 32.27 23.67 20.98
48

FIGURE 1: CONCEPTUAL FRAMEWORK:


EMOTIONAL IMPACT OF SCENE STRUCTURE OF MOVIES, TRAILERS AND
CLIPS ON WATCHING INTENTIONS

Notes: Scenes are the basic building blocks of video content. The audio-visual scene structure of
movies elicits an intended emotional response. Scenes from the movie are selected for the trailer,
and scenes from the trailer are selected to produce a clip. The clip provides a representative
emotional experience and results in the intention to watch the full movie.
49

FIGURE 2: TOTAL SOUND VOLUME AND MUSIC VOLUME FOR ONE TRAILER (MEN IN BLACK 3)

Volume (in dB)

Time (in sec)

Note: The solid line indicates the total sound volume (in dB). The dotted line indicates the music volume (in dB) with vocals removed.
Vertical dashed lines indicate scene cuts.
50

FIGURE 3: HAPPINESS PROFILE AND HAPPINESS AND SCENE MEASURES FOR ONE INDIVIDUAL FOR A
SAMPLE TRAILER (MEN IN BLACK 3)

Happiness

Time (in sec)

Notes: The happiness measure ranges from 0 to 1 (the algorithm assigns a probability based on three sets of facial expression
measurements; details about the measurement are provided in Web Appendix I). Vertical dashed lines indicate scene cuts. The middle
shaded area is the region of the scene with the happiness peak; left and right shaded areas are start and end scenes. The horizontal
dashed line indicates 75% of the peak value; the dotted line is a linear fit used to represent the happiness trend.
51

Video Content Marketing: The Making of Clips

Xuan Liu, Savannah Shi, Thales Teixeira, and Michel Wedel

WEB APPENDIX

WEB APPENDIX I: FACIAL EXPRESSION RECOGNITION ALGORITHM IN SORCI ET


AL. 2010.

The facial recognition algorithm developed by Sorci et al (2010) and used by nViso
specifies the probability of six basic emotions (Ekman and Friesen 1971), namely, “happiness”,
“surprise”, “fear”, “disgust”, “sadness”, and “anger”. Using a multinomial logit model, the
algorithm assigns a probability to each of emotions based on three sets of facial expression
measurements: those based on the Facial Action Coding Systems (FACS); based on the
Expression Descriptive Units (EDU); and based on Appearance Parameters (AP).
The FACS developed by Ekman and Friesen (1978) is the leading standard for measuring
facial expressions. All visible movement of muscular activities on a face are categorized into
“action units” (AUs) and emotions are identified based on a unique combination of these AUs.
For example, happiness is characterized by two primary and three secondary action units. Zhang
et al. (2005) validated the classification of emotions based on the AUs and supplemented the
classification with auxiliary AUs and transient facial features, such as wrinkles and furrows.
The EDU was developed by Antonini et al. (2006), after recognizing that face
recognition also involves spatial configuration of facial features (Cabeza and Kato 2000; Farah et
al. 1998). The EDU encodes the interactions among facial features (e.g., the interactions between
eyebrows and mouth), in addition to the isolated AUs identified in FACS. Lastly, AP was
developed by Sorci et al. (2010) to provide a description of a face as a global entity.
Sorci et al. (2010) compared three different models: (1) the model with only measures
from FACS as explanatory variables; (2) the model with EDU and significant measures from
FACS in model 1 as explanatory variables; (3) the model with AP, and significant measures from
EDU and FACS in model 2 as explanatory variables. The model is specified as:
Emotion j  Intercept j   k 1 I kjK1  kjK1 FACSkK1 ,
K1
Model 1


K2 K
(W1)  I 2
k 1 kj
 kjK2 ( FACS  EDU ) kK2 , Model 2


K3 K
I 3
k 1 kj
 kjK3 ( FACS  EDU  AP) kK3 , Model 3
in which the Emotionj includes “happiness”, “surprise”, “fear”, “disgust”, “sadness”, “anger”,
“neutral”, “other”, and “I don’t know”; K1, K2, and K3 represent the total number of
measurements from FACS, FACS and EDU, and FACS, EDU and AP respectively; I is the
indicator variable, which equals 1 if the k-th is included for emotion j, and 0 otherwise; and the
intercept captures the average effect of factors that are not included.
The models are estimated by maximum likelihood, and model comparison statistics show
52

that the full model (model 3) performs best, and is used as the final model for facial expression
recognition algorithms.

References:

Antonini, Gianluca, Matteo Sorci, Michel Bierlaire, and Jean-Philippe Thiran (2006), "Discrete
Choice Models for Static Facial Expression Recognition," In Advanced Concepts for Intelligent
Vision Systems, 710-721. Springer Berlin/Heidelberg.

Cabeza, Roberto, and Takashi Kato (2000), "Features Are Also Important: Contributions of
Featural And Configural Processing To Face Recognition," Psychological Science 11(5), 429-
433.

Ekman, Paul, and Wallace V. Friesen (1971), "Constants across Cultures in the Face and
Emotion," Journal of Personality and Social Psychology 17(2), 124-129.

Ekman, Paul, and Wallace V. Friesen (1978), Facial Action Coding System Investigator’s Guide.
Consulting Psycologist Press, Palo Alto, CA.

Farah, Martha J., Kevin D. Wilson, Maxwell Drain, and James N. Tanaka (1998), "What is"
Special" about Face Perception?," Psychological Review, 105(3), 482-498.

Zhang, Yongmian, and Qiang Ji (2005) "Active and Dynamic Information Fusion for Facial
Expression Understanding from Image Sequences," IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(5), 699-714.
53

WEB APPENDIX II: JAGS CODE

model{
for (j in 1:numTrailer){
for (i in 1:nObsMatrix[,j]){

### moment-to-moment emotion model


y[index[i,j], 1, j] ~ dnorm(0, 0.01)
for (t in 2:tMatrix[,j]){
theta[index[i,j],t,j] <- zeta[1] * s[t,j] +
zeta[2]*x4_eq1[index[i,j],7,j]+zeta[3]*x4_eq1[index[i,j],15,j]+zeta[4]*x4_eq1[index[i,j],16,
j]+zeta[5] * Vol[t,j]
+zeta[6]*x4_eq1[index[i,j],32,j]+zeta[7]*x4_eq1[index[i,j],34,j]+zeta[8]*x4_eq1[index[i,j],
35,j]+zeta[9]*x4_eq1[index[i,j],14,j]+ zeta[10] *
Music[t,j]+zeta[11]*x4_eq1[index[i,j],38,j]+zeta[12]*x4_eq1[index[i,j],39,j]+zeta[13]*x4_e
q1[index[i,j],17,j]+u[index[i,j], 1] + u[index[i,j], 2] * t
y[index[i,j],t,j] ~ dnorm(theta[index[i,j],t,j], sigma)
}

## end-point watch intention model


mu_c[index[i,j],j] <- alpha[1]*x4_eq1[index[i,j],16,j]+
alpha[2]*x4_eq1[index[i,j],35,j]+alpha[3]*x4_eq1[index[i,j],14,j]+
alpha[4]*x4_eq1[index[i,j],39,j]+
alpha[5]*x4_eq1[index[i,j],21,j]+alpha[6]*x4_eq1[index[i,j],30,j]+alpha[7
]*x4_eq1[index[i,j],12,j]+ v[1] * u[index[i,j], 1] + v[2] * u[index[i,j], 2]
+u[index[i,j], 3]

logit(Q[index[i,j],j,1]) <- tau[1] - mu_c[index[i,j], j]


p[index[i,j],j,1] <- Q[index[i,j],j,1]
for (d in 2:(D-1)) {
logit(Q[index[i,j],j,d]) <- tau[d] - mu_c[index[i,j],j]
p[index[i,j],j,d] <- Q[index[i,j],j,d] - Q[index[i,j],j,(d-1)]
}
p[index[i,j],j,D] <- 1 - Q[index[i,j],j,(D-1)]
C[index[i,j],j] ~ dcat(p[index[i,j],j,1:D])
C_Pred[index[i,j],j] <-sum(p[index[i,j],j,1:D]*(1:D))
} #i

## end-point (log) Box-office revenue model


WatchMovieTemp[j]<-sum(C_Pred[index[1:nObsMatrix[,j],j],j])/nObsMatrix[,j]
happiness_peak_numtemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],2,j])/nObsMatrix[,j]
happiness_peak_durationtemp[j]<-
sum(x4_eq1[index[1:nObsMatrix[,j],j],5,j])/nObsMatrix[,j]
happiness_avg_temp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],9,j])/nObsMatrix[,j]
happiness_coftemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],12,j])/nObsMatrix[,j]
54

happiness_peaktemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],21,j])/nObsMatrix[,j]
happiness_peakIndextemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],24,j])/nObsMatrix[,j]
happiness_endtemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],27,j])/nObsMatrix[,j]
happiness_starttemp[j]<-sum(x4_eq1[index[1:nObsMatrix[,j],j],30,j])/nObsMatrix[,j]

mu_bo[j]<-vbo[1]*WatchMovieTemp[j]+vbo[2]*Holiday[j]
+vbo[3]*mpaaPG[j]+vbo[4]*mpaaPG13[j]+vbo[5]*mpaaR[j]+vbo[6]*log(Numratingabov
e3_5[j]+1)+vbo[7]*happiness_avg_temp[j]+vbo[8]*happiness_peaktemp[j]+vbo[9]*happi
ness_endtemp[j]+vbo[10]*happiness_starttemp[j]+vbo[11]*happiness_coftemp[j]
+vbo[12]
LGBoxOffice[j] ~ dnorm(mu_bo[j], sigma_bo)
} #j
for (i in 1:nRespondent){
for (f in 1:3){
u[i,f] ~ dnorm(0, sigma_m[f])
}
}
for (r1 in 1:3){
sigma_m[r1] ~ dgamma(0.01,0.01)
}
for (d in 1:D){
tau0[d] ~ dnorm(0, 0.001)
}
tau <- sort(tau0)

for (m in 1:nAlpha){
alpha[m] ~ dnorm(0, 0.01)
}
for (l in 1:nV){
v[l] ~ dnorm(0, 0.01)
}
sigma ~ dgamma(0.01, 0.01)

for (ll in 1:nZeta){


zeta[ll] ~ dnorm(0, 0.01)
}
sigma_bo ~ dgamma(0.01, 0.01)

for (ll in 1:nvbo){


vbo[ll] ~ dnorm(0, 0.01)
}
}"

You might also like