Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions

7836 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.
6, JUNE 2023
Transferring Knowledge From Text to Video:

Zero-Shot Anticipation for Procedural Actions
Fadime Sener , Rishabh Saraf , and Angela Yao
Abstract—Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem
by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-
scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts
coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model,
we introduce the Tasty Videos Dataset V2, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive
experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for
training models.
Index Terms—Deep learning, action anticipation, zero-shot learning, video analysis
1 INTRODUCTION Instructional data, especially cooking recipes, can be

found readily on the web [5], [6], [7]. The richest forms are
MAGINE a not-so-distant future where robot chefs service
I our kitchens. How can we learn and embody cooking as a
general skill? Perhaps by reading all the recipes on the web?
multimodal e.g., images plus text or videos with narrations.
Such data could be used to build automated systems to
learn visual models from videos, enabling the advancement
Or by watching all the cooking videos on YouTube? Learn-
of virtual assistants or service robots learning new skills.
ing and generalizing from a set of instructions, be it in text,
However, learning complex multi-step procedures requires
image, or video form, is a highly challenging and open
significant amounts of data. Despite the abundance of
problem faced by those working in computer vision, natural
instructional data online, it is still difficult to find sufficient
language understanding and robotics.
examples in multimodal form. Furthermore, learning the
In this work, we limit our scope of training the next
steps’ visual appearance would require temporally aligned
‘robochef’ to predict subsequent steps as it watches a
data, which is less common and or expensive to annotate.
human cook a never-before-seen dish. We frame the prob-
Several works [8], [9] have been proposed to tackle learning
lem as action recognition and anticipation in a zero- or few-
video representations without manual supervision. How-
shot learning scenario. In addition to recognition, it will be
ever, these methods can only provide solutions for coping
important for the robot to anticipate what happens in the
with the misaligned narrations and cannot handle missing
future to ensure a safe and smooth collaborative experience
or unrelated narration.
with the human [1], [2]. We also place importance on zero-
Our strategy is to separate the procedural learning from
shot or few-shot learning as it likely reflects how service
the visual perception problem. We learn procedural knowl-
robots will be introduced to the home [3], [4]. Models (and
edge from text corpora; these are readily available and large
robots) can be pre-trained extensively, but they will likely
scale, on the scale of millions [8], [10], [11]. Knowledge from
be deployed in never-before-encountered scenarios. Suc-
the text is then transferred to video so that visual perception
cessfully anticipating the next steps of a never-before-seen
is simplified to a grounding task done via aligned video
dish would require leveraging and generalizing from previ-
and text (Fig. 1). Early examples of transferring knowledge
ously learned procedural knowledge.
from one domain to another like [12], [13] work with
images, sound, and text. Ngiam et al. [12] represent image
Fadime Sener is with the University of Bonn, 53113 Bonn, Germany. and sound using separate stacked autoencoders and fuse
E-mail: sener@cs.uni-bonn.de.
Rishabh Saraf is with the Indian Institute of Technology (ISM) Dhanbad, them into a multimodal representation space. Srivastava
Dhanbad, Jharkhand 826004, India. E-mail: rishabh.15je001745@am. and Salakhutdinov [13] learn a joint space for image and
iitism.ac.in. text for information retrieval from unimodal or multimodal
Angela Yao is with the National University of Singapore, Singapore queries. More recently, VideoBERT [14] has been proposed
119077. E-mail: ayao@comp.nus.edu.sg.
for the joint modeling of video and text. Unlike these works,
Manuscript received 15 May 2021; revised 1 May 2022; accepted 18 October
2022. Date of publication 1 November 2022; date of current version 5 May
which assume parallel data during training, our text models
2023. are trained on large text corpora alone, and the video mod-
This work was supported by the National Research Foundation, Singapore els are trained on scarce parallel data and can be used for
under its NRF Fellowship for AI under Grant NRF-NRFFAI1-2019-0001. zero-shot queries.
(Corresponding author: Fadime Sener.)
Recommended for acceptance by K. Saenko. More specifically, we encode text and or video in a multi-
Digital Object Identifier no. 10.1109/TPAMI.2022.3218596 modal embedding space. The context vectors, derived from
0162-8828 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on June 08,2024 at 14:04:13 UTC from IEEE Xplore. Restrictions apply.
SENER ET AL.: TRANSFERRING KNOWLEDGE FROM TEXT TO VIDEO: ZERO-SHOT ANTICIPATION FOR PROCEDURAL ACTIONS 7837
Fig. 1. An overview of our model. We first learn procedural knowledge from large text corpora and transfer it to the visual domain to anticipate the
future. Our system comprises four RNNs: a sentence encoder, a sentence decoder, a video encoder, and a recipe network.
either video or text, are fed into a recipe network that mod- When transferring procedural knowledge from text rec-
els the recipe’s sequential structure and makes the following ipes to videos, we need to ground the text domain with the
step predictions. Fig. 1 shows an overview. The use of text video and vice versa. This requires video with temporally
and language information to help train video models is not aligned captions; to the best of our knowledge, YouCoo-
new. Several works employ accompanying narrations [7], kII [23] is the only dataset with such labels. However, You-
recipes [11] or film scripts [15] to use text as weak bound- CookII lacks diversity in the number of dishes (89 dishes)
aries and several learn joint text-video models [8], [14]. Our and therefore the number of possible recipe steps. As such,
work is similar in spirit in that we also want to leverage we collect and present our new Tasty Videos dataset V2, a
these auxiliary sources of information to reduce the labeling diverse set of 4022 different cooking recipes1 each accompa-
effort. However, the previous works mainly focus on using nied by a video, ingredient list, and temporally aligned rec-
text to minimize the labeling effort for large-scale video ipe steps. Video footage is taken from a fixed bird’s-eye
datasets. In contrast, our work learns entire models out of view and focuses almost exclusively on the cooking instruc-
text and then transfers said models across the natural lan- tions, making it well-suited for procedural understanding.
guage domain to the visual domain via aligned video and Our main contributions are summarized as follows:
text representations. To the best of our knowledge, we are
the first to learn and transfer a cross-domain model. We are the first to explore action anticipation in a
Our work breaks new ground in procedural activity zero-shot setting by generalizing knowledge from
understanding in two ways. First and foremost, we antici- text corpora and transferring it to the visual domain.
pate upcoming actions under a zero-shot setting, as we tar- We propose a modular hierarchical model for learn-
get making predictions for never-before-seen dishes. We ing multi-step procedures with text and visual con-
achieve this by generalizing cooking knowledge from large- text. Our model generalizes cooking knowledge and
scale text corpora and then transferring the knowledge to predicts coherent and plausible instructions for mul-
the visual domain. This approach relieves us of the burden tiple steps into the future. The rich natural language
and impracticality of providing annotations for a virtually predictions score higher in NLP metrics than state-
unlimited number of categories (dishes) and sub-categories of-the-art video captioning methods applied directly
(instructional steps). Our work is the first to tackle the prob- to the (future) video.
lem of procedural activity understanding in this form; prior We present a new and highly diverse dataset of
works in recognition are severely limited in the number of cooking recipes. The dataset is publicly available2
categories and steps [16], [17], [18], while works in anticipa- and will be of interest to those working in procedural
tion rely on strong supervision [19], [20], [21]. video understanding, action recognition and antici-
Our work’s second novelty is that we do not work with a pation, as well as other multimodal research in video
closed set of labels derived from word tags. Instead, we train and text.
with and also predict full sentences, e.g., ‘Cook the chicken A preliminary version of this paper was published
wing until both sides are golden brown.’ versus ‘cook chicken’. in [25]. The current paper extends the previous work’s
This design choice makes the problem more challenging but model by integrating a temporal segment proposal method
also brings several advantages. First, it adds qualifiers and to the video encoder and additional losses at the recipe
richness to the instruction since natural language conveys encoder to improve convergence. We add experiments com-
much more information than simple text labels [22], [23]. Sec- paring against recipe generation networks [26] and verify
ond, it allows for anticipation of not only actions but also that our hierarchical architecture better generalizes to
objects and attributes. Finally, as a byproduct, it facilitates
data collection, as the number of class-based annotations
1. Collected from the website https://tasty.co/
grows exponentially with the number of actions, objects, and 2. Tasty Video Dataset https://cvml.comp.nus.edu.sg/
attributes, leading to very long-tailed distributions [24]. tasty
7838 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 6, JUNE 2023
previously unseen dishes or recipes. Finally, we extend the framework. Our proposed approach can also predict actions
dataset by 60% from 2511 to 4022 videos. multiple steps into the future, but unlike these methods, we
do not work in a fully supervised framework. Furthermore,
2 RELATED WORKS we do not require repetitions of activity sequences for train-
ing. Moreover, we are the first to predict the future in the
2.1 Procedural Activity Modeling form of sentences instead of category-based predictions as in
Understanding procedural activities and their sub-activities [19], [52], [53]. Similar to us, recent work [54] also predicts
has been typically addressed as a supervised temporal sentences for subsequent actions by extending their anticipa-
video segmentation and recognition problem [17], [18], [27]. tion framework [44].
A variety of models have been used, including conditional
random fields [28], hidden Markov models [29], RNNs [30]
and recently temporal convolutional networks [31], [32]. 2.3 Zero- and Few-Shot Learning in Video
Data and label-wise, these methods require video sequences Zero- and few-shot learning is more popular in the image
in which every frame is labeled exhaustively, making it dif- domain and we refer the interested readers to two recent
ficult to work at a large scale. surveys [55], [56]. Extensions to the video domain has been
Alternative lines of work are either weakly supervised, less explored. Early works on zero-shot learning on videos
using cues from accompanying narrations [16], [33], [34] or rely on attribute-oriented feature representations, which are
sub-activity orderings [35], [36], [37], or are fully unsuper- then used to categorize the unseen videos [57]. More recent
vised [38], [39]. Our work is similar to those using text cues; works train temporal models that map video features to a
however, we do not rely on aligned video and text modali- semantic embedding space of categorical labels [58], [59],
ties for learning the activity models [16], [34]. Assuming [60] or sentence representations [61]. In a similar spirit to
videos accompanied by temporally aligned narrations is not the embedding-based approaches, we ground the video to
always the case for instructional videos as it is far more nat- text by mapping video representations to the semantic
ural for people to talk about an action before/after perform- space of step-wise instructions. Our approach extends the
ing it or to talk about an alternative action/object. Instead, research on the zero-shot recognition of simple actions to
we use a large corpus of unlabeled data in the text domain the complex multi-step activities of procedural cooking vid-
and use only a very small set of aligned data for grounding eos. ‘Zero-shot’ in our case refers to making predictions of
the visual evidence. previously unseen recipes.
2.2 Action Anticipation 2.4 Modeling Instructional Text in NLP

Action anticipation is the forecast of not-yet-observed Cooking is a popular domain in NLP research since recipes
actions into the future. Let t a be the ‘anticipation time’, i.e., are rich in natural language yet reasonably limited in scope.
how many seconds in advance to anticipate the next action. Cooking recipes are employed in tasks such as food recogni-
Then the task of action anticipation is to predict upcoming tion [62], recommender systems [63] and indexing and
action, t a seconds before it starts. In many recent works, t a retrieval [10], [64]. Modeling the procedural aspects of text
is considered to be 1 second [40], [41] but could vary and generating coherent recipes date back several deca-
between zero [20] and several seconds [1]. des [65], [66]. Early works focus on parsing the recipes to
Early works in forecasting activities have been limited to extract verbs and ingredients [11], [67], [68], [69]. For exam-
simple movement primitives, such as reaching and plac- ple, [67] generate plans from textual instructions, [68] map
ing [1], or personal interactions like hand-shaking, hugging recipes to action graphs, and [70] use parsed instructions to
etc. [20], [42]. More recent works anticipate up to thousands make robots cook pancakes.
of action classes [24] by defining actions as the composition More recently, neural network-based solutions are popu-
of a verb and a noun. However, this does not scale as the lar, and they target especially improving the generated rec-
number of actions grows, leading to a long-tail distribution. ipes’ coherence. For example, [71] train an encoder-
For example, in the new Epic-100 Dataset [43] 92% of the decoder [72] with a checklist mechanism to keep track of
actions are tail classes. Methods for anticipation include the ingredients given as input. [73] propose a reinforcement
RNNs for encoding the observations [41], [44], predicting learning-based solution with discourse-aware rewards to
future features and performing future action classification encourage generating instructions in correct orders. [74]
on these [45], [46], employing knowledge distillation [47], produce personalized recipes by fusing users’ previously
transferring knowledge from other sources such as word consumed recipes with an attention mechanism.
embeddings [48] or visual attribute classifiers [40]. More-
over, recent interest in egocentric vision has started a new
line of approaches dedicated to such recordings using 2.5 Cooking Domain in Vision
gaze [49] or hand-object interaction regions [50], [51]. In vision, cooking has been explored for procedural and
Dense anticipation extends action anticipation to forecast fine-grained activity recognition [17], [18], [23], [24], tempo-
multiple actions into the future. Examples include [19], [52], ral segmentation [17], [23], video-text alignment [33], [75]
who propose two-stage methods that first perform action and captioning [76], [77], [78]. There are several cooking
segmentation of the observed sequence before using the and kitchen datasets [17], [23], [24], [33], [34]; What’s Cook-
frame-wise labels as input for anticipation. The work of [53] ing [33] and YouCookII [23] are the most similar to ours, fea-
bypasses the initial segmentation and performs dense antici- turing videos and accompanying recipe texts. YouCookII,
pation in a single stage directly using a temporal aggregation however, has limited diversity with only 89 dishes; What’s
Fig. 2. Our system is composed of four RNNs: a sentence encoder and a decoder, a video encoder, and a recipe RNN. Given the ingredients as initial
input and context in either text or visual form, the recipe RNN recurrently predicts future steps. The sentence decoder converts predicted future steps
back into natural language. We continue predicting future steps by repeatedly feeding the next steps encoded by the sentence or video encoder.
Cooking is large scale (180K recipes) but lacks temporal overview is shown in Fig. 2, while details of the RNNs are
alignments between the video and recipe texts. given in Sections 3.1, 3.2, and 3.3.
Some recent methods investigate learning image-text
embeddings for image-based recipe retrieval. For exam- 3.1 Sentence Encoder and Decoder
ple, [10] learn a joint embedding space of the recipes The sentence encoder produces a fixed-length vector repre-
encoded with skip-thought vectors [79] and associated food sentation of each textual recipe step. We use a bi-directional
images using a pairwise ranking loss. This baseline is LSTM, but rather than representing a sentence by the last
extended by learning with a triplet loss [64] and hard sam- step’s hidden vector, we apply a (temporal) max-pooling
ple mining [80]. Instead of retrieval, [81] recently propose over each dimension of the hidden units. This type of archi-
an image generation method from recipe text using an tecture and, in particular, the temporal pooling are shown
instruction encoder. to be successful in sentence encoding [87]. More formally,
An alternative application is to generate recipes from let sentence sj from step j of a recipe (we assume each step
images. [26] predicts ingredients of food images and use the is one sentence) be represented by Mj words, i.e., sj ¼
ingredients as input for a transformer-based decoder. How- fwtj gt¼1...M and xtj be the word embedding of word wtj . For
ever, [26] generates an entire recipe as one continuous text each sentence j, at each (word) step t, the bi-directional
block, so recipes can only be as long as the allowed maxi- LSTM based sentence encoder, SE, outputs ytj :
mum length of the decoder (150 words). [82] splits recipes h n o n M oi
into several chunks and predicts the instructions for each ytj ¼ SE LSTM x1j ; . . . ; xtj ; LSTM xj j ; . . . ; xtj ; (1)
chunk guided by position encoders.
which is a concatenation of the hidden states from the for-
3 MODELING SEQUENTIAL INSTRUCTIONS ward and backward pass of the LSTM. The overall sentence
representation rj is determined by a dimension-indepen-
Sequence-to-sequence learning [72] has made it possible to dent max-pooling over the time steps, i.e.,
successfully generate continuous text and build dialogue sys-
tems [83], [84]. Recurrent neural networks are used to learn ðrrj Þd ¼ max ðyytj Þd ; (2)
t2f1;...;Mj g
rich representations of sentences [79], [85], [86] in an unsuper-
vised manner, using the extensive amount of text that exists
where ðÞd , d 2 f1; . . . ; Dg, indicates the d-th element of the
in book and web corpora. Examples include skip-thoughts
D-dimensional bi-directional LSTM outputs ytj .
vectors [79] and FastSent [85], both of which are effective for
The sentence decoder, SD, is an LSTM-based neural lan-
generation tasks. However, for instructional text, such as
guage model that converts the fixed-length representation
cooking recipes, such representations do not fully capture the
of the steps back into human-interpretable sentences. More
underlying sequential nature of the instruction set, and gener-
specifically, given the vector prediction ^rj from the recipe
ations are not always coherent from one step to the next. As
RNN of step j, it decodes the sentence s^j
such, we propose a hierarchical model and dedicate two
RNNs to represent the sentences and the steps of the recipe n M^ o
individually: the sentence encoder and the recipe RNN, s^j ¼ SDðLSTMf^rj Þg ¼ w^ 1j ; . . . ; w
^j j : (3)
respectively. A third RNN decodes predicted recipe steps
back into sentence form for human-interpretable results (sen- 3.2 Recipe RNN
tence decoder). These three RNNs are learned jointly as an We model the sequential ordering of recipe steps with a recipe
auto-encoder in an initial training step. A fourth RNN encod- encoder (RE), which is an LSTM that takes as input
ing visual evidence (video encoder) is then learned in a subse- frrj gj¼1;...;N , i.e., fixed-length representations of the steps of a
quent step to replace the sentence encoder to enable recipe with N steps, where j indicates the step index. At each
interpretation and future prediction from video data. An recipe step, the hidden state of the RE hj can be considered a
0
fixed-length representation of all recipe steps fs1 ; . . . ; sj g seen where P ðwtj jwjt < t ; ^rj Þ is parameterised by a softmax func-
up to step j; we directly use this hidden state vector as a pre- tion at the output layer of the sentence decoder to estimate
diction for the sentence representation for step j þ 1, i.e., the distribution over the words, w, in our vocabulary V . The
overall objective is then summed over all recipes in the cor-
^rjþ1 ¼ hj ¼ REðLSTMfrr0 ; . . . ; rj gÞ: (4)
pus. The loss is computed only when the LSTM is learning
The hidden state of the last step hN can be considered as a to decode a sentence. This first training stage is unsuper-
representation of the entire recipe. Due to the standard recur- vised, as the sentence encoder and decoder and the recipe
sion of the hidden states in LSTM, each hidden state vector, RNN require only text inputs that can easily be scraped
and therefore, each future step prediction, is conditioned on from the web without human annotations.
the previous steps. This allows predicting recipe steps that In a second step, we train the video encoder (VE) while
are plausible and coherent with respect to previous steps. keeping the recipe RNN (RE) and sentence decoder (SD)
Recipes usually include an ingredient list, a rich source of fixed. We simply replace the sentence encoder with the
information that can also serve as a strong modeling video encoder while applying the same loss function as
cue [10], [26], [71]. To incorporate the ingredients, we form defined in Eq. (7). This step is supervised as it requires
an ingredient vector I for each recipe in the form of a one- video segments of each step that are temporally aligned
hot encoding over a vocabulary of ingredients. I is then with the corresponding sentences.
transformed with a separate fully connected layer in the rec- In addition to the word-based reconstruction loss of
ipe RNN to serve as the initial input, i.e., r0 ¼ fðII Þ. Note Eq. (7), we propose an L2 regularizer that encourages the
that our model is fully deterministic, so the same ingredient predicted next step representation of ^r of Eq. (1) to be faith-
vector input will always lead to the same first instruction. ful to the observed representation r, i.e. recipe loss Lr :
X
N
3.3 Video Encoder Lr ¼ ðrrj ^rj Þ2 ; (8)
For inference, we would like the recipe RNN to interpret j¼1
sentences from text and visual inputs. The modular nature
of our model allows us to conveniently replace the sentence We use the decoder loss, Ld , and recipe loss, Lr , together
encoder with an analogous video encoder, VE. Suppose the with weighting hyperparameter a:
jth video segment cj is composed of Cj frames, i.e.,
cj ¼ fff tj gt¼1;...;L 3. Each frame fjt is represented as a high-level Lðs1 ; . . . ; sN Þ ¼ Ld þ aLr ; (9)
CNN feature vector – we use the last fully connected layer where ^rj is the predicted output for step j and rj is the input
output of ResNet-50 [88] before the softmax layer. Similar to from the sentence encoder for step j.
the sentence encoding, rj , in Eqs. (1) and (2), we determine
the video encoding vector, vj , by applying a temporal max 3.5 Inference
pooling over each dimension of the video segment repre-
During inference, we provide the ingredient vector r0 as an
sentations, ztj :
initial input to the recipe RNN, which then outputs the pre-
dicted vector ^r1 for the first step of the video (see Fig. 2). We
ðvvj Þd ¼ max ðzztj Þd ; where (5)
t2f1;...;Cj g use the sentence decoder and generate the corresponding
h n o n C oi first sentence, s^1 . Then, we sample a sequence of frames
ztj ¼ VE LSTM f 1j ; . . . ; f tj ; LSTM f j j ; . . . ; f tj : (6) from the video and apply the video encoder to generate v1 ,
which we again provide as an input to the recipe RNN. The
The video encoder, VE, is trained such that vj can directly output prediction of the recipe RNN, ^r2 , is for the second
replace rj . The inputs to our video encoder are frames from step of the video. We again use the sentence decoder and
the video segments that correspond to individual recipe generate the corresponding sentence s^2 .
steps. We train our method on videos using ground truth Our model is not limited to one-step-ahead predictions:
segments for the video encoder. For testing, we use tempo- for further predictions, we can simply apply the predicted
ral segments either based on fixed temporal windows or output ^rj as contextual input rj . During training, instead of
from predicted segment proposals [23]. always feeding in the ground truth rj , we sometimes (with
0.5 probability after the 5th epoch) use our predictions, ^rj ,
3.4 Model Learning as the input for the next-step predictions, which helps us
Our full model is learned in two stages. First, the sentence with being robust to feeding in bad predictions [89].
encoder (SE), recipe RNN (RE) and sentence decoder (SD)
are jointly trained end-to-end. Given a recipe of N steps, we 3.6 Implementation and Training Details
define our decoder loss, Ld , as the negative log probability We use a vocabulary V of 30171 words provided by the Rec-
of each reconstructed word t for each step j: ipe1M dataset [10]; words are represented by a 256-dimen-
sional embedding shared by the sentence encoder and
N X
X Mj
0
Ld ðs1 ; . . . ; sN Þ ¼ log P ðwtj jwjt < t ; ^rj Þ; (7) decoder. We use the ingredient vocabulary from the train-
j¼1 t¼1 ing set of Recipe1M; the one-hot ingredient encodings are
mapped into a 1024 dimensional vector r0 . The RNNs are
all single-layer LSTMs implemented in PyTorch; SE, VE, SD
3. We overload the word index t from Eqs. (1) and (2) to also denote have 512 hidden units, while RE has 1024. We train our
the frame index as the two are directly analogous in our encoders. model using the Adam optimizer [90] with a batch size of
TABLE 1
Comparisons of Our Tasty V1 and V2 Datasets With Relevant Datasets
Tasty #videos #images / #unique avg. #steps / avg. #steps avg. segment avg. video
frames ingredients #segments /#segments duration duration
V1 2511 4.1M 1228 9 21243 5s 54s
V2 4022 8.6M 1542 10 37530 6s 71s
Recipe1M[10] - 887K 3769 9 - - -
YouCookII [23] 2000 15.8M 828 8 15400 19.6s 303s
Epic-Kitchens 432 11.5M - - 39596 1.9s 426s
[24]
Recipe1M includes textual recipes and recipe images, while YouCookII and Epic-Kitchens are video-based datasets.
50 recipes and a learning rate of 0.001. We train our text- represent the sentences, then 64% and 62% of the tuples are
based model (SE, RE, SD) for 50 epochs and the visual unseen during evaluation in Tasty V1 and V2, respectively.
model (VE, RE, SD) for 25 epochs. We use a ¼ 0:1 for Lr . Other datasets feature crowd-sourced text [23], [24]; the
The text-based model trained with Lr converges faster, so recipes in our dataset are written by experts. This ensures
we train this variant for only 10 epochs. specificity and richness in the instruction. For each recipe
step, which corresponds to a single sentence, we annotate
the temporal boundaries of the step in the video. We omit
4 TASTY VIDEOS DATASET V1 AND V2 annotating steps without visual correspondences, such as
In our original publication [25], we released the Tasty Videos alternative recommendations, non-visualized instructions
Dataset with 2511 unique recipes, which we will refer to as like ‘Preheat oven.’ and stylistic statements such as ‘Enjoy!’.
V 1. We have since extended the dataset to 4022 unique rec- For both Tasty V1 and V2, we define a split ratio of 8:1:1 for
ipes – Tasty Videos Dataset V2. All text recipes and videos are the training, validation, and testing sets.
collected from Buzzfeed’s Tasty website.4 We make publicly We present the statistics of our datasets in Table 1 and in
available the links to each recipe page, computed features, Fig. 3. We compare our dataset to the relevant datasets of
and temporal annotations 5 for both versions of the dataset. Recipe1M and YouCookII as well a large-scale activity data-
In our dataset, each recipe has an ingredient list, step- set Epic-Kitchens in Table 1. Recipe1M is a large-scale data-
wise instructions, and a video demonstrating the prepara- set with, as the name suggests, approximately one million
tion of the dish. The videos in this dataset are captured with recipes with a recipe name, list of ingredients, a sequence of
a fixed overhead camera and focus entirely on preparing instructions, and images of the final dish for each recipe.
the dish (see Fig. 2). This viewpoint removes the added YouCookII is a collection of cooking videos from YouTube
challenge of distractors and irrelevant actions. This simplifi- with around 2000 videos of 89 dishes. Videos are captured
cation is not reflective of in-the-wild environments but it from a third-person viewpoint. Each dish has an average of
does allow us to focus our scope on modeling the sequential 22 videos, each with an average of 8 steps. The videos are
nature of instructional videos, which is already a highly annotated with the temporal boundaries of each step and
challenging and open research topic. The videos are their corresponding descriptions. Epic-Kitchens is a large-
designed to be sufficiently informative visually without the scale egocentric activity dataset with 39K action segments
need for any narrations. with category-based labels and is frequently used for action
Tasty V2 features 400 test, 400 validation and 3222 train- recognition and anticipation.
ing instances. When creating the splits of both Tasty V1 and Tasty V2 extends the number of visual segments by 76%.
V2, we grouped the recipes based on their similarities, e.g. It includes around 37K segments that correspond to single
muffins, tarts, and pizza, and then split the recipes in each sentence recipe steps. This number is comparable to the
category into training, validation, and testing sets. The rec- large-scale Epic-Kitchens dataset, which contains 39K seg-
ipes in the test set can be further split into those with similari- ments, and is higher than the number of segments in You-
ties in the training set, e.g.”strawberry pretzel cheesecake” CookII with its 15K segments. Compared to the 1M recipes
versus ”carrot cake cheesecake” and those without similari- of Recipe1M, our dataset with 4022 recipes covers 40% of
ties, e.g.”okonomiyaki”. Tasty V1 test set has 183 recipes the 3769 ingredients in Recipe1M. Compared to YouCookII,
with similarities to the training and 72 without. Tasty V2 has our dataset includes a diverse list of dishes (4022 versus 89)
more recipes overall, so there are more test recipes with simi- and ingredients (1542 versus 828). One notable difference
larities; 366 recipes are with similarities and 34 are without. between these datasets is the segment/step granularity,
As the number of recipes increase, the overlapping steps which is on the order of a few seconds in our dataset and
across recipes also increase. To capture the extent of over- Epic but is coarser for YouCookII (19.6 seconds on average).
lap, we create exhaustive verb and ingredient pairs for each The videos in our dataset are short (on average 54/71 sec-
sentence. 34% of the pairs are unseen during evaluation in onds for V1/V2) yet contain a challenging number of steps
Tasty V1, and 24% are in the larger Tasty V2. If we simply (on average 9/10 for V1/V2). The recipes in YouCookII and
concatenate the verbs and ingredients in each sentence to Recipe1M have a similar number of steps (9/8 respectively).
We note that the Tasty dataset also has the potential for
4. https://tasty.co tasks beyond anticipation, such as temporal action segmenta-
5. https://cvml.comp.nus.edu.sg/tasty tion, dense video captioning, and object state recognition. In
Fig. 3. Dataset distributions.
this paper, we compute several video captioning baselines, truth and predicted sentence using the exact word matches,
and we encourage the community to further develop models stems, synonyms, and paraphrases; then, it computes a
to tackle our challenging zero-shot dataset. weighted F-score with an alignment fragmentation penalty.
For the uninformed reader, sentence scores like BLEU
and METEOR are best at indicating sentences with precise
5 EXPERIMENTS: TEXT word matches to the ground truth (GT). There are variations
5.1 Datasets and Evaluation Measures between sentences conveying the same idea in natural lan-
We experiment with Recipe1M [10], YouCookII [23], and guage, so automated scores may fail to match sentences a
Tasty Videos V1 and V2. human would consider equivalent. This is true even for text
We use the ingredients and instructions from the training with very specific language, such as cooking recipes. For
split of the Recipe1M dataset to learn our sentence encoder example, for the ground truth sentence ‘Garnish with the
(Eq. (1)), sentence decoder (Eq. (3)), and recipe RNN remaining wasabi and sliced green onions.’, our method pre-
(Eq. (4)). To learn the video encoder (Eq. (6)), we use the dicts ‘Transfer to a serving bowl and garnish with reserved
aligned instructions and video data from the training split scallions.’. For a human reader, this is half correct, since
of either YouCookII or Tasty datasets. We evaluate our ‘scallions’ and ‘green onions’ are synonyms, yet this example
model’s prediction capabilities with text inputs from Rec- would have only a BLEU1 score of 30.0, BLEU4 of 0.0 and
ipe1M and video and text inputs from YouCookII and Tasty METEOR of 11.00. For another, for the ground truth sen-
Videos. tence ‘Place patties on the grill, and cook for 5 minutes per side.’
Our predictions are in sentence form; evaluating the versus a prediction by our model ‘Place on the grill, and cook
quality of generated sentences is known to be difficult in for about 10 minutes, turning once.’, we would have a BLEU1
captioning and natural language generation [91], [92]. We score of 65.0, BLEU4 of 44.0, and METEOR of 29.0. In this
apply a variety of measures to offer a broad assessment. regard, we note that BLEU and METEOR scores offer only a
First, we target the matching of ingredients and verb key- limited ability to evaluate the predicted sentences.
words since they indicate the next active objects and actions The gold standard to evaluate dialogue generation [95]
and are analogous to the assessments of action anticipation and captioning [96] is human subject ratings. Therefore, we
[24]. Second, we evaluate using sentence-matching scores conduct a user study and ask people to assess how well the
BLEU (BiLingual Evaluation Understudy) [93] and predicted step matches the ground truth in meaning; if it
METEOR (Metric for Evaluation of Translation with Explicit does not match, we ask if the prediction would be plausible
ORdering) [94], which are also used for video captioning for future steps. This gives flexibility in case predictions do
methods [76], [77], [78]. BLEU computes an n-gram-based not follow the exact aligned order of the ground truth, e.g.,
precision for predicted sentences w.r.t. the ground truth sen- due to missing steps not being predicted or steps that are
tences. METEOR creates an alignment between the ground slightly out of order (see Figs. 4, 13 and 12).
Fig. 4. Predictions of our text-based method for ‘ Candied Bacon Sticks’ along with the automated scores and human ratings. For ‘ HUMAN1 (HUM1)’
we asked the raters to directly assess how well the predicted steps match the corresponding Ground Truth (GT) sentences; for ‘ HUMAN2 (HUM2)’,
we asked them to judge if the predicted step is still a plausible future prediction (see Section 6.6). Our prediction for step6 matches the GT well, while
that for step5 does not. However, according to ‘ HUMAN2 (HUM2)’ score, our step5 prediction is still a plausible future action.
Fig. 5. Comparisons on Recipe1M’s test set. (a) recall of ingredients predicted by our model (‘ Ours’), skip-thought vectors (‘ ST’), and our model
trained without ingredients (‘ Ours noING’). (b) verb recall of our model (‘ Ours’) versus ‘ ST’. (c) BLEU1, BLEU4, METEOR scores for our model
(‘ Ours’) versus ‘ ST’. The x-axes in the plots indicate the step number being predicted in the recipe; each curve begins on the first prediction, i.e., the
ðj þ 1Þth step after having received steps 1 to j as input.
5.2 Learning of Procedural Knowledge skip-thought vectors and our model trained without ingredi-
We first verify the learning of procedural knowledge with a ent inputs. We can see that our model’s predictions success-
text-only model, i.e., the sentence encoder, sentence fully incorporate relevant ingredients with recall rates as
decoder, and the recipe RNN, by evaluating on Recipe1M’s high as 39.6% with the predicted next step, 31.0% with the
test set of 51K recipes. For a recipe of N steps, we evaluate second, 24.8% with the third and 20.2% with the predicted
our model’s ability to predict steps jþ1 to N, conditioning fourth step. The overall recall decreases with the later steps.
on steps 1 to j as input context. For comparison, we look at This is likely due to the increased difficulty once the overall
the generations from the commonly used sequence-to- number of ingredient occurrences decreases, which tends to
sequence model skip-thought (ST) vectors [79]. Skip- happen in later steps. Based on the ground truth, we observe
thought vectors are trained to decode temporally adjacent that the majority of the ingredients occur in the early and
sentences from a current encoding, i.e., given step j to the middle steps and decrease in the last steps. The last steps are
encoder, the decoder predicts steps jþ1 and j1, and have usually related to the already completed dish and do not
been shown to be successful in generating continuous explicitly mention as many ingredients as the earlier steps.
text [71], [83], [84]. Compared to skip-thought vectors, our predictions’
We train the skip-thought vectors on the training set of ingredient recall is higher regardless of whether or not
the Recipe1M dataset; because the model is not designed to ingredients are provided as an initial input. Without ingre-
accept an ingredient list as a 0th or initialization step, we dient input, the overall recall is lower, but after the initial
make skip-thought predictions only from the second step step, our model’s recall increases sharply, i.e., once it
onwards. We report our results for the recipes in the entire receives some context. Our model without the ingredient
test set of the Recipe1M dataset in Figs. 5a, 5b and 5c. We input still performs better than the skip-thought predic-
report scores of the predicted steps averaged over multiple tions. We attribute this to the strength of our model to gen-
recipes. Only those recipes that have at least j steps contrib- eralize across related recipes so that it is able to predict
ute to the average for step j. relevant co-occurring ingredients. Our predictions include
common ingredients such as salt, butter, eggs and water and
also recipe-specific ones such as couscous, zucchini, or choco-
5.2.1 Key Ingredients late chips. While skip-thought vectors predict some common
We first look at our model’s ability to predict ingredients ingredients, they fail to predict recipe-specific ingredients.
and verbs on the Recipe1M dataset. For ingredients, we also
compare with a variant of our model without any ingre-
dients (‘Ours noING’) where we train our network without 5.2.2 Key Verbs
ingredient inputs. To evaluate recall, we do not directly Key verbs indicate the main action for a step and are also
cross-reference the ingredient list but instead limit the eval- cues for future steps both immediate (e.g., ‘mix’ after
uation to ingredients mentioned explicitly in the recipe ‘adding’ ingredients into a bowl) and long-term (e.g., ‘bake’
steps. This is necessary to avoid ambiguities that may arise after ‘preheating’ the oven). We tag the verbs in the training
from specific instructions such as ‘add chicken, onion, and bell recipes with a natural language toolkit [98] and select the
pepper’ versus the more vague ‘add remaining ingredients’. 250 most frequent verbs for evaluation. Similar to ingre-
Furthermore, the ingredient lists in Recipe1M are often dients, we check the recall for only those verbs appearing in
automatically generated and may be incomplete. the ground truth steps. In the ground truth steps, there are
In Fig. 5a, we compare the recall of the ingredients between 1.55 and 1.85 verbs per step, i.e., steps often include
detected in our predicted steps versus steps generated by multiple verbs such as ‘add and mix’.
Fig. 6. Comparisons with GPT-2 [97] on Recipe1M’s test set. Compared to our text-based model, GPT-2 has lower performance in all scores, indicat-
ing the importance of a dedicated hierarchical model like ours.
Fig. 5 shows that our model recalls up to 30.6% of the trend for (ST) vectors [79] in all scores, indicating a signifi-
verbs with the predicted next step(b). Our model’s perfor- cant gap and highlighting the importance of a dedicated
mance is poor in the first steps, due to ambiguities when hierarchical model like ours.
given only the ingredients without any further knowledge
of the recipe. After the first steps, our model’s performance
quickly increases and stays consistent across the remaining 5.3 Encoder Modularity
steps. In comparison, the ST model’s best recall is only Since our network is modular, we check the interchange-
20.1% for the next-step prediction. ability of the sentence encoder by replacing it with skip-
thought vectors trained on the Recipe1M dataset, as pro-
vided by [10]. For this experiment, we train the recipe RNN
5.2.3 Sentences and sentence decoder jointly using the pre-trained skip-
Key ingredients and verbs alone do not capture the rich thought vectors as sentence representations. The recipe
instructional nature of recipe steps; compare, e.g., ‘whisk’ RNN and sentence decoder have been trained with the
and ‘egg’ to ‘Whisk the eggs till light and fluffy’. As such, we same parameter settings as our full model in all the ablation
also evaluate the quality of the entire predicted sentences studies.
based on the BLEU1, BLEU4, and METEOR scores. We com- Fig. 7 compares sentence scores of our joint model
pare with skip-thought vectors in Fig. 5c. Our results for the (‘Ours’), our joint model trained without ingredient inputs
BLEU1 scores are consistently high, at around 22.7 for (‘Ours noING’), and our model using the pre-trained skip-
the next-step predictions, with a slight decrease towards the thought vectors (‘ST vectors’) when ‘X%’ where X ¼
end of the recipe. Predictions further than the next step f0; 25; 50; 75g, of a recipe is observed. Our sentence encoder
have lower scores, though they stay above 12.0. The BLEU4 performs on par with skip-thought vectors as an encoding.
scores are highest in the very first step, around 5.8, and An advantage of our model, however, is that our encoder
range between 0.7 and 4.3 over the remaining steps. The and decoder can be trained jointly and there is no need for a
high performance in the beginning is because many of the separate pre-training of a sentence auto-encoder, as
recipes start with common instructions (e.g., ‘Preheat oven to required when using skip-thought vectors as input. Similar
X degrees’, ‘In a large skillet, heat the oil’). For similar reasons, to our observations for ingredient recall, we see that ingre-
we also do well towards the end of recipes, where instruc- dient information is very important for predicting senten-
tions for serving and garnishing are common (e.g., ‘Season ces, especially for the initial steps. In subsequent steps,
with salt and pepper’). Trends for the METEOR score are simi- when 25%, 50% of the recipe steps (enough context) are
lar to our BLEU1 scores. METEOR scores are above 13.0 for observed, the model’s performance starts to improve.
the next-step predictions and do not go lower than 6.5 for
the further-step predictions.
Our proposed method outperforms skip-thought predic- 5.4 Amount of Training Data
tions across the board. In fact, predictions up to four steps At the core of our method is the transfer of knowledge from
into the future surpass the predictions made by skip- text resources to solve a challenging visual problem. We
thought predictions only one step ahead. This can be attrib- evaluate the effectiveness of the knowledge transfer by
uted to the dedicated long-term modeling of the recipe varying the amount of training data from Recipe1M to be
RNN; as such, we are able to incorporate the context from used for pre-training. We train our method with 100%, 50%,
all sentence inputs up until the present. In contrast, skip- and 25% of the Recipe1M training set. We present our
thoughts are Markovian in nature and can only take the cur- results in Table 2. Looking at the averaged scores over all
rent step into account. the predicted steps on the Tasty Videos dataset, we observe
Our model predicts coherent and plausible instructional a decrease in all evaluation measures as we limit the
sentences, as shown in Fig. 4. One interesting and unex- amount of data from Recipe1M (see ‘ours text’ 100%, 50%,
pected outcome of our model is that it also makes recom- 25%, and 0%), with the most significant decrease occurring
mendations. In cooking recipes, one finds not only strict for the BLEU4 score. When using less text data, our meth-
recipe steps but also suggestions based on the writer’s expe- od’s performance decreases from the 3.30 of the BLEU4
rience (e.g., ‘If using wooden skewers, make sure to soak in score to 2.42 when half of the Recipe1M is used and to 2.03
water.’). Our learned model also generates such suggestions. when a quarter of the dataset is used. While we observe a
For example, for the ground truth ‘If it’s too loose at this point, similar decrease in the ingredient detection scores, the
place it in the freezer for a little while to let freeze.’, our model decrease in BLEU1, METEOR, and verb scores remains less
predicts ‘If you freeze it, it will be easier to eat’. significant. If there is no pre-training, i.e., when the model is
We also compare our model with a transformed based learned only on text from Tasty Videos V1 (‘ours text (0%)’),
architecture, GPT-2 [97], in Fig. 6. We observe a similar the decrease in scores is noticeable for all evaluation criteria.
Fig. 7. Ablations on the interchangeability of the sentence encoder and the influence of ingredient inputs, evaluated on Recipe1M’s test set. We com-
pare the sentence scores of our joint model (‘ Ours’), our joint model without ingredient inputs (‘ Ours noING’), and our model where the sentence
encoder is replaced with pre-trained skip-thought vectors (‘ ST vectors’). ‘ X% seen’ refers to the number of steps the model receives as input, while
predicting the remaining ð100 XÞ%.
These results verify that pre-training has a significant effect with the decoder loss alone. However, upon closer inspec-
on our method’s performance. tion of individual predictions, we see that the recipe loss has
a tendency to promote repetitive outputs. Repetitions are a
5.5 Loss Variants common but undesirable outcome for natural language gen-
We train our recipe network using a decoder loss, Ld eration [99]. In this case, as consecutive steps are more likely
(Eq. (7)), and additionally propose a recipe loss, Lr (Eq. (8)), to describe the handling of common items or ingredients, the
in Section 3. Table 3 presents our experiments analyzing the ingredient recall is not as directly affected.
influence of the recipe loss evaluated on the test set of Rec- Beam search [100] is known to improve the performance
ipe1M. Overall, the recipe loss Lr , as expected, encourages of text-based generation algorithms [101]. In Table 4, we
and increases the next-step prediction performance. The evaluate our model, trained with decoder loss, Ld , using
increase is more than 0.5% for verb and ingredient recall and greedy and beam search. Greedy decoding selects the
0.7, 0.1, and 0.3 for the BLEU1, BLEU4, and METEOR scores. words with the highest probability at every decoding step.
However, using the recipe loss significantly decreases the Beam search keeps track of a beam of k possible generations
scores in the further stages, particularly for sentence scores. and updates them at every step of decoding by ranking
The most significant decrease is observed for the BLEU4 them according to the model likelihood. Although it is k
scores, which decrease to 0.4 for four steps into the future, times more expensive than greedy search, it improves the
‘next+3’. Only the ingredient recall does not significantly performance. In our experiments we use k ¼ 5. This
decrease and is at least 4% higher than our model trained improves both ingredient and verb scores by 0.8%, and our
sentence scores by 0.7, 0.3, and 0.5 for BLEU1, BLEU4, and
TABLE 2 METEOR, respectively. We observe similar improvements
Evaluations on Tasty V1’s Test Set for the Textual Model When over the later steps as well. We report the beam search
the Number of Training Recipes Varies encoding results only in Table 4, while in our other results
tables, out of fairness, we use greedy search.
5.6 State-of-the-Art Comparisons

There are several works [26], [71] that generate instructional
text for the cooking domain. Similar to our works, such
approaches also target learning procedural knowledge in
text. Inverse cooking [26] generates recipes for food images.
TABLE 3 TABLE 4
Comparisons for the Decoder, Ld and Decoder + Recipe Loss, We Compare Greedy (gr) and Beam (b) Search When Decoding
þLr , on the Recipe1M’s Test Set Sentences on the Recipe1M’s Test Set
Fig. 8. We compare our visual model when tested with GT segments, ‘ Visual GT’, and our textual model, ‘ Text’, on the Tasty Videos V1’s test set for
next-step predictions using the recall of predicted ingredients and verbs as well as sentence scores. Compared to our text-based model, our visual
model has lower performance but follows similar trends.
It uses a two-stage transformer approach: the first trans- following settings for inference, one according to ground
former predicts ingredients from a recipe picture while the truth segments, ‘ours visual (GT)’, and one based on fixed
second transformer generates the recipe from these ingre- temporal windows, ‘ours visual (window)’. For temporal
dients. Inverse cooking uses a subset of Recipe1M’s dataset window experiments, the videos are partitioned into
by removing the recipes that contain no images and those chunks of fixed-sized windows until the latest observation.
with fewer than two ingredients or steps. For a fair compari- We sequentially feed the representations of these chunks
son, we train and test our model using this subset, which obtained from the video encoder into our recipe RNN.
features 252.5k and 54.5k recipes for training and testing, Overall, our method is relatively robust to window size, as
respectively. Their network is limited to generating a maxi- shown in Table 6 for BLEU4 scores. We empirically select
mum of 150 words per recipe. Since our model does not a window of 170 for Tasty Videos. In both settings, every
have limitations regarding the generations, for fair compari- fifth frame in the GT or window-based segments is sam-
son, we also truncate our model’s generations after 150 pled, and their visual features are fed into the video
words. We compare our model to the publicly released encoder. The representations from the video encoder are
Inverse cooking model by directly providing the ground then fed to the recipe RNN as context vectors. Through the
truth ingredients into their recipe generation transformer. video encoder, our model can interpret visual evidence,
We first evaluate our model’s capabilities in recipe gener- and make plausible predictions of the next steps; these are
ation and compare our generations to Inverse cooking in exemplified in Figs. 12 and 13, where it can be seen that
Fig. 9 for individual steps. Both methods are evaluated using our visual model corrects itself after observing new
GT ingredients. Overall, our model outperforms ‘Inverse’ for evidence.
recipe generation for individual steps. Our method’s perfor- The results are shown in Table 5. Compared to using
mance is significantly higher for ingredient recall by at least ground truth segments, ‘ours visual (GT)’, using fixed win-
10% for all steps. Similarly, for the BLEU4 score, our model’s dow segments, ‘ours visual (window)’, results in a decrease in
first step prediction is 4.8, while Inverse has only 1. However, performance, with the most extreme drop on the most chal-
for the later steps, we perform comparably in this metric. We lenging sentence score, BLEU4 (around 17%), and ingredi-
observe similar gaps between the first step predictions in the ent scores (around 18%). For the verb, BLEU1, and
BLEU1, METEOR, and verb scores. METEOR scores, the decrease is not as big (lower than 10%).
Next, we evaluate ‘Inverse’ for next-step prediction. For In Table 5, our text-based results are presented as upper-
each step, we feed the ground truth sentences in the previ- bound, ‘ours text’. Given that our model is first trained on
ous steps into Inverse’s transformer-based sentence text and then transferred to video, the drop in performance
decoder, which generates a sentence for the next step. The
results in Fig. 10 show that although the transformer-
decoder has access to the complete history of the GT recipe TABLE 5
until the next step, our method significantly outperforms Evaluations on Tasty Videos V1 for Our Visual and Text-Based
Model Along With Comparisons Against Video Captioning Meth-
this recipe generation network on all scores. For ingre-
ods [78], [102]
dients and verbs, the difference is more than 15% for all
steps. Inverse’s performance degrades towards the end of
the recipe, while ours stays consistently high for all met-
rics. The most significant decrease for Inverse is in BLEU4,
which decreases to almost zero after the 5th step, whereas
ours is invariably above 1.5.
6 EXPERIMENTS: VIDEO
6.1 Video Predictions on Tasty V1
We first evaluate our model for making predictions on
video inputs on Tasty Videos V1’s test set. To explore the
importance of video partitioning, we consider the
TABLE 6
Window Size Selection on the Tasty V1 and YouCookII Datasets
window s. 30 50 70 90 110 130 150 170 190 210 230

Tasty V1 0.75 0.90 0.93 1.06 1.18 1.09 1.07 1.23 1.09 1.19 1.06
YouCookII 0.60 1.10 1.38 1.32 1.32 1.28 1.30 1.20 1.17 1.20 1.22
Reported are the BLEU4 scores.
from text to video is as expected. The video results, how-

ever, follow similar trends as the text; see, for example,
Fig. 9. We evaluate our model for recipe generation given ingredients
Fig. 8, where we provide step-wise comparisons of our tex- and compare our performance to a recipe generation network,
tual and visual models (GT). We further investigate the ‘Inverse’ [26], on a subset of Recipe1M.
influence of the ingredients on the performance of our
method. When ingredients are not provided, ‘ours text
noING’, our method fails to make plausible predictions. The background frames that are irrelevant to the recipe steps.
performance decrease is mainly noticeable in the ingredient We suspect using large windows misses important cues for
scores and the BLEU4 scores. the YouCookII steps.
In some instructional scenarios, there may be semi- Similar to our observations on Tasty V1, using window
aligned text that accompanies the video, e.g., narrations. We segments, ‘ours visual (window)’, instead of ground truth,
test such a setting by training the sentence and video ‘ours visual (GT)’, results in decreased accuracy. The larg-
encoder, as well as sentence decoder and recipe RNN est decrease is observed for the BLEU4 and ingredient
jointly, to make future step predictions. For this, the sen- scores, by 16% and 17%, respectively, highlighting the
tence and video context vectors are first concatenated and importance of achieving high scores for these metrics.
then passed through a linear layer before feeding them as The decrease for BLEU1 is the smallest, by 4%. Compar-
input to the Recipe RNN. Overall, the results are better than ing the performance of our visual with the textual model,
our video-alone results but not better than our text-alone ‘ours text’, the textual results are better overall on Tasty
results (see ‘ours video-text’ in Table 5). Even with joint train- V1 than YouCookII.
ing, it is still challenging to make improvements, which we
attribute to the diversity in our videos and the variations in
the text descriptions for similar visual inputs. On the other 6.3 Video Predictions on Tasty V2
hand, when there is accompanying text, our model can be In addition to using fixed window-based segments, we train
adapted easily and improve prediction performance. the transformer-based proposal decoder from Zhou et al.
[23] on the Tasty V2 dataset to generate segment proposals.
First, using non-maximum suppression, at each iteration,
6.2 Video Predictions on YouCookII the longest proposal is selected, and those with an IoU with
To further validate the effectiveness of our model on pub- the longest proposal that is greater than a threshold of 0.2
licly available datasets with sentence-level annotations, we are discarded. Then, to obtain non-overlapping proposals,
also evaluate it on the YouCookII dataset. Table 7 compares the overlapping regions among the overlapping proposals
our visual model evaluated with ground truth segments, are divided w.r.t. their lengths; an example can be seen in
‘ours visual (GT)’, and temporal windows, ‘ours visual (win- Fig. 11. We evaluate the quality of the proposals using mean
dow)’, on YouCookII’s validation set. For the ‘ours visual intersection over union (IoU), which is 39.59%.
(window)’ experiments, a window of 70 frames is empirically The results are shown in Table 8. To validate our method’s
selected, see Table 6. YouCookII includes a large number of performance on Tasty V2, we use ground truth segments,
‘video (GT)’, fixed temporal windows, ‘video (window)’, and
TABLE 7 segment proposals, ‘video (proposal)’, as input to our video
Evaluation of Our Visual and Text-Based Models and Compari- encoder during inference. Compared to using temporal win-
son Against Two Video Captioning Methods [78], [103] on dows, proposals improve ingredient scores by 2% and
YouCookII’s Validation Set BLEU4 by 12%. The other scores show slight differences.
Fig. 10. We employ the recipe generation network ‘Inverse’ [26] for next-
step prediction and compare our performance. Both methods are tested
on a subset of Recipe1M.
Fig. 11. Example proposals from a proposal decoder [78] trained on

Tasty V2.
This small performance gap between the window and pro-

posal-based segments indicates the difficulty of partitioning
our videos and motivates exploring more robust ways of
generating segments/proposals. The text-based results, ‘ours
text’, maintain the highest scores in all metrics, encouraging
us to further develop visual models to tackle in anticipation
of our challenging zero-shot dataset.
6.4 Supervised versus Zero-Shot Learning

YouCookII is a suitable dataset to compare the differences
between supervised and zero-shot learning. As the provided
splits for this dataset are not zero-shot and overlap in the
dishes for training and test, we create our own splits based
on distinct dishes for this ablation study. We divide the data-
set into four splits based on the 89 dishes, 22 dishes per split,
and use three splits for training and half of the videos in the
fourth split for testing. In the zero-shot setting, the videos
from the other half of the fourth split are unused, while in the
supervised setting, they are included as part of the training.
We report our results as averages over the four cross-folds
in Table 9. As expected, the predictions are better when the
model is trained under a supervised setting than a zero-shot
setting. This is true for all inputs, with the same drop as
observed previously when moving from text to video inputs
and when moving from ground truth video segments to fixed
window segments. However, the difference between the
supervised versus zero-shot setting (‘sup. visual’ versus ‘zero
visual’) is surprisingly much smaller than the difference
between a supervised setting with and without pre-training
Fig. 13. Next-step predictions for ‘Garlic Knots’ shown in blue. After bak-
on Recipe1M (‘sup. visual’ versus ‘sup. visual w/o pre-train’). ing in step 7, our visual model predicts that the dish should be served.
This suggests that having a large corpus for pre-training is Yet when presented with visual evidence of the garlic butter in step 8, it
more useful than repeated observations for a specific dish. correctly predicts that the knots should be brushed with the mixture in
Fig. 14 shows a detailed transition from the zero-shot sce- step 9.
nario (no videos about the evaluated dish in the training set)
to a one-shot setting (only one similar video) and incremental
addition of training videos until the fully supervised case
(average of 11 videos from the same dish). One can see that
performance increases as more videos are added, indicating
that the model is learning and that more than 11 videos (cur-
rent supervised setting) will further improve the supervised
performance.
6.5 Comparisons to Video Captioning

We show that knowledge transfer considerably improves
our method’s predictions, see Section 5.4. To further vali-
date our claims, we compare our method against different
video captioning methods in Tables 5, 10 and 7 for the Tasty
Videos V1, V2 and YouCookII datasets, respectively. Unlike
predicting future steps, captioning methods generate sen-
tences after observing visual data. In principle, this should
be an easier task than predicting the future.
We compare our model on the validation set of YouCoo-
Fig. 12. Next-step predictions from our visual model for ‘ Salted Caramel
Hot Chocolate’ in blue. Note that our model predicts the next steps with- kII against two captioning methods [78], [103] in Table 7.
out having observed the corresponding video segment. The End-to-end masked transformer [78] performs dense
TABLE 8
Evaluation of the Visual and Text-Based Models and Compari-
sons Using Different Segments on Tasty V2
Fig. 14. Zero-shot (‘ Zero’) versus supervised (‘ Sup.’) comparison on

YouCookII when the number of training videos from the same dish is
increased. When more videos from the same dish are added into the
training set, the BLEU4 score increases.
video captioning by both localizing steps and generating significant margin for all scores. We also test a state-of-the-
descriptions for these steps. Instead of separating the cap- art video captioning method, ‘End-to-end [78]’, on Tasty V1
tioning problem into the two stages of proposal generation and get a BLEU4 and METEOR score of 0.54 and 5.48,
and captioning, [78] produce proposals and descriptions respectively (versus our future prediction scores 1.23 /
simultaneously. Their work is composed of a transformer- 11.00). The poor performance is likely due to the increased
based [104] video encoder for context-aware features, a pro- dish diversity and the difficulty of our dataset compared to
posal decoder similar to [23] that localizes action proposal YouCookII.
candidates, and finally, a transformer-based decoder that Finally, we train the End-to-end masked transformer [78]
generates captions. TempoAttn [103] is an RNN-based on our Tasty V2 for captioning. Table 10 compares our
encoder-decoder with attention. A variant of Tem- model to this work using ground truth segments. Similar to
poAttn [103] is trained on YouCookII by Zhou et al. [78] our observations on other datasets, although ‘End-to-
after several changes have been made to the model for a fair end [78]’, incorporates context into predictions and predicts
comparison, including adding a Bi-LSTM context encoder the current observation, our method, which predicts the
and adding temporal attention. next steps, outperforms it for all sentence scores.
In Table 7, we see that even though the anticipation task is
more difficult than captioning, our method outperforms both 6.6 Human Ratings
of the captioning methods for the BLEU4 and METEOR As automated scores such as BLEU and METEOR are not
scores. Compared to the state-of-the-art video captioning fully representative of the correctness of the predicted steps,
method, [78], our visual model achieves a METEOR score we also ask humans to evaluate our model’s predictions. We
twice as high and a BLEU4 score four times higher. We attri- invite three volunteers to assess how well the anticipated
bute the better performance of our method compared to the steps match the ground truth with scores 0 (‘not at all’), 1
captioning methods to the pre-training on the Recipe1M data- (‘somewhat’), or 2 (‘very well’). If the prediction receives a
set, which allows our model to generalize. Note that for You- score of 0, we additionally ask the participant to judge if the
CookII, as we use all the videos in the training set, our predicted step is still a plausible future prediction, again
training is no longer a zero-shot but a supervised scenario. with the same scores of 0 (‘not at all’), 1 (‘somewhat’), or 2
Table 5 compares our model against different captioning (‘very likely’). The study is done on a subset of 30 recipes from
methods on the Tasty V1 dataset. We also test S2VT [102], Recipe1M’s test set, each with seven steps. Ratings are com-
an RNN-based encoder-decoder, on the ground truth seg- pared to the automated sentence scores in Fig. 15.
ments of Tasty V1 for captioning. Our visual model outper- In Fig. 15, the upper graph (a) shows the results for the
forms this baseline, especially for ingredient recall, by 13%, human raters. In this plot, ‘exact match’ corresponds to
and with an improvement of 0.3 in the BLEU4 score. To humans assessing if the predicted steps match the ground
highlight the difficulty of predicting future steps compared truth, i.e. the plausibility of the prediction regarding the
to captioning, we train S2VT for predicting the next step ground truth sentence. Raters report a score close to 1 for the
from the observation of the current step, ‘S2VT [102] next initial step predictions, indicating that our method, even by
(GT)’. Our visual model outperforms this variation with a only seeing the ingredients, can start predicting plausible
steps. Scores increase towards the end of the recipe and are
lowest at step 3. ‘future match’ corresponds to humans assess-
TABLE 9
Comparison of Our Zero-Shot and Supervised Setting on You- ing if a step is a plausible future prediction given all previous
CookII, Computed Using 4-Fold Cross-Validation steps. The average score of the predicted steps being a possi-
ble future prediction is consistently high across all steps.
Even if the predicted step does not exactly match the ground
TABLE 10
Captioning Results of End-to-End [78] versus Our Method Eval-
uated for Next-Step Prediction on the Tasty V2 Dataset
BLEU1 BLEU4 METEOR

End-to-end (GT) [78] 18.33 1.14 6.73
ours visual (GT) 19.34 1.71 12.64
Both methods are evaluated using GT segments.

Currently, our method only employs textual recipes and

ingredient keys. It can be further improved using additional
cues such as the amount of ingredients, which are crucial
for real-life instructions. However, this will likely require a
dedicated architecture for handling such data to keep track
of the ingredient amounts. Further improvements could be
achieved by aggregating information from multiple similar
recipes or from user feedback and comments.
REFERENCES
[1] H. S. Koppula and A. Saxena, “Anticipating human activities
using object affordances for reactive robotic response,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 14–29, Jan. 2016.
Fig. 15. We conduct a user study and ask human raters to asses how [2] C. Wu, J. Zhang, B. Selman, S. Savarese, and A. Saxena, “Watch-Bot:
well the predicted sentences match the ground truth sentences. We Unsupervised learning for reminding humans of forgotten actions,”
present the comparison of human ratings (a) versus automated sen- in Proc. IEEE Int. Conf. Robot. Automat., 2016, pp. 2479–2486.
tence scores (b). [3] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot
visual imitation learning via meta-learning,” in Proc. Conf. Robot
truth, human raters still consider it possible in the future, Learn., 2017, pp. 357–368.
[4] N. S€underhauf et al., “The limits and potentials of deep learning
including the previously low rating for step 3. Overall, the
for robotics,” Int. J. Robot. Res., vol. 37, pp. 405–420, 2018.
ratings indicate that the predicted steps are plausible. [5] Y. Tang et al., “COIN: A large-scale dataset for comprehensive
The lower graph (b) in Fig. 15 shows automated scores instructional video analysis,” in Proc. IEEE Conf. Comput. Vis. Pat-
for the same set of recipes used in our user study. The left tern Recognit., 2019, pp. 1207–1216.
[6] Wikihow, “How to do anything,” 2005. [Online]. Available:
plot shows the standard scores for the predicted sentences http://www.wikihow.com/
matching the ground truth. Overall, the trends are very sim- [7] D. Zhukov, J.-B. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, and
ilar to the user study, including the low-scoring step 3. To J. Sivic, “Cross-task weakly supervised learning from instruc-
match the second setting of the user study, we compute the tional videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 3532–3540.
sentence scores between the predicted sentence s^j , and the [8] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J.
next four future ground truth steps fsj ; sjþ1 ; sjþ2 ; sjþ3 g and Sivic, “HowTo100M: Learning a text-video embedding by watch-
select the step with the maximum score as our future match. ing hundred million narrated video clips,” in Proc. IEEE Int. Conf.
Comput. Vis., 2019, pp. 2630–2640.
These scores are plotted in the lower right plot of Fig. 15. [9] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zis-
Similar to the second setting in the human study, the sen- serman, “End-to-end learning of visual representations from
tence scores increase overall. uncurated instructional videos,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., 2020, pp. 9879–9889.
[10] A. Salvador et al., “Learning cross-modal embeddings for cook-
7 CONCLUSION ing recipes and food images,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2017, pp. 3068–3076.
In this paper, we posed a new problem setting of zero-shot [11] J. Malmaud, E. Wagner, N. Chang, and K. Murphy, “Cooking
action anticipation. We presented a model that can general- with semantics,” in Proc. ACL Workshop Semantic Parsing, 2014,
pp. 33–38.
ize instructional knowledge from the text domain and be [12] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
applied to videos. Using this model, we tackled the chal- “Multimodal deep learning,” in Proc. Int. Conf. Mach. Learn.,
lenging task of predicting the steps of complex tasks from 2011, pp. 689–696.
[13] N. Srivastava and R. R. Salakhutdinov, “Multimodal learning
visual data. Our model produces coherent and plausible with deep Boltzmann machines,” in Proc. Int. Conf. Neural Inf.
future steps from both text and video inputs. Such a task Process. Syst., 2012, pp. 2231–2239 .
has been to date otherwise not possible because of the scar- [14] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
city in annotated training data. Our evaluation shows that “VideoBERT: A joint model for video and language representa-
tion learning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019,
our anticipation method is more competitive than all other pp. 7463–7472.
baselines, even when compared against video captioning [15] Y. Zhu et al., “Aligning books and movies: Towards story-like
methods that have access to the visual data. While we score visual explanations by watching movies and reading books,” in
Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 19–27.
well for keyword recall like ingredients and verbs, our sen- [16] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and
tence scores, like the challenging BLEU4, are still poor. We S. Lacoste-Julien, “Unsupervised learning from narrated instruc-
believe this highlights the difficulty of our task and thus tion videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
aim for improvements in our future work. 2016, pp. 4575–4583.
[17] H. Kuehne, A. Arslan, and T. Serre, “The language of actions:
To complement our new task and model, we presented a Recovering the syntax and semantics of goal-directed human
diverse dataset of 4022 cooking videos and recipes. All the activities,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
videos are annotated with the temporal boundaries of the 2014, pp. 780–787.
textual recipe steps. Our dataset includes cooking videos [18] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database
for fine grained activity detection of cooking activities,” in Proc.
with various dish categories, cookware, and ingredients IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 1194–1201.
and provides researchers with a rich database to study the [19] Y. A. Farha, A. Richard, and J. Gall, “When will you do what?
challenging zero-shot anticipation problem. We also hope Anticipating temporal occurrences of activities,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5343–5352.
that its diversity will motivate researchers to study tasks [20] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representa-
beyond anticipation, such as dense video captioning, tem- tion for future action prediction,” in Proc. Eur. Conf. Comput. Vis.,
poral s egmentation, visual grounding, and retrieval. 2014, pp. 689–704.
[21] Y. Zhou and T. L. Berg, “Temporal perception and prediction in [45] J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced encoder-
ego-centric video,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, decoder networks for action anticipation,” in Proc. Brit. Mach.
pp. 4498–4506. Vis. Conf., 2017.
[22] D. Lin, C. Kong, S. Fidler, and R. Urtasun, “Generating multi- [46] M. Zolfaghari, O.€ Çiçek, S. M. Ali, F. Mahdisoltani, C. Zhang, and T.
sentence lingual descriptions of indoor scenes,” in Proc. Brit. Brox, “Learning representations for predicting future activities,”
Mach. Vis. Conf., 2015. 2019, arXiv: 1905.03578.
[23] L. Zhou, C. Xu, and J. J. Corso, “Towards automatic learning of [47] V. Tran, Y. Wang, and M. Hoai, “Back to the future: Knowl-
procedures from web instructional videos,” in Proc. 32nd AAAI edge distillation for human action anticipation,” 2019, arXiv:
Conf. Artif. Intell. 13th Innov. Appl. Artif. Intell. Conf. 8th AAAI 1904.04868.
Symp. Educ. Adv. Artif. Intell., 2018, Art. no. 930. [48] G. Camporese, P. Coscia, A. Furnari, G. M. Farinella, and L. Bal-
[24] D. Damen et al., “Scaling egocentric vision: The EPIC-KITCH- lan, “Knowledge distillation for action anticipation via label
ENS dataset,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 753–771. smoothing,” 2020, arXiv: 2004.07711.
[25] F. Sener and A. Yao, “Zero-shot anticipation for instructional [49] Y. Shen, B. Ni, Z. Li, and N. Zhuang, “Egocentric activity predic-
activities,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 862–871. tion via event modulated attention,” in Proc. Eur. Conf. Comput.
[26] A. Salvador, M. Drozdzal, X. Giro-i-Nieto, and A. Romero, Vis., 2018, pp. 202–217.
“Inverse cooking: Recipe generation from food images,” in Proc. [50] M. Liu, S. Tang, Y. Li, and J. Rehg, “Forecasting human object
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10445–10454. interaction: Joint prediction of motor attention and egocentric
[27] A. Richard and J. Gall, “Temporal action detection using a statis- activity,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 704–721.
tical language model,” in Proc. IEEE Conf. Comput. Vis. Pattern [51] E. Dessalene, M. Maynord, C. Devaraj, C. Fermuller, and Y. Aloi-
Recognit., 2016, pp. 3131–3140. monos, “Egocentric object manipulation graphs,” 2020, arXiv:
[28] M. Hoai, Z.-Z. Lan, and F. De la Torre, “Joint segmentation and 2006.03201.
classification of human actions in video,” in Proc. IEEE Conf. [52] Q. Ke, M. Fritz, and B. Schiele, “Time-conditioned action antici-
Comput. Vis. Pattern Recognit., 2011, pp. 3265–3272. pation in one shot,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
[29] C. Lea, A. Reiter, R. Vidal, and G. D. Hager, “Segmental spatio- ognit., 2019, pp. 9917–9926.
temporal CNNs for fine-grained action segmentation,” in Proc. [53] F. Sener, D. Singhania, and A. Yao, “Temporal aggregate repre-
Eur. Conf. Comput. Vis., 2016, pp. 36–52. sentations for long-range video understanding,” in Proc. Eur.
[30] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi- Conf. Comput. Vis., 2020, pp. 154–171.
stream bi-directional recurrent neural network for fine-grained [54] T. Mahmud, M. Billah, M. Hasan, and A. K. Roy-Chowdhury,
action detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- “Prediction and description of near-future activities in video,”
nit., 2016, pp. 1961–1970. 2020, arXiv:1908.00943.
[31] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, [55] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a
“Temporal convolutional networks for action segmentation and few examples: A survey on few-shot learning,” ACM Comput.
detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Surv., vol. 53, no. 3, pp. 1–34, 2020.
2017, pp. 1003–1012. [56] W. Wang, V. W. Zheng, H. Yu, and C. Miao, “A survey of zero-
[32] Y. A. Farha and J. Gall, “MS-TCN: Multi-stage temporal convolu- shot learning: Settings, methods, and applications,” ACM Trans.
tional network for action segmentation,” in Proc. IEEE Conf. Com- Intell. Syst. Technol., vol. 10, pp. 1–37, 2019.
put. Vis. Pattern Recognit., 2019, pp. 3570–3579. [57] C. Gan, T. Yang, and B. Gong, “Learning attributes equals multi-
[33] J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, source domain generalization,” in Proc. IEEE Conf. Comput. Vis.
and K. Murphy, “What’s cookin’? Interpreting cooking videos Pattern Recognit., 2016, pp. 87–97.
using text, speech and vision,” in Proc. Conf. North Amer. [58] Y. Zhu, Y. Long, Y. Guan, S. Newsam, and L. Shao, “Towards
Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2015, universal representation for unseen action recognition,” in Proc.
pp. 143–152. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9436–9445.
[34] O. Sener, A. R. Zamir, S. Savarese, and A. Saxena, “Unsupervised [59] M. Hahn, A. Silva, and J. M. Rehg, “Action2Vec: A crossmodal
semantic parsing of video collections,” in Proc. IEEE Int. Conf. embedding approach to action learning,” 2019, arXiv:1901.00484.
Comput. Vis., 2015, pp. 4480–4488. [60] B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka,
[35] D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist tempo- “Rethinking zero-shot video classification: End-to-end training
ral modeling for weakly supervised action labeling,” in Proc. for realistic applications,” in Proc. IEEE Conf. Comput. Vis. Pattern
Eur. Conf. Comput. Vis., 2016, pp. 137–153. Recognit., 2020, pp. 4612–4622.
[36] A. Richard, H. Kuehne, and J. Gall, “Weakly supervised action [61] B. Zhang, H. Hu, and F. Sha, “Cross-modal and hierarchical
learning with RNN based fine-to-coarse modeling,” in Proc. IEEE modeling of video and text,” in Proc. Eur. Conf. Comput. Vis.,
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1273–1282. 2018, pp. 385–401.
[37] C.-Y. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, and J. C. Niebles, [62] L. Herranz, W. Min, and S. Jiang, “Food recognition and recipe
“D3TW: Discriminative differentiable dynamic time warping for analysis: Integrating visual content, context and external knowl-
weakly supervised action alignment and segmentation,” in Proc. edge,” 2018, arXiv: 1801.07239.
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3541–3550. [63] W. Min, S. Jiang, S. Wang, J. Sang, and S. Mei, “A delicious recipe
[38] F. Sener and A. Yao, “Unsupervised learning and segmentation analysis framework for exploring multi-modal recipes with vari-
of complex activities from video,” in Proc. IEEE Conf. Comput. ous attributes,” in Proc. 25th ACM Int. Conf. Multimedia, 2017,
Vis. Pattern Recognit., 2018, pp. 8368–8376. pp. 402–410.
[39] A. Kukleva, H. Kuehne, F. Sener, and J. Gall, “Unsupervised learn- [64] M. Carvalho, R. Cadene, D. Picard, L. Soulier, N. Thome, and M.
ing of action classes with continuous temporal embedding,” in Proc. Cord, “Cross-modal retrieval in the cooking context: Learning
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12058–12066. semantic text-image embeddings,” in Proc. 41st Int. ACM SIGIR
[40] A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran, Conf. Res. Develop. Inf. Retrieval, 2018, pp. 35–44.
“Leveraging the present to anticipate the future in videos,” in [65] K. J. Hammond, “CHEF: A model of case-based planning,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, Proc. AAAI Conf. Artif. Intell., 1986, pp. 267–271.
pp. 2915–2922. [66] R. Dale, “Cooking up referring expressions,” in Proc. Annu. Meeting
[41] A. Furnari and G. M. Farinella, “What would you expect? Antici- Assoc. Comput. Linguistics Hum. Lang. Technol., 1989, pp. 68–75.
pating egocentric actions with rolling-unrolling LSTMs and [67] M. Tenorth, D. Nyga, and M. Beetz, “Understanding and execut-
modality attention,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, ing instructions for everyday manipulation tasks from the world
pp. 6251–6260. wide web,” in Proc. IEEE Int. Conf. Robot. Automat., 2010,
[42] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual pp. 1486–1491.
representations from unlabeled video,” in Proc. IEEE Conf. Com- [68] C. Kiddon, G. T. Ponnuraj, L. Zettlemoyer, and Y. Choi,
put. Vis. Pattern Recognit., 2016, pp. 98–106. “Mise en place: Unsupervised interpretation of instructional
[43] D. Damen et al., “Rescaling egocentric vision,” 2020, arXiv: recipes,” in Proc. Conf. Empir. Methods Natural Lang. Process.,
2006.13256. 2015, pp. 982–992.
[44] T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury, “Joint pre- [69] J. Jermsurawong and N. Habash, “Predicting the structure of
diction of activity labels and starting times in untrimmed vid- cooking recipes,” in Proc. Conf. Empir. Methods Natural Lang. Pro-
eos,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5784–5793. cess., 2015, pp. 781–786.
[70] M. Beetz et al., “Robotic roommates making pancakes,” in Proc. [95] C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J.
IEEE 11th Int. Conf. Humanoid Robots, 2011, pp. 529–536. Pineau, “How NOT to evaluate your dialogue system: An empir-
[71] C. Kiddon, L. Zettlemoyer, and Y. Choi, “Globally coherent text ical study of unsupervised evaluation metrics for dialogue
generation with neural checklist models,” in Proc. Conf. Empir. response generation,” in Proc. Conf. Empir. Methods Natural Lang.
Methods Natural Lang. Process., 2016, pp. 329–339. Process., 2016.
[72] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence [96] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, “Jointly localizing and
learning with neural networks,” in Proc. Int. Conf. Neural Inf. Pro- describing events for dense video captioning,” in Proc. IEEE
cess. Syst., 2014, pp. 3104–3112. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7492–7500.
[73] A. Bosselut, A. Celikyilmaz, X. He, J. Gao, P.-S. Huang, and Y. [97] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutsk-
Choi, “Discourse-aware neural rewards for coherent text gener- ever, “Language models are unsupervised multitask learners,”
ation,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguis- OpenAI Blog, vol. 1, 2019, Art. no. 9.
tics: Hum. Lang. Technol., 2018, pp. 173–184. [98] NLTK, “Natural language toolkit,” 2018. [Online]. Available:
[74] H. H. Lee et al., “RecipeGPT: Generative pre-training based http://www.nltk.org/
cooking recipe generation and evaluation system,” in Proc. Com- [99] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curi-
panion Proc. Web Conf., 2020, pp. 181–184. ous case of neural text degeneration,” 2019, arXiv: 1904.09751.
[75] A. S. Lin et al., “A recipe for creating multimodal aligned data- [100] “Speech understanding systems. Summary of results of the five-
sets for sequential tasks,” 2020, arXiv: 2005.09606. year research effort at Carnegie-Mellon University,” Dept. Com-
[76] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele, put. Sci., Interim Report Carnegie-Mellon Univ., Pittsburgh, PA,
“Translating video content to natural language descriptions,” in USA, Aug. 1977.
Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 433–440. [101] M. Freitag and Y. Al-Onaizan, “Beam search strategies for neural
[77] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. machine translation,” 2017, arXiv: 1702.01806.
Pinkal, “Grounding action descriptions in videos,” Trans. Assoc. [102] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell,
Comput. Linguistics, vol. 1, pp. 25–36, 2013. and K. Saenko, “Sequence to sequence-video to text,” in Proc.
[78] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to- IEEE Int. Conf. Comput. Vis., 2015, pp. 4534–4542.
end dense video captioning with masked transformer,” in Proc. [103] L. Yao et al., “Describing videos by exploiting temporal structure,”
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8739–8748. in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4507–4515.
[79] R. Kiros et al., “Skip-thought vectors,” in Proc. Int. Conf. Neural [104] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf.
Inf. Process. Syst., 2015, pp. 3294–3302. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[80] H. Wang, D. Sahoo, C. Liu, E.-P. Lim, and S. C. H. Hoi, “Learning
cross-modal embeddings with adversarial networks for cooking Fadime Sener received the BSc degree in com-
recipes and food images,” in Proc. IEEE Conf. Comput. Vis. Pattern puter engineering from Hacettepe University, in
Recognit., 2019, pp. 11564–11573. 2011, the MSc degree in computer engineering
[81] B. Zhu and C.-W. Ngo, “CookGAN: Causality based text-to- from Bilkent University, in 2013, and the PhD
image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- degree from the University of Bonn, in 2021. In
nit., 2020, pp. 5518–5526. 2015, she joined the Visual Computing Group of
[82] H. Wang, G. Lin, S. C. Hoi, and C. Miao, “Decomposed genera- Angela Yao, Institute of Computer Science, Uni-
tion networks with structure prediction for recipe generation versity of Bonn as a doctoral researcher. Her
from food images,” 2020, arXiv: 2007.13374. research focuses on automatic human activity
[83] K. Cho et al., “Learning phrase representations using RNN understanding in videos through vision and
encoder–decoder for statistical machine translation,” in Proc. language.
Conf. Empir. Methods Natural Lang. Process., 2014, pp. 1724–1734.
[84] O. Vinyals and Q. V. Le, “A neural conversational model,” in
Proc. Int. Conf. Mach. Learn. Deep Learn. Workshop, 2015. Rishabh Saraf received the Integrated MTech
[85] F. Hill, K. Cho, and A. Korhonen, “Learning distributed repre- degree in mathematics and computing from the
sentations of sentences from unlabelled data,” in Proc. Conf. Indian Institute of Technology Dhanbad. He is cur-
North Amer. Chapter Assoc. Comput. Linguistics Hum. Lang. Tech- rently working as a data scientist with Rakuten,
nol., 2016, pp. 1367–1377. Inc. In 2020, he was an intern with IBM India Soft-
[86] J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” ware Labs. In 2019, he was an intern with the
2016, arXiv:1607.06450. CVML Group of Angela Yao, NUS. In 2018, he
[87] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, was an intern with the University of Manitoba. He
“Supervised learning of universal sentence representations from worked on various projects in computer vision,
natural language inference data,” in Proc. Conf. Empir. Methods NLP, mathematical statistics, entity matching and
Natural Lang. Process., 2017, pp. 670–680. linking for master data management.
[88] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
ognit., 2016, pp. 770–778. Angela Yao received the BASc degree in engi-
[89] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam- neering science from the University of Toronto, in
pling for sequence prediction with recurrent neural networks,” 2006, and the master’s and PhD degrees from
in Proc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 1171–1179. ETH Zurich, in 2008 and 2012, respectively. Since
[90] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- 2018, she is an Assistant Professor with the
mization,” 2014, arXiv:1412.6980. School of Computing, National University of Sin-
[91] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus- gapore, where she leads the Computer Vision and
based image description evaluation,” in Proc. IEEE Conf. Comput. Machine Learning Group. Her group’s research
Vis. Pattern Recognit., 2015, pp. 4566–4575. ranges from low-level enhancement to high-level
[92] A. Lopez, “Statistical machine translation,” ACM Comput. Surv., semantic interpretation of images and video.
vol. 40, no. 3, 2008, Art. no. 8.
[93] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
for automatic evaluation of machine translation,” in Proc. 40th Annu. " For more information on this or any other computing topic,
Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
please visit our Digital Library at www.computer.org/csdl.
[94] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT
evaluation with improved correlation with human judgments,”
in Proc. Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl.
Summarization, 2005, pp. 65–72.

Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions

Uploaded by

Document Informationclick to expand document informationscdksnvksdv

Document Informationclick to expand document information

Copyright:

Available Formats

Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions

Uploaded by

Copyright:

Available Formats

7836 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.

Transferring Knowledge From Text to Video:

Index Terms—Deep learning, action anticipation, zero-shot learning, video analysis

1 INTRODUCTION Instructional data, especially cooking recipes, can be

2.2 Action Anticipation 2.4 Modeling Instructional Text in NLP

Fig. 3. Dataset distributions.

5.6 State-of-the-Art Comparisons

window s. 30 50 70 90 110 130 150 170 190 210 230

Reported are the BLEU4 scores.

from text to video is as expected. The video results, how-

Fig. 11. Example proposals from a proposal decoder [78] trained on

This small performance gap between the window and pro-

6.4 Supervised versus Zero-Shot Learning

6.5 Comparisons to Video Captioning

Fig. 14. Zero-shot (‘ Zero’) versus supervised (‘ Sup.’) comparison on

BLEU1 BLEU4 METEOR

Both methods are evaluated using GT segments.

Currently, our method only employs textual recipes and

You might also like