Mmm2014 ML DX Ak
Mmm2014 ML DX Ak
Mmm2014 ML DX Ak
1 Introduction
Researchers in multimedia information systems, visual information retrieval and
information retrieval in general have lately put more and more emphasis on re-
search regarding users’ context. A common definition of context is: Context is
any information that can be used to characterize the situation of an entity. An
entity is a person, place, or object that is considered relevant to the interaction
between a user and an application, including the user and applications them-
selves. [1]. A user’s intention – defined as a thing intended; an aim or plan 1 –
therefore is part of a user’s context.
In multimedia information systems user intentions are manifold. In search
scenarios users might want to find multimedia data to gain knowledge, or to
entertain themselves. In publishing scenarios users might intend to communicate
ideas or share feelings with others. To learn about intentions, the users have to
answer why they want to search, share, or store a video or image. Figure 1
shows two images. Image (A) has been taken to preserve a bad feeling. The
photographer noted: “because i [sic] was feeling sad at that time and everything
seems as sharp and hard to me as the endins [sic] of this plant.”. Image (B) on
the other hand was taken for a functional reason. The photographer claimed:
“I’m origami folder, and I took this photo to archive my work and share it with
other origami folders.”
1
Oxford Dictionaries,http://oxforddictionaries.com/
Fig. 1. Two sample photos from our test data set taken with different intentions.
In this sense we have created a data set, where 1,309 images, shared on the
internet, have been annotated by their owners to indicate why these images have
been taken. The images were randomly selected from the Flickr web site and their
publishers have been contacted to take part in a survey. An additional, crowd-
sourced verification step was done with the help of Amazon Mechanical Turk.
The data set is publicly available for scientific use under Creative Commons
Attribution License2 . Note at this point that this is not a test data set in the
common sense of multimedia retrieval. There are neither queries and topics given
for the data set, nor can it considered being a ground truth. Its value is (i) its
nature of being a first data source for research on user intentions in multimedia,
and (ii) that the data set provides a basis common to different research groups
due to its open nature. Its nature is comparable to the infamous AOL search log
data, where also no topics were given, but the data set was appreciated (in terms
of availability of data not in terms of releasing it without asking the users) by
the research community. However, in our case we asked the photographers for
permission to release the data.
This paper describes the data set starting with a short overview on related
work and research on user intentions. Then the acquisition process is outlined
and basic statistics and information on the data set are given. We conclude the
paper with a discussion on the impact of the data set and give an outlook on
future work.
2 Related Work
Data sets for multimedia retrieval and computer vision have quite a long history,
as it is commonly agreed that building on each others research results can only
work if methods and data are made available. A discussion on the Corel data
set, which was employed often in sub sets, and its implications are presented
2
Note that the URL is not given to the double blind review process.
in [2]. Today, a well-known and well-received data set is for instance the MIR-
Flickr [3] data set. Since 2010 it provides 1,000,000 images from Flickr along
with metadata including tags, title, license and EXIF. Other examples – just
to pick a few out of many – are the Caltech-256 Object Category Data set [4],
which consists of more than 30,000 images in 256 categories, and the PASCAL
data set, which was developed for the PASCAL Visual Object Classes (VOC)
Challenge [5].
The problem of capturing the intention of multimedia information system
users is diverse, so different approaches have been tried. A preliminary survey on
the creation of videos has been presented in [6]. Similarly [7], [8], [9] investigate
the intentions people have for capturing photos with phone cameras and [10]
investigates intentions for capturing photos independently from the camera used.
Intentions for watching online videos have been investigated in [11]. As a part
of a survey on user (sub-)groups in multimedia information systems, the goal-
directedness of users is investigated in [12]. User intentions for searching images
are discussed in [13], where also a taxonomy of user intentions for image retrieval
is presented. A taxonomy on intention classes for online video search is discussed
in [14]. An application of the research on user intentions for image search is
discussed in [15], where the result view of Flickr is adapted to the automatically
detected search intention class.
4 Data Set
The resulting data set consists of 1,309 samples. Each sample contains informa-
tion about the image collected from three main sources: (i) taken from Flickr’s
API including EXIF metadata, (ii) added from the photographer in the course
of the survey and (iii) added by the turkers in the HITs. An example for the
ratings of an instance from the data set is given in Table 1. The ratings were
given to the image shown in Figure 2.
Fig. 2. Sample image from the data set. The photographer described the intention for
taking the photo as “a reminder of the beautiful Island were [sic] my father came from”.
Table 1. Example of a data item from the data set giving the rating on the image
shown in Figure 2. A value of -2 corresponds to strongly disagree on the Likert scale,
while a value of 2 denotes strongly agree.
Photogr. Turkers
Recall situation: 2 2, 0, 1, 0, 2
Preserve good feeling: 2 -2, 1, 0, 0, 1
Publish online: 2 0, 0, 0, 1, 2
Show to family & friends: 2 1, 2, 1, 1, 0
Support task of mine: 0 -2, 1, 1, 1,-2
Preserve bad feeling: -2 0,-2, 0,-2,-2
Degree of manipulation: - -1,-2,-1, 0,-2
Readability: - -2, 2, 0, 0, 2
Infer intention: - 0, 2, 1, 0, 0
Using the IP addresses of the survey participants and turkers logged by our
web server, we were able to assign locations to survey participants and turkers
and therefore, to get a rough idea about the originating country. The survey
participants – the actual photographers – are spread over 95 different countries.
Around 38% of the participants were from English speaking countries like USA,
UK and Australia.
In contrast to the widespread distribution of photographers from all over the
world, the majority of turkers – the people doing the validation on AMT – were
from India and only a small percentage from other countries. Figure 3 gives an
overview on the absolute number of participants from the six top countries (on
top) and an overview on the turkers’ locations (bottom). A trend of an increase
of Indian turkers on AMT was already noticed by Ross et al. [16] in 2009. They
observed that the share of Indian workers went from 5% in November 2008 to
36% in November 2009, so the distribution of turkers in our survey is not too
unusual.
A first and pressing question was to what degree the employed intention
classes were redundant. Therefore we investigated if the 6 classes were correlated
in a pair wise manner. Table 2 shows the correlation matrix. Most interesting
correlations are to be found between the intentions show to family and friends,
recall situation and preserve good feeling, and that the highest correlation is
between preserve good feeling and recall situation with a value of 0.45. The rest
of the correlations coefficients are too small to talk of a reasonable correlation.
However, the actual values indicate that with the given data sets the 6 classes
of intentions are not pair wise redundant and therefore, cannot be removed.
With the validation on AMT by 5 turkers for each instance the question whether
the turkers agree is obvious. For quantizing inter-rater agreement we chose Krip-
pendorff’s Alpha α, specifically the R implementation provided by the irr pack-
age.
United States 248
United Kingdom 225
Italy 78
Germany 68
France 66
Australia 65
India 86.6%
4.8% United States
3.1% Canada
1.6% Pakistan
3.8% Other
Fig. 3. Locations of the survey participants (photographers, top graph) and the turkers
(workers on AMT in the validation step, bottom graph).
Table 3. Descriptive statistics for α indicating the inter rater agreement for the six
intention classes compared to the three other classes.
Intentions Other
mean 0.1467 0.2321
variance 0.0316 0.0693
minimum -0.2361 -0.2291
maximum 0.7096 0.9437
Intentions Other
mean 0.5707 0.5104
variance 0.0157 0.1337
minimum 0.4330 -0.0499
maximum 0.7710 0.8571
150
200
Frequency
150
100
100
50
50
0
−0.2 0.0 0.2 0.4 0.6 0.8 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 4. Histograms of α indicating the inter rater agreement for the six intention classes
compared to the three other classes.
0.6
0.8
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−0.2
−0.2
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
Fig. 5. Plot of the ranked values for α with the rank in the x-axis and α on y indicating
the inter rater agreement for the six intention classes compared to the three other classes
5 Conclusions
In this paper we have presented a data set of 1,309 photos. These photos were
collected from Flickr and the photographers participated in a survey that tried
to find out why the images have been taken. The data set is to this date – to
the best of our knowledge – the only openly available data set4 dealing with
user intentions in multimedia. We consider this as one of the first steps towards
joint research in user intentions in multimedia information systems on a common
basis. While providing anecdotal evidence on actual intentions for taking photos,
also text mining and pattern analysis on the data might lead to insights on why
people actually take photos and put them online. Ultimately this understanding
will help in providing better tools and algorithms for multimedia search, retrieval,
distribution, storage and communication.
While the data set is a great tool to leverage understanding of user intentions
in creating digital photos, there are several shortcomings. First of all the data
set only includes photos that have already been shared and are available to the
public. Hence, the data set is biased towards a sharing intention. Also the actual
intention is hard to find, even for the original photographer or uploader of the
image. This additional step of abstraction is something users do not appreciate.
In face to face interviews we often heard the answer “I don’t know”. We assume
that those, that were not willing to formulate their explicit intention for taking
the photo either aborted the study or did not even start it. Still, there are
multiple answers, that do not define the intention, but explain the content of the
image. Furthermore the data set is rather noisy. Instances with rich information
are mixed with instances that are most likely fakes or random answers.
4
Note that the URL is not given due to the double blind review process.
In the near future we want to investigate the data set in full detail. First steps
towards using the data set to infer photographers’ intentions have shown promis-
ing results. Also manual selection of a sub set with richly annotated instances is
a next, crucial step.
References
1. Abowd, G., Dey, A., Brown, P., Davies, N., Smith, M., Steggles, P.: Towards a
better understanding of context and context-awareness. In Gellersen, H.W., ed.:
Handheld and Ubiquitous Computing. Volume 1707 of Lecture Notes in Computer
Science. Springer Berlin Heidelberg (1999) 304–307
2. Müller, H., Marchand-Maillet, S., Pun, T.: The truth about corel - evaluation in
image retrieval. In Lew, M., Sebe, N., Eakins, J., eds.: Image and Video Retrieval.
Volume 2383 of Lecture Notes in Computer Science. Springer Berlin Heidelberg
(2002) 38–49
3. Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: MIR ’08: Pro-
ceedings of the 2008 ACM International Conference on Multimedia Information
Retrieval, New York, NY, USA, ACM (2008)
4. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical
report, California Institute of Technology (2007)
5. Everingham, M., Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. International Journal of Computer Vision 88 (2010)
303–338
6. Lux, M., Huber, J.: Why did you record this video? an exploratory study on user
intentions for video production. In: Image Analysis for Multimedia Interactive
Services (WIAMIS), 2012 13th International Workshop on. (may 2012) 1 –4
7. Kindberg, T., Spasojevic, M., Fleck, R., Sellen, A.: The ubiquitous camera: An
in-depth study of camera phone use. Pervasive Computing, IEEE 4(2) (2005)
42–50
8. Mäkelä, A., Giller, V., Tscheligi, M., Sefelin, R.: Joking, storytelling, artsharing,
expressing affection: a field trial of how children and their social network communi-
cate with digital images in leisure time. In: Proceedings of the SIGCHI conference
on Human factors in computing systems, ACM (2000) 548–555
9. Van House, N., Davis, M., Ames, M., Finn, M., Viswanathan, V.: The uses of
personal networked digital imaging: an empirical study of cameraphone photos
and sharing. In: CHI’05 Extended Abstracts on Human Factors in Computing
Systems, ACM (2005) 1853–1856
10. Lux, M., Kogler, M., del Fabro, M.: Why did you take this photo: a study on
user intentions in digital photo productions. In: Proceedings of the 2010 ACM
workshop on Social, adaptive and personalized multimedia interaction and access.
SAPMIA ’10, New York, NY, USA, ACM (2010) 41–44
11. Lagger, C., Lux, M., Marques, O.: Which video do you want to watch now? devel-
opment of a prototypical intention-based interface for video retrieval. Multimedia
on the Web, Workshop on 0 (2011) 45–48
12. Kemman, M., Kleppe, M., Beunders, H.: Who are the users of a video search
system? classifying a heterogeneous group with a profile matrix. In: Image Analysis
for Multimedia Interactive Services (WIAMIS), 2012 13th International Workshop
on. (may 2012) 1 –4
13. Lux, M., Kofler, C., Marques, O.: A classification scheme for user intentions in
image search. In: CHI ’10 Extended Abstracts on Human Factors in Computing
Systems. CHI EA ’10, New York, NY, USA, ACM (2010) 3913–3918
14. Hanjalic, A., Kofler, C., Larson, M.: Intent and its discontents: The user at the
wheel of the online video search engine. In: Proceedings of the ACM international
conference on Multimedia 2012, Nara, JP (Nov 2012)
15. Kofler, C., Lux, M.: Dynamic presentation adaptation based on user intent classi-
fication. In: Proceedings of the 17th ACM international conference on Multimedia.
MM ’09, New York, NY, USA, ACM (2009) 1117–1118
16. Ross, J., Irani, L., Silberman, M.S., Zaldivar, A., Tomlinson, B.: Who are the
crowdworkers?: shifting demographics in mechanical turk. In: Proceedings of the
28th of the international conference extended abstracts on Human factors in com-
puting systems. CHI EA ’10, New York, NY, USA, ACM (2010) 2863–2872
17. Krippendorff, K.: Computing krippendorff’s alpha reliability. Departmental Papers
(ASC) (2007) 43