This paper considers the problem of fine-grained image recognition with a growing vocabulary. Sin... more This paper considers the problem of fine-grained image recognition with a growing vocabulary. Since in many real world applications we often have to add a new object category or visual concept with just a few images to learn from, it is crucial to develop a method that is able to generalize the recognition model from existing classes to new classes. Deep convolutional neural networks are capable of constructing powerful image representations; however, these networks usually rely on a logistic loss function that cannot handle the incremental learning problem. In this paper, we present a new method that can efficiently learn a new class given only a limited number of training examples, which we evaluate on the problems of food and clothing recognition. To illustrate the performance of our proposed method on the task of recognizing different kinds of food, when using only 1.3% of training examples per category we achieved about 73% of the performance (as measured by F1-score) compared to when using all available training data.
Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the ... more Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the implied sentiment. The detection of sarcasm in social media platforms has been applied in the past mainly to textual utterances where lexical indicators (such as interjections and intensifiers), linguistic markers, and contextual information (such as user profiles, or past conversations) were used to detect the sarcastic tone. However, modern social media platforms allow to create multimodal messages where audiovisual content is integrated with the text, making the analysis of a mode in isolation partial. In our work, we first study the relationship between the textual and visual aspects in multimodal posts from three major social media platforms, i.e., Insta-gram, Tumblr and Twitter, and we run a crowdsourcing task to quantify the extent to which images are perceived as necessary by human annotators. Moreover, we propose two different computational frameworks to detect sarcasm that integrate the textual and visual modalities. The first approach exploits visual semantics trained on an external dataset, and concatenates the semantics features with state-of-the-art textual features. The second method adapts a visual neural network initialized with parameters trained on ImageNet to multimodal sarcastic posts. Results show the positive effect of combining modalities for the detection of sarcasm across platforms and methods.
Emojis are an extremely common occurrence in mobile communications , but their meaning is open to... more Emojis are an extremely common occurrence in mobile communications , but their meaning is open to interpretation. We investigate motivations for their usage in mobile messaging in the US. This study asked 228 participants for the last time that they used one or more emojis in a conversational message , and collected that message, along with a description of the emojis' intended meaning and function. We discuss functional distinctions between: adding additional emotional or situational meaning, adjusting tone, making a message more engaging to the recipient, conversation management, and relationship maintenance. We discuss lexical placement within messages, as well as social practices. We show that the social and linguistic function of emojis are complex and varied, and that supporting emojis can facilitate important conversational functions.
User interfaces for web image search engine results differ significantly from interfaces for trad... more User interfaces for web image search engine results differ significantly from interfaces for traditional (text) web search results, supporting a richer interaction. In particular, users can see an enlarged image preview by hovering over a result image, and an 'image preview' page allows users to browse further enlarged versions of the results, and to click-through to the referral page where the image is embedded. No existing work investigates the utility of these interactions as implicit relevance feedback for improving search ranking, beyond using clicks on images displayed in the search results page. In this paper we propose a number of implicit relevance feedback features based on these additional interactions: hover-through rate, 'converted-hover' rate, referral page click through, and a number of dwell time features. Also, since images are never self-contained, but always embedded in a referral page, we posit that clicks on other images that are embedded on the same referral webpage as a given image can carry useful relevance information about that image. We also posit that query-independent versions of implicit feedback features, while not expected to capture topical relevance, will carry feedback about the quality or attractiveness of images, an important dimension of relevance for web image search. In an extensive set of ranking experiments in a learning to rank framework, using a large annotated corpus, the proposed features give statistically significant gains of over 2% compared to a state of the art baseline that uses standard click features.
Automatic detection of interesting moments in video has many real-world applications such as vide... more Automatic detection of interesting moments in video has many real-world applications such as video summarization and efficient on-line video browsing. In this paper, we present a lightweight and scalable solution to this problem based on user mouse activity while watching video. Unlike previous approaches that analyze video content to infer the interestingness, we leverage the implicit user feedback obtained from thousands of online video watching sessions. This makes our method computationally efficient and scal-able to billions of videos. Most importantly, our approach can handle a variety of video genres because we make no assumption on what constitutes interestingness: we let the crowd tell us through their mouse activity. By analyzing 106,212 user sessions collected from a popular online video website, we show that mouse activity is highly indicative of interestingness, and that our approach has competitive performance to several state-of-the-art methods.
Animated GIFs have been around since 1987 and recently gained more popularity on social networkin... more Animated GIFs have been around since 1987 and recently gained more popularity on social networking sites. Tumblr, a large social networking and micro blogging platform, is a pop ular venue to share animated GIFs. Tumblr users follow blogs, generating a feed of posts, and choose to " like " or to " reblog " favored posts. In this paper, we use these actions as signals to analyze the engagement of over 3.9 million posts, and con clude that animated GIFs are significantly more engaging than other kinds of media. We follow this finding with deeper visual analysis of nearly 100k animated GIFs and pair our results with interviews with 13 Tumblr users to find out what makes animated GIFs engaging. We found that the animation, lack of sound, immediacy of consumption, low bandwidth and mini mal time demands, the storytelling capabilities and utility for expressing emotions were significant factors in making GIFs the most engaging content on Tumblr. We also found that en gaging GIFs contained faces and had higher motion energy, uniformity, resolution and frame rate. Our findings connect to media theories and have implications in design of effective content dashboards, video summarization tools and ranking algorithms to enhance engagement.
The New Yorker publishes a weekly captionless cartoon. More than 5,000 readers submit captions fo... more The New Yorker publishes a weekly captionless cartoon. More than 5,000 readers submit captions for it. The editors select three of them and ask the readers to pick the funniest one. We describe an experiment that compares a dozen automatic methods for selecting the funniest caption. We show that negative sentiment, human-centeredness, and lexical centrality most strongly match the funniest captions, followed by positive sentiment. These results are useful for understanding humor and also in the design of more engaging conversational agents in text and multimodal (vision+text) systems. As part of this work, a large set of cartoons and captions is being made available to the community.
This article presents how semantic web technologies have been applied for enriching existing cont... more This article presents how semantic web technologies have been applied for enriching existing contents within the SEMUSICI project. The SEMUSICI project has the goal of researching on how semantic web technologies can be applied to digital libraries, and how this can improve searchability and accessibility. The project takes the results from the eContent project HARMOS, which defined a musical taxonomy
... Paloma de Juan and Carlos . ... This taxonomy aims to cover the whole spectrum of music pract... more ... Paloma de Juan and Carlos . ... This taxonomy aims to cover the whole spectrum of music practice and teaching, focusing on pedagogical aspects, such as technique (general or specific of an instrument), mechanics, musicology, musical elements (rhythm, melody, har-mony, form ...
Recommender systems are based mainly on collaborative filtering algorithms, which only use the ra... more Recommender systems are based mainly on collaborative filtering algorithms, which only use the ratings given by the users to the products. When context is taken into account, there might be difficulties when it comes to making recommendations to users who are placed in a context other than the usual one, since their preferences will not correlate with the preferences of those in the new context. In this paper, a hybrid collaborative filtering model is proposed, which provides recommendations based on the context of the travelling users. A combination of a user-based collaborative filtering method and a semantic-based one has been used. Contextual recommendation may be applied in multiple social networks that are spreading world-wide. The resulting system has been tested over 11870.com, a good example of a social network where context is a primary concern.
Communications in Computer and Information Science, 2009
This paper details a Dublin Core Application Profile defined for cataloguing musical resources de... more This paper details a Dublin Core Application Profile defined for cataloguing musical resources described within the European eCon-tentPlus project Variazioni. The metadata model is based on FRBR and has been formalised with DC-Text and implemented in an available web portal where users and music institutions can catalogue their musical assets in a collaborative way.
This paper considers the problem of fine-grained image recognition with a growing vocabulary. Sin... more This paper considers the problem of fine-grained image recognition with a growing vocabulary. Since in many real world applications we often have to add a new object category or visual concept with just a few images to learn from, it is crucial to develop a method that is able to generalize the recognition model from existing classes to new classes. Deep convolutional neural networks are capable of constructing powerful image representations; however, these networks usually rely on a logistic loss function that cannot handle the incremental learning problem. In this paper, we present a new method that can efficiently learn a new class given only a limited number of training examples, which we evaluate on the problems of food and clothing recognition. To illustrate the performance of our proposed method on the task of recognizing different kinds of food, when using only 1.3% of training examples per category we achieved about 73% of the performance (as measured by F1-score) compared to when using all available training data.
Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the ... more Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the implied sentiment. The detection of sarcasm in social media platforms has been applied in the past mainly to textual utterances where lexical indicators (such as interjections and intensifiers), linguistic markers, and contextual information (such as user profiles, or past conversations) were used to detect the sarcastic tone. However, modern social media platforms allow to create multimodal messages where audiovisual content is integrated with the text, making the analysis of a mode in isolation partial. In our work, we first study the relationship between the textual and visual aspects in multimodal posts from three major social media platforms, i.e., Insta-gram, Tumblr and Twitter, and we run a crowdsourcing task to quantify the extent to which images are perceived as necessary by human annotators. Moreover, we propose two different computational frameworks to detect sarcasm that integrate the textual and visual modalities. The first approach exploits visual semantics trained on an external dataset, and concatenates the semantics features with state-of-the-art textual features. The second method adapts a visual neural network initialized with parameters trained on ImageNet to multimodal sarcastic posts. Results show the positive effect of combining modalities for the detection of sarcasm across platforms and methods.
Emojis are an extremely common occurrence in mobile communications , but their meaning is open to... more Emojis are an extremely common occurrence in mobile communications , but their meaning is open to interpretation. We investigate motivations for their usage in mobile messaging in the US. This study asked 228 participants for the last time that they used one or more emojis in a conversational message , and collected that message, along with a description of the emojis' intended meaning and function. We discuss functional distinctions between: adding additional emotional or situational meaning, adjusting tone, making a message more engaging to the recipient, conversation management, and relationship maintenance. We discuss lexical placement within messages, as well as social practices. We show that the social and linguistic function of emojis are complex and varied, and that supporting emojis can facilitate important conversational functions.
User interfaces for web image search engine results differ significantly from interfaces for trad... more User interfaces for web image search engine results differ significantly from interfaces for traditional (text) web search results, supporting a richer interaction. In particular, users can see an enlarged image preview by hovering over a result image, and an 'image preview' page allows users to browse further enlarged versions of the results, and to click-through to the referral page where the image is embedded. No existing work investigates the utility of these interactions as implicit relevance feedback for improving search ranking, beyond using clicks on images displayed in the search results page. In this paper we propose a number of implicit relevance feedback features based on these additional interactions: hover-through rate, 'converted-hover' rate, referral page click through, and a number of dwell time features. Also, since images are never self-contained, but always embedded in a referral page, we posit that clicks on other images that are embedded on the same referral webpage as a given image can carry useful relevance information about that image. We also posit that query-independent versions of implicit feedback features, while not expected to capture topical relevance, will carry feedback about the quality or attractiveness of images, an important dimension of relevance for web image search. In an extensive set of ranking experiments in a learning to rank framework, using a large annotated corpus, the proposed features give statistically significant gains of over 2% compared to a state of the art baseline that uses standard click features.
Automatic detection of interesting moments in video has many real-world applications such as vide... more Automatic detection of interesting moments in video has many real-world applications such as video summarization and efficient on-line video browsing. In this paper, we present a lightweight and scalable solution to this problem based on user mouse activity while watching video. Unlike previous approaches that analyze video content to infer the interestingness, we leverage the implicit user feedback obtained from thousands of online video watching sessions. This makes our method computationally efficient and scal-able to billions of videos. Most importantly, our approach can handle a variety of video genres because we make no assumption on what constitutes interestingness: we let the crowd tell us through their mouse activity. By analyzing 106,212 user sessions collected from a popular online video website, we show that mouse activity is highly indicative of interestingness, and that our approach has competitive performance to several state-of-the-art methods.
Animated GIFs have been around since 1987 and recently gained more popularity on social networkin... more Animated GIFs have been around since 1987 and recently gained more popularity on social networking sites. Tumblr, a large social networking and micro blogging platform, is a pop ular venue to share animated GIFs. Tumblr users follow blogs, generating a feed of posts, and choose to " like " or to " reblog " favored posts. In this paper, we use these actions as signals to analyze the engagement of over 3.9 million posts, and con clude that animated GIFs are significantly more engaging than other kinds of media. We follow this finding with deeper visual analysis of nearly 100k animated GIFs and pair our results with interviews with 13 Tumblr users to find out what makes animated GIFs engaging. We found that the animation, lack of sound, immediacy of consumption, low bandwidth and mini mal time demands, the storytelling capabilities and utility for expressing emotions were significant factors in making GIFs the most engaging content on Tumblr. We also found that en gaging GIFs contained faces and had higher motion energy, uniformity, resolution and frame rate. Our findings connect to media theories and have implications in design of effective content dashboards, video summarization tools and ranking algorithms to enhance engagement.
The New Yorker publishes a weekly captionless cartoon. More than 5,000 readers submit captions fo... more The New Yorker publishes a weekly captionless cartoon. More than 5,000 readers submit captions for it. The editors select three of them and ask the readers to pick the funniest one. We describe an experiment that compares a dozen automatic methods for selecting the funniest caption. We show that negative sentiment, human-centeredness, and lexical centrality most strongly match the funniest captions, followed by positive sentiment. These results are useful for understanding humor and also in the design of more engaging conversational agents in text and multimodal (vision+text) systems. As part of this work, a large set of cartoons and captions is being made available to the community.
This article presents how semantic web technologies have been applied for enriching existing cont... more This article presents how semantic web technologies have been applied for enriching existing contents within the SEMUSICI project. The SEMUSICI project has the goal of researching on how semantic web technologies can be applied to digital libraries, and how this can improve searchability and accessibility. The project takes the results from the eContent project HARMOS, which defined a musical taxonomy
... Paloma de Juan and Carlos . ... This taxonomy aims to cover the whole spectrum of music pract... more ... Paloma de Juan and Carlos . ... This taxonomy aims to cover the whole spectrum of music practice and teaching, focusing on pedagogical aspects, such as technique (general or specific of an instrument), mechanics, musicology, musical elements (rhythm, melody, har-mony, form ...
Recommender systems are based mainly on collaborative filtering algorithms, which only use the ra... more Recommender systems are based mainly on collaborative filtering algorithms, which only use the ratings given by the users to the products. When context is taken into account, there might be difficulties when it comes to making recommendations to users who are placed in a context other than the usual one, since their preferences will not correlate with the preferences of those in the new context. In this paper, a hybrid collaborative filtering model is proposed, which provides recommendations based on the context of the travelling users. A combination of a user-based collaborative filtering method and a semantic-based one has been used. Contextual recommendation may be applied in multiple social networks that are spreading world-wide. The resulting system has been tested over 11870.com, a good example of a social network where context is a primary concern.
Communications in Computer and Information Science, 2009
This paper details a Dublin Core Application Profile defined for cataloguing musical resources de... more This paper details a Dublin Core Application Profile defined for cataloguing musical resources described within the European eCon-tentPlus project Variazioni. The metadata model is based on FRBR and has been formalised with DC-Text and implemented in an available web portal where users and music institutions can catalogue their musical assets in a collaborative way.
Uploads
Papers by Paloma de Juan