Oscar Munoz Garcia

Download as pdf or txt
Download as pdf or txt
You are on page 1of 316

Methods and Techniques for

Segmentation of Consumers in
Social Media
PhD Thesis

Óscar Muñoz Garcı́a (MSc Artificial Intelligence)


Departamento de Inteligencia Artificial
ETS de Ingenieros Informáticos

Supervisors
Asunción Gómez Pérez (PhD Computer Science, MBA)
Raúl Garcı́a Castro (PhD Computer Science and Artificial Intelligence)

2015
Tribunal nombrado por el Sr. Rector Magfco. de la Universidad Politécnica de
Madrid, el dı́a de de .

Presidente:

Vocal:

Vocal:

Vocal:

Secretario:

Suplente:

Suplente:

Realizado el acto de defensa y lectura de la Tesis el dı́a de


de en la Escuela Técnica Superior de Ingenieros Informáticos.

Calificación:

EL PRESIDENTE LOS VOCALES

EL SECRETARIO

i
ii
A Mari. Gracias por tu comprensión durante todo el tiempo que he
dedicado a la tesis.
A mis padres. Gracias por todo vuestro apoyo y motivación sin los
cuales no habrı́a llegado hasta aquı́.
A mi hija Lucı́a.
iv
Acknowledgements

This thesis represents the final stage of a long period of my life I had
never completed without the help of many people whom I thank for
their inestimable support that worth its weight in gold.
First of all, I want to acknowledge all the co-authors of the research
works that have contributed to the contents included in these the-
sis: Silvia, Inés, Nuria, March, Beatriz, Gloria, Javier, Daniel, Jesús,
David, Guadalupe, Auxi, Socorro, Elena, Vı́ctor, and Carlos. This
thesis would not have been possible without their hard work.
Havas Media Group deserves a special recognition. I want to ac-
knowledge my colleagues there for all their lessons about marketing
and advertising. I could not imagine my professional career from now
on without their support and training. Specially, I have no words to
express my gratitude to Gloria.
I also want to acknowledge the Spanish Centre for the Development
of Industrial Technology that has partially supported this research
under the CENIT program in the context of the Social Media Project
(CEN-20101037). Thanks a lot to all the partners in this project.
Finally, I want to acknowledge my supervisors, Asun and Raúl, for
their guidance, reviews and patience, during and before the writing
of this thesis. I hope I have lived up to their expectations.
vi
Abstract

Social media has revolutionised the way in which consumers relate to


each other and with brands. The opinions published in social media
have a power of influencing purchase decisions as important as adver-
tising campaigns. Consequently, marketers are increasing efforts and
investments for obtaining indicators to measure brand health from
the digital content generated by consumers.
Given the unstructured nature of social media contents, the tech-
nology used for processing such contents often implements Artificial
Intelligence techniques, such as natural language processing, machine
learning and semantic analysis algorithms.
This thesis contributes to the State of the Art, with a model for
structuring and integrating the information posted on social media,
and a number of techniques whose objectives are the identification
of consumers, as well as their socio-demographic and psychographic
segmentation. The consumer identification technique is based on the
fingerprint of the devices they use to surf the Web and is tolerant to
the changes that occur frequently in such fingerprint. The psycho-
graphic profiling techniques described infer the position of consumer
in the purchase funnel, and allow to classify the opinions based on a
series of marketing attributes. Finally, the socio-demographic profil-
ing techniques allow to obtain the residence and gender of consumers.
viii
Resumen

Los medios sociales han revolucionado la manera en la que los con-


sumidores se relacionan entre sı́ y con las marcas. Las opiniones publi-
cadas en dichos medios tienen un poder de influencia en las decisiones
de compra tan importante como las campañas de publicidad. En con-
secuencia, los profesionales del marketing cada vez dedican mayores
esfuerzos e inversión a la obtención de indicadores que permitan medir
el estado de salud de las marcas a partir de los contenidos digitales
generados por sus consumidores.
Dada la naturaleza no estructurada de los contenidos publicados en
los medios sociales, la tecnologı́a usada para procesar dichos con-
tenidos ha menudo implementa técnicas de Inteligencia Artificial, tales
como algoritmos de procesamiento de lenguaje natural, aprendizaje
automático y análisis semántico.
Esta tesis, contribuye al estado de la cuestión, con un modelo que
permite estructurar e integrar la información publicada en medios so-
ciales, y una serie de técnicas cuyos objetivos son la identificación
de consumidores, ası́ como la segmentación psicográfica y sociode-
mográfica de los mismos. La técnica de identificación de consumi-
dores se basa en la huella digital de los dispositivos que utilizan para
navegar por la Web y es tolerante a los cambios que se producen
con frecuencia en dicha huella digital. Las técnicas de segmentación
psicográfica descritas obtienen la posición en el embudo de compra
de los consumidores y permiten clasificar las opiniones en función de
una serie de atributos de marketing. Finalmente, las técnicas de seg-
mentación sociodemográfica permiten obtener el lugar de residencia y
el género de los consumidores.
x
Contents

1 INTRODUCTION 1
1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Dissemination of Results . . . . . . . . . . . . . . . . . . . . . . . 7

2 STATE OF THE ART 9


2.1 Semantic Vocabularies for Representing Social Media Information 10
2.1.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Techniques for Tracking Users in the Web . . . . . . . . . . . . . 13
2.2.1 Techniques for Capturing Web Activity . . . . . . . . . . . 14
2.2.1.1 Technique Based on Web Logs . . . . . . . . . . 15
2.2.1.2 Technique Based on Web Beacons . . . . . . . . . 18
2.2.1.3 Technique Based on JavaScript Tags . . . . . . . 19
2.2.1.4 Technique Based on Packet Sniffing . . . . . . . . 21
2.2.2 Techniques for Identifying Unique Users . . . . . . . . . . 22
2.2.2.1 Technique Based on Cookies . . . . . . . . . . . . 22
2.2.2.2 Technique Based on Fingerprint . . . . . . . . . . 23
2.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Technique for Detecting the Evolution of Temporary Records . . . 27
2.3.1 Early Binding Algorithm [Li et al., 2011] . . . . . . . . . . 28
2.3.2 Late Binding Algorithm [Li et al., 2011] . . . . . . . . . . 28
2.3.3 Adjusted Binding Algorithm [Li et al., 2011] . . . . . . . . 29
2.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Social Media Analysis Applied to Market Research . . . . . . . . 30
2.4.1 KPIs Based on Social Media Analysis . . . . . . . . . . . . 32

xi
2.4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Marketing Background . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 The Consumer Decision Journey . . . . . . . . . . . . . . . 37
2.5.2 The Marketing Mix . . . . . . . . . . . . . . . . . . . . . . 39
2.5.3 Research on Human Emotions . . . . . . . . . . . . . . . . 40
2.5.4 Owned, Paid and Earned Media . . . . . . . . . . . . . . . 43
2.5.5 Marketing Technology . . . . . . . . . . . . . . . . . . . . 44
2.5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6 Analysis of Social Media Content . . . . . . . . . . . . . . . . . . 47
2.6.1 Lemmatisation and Part-Of-Speech Tagging . . . . . . . . 47
2.6.2 Normalisation of Microposts . . . . . . . . . . . . . . . . . 48
2.6.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . 49
2.6.4 Identification of Wishes . . . . . . . . . . . . . . . . . . . . 51
2.6.5 Detection of Place of Residence . . . . . . . . . . . . . . . 52
2.6.6 Detection of Gender . . . . . . . . . . . . . . . . . . . . . 53
2.6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7 Open Research Problems . . . . . . . . . . . . . . . . . . . . . . . 55

3 APPROACH 57
3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Contributions to the State of the Art . . . . . . . . . . . . . . . . 61
3.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 RESEARCH METHODOLOGY 71
4.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Method Followed for Obtaining the Artefacts Provided by this Thesis 74
4.3.1 Method Followed for Ontology Engineering . . . . . . . . . 76
4.3.2 Method Followed for the Data Mining Techniques . . . . . 79
4.3.2.1 Business Understanding . . . . . . . . . . . . . . 80
4.3.2.2 Data Understanding . . . . . . . . . . . . . . . . 80

xii
4.3.2.3 Data Preparation . . . . . . . . . . . . . . . . . . 81
4.3.2.4 Modelling . . . . . . . . . . . . . . . . . . . . . . 81
4.3.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . 82
4.3.2.6 Deployment . . . . . . . . . . . . . . . . . . . . . 82

5 SOCIAL MEDIA ONTOLOGY FOR CONSUMER ANALYT-


ICS 83
5.1 Ontology Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Notation Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Core Ontology Module . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Publication Channels Module . . . . . . . . . . . . . . . . . . . . 96
5.5 Contents Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6 Users Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Opinions Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.8 Topics and Keywords Module . . . . . . . . . . . . . . . . . . . . 105
5.9 Geographical Locations Module . . . . . . . . . . . . . . . . . . . 106

6 MORPHOSYNTACTIC CHARACTERISATION OF SOCIAL


MEDIA CONTENTS 109
6.1 Types of Social Media Analysed . . . . . . . . . . . . . . . . . . . 110
6.2 Distribution of Part-of-Speech Categories . . . . . . . . . . . . . . 111
6.2.1 Distribution of Nouns . . . . . . . . . . . . . . . . . . . . . 113
6.2.2 Distribution of Adjectives . . . . . . . . . . . . . . . . . . 113
6.2.3 Distribution of Adverbs . . . . . . . . . . . . . . . . . . . 114
6.2.4 Distribution of Determiners . . . . . . . . . . . . . . . . . 114
6.2.5 Distribution of Conjunctions . . . . . . . . . . . . . . . . . 114
6.2.6 Distribution of Pronouns . . . . . . . . . . . . . . . . . . . 115
6.2.7 Distribution of Prepositions . . . . . . . . . . . . . . . . . 115
6.2.8 Distribution of Punctuation Marks . . . . . . . . . . . . . 115
6.2.9 Distribution of Verbs . . . . . . . . . . . . . . . . . . . . . 116
6.3 Hypothesis Validation . . . . . . . . . . . . . . . . . . . . . . . . 116

xiii
7 TECHNIQUE FOR UNIQUE USER IDENTIFICATION BASED
ON EVOLVING DEVICE FINGERPRINT DETECTION 117
7.1 Data Understanding Activity . . . . . . . . . . . . . . . . . . . . 118
7.1.1 Collect Initial Data Task . . . . . . . . . . . . . . . . . . . 119
7.1.2 Describe Data Task . . . . . . . . . . . . . . . . . . . . . . 123
7.1.3 Explore Data Task . . . . . . . . . . . . . . . . . . . . . . 124
7.1.4 Verify Data Quality Task . . . . . . . . . . . . . . . . . . . 130
7.2 Data Preparation Activity . . . . . . . . . . . . . . . . . . . . . . 131
7.2.1 Select Data Task . . . . . . . . . . . . . . . . . . . . . . . 131
7.2.2 Clean Data Task . . . . . . . . . . . . . . . . . . . . . . . 131
7.2.3 Construct Data Task . . . . . . . . . . . . . . . . . . . . . 132
7.3 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.3.1 Select Modelling Technique Task . . . . . . . . . . . . . . 133
7.3.1.1 Cluster Signature . . . . . . . . . . . . . . . . . . 135
7.3.1.2 Similarity Computation . . . . . . . . . . . . . . 137
7.3.1.3 Attribute Weight Computation . . . . . . . . . . 138
7.3.2 Generate Test Design Task . . . . . . . . . . . . . . . . . . 140
7.3.3 Build Model Task . . . . . . . . . . . . . . . . . . . . . . . 140
7.3.3.1 X-Real-IP Header . . . . . . . . . . . . . . . . . 143
7.3.3.2 X-Forwarded-For Header . . . . . . . . . . . . . . 145
7.3.3.3 User-Agent Header . . . . . . . . . . . . . . . . . 146
7.3.3.4 Accept Header . . . . . . . . . . . . . . . . . . . 147
7.3.3.5 Accept-Language Header . . . . . . . . . . . . . . 148
7.3.3.6 Accept-Charset Header . . . . . . . . . . . . . . . 149
7.3.3.7 Accept-Encoding Header . . . . . . . . . . . . . . 151
7.3.3.8 Cache-Control Header . . . . . . . . . . . . . . . 152
7.3.3.9 Plugins . . . . . . . . . . . . . . . . . . . . . . . 153
7.3.3.10 Fonts . . . . . . . . . . . . . . . . . . . . . . . . 154
7.3.3.11 Video . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.3.12 Time zone . . . . . . . . . . . . . . . . . . . . . . 157
7.3.3.13 Session Storage . . . . . . . . . . . . . . . . . . . 158
7.3.3.14 Local Storage . . . . . . . . . . . . . . . . . . . . 160
7.3.3.15 Internet Explorer Persistence . . . . . . . . . . . 161

xiv
7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 163
7.4.1.1 Rand Index . . . . . . . . . . . . . . . . . . . . . 164
7.4.1.2 Error Rate . . . . . . . . . . . . . . . . . . . . . 164
7.4.1.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . 164
7.4.1.4 Specificity . . . . . . . . . . . . . . . . . . . . . . 164
7.4.1.5 False Positive Rate . . . . . . . . . . . . . . . . . 165
7.4.1.6 False Negative Rate . . . . . . . . . . . . . . . . 165
7.4.1.7 Precision . . . . . . . . . . . . . . . . . . . . . . 165
7.4.1.8 F-measure . . . . . . . . . . . . . . . . . . . . . . 165
7.4.1.9 Purity . . . . . . . . . . . . . . . . . . . . . . . . 166
7.4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 166
7.4.2.1 Variant Based on Uniform Weights . . . . . . . . 166
7.4.2.2 Variant Based on Attribute Entropy . . . . . . . 167
7.4.2.3 Variant Based on Time Decay . . . . . . . . . . . 169
7.4.2.4 Variant Based on Attribute Entropy and Time
Decay . . . . . . . . . . . . . . . . . . . . . . . . 170
7.4.2.5 Comparison of the Variants . . . . . . . . . . . . 171
7.5 Hypothesis Validation . . . . . . . . . . . . . . . . . . . . . . . . 173

8 TECHNIQUES FOR SEGMENTATION OF CONSUMERS FROM


SOCIAL MEDIA CONTENT 175
8.1 Common Elements Used by the Techniques . . . . . . . . . . . . . 176
8.1.1 Collect Initial Data Task . . . . . . . . . . . . . . . . . . . 177
8.1.2 Data Preparation Activity . . . . . . . . . . . . . . . . . . 178
8.1.2.1 Select Data Task . . . . . . . . . . . . . . . . . . 179
8.1.2.2 Clean Data Task . . . . . . . . . . . . . . . . . . 180
8.1.2.3 Construct Data Task . . . . . . . . . . . . . . . . 182
8.1.3 Rule-based Modelling Technique . . . . . . . . . . . . . . . 187
8.2 Technique for Detecting Consumer Decision Journey Stages . . . . 191
8.2.1 Data Understanding Activity . . . . . . . . . . . . . . . . 191
8.2.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 192
8.2.1.2 Describe Data Task . . . . . . . . . . . . . . . . 195

xv
8.2.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 196
8.2.1.4 Verify Data Quality Task . . . . . . . . . . . . . 197
8.2.2 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 200
8.2.2.1 Select Modelling Technique Task . . . . . . . . . 200
8.2.2.2 Build Model Task . . . . . . . . . . . . . . . . . 201
8.3 Technique for Detecting Marketing Mix Attributes . . . . . . . . . 205
8.3.1 Data Understanding Activity . . . . . . . . . . . . . . . . 205
8.3.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 205
8.3.1.2 Describe Data Task . . . . . . . . . . . . . . . . 207
8.3.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 208
8.3.1.4 Verify Data Quality Task . . . . . . . . . . . . . 208
8.3.2 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 210
8.3.2.1 Select Modelling Technique Task . . . . . . . . . 210
8.3.2.2 Build Model Task . . . . . . . . . . . . . . . . . 211
8.4 Technique for Detecting Emotions . . . . . . . . . . . . . . . . . . 212
8.4.1 Data Understanding Activity . . . . . . . . . . . . . . . . 213
8.4.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 213
8.4.1.2 Describe Data Task . . . . . . . . . . . . . . . . 214
8.4.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 215
8.4.1.4 Verify Data Quality Task . . . . . . . . . . . . . 216
8.4.2 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 218
8.4.2.1 Select Modelling Technique Task . . . . . . . . . 218
8.4.2.2 Generate Test Design Task . . . . . . . . . . . . 219
8.4.2.3 Build Model Task . . . . . . . . . . . . . . . . . 219
8.5 Technique for Detecting Place of Residence . . . . . . . . . . . . . 223
8.5.1 Data Understanding Activity . . . . . . . . . . . . . . . . 223
8.5.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 224
8.5.1.2 Describe Data Task . . . . . . . . . . . . . . . . 224
8.5.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 225
8.5.2 Data Preparation Activity . . . . . . . . . . . . . . . . . . 225
8.5.3 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 225
8.5.3.1 Select Modelling Technique Task . . . . . . . . . 226
8.5.3.2 Generate Test Design Task . . . . . . . . . . . . 235

xvi
8.6 Technique for Detecting Gender . . . . . . . . . . . . . . . . . . . 235
8.6.1 Data Understanding Activity . . . . . . . . . . . . . . . . 235
8.6.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 235
8.6.1.2 Describe Data Task . . . . . . . . . . . . . . . . 236
8.6.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 236
8.6.2 Data Preparation Activity . . . . . . . . . . . . . . . . . . 236
8.6.3 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 237
8.6.3.1 Select Modelling Technique Task . . . . . . . . . 237
8.6.3.2 Generate Test Design Task . . . . . . . . . . . . 240
8.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
8.7.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 241
8.7.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . 242
8.7.1.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . 242
8.7.1.3 Precision . . . . . . . . . . . . . . . . . . . . . . 242
8.7.1.4 F-measure . . . . . . . . . . . . . . . . . . . . . . 242
8.7.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 243
8.7.2.1 Technique for Detecting Consumer Decision Jour-
ney Stages . . . . . . . . . . . . . . . . . . . . . . 243
8.7.2.2 Technique for Detecting Marketing Mix Attributes 246
8.7.2.3 Technique for Detecting Emotions . . . . . . . . 249
8.7.2.4 Technique for Detecting Place of Residence . . . 252
8.7.2.5 Technique for Detecting Gender . . . . . . . . . . 252
8.8 Validation of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 255

9 CONCLUSIONS AND FUTURE WORK 257


9.1 Social Media Data Model for Consumer Analytics . . . . . . . . . 258
9.2 Morphosyntactic Characterisation of Social Media Contents . . . 258
9.3 Technique for Unique User Identification
Based on Evolving Device Fingerprint . . . . . . . . . . . . . . . . 259
9.4 Techniques for Segmentation of Consumers from Social Media Con-
tent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.4.1 Technique for Detecting Consumer Decision Journey Stages 261
9.4.2 Technique for Detecting Marketing Mix Attributes . . . . 262

xvii
9.4.3 Technique for Detecting Emotions . . . . . . . . . . . . . . 262
9.4.4 Technique for Identifying the Place of Residence of Social
Media Users . . . . . . . . . . . . . . . . . . . . . . . . . . 265
9.4.5 Technique for Identifying the gender of Social Media Users 265
9.4.6 Normalisation of User-Generated Content . . . . . . . . . 265
9.4.7 Evaluation of Scalability . . . . . . . . . . . . . . . . . . . 266

xviii
List of Figures

2.1 Process followed by the technique based on web logs (adapted from
Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Process followed by the technique based on web beacons (adapted
from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Process followed by the technique based on JavaScript tags (adapted
from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Process followed by the tags or web beacons techniques for gath-
ering data from multiple sites (adapted from [Kaushik, 2007]) . . 20
2.5 Process followed by the technique based on packet sniffing (adapted
from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Consumer Decision Journey stages adopted in this thesis . . . . . 39

3.1 Contributions to the State of the Art . . . . . . . . . . . . . . . . 64


3.2 Relationships between the objectives, contributions, assumptions,
hypothesis and restrictions . . . . . . . . . . . . . . . . . . . . . . 69

4.1 Relations between methodology, methods, techniques, processes,


activities and tasks (adapted from Gómez-Pérez et al. [2004]) . . . 72
4.2 Iterative research methodology using exploratory and experimental
approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Web mining framework (adapted from Hu and Cercone [2004]) . . 75
4.4 The CRISP-DM reference model (adapted from Shearer [2000]) . 79

5.1 Ontology network . . . . . . . . . . . . . . . . . . . . . . . . . . . 84


5.2 Social Graph Ontology Modeles . . . . . . . . . . . . . . . . . . . 86

xix
5.3 Class Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Object Property Example . . . . . . . . . . . . . . . . . . . . . . 87
5.5 Inverse Object Properties Example . . . . . . . . . . . . . . . . . 88
5.6 Class Inheritance Example . . . . . . . . . . . . . . . . . . . . . . 88
5.7 Property Inheritance Example . . . . . . . . . . . . . . . . . . . . 89
5.8 Instances Example . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.9 Core ontology module of the SGO . . . . . . . . . . . . . . . . . . 91
5.10 Publication Channels module of the SGO . . . . . . . . . . . . . . 96
5.11 Contents module of the SGO . . . . . . . . . . . . . . . . . . . . . 97
5.12 Users module of the SGO . . . . . . . . . . . . . . . . . . . . . . 100
5.13 Opinions module of the SGO . . . . . . . . . . . . . . . . . . . . . 104
5.14 Topics and Keywords module of the SGO . . . . . . . . . . . . . . 105
5.15 Locations module of the SGO . . . . . . . . . . . . . . . . . . . . 107

7.1 Format of the data used by the technique for unique user identifi-
cation based on evolving device fingerprint detection . . . . . . . 124
7.2 Daily distribution of visitors during the period of study . . . . . . 125
7.3 Daily distribution of visits during the period of study . . . . . . . 125
7.4 Daily distribution of page views during the period of study . . . . 125
7.5 Distribution of the activity records captured by unique user . . . 126
7.6 Distribution of visits per country . . . . . . . . . . . . . . . . . . 127
7.7 Disagreement decay for the X-Real-IP header (second interval) . . 144
7.8 Disagreement decay for the X-Real-IP header (first interval) . . . 144
7.9 Agreement decay for the X-Real-IP header . . . . . . . . . . . . . 145
7.10 Agreement decay for the X-Forwarded-For header . . . . . . . . . 145
7.11 Disagreement decay for the User-Agent header . . . . . . . . . . . 146
7.12 Agreement decay for the User-Agent header . . . . . . . . . . . . 147
7.13 Disagreement decay for the Accept header . . . . . . . . . . . . . 148
7.14 Agreement decay for the Accept header . . . . . . . . . . . . . . . 148
7.15 Disagreement decay for the Accept-Language header . . . . . . . . 149
7.16 Agreement decay for the Accept-Language header . . . . . . . . . 149
7.17 Disagreement decay for the Accept-Charset header . . . . . . . . . 150
7.18 Agreement decay for the Accept-Charset header . . . . . . . . . . 150

xx
7.19 Disagreement decay for the Accept-Encoding header . . . . . . . . 151
7.20 Agreement decay for the Accept-Encoding header . . . . . . . . . 151
7.21 Disagreement decay for the Cache-Control header . . . . . . . . . 152
7.22 Agreement decay for the Cache-Control header . . . . . . . . . . 153
7.23 Disagreement decay for the Plugins attribute . . . . . . . . . . . . 154
7.24 Agreement decay for the Plugins attribute . . . . . . . . . . . . . 154
7.25 Disagreement decay for the Fonts attribute (second interval) . . . 155
7.26 Disagreement decay for the Fonts attribute (first interval) . . . . 155
7.27 Agreement decay for the Fonts attribute . . . . . . . . . . . . . . 156
7.28 Disagreement decay for the Video attribute . . . . . . . . . . . . . 157
7.29 Agreement decay for the Video attribute . . . . . . . . . . . . . . 157
7.30 Disagreement decay for the Time zone attribute . . . . . . . . . . 158
7.31 Agreement decay for the Time zone attribute . . . . . . . . . . . 158
7.32 Disagreement decay for the Session Storage attribute . . . . . . . 159
7.33 Agreement decay for the Session storage attribute . . . . . . . . . 159
7.34 Disagreement decay for the Local storage attribute . . . . . . . . . 160
7.35 Agreement decay for the Local Storage attribute . . . . . . . . . . 160
7.36 Disagreement decay for the Internet Explorer persistence attribute 161
7.37 Agreement decay for the Internet Explorer persistence attribute . 162
7.38 Performance of the variants evaluated for the technique for unique
user identification based on evolving device fingerprint detection . 172

8.1 Initial Data Collection task executed by the content-analysis tech-


niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.2 Data Preparation Activity implemented by the content-analysis
techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.3 Clean data task executed by the content-analysis techniques . . . 180
8.4 Construct data task executed by the content-analysis techniques . 183
8.5 Format of the data used by the technique for detecting Consumer
Decision Journey stages . . . . . . . . . . . . . . . . . . . . . . . 195
8.6 Distribution of the texts along the media sources and sectors for
the Consumer Decision Journey gold standard . . . . . . . . . . . 196

xxi
8.7 Distribution of the texts along the Consumer Decision Journey
categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.8 Example annotation of a post according to a Consumer Decision
Journey category using Amazon Mechanical Turk . . . . . . . . . 199
8.9 Format of the data used by the technique for detecting Marketing
Mix attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.10 Example annotation of a post according to a Marketing Mix Cat-
egory using Amazon Mechanical Turk . . . . . . . . . . . . . . . . 209
8.11 Format of the data used by the technique for detecting emotions . 215
8.12 Example annotation of a post according to a Emotions category
using Amazon Mechanical Turk . . . . . . . . . . . . . . . . . . . 217
8.13 Data format of the corpus used by the technique for detecting the
place of residence of social media users . . . . . . . . . . . . . . . 225
8.14 Example of user profile location metadata . . . . . . . . . . . . . 227
8.15 Example of an output of the Google Geocoding API . . . . . . . . 228
8.16 Example execution of table location filtering process . . . . . . . . 230
8.17 Example of user profile description metadata . . . . . . . . . . . . 232
8.18 Example of location extraction from content . . . . . . . . . . . . 234
8.19 Data format of the corpus used by the technique for detecting the
gender of social media users . . . . . . . . . . . . . . . . . . . . . 236
8.20 Example of user profile name metadata . . . . . . . . . . . . . . . 238
8.21 Dependency tree obtained from a tweet that mentions to a user . 240
8.22 Accuracy of the Consumer Decision Journey classifier for English 244
8.23 Accuracy of the Consumer Decision Journey classifier for Spanish 245
8.24 Accuracy of the Consumer Decision Journey classifier by sector . 246
8.25 Accuracy of the Marketing Mix classifier for English . . . . . . . . 247
8.26 Accuracy of the Marketing Mix classifier for Spanish . . . . . . . 248
8.27 Accuracy of the emotions classifier . . . . . . . . . . . . . . . . . 250
8.28 Accuracy of the emotions classifier by sector . . . . . . . . . . . . 251
8.29 Accuracy of the emotions classifier by social media type . . . . . . 251
8.30 Performance of the gender recognition approaches . . . . . . . . . 253

xxii
List of Tables

2.1 Prefixes that can be declared in a web server log file . . . . . . . . 16


2.2 Identifiers that can be declared in a web server log file . . . . . . . 17
2.3 Subcategories of the Marketing Mix elements . . . . . . . . . . . . 40
2.4 Categories for the sentiment classification, organised according to
their polarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Relations between the conceptual framework of emotions used in
this thesis and the Wordnet-Affect taxonomy . . . . . . . . . . . . 43
2.6 Example lemmatisation and part-of-speech tagging of an example
text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Vocabularies selected for defining the Social Graph Ontology . . . 84


5.2 Properties of the class sioc:UserAccount . . . . . . . . . . . . . . 92
5.3 Properties of the class sioc:Post (1/2) . . . . . . . . . . . . . . . . 93
5.4 Properties of the class sioc:Post (2/2) . . . . . . . . . . . . . . . . 94
5.5 Properties of the class sioc:Forum . . . . . . . . . . . . . . . . . . 94
5.6 Properties of the class marl:Opinion . . . . . . . . . . . . . . . . 95
5.7 Properties of the class skos:Concept . . . . . . . . . . . . . . . . . 95
5.8 Properties of the class sioc:Community . . . . . . . . . . . . . . . 95
5.9 Properties of the class rdfg:Graph . . . . . . . . . . . . . . . . . . 95
5.10 Properties of the class sioc:Site . . . . . . . . . . . . . . . . . . . 96
5.11 Properties of the class foaf:Document . . . . . . . . . . . . . . . . 98
5.12 Properties of the class schema:Review . . . . . . . . . . . . . . . . 98
5.13 Property of the class sioc:Role . . . . . . . . . . . . . . . . . . . . 101
5.14 Properties of the class foaf:Agent . . . . . . . . . . . . . . . . . . 101
5.15 Properties of the class foaf:Person . . . . . . . . . . . . . . . . . . 101

xxiii
5.16 Properties of the class foaf:Activity . . . . . . . . . . . . . . . . . 101
5.17 Properties of the class sgo:Cookie . . . . . . . . . . . . . . . . . . 102
5.18 Properties of the class sgo:Fingerprint . . . . . . . . . . . . . . . 102
5.19 Properties of the class tzont:PoliticalRegion . . . . . . . . . . . . . 107
5.20 Properties of the class tzont:Country . . . . . . . . . . . . . . . . 108
5.21 Properties of the class tzont:State . . . . . . . . . . . . . . . . . . 108
5.22 Properties of the class tzont:County . . . . . . . . . . . . . . . . . 108
5.23 Properties of the class tzont:City . . . . . . . . . . . . . . . . . . 108
5.24 Properties of the class schema:Continent . . . . . . . . . . . . . . 108
5.25 Properties of the class tzont:TimeZone . . . . . . . . . . . . . . . 108

6.1 Distribution of part-of-speech categories by social media type . . . 112

7.1 Statistics associated to the number of records gathered per unique


user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2 Distribution of visits for the 10 countries that generated more site
activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.3 Entropy of fingerprint attributes . . . . . . . . . . . . . . . . . . . 128
7.4 Cross-entropy between pairs of fingerprint attributes . . . . . . . 129
7.5 Conditional entropy between pairs of fingerprint attributes . . . . 130
7.6 User-Agent values for Google, Bing, and Yahoo! robots . . . . . . 132
7.7 Disagreement decay of fingerprint attributes . . . . . . . . . . . . 142
7.8 Agreement decay of fingerprint attributes . . . . . . . . . . . . . . 143
7.9 Evaluation results for the variant based on uniform weights . . . . 167
7.10 Evaluation results for the variant based on attribute entropy . . . 168
7.11 Evaluation results for the variant based on time decay . . . . . . . 169
7.12 Evaluation results for the variant based on attribute entropy and
time decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.13 Comparison of the variants with more performance . . . . . . . . 173

8.1 Examples of the linguistic patterns for identifying Consumer De-


cision Journey stages . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.2 Primary and secondary sentiments . . . . . . . . . . . . . . . . . 214
8.3 Distribution of texts for the sentiment corpus by social media type 215

xxiv
8.4 Distribution of texts for the sentiment corpus by domain . . . . . 216
8.5 Distribution of texts for the sentiment corpus for the training and
test sets by domain . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.6 Excerpt from sentiments in Badele3000 . . . . . . . . . . . . . . . 221
8.7 Examples of rules for classifying emotions . . . . . . . . . . . . . 223
8.8 Collocations of “odio” in Badele3000 . . . . . . . . . . . . . . . . 223
8.9 Accuracy of the place of residence identification approaches . . . . 252
8.10 Coverage of the gender recognition approaches . . . . . . . . . . . 253
8.11 Confusion matrix with the results of the approach based on men-
tions to users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

9.1 Rule reordering example . . . . . . . . . . . . . . . . . . . . . . . 264

xxv
xxvi
Chapter 1

INTRODUCTION

The rise of Web 2.0 technologies and social media has enabled users to author their
own content. This has populated the Web with huge amounts of user-generated
content that can be exploited for many different and interesting purposes, such
as explaining or predicting real world outcomes through opinion mining, which
provides a valuable tool for market research. Data scientists in almost every
industry that is exposed to public opinion are under pressure to deal with the
explosive growth of social media. Such professionals must be aware of what is
said about the issues that affect their business in different social media channels.
Social media are media in which information is created by the interaction
of users, who express their opinions freely and spontaneously. This has rev-
olutionised the way in which organisations and consumers interact. Users have
adopted massively these channels to engage in conversations about content, prod-
ucts, and brands, while organisations are striving to adapt proactively to the
threats and opportunities that this new dynamic environment poses. Social me-
dia is a knowledge mine about users, communities, preferences and opinions,
which has the potential to impact positively marketing and product development
activities [Weber, 2007].
In the marketing field, media and society digitalisation has revolutionised the
rules of traditional brand communication with an explosion of channels and pos-
sibilities for brands to contact consumers. Brands and media agencies are facing
a big challenge developing systems to assure the best communication strategy
for the brand (in terms of cost, effectiveness and efficiency). Activities such as

1
word-of-mouth advertising where products or brands are promoted via oral or
written communication have successfully adapted to social media through viral
processes. It is becoming essential to know the views of consumers towards brands
and products for designing advertisement campaigns, estimating future sales and
deciding the strategy to follow when launching a new brand image. According to
a Nielsen [2012b] report, 70% of social media users take into account the product
experience published by other users; 65% declare to search information about
brands, products and services; 53% express positive comments on brands; and
50% express complaints at least once per month.
Social media monitoring tools are being used successfully in a range of do-
mains (including market research, online publishing, etc.). However, tools avail-
able nowadays to analyse social media do not leverage completely the rich and
complex information structure generated by users. Most of these tools elaborate
their reports from metrics based on volume of posts, opinion polarity about the
subject that is being studied, and users’ reputation. Although such metrics are
good indicators of a subject’s popularity and relevance, these metrics are often
inadequate for capturing complex multi-modal dimensions of the subjects to be
measured that are relevant to business, and must be complemented with ad-hoc
studies such as opinion polls. Therefore, existing opinion-mining techniques must
be extended for discovering other aspects of discourse, such as consumer intents,
mood and emotions. Overcoming some of the limitations of current tools to man-
age and analyse the information produced in social media is a pending challenge
that this thesis addresses.
The main goal of this thesis is to provide a data model and a set of
techniques based on Web users tracking and natural language process-
ing for extracting semantic information from the contents generated
by consumers in social media. In the following paragraphs we introduce the
specific contributions of this thesis to the State of the Art.
The disparity of formats, mechanisms for accessing the information, content
sizes, and metadata hinders the collection, integration and processing of the con-
tent published in social media, forcing to use specific methods and techniques for
each kind of media. In this thesis, we provide a data model for the mar-
keting domain that can be using for standardising and normalising the

2
information that can be extracted from social media about consumers,
brands, media and opinions of consumers about brands (C1).
The distributed nature of the Web and the disparity of devices, that can
be used to access social media (PCs, smartphones, tablets, smart TVs, etc.)
make difficult to track the actions performed by users for web analytics purposes.
Unique user identification is a key task within the web analytics data collec-
tion process, and is useful for measuring the effectiveness of online advertising
campaigns, among other applications. The fingerprinting technique consists in
tracking user activity on a set of sites by capturing technical information about
the browser and the machine that the user employs to navigate the Web. Browser
fingerprinting has been demonstrated to be an effective method for unique user
identification when the device used to navigate the Web does not support cookies.
However as the attributes used for generating browser fingerprint evolve, multi-
ple distinct fingerprint records are created for the same user, leading to incorrect
unique user identification. This thesis contributes to the State of the Art
with a technique for unique user identification that detects browser
fingerprint evolution (C3).
In the last decade, the availability of digital user-generated documents from
social media has dramatically increased. This massive growth of user-generated
content has also affected traditional shopping behaviour. Customers have em-
braced new communication channels such as microblogs and social networks that
enable them not just to talk with friends and acquaintances about their shopping
experience, but also to search for opinions expressed by complete strangers as
part of their decision making processes. Uncovering how customers feel about
specific products or brands and detecting purchase habits and preferences has
traditionally been a costly and highly time-consuming task which involved the
use of methods such as focus groups and surveys. However, the new scenario calls
for a deep assessment of current market research techniques in order to better
interpret and profit from this ever-growing stream of attitudinal data.
With this purpose, we present a novel analysis and classification of
user-generated content in terms of it belonging to one of the four stages
of the Consumer Decision Journey [Court et al., 2009] (i.e. the purchase
process from the moment when a customer is aware of the existence of the product

3
to the moment when he or she buys, experiences and talks about it) (C4.1).
Using a corpus of short texts written in English and Spanish and extracted from
different social media, this thesis identifies a set of linguistic patterns for each
purchase stage that will be then used in a rule-based classifier. Additionally,
we use machine-learning algorithms to automatically identify business
indicators such as the Marketing Mix elements [McCarthy and Brogowicz,
1981] (C4.2).
Sentiment analysis of social media is of commercial interest as user-generated
content published in the Web reaches and influences many potential customers.
Most work in this field has focused on opinion polarity (positive or negative) and,
therefore, does not specify the kind of sentiment related to that opinion. In order
to provide this information, this thesis establishes four polarised categories
that capture the main sentiments that can be found on social media:
satisfaction-dissatisfaction (SD), trust-fear (TF), love-hate (LH), and
happiness-sadness (HS). It develops a rule-based system that classifies
texts in Spanish from those social media, according to this sentiment
classification with respect to a brand, company or product. The rules
have been written in a simple grammar after (linguistically) analysing a corpus
of different business domains whose texts had been manually classified (C4.3).
Characterising users through demographic attributes is a necessary step be-
fore conducting opinion surveys from information published by such users in
social media. In this thesis, we describe, compare and evaluate different
techniques for the identification of the attributes “gender” (C4.4) and
“place of residence” (C4.5) by mining the metadata associated to the users,
the content published and shared by themselves, and their friendship networks.
Natural language processing techniques are a key technology for analysing
user-generated content. Despite some efforts have been done to structure social
media information, such as Twitlogic [Shinavier, 2010], there is still the need for
approaches that are able to cope with the different channels in the Social Web
and with the challenges they pose. The content published in social media is char-
acterised by the use of casual language; social media posts contain texts that
vary in length from short sentences in microblogs to medium-size articles in web
logs. Very often the text published in social media contains misspellings, is com-

4
pletely written in uppercase or lowercase letters, or it is composed of set phrases,
among other characteristics that challenge existing content analysis techniques,
leading to problems regarding the accuracy of natural language processing tools
like part-of-speech taggers. As an example, for the Spanish language, the absence
of an accent in a word may give such word a completely different meaning.
As a minor contribution, this thesis studies the differences of the lan-
guage used in heterogeneous social media sources, by analysing the dis-
tribution of the part-of-speech categories extracted from the analysis of the mor-
phology of a sample of texts published in such sources, showing that the task
of normalising user-generated content is a necessary step before analysing social
media posts, particularly on Twitter1 (C2). Therefore the content analysis tech-
niques proposed by this thesis implement a stage that performs a morphological
normalisation of user-generated content that makes use of on-line and collectively
developed resources, including Wikipedia2 and a SMS lexicon. The results ob-
tained demonstrate that the normalisation of user-generated content improves
slightly the accuracy of the content analysis techniques presented in this thesis.

1.1 Thesis Structure


This thesis is structured as follows:

• Chapter 2 reviews the State of the Art and identifies the open research
problems addressed in this thesis.

• Chapter 3 presents the objectives of this thesis, which were defined accord-
ing to the open research problems identified in Chapter 2. In addition, we
present the contributions to the State of the Art, as well as the assump-
tions and hypotheses on which our contributions rely. Finally we describe
the restrictions, which define the scope of the different contributions.

• Chapter 4 presents the research methodology, and the method followed


for obtaining the artefacts provided by this thesis, which is inspired in an
1
http://twitter.com
2
http://www.wikipedia.org

5
existing framework for web mining. For defining the model of the data
warehouse we have followed an existing methodology for building ontology
networks. For addressing the rest of the phases defined by the framework,
we have followed an existing data mining process model.

• Chapter 5 describes the data model that we have designed for representing
the information extracted from social media for the marketing domain.

• Chapter 6 characterises the different kinds of social media according to the


morphosyntactic characteristics of the textual content published in such
media.

• Chapter 7 provides a technique for uniquely identifying users in social media


based on the fingerprint of their devices, regardless the evolution of such
fingerprints. The chapter also presents the evaluation results and describes
the data set used for evaluating the technique.

• Chapter 8 presents a collection of techniques for extracting sociodemo-


graphic and psychographic profiles from social media users applied to the
marketing domain, through the analysis of the opinions they express about
brands, as well as from the profiles published by them in social networks.
The chapter also presents the evaluation results and describes the data sets
used for evaluating the techniques.

• Finally, Chapter 9 presents research conclusions and possible future lines of


research and innovation.

6
1.2 Dissemination of Results
Some of the contributions produced within the framework of this thesis have been
published in international peer-reviewed journals, conferences and workshops. In
the following we list the contributions along with the publications that support
them.
The technique proposed for uniquely identifying users in social media based
on the fingerprint of their devices has been published in an international journal:

Óscar Muñoz-Garcı́a, Javier Monterrubio-Martı́n, Daniel Garcı́a-


Aubert. Detecting browser fingerprint evolution for identi-
fying unique users. International Journal of Electronic Business,
10(2):120–141, 2012, ISSN 1470-6067, DOI 10.1504/IJEB.2012.051116.

The techniques proposed for classifying user-generated content into Consumer


Decision Journey stages and Marketing Mix elements have been published in an
international journal indexed by JCR:

Silvia Vázquez, Óscar Muñoz-Garcı́a, Inés Campanella, Marc Poch,


Beatriz Fisas, Nuria Bel, Gloria Andreu. A classification of user-
generated content into Consumer Decision Journey stages.
Neural Networks, 58:68–81, October 2014, ISSN 0893-6080,
DOI 10.1016/J.NEUNET.2014.05.026.

The technique proposed for detecting emotions has been published in the
proceedings of a Spanish conference:

Guadalupe Aguado-de-Cea, Marı́a Auxiliadora Barrios, Marı́a So-


corro Bernardos, Inés Campanella, Elena Montiel-Ponsoda, Óscar
Muñoz-Garcı́a, Vı́ctor Rodrı́guez. Análisis de sentimientos en
un corpus de redes sociales. In Proceedings of the 31st AESLA
(Asociación Española de Lingüı́stica Aplicada) International Confer-
ence, San Cristóbal de la Laguna, Tenerife, Spain, April 2014.

7
The techniques proposed for identifying the place of residence and gender of
social media users have been published in a Spanish journal:

Óscar Muñoz-Garcı́a, Jesús Lanchas Sampablo, David Prieto Ruı́z.


Characterising social media users by gender and place of res-
idence. Procesamiento del Lenguaje Natural, 51:57–64 , September
2013, ISSN 1135-5948.

The characterisation of the different kinds of social media according to the


morphosyntactic characteristics of the textual content published in such media
has been published in the proceedings of an international workshop:

Óscar Muñoz-Garcı́a, Carlos Navarro. Comparing user gener-


ated content published in different social media sources. In
Proceedings of the NLP can u tag #user generated content ?! via lrec-
conf.org Workshop co-located with Eighth International Conference on
Language Resources and Evaluation (LREC 2012), pp. 1–8, Istanbul,
Turkey, 26 May 2012.

Finally, the approach that we follow for performing morphological normali-


sation of social media posts has been published in the proceedings of a Spanish
workshop:

Óscar Muñoz-Garcı́a, Silvia Vázquez Suárez, Nuria Bel. Exploit-


ing Web-based collective knowledge for micropost normalisa-
tion. In Proceedings of the Tweet Normalization Workshop co-located
with 29th Conference of the Spanish Society for Natural Language
Processing (SEPLN 2013), pp. 10–14, Madrid, Spain, 20 September
2013, ISSN 1613–0073.

8
Chapter 2

STATE OF THE ART

This chapter reviews the State of the Art regarding the objectives of this thesis.
The information published in social media consists of connected data by na-
ture, due to the interlinked nature of social networks. Therefore, graph-based
data models are an appropriate way of representing the relationships between
the users and contents included in social media. Section 2.1 describes existing
semantic vocabularies that can be used for representing social media information.
Such vocabularies will be reused in this thesis to provide a normalised schema for
structuring the information published in social media.
This thesis provides a technique for unique user identification which is an
essential step for tracking the activity of users in the Web. Section 2.2 describes
the existing techniques for tracking users in the Web, while Section 2.3 describes
a technique for detecting the evolution of temporary records, upon which our user
technique for identifying unique users is based.
Additionally, this thesis has a strong business context, and its objectives are
devoted to solve specific problems related with the marketing field. Section 2.4
describes the State of the Art on social media analysis applied to market re-
search, while Section 2.5 introduces the marketing background upon which the
contributions of our thesis are based.
Finally, many of the contributions of this thesis rely on natural language
processing techniques applied to the analysis of textual content published in social
media, whose State of the Art is described in Section 2.6. In the following we
detail the State of the Art and existing research problems related with it.

9
2.1 Semantic Vocabularies for Representing So-
cial Media Information
Social media and the online communities built around them are silos whose users,
contents, topics, etc. are rarely connected among them (e.g. Twitter data is
not connected with Facebook3 data), except for minor service integrations (e.g.
publishing a tweet whenever an status update is made in a LinkedIn4 account).
In addition, there is no unified data format according to which to express the
information posted to every social media. For example, the data published using
the Facebook Graph API5 does not match the one used by the Twitter API6 ,
neither do match the content syndication formats RSS7 and Atom [Nottingham
and Sayre, 2005], commonly used by weblogs and news publication sites.
Format heterogeneity and cross-social network integration issues difficult data
gathering and the integrated analysis of the data published in social media. SIOC
[Breslin et al., 2006] is a Semantic Web ontology designed to cope with these is-
sues. It uses RDF8 for representing data published in social media, allowing
linking posts, authors, topics, and other concepts, regardless specific social net-
works, therefore providing a mechanism for integrating information related to
online communities.
The SIOC vocabulary is linked with FOAF [Graves et al., 2007] for repre-
senting information about users and user-accounts. FOAF defines a data model
of persons and relationships between persons, including mappings with other Se-
mantic Web vocabularies, like Schema.org9 .
Schema.org is a vocabulary designed for marking up HTML10 pages to improve
indexing and metadata visualisation by search providers like Google11 , Yahoo!12

3
http://www.facebook.com
4
http://www.linkedin.com
5
http://developers.facebook.com/docs/graph-api
6
https://dev.twitter.com
7
http://www.rssboard.org/rss-specification
8
http://www.w3.org/TR/rdf11-concepts
9
http://schema.org
10
http://www.w3.org/TR/html5
11
http://www.google.com
12
http://www.yahoo.com

10
and Bing13 . This vocabulary includes a rich set of classes and properties that can
be used for complementing the ones provided by SIOC and FOAF for annotating
users and contents.
Additionally, SIOC reuses the Dublin Core vocabulary14 for aggregating meta-
data to posts (e.g. title, summary, publication date) using properties standardised
by DCMI (Dublin Core Metadata Initiative)15 .
The SIOC specification16 suggests using SKOS [Miles et al., 2005] for repre-
senting topics according to which contents can be categorised. SKOS is a RDF
vocabulary that provides a model for representing conceptual schemes such as,
thesauri, classification schemes, subject heading lists, taxonomies, and other king
of controlled vocabularies within the framework of the Semantic Web.
Regarding geo-localisation of contents and users, FOAF is linked with the
WGS8417 vocabulary that allows annotating resources with geographical coordi-
nates. In addition, for representing time zones and political regions (e.g. coun-
tries and states) the ontology Time Zone18 can be used. Schema.org also provides
ontology elements for describing spatial features of web resources.
SIOC does not provide ontology elements, neither a recommendation for an-
notating the content with the results of natural language analysis processes. Nev-
ertheless there exists multiple vocabularies that can be used for performing this
task. As an example, the categorisation model ISOcat [Kemps-Snijders et al.,
2008] can be used for annotating contents with linguistic information based on a
standardised set of categories.
With respect to Opinion Mining, Marl [Westerski et al., 2011] is an ontology
used for annotating and describing opinions according to the polarity expressed
in them with respect to specific entities (e.g. brands, persons) mentioned in
social media. Therefore it provides ontology elements for classifying opinions
into three possible categories of polarity (i.e. positive, negative, neutral) and for
quantifying such polarity according to a numeric scale. Additionally, the Onyx
13
http://www.bing.com
14
http://dublincore.org/documents/dcmi-terms
15
http://dublincore.org
16
http://rdfs.org/sioc/spec
17
http://www.w3.org/2003/01/geo
18
http://www.w3.org/2006/timezone

11
ontology [Sánchez-Rada and Iglesias, 2013] allows categorising opinions into a
broader set of emotions, like the ones described by the Wordnet-Affect taxonomy
[Valitutti et al., 2004].
Multiple instances of social graphs can be used to perform analyses with dif-
ferent data sets (e.g. for analysing different domains or markets). These instances
can be treated separately with RDF named graphs, and each named graph can
be described by using the graph description metadata, like the one provided by
the RDFG vocabulary [Carroll et al., 2005].
Finally, the PROV-O (PROV Ontology)19 provides a set of ontology elements
that can be used for representing and exchanging information of the provenance
of data generated by different systems. Therefore, it can be used within the
social media field for indicating the content authoring entities and referencing
publication sources. PROV-O has been mapped with the Dublin Core vocabu-
lary20 , which in turn is mapped with FOAF. Thus, expressing social media facts
using the FOAF and Dublin Core vocabularies automatically adds provenance
information through the existing mappings.
All these vocabularies are richer enough for describing general-purpose social
graphs. However, we have not found during our survey vocabularies that allow
describing some of the concepts related with the Marketing domain this thesis
deals with and that will be explained in the following sections. Neither there
exists a unified model that integrates the different vocabularies.

2.1.1 Conclusions
Open Research Problem 1. While there exist data models for representing
information captured from social media, either generic or social-network-specific,
there are not schemas that integrate such information with marketing-specific clas-
sifications and KPIs (Key Performance Indicators) obtained from the analysis of
the content generated by the consumers and the activity produced by them in so-
cial media. Therefore, the existing vocabularies may be extended with ontology
elements that model marketing-related knowledge.

19
http://www.w3.org/TR/prov-o
20
http://www.w3.org/TR/prov-dc

12
Open Research Problem 2. The existing data models for representing social
media information characterise the metadata that accompany the content pub-
lished in the different kind of media. However, there is not a characterisation of
such media according to the linguistic features of the textual contents published
on them.

2.2 Techniques for Tracking Users in the Web


User tracking consists in registering the activity of users as they interact with
one or more websites so that such activity can be related with specific, uniquely
identified users.
The tracking of users is an essential activity in order to perform Web Ana-
lytics. Web Analytics is the professional discipline designed to draw conclusions,
define strategies, or establish business rules on the basis of data collected in all
web environments on which a company has control [Maldonado, 2009]. Web An-
alytics is a professional discipline because there is an industry related to Business
Intelligence, Market Research and Marketing, which demands professionals with
Web Analytics skills, which provide insights to their customers. Web Analytics
allows studying the behaviour of users in websites, drawing conclusions, such as
why they came to the site and from where, why they leave and where they went
to, why they do not perform the actions we were expecting them to perform, or
what search terms were used to get to the website. The strategies and business
rules that Web Analytics enable are oriented to drive a continous improvement of
the online experience that customers and potential customers have, leading them
to website desired outcomes [Kaushik, 2009].
Web Analytics are used for measuring the performance of websites in a com-
mercial context, providing a measurement model to Digital Marketing, allowing
to quantify the effectiveness and impact of advertisement campaigns in digital
media. Data gathered by applying Web Analytics (e.g. number of persons that
have visualised a banner) are typically compared against KPI (e.g. outreach of
a campaign) and used to improve the audience response to marketing campaigns
(e.g. move the banner to a site with more audience). The most significant KPIs
depend on counting unique visitors.

13
Within a Web Analytics context, the data collection process consists in record-
ing the activity generated by users while they interact with a set of websites. Such
recorded activity may contain records about advertisement impressions, clicks on
web page hyperlinks, and other navigational information. Collected data is useful
for a number of marketing activities, such as, analysing advertisement campaign
outreach or performing behavioural targeting, which involves tracking the on-line
activities of users in order to deliver tailored ads to them. Specifically, ad target-
ing techniques, such as the one described by Deane et al. [2011], rely on data with
users uniquely identified. For collecting such data, firstly the activity itself must
be captured. After that, such activity must be associated with unique visitors.
Visits and unique visitors are the basic web metrics required for nearly every
web metric calculation [Kaushik, 2009]. As defined by the Digital Analytics
Association [Burby and Brown, 2007]:

Definition 1. A visit is an interaction, by an individual, with a website con-


sisting of one or more requests for an analyst-definable unit of content (i.e. page
view).

Definition 2. The KPI unique visitors refers to the number of inferred indi-
vidual people (filtered for spiders and robots), within a designated reporting time-
frame, with activity consisting of one or more visits to a site. Each individual is
counted only once in the unique visitor measure for the reporting period.

At least six of the eight critical web metrics defined by Kaushik [2009] depend
on uniquely identifying users (i.e. unique visitors, time on page, time on site,
bounce rate, exit rate, and engagement). The other two are visits and conversion
rate. Conversion rate can be calculated by taking into account either unique
visitors or visits, depending on business objectives.

2.2.1 Techniques for Capturing Web Activity


There are four main ways of capturing the activity (a.k.a. clickstream data) of
website users [Kaushik, 2007]: web logs, web beacons, JavaScript tags, and packet
sniffing. This section describes these approaches and analyses their advantages
and disadvantages.

14
4

3
1 2
Web Servers Log files

Figure 2.1: Process followed by the technique based on web logs (adapted from
Kaushik [2007])

2.2.1.1 Technique Based on Web Logs

Web logs are a classic system for capturing clickstream data. This technique is
implemented by web servers and consists in registering one log entry each time
there is a request to a web server by a web client. In such log-based systems, the
web server triggers the log action when it receives a request from the client.
Figure 2.1 shows the process followed by this technique. The steps of this
process are the following:

1. A user requests a resource (e.g. a web page) through its URL [Berners-Lee,
1994].

2. The request is sent to a web server.

3. The server receives the request and creates a record in its log describing the
request.

4. Finally, the server sends the resource to the user.

The format of web server logs has been standardised by W3C21 . The standard
proposes to describe log files as a sequence of log entries preceded by a header
with one or more of the metadata described next:

Version. Specifies the version of the log file format used.


21
http://www.w3.org/TR/WD-logfile.html

15
Prefix Description
c Client.
s Server.
r Remote.
cs Client to Server.
sc Server to Client.
sr Server to Remote Server. This prefix is used by proxies.
rs Remote Server to Server. This prefix is used by proxies.
x Application specific identifier.

Table 2.1: Prefixes that can be declared in a web server log file

Fields. Specifies the fields recorded in the log. Such fields are defined by using
a prefix and a field identifier. The prefix refers to the information transfer
mode, while the identifier refers to an entry data type. For example, the
identifier cs-method refers to the HTTP method [Fielding and Reschke,
2014b] used for data transfer from client to server. Table 2.1 shows the list
of available prefixes, while Table 2.2 shows the possible fields that can be
registered, indicating if the field requires or does not require to declare a
prefix.

Software. Identifies the software that generated the log.

Start-Date. The date and time at which the log was started.

End-Date. The date and time at which the log was finished.

Date. The date and time at which the entry was added.

Remark. Comment information. Analysis tools should ignore data recorded in


this field.

Listing 2.1 shows an example file log that includes a header in which the
version used (line 1), the recording date (line 2), and fields registered (line 3) are
specified. Registered fields correspond to the timestamp of particular requests,
the HTTP method used, and the URI of the resource requested.
The technique based on logs is the most accessible from all the techniques
for recording web activity, since most web servers implement it. Also, there are

16
Identifier Description Prefix Type
date Date at which transaction completed. No Date
time Time at which transaction completed. No Time
time-taken Time taken for transaction to complete in No Fixed
seconds.
bytes Number bytes transferred. No Integer
cached Records whether a cache hit occurred. No Integer
ip IP [Postel, 1981] address and port. Yes Address
dns DNS name [Mockapetris, 1987]. Yes Name
status Status code [Fielding and Reschke, 2014b]. Yes Integer
comment Comment returned with status code. Yes Text
method HTTP method. Yes Name
uri URI [Berners-Lee et al., 2005]. Yes URI
uri-stem Stem portion alone of URI (omitting query). Yes URI
uri-query Query portion alone of URI. Yes URI

Table 2.2: Identifiers that can be declared in a web server log file

1 #Version: 1.0
2 #Date: 12−Jan−1996 00:00:00
3 #Fields: time cs−method cs−uri
4 00:34:23 GET /foo/bar.html
5 12:21:16 GET /foo/bar.html
6 12:45:52 GET /foo/bar.html
7 12:57:34 GET /foo/bar.html

Listing 2.1: Example log file

numerous tools that allow analysis of logs such as AWStats22 , Webalizer23 and
Analog24 .
The main criticism to this technique is that the information captured in log
files is often too technical (HTTP errors [Fielding and Reschke, 2014b], browser
types, etc.) to be used directly for business purposes (e.g. marketing intelligence).
Similarly, the information recorded in the logs is too large, since it records the
download of any resource provided by the web server (style sheets, images, etc.)
22
http://awstats.sourceforge.net
23
http://www.webalizer.org
24
http://www.analog.cx

17
regardless it worths been measured or not. Therefore the log files must be con-
veniently filtered prior to their analysis.
The technique based in logs is able to register any activity that implies an
HTTP request [Fielding and Reschke, 2014a] from the client to the server. How-
ever it is not able to register users’ behaviour on web pages that do not require a
resource download operation. Such operations are becoming more common due
to dynamic web pages.

2.2.1.2 Technique Based on Web Beacons

The web beacons technique consists in placing banners, or 1 × 1 pixel transparent


images, in web pages within img src HTML tags. When these tags are processed,
a request to a tracking server is performed, what triggers the recording of the
activity.
Figure 2.2 shows the process followed by this technique. The steps of this
process are the following:

1. A user requests a web page through its URL.

2. The request is sent to a web server.

3. The server sends the web page including an image of 1 × 1 pixels whose
URL points to a data collection server.

4. When the web page is loaded in the user’s browser, a request of the image
is sent to the data collection server.

5. The data collection server sends the image to the user, taking advantage of
the HTTP protocol for managing cookies in the user’s device, and capturing
user data, such as the web page that the user is viewing, the IP address of
the user’s device, the timestamp of the activity, etc.

Web beacons are used not only to capture information relating to the navi-
gation of web pages; they can also can be inserted into email messages, so KPIs
about an email sent can be recorded (e.g. number of email views). However,
users often disable the download of images within their email applications.

18
4 3

Data Collector 2
5
1
Website Servers

Figure 2.2: Process followed by the technique based on web beacons (adapted
from Kaushik [2007])

2.2.1.3 Technique Based on JavaScript Tags

The JavaScript tags technique is the most used nowadays, existing multiple com-
mercial tools that implement it (e.g. Adobe Marketing Cloud25 , IBM EMM26 ,
webtrends27 , and Google Analytics28 ). It consists in placing JavaScript [ECMA,
2011] code within HTML pages, so that, when an event to be measured is pro-
duced, the scripting code is evaluated. Such code includes a request to a tracking
server. Thus, when the script is evaluated, the request is performed and the
activity is recorded.
Figure 2.3 shows the process followed by this technique. The steps of this
process are the following:

1. A user requests a web page through its URL.

2. The request is sent to a web server.

3. The server sends the web page including a script of JavaScript code assigned
to different events (e.g. web page load, click on an active item).

4. When an event is triggered, its assigned JavaScript code is executed. Such


code includes sending an HTTP request to a data collection server.
25
http://www.adobe.com/en/solutions/digital-marketing.html
26
http://www.ibm.com/software/products/category/enterprise-marketing-management
27
http://webtrends.com
28
http://www.google.com/analytics

19
5. The data collection server processes the request, taking advantage of the
HTTP protocol for managing cookies in the user’s device, and capturing
user data, such as the web page that the user is viewing, the IP address of
the user’s device, the timestamp of the activity, etc.

Both, the technique based in web beacons and the technique based in JavasS-
cript tags, allow collecting the web activity produced in multiple websites into a
single data collection system. Figure 2.4 illustrates this scenario.

4 3

5 2
Site Analytics 1 Website Servers
Services

Figure 2.3: Process followed by the technique based on JavaScript tags (adapted
from Kaushik [2007])

4
2 Website 1 servers
Site Analytics
Services

5 3
1

Data Collector
2
Website 2 servers

Figure 2.4: Process followed by the tags or web beacons techniques for gathering
data from multiple sites (adapted from [Kaushik, 2007])

20
2.2.1.4 Technique Based on Packet Sniffing

The packet sniffing technique consists in inspecting IP packages exchanged be-


tween web browsers and web servers. Packet sniffers can be implemented as a
software layer over the web server, or as an independent module that intercepts
and analyses the packages sent by web browsers before re-routing them to web
servers.
Figure 2.5 shows the process followed by this technique. The steps of this
process are the following:

1. A user requests a web page through its URL.

2. The request is intercepted in its route to the web server by a packet sniffer
that extracts the request data from the HTTP header of the request.

3. The packet sniffer re-routes the request to the web server.

4. The web server sends its response to the user’s browser. The response is
intercepted by the packet sniffer, which extracts the information about the
web page being served. Additionally, some sniffers add JavaScript tags to
the web page, with the aim of obtaining additional information, once the
browser processes the scripts.

5. The packet sniffer re-routes the response to the web browser.

5 4

Packet Sniffer 3
1 2
Website Servers

Figure 2.5: Process followed by the technique based on packet sniffing (adapted
from Kaushik [2007])

21
2.2.2 Techniques for Identifying Unique Users
This section describes the existing techniques for identifying unique users. Sec-
tion 2.2.2.1 describes the widely used technique based on cookies, while Sec-
tion 2.2.2.2 describes a novel technique based on the fingerprint of the device
used for browsing the Web.

2.2.2.1 Technique Based on Cookies

With respect to the technique for uniquely identifying users, the one based on
cookies is the most extended. A cookie is a message sent to a web browser from
a web server. The browser stores the message and forwards it to the server each
time the web browser requests a page from the server. The web server can send
two different kinds of cookies:

1. Session cookies, which have a lifetime limited to the user interaction with
the website.

2. Persistent cookies, which remain on the machine of the user until a date of
cookie expiration.

The second type of cookies is the one used for user identification. Each time a
request comes from a web browser to a web server, the server checks if a specific
cookie exists on the client. If the cookie exists, the server obtains it and reads
a unique user identifier stored on it. If the cookie does not exist, the server
generates a new one, with a new unique user identifier, and sends it to the client.
Typically, cookies used to identify users contain a user identifier, unique and
anonymous, which identifies the browser. Therefore, this type of cookies identifies
browsers used by users to access the Web. If a user uses multiple devices, the
same user will be identified multiple times as a unique user (once per device).
Cookies may be disabled in web browsers, or not supported by certain devices,
such as smart TVs, so the user identification technique based on cookies cannot
be universally applied. In addition, the browser may be configured to delete
cookies periodically, or they can be erased by anti-spyware applications.

22
2.2.2.2 Technique Based on Fingerprint

The technique based on fingerprint is an alternative to the technique based on


cookies. This technique consists in identifying users from a number of attributes of
the web browser or that can be queried through it. These attributes are sent from
the web browser to the web server within the headers of each HTTP request, or are
available once a page has been loaded in the browser so that attribute values can
be sent to the web server using the JavaScript tags technique explained before.
Eckersley [2010] demonstrated the effectiveness of this technique by extracting
and collecting the fingerprints of 470,161 browsers. After analysing the data
Eckersley [2010] obtained the following conclusions:

• 83.6% of browsers have a unique fingerprint.

• In addition, 94.2% of the browsers with Adobe Flash Player29 or Java Vir-
tual Machine30 installed have a unique fingerprint. This is because, making
use of these technologies, more data are available for differentiating one
browser from another (e.g. the fonts installed on the system).

• The entropy [Shannon, 1948] associated with the distribution of fingerprints


is 18.1 bits, which means that, if a browser is taken at random, at most one
in 286,777 browsers share the same fingerprint.

• However, the fingerprint of each web browser may change quickly. The
number of unstable fingerprints was of 37.4% during the period of study.

The approaches for implementing user identification based on browser finger-


print are described next [Eckersley, 2010].

Use the fingerprint as a global identifier. The strength of this technique is


that, while cookies can be removed, disabled or not supported by certain
web browsers or specific devices (e.g. smartphones and set-top boxes), a
fingerprint can be always obtained. The weakness of this technique is that
changes on the client (e.g. updating the browser version) imply changes on
29
http://www.adobe.com/products/flashplayer.html
30
http://www.java.com

23
the fingerprint and, therefore, unique user identification fails, since there
exist distinct fingerprints that correspond to the same user.

Use the fingerprint along with the IP address assigned to the user. The
strength of this approach is that it improves accuracy with respect to using
fingerprint as a global identifier, since adding the IP address to the finger-
print increments its entropy. However, the weakness of this approach is
that it fails in environments where the IP may change, as occur when using
DHCP [Droms, 1997].

Use the fingerprint along with the IP address to regenerate cookies. The
strength of this technique is that correspondences between the cookies and
the fingerprint of the users are maintained, so fingerprint is used to identify
users with a cookie previously assigned, when such cookie is lost due to
cookie expiration or deleted by anti-spyware software.

Eckersley [2010] proposes to construct the fingerprint from the attributes de-
scribed next.

User-Agent header. This HTTP header contains information about the device
used for requesting the web resource, like the browser version, and the
operating system installed in such device.

Accept header. This HTTP header determines the MIME [Freed and Boren-
stein, 1996] type of the content expected in a response to a HTTP request.
E.g.:

• The value text/html indicates that a web page in HTML format is


expected.
• The value image/jpg indicates that an image in JPEG format31 is
expected.
• The value text/* indicates that plain text is expected.
• The value */* indicates that any kind of content is expected.
31
http://www.jpeg.org

24
Accept-Language header. This HTTP header determines the language ex-
pected in the response from a set of standard ones defined by Alvestrand
[1995].

Accept-Charset header. This HTTP header indicates the charset expected in


the response (e.g. UTF-8 [Yergeau, 2003]).

Accept-Encoding header. This HTTP header determines the encoding or com-


pression format expected in the response. Frequent values are gzip or de-
flate.

Cookies enabled. Represents the browser’s capability for accepting cookies.


This attribute is set to true when the browser responds with cookie val-
ues when asked by the web server. Otherwise the attribute is set to false.

Installed plugins. This attribute is composed by the names of the plugins in-
stalled in the web browser, their versions, and their assigned MIME types.

Installed fonts. The fonts installed in the computer where the browser is run-
ning.

Video. The video resolution and colour depth configured in such computer.

Time zone. The time zone of the user.

Session Storage. The capability of the browser for storing session data32 through
key-value pairs.

Local Storage. The capability of the browser for storing local data through
key-value pairs.

IE Persistence. The capability for persisting data when the user’s browser is
Internet Explorer33 . This capability is enabled by modifying XML34 DOM
(Document Object Model)35 elements through JavaScript code.
32
http://www.w3.org/TR/webstorage
33
http://windows.microsoft.com/internet-explorer
34
http://www.w3.org/TR/xml11
35
http://www.w3.org/DOM

25
The User-Agent and Accept headers are sent via HTTP from the user’s
browser to the web server. The rest of attributes are sent to the tracking server
by applying the technique based on JavaScript tags explained in Section 2.2.1.3.
An advantage of the browser fingerprinting technique is that a thorough se-
lection of fingerprint attributes may lead to cross-browser identification (i.e. as-
signing users to multiple browsers). Boda et al. [2012] have shown that a subset
of browser-independent attributes is enough to uniquely identifying most users.
A disadvantage of existing browser fingerprinting techniques is the evolution
of fingerprint over time, since the fingerprint makes use of attributes whose value
may change. Therefore, the tracking server may interpret that two different
fingerprints of the same browser correspond to different browsers. To solve this
problem, Eckersley [2010] describes an algorithm for detecting the evolution of the
fingerprints. This algorithm consists in measuring the lexical similarity between
pairs of different fingerprints. If this similarity exceeds a threshold (θ = 0.85), it
is considered that the two fingerprints represent the same user. This algorithm
can be significantly improved if different weights are assigned to the fingerprint
attributes, according to their importance, or if the time elapsed between finger-
prints registration is taken into account.

2.2.3 Conclusions
The metric unique visitors measures the audience of a site in terms of people that
have accessed site contents.
Counting unique visitors of websites is an essential activity in order to perform
Web Analytics, since many Web Analytics KPIs depend on individuals counted
only once (e.g. new visitors, return visitors, etc.).
There are many techniques to capture user activity, such as recording server
logs, using web bugs or JavaScript tags that make use of HTTP, HTML, and
JavaScript capabilities for triggering events that cause the registration of such
activity, or inspecting complex low-level network packets exchanged between
browsers and web servers.
The techniques most used for uniquely identifying users from captured web ac-
tivity are the ones that combine cookies and web bugs or JavaScript tags [Harding

26
et al., 2001]. This approach is being affected by several factors, such as strict pri-
vacy restrictions implemented by web browsers [Kaushik, 2007] or the use of new
devices for navigating the Web that do not support cookies (e.g. many set-top
boxes and certain video game consoles). Furthermore, several security programs,
such as antispyware ones, remove cookies periodically, making it difficult to trace
recurring visits to websites [Kaushik, 2007]. Thus, these security measures, en-
abled to protect the privacy of users, affect basic aggregated metrics obtained
with Web Analytics, from which valuable business insights can be derived, such
as the number of unique visitors of a website, or the bounce rate.

Open Research Problem 3. An alternative to cookies for uniquely identify-


ing users consists in capturing distinctive technical attributes of the system used
by such users to navigate the Web (i.e. their browser fingerprint). While Eck-
ersley [2010] demonstrated the effectiveness of this technique, such technique is
not entirely accurate, since browser fingerprint is built from attributes that evolve
over time. Thus, changes in values of fingerprint attributes lead to incorrectly
accounting new users.

2.3 Technique for Detecting the Evolution of


Temporary Records
Li et al. [2011] describe a method for detecting the evolution of temporary records.
This method takes into account the time elapsed between the capture of the
records for being compared, introducing the concept of time decay and defining
the probabilities described next.

Definition 3. Disagreement decay is the probability that an entity changes


the value of an attribute A within the time Δt. This probability is denoted by
d= (A, Δt) [Li et al., 2011].

Definition 4. Agreement decay is the probability that two different entities


share the same value of A within the time Δt. This probability is denoted by
d= (A, Δt) [Li et al., 2011].

27
In addition, Li et al. [2011] describe two algorithms to learn agreement and
disagreement decays from existing training data, and different ways of calculating
the similarity between two records taking into account the probabilities defined
above, and the cardinality of the attributes (e.g. single-valued or multivalued).
Finally, the three algorithms for clustering temporal records described next are
provided.

2.3.1 Early Binding Algorithm [Li et al., 2011]


This algorithm processes the records in ascending time order. For each record,
the algorithm creates a new cluster, or adds it to an existing cluster.
Specifically, given a record r and a set of clusters C1 , ..., Cn , the algorithm
consists in the execution of the following steps:

1. Calculate the similarity between r and each Ci , i ∈ [1, n].

2. Let sim(r, Cx ) be the similarity between r a cluster Cx , choose the cluster


C with the biggest similarity.

(a) If sim(r, C) > θ add r to C, where θ is a threshold that indicates a


high similarity.
(b) Otherwise, create a new cluster Cn+1 for r.

3. Update the signature of the cluster (i.e. cluster description) to which r as


been added.

Given a set of records for being clustered, the computational complexity of this
algorithm is O(n2 ) (i.e. quadratic complexity), because the algorithm compares
once each pair of records.

2.3.2 Late Binding Algorithm [Li et al., 2011]


The strength of this algorithm is that, unlike the previous algorithm in which
decisions were made early, this algorithm stores information about all the com-
parisons between records and clusters and takes the decisions at end the process,
improving accuracy.

28
To store the information of the comparisons the algorithm makes use of a
data structure that stores a bipartite graph (Nr , NC , E) in which each node nr
represents a record, each node nC represents a cluster, and each edge (nr , nC ) ∈ E
is labelled as the probability for a record r to belong to a cluster C.
The algorithm is implemented in two phases, called Evidence Collection and
Decision Making.

1. The Evidence Collection phase creates the bipartite graph and calculates
the weight for each edge. This step behaves in a similar way to the previous
algorithm, but storing all the probabilities instead of taking early decisions.

2. The Decision Making phase deletes edges with lower weights until each
record r belongs to a unique cluster C.

The weakness of this algorithm is that it adds a further analysis phase which
increments processing time, in comparison to early binding which runs in a single
phase. In addition, early binding has lower memory usage requirements than
late binding, as for each cluster the early binding algorithm maintains only the
last record that was added. In contrast, the late binding algorithm maintains all
records within the cluster as the cluster signature.
The computational complexity of late binding algorithm is also O(n2 ).

2.3.3 Adjusted Binding Algorithm [Li et al., 2011]


The strength of this algorithm is that, unlike previous algorithms, it allows com-
paring records with clusters created after the arrival of any record, improving
accuracy over the previous algorithms.
This algorithm starts after executing any of the previous algorithms, and
consists in the execution of the following steps:

1. Initialisation. Set the initial assignment as the result of early of late binding.

2. Estimation. Compute the similarity of each record-cluster pair as it is done


in the first step of late binding.

3. Maximisation. Chose the clustering with the maximum probability as in


step 2 of late binding.

29
4. Termination. Repeat steps 2-3 until the results converge or oscillate.

The weakness of this algorithm is that it add additional steps of quadratic


computational complexity (O(n2 )) that have to be executed after running early
binding or late binding. Thus, the number of iterations to run over data makes
this algorithm less scalable than the other ones.

2.3.4 Conclusions
One of the objectives of this thesis is to study the feasibility of a novel browser
identification technique in a real-time scenario, where the tracking server assigns
fingerprints to particular users as they arrive to the system. Of the three algo-
rithms described before, the most suitable for this scenario is early binding due
to the reasons explained next.

• The adjust binding approach is discarded, due to the scalability reasons


explained before.

• In addition, in a real time scenario, there is always a set of zero or more


clusters created previously, and only one record to classify on each invoca-
tion of the algorithm, so the computational complexity early binding and
late binding is reduced to O(n) (i.e. linear complexity).

Therefore, the algorithm early binding is the most suitable for achieving the
objective of this research.

2.4 Social Media Analysis Applied to Market


Research
Internet has transformed the way in which consumers’ word-of-mouth (i.e. non-
formal exchange of information between at least two individuals, which is per-
ceived as trustworthy) is created and propagated [De Bruyn and Lilien, 2008;
Gupta and Harris, 2010; Kozinets et al., 2010]. Digitised customer feedback in-
formation (i.e. electronic word-of-mouth or e-WOM) can be accessed any time

30
and anywhere through diverse social media such as blogs, social networks, cus-
tomer reviews, and forums, which further increases its influence among fellow
customers [Dellarocas, 2003; Schindler and Bickart, 2005]. Nowadays, a person
who is looking for information about some product is not limited to asking to
friends or relatives about it, instead he or she can expand this search by consult-
ing user reviews, specialised blogs or even brief opinions stated by microbloggers.
According to a survey by Nielsen [2012a], 70% of global consumers trust buyer’s
reviews, while 92% of consumers indicate they trust recommendations from peers,
family and word-of-mouth above other forms of advertising.
This shopping scenario, if disruptive for traditional business models, opens up
opportunities for corporations to grow, innovate and improve their relationship
with customers [Hennig-Thurau et al., 2010]. Marketers are in an advantageous
position to monitor and derive a benefit from this unparalleled volume of con-
sumer conversations, which are increasingly taking place in social media channels.
Accordingly, companies have reorganised their traditional methods of gathering
customer opinions (such as polls, and surveys) in order to adapt them to these
new media. This novel source of consumer data is not only extremely massive
and complex but also completely unfiltered, which facilitates a real-time, deeper
comprehension of consumer’s needs and thoughts [Han et al., 2014]. This im-
proves in turn the level of responsiveness to reputation crisis, emergencies and
situations alike. However, although the proliferation of social media has allowed
organisations and companies to collect a massive amount of information about
user’s opinions, the majority of this user-generated content is unstructured and
therefore, hard to interpret, classify and summarise.
In order to solve these new requirements, fields such as Sentiment Analysis and
Opinion Mining [Liu, 2012] have developed technology to automatically analyse
user-generated content. Research in these areas started to work in several aspects,
such as subjectivity detection, automatic classification of opinionated texts, and
automatic opinion summarisation. At the beginning, the main objective of these
fields was limited to summarising the overall opinion expressed in these user-
generated texts, and generally based on the distinction between positive and
negative comments conveyed by buyers. However, the task started to evolve
[Cambria et al., 2013; Cambria and White, 2014] and currently there is a broader

31
interest to carry out a very fine-grained analysis of the available data [Gangemi
et al., 2014]. The content of the user-generated texts is so rich and varied that it
can be analysed from very different perspectives. For example, in works such as
Asur and Huberman [2010]; Joshi et al. [2010]; Sadikov et al. [2009] authors make
predictions about the profit of movies from user-generated content of microblogs,
reviews and blogs.
However, the validity of social metrics [Sterne, 2010] depends to a large ex-
tent on the population over which they are applied. Social media users cannot
be considered a representative sample until the vast majority of people regularly
use social media. Therefore, until then, it is necessary to identify the differ-
ent strata of users in terms of socio-demographic attributes (e.g. gender, age or
geographical precedence) in order to weight their opinions according to the pro-
portion of each stratum in the population [Gayo-Avello, 2011]. As an example,
the comparison performed by Mislove et al. [2011] between the U.S. and Twitter
populations along three axes (place of residence, gender and race) showed that
Twitter users significantly overrepresent the densely population regions of the
U.S., are predominantly male, and represent a highly non-random sample of the
overall race/ethnicity distribution.

2.4.1 KPIs Based on Social Media Analysis


In the world of marketing and business, predicting real-world outcomes is a chal-
lenging task that normally requires indicators from heterogeneous data sources.
For instance, traditional media content analysis has been used to forecast the
financial market [Chan, 2003; Fung et al., 2003; Tetlock et al., 2008], and sev-
eral works have demonstrated connections between online content and customer
behaviour (e.g. purchase decisions).
Since social media feeds can be effective indicators of real-world performance
[Asur and Huberman, 2010], different forecasting models have been studied for
using online chatter to predict real world outcomes related to the sales of different
kinds of goods, such as movies [Asur and Huberman, 2010; Mishne and Glance,
2006; Zhang and Skiena, 2009] or books [Gruhl et al., 2005].
Predictive models range from gross income predictions [Asur and Huberman,

32
2010; Joshi et al., 2010; Mishne and Glance, 2006; Sharda and Delen, 2006; Zhang
and Skiena, 2009] to revenue estimations per product distributor (i.e. stores that
offer a product or service) [Mishne and Glance, 2006] or spike predictions in sales
ranks [Gruhl et al., 2005]. Besides, social media plays an increasingly important
role in how customers discover and engage with various forms of content, including
traditional media, such as TV. In this line, a study by Nielsen [Subramanyam,
2011] found correlations between online buzz and TV ratings.
Many social media have started to be exploited to obtain the indicators that
enable such prediction models (e.g. from Twitter [Asur and Huberman, 2010],
blog feeds [Gruhl et al., 2005; Mishne and Glance, 2006], review texts [Joshi et al.,
2010], online news [Zhang and Skiena, 2009]). Indicators are based on volume,
sentiment analysis, or combinations between them and economic data or product
metadata.
Volume-based indicators can be simple or composed. Among the simple pre-
dictors we find the raw count of posts referring to a brand [Gruhl et al., 2005;
Mishne and Glance, 2006; Zhang and Skiena, 2009], the number of mentions for
a brand (i.e. count of entity references, taking into account that one post can
mention the same entity multiple times) [Zhang and Skiena, 2009], or the num-
ber of unique authors that refer to the brand. Among composed predictors we
find the post rate [Asur and Huberman, 2010] (which denotes the rate at which
publications about particular topics are created, i.e. the number of posts about
a topic divided by time) and the post-per-source (which measures the average
number of posts published about a topic in particular feed sources, e.g. a set of
forums). These volume-based indicators have been demonstrated to be effective.
For example, spikes in references to books in blogs are likely to be followed by
spikes in their sales [Gruhl et al., 2005].
Sentiment analysis-based indicators are based on the hypothesis that products
that are talked about positively will produce better results than those discussed
negatively, because positive and negative opinions influence people as they prop-
agate through a social network. Basic sentiment-based predictors include the
numbers of positive, negative and non-neutral posts (i.e. positive plus negative)
about a brand [Mishne and Glance, 2006]. Composite indicators include the pos-
itive and negative ratios [Zhang and Skiena, 2009] (i.e. the number of positive

33
or negative posts divided by the total number of posts), and the mean or the
variance of sentiment values [Mishne and Glance, 2006]. Other important com-
posite sentiment-based indicators include the Net Promoter ScoreSM (NPS36 ), the
polarity index and the subjectivity index. NPS is commonly used to gauge the
loyalty of a firm’s customer relationships [Zhang and Skiena, 2009]. NPS can
be approximated by dividing the difference of positive and negative posts by the
total number of posts. The polarity index is calculated in different manners: by
dividing the posts with positive sentiment by the post with negative sentiment
[Asur and Huberman, 2010; Mishne and Glance, 2006], or by dividing the posts
with positive sentiment by the number of non-neutral posts [Zhang and Skiena,
2009]. Subjectivity is measured by dividing the number of non-neutral posts by
the number of neutral or total publications [Zhang and Skiena, 2009].
Low-level textual feature-based indicators, combined with metadata features,
have been also demonstrated to achieve a good performance [Joshi et al., 2010].
Such textual features include term n-grams, part-of-speech n-grams and depen-
dency relations.
All these indicators can be combined with other numerical and categorical
predictors, such as product metadata [Joshi et al., 2010; Mishne and Glance,
2006; Sharda and Delen, 2006; Zhang and Skiena, 2009], advertising investment,
overall budget [Joshi et al., 2010; Zhang and Skiena, 2009], number of product
distributors [Mishne and Glance, 2006; Zhang and Skiena, 2009], or even, the
Time Value of Money [Zhang and Skiena, 2009].
The forecasting models used range from linear or logistic regression models
[Asur and Huberman, 2010; Joshi et al., 2010; Zhang and Skiena, 2009] to k-
nearest neighbour models (k-NN) [Zhang and Skiena, 2009]. Gruhl et al. [2005]
base their models on time-series analysis and construct a moving average predic-
tor [Box and Jenkins, 1990], a weighted least squares predictor, and a Markov
predictor. Sharda and Delen [2006] convert the forecasting problem into a clas-
sification problem by discretising the continuous predicted variables to a finite
number of categories, and then they use a neural network model for performing
the classification.
Finally, the scale of the data is a key aspect when analysing online content. To
36
Service mark owned by Bain & Company (http://www.netpromotersystem.com)

34
get an idea, the work presented by Asur and Huberman [2010] uses 2.98 million
tweets from 1.2 million users, with feeds extracted hourly during three months;
the Nielsen study about social TV uses data from 250 TV programs and 150
million social media sites; and in Gruhl et al. [2005] the authors analyse the daily
rank values of 2,340 books over a period of four months.

2.4.2 Conclusions
The proliferation of new social media channels provides marketing practitioners
with a huge quantity of data about consumer preferences, likes and dislikes. The
large amount of data provides more and richer information that is, however, lost
because of the lack of means if it is to be analysed by using manual methods.
In comparison with traditional quantitative techniques such as questionnaires,
the collection of opinions extracted from social media sources means less intru-
sion since it enables the gathering of spontaneous perceptions and desires of con-
sumers, without introducing any bias. In addition, the possibility of doing this
in real time poses a clear advantage over other techniques based on retrospective
data. Overall, this allows for a more efficient and complex business decision mak-
ing based on a comprehensive assessment of users propensity to buy and concrete
opinions shared about a brand or product.
Open Research Problem 4. While there are approaches for obtaining KPIs
derived from the volume of posts about the opinionated entities, or the polarity
of opinion about them, there are other KPIs that cannot be obtained due to the
lack of user-generated-content-analysis techniques that allow to classify consumers
according to multiple socio-demographic and psychographic attributes commonly
used in the field of marketing for consumer segmentation.
The next section describes the marketing and psychological backgrounds upon
which the set of socio-demographic and psychographic attributes are based.

2.5 Marketing Background


Marketing is the process of communicating the value of a product or service to
consumers for the purpose of selling that product or service to them. If marketing

35
has one goal, it is to understand the most adequate way to reach consumers to
offer them the product or service recommended for them. To that extent, it is
important to get familiarised with the various buying processes that consumers
go through depending on the product at hand. Furthermore, what is considered
in fact of great value is being able to detect the different stages that consumers
have to go through during this process, as well as the conditioning factors that
produce a shift from one stage to another.
In the past, the construction of the media plan for a media agency was far
less complicated as there were fewer media, i.e. TV, printed newspapers, etc.
Back then, placing an advert in television would guarantee the delivery of the
marketing message to the consumer. However, nowadays the task of reaching the
consumer is not that straightforward anymore due to the fragmentation of both
traditional and digital media.
Marketing teams today are swimming in data —online, offline, internal, ex-
ternal, customer demographics, Web Analytics, media modelling, visibility, im-
pressions, click-through rates, conversions, engagement metrics (see [Burby and
Brown, 2007] for some examples). The most important thing to remember is that
all that the brand teams really want is to connect with its customers, or potential
customers, in a personal and meaningful way.
The goal for marketers today is first to tie all their disparate proprietary
data together. But that’s only step one. To send appropriate messages to recep-
tive consumers, brands need to be able to identify and segment customers and
prospective customers using predictive attributes: What are they likely to buy?
How are they thinking? And what is the best way to reach them?
To optimise media spending, marketers also need to look for solutions that
effectively manage their campaigns and divide consumers into psychographic and
demographic clusters —a way for marketers and their agencies to overlay pro-
prietary data and look for the right targets based on who they are, what they
have done, what they like and what they’re likely to buy. Thus consumers are
benefited with pertinent and meaningful communications directed by the brands,
which take into account their context, preferences and particular needs, avoiding
the over-saturation of massive marketing. There is nothing worst to a customer
than receiving “junk” advertisement on something that they do not need, want

36
or that they already have.
This section presents the theoretical marketing backgrounds related to the
work presented in this thesis. We introduce the Consumer Decision Journey
[Court et al., 2009] and Marketing Mix [Borden, 1964] models, as well as a sum-
mary of psychological research on human emotions, which are conceptual frame-
works upon which the analytic tools we propose are based on. Additionally, we
describe the different kind of media that marketers must deal with nowadays and
describe the different kinds of tools used for solving the problems arisen on each
media type.

2.5.1 The Consumer Decision Journey


The Purchase Funnel, proposed in the early twentieth century by Lewis [1903],
is a marketing model that illustrates the purchase process in several stages, from
the moment when a customer is aware of the existence of the product (awareness)
to the moment when he or she buys the product (purchase). The model evolved
during the last years and at present there are many different purchase funnel
models, some of them with many different intermediate stages. However, the
basic conceptual framework and stages remain the same in all of them [De Bruyn
and Lilien, 2008; Franzen and Goessens, 1999].
Modern versions of the purchase funnel model take into account the influence
of Internet and social media in the decision-making path of the customer, and
also include a postpurchase stage. The version of the purchase funnel proposed
by Forrester [Noble et al., 2010] is a good example of the introduction of the new
technologies and social media to the classic Elmo Lewis’ model [Lewis, 1903].
This work highlights the great influence of user-generated content on the final
purchase decision of the customers. In the model proposed by McKinsey [Court
et al., 2009], the Consumer Decision Journey, the traditional funnel shape of the
decision journey is transformed in a purchasing loop and the notion of trigger (as
the cause because of which potential customers start to investigate the brand and
therefore enter into the purchase funnel) is introduced.
Knowing the exact stage of the decision journey where the customer is lo-
cated is essential in order to design specific promotional campaigns, interact with

37
customers at the appropriate touch-points and improve customer relationships
management (CRM) systems [Edelman, 2010]. To discover this, the analysis of
the different social media channels is crucial, since the online conversations be-
tween potential customers play a very important role in the purchase decision
pathway [Divol et al., 2012]. Findings of Ng and Hill [2009] and Gupta and Har-
ris [2010] revealed that consumers do actively search the Web for non-commercial
bias opinions prior to making a purchase decision. Pookulangara and Koesler
[2011] state that, in addition to transforming the evaluation and purchase stages,
online social networks enable consumers to become advocates of their preferred
brands. Related work by other researchers found that online consumer conversa-
tions influence purchase decisions in a variety of ways, which include reinforcing
of product involvement [Wang et al., 2012]. De Bruyn and Lilien [2008] stud-
ied which factors affect consumers in the various phases of their online decision
making processes, and found that while tie strength (i.e. closeness of relationship
between two individuals) facilitates awareness, it has no apparent power over
triggering interest or decision to buy. In summary, it is safe to say that social
media have drastically changed the shopping experience, which calls for further
research in this area.
While the shopping experience of some goods involves very little deliberation
and an emotional response (e.g. greeting cards), other products require deeper
forethought either because its cost is significantly higher or because the conse-
quences of making a good or bad decision are much more profound (e.g. life
insurance, mortgages) [Vaughn, 1986]. Similarly, the duration and intensity of
the different purchase phases might be affected by the features of the product
being purchased or evaluated (e.g. novelty, price) as well as by buyers’ char-
acteristics (e.g. their previous experience with the brand) [van Bruggen et al.,
2010].
In this work we adopt the following, widely agreed, purchase stages: aware-
ness, evaluation, purchase, and post-purchase experience. This straightforward
model can be easily applied to a wide variety of products and purchase contexts.
Therefore, our aim is to use a consumer decision-making model whose basic
stages can be reasonably traceable in a big data scenario consisting of online
consumer texts, rather than using a sophisticated conceptual model that incor-

38
Post-purchase
Awareness Evaluation Purchase
Experience

Figure 2.6: Consumer Decision Journey stages adopted in this thesis

porates customer experience complexity to its fullest. Figure 2.6 illustrates the
model adopted as conceptual framework in this work.
The first stage, awareness, refers to the very first contact of the customer with
the product or brand, with or without the desire of purchase. Customers usu-
ally convey their interest through references or expressions about the advertising
campaigns.
In the evaluation phase, the customer already knows the product or brand
and evaluates it, frequently with respect to other similar products or brands. In
this step, buyers actively investigate the brand in comparison with its competi-
tors (asking for opinions, formulating questions, consulting product reviews, etc.)
and/or express their preference towards a specific brand or product.
In the purchase stage customers either explicitly convey their decision to buy
the product or make comments referring to the transaction involved when buying
the item.
Finally, the post-purchase experience phase refers to the moment when cus-
tomers, having tried the product, criticise, recommend it or simply talk about
their personal experience with it.

2.5.2 The Marketing Mix


The concept of “Marketing Mix” was coined by Borden [1964] who identified
twelve marketing elements to manage business operations in a more profitably
way. McCarthy and Brogowicz [1981] reduced these twelve elements to just four:
Product, Price, Promotion, and Place (the “4P’s”). These four elements usually
imply different subcategories that can vary depending on the interests of the
marketing company. For example, the element Product could be subdivided into
Quality, Design and Warranty; within Place one could distinguish Point of Sale
and Customer Service, and Promotion has also different subcategories such as

39
Product Place Price Promotion
Quality Point of Sale Price Promotion
Design Customer Service Sponsorship
Warranty Loyalty Marketing
Advertisement

Table 2.3: Subcategories of the Marketing Mix elements

Sponsorship, Loyalty Marketing, and Advertisement (that can also be divided


into different subtypes of advertisement depending on the media used). The 4P’s
Marketing Mix framework is used by marketers from all over the world, taking it
as a basis to develop their operational marketing plans.
Table 2.3 identifies the subcategories in which we have divided each element
of the Marketing Mix framework. In this thesis, we have developed classifiers
for the following subcategories: “quality”, “design”, “point of sale”, “customer
service”, “price”, “promotion”, “sponsorship” and “advertisement”.

2.5.3 Research on Human Emotions


Sentiment studies have been present in different areas and for different purposes.
Many researchers have pursued different approaches to analyse human emotions,
feelings, opinions, preferences and evaluations, and, unfortunately, there is no
agreement on the nature and number of basic human emotions. From the psychol-
ogy field, we can distinguish two main traditions [Gendron and Feldman Barrett,
2009]:

1. the basic emotion tradition, founded on the study of the basic and instinc-
tive emotions, mainly with an evolutionary approach, and

2. the appraisal tradition, focused on the individual evaluation of world ob-


jects.

Within the first approach, we find the works of Plutchik [1989] and Ekman
[2005], among others. Plutchik proposed a taxonomy of eight multidimensional
emotions grouped into four categories, namely, joy-sadness, trust-disgust, fear-
anger, and surprise-anticipation; whereas Ekman differentiated six primary uni-

40
versal (innate and cross-cultural) emotions, which can be recognised from facial
expressions: happiness, sadness, anger, disgust, surprise, and fear.
One of the main representatives of the second tradition is Arnold [1960], who
created a classification of eleven primary emotions (anger, aversion, courage, de-
jection, desire, despair, fear, hate, hope, love, sadness). Following also the ap-
praisal tradition, but applying the prototype approach [Rosch, 1978], we find the
work of Shaver et al. [1987], who distinguished six primary emotions (love, joy,
anger, sadness, fear, and perhaps, surprise) with (related) groups of descriptors
drawn from a lexicon of words with emotional connotation (for instance, nervous-
ness and anxiety as descriptors of fear).
A comprehensive definition of emotion that comprises all these approaches in
this field is given by [Kleinginna and Kleinginna, 1981]:

Emotion is a complex set of interactions among subjective and


objective factors, mediated by neural-hormonal systems, which can

1. give rise to affective experiences such as feelings of arousal, plea-


sure/displeasure;
2. generate cognitive processes such as emotionally relevant percep-
tual effects, appraisals, labelling processes;
3. activate widespread physiological adjustments to the arousing
conditions; and
4. lead to behaviour that is often, but not always, expressive, goal
directed, and adaptive.

Since emotions are affected by the context in which they are produced [Phillips
and Baumgartner, 2002], the taxonomies proposed in the psychological domain
were adapted for consumption-related studies, the field in which we are inter-
ested. In this sense, Richins [1997] elaborated the Consumption Emotions Set
(CES) taxonomy, which distinguished between emotions and mood, and grouped
emotions into sixteen clusters (e.g. fear: scared, afraid and panicky). In the same
line, Westbrook and Oliver [1991] showed that affective experiences (which can be
understood here as emotions) coexisted and were related to consumer satisfaction

41
and dissatisfaction, which is the traditional approach used to measure consumer
experiences.
In line with the Artificial Intelligence studies, Ortony et al. [1990] proceeded
on the assumption that progress in psychological research on emotion could be
attained through an analysis of the cognitions that underlie emotions. To this
end, their account of emotions is in terms of classes of emotions types, and not in
terms of specific words. An important guiding principle in developing the theory
was that it could be sufficient to permit empirical testing, such as computationally
tractable model of emotions to be used in Artificial Intelligence. Obviously, this
perspective is also very relevant for our work.
In this thesis we have established the categories of sentiments shown in Ta-
ble 2.4 as our conceptual framework. This conceptual framework is based on
Ekman [2005]; Richins [1997]; Shaver et al. [1987], and consists of the follow-
ing four polarized categories: SD (satisfaction-dissatisfaction), TF (trust-fear),
LH (love-hate) and HS (happiness-sadness), where the first one, SD, subsumes
the other three (i.e. a text classified as TF, LH or HS is also categorized as
SD). This decision is based on previous works (e.g. Oliver [1989]; Westbrook
and Oliver [1991]) that confirm that the satisfaction-dissatisfaction scale conceals
much more fine-grained sentiments. Finally, Table 2.5 shows the relationship
between our conceptual framework and the Wordnet-Affect taxonomy [Valitutti
et al., 2004], already introduced in Section 2.1, meaning that a given category
of our conceptual framework subsumes the corresponding set of categories in the
Wordnet-Affect taxonomy.

Polarity
Category
+ −
SD satisfaction dissatisfaction
TF trust fear
HS happiness sadness
LH love hate

Table 2.4: Categories for the sentiment classification, organised according to their
polarity

42
Category Wordnet-Affect
Satisfaction Liking, Gratitude, Positive expectation, Calmness, Affection,
Contentment.
Dissatisfaction Dislike, Annoyance.
Happiness Self pride, Joy.
Sadness Shame, Anxiety, Sadness.
Love Love.
Hate Hate, Indignation, Bad temper, Fury, Huffiness, Dander.
Trust Positive hope, Fearlessness.
Fear Negative Fear.

Table 2.5: Relations between the conceptual framework of emotions used in this
thesis and the Wordnet-Affect taxonomy

2.5.4 Owned, Paid and Earned Media


Marketers distinguish three types of media: owned, paid, and earned [Corcoran,
2009].

• Owned media refers to those media controlled by brands, such as their web-
sites, mobile apps, blogs and any communication channel that brands may
have on social media platforms like Twitter, Facebook or Instagram37 , to
mention just a few. The role of this media is to build longer-term relation-
ship with existing customers.

• Paid media refers to the media that brands pay to leverage a channel.
It includes traditional offline mass media channels (e.g. TV, radio, print
and out of home advertising, sponsorships), as well as online channels like
display ads and paid search.

• Earned media refers to opinions about the brands exchanged between con-
sumers, and brands’ contents sharing through word-of-mouth mechanisms.
The content published in social media is mostly of this kind.

Brands must listen carefully to what happens in all these channels, as if they
were customers. Companies struggle to integrate and analyse the huge volume of
interactions coming from paid, owned and earned media, with the aim of achieving
37
http://instagram.com

43
a holistic 360◦ approach to brand communication, that will lead to more efficient
and effective marketing campaigns.

2.5.5 Marketing Technology


In the online marketing field, Big Data Analytics is a big challenge that companies
and agencies are facing with applications that address different brand-customer
communication dimensions individually. Such applications are described next.

Programmatic advertising. These systems are oriented to automatise the pro-


cess of paid-media planning (i.e. buying of advertisement spaces), perform-
ing Big Data analysis for finding ad placement plans that should lead to
optimum performance KPIs (e.g. maximising the click through rate of dis-
play advertising). Demand-Side-Platforms (DSPs) like MediaMath38 , or
Data Management Platforms (DMPs) like Oracle Bluekai39 belong to this
category. The scope of these applications is limited to sites with web ad-
vertising capabilities.

Site analytics and digital customer experience management. These sys-


tems are devoted to analyse and optimise brand-customer communication
processes on owned digital media (i.e. sites owned by the brand). Within
this group we find the following kind Web Analytics applications (e.g.
Adobe Marketing Cloud, IBM EMM, webtrends, and Google Analytics)
and solutions for digital customer experience management and customer
behaviour analysis (e.g. IBM Tealeaf40 ).

The scope of these applications is generally limited to brands’ sites and


microsites. Recently, services like Google Analytics have extended mea-
surement capabilities to mobile apps.

Social media analytics and social CRM. Within these systems we find ap-
plications for measuring brand reputation on earned media (i.e. media not
38
http://www.mediamath.com
39
http://www.bluekai.com
40
http://www-01.ibm.com/software/info/tealeaf

44
controlled by the brand, like social networks, Web 2.0, etc.) and applica-
tions for social CRM (i.e. community management in social networks).
Regarding social media monitoring applications, given the massive amount
of posts published every day through different social media, the fact of
having a system able to evaluate the global sentiment towards an entity
(e.g. brand or product) is becoming a must for marketing experts. This is
one of the main reasons for the increased attention that sentiment analysis
has received in these last few years. Actually, there are already several
commercial tools able to provide a polarity figure measuring the attitude
towards a brand or any other queried topic, such as Radian641 , Sysomos42
and Brandwatch43 . Market analysts and social media researchers in general
use these tools and other similar ones to classify opinions about brand
sentiments in terms of polarity (positive or negative). The State of the Art
regarding techniques for sentiment analysis is described in Section 2.6.3.
Social CRM applications implement features for monitoring social media
opinions and conversation, and communicating with the consumers using
the same social networks where the opinions have been captured. Example
applications of this kind are HootSuite44 and TweetDeck45 .

2.5.6 Conclusions
There exists tons of data related with advertising and communication activities
that are underexploited, many of which are currently in such format that cannot
be treated, processed or used. Companies are sitting on “gold mines” without
even realising, and the power of data utilisation is beyond measure.
The first step to influence social media conversations is to understand them
to its fullest. In other words, managers and marketers need to know and under-
stand the content of these conversations and, further, be able to classify them
into categories that are relevant for their day-to-day tasks such as Consumer
41
http://www.salesforcemarketingcloud.com
42
http://www.sysomos.com
43
http://www.brandwatch.com
44
http://hootsuite.com
45
http://tweetdeck.twitter.com

45
Decision Journey stages and Marketing Mix elements.
In the first case (purchase funnel stages), to monitor in real time and ac-
cordingly react to the experiences and needs that those customers are sharing,
advertisers must know in which purchase stages are consumers gained and lost in
order to refine touch points, impact consumers and achieve the desire result (e.g.
a transaction). Other applications are, among others, the analysis of shopping
behaviour of users in comparison with brands from the rivals, to confirm whether
any particular marketing strategy has had the desired effect on purchase atti-
tudes (e.g. if there has been a rise in awareness after the launch of an advertising
campaign), to explore whether the distribution of users in Consumer Decision
Journey stages is seasonally affected, etc.
In the second case (Marketing Mix elements), uncovering the exact content
of the dialogues that costumers are having, e.g. which product attributes worry
them the most, lets marketers and advertisers have a better track of consumers’
mind-set.
The combination of these two categories (purchase funnel stages and mar-
keting mix elements) gives answers to extremely significant questions that have
an influence on the position of the brand in the market such as: which are the
features by which a brand is known, which are the elements that are driving
awareness to the brand (i.e. price), which characteristics of the product make it
desirable and which characteristics are not relevant.

Open Research Problem 5. While there are tools for analysing brand health
in earned media through the analysis of the polarity of the opinions produced by
consumers when talking about the brand, there are not approaches that specifically
address the classification of electronic word-of-mouth according to the Consumer
Decision Journey, useful for market analysis purposes.

Open Research Problem 6. Additionally, there are not tools for identifying
the Marketing Mix elements consumers are referring to when publishing opinions
about brands in social media.

46
2.6 Analysis of Social Media Content
This section describes existing activities and techniques for the analysis of the
textual contents published in social media that are related with the contributions
of this thesis.
Specifically, we describe the lemmatisation and part-of-speech tagging tasks
[Jurafsky and Martin, 2009] and introduce content normalisation approaches [Ale-
gria et al., 2013; Sproat et al., 2001], which are fundamental preliminary steps in
all the techniques provided by this thesis.
After that, we describe the related work regarding sentiment analysis [Liu,
2012] and discuss existing research results on automatic identification of wishful
sentences [Goldberg et al., 2009], which are the areas where we have found more
similarities with our work both in terms of objectives and used technologies.
Finally, we describe the existing techniques for detecting the gender and place
of residence of social media users, upon which our techniques for recognising socio-
demographic attributes are based.

2.6.1 Lemmatisation and Part-Of-Speech Tagging


Many content-analysis techniques rely on particular Natural Language Processing
tools, to lemmatise (i.e. grouping together different inflected forms of a word to
process them as one single element) and to add morphological information (i.e.
part-of-speech, to distinguish between homographs such as “walk-verb” or “walk-
noun”, verb tense, and person). Thus, a text such as

This Volkswagen I got my eye on is so sexy

gets the representation shown in Table 2.6, where the first column shows the
words in the text, the second column shows lemmas corresponding to each word,
and the third column the part-of-speech tag, where DT means determiner, NN
means common noun singular, NNP means proper noun singular, PRP means
personal pronoun, VBD means verb in past tense, IN means preposition, VBZ
means verb in present tense in third person singular, RB refers to adverb, and JJ
to adjective.

47
Word Lemma Part-Of-Speech
This this DT
Volkswagen volkswagen NNP
I i PRP
got get VBD
my my PRP
eye eye NN
on on IN
is be VBZ
so so RB
sexy sexy JJ

Table 2.6: Example lemmatisation and part-of-speech tagging of an example text

Example tools for part-of-speech tagging and lemmatisation are Freeling [Padró
and Stanilovsky, 2012] and TreeTagger [Schmid, 1994]. Such tools usually make
use of standardised vocabularies of tags (e.g. Santorini [1991] defines a tag-set for
English and Leech and Wilson [1996] define a tag-set normally used for the Span-
ish language). Generally, such tools provide more features beyond lemmatisation
and part-of-speech tagging. As an example, Freeling is an open-source multilin-
gual language processing library providing a wide range of analysers for several
languages, including named entity detection and classification, dependency pars-
ing and nominal co-reference resolution, among others.

2.6.2 Normalisation of Microposts


The activity of normalising user-generated content is a crucial step before analysing
social media posts, particularly on Twitter. User-generated content published in
social media (specially in microblogs) is characterised by informality, brevity,
frequent grammatical errors and misspellings, and by the use of abbreviations,
acronyms, and emoticons. These features add additional difficulties in text min-
ing processes that frequently make use of tools designed for dealing with texts,
which conform to the canons of standard grammar and spelling [Hovi et al., 2013].
The micropost normalisation activity enhances the accuracy of NLP tools
when applied to short fragments of texts published in social media, e.g. the
syntactic normalisation of tweets improves the accuracy of existing part-of-speech

48
taggers [Codina and Atserias, 2012].
There are several techniques that can be combined for micropost normalisa-
tion, which are described next.

1. Pre-processing the micropost for detecting, removing and transforming spe-


cific social network’s metalanguage elements (e.g. hashtags, user names,
URLs) into standard language constructions; e.g. Kaufmann and Jugal
[2010] propose several rules for dealing with hashtags and user names.

2. Performing orthographic correction of content by relying in lexical resources


like SMS lexicons for identifying abbreviations. List of correct forms are
also used for performing spell correction, e.g. Gamallo et al. [2013] rely on a
list of correct forms in Spanish generated by an automatic conjugator from
the lemmas found in the Real Academia Española Dictionary (DRAE46 ).

As an example result of the micropost normalisation task, the following mi-


cropost published in Twitter

#worstfeeling buying a fresh laptop..then ur screen blowz out :((

may be normalised to the following text47 :

worst feeling is buying a fresh laptop.. then your screen blowz out.

2.6.3 Sentiment Analysis


According to Pang and Lee [2008], the analysis of emotions, opinions and ap-
praisal regarding commercial companies, gained momentum from 2001 following
slightly different perspectives and, consequently, using terminological variations:
sentiment analysis, opinion mining, brand monitoring, buzz monitoring, online
anthropology, market influence analytics, conversation mining, online consumer
intelligence, or user-generated content analysis are some of the terms used. These
terminological divergences reflect differences in the connotation that each research
group wants to project in their work, as well as the different uses given in the
different epistemological communities.
46
http://www.rae.es/recursos/diccionarios/drae
47
Example extracted from the paper by Kaufmann and Jugal [2010]

49
In this thesis, we have adopted a term satisfying the psychological, the lin-
guistic and the computational projections: sentiment analysis, where sentiment is
conceptualised as emotion in Clore et al. [1987] (a detectable human reaction, i.e.
traceable, identifiable and with a particular valence). Undetermined cognitive
states, with no specific sign either positive or negative, like surprise or boredom,
and bodily states, such as sleepiness are excluded from the study. We also leave
out the analysis of mood, because we agree with previous work by Thayer [1989]
and Ekman [1994] in the sense that mood is a relatively persistent and often sub-
tle emotional state, which is different from emotion, as mood is less intense and
variable, less likely to be related to a particular event, and thus less likely to be
readily identifiable. Although we will mainly use the term sentiment, sometimes
emotion will be employed, both terms matching the definition just stated.
Pang and Lee [2008], and Liu [2010] have made a comprehensive survey de-
scribing the different approaches followed in sentiment analysis research. They
have reviewed and discussed a wide collection of related works. In general, deter-
mining which sentiment is conveyed in a text is seen as a classification problem,
which can be addressed with machine-learning techniques (supervised or unsuper-
vised) [Mullen and Collier, 2004], rule-based systems [Chetviorkin et al., 2011;
Ding and Liu, 2007], or combinations of them [Prabowo and Thelwall, 2009;
Rentoumi et al., 2010]. Machine-learning classifiers have been fed considering
different features extracted from the text, like the simple presence of words (or
n-grams in general) in the message, part-of-speech annotations or TF-IDF (Term
Frequency – Inverse Document Frequency) measures. Rule-based systems have
been applied both on plain texts and on part-of-speech annotated texts.
Many of these systems rely on sentiment lexicons, where each lexical unit
is associated to a sentiment category and, sometimes, also to a score specifying
the degree of association. These lexical units can be extracted automatically
(e.g. from other dictionaries) or, more uncommonly, manually. The works by
Hatzivassiloglou and McKeown [1997], and Turney [2002] are examples of the
first approach. An instance of the second one is Taboada et al. [2011], whose
sentiment dictionaries were created manually to produce a system for measuring
the semantic orientation of texts. Some publicly available lexicons for English
are SentiWordnet [Esuli and Sebastiani, 2006], the MPQA (Multi-Perspective

50
Question Answering) Subjectivity Lexicon [Wiebe et al., 2005], and the Harvard
General Inquirer [Stone et al., 1966]. A multilingual perspective is being ad-
dressed by the Eurosentiment project [Buitelaar et al., 2013], whose main goal is
to provide a shared language resource pool for fostering sentiment analysis.
However, studies on languages different from English are still scarce. For
Spanish, we can mention Brooke et al. [2009], who adapted the lexicon-based
sentiment analysis system described in Taboada et al. [2011] by automatically
translating the core lexicons and adapting other resources; Sidorov et al. [2013],
who presented an analysis of various parameter settings for the most popular
machine-learning classifiers; and Vilares et al. [2013], who used the syntactic
structure of the text to deal with some linguistic constructions (e.g. negation).
All in all, most of the research in sentiment analysis focuses on polarity clas-
sification. Some examples of projects that go beyond polarity can be found in
Strapparava and Mihalcea [2007], which summarises the evaluation of sentiment
analysis systems taking place for SemEval 2007 task on “Affective Text”. The
data consisted of news headlines extracted from news websites and/or newspa-
pers, and they were annotated according to their valence (i.e. polarity) and/or six
emotions (anger, disgust, fear, joy, sadness, and surprise) by different evaluators.
Three systems participated in the annotation of the six emotions: SWAT [Katz
et al., 2007], UA [Kozareva et al., 2007] and UPAR7 [Chaumartin, 2007] , and
only the last one followed a linguistic approach. None of them outperformed the
others for all emotions. The organisers concluded that the gap between the re-
sults obtained by the systems and the upper bound represented by the annotator
agreement suggested that there was room for future improvements.

2.6.4 Identification of Wishes


The first attempt to automatically classify sentences containing wishes was per-
formed by Goldberg et al. [2009]. The authors reported that, after a manual
annotation of a corpus of wishful texts, a number of linguistic patterns related
to wishes expression were identified. These patterns were used to automatically
extract the sentences that contained wishes. The precision results stated by Gold-
berg et al. [2009] was 80%, but combining these linguistic patterns with the most

51
frequent words and for user-generated texts related to the area of politics. When
applying the same method to product reviews, precision falls to 56%.
More recent works in this area are those carried out by Wu and He [2011]
and Ramanand et al. [2010]. In these studies the authors investigate methods to
automatically identify different types of wishes (specifically the wish to suggest
and the wish to purchase) and find linguistic patterns to extract them.
Ramanand et al. [2010] also used linguistic patterns to discover two specific
types of wishes, as mentioned before: sentences that make suggestions about
existing products, and sentences that indicate purchasing interest. Note that
Ramanand et al. [2010] wish types are similar to the evaluation and purchase
stages of the Consumer Decision Journey we address in this paper. Ramanand
et al. [2010] reported precision and recall are 62% and 48.5% respectively for
suggestions and 86.7% and 57.8% for purchase.

2.6.5 Detection of Place of Residence

The identification of the geographical origin of social media users has been tackled
in the past by several research works.
Mislove et al. [2011] estimate geographical location for Twitter users by ex-
ploiting the self-reported location field in the user profile. Content-analysis ap-
proaches are appropriate when the user location is not self-declared in the user
profile. Cheng et al. [2010] propose to obtain user location based on content anal-
ysis. The authors use a generative probabilistic model that relates terms with
geographic focuses on a map, placing 51% of Twitter users within 100 miles of
their actual location. Backstrom et al. [2008] described also a probabilistic model.
Chang et al. [2012] follow a similar approach, consisting in estimating the city
distribution on the use of each word. In addition, Rao et al. [2010] describe a
method for obtaining user regional origin from content analysis, testing different
models based on Support Vector Machines (SVM) [Cortes and Vapnik, 1995],
achieving a 71% of accuracy when applying a model of socio-linguistic features.

52
2.6.6 Detection of Gender
With respect to gender identification, Mislove et al. [2011] use the user name
for identifying his/her gender, achieving a coverage (i.e. proportion of users
classified) of 64.2%. Burger et al. [2011] propose to use more metadata and
content features for training an automatic classifier. Using only the full name
of the users, an accuracy of 0.89 is reached. An accuracy of 0.92 is achieved by
using the descriptions of the users, their screen names and the text of the tweets
published by them.
Rao et al. [2010] authored another relevant related work regarding gender
identification. In this case the proposed method, based on SVM, tries to distin-
guish the author gender exclusively from the content and style of their writing.
This solution needs an annotated seed corpus with authors classified as male or
female, to create the model used by the SVM classifier. In this case the accuracy
of the best model is 0.72, lower than considering the full name of the author.

2.6.7 Conclusions
Lemmatisation and part-of-speech tagging tools offer text processing and lan-
guage annotation facilities to NLP application developers, lowering the cost of
building those applications.
Social media user-generated content has particular characteristics (informal-
ity, brevity, frequent grammar errors and misspellings, abusive use of abbrevia-
tions, acronyms and emoticons, etc.). Text mining is based on the use of tools that
cannot handle this broad range of variations in a language. Therefore the task of
linguistic normalisation is a necessary step before performing NLP activities like
part-of-speech tagging.

Open Research Problem 7. Regarding sentiment analysis, while polarity de-


tection has been addressed for many languages, including English and Spanish,
and there are techniques for detecting emotions beyond polarity classification for
English, there are not existing approaches for identifying emotions for the Spanish
language.

The work we present in this thesis offers a more in-depth analysis of user-

53
generated content than sentiment analysis. In our work, we identify critical infor-
mation about consumer behaviour: we provide information about how customers
are distributed along the four stages of the Consumer Decision Journey and about
the nature of their comments in terms of categories of the Marketing Mix. The
automatic identification of wishful sentences is the area where we have found more
similarities with our work, both in terms of objectives and used technologies. To
the best of our knowledge, there is no previous work that addresses these tasks.
Nevertheless, the identification of wishful sentences offers some similarities that
allow for a basic comparison.
Author and content metadata is not enough for capturing socio-demographic
attributes like gender and place of residence. As an example, not all the so-
cial media channels qualify their users neither with gender nor with geographical
location. Some channels, such as Twitter, allow their authors to specify their geo-
graphical location via a free text field. However, this text field is often left empty,
or filled with ambiguous information (e.g. Paris - France vs. Paris - Texas), or
with other data that is useless for obtaining real geographical information (e.g.
“Neverland”).

Open Research Problem 8. The existing techniques for identifying the place
of residence of social media users do not combine different metadata that may
improve their accuracy. Among the metadata that can be used for this purpose
are the descriptions included in users’ profiles, the friendship networks, and the
locations found in the content shared and produced by them.

Open Research Problem 9. The existing techniques for identifying the gender
of social media users achieve good results of coverage and accuracy by using fea-
tures extracted from metadata about users, as well as from the content published
by them in the form of character n-grams. However, none of them take the ad-
vantage of the linguistic information that can be extracted from the content, such
as gender concord (a.k.a. agreement). This may improve the proportion of users
with a gender identified when it is not possible to recognise it from user’s profile
metadata.

54
2.7 Open Research Problems
We have identified the following open research problems in the State of the Art
that are addressed in this thesis.

1. There is a lack of data models for modelling the information that can be
extracted from social media for the marketing domain.

2. There is a lack of a characterisation of social media according to linguistic


features of the textual contents published on them.

3. The technique for uniquely identifying users in the Web based on the fin-
gerprint of their navigation devices fails when such fingerprint evolves over
time.

4. There is a lack of techniques for classifying consumer opinions according to


multiple socio-demographic and psychographic attributes commonly used
in the field of marketing for consumer segmentation.

5. There are not techniques for the classification of electronic word-of-mouth


according to the Consumer Decision Journey framework.

6. There are not techniques for identifying Marketing Mix attributes in con-
sumer opinions.

7. There are not techniques for detecting emotions in Spanish that go beyond
polarity detection.

8. The existing techniques for identifying the place of residence of social media
users do not take advantage of combining useful metadata that may improve
their accuracy.

9. The existing techniques for identifying the gender of social media users do
not take advantage of the linguistic information that can be extracted from
the content, such as gender concord.

55
56
Chapter 3

APPROACH

In this chapter we describe the objectives pursued by this thesis together with its
main contributions. We also present the hypothesis along with the restrictions
and assumptions upon which our research relies.

3.1 Objectives
The goal of this thesis is to provide techniques for extracting consumer
segmentations from the content generated by consumers in social me-
dia, their profile metadata, and their activities when navigating social
media websites.
According to the overall objective and to the open research problems identified
in the State of the Art (see Chapter 2), we have defined the specific objectives of
this thesis, which are described next.

O1. To provide a normalised schema for structuring the information


published in social media that can be used for marketing purposes.

As depicted by the Open Research Problem 1, there are not data mod-
els for representing information captured from social media that integrate
marketing-specific classifications and KPIs obtained from the analysis of
the content generated by consumers and their social network profiles, as
well as from the activity produced by them in social media.

57
The data model described in this thesis will allow integrating, using a single
format, data from social media as well as the data inferred by applying the
analysis techniques presented in this thesis. In addition, the model will
unify the semantics of the information extracted from heterogeneous sites,
by linking social media instances (e.g. posts, users, topics) regardless their
specific publication channels.

O2. To characterise the different social media types from the point
of view of the morphosyntactic characteristics of their textual
contents.
As shown by the Open Research Problem 2, there is not a characterisation
of the different kinds of social media with respect to the linguistic charac-
teristics of the content published on these media.

O3. To provide a fingerprint-based technique for identifying the ac-


tivity of consumers in different websites that is able to detect
changes in the device fingerprint.
As shown by the Open Research Problem 3, the existing techniques for
counting unique visitors are losing effectiveness, because of privacy restric-
tions and of new devices for navigating the Web. The fingerprinting tech-
nique deals with such restrictions and devices but is quite sensible to changes
in the attributes of the web browser, which leads to counting unique visitors
imprecisely.

O4. To provide a collection of automatic techniques for extracting con-


sumer segmentations according to their demographic and psycho-
graphic traits, from the analysis of content generated by them in
social media.
As reflected by the Open Research Problem 4, there are no techniques for
obtaining many of the demographic and psychographic attributes used in
marketing from which to obtain KPIs beyond the polarity of opinion and
the volume of publications. In this work we propose to automate the iden-
tification of a collection socio-demographic and psychographic attributes
from the content generated by consumers, by providing a set of individual

58
techniques for capturing each of these attributes. We aim for an analytic
technology that is able to perform a fine-grained analysis and that provides
information about the consumer behaviour. The automation of the activi-
ties oriented to capture these attributes from social media is unavoidable in
order to drastically reduce analysis time and the efforts required to process
the available large amount of data.
Specifically, this objective is limited to the following sub-objectives:

O4.1. To provide techniques for classification of consumer opinions


produced in social media according to the Consumer Decision
Journey framework.
As shown by the Open Research Problem 5, there are not techniques
that address the classification of consumer opinions according to the
Consumer Decision Journey framework. Our objective in this work is
to build a classifier for English and Spanish to assign e-WOM (elec-
tronic word-of-mouth) short texts to one single phase of the so-called
Consumer Decision Journey (see Section 2.5.1). Such a textual clas-
sification on different stages of the purchase process places customers
in the exact moment of their purchase journey.
O4.2. To provide techniques for classification of consumer opinions
produced in social media according to the Marketing Mix
framework.
As shown by the Open Research Problem 6, there are not techniques
that address the classification of consumer opinions according to the
Marketing Mix framework. Our objective in this work is to build
a classifier for English and Spanish to assign comments published by
consumers about brands to Marketing Mix elements (see Section 2.5.2)
expressed in a text. The classification of texts extracted from differ-
ent social media channels in terms of them belonging to one or more
Marketing Mix elements gives us information about what marketing-
related issues are the customers talking about.
O4.3. To provide a technique for identifying emotions expressed by
consumers in social media for Spanish.

59
As shown by the Open Research Problem 7, there are not techniques
that address the identification of emotions for the Spanish language
that go beyond polarity detection (i.e. automatically discovering plea-
sure or displeasure in texts). Specifically, this thesis addresses the
identification of emotions according to the eight categories shown in
Table 2.4 (satisfaction, dissatisfaction, trust, fear, happiness, sadness,
love, and hate), overcoming the limitations of current sentiment anal-
ysis approaches, which analyse only the polarity of the sentiments
expressed in user messages written in Spanish. Classification of user-
generated content according to the emotions expressed in them might
be useful not only for several Business Intelligence fields such as mar-
keting, sales, or customer service but also for public opinion analysis
where research on people’s behaviour is crucial.

O4.4. To provide a technique for recognising the place of residence


of social media users that improves the accuracy of existing
techniques.

As shown by the Open Research Problem 8, different approaches and


kinds of metadata can be used for improving the accuracy of existing
techniques. Our objective is to define and validate a technique that ex-
ploits user profiles descriptions, friendship networks, and geographical
entity recognition within contents for detecting the place of residence
of social media users.

O4.5. To provide a technique for recognising the gender of social


media users that improves the coverage of the techniques
based in profile metadata by exploiting the linguistic infor-
mation that can be extracted from the content written in
Spanish.

As shown by the Open Research Problem 9, the existing techniques for


gender identification do not take into account the linguistic informa-
tion that can be extracted from content analysis for improving their
coverage.

60
3.2 Contributions to the State of the Art
This thesis contributes to the State of the Art with a data model and a set of
techniques that address the objectives described in the previous section. The
contributions of this thesis are explained next.

C1. A normalised schema for representing the information extracted


from heterogeneous social media about brands, consumers and
opinions of consumers about brands, useful for the marketing do-
main.
This schema includes concepts and attributes for modelling the content and
metadata defined explicitly in social media. In addition to these explicitly-
defined data, the schema provides concepts and attributes for representing
the data enrichments inferred when applying the user identification tech-
nique (C2) and consumer segmentation techniques (C3).
The schema has been designed as a semantic data model defined by an
ontology network reusing ontologies widely used in the Semantic Web and
Linked Data fields.

C2. A descriptive characterisation of social media types from the point


of view of the morphosyntactic characteristics of the content pub-
lished on them.
We have processed and characterised corpora of user-generated content ex-
tracted from different social media sources. Specifically, we have studied
differences of the language used in distinct types of social media content by
analysing the distribution of part-of-speech categories in such sources.

C3. A technique for the identification of unique users from the finger-
print of the devices they use when interacting with social media,
which is tolerant to changes in such fingerprint.
This thesis will contribute to the State of the Art with an algorithm, based
on the fingerprinting technique defined by Eckersley [2010], which allows
identifying unique visitors accurately, regardless of changes in browser at-
tributes. For doing so, our algorithm is able to detect the evolution of

61
fingerprint and therefore, to effectively group distinct fingerprints that cor-
respond to the same user.

C4. A collection of techniques for extracting socio-demographic and


psychographic profiles from social media users applied to the mar-
keting domain.
The socio-demographic variables considered include gender and place of
residence, while the psychographic information includes purchase intention,
Marketing Mix elements, and emotional perceptions about brands.
Specifically, this thesis provides the following contributions to the State of
the Art.

C4.1. A technique for classifying consumer opinions produced in


social media according to the Consumer Decision Journey
stages for texts written in English and Spanish.
We have developed a classifier based on the identification of linguistic
patterns in short texts. These linguistic patterns were then used as a
part of a set of rules to classify each particular text into one of the
Customer Decision Journey stages.
C4.2. A technique for classifying consumer opinions produced in
social media according to the Marketing Mix framework for
texts written in English and Spanish.
We have developed a classifier based on machine-learning techniques,
specifically on Decision Tree (DT) learning algorithms.
C4.3. A technique for analysing consumer opinions written in Span-
ish according to the emotions expressed in such opinions that
goes beyond polarity identification by identifying the follow-
ing sentiment categories: satisfaction, dissatisfaction, trust,
fear, happiness, sadness, love, and hate.
We have developed a technique for classifying the texts of a corpus
of consumer opinions about brands according to the sentiment they
express. Unlike many existing solutions that focus on polarity clas-
sification, which deal with English texts and extract documents from

62
specific channels and a few domains, in the work presented in this
thesis we are interested in an eight-sentiment classification of Spanish
texts that consist of documents with different sizes and characteristics
from diverse social media and product domains.
C4.4. A technique for identifying the place of residence of social me-
dia users that improves the accuracy of existing techniques.
The technique proposed exploits the metadata declared by social media
users in their social network profiles, the locations included in the
contents published and shared by them, and their friendship networks.
C4.5. A technique for identifying the gender of social media users
that exploits the gender concord existing for the Spanish lan-
guage.
The technique proposed exploits the metadata declared by social me-
dia users in their social network profiles and takes the advantage of
the linguistic concord existing in certain languages like Spanish for de-
termining the gender of the users mentioned in the content produced
by other users.

Figure 3.1 depicts the contributions to the State of the Art of this thesis. The
contributions of this thesis can be grouped into three tiers.
• The Earned Media Knowledge Base provides the data warehouse for storing
marketing-oriented structured information extracted from social media or
inferred from it. The contribution C1 provides the ontology network that
models such data warehouse.

• The Inference Layer provides the engine that can reason about the facts
extracted from social media producing new inferences. The contribution C3
provides a technique for identifying users uniquely from their web activity,
while the contribution C4 provides a collection of techniques for segmenting
consumers from the information shared and published by them in social
media.

• The Social Media Characterisation tier provides observations on social me-


dia content attributes that may be considered for producing the algorithms

63
Social Media Characterisation

C2. Morphosyntactic characterisation of social media contents

Inference Layer

C3. Technique for unique user identification based on evolving fingerprint detection

C4. Techniques for segmentation of consumers from social media content

C4.1. Technique for detecting Consumer Decision Journey stages


C4.2. Technique for detecting Marketing Mix attributes
C4.3. Technique for detecting emotions
C4.4. Technique for detecting the place of residence of social media users
C4.5. Technique for detecting the gender of social media users

Earned Media Knowledge Base

C1. Social media data model for consumer analytics

Figure 3.1: Contributions to the State of the Art

of the Inference Layer. The contribution C2 provides a characterisation


form the point of view of the morphosyntactic attributes of the content
published in social media.

3.3 Assumptions
The models and techniques proposed in this thesis rely on the following assump-
tions.

Assumption 1. It is possible to structure the content published on social media


(and the associated metadata) according to a single normalised data schema.

64
Assumption 2. The information structured according to the data model pro-
posed, including data explicitly defined in social media and data enrichments ob-
tained by our analysis techniques can be used for higher-level Business Intelligence
processes, like the ones presented in Section 2.4.

Assumption 3. Consumers’ demographic and psychographic profiles (feelings,


interests, etc.) can be obtained from social media, even if those profiles are not
declared explicitly by the user, by analysing the content published and shared by
such consumers, as well as other metadata, such as profile information and friend-
ship networks.

3.4 Hypotheses
The overall research hypothesis of this work is that is possible to extract infor-
mation useful for marketing activities from the content and activity generated
by consumers in social media, despite the heterogeneity of textual contents and
metadata, and disparate access devices. The specific hypothesis are described
next.

Hypothesis 1. The contents published in social media statistically present differ-


ent morphosyntactic features depending on the specific kind of media where they
have been published.

Hypothesis 2. The online activity generated by consumers in social media can


be grouped and identified effectively through the digital fingerprint of their devices
by using the technique described in this thesis, even when such fingerprint varies
over time.
The technique must outperform the existing approach authored by Eckersley
[2010], whose accuracy, false positive rates, and coverage (i.e. percentage of
browsers classified) are 0.991, 0.0086 and 65% respectively.

Hypothesis 3. Consumers utilise different expressions along the four stages of


the Consumer Decision Journey. Therefore, if we are able to identify the par-
ticular linguistic expressions used in each of the stages of the purchase process,
we will be able to classify texts along the different phases and, consequently, we

65
will be able to approximate distributions of consumers in different moments of the
Consumer Decision Journey process.
Although there are not existing techniques for identifying Consumer Decision
Journey Stages from user-generated content, the results provided by this this the-
sis must be in line with existing approaches for the identification of wishes with
precisions that vary from 56% to 86.7%, depending on the wish type.

Hypothesis 4. The vocabulary used by consumers when publishing comments


about brands in social media can be used to identify the Marketing Mix attributes
they are referring to. Therefore, if we are able to identify the particular lexical
elements that refer to such attributes, we will be able to classify text according to
the Marketing Mix framework and, consequently, we will be able to approximate
distributions of consumers that refer to the distinct Marketing Mix elements.

Hypothesis 5. Consumers utilise different expressions to express their senti-


ment about brands beyond their pleasure and displeasure about brand products —
specifically for expressing the satisfaction, dissatisfaction, trust, fear, love, hate,
happiness, and sadness sentiments. Thus, if we are able to identify the partic-
ular linguistic expressions used for each of these sentiments, we will be able to
classify texts along the different emotions and, consequently, we will be able to ap-
proximate distributions of consumers according to fine-grained sentiments about
brands.

Hypothesis 6. The homophily existing between the users of a social network


[McPherson et al., 2001] can be used for improving the accuracy of existing tech-
niques for identifying their place of residence (from 51% to 71%). Specifically
the friendship network of a given user can be used for estimating her/his place of
residence, as the major part of her/his friends may share her/his location.

Hypothesis 7. The linguistic concord existing in the posts written in Spanish that
explicitly mention social media users can be exploited for enhancing the coverage
of the gender identification techniques that make use of the name declared by users
in their profiles.

66
3.5 Restrictions
Restriction 1. The technique for identifying unique users from their online ac-
tivity is restricted to the identification of the unique devices that they use for
browsing the Web. The consolidation of multiple devices in a unique user iden-
tity (e.g. relating her smartphone and tablet fingerprints) is out of the scope of the
technique proposed. Cross-device and cross-site identification can be performed by
combining logged sessions with fingerprints records or third party cookies and do
not suppose a research problem.

Restriction 2. The techniques for the analysis of user-generated content pre-


sented in this thesis are restricted to textual content. Therefore the analysis of
audio-visual content is out of the scope of this thesis.

Restriction 3. This thesis provides techniques for inferring psychographic char-


acteristics of consumers related with their position in the Consumer Decision
Journey and the Marketing Mix attributes they consider when talking about prod-
ucts and brands. The mining of other psychographic characteristics, such as hob-
bies or interests, is out of the scope of this thesis.

Restriction 4. This thesis provides techniques for inferring socio-demographic


characteristics of consumers related with their gender and place of residence. The
mining of other socio-demographic characteristic used in the marketing domain,
such as age or purchasing power, is out of the scope of this thesis.

Restriction 5. The technique for detecting Consumer Decision Journey stages


in user-generated content is limited to the English and Spanish languages. Other
languages are out of the scope of this thesis.

Restriction 6. The technique for detecting Marketing Mix elements in user-


generated-content is limited to the English and Spanish languages. Other lan-
guages are out of the scope of this thesis.

Restriction 7. The technique for detecting emotions in user-generated content


is limited to the Spanish language. Other languages are out of the scope of this
thesis.

67
Restriction 8. The text-mining techniques provided by this thesis have been
evaluated with corpora extracted from social media consisting in posts mention-
ing brands of the following commercial sectors: automotive, banking, beverages,
sports, telecommunications, food, retail, and utilities. The accuracy of the tech-
niques may vary significantly when applied to posts mentioning brands belonging
to other sectors.

Restriction 9. The deployment of the techniques proposed by this thesis in an


industrial environment, as well as the validation of their scalability is out of
the scope of this thesis. Nevertheless, we have performed some preliminary tests
regarding scalability whose results are shown in Section 9.4.7.

Restriction 10. We have chosen Freeling for executing the lemmatisation, part-
of-speech tagging and dependency parsing tasks of contribution C4, because it is
customisable, extensible and robust, and offers a high reliability for Spanish. The
evaluation results could vary slightly if another computational linguistic software
was used.

Restriction 11. As an exception, for contribution C2 we have used TreeTagger


for Spanish due to project technology requirements at the moment in which the
study was performed. Therefore, the part-of-speech distributions provided may
also may vary with the use of a different part-of-speech tagging.

Finally, to conclude this chapter, Figure 3.2 show the relationships among the
objectives, contributions, assumptions, hypotheses and restrictions of this thesis.

68
Objectives achieve Contributions apply to Hypotheses, Assumptions and Restrictions

O1 C1 A1 A2

O2 C2 H1 R11

O3 C3 H2 R1

R9

O4 C4 A3 R2 R3 R4 R8 R10

O4.1 C4.1 H3 R5

O4.2 C4.2 H4 R6

O4.3 C4.3 H5 R7

O4.4 C4.4 H6

O4.5 C4.5 H7

Figure 3.2: Relationships between the objectives, contributions, assumptions,


hypothesis and restrictions

69
70
Chapter 4

RESEARCH METHODOLOGY

This chapter describes the research methodology followed for obtaining the con-
tributions of this work. Before describing the methodology, Section 4.1 provides
definitions for the terms methodology, method, techniques, process, activity and
task, which appear frequently in this thesis. After providing these definitions,
Section 4.2 describes the research methodology, and Section 4.3 details the meth-
ods followed for obtaining the ontology and techniques provided by this thesis.

4.1 Terminology
Throughout literature, the terms methodology, method, technique, process, ac-
tivity, etc. are used indistinctively. Therefore, for the shake of clarity, in this
thesis we have adopted several IEEE48 definitions, which are described in detail
in different sources [IEEE, 1990, 1995a,b, 1997; Sommerville, 2007] and shown in
Figure 4.1.

Definition 5. A methodology is a comprehensive, integrated series of tech-


niques or methods that create a general system theory of how a class of thought-
intensive work ought to be performed [IEEE, 1995a].

Definition 6. Methods are parts of methodologies. A method is a set of “or-


derly processes or procedures used in the engineering of a product or in performing
a service” [Sommerville, 2007]. Methods are composed of processes.
48
http://www.ieee.org

71
Methodology
composed of composed of

Method Technique
composed of

Process

composed of

Activity

composed of

Task

specify

Figure 4.1: Relations between methodology, methods, techniques, processes, ac-


tivities and tasks (adapted from Gómez-Pérez et al. [2004])

Definition 7. Techniques are parts of methodologies. Techniques are “the ap-


plication of accumulated technical or management skills and methods in the cre-
ation of a product or in performing a service” [IEEE, 1990]. Techniques detail
methods and their components (processes, activities and tasks).

Definition 8. A process is a set of activities whose goal is the development or


the evolution of software [Sommerville, 2007].

Definition 9. An activity is a defined body of work to be performed, including


its required input and output information [IEEE, 1997]. Activities can be divided
into zero or more tasks.

Definition 10. A task is the smallest unit of work subject to management ac-
countability. A task is a well-defined work assignment for one or more project
members. Related tasks are usually grouped to form activities [IEEE, 1995b].

72
4.2 Research Methodology

This research was motivated from the need that the marketing field has for mea-
suring and understanding the effects of earned media during advertising cam-
paigns. Therefore we initially defined a broad research problem: to develop tech-
niques for acquiring marketing-oriented knowledge from the unstructured content
published in social media. Thus to refine this research problem and define the
objectives and hypotheses of the thesis we followed a iterative methodology con-
sisting of two stages (see Figure 4.2).
In the first stage we used an exploratory approach [Kothari, 2004]. The objec-
tive of exploratory research is to define the research problem and the hypotheses
to be tested. Accordingly, in the first state we reviewed the State of the Art on
approaches for knowledge acquisition from user-generated content and user activ-
ity, as well as the marketing background of our thesis. This review of the State of
the Art, which was presented in Chapter 2, helped us to specify in more detailed
terms the definition of the research problem and the hypothesis of our work.
Therefore, we defined our research problem more precisely in terms of providing
techniques for extracting consumer segmentations from the content generated by
consumers in social media, their profile metadata, and their activities when nav-
igating social media websites. The objectives, as well as the hypotheses in which

Explorative Research Experimental Research

Design
Review of the
Experiments &
State of the Art
Evaluate

Define Problem,
Propose
Hypotheses,
Solution
and Objectives

Figure 4.2: Iterative research methodology using exploratory and experimental


approaches

73
we rely to propose a solution for this problem were presented in Chapter 3.
Once we had defined the research problem we proceeded to the second state
where we followed an experimental approach [Dodig-Crnkovic, 2002; Kothari,
2004]. Our objective in the experimental research was to propose a solution based
on the hypotheses to fulfil the research objectives and design experiments to val-
idate the hypotheses. In this stage we investigated existing techniques in other
research fields such as Natural Language Processing and Information Retrieval
which might help to reach the objectives. Then we adapted these techniques to
the requirements defined by the particularities of our research. After this, we
designed the experiments to validate the proposed solutions, using well-known
evaluation metrics. Next, we carried out an abstraction exercise over the pro-
cedure that we had followed when developing the techniques, and designing and
executing the experiments. The objective was to elicit commonalities in the form
of data models, activities, and tasks. Thus, with these components we produced
the contributions of this thesis.
We performed five interactions, one per technique provided (contributions C3,
C4.1, C4.2, C4.3, C4.4, and C4.5). The ontology (contribution C1) was contin-
uously refined during the execution of each interaction. The morphosyntactic
characterisation of social media contents (contribution C2) was produced at a
preliminary stage of the first interaction.

4.3 Method Followed for Obtaining the Arte-


facts Provided by this Thesis
Extracting knowledge from social media information requires: (i) building a data
warehouse from which obtaining insights by querying it and, (ii) applying different
analysis techniques for obtaining knowledge from the data warehouse, such as
graph and time series analyses. The method that we have followed for obtaining
the artefacts provided by this thesis is inspired in an existing framework defined
by Hu and Cercone [2004] for Web mining and Business Intelligence reporting.
This framework follows the data warehousing approach proposed by Kimball et al.
[1998]; Kimball and Ross [2002] and provides guidelines for performing research

74
Data Capture Data Webhouse Mining, OLAP Pattern
Construction Evaluations &
(clickstream, sale, (rules, prediction Deployment
(clickstream, sale,
customer, product, customer, product, models, cubes,
etc.) etc.) reports, etc.)

Figure 4.3: Web mining framework (adapted from Hu and Cercone [2004])

on data extracted from the Web, including guidelines for the data warehouse
construction, among other activities.
Figure 4.3 illustrates the data flow proposed by the framework, which involves
the following phases:

1. Data Capture. This phase consists in capturing and cleansing data com-
ing from heterogeneous web data sources.

2. Data Webhouse Construction. This phase consists in creating a database


for storing the data gathered in the previous activity. To do this, the
database requirements are analysed, the database schema is defined, and
the data captured are transformed according to this schema.

3. Mining, OLAP. This phase consists in the execution of data mining tasks
in order to derive useful knowledge from the data stored in the database
created in the previous activity.

4. Pattern Evaluations and Deployment. This phase consists in the eval-


uation of the models obtained in the previous activity, as well as on the
deployment of the learning validated.

We follow two methods for dealing with the data mining phases defined by
Hu and Cercone [2004]:

• For addressing the Data Webhouse Construction phase we follow the method-
ology proposed by Suárez-Figueroa et al. [2012] for constructing ontology
networks. Section 4.3.1 describes the method followed for constructing the
social media data model that will be described in Chapter 5.

75
• For addressing the other phases (Data Capture, Mining and Evaluation and
Deployment) we follow the CRISP-DM reference process model [Shearer,
2000], which is a framework that describes a set of generic activities and
tasks that any data mining process may implement. Section 4.3.2 describes
the method followed by the data-mining techniques proposed by this thesis
hat will be described in chapters 7 and 8.

4.3.1 Method Followed for Ontology Engineering


We have followed the NeOn methodology for building ontology networks [Suárez-
Figueroa et al., 2012] for engineering the social media data model provided by
this thesis. Such methodology: (i) proposes the processes and activities required
involved in the construction of ontology networks, (ii) defines two ontology de-
velopment life cycle models, (iii) identifies and describes a set of scenarios for
building ontology networks, and (iv) provides a set of methodological guidelines
for performing some of the processes and activities proposed.
Specifically, we have implemented the Reusing Ontological Resources scenario,
as we have reused existing ontologies in the constriction of our data model. The
sequence of activities in this scenario is the following:
1. Ontology Search. This activity consists in finding candidate ontologies
or ontology modules to be reused. We have searched for the candidate
ontological resources that satisfy the requirements using search services for
the Web.

2. Ontology Assessment. This activity consists in checking an ontology


against the user’s requirements, such as usability, usefulness, abstraction,
quality. After executing this activity we obtained a list of candidate ontolo-
gies for being reused, which has been described in Section 2.1.

3. Ontology Comparison. This activity consists in finding differences be-


tween two or more ontologies or between two or more ontology modules.

4. Ontology Selection. This activity consists in choosing the most suitable


ontologies or ontology modules among those available in an ontology repos-
itory or library, for a concrete domain of interest and associated tasks. The

76
result of this activity has been a selection of ontologies for being reused,
which are listed in Table 5.1 of Chapter 5.

5. Ontology Integration. This activity consists in integrating one ontology


into another ontology. The ontologies selected have been imported into the
ontology network depicted in Figure 5.1 of Chapter 5.

Apart from the activities defined by this scenario, we have implemented the
following activities (definitions literally taken from Suárez-Figueroa et al. [2012]):

Ontology Annotation. It refers to the activity of enriching the ontology with


additional information, e.g. metadata or comments. We have commented
each new ontology element.

Ontology Conceptualisation. It refers to the activity of organising and struc-


turing the information (data, knowledge, etc.), obtained during the acquisi-
tion process, into meaningful models at the knowledge level and according
to the ontology requirements specification document. This activity is inde-
pendent of the way in which the ontology implementation will be carried
out. Previously to the Ontology Reuse Process we identified the concepts,
attributes and relations that the ontology network must cover.

Ontology Documentation. It refers to the collection of documents and ex-


planatory comments generated during the entire ontology building process.
This thesis includes the documentation of the developed ontology network.

Ontology Elicitation. It is a knowledge acquisition activity in which concep-


tual structures (i.e. T-Box) and their instances (i.e. A-Box) are acquired
from domain experts. In our case, we obtained conceptual structures and
types from the marketing frameworks described in the State of the Art (see
sections 2.5.1, 2.5.2 and 2.5.3).

Ontology Enrichment. It refers to the activity of extending an ontology with


new conceptual structures (e.g. concepts, roles and axioms). After perform-
ing the Ontology Integration activity, there were missing ontology elements
for modelling some concepts, attributes and properties identified during

77
the conceptualisation phase. Therefore we enriched the ontology network
with our own ontology elements, which have been grouped under a specific
namespace.

Ontology Environment Study. It refers to the activity of analysing the envi-


ronment in which the ontology is going to be developed. Such environment
has been described in Section 2.5.

Ontology Implementation. It refers to the activity of generating computable


models according to the syntax of a formal representation language (e.g.
RDFS49 and OWL50 ). Our ontology has been implemented using OWL.

Ontology Modularisation. It refers to the activity of identifying one or more


modules in an ontology with the purpose of supporting reuse or mainte-
nance. We have structured our ontology into seven modules that are de-
scribed in Chapter 5.

Ontology Summarisation. It refers to the activity of providing an abstract


or summary of the ontology content. We have summarised the ontology
network using a UML [OMG, 2011] representation, which has been included
in Chapter 5.

Regarding the ontology development life-cycle, we have selected an iterative-


incremental ontology network life cycle model, as requirements were changing
during the ontology development.

49
http://www.w3.org/TR/rdf-schema
50
http://www.w3.org/TR/owl2-primer

78
4.3.2 Method Followed for the Data Mining Techniques
This research is framed within the reference model CRISP-DM (Cross Industry
Standard Process for Data Mining), applied to the extraction of information from
social media. Therefore, we have instantiated the activities and tasks within this
process for performing our research.
Figure 4.4 shows the activities involved in the CRISP-DM process. Next, each
of the activities are described, as well as the tasks that have been instantiated by
the contributions of this thesis.

Business Data
Understanding Understanding

Data
Preparation

Deployment

Data
Modeling

Evaluation

Figure 4.4: The CRISP-DM reference model (adapted from Shearer [2000])

79
4.3.2.1 Business Understanding

This initial activity focuses on understanding the project objectives and require-
ments from a business perspective, then converting this knowledge into a data
mining problem definition and a preliminary plan designed to achieve the objec-
tives.
The result of this activity has been included in Chapter 2 where the State of
the Art, and specifically the marketing frameworks have been described, as well
as in Chapter 3, where the objectives, contributions, assumptions, hypotheses
and restrictions of this research have been detailed.

4.3.2.2 Data Understanding

This activity starts with initial data collection and proceeds with tasks that enable
data analysts to become familiar with the data, identify data quality problems,
discover first insights into the data, and/or detect interesting subsets to form
hypotheses regarding hidden information.
The tasks involved in this activity are the following:

Collect Initial Data. The goal of this task is to acquire the data used for learn-
ing purposes.

Describe Data. This task consists in examining the “gross” or “surface” proper-
ties of the acquired data, describing the format of the data and the quantity
of the data, as any other relevant features that have been discovered.

Explore Data. This task addresses data mining questions using querying, vi-
sualisation, and reporting techniques, obtaining distributions of key at-
tributes, relations between pairs of attributes and other simple statistical
analyses.

Verify Data Quality. This task examines the quality of the data, addressing
questions such as data completeness.

The data mining techniques proposed by this thesis, which are described in
chapters 7 and 8, implement this activity.

80
4.3.2.3 Data Preparation

This activity covers all tasks needed to construct the final dataset —data that
will be fed into the modelling tools—, from the initial raw data.
The tasks involved in this activity are the following:

Select Data. The goal of this task is to decide on the data to be used for anal-
ysis. Criteria include relevance to the data mining goals, quality, and tech-
nical constraints such as limits on data volume or data types.

Clean Data. The goal of this task is to raise the data quality to the level required
by the selected analysis techniques. This task may involve the selection
of clean subsets of the data, the insertion of suitable defaults, or more
ambitious techniques such as the estimation of missing data by modelling.

Construct Data. This task performs data preparation operations such as the
production of transformed values for existing attributes.

This activity is also implemented by the data mining techniques proposed by


this thesis, which are described in chapters 7 and 8.

4.3.2.4 Modelling

This activity applies one or more techniques for obtaining a final model. When
the performance of the model obtained depends on parameters, such parameters
are calibrated to optimal values.
The tasks involved in this activity are the following:

Select Modelling Technique. The goal of this task is to select the actual mod-
elling technique to be used (e.g. decision-tree building, rule-set engineering).

Generate Test Design. The goal of this task is to generate a procedure to


test the model for quality and validity. This involves choosing evaluation
metrics like precision or recall, and separating the dataset into training and
test sets.

Build Model. The goal of this task is to create the model. This typically in-
volves running a modelling tool on the prepared dataset and performing

81
human supervision on the model, depending on the modelling technique
chosen.

This activity is also implemented by the data mining techniques proposed by


this thesis, which are described in chapters 7 and 8.

4.3.2.5 Evaluation

This activity consists in evaluating the model obtained in order to assess that it
has a high quality from a data analysis perspective as well as to be certain the
model properly achieves the business objectives.
This activity is also implemented by the data mining techniques proposed by
this thesis, which are described in chapters 7 and 8.

4.3.2.6 Deployment

This activity integrates the model obtained into the application that will make
use of it.
As stated by Restriction 9, the deployment of the data mining techniques pro-
posed by this thesis is part of the future work. However, we have performed some
preliminary tests regarding scalability whose results are shown in Section 9.4.7.

82
Chapter 5

SOCIAL MEDIA ONTOLOGY


FOR CONSUMER ANALYTICS

This chapter describes the ontology for representing the information extracted
from social media as well as the knowledge about consumers that can be inferred
from such information by applying the analysis techniques presented in this thesis,
which are described in the following chapters.
The social media ontology has been defined as an ontology network, called
Social Graph Ontology (SGO)51 . Such ontology reuses existing semantic vocabu-
laries, which have been already described in Section 2.1. The reused vocabularies
are enumerated in Table 5.1.
Figure 5.1 shows the import relations between the Social Graph Ontology
and the rest of vocabularies (non-dashed lines). In addition, the dashed lines
represent the existing import relations between the vocabularies reused. The
colours associated to each vocabulary are used to denote the namespaces to which
the classes and properties of the ontology network belong.

51
The Social Graph Ontology OWL implementation has not been made public due to the
exploitation rights defined by the Social TV Project (TSI-100600-2013-53)

83
Vocabulary Prefix Namespace
SIOC sioc http://rdfs.org/sioc/ns#
FOAF foaf http://xmlns.com/foaf/0.1/
schema.org schema http://schema.org/
Dublin Core dcterms http://purl.org/dc/terms/
SKOS skos http://www.w3.org/2004/02/skos/core#
ISOcat isocat http://www.isocat.org/ns/dcr.rdf#
Marl marl http://purl.org/marl/ns#
Onyx onyx http://www.gsi.dit.upm.es/onlogies/onyx/ns#
WGS84 geo http://www.w3.org/2003/01/geo/wsq84_pos#
Time Zone Ontology tzont http://www.w3.org/2006/timezone#
Named Graphs rdfg http://www.w3.org/2004/03/trix/rdfg-1/

Table 5.1: Vocabularies selected for defining the Social Graph Ontology

onyx

marl tzont

isocat sgo schema

skos geo

rdfg

sioc

dcterms
foaf

Figure 5.1: Ontology network

84
5.1 Ontology Modules
The ontology is divided into seven ontology modules that are shown in Fig-
ure 5.2. The arrows represent usages of ontology elements contained in the mod-
ules pointed by such arrows.
The modules of the Social Graph Ontology are the following:

1. The Core Ontology Module defines the main components of the ontology
(see Section 5.3).

2. The Publication Channels Module defines the ontology elements in charge


of representing information related to the content publication media (see
Section 5.4).

3. The Contents Module defines the ontology elements used for representing in-
formation related to the contents published in social media (see Section 5.5).

4. The Users Module defines the ontology elements used for representing in-
formation related to social media users (see Section 5.6).

5. The Opinions Module defines ontology elements used for representing infor-
mation related to opinions expressed within the contents (see Section 5.7).

6. The Topics Module defines the ontology elements used for representing in-
formation related to the topics that the contents are about (see Section 5.8).

7. Finally, the Locations Module defines the ontology elements used for repre-
senting information related to the geographical locations associated to users
and contents (see Section 5.9).

Before explaining the SGO modules in detail, we briefly summarise next the
notation used for describing the ontology.

85
de SGO Ontology Modules

Topics and Publication


Opinions
Keywords Channels

Contents Core Users

Locations

Figure 5.2: Social Graph Ontology Modeles

5.2 Notation Used


We use UML [OMG, 2011] class diagrams for representing the elements contained
within the ontology modules.
Figure 5.3 shows an example class. The title of the box represents the class
name (Site), prefixed by its namespace abbreviation (sioc). Within the box we
find the data type properties of the class (e.g. rank ) prefixed by their names-
pace abbreviation (e.g. sgo), and followed by their XML Schema data type52
(e.g. decimal ). The class may include a URL [Berners-Lee, 1994] property if its
identifier is dereferenceable. The criterium chosen for deciding if a class can be
identified by a URL is the existence of a resource in the Web pointed by the URL.
Optionally, a class with a non-dereferencable URI [Berners-Lee et al., 2005] may

52
http://www.w3.org/TR/xmlschema-2

86
sioc:Site

URL
sgo:rank: decimal
sgo:monthlyVisitors: nonNegativeInteger
sgo:pagesPerVisit: nonNegativeInteger
sgo:visitsPerVisitor: nonNegativeInteger
sgo:minutesPerVisitor: nonNegativeInteger
sgo:backlinks: nonNegativeInteger
sgo:percentageMale: decimal
sgo:percentageFemale: decimal

Figure 5.3: Class Example

foaf:Agent

foaf:name
sgo:hasActivity sgo:Activity
foaf:age
foaf:mbox 1 * dcterms:created
dcterms:language
dcterms:description

Figure 5.4: Object Property Example

include a URI property with a clue on how can it be constructed from some of
its properties in order to warranty identifiers’ uniqueness.
Figure 5.4 shows an example object property represented by a labelled
arrow with the name of the property (e.g. hasActivity) prefixed by its namespace
abbreviation (sioc). The direction of the arrow is used for notating the domain
and range of the property. The range is represented as the class pointed by
the arrow (e.g. sgo:Activity), while the range is the other class (e.g foaf:Agent).
Properties are annotated with its domain and range cardinalities. Some object
properties may have inverse object properties. We notate these cases with a
bidirectional arrow annotated with the name of the property and its inverse, as
shown in Figure 5.5.
Class inheritance is represented with UML notation as shown in Figure 5.6,
where the classes foaf:Organisation and foaf:Person are subclasses of the class
foaf:Agent. In an analogous way, property inheritance is represented as shown
in Figure 5.7. In the example, the properties sioc:reply of, sioc:has reply, sioc:copies
and sioc:shares are subproperties of the property sioc:related to.

87
sioc:Site
URL
sgo:rank: decimal
sioc:Forum
* sioc:has_host * sgo:monthlyVisitors: nonNegativeInteger
URI: concat(site, type) sgo:pagesPerVisit: nonNegativeInteger
sioc:host_of sgo:visitsPerVisitor: nonNegativeInteger
dcterms:type
sgo:minutesPerVisitor: nonNegativeInteger
sgo:backlinks: nonNegativeInteger
sgo:percentageMale: decimal
sgo:percentageFemale: decimal

Figure 5.5: Inverse Object Properties Example

foaf:Agent
foaf:name
foaf:age
foaf:mbox
dcterms:language
dcterms:description

foaf:Organisation foaf:Person
foaf:givenName
foaf:familyName
schema:jobTitle
foaf:gender

Figure 5.6: Class Inheritance Example

Finally, instances are represented as shown in Figure 5.8 with underscored


names for instances and dashed lines for the instantiation relationship. In the ex-
ample the resources marl:Positive, marl:Neutral and marl:Negative are instances
of the class marl:Polarity.

88
sgo:copies sgo:shares
* * * *

sioc:Post

sioc:content (language tagged)


dcterms:identifier
dcterms:created
sioc:related_to

* dcterms:title
dcterms:dateCopyrighted
dcterms:medium
sioc:num_views
sioc:num_replies
*
geo:lat
geo:long
schema:wordCount
schema:contentRating
*
sioc:has_reply

schema:articleBody
sioc:reply_of

schema:isFamilyFriendly
sgo:numLikes: xsd:nonNegativeInteger
sgo:numShares: xsd:nonNegativeInteger
sgo:impact: xsd:decimal
sgo:reach: xsd:nonNegativeInteger
*

sgo:engagement: xsd:decimal
sgo:relevance: xsd:decimal
sgo:isPromotion: xsd:boolean

Figure 5.7: Property Inheritance Example

marl:Polarity

marl:Positive

marl:Neutral

marl:Negative

Figure 5.8: Instances Example

89
5.3 Core Ontology Module
Figure 5.9 shows a UML representation of the core ontology module. The classes
defined by this module are the following:

• The class sioc:UserAccount represents users accounts defined for specific


social media. The properties defined for the sioc:UserAccount class are
shown in Table 5.2.

• The class sioc:Post represents specific contents published in publication


channels by social media users. Such contents can take the form of text,
video, image, etc. The properties defined for this class are shown in Tables
5.3 and 5.4.

• The class sioc:Forum represents publication channels into which users pub-
lish contents. The properties defined for this class are shown in Table 5.5.

• The class marl:Opinion represents opinions extracted from posts. The prop-
erties defined for this class are shown in Table 5.6.

• The class skos:Concept represents the subjects that contents are about,
around which online communities are organised, or users are interested in.
It also represents the specific entities (e.g. brands), which are opinionated
by users. As an indeterminate number of types of subjects and entities may
be opinionated by social media users, we have chosen not to create specific
concept subclasses, but to annotate such concepts with standard semantic
and syntactic categories. Further details are provided in Section 5.8. The
properties defined for this class are shown in Table 5.7.

• The class sioc:Community represents online communities of users that share


interest in specific topics. The object properties defined for this class are
shown in Table 5.8.

• Finally, the class rdfg:Graph represents named graphs that correspond to


specific social graphs instances. Such instances can be used for grouping
specific data analysis projects. The data properties defined for this class
are shown in Table 5.9.

90
cd Social Graph Ontology

rdfg:Graph

URI
rdfs:label
dcterms:description

dcterms:isPartOf
skos:Concept
* *
URI: concat(language, prefLabel) sioc:topic
skos:prefLabel (language tagged) sioc:Community
* * *
* * *

sioc:topic

dcterms:isPartOf
sioc:Forum
sioc:has_subscriber
URI: concat(site, type) sioc:follows
* sioc:subscriber_of
dcterms:type
* * * * * *
sioc:has_container
sioc:UserAccount
sioc:container_of
URI: concat(site, accountName)
foaf:nick
foaf:accountName
*

dcterms:contributor
sgo:copies sgo:shares dcterms:created
dcterms:modified
* * * * * sgo:verified: xsd:boolean
sioc:Post sgo:private: xsd:boolean
sgo:outreach: xsd:decimal
URL sgo:influence: xsd:decimal
sioc:content (language tagged) sgo:numPosts: xsd:nonNegativeInteger
dcterms:identifier sgo:numFollowers: xsd:nonNegativeInteger *
sioc:related_to

* dcterms:created sgo:numFollowing: xsd:nonNegativeInteger


dcterms:title sgo:numLikes: xsd:nonNegativeInteger
dcterms:dateCopyrighted sgo:declaredLocation: xsd:string
dcterms:medium
sioc:num_views * * *
*
sioc:num_replies sioc:creator_of
geo:lat
geo:long * sioc:has_creator
schema:wordCount dcterms:references
*
sioc:has_reply

schema:contentRating
sioc:reply_of

* dcterms:contributor
schema:articleBody
schema:isFamilyFriendly *
sgo:numLikes: xsd:nonNegativeInteger
sgo:numShares: xsd:nonNegativeInteger * marl:hasOpinion * marl:Opinion
sgo:impact: xsd:decimal marl:extractedFrom
*

sgo:reach: xsd:nonNegativeInteger URI: hash(post, text)


sgo:engagement: xsd:decimal marl:optinionText
sgo:relevance: xsd:decimal marl:polarityValue
sgo:isPromotion: xsd:boolean
*
sioc:topic *
marl:describesObject

Figure 5.9: Core ontology module of the SGO

91
Property Description
URI An instance of sioc:UserAccount can be uniquely identified by a
URI constructed with the URL of the website in which the user
account is registered, together with the account name of the user
in the site.
sioc:follows User account followed by the user account being described, as for
example a Facebook friend or a Twitter followee.
sioc:subscriber of Publication channel to which the user account is subscribed.
sioc:topic Subject in which the owner of the user account is interested.
dcterms:isPartOf Online community to which a user belongs.
dcterms:contributor Other account that can contribute to the content published by the
user account being described, or that can published in its name.
foaf:nick Nick of the user in the publication channel (e.g. the screen name
in the case of Twitter).
foaf:accountName Id of the user in the publication channel (in the case of Facebook
and Twitter numeric identifiers are used).
foaf:page Web page that describes the user profile in the publication channel
being defined.
foaf:avatar An image that represents the user in the publication channel.
sioc:account of Person or organisation that owns the user account.
sioc:has function Role that the user plays in the publication channel (e.g. influencer,
owner, etc.).
dcterms:created Date and time of creation of the user account.
dcterms:modified Date and time of modification of the user account.
sgo:verified Determines if the publication channel has verified the person or
organisation that has been declared as the owner of the user ac-
count.
sgo:private Determines whether the profile defined by the user account and
the content produced can be only accessed by authorised users, or
are publicly available in the Web.
sgo:outreach KPI that measures the overall outreach of the user account in
terms of outreach metrics like the one provided by Kred (http:
//kred.com).
sgo:influence KPI that measures the overall influence of the user account in
terms of influence metrics like the one provided by Klout (http:
//klout.com) or Kred.
sgo:numPosts KPI that measures the number of posts published by the user
account.
sgo:numFollowers KPI that measures the number of followers of the user account.
sgo:numFollowing KPI that measures the number of user accounts followed by the
user account being described.
sgo:numLikes Number of likes that hasve been received by the user account.
sgo:withHeldIn Country in which the user account has been banned due to legal
restrictions, etc.
sgo:declaredLocation Location declared by a user in her/his profile of a given social
medium.

Table 5.2: Properties of the class sioc:UserAccount

92
Property Description
URL Posts can be uniquely identified by the URL of the web resources
that annotate.
sioc:has container Channel in which the post has been published. This property
is the inverse of the propertysioc:container of, which has been
defined by Table 5.5.
sioc:has creator User account that has published the post being described. This
property is the inverse of the propertysioc:creator of, which has
been defined by Table 5.2.
dcterms:contributor User account that has contributed to the post being described.
dcterms:references User account being mentioned in the post.
sioc:related to Other post related to the content being described.
sioc:reply of Publication of which the post being described is a reply. This
property is a sub property of the propertysioc:related to.
sioc:has reply Post that is a reply of the content being described. This property
is also a sub property of sioc:related to, and the inverse of the
property sioc:reply of.
sgo:shares Post that is being spread by the post being described, for example
by using a retweet when disseminating through Twitter. This
property is also a sub property of sioc:related to, and the inverse
of the property sioc:reply of.
sgo:copies Other post whose content has been copied fully of partially in
the post being described, without explicitly declaring it in con-
tent’s metadata (e.g. by setting the retweet flag when the publi-
cation channel is Twitter). This property is also a sub property
of sioc:related to.
marl:hasOpinion Object property that relates the post with an opinion contained
in it.
sioc:topic Keyword included in the content of post, or subject that the post
is about.
sioc:content Textual content of the post. The value of this property may be
annotated with its language according to the mechanisms provided
by RDF for tagging the language of string literals.
sioc:links to Multimedia content (videos, photos, etc.) linked from the post.
dcterms:identifier Identifier assigned by the publication channel to the post.
dcterms:created Publication date of the post.
dcterms:dateCopyrighted Copyright date of the post.

Table 5.3: Properties of the class sioc:Post (1/2)

93
Property Description
dcterms:medium Main format of the content (text, video, etc.).
foaf:based near Location from which the content has been published.
geo:lat Geographical latitude from which the content has been published.
geo:long Geographical longitude from which the content has been pub-
lished.
schema:articleBody Content of the post in HTML format.
sioc:num views KPI that measures the number of views of the content.
sioc:num replies KPI that measures the number of replies to the content.
sgo:numLikes KPI that measures number of times that the content has been
liked.
sgo:numShares KPI that measures the number of times that the content has been
shared.
sgo:impact KPI that measures the degree in which the content has been
viewed and shared.
sgo:reach KPI constructed from the summatory of the influence of the au-
thor of the post and of the users that have disseminated the post.
sgo:engagement KPI that measures the engagement of the content.
sgo:relevance KPI that measures the relevance of the content. It is calculated
as an aggregation of the KPIs of the post, the author and the site.
schema:wordCount Number of words included in the content.
sgo:isPromotion Indicates if the post contains and advertising message.
schema:isFamilyFriendly Indicates if the post does not include sensible content (e.g. vio-
lence).
sgo:withHeldIn Country in which the post has been banned due to legal restric-
tions, etc.
sgo:contentRating Rating of the post according to its publication channel (e.g. Twit-
ter rates the tweets according to its degree of dissemination).

Table 5.4: Properties of the class sioc:Post (2/2)

Property Description
URI An instance of the class sioc:Forum can be uniquely identified
by a URI constructed with the URL of the website to which the
publication channel belongs together with the type of publication
channel.
sioc:has subscriber User account that is subscribed to the publication channel be-
ing described. This property is the inverse of the property
sioc:subscriber of, which has been defined by Table 5.2.
sioc:container of Post published within the publication channel being described.
dcterms:type Type of publication channel (e.g. weblog, microblog, social net-
work, etc.).
sioc:has host Website to which the publication channel belongs.

Table 5.5: Properties of the class sioc:Forum

94
Property Description
URI Hashes constructed from the concatenation of the URL of the
posts where the opinion has been described and the text of the
opinion may uniquely identify instances of marl:opinion.
marl:extractedFrom Post from which the opinion has been extracted. This property
is the inverse of the property marl:hasOpinion, which has even
defined by Table 5.3
marl:describesObject Entity that is the object of the opinion being described (e.g. a
brand or product).
marl:opinionText Text of the opinion.
marl:polarityValue Numeric value of the opinion polarity. The Marl ontology speci-
fication (http://www.gsi.dit.upm.es/ontologies/marl/) rec-
ommends using a real number in the interval [0, 1] for this value.
marl:hasPolarity Category of the opinion polarity (i.e. positive, negative or neutral)
onyx:hasEmotionCategory Kind of emotion expressed in the opinion.
sgo:hasPurchaseStage Purchase stage in the Consumer Decision Journey.
sgo:hasMarketingMixAttribute Marketing Mix attribute.

Table 5.6: Properties of the class marl:Opinion


Property Description
URI An instance of the class skos:Concept can be uniquely identified by a
URI that includes the language and the label of the topic or keyword.
skos:prefLabel Label of the concept. The value of this property may be annotated with
its language according to the mechanisms provided by RDF for tagging
the language of string literals.
isocat:datcat Lexical or semantic classification of the concept expressed according to
a ISOcat [Kemps-Snijders et al., 2008] category. The categories reused
by this module are the following: verb, adjective, noun, common noun,
proper noun, named entity, location, organisation, person, male, female,
metadata tag, trademark (i.e. brand or product), and domain (i.e. busi-
ness sector).

Table 5.7: Properties of the class skos:Concept


Property Description
sioc:topic Subject (or topic) around which the community has been con-
structed.
dcterms:isPartOf Broader community of which the community being described is
part.

Table 5.8: Properties of the class sioc:Community


Property Description
URI Used for uniquely identifying the graph
rdfs:label Name assigned to the social graph instance.
dcterms:description Text that describes the social graph instance.

Table 5.9: Properties of the class rdfg:Graph

95
5.4 Publication Channels Module
This module describes the classes and properties related with content publication
channels, i.e. sites and sections within sites where social media contents are pub-
lished. Figure 5.10 shows a UML representation of this module, which includes
the class sioc:Site that describes websites. The properties defined for this class
are shown in Table 5.10.

cd Publication Channels
sioc:Site
sioc:Forum
* sioc:has_host *
URL
URI: concat(site, type)
sioc:host_of sgo:rank: decimal
dcterms:type
sgo:monthlyVisitors: nonNegativeInteger
sgo:pagesPerVisit: nonNegativeInteger
sgo:visitsPerVisitor: nonNegativeInteger
sgo:minutesPerVisitor: nonNegativeInteger
sgo:backlinks: nonNegativeInteger
sgo:percentageMale: decimal
sgo:percentageFemale: decimal

Figure 5.10: Publication Channels module of the SGO


Property Description
URL The URLs of the websites are used for identifying the instances
of this class.
sgo:rank KPI that ranks the site according to a relevance metric like
Google’s PageRank [Page et al., 1999] or MozRank (http://moz.
com/learn/seo/mozrank).
sgo:monthlyVisitors KPI that measures the average number of unique visitors to the
site per month. The correct identification of unique visitors may
rely on techniques like the ones described in Section 2.2.2, or on
the contribution of this thesis to the State of the Art described in
Chapter 7.
sgo:visitsPerVisitor KPI that measures the average number of visits to the site per
visitor and month.
sgo:pagesPerVisit KPI that measures the average number of pages viewed by a vis-
itor per visit.
sgo:minutesPerVisitor KPI that measures the average time in minutes spent by a visitor
of the site per visit.
sgo:backlinks KPI that measures the number of links to the site from other web
pages.
sgo:percentageMale Percentage of male visitors.
sgo:percentageFemale Percentage of female visitors.
sioc:host of Publication channels that belong to the site being described. This
property is the inverse of the property sioc:has host, which has
been defined by Table 5.5.

Table 5.10: Properties of the class sioc:Site

96
5.5 Contents Module
This module describes the classes and properties related with the contents pub-
lished in social media. Figure 5.11 shows a UML representation of this module.
The classes defined by this module are the following:

• The class foaf:Document represents any kind of multimedia document pub-


lished online. The properties defined for this class are shown in Table 5.11.

• The class schema:Review is used for creating posts annotations by social


media analysts, community managers, or CRM operators. The properties
defined for this class are shown in Table 5.12.

The classes tzont:PoliticalRegion and tzont:Country are defined within in the


module that deals with geographical locations (see Section 5.9).

cd Contents

sioc:Post sioc:links_to
*
URL *
sioc:content (language tagged) foaf:Document
dcterms:identifier
dcterms:created URL
dcterms:title
* foaf:based_near *
dcterms:dateCopyrighted tzont:PoliticalRegion
dcterms:medium
sioc:num_views tzont:Country
sioc:num_replies * sgo:withheldIn *
geo:lat URI: (identifier)
geo:long
schema:wordCount
schema:contentRating schema:Review
schema:articleBody dcterms:created
schema:isFamilyFriendly schema:review
* * dcterms:creator
sgo:numLikes: xsd:nonNegativeInteger schema:reviewBody
sgo:numShares: xsd:nonNegativeInteger schema:keywords
sgo:impact: xsd:decimal sgo:starred: xsd:boolean
sgo:reach: xsd:nonNegativeInteger sgo:checked: xsd:boolean
sgo:engagement: xsd:decimal sgo:status: rdfs:Literal
sgo:relevance: xsd:decimal sgo:priority: rdfs:Literal
sgo:isPromotion: xsd:boolean

Figure 5.11: Contents module of the SGO

97
Property Description
URL Documents can be uniquely identified by their URLs of the resources that
annotate.

Table 5.11: Properties of the class foaf:Document


Property Description
dcterms:created Date of creation of the review.
dcterms:creator Name of the reviewer.
dcterms:reviewBody Text of the review.
schema:keywords Tags assigned by the reviewer to the post.
sgo:starred Indicates if the post has been highlighted by the reviewer.
sgo:checked Indicates if the review task has finished.
sgo:status Status of the actions derived from the review.
sgo:priority Priority of the review.

Table 5.12: Properties of the class schema:Review

5.6 Users Module


This module describes the classes and properties related with social media users.
Figure 5.12 shows a UML representation of this module. The classes defined by
this module are the following:

• The class sioc:Role represents roles that the user accounts play in social
media, like influencer, content propagator, etc. The property defined for
this class is shown in Table 5.13.

• The class foaf:Agent defines persons or organisations that own user ac-
counts. The properties defined for this class are shown in Table 5.14.

• The class foaf:Organisation is used for describing organisations. We do not


have defined additional properties for the class foaf:Organisation. This class
is a subclass of foaf:Agent.

• The class foaf:Person is used for defining persons. This class is a subclass
of foaf:Agent. The properties defined for this class are shown in Table 5.15.

• The class foaf:Image is a subclass of the class foaf:Document, which has been
described in Section 5.5. This class is used for defining images assigned to
user accounts.

98
• The class foaf:PersonalProfileDocument is a also a subclass of the class
foaf:Document. This class is used for defining web pages that describe user
accounts.

• The class sgo:Activity is used for registering an activity record captured


by a tracking server. This activity record can be associated to a cookie,
a fingerprint, or both. The properties defined for this class are shown in
Table 5.16.

• The class sgo:Cookie is used for describing cookies installed in web browsers
used by users. A cookie is used in the context of this thesis as a mechanism
for uniquely identifying browsers, as it has been described in Section 2.2.2.1.
The properties defined for this class are shown in Table 5.17.

• The class sgo:Fingerprint is used for describing device fingerprints. The


properties defined for this class are shown in Table 5.18.

The classes tzont:PoliticalRegion and tzont:Country are defined within in the


module that deals with geographical locations (see Section 5.9).

99
cd Users

sioc:UserAccount * sioc:function_of *
sioc:Role
URI: concat(site, accountName) sioc:has_function
foaf:nick
foaf:accountName sgo:withheldIn tzont:Country
dcterms:created
* * URI: (identifier)
dcterms:modified
* sgo:verified: xsd:boolean
* sgo:private: xsd:boolean sgo:Fingerprint
sgo:outreach: xsd:decimal
sgo:influence: xsd:decimal URI: hash(all attributes)
sgo:numPosts: xsd:nonNegativeInteger dcterms:created
sgo:numFollowers: xsd:nonNegativeInteger sgo:xRealIP: xsd:NMTOKEN
sgo:numFollowing: xsd:nonNegativeInteger sgo:xForwardedFor: xsd:NMTOKEN
sgo:numLikes: xsd:nonNegativeInteger sgo:userAgent: xsd:string
sgo:declaredLocation: xsd:string sgo:accept: xsd:string
sgo:acceptLanguage: xsd:string
* * sgo:acceptCharset: xsd:string
sioc:avatar foaf:page
sgo:acceptEncoging: xsd:string
* *
sgo:cacheControl: xsd:string
foaf:Image foaf:PersonalProfileDocument sgo:plugins: xsd:string
sgo:fonts: xsd:string
sgo:video: xsd:string
sgo:timeZone: xsd:string
sgo:sessionStorage: xsd:boolean
foaf:Document sgo:localStorage: xsd:boolean
sgo:iePersistence: xsd:boolean
URL
0..1
* sgo:hasFingerprint
foaf:page *
*
foaf:account * sgo:hasActivity sgo:Activity
foaf:Agent
sioc:account_of foaf:name 1 * dcterms:created
foaf:age *
foaf:mbox sgo:hasCookie
dcterms:language 0..1
dcterms:description
sgo:Cookie
*
foaf:based_near

URI: hash(label,value,domain,path)
rdfs:label
dcterms:created
foaf:Organisation foaf:Person dcterms:valid
sgo:value: xsd:string
foaf:givenName sgo:domain: xsd:NMTOKEN
foaf:familyName sgo:path: xsd:normalizedString
*
schema:jobTitle sgo:isSecure: xsd:boolean
tzont:PoliticalRegion foaf:gender sgo:httpOnly: xsd:boolean

Figure 5.12: Users module of the SGO

100
Property Description
foaf:function of User account that plays the role being described. This property
is the inverse of the property sioc:has function, which has been
defined by Table 5.2.

Table 5.13: Property of the class sioc:Role

Property Description
foaf:name Name of the agent.
foaf:age Age of the agent.
foaf:mbox E-mail of the agent.
foaf:page Web page owned by the agent (e.g. weblog, homepage, etc.).
foaf:account A user account owned by the agent. This property is the inverse
of sioc:account of, which has been defined by Table 5.13.
foaf:based near Normalised geographical location of the agent (e.g. place of resi-
dence). In Section 8.5 we provide a technique for identifying the
place of residence of social media users.
dcterms:language Language spoken by the agent.
dcterms:description Description declared by the user about herself/himself in her/his
profile of the social medium.
sgo:hasActivity Activity record registered for the agent.

Table 5.14: Properties of the class foaf:Agent

Property Description
foaf:givenName Given name (e.g. first name) of the person being described.
foaf:familyName Family name (e.g. last name) of the person being described.
schema:jobTitle Profession of the person being described.
foaf:gender Gender of the person ( “male” or “female”). In Section 8.6 we pro-
vide a technique for identifying the gender of social media users.

Table 5.15: Properties of the class foaf:Person

Property Description
dcterms:created Timestamp in which the activity record has been gathered. It is
defined at the granularity of milliseconds.
sgo:hasCookie Cookie assigned to a web browser when registering the activity.
sgo:hasFingerprint Fingerprint of a device when registering the activity.

Table 5.16: Properties of the class foaf:Activity

101
Property Description
URI An instance of the class sgo:Cookie can be uniquely identified by
a URI constructed with a hash created from the name, the value,
the domain and the path of the cookie.
rdfs:label Name of the cookie.
dcterms:created Date and time of creation of the cookie.
dcterms:valid Expiry date and time of the cookie.
sgo:value Value assigned to the cookie.
sgo:domain Domain scope of the cookie.
sgo:path Path scope of the cookie.
sgo:isSecure Determines whether the cookie can only be sent using secure con-
nections.
sgo:httpOnly Determines whether the cookie can only be sent through HTTP
[Fielding and Reschke, 2014a].

Table 5.17: Properties of the class sgo:Cookie


Property Description
URI An instance of the class sgo:Fingerprint can be uniquely identified
by a URI constructed with a hash created from all fingerprint
attributes (i.e. the ones described in Section 2.2.2.2)
dcterms:created Date and time of creation of the fingerprint.
sgo:xRealIP IP address [Postel, 1981] of the user’s device.
sgo:xForwardedFor IP address of the user’s device followed by the IP addresses of
the proxy servers between the device and the web server that has
registered the fingerprint.
sgo:userAgent Information about the device (browser, operating system, etc.)
used by the user.
sgo:accept Kind of content requested by the device to the web server when
such server registered the fingerprint.
sgo:acceptLanguage Language expected by the device.
sgo:acceptCharset Charset expected by the device.
sgo:acceptEncoding Encoding or compression format expected by the device.
sgo:cacheControl Directive that specifies the caching mechanisms to be applied
along the request-response chain.
sgo:plugins Plugins installed in the web browser used by the device.
sgo:fonts Fonts installed in the device.
sgo:video Video settings of the device.
sgo:timeZone Time zone of the device’s user.
sgo:sessionStorage Indicates if the device supports data persistence which is available
during a navigation session.
sgo:localStorage Indicates if the device supports persistent data which is available
beyond a navigation session.
sgo:iePersistence Indicates whether the device supports data persistence when the
browser is Internet Explorer.

Table 5.18: Properties of the class sgo:Fingerprint

102
5.7 Opinions Module
This module describes the classes and properties related with the opinions ex-
pressed by consumers in their posts. Figure 5.13 shows a UML representation of
this module. The classes defined by this module are the following:

• The class marl:Polarity indicates the polarity of the opinion. There are
three possible instances of this class: marl:Positive, marl:Negative and
marl:Neutral.

• The class onyx:EmotionCategory is used for indicating the kind of emo-


tion expressed within an opinion according to the categories defined in the
State of the Art (see Table 2.4). Therefore, we have defined the follow-
ing instances for this class: sgo:Satisfaction, sgo:Dissatisfaction, sgo:Love,
sgo:Hate, sgo:Happiness, sgo:Sadness, sgo:Trust and sgo:Fear. This thesis
provides a technique for identifying these emotion categories in Section 8.4.

• The class sgo:PurchaseStage is used for indicating the purchase stage ex-
pressed by a consumer according to the categories defined in the State
of the Art (see Figure 2.6). Therefore, we have defined the following in-
stances for this class: sgo:Awareness, sgo:Evaluation, sgo:Purchase and
sgo:PostpurchaseExperience. This thesis provides a technique for identi-
fying these Consumer Decision Journey stages in Section 8.2.

• The class sgo:MarketingMixAttribute is used for indicating the Marketing


Mix attributes to which consumers refer within their opinions according to
the categories defined in the State of the Art (see Table 2.3). Therefore,
we have defined the following instances for this class: sgo:CustomerService,
sgo:Sponsorship, sgo:Quality, sgo:Promotion, sgo:Advertisement, sgo:Price,
sgo:Design, sgo:PointOfSale, sgo:Warranty and sgo:LoyaltyMarketing. This
thesis provides a technique for identifying these purchase stages in Sec-
tion 8.3, with the exception of Warranty and Loyalty Marketing, which are
out of the scope.

103
cd Opinions
sgo:Satisfaction sgo:Dissatisfaction

sgo:Love sgo:Hate

sgo:Happiness sgo:Sadness

sgo:Trust sgo:Fear

onyx:EmotionCategory

*
onyx:hasEmotionCategory
*
sgo:hasPurchaseStage marl:hasPolarity
marl:Opinion
sgo:PurchaseStage marl:Polarity
0..1 * URI: hash(post, text) * *
marl:optinionText
sgo:Awareness marl:polarityValue marl:Positive

sgo:Evaluation *
marl:Neutral

sgo:PostpurchaseExperience marl:Negative

sgo:Purchase sgo:hasMarketingMixAttribute
*
sgo:MarketingMixAttribute

sgo:Design sgo:Quality

sgo:Sponsorship sgo:CustomerService

sgo:Price sgo:Promotion

sgo:Advertisement sgo:PointOfSale

sgo:Warranty sgo:LoyaltyMarketing

Figure 5.13: Opinions module of the SGO

104
5.8 Topics and Keywords Module
This module describes the instances used for annotating the topics and keywords
included in social media content.
Figure 5.14 shows a UML representation of this module. Note that there
exists part-of relationships between some of the categories show in the figure
(e.g. between noun types). The definition of this mereology is out of the scope
of this work since they are specified by ISOCat [Kemps-Snijders et al., 2008].

cd Topics and Keywords


skos:Concept
URI: concat(language, prefLabel)
skos:prefLabel (language tagged)

*
isocat:datcat
*
rdfs:Resource

isocat:DC-1424 isocat:DC-1230
(verb) (adjective)

isocat:DC-1333 isocat:DC-1256

(noun) (common noun)

isocat:DC-1371 isocat:DC-4339

(proper noun) (location)

isocat:DC-2275 isocat:DC-2979

(named entity) (Organisation)

isocat:DC-2978 isocat:DC-2950

(Person) (female)

isocat:DC-2949 isocat:DC-414
(male) (trademark)

isocat:DC-5436 isocat:DC-2212

(metadata tag) (domain)

Figure 5.14: Topics and Keywords module of the SGO

105
5.9 Geographical Locations Module
This module describes the classes and properties related with the locations of
users and contents. Figure 5.15 shows a UML representation of this module. The
classes defined by this module are the following:

• The class tzont:PoliticalRegion represents a location that corresponds to any


kind of political region (e.g. country, state, city). The properties defined
for this class are shown in Table 5.19.

• The class tzont:Country represents a political region that corresponds to a


country. The properties defined for this class are shown in Table 5.20.

• The class tzont:State represents a political region that corresponds to an


administrative region of first level within a country (e.g. state, autonomous
community). The properties defined for this class are shown in Table 5.21.

• The class tzont:County represents a political region that corresponds to


an administrative region of second level within a country (e.g. county,
province). The properties defined for this class are shown in Table 5.22.

• The class tzont:City represents a political region that corresponds to an


administrative region of third level within a country (e.g. city, town, village,
settlement). The properties defined for this class are shown in Table 5.23.

• The class schema:Continent represents a continent of the world. The prop-


erties defined for this class are shown in Table 5.24.

• The class tzont:TimeZone represents a time zone to which a political region


belongs. The properties defined for this class are shown in Table 5.25.

106
cd Locations
tzont:hasParentRegion

* *
tzont:PoliticalRegion
tzont:TimeZone
tzont:hasTimeZone dcterms:identifier
URI: (label)
rdfs:label
rdfs:label * * geo:lat
tzont:GMToffset
geo:long
*

tzont:hasParentRegion
tzont:Country
URI: (identifier) schema:Continent

* URI: (identifier)
tzont:State dcterms:identifier
rdfs:label
URI: concat(country, identifier)

tzont:County
URI: concat(country, state, identifier)

tzont:City
URI: concat(country, state, city, identifier)

Figure 5.15: Locations module of the SGO

Property Description
dcterms:identifier Identifier of the political region.
rdfs:label Name of the political region.
geo:lat Representative latitude of the political region.
geo:long Representative longitude of the political region.
tzont:hasParentRegion Region (political or continent) to which a political region belongs.
This property is used for modelling the part-of relationship among
geographical political entities (City, County, State and Country).
tzont:hasTimeZone Time zone to which a political region belongs.

Table 5.19: Properties of the class tzont:PoliticalRegion

107
Property Description
URI The instances of the class tzont:Country can be uniquely identified
by a URI constructed from the identifier of the country.

Table 5.20: Properties of the class tzont:Country

Property Description
URI The instances of the class tzont:State can be uniquely identified
by a URI constructed from the identifiers of the country and the
state.

Table 5.21: Properties of the class tzont:State

Property Description
URI The instances of the class tzont:County can be uniquely identified
by a URI constructed from the identifiers of the country, the state,
and the county.

Table 5.22: Properties of the class tzont:County

Property Description
URI The instances of the class tzont:City can be uniquely identified by
a URI constructed from the identifiers of the country, the state,
the county and the city.

Table 5.23: Properties of the class tzont:City

Property Description
URI The instances of the class schema:Continent can be uniquely iden-
tified by a URI constructed from the identifier of the continent.
dcterms:identifier Identifier of the continent.
rdfs:label Name of the continent.

Table 5.24: Properties of the class schema:Continent

Property Description
URI The instances of the class tzont:TimeZone can be uniquely iden-
tified by a URI constructed from the name of the time zone.
rdfs:label Name of the time zone.
tzont:GMToffset Difference of the time zone from Greenwich Meridian Time
(GMT).

Table 5.25: Properties of the class tzont:TimeZone

108
Chapter 6

MORPHOSYNTACTIC
CHARACTERISATION OF
SOCIAL MEDIA CONTENTS

In this chapter, we make use of a part-of-speech tagger to process and characterise


a corpus of user-generated content extracted from different social media sources.
Specifically, we have studied differences in the language used in distinct types of
social media content by analysing the distribution of part-of-speech categories in
such sources. The chapter is structured as follows:

• Firstly, Section 6.1 describes the kinds of social media that we have com-
pared, from which we have extracted the contents to be analysed.

• Secondly, Section 6.2 explains the distributions of part-of-speech categories


by type of social media.

• Finally, Section 6.3 presents the conclusions of the analysis, validating the
first hypothesis of this thesis: the contents published in social media statis-
tically present different morphosyntactic features depending on the specific
kind of media where they have been published.

109
6.1 Types of Social Media Analysed
We have characterised the following types of social media by extracting and
analysing a random sample of 10,000 textual contents published on them, uni-
formly distributed among the following media types:

Blogs. We have extracted the texts of posts published in feeds of blog publishing
platforms such as Wordpress53 and Blogger54 . Content published in these
sites usually consists on medium-sized posts and small comments about
such posts.

Forums. We have scrapped the text of the comments published in web forums
constructed with vBulletin55 and phpBB56 technologies. Content published
in these sites consists in dialogues between users in the form of a timely
ordered sequence of small comments.

Microblogs. We have extracted the short messages published in Twitter and


Tumblr57 by querying their APIs. Content published in these sources con-
sists on small pieces of text (e.g. maximum 140 characters for Twitter).

Social networks. We have extracted the messages published in Facebook and


Google Plus58 by querying their APIs. Content published in these sites goes
from small statuses or comments to medium-sized posts.

Review sites. We have scrapped the text of the comments published in Ciao59 ,
Dooyoo60 and reviews published in Amazon61 . The length of the content
published in these sites is also variable.
53
http://wordpress.org
54
http://www.blogger.com
55
http://www.vbulletin.com
56
http://www.phpbb.com
57
http://www.tumblr.com
58
http://plus.google.com
59
http://www.ciao.com
60
http://www.dooyoo.com
61
http://www.amazon.com

110
Audio-visual content publishing sites. We have extracted the textual com-
ments associated to the audio-visual content published in YouTube62 and
Vimeo63 . Textual content published in these sites takes the form of small
textual comments.

News publishing sites. We have extracted the articles from the feeds pub-
lished in such sources. Sites of this kind can be classified as traditional
editorially controlled media. However, comments posted by article readers
can be catalogued as user-generated content. Thus, content published in
news sites consists on articles and small comments about such articles.

Other sites not classified in the categories above (e.g. Content Manage-
ment Systems) that publish their content as structured feeds, or that have
a known HTML structure from which a scrapping technique can be applied.
The content published in these sites is heterogeneous.

6.2 Distribution of Part-of-Speech Categories


For performing the study of the distribution of part-of-speech (PoS) categories in
user-generated content, we have collected a corpus with 10, 000 posts written in
Spanish, obtained from the sources described in the previous section. The posts
extracted are related to the telecommunications domain. We have performed the
PoS analysis by implementing a GATE [Cunningham et al., 2011] pipeline, with
TreeTagger [Schmid, 1994] as the PoS tagger. Therefore, the PoS distributions
obtained are based on an automatic tagger. A previous work by Garcı́a Moya
[2008] includes an evaluation of TreeTagger with a Spanish parameterisation when
applied to a corpus of news articles. The precision, recall and F-measure obtained
on such evaluation were 0.8831, 0.8733 and 0.8782, respectively.
Table 6.1 shows the distributions obtained. The TreeTagger tag-set for Span-
64
ish determines the PoS categories. As shown in the table, there are variations
in the distribution of these categories with respect to the publication source.
62
http://www.youtube.com
63
http://vimeo.com
64
ftp://ftp.ims.uni-stuttgart.de/pub/corpora/spanish-tagset.txt

111
Table 6.1: Distribution of part-of-speech categories by social media type
PoS Category News Blogs Audiov. Reviews Microbl. Forums Other S. Net. All
Noun 30.9% 30.0% 29.0% 23.2% 33.7% 22.0% 26.6% 32.7% 27.4%
Common 53.3% 56.9% 50.5% 71.5% 50.4% 68.8% 60.9% 50.2% 59.2%
Proper 42.3% 37.3% 42.9% 23.8% 36.1% 25.7% 34.1% 43.1% 34.6%
Foreign word 0.2% 0.5% 1.4% 0.5% 1.8% 0.9% 0.7% 1.0% 0.8%
Measure unit (e.g. 0.2% 0.8% 0.0% 0.6% 0.1% 0.2% 0.2% 0.2% 0.3%
GHz)
Month name (e.g. 0.5% 1.1% 0.4% 0.1% 0.1% 0.3% 0.5% 0.4% 0.4%
Feb)
Acronym (e.g. UN) 0.3% 0.5% 0.5% 0.1% 0.3% 0.5% 0.3% 0.5% 0.3%
Letter of the alpha- 0.6% 1.1% 2.3% 1.0% 4.0% 1.7% 1.0% 1.9% 1.5%
bet (e.g. b)
Alphanumeric code 2.2% 1.5% 1.9% 0.9% 1.1% 1.2% 1.9% 1.1% 1.5%
(e.g. A4)
Symbol (e.g. $, £) 0.4% 0.3% 0.1% 1.4% 6.1% 0.7% 0.5% 1.5% 1.3%
Adjective 8.6% 8.3% 6.4% 8.2% 9.4% 7.1% 8.4% 6.2% 8.0%
Quantity ordinal 4.6% 2.7% 1.4% 1.5% 0.4% 1.1% 1.7% 1.1% 1.9%
Quantity cardinal 34.7% 30.6% 28.5% 22.0% 33.0% 24.8% 34.3% 25.5% 29.6%
Quantity other 7.5% 12.0% 14.5% 23.6% 7.4% 23.3% 13.8% 19.3% 15.7%
Other 53.3% 54.8% 55.6% 53.0% 59.1% 50.8% 50.1% 54.1% 52.9%
Adverb 2.5% 3.4% 3.2% 4.9% 3.9% 4.5% 3.7% 3.4% 3.8%
Negation 18.2% 18.1% 29.7% 23.9% 36.2% 30.0% 30.6% 29.1% 27.4%
Other 81.8% 81.9% 70.3% 76.1% 63.8% 70.0% 69.4% 70.9% 72.6%
Determiner 11.5% 9.8% 7.6% 8.0% 5.8% 8.0% 8.7% 7.5% 8.5%
Conjunction 6.1% 7.8% 6.6% 9.7% 6.2% 10.1% 8.7% 7.4% 8.3%
Adversative coordi- 2.4% 3.1% 3.9% 5.7% 7.0% 5.7% 4.1% 3.7% 4.6%
nating
Negative coordinat- 0.3% 0.9% 0.7% 1.5% 1.0% 1.5% 1.3% 1.3% 1.2%
ing
Other coordinating 44.3% 44.2% 36.6% 29.3% 36.6% 32.5% 38.9% 41.6% 36.7%
”que” 28.5% 26.9% 27.0% 34.4% 26.1% 31.7% 29.5% 26.7% 30.1%
Subordinating 2.2% 3.1% 1.6% 4.4% 1.4% 3.0% 2.9% 2.2% 3.0%
(finite clauses)
Subordinating (infi- 10.6% 9.7% 18.7% 10.8% 10.7% 11.1% 10.2% 12.0% 10.8%
nite clauses)
Other subordinating 11.7% 12.0% 11.5% 13.9% 17.2% 14.6% 13.1% 12.6% 13.5%
Pronoun 1.9% 3.4% 5.0% 5.6% 4.7% 5.8% 4.3% 4.4% 4.4%
Demonstrative 23.7% 24.3% 15.4% 20.2% 15.1% 13.9% 18.3% 16.2% 17.8%
Interrogative 0.7% 0.9% 0.0% 0.8% 1.8% 1.1% 0.6% 0.8% 0.9%
Personal (clitic) 17.1% 16.0% 11.4% 11.4% 16.3% 17.2% 14.6% 12.8% 14.6%
Personal (non-clitic) 15.7% 22.1% 37.3% 44.3% 42.9% 50.3% 39.0% 42.5% 40.8%
Posessive 38.4% 34.3% 33.0% 21.2% 22.0% 15.9% 24.8% 24.6% 23.4%
Relative 4.3% 2.4% 2.8% 2.1% 1.9% 1.6% 2.7% 3.1% 2.4%
Preposition 15.2% 14.6% 11.8% 12.7% 8.2% 11.9% 12.9% 11.5% 12.6%
Portmanteau word 3.8% 3.1% 3.4% 2.8% 2.1% 2.8% 3.1% 3.0% 3.1%
“al”
Portmanteau word 7.6% 4.2% 3.9% 4.5% 3.2% 3.9% 4.3% 4.8% 4.8%
“del”
Other 88.6% 92.7% 92.8% 92.6% 94.7% 93.3% 92.6% 92.3% 92.1%
Punctuation 10.7% 8.5% 12.9% 9.4% 8.3% 9.2% 9.7% 10.5% 9.7%
mark
Full stop 4.9% 17.1% 41.5% 8.7% 29.8% 25.5% 13.2% 25.0% 16.8%
Comma 48.9% 54.5% 29.1% 50.1% 25.2% 44.1% 44.7% 33.8% 43.7%
Colon 3.8% 3.8% 2.4% 5.4% 13.9% 4.8% 5.2% 15.2% 6.6%
Semicolon 1.0% 0.9% 1.3% 0.5% 0.6% 0.5% 0.5% 0.6% 0.7%
Dash 2.5% 1.4% 3.4% 1.5% 0.7% 2.1% 3.6% 3.3% 2.4%
Ellipsis 2.9% 4.3% 7.7% 8.8% 16.3% 8.4% 6.2% 9.2% 7.4%
Slash 0.5% 0.0% 0.0% 0.6% 3.8% 0.1% 0.3% 0.1% 0.5%
Percent sign 1.3% 1.1% 0.0% 0.9% 0.0% 0.7% 1.3% 0.4% 0.9%
Left parenthesis 13.4% 6.2% 5.2% 8.8% 2.1% 5.1% 11.1% 4.1% 8.1%
Rigth parenthesis 13.4% 6.2% 4.7% 8.6% 4.1% 5.5% 11.0% 4.6% 8.3%
Quotation symbol 7.5% 4.5% 4.6% 6.2% 3.5% 3.2% 2.9% 3.6% 4.5%
Verb 12.0% 13.8% 16.8% 17.8% 19.1% 20.5% 16.4% 16.0% 16.7%
To be (“estar”) 1.6% 1.9% 0.5% 1.7% 1.1% 1.5% 1.5% 1.3% 1.5%
To have (“haber”) 5.8% 3.5% 2.4% 3.5% 2.0% 3.2% 3.9% 1.9% 3.4%
Lexical past partici- 16.0% 13.4% 11.7% 10.2% 5.8% 10.0% 12.2% 8.9% 10.8%
ple
Lexical finite 47.2% 48.8% 48.5% 46.8% 50.1% 50.2% 48.5% 51.8% 48.8%
Lexical gerund 1.0% 0.7% 0.3% 0.9% 0.4% 0.8% 0.8% 1.1% 0.8%
Lexical infinitive 20.4% 22.9% 28.1% 25.5% 32.0% 26.7% 25.0% 26.9% 26.0%
Modal 1.5% 1.8% 0.8% 1.4% 0.8% 1.9% 1.6% 1.9% 1.6%
To be (“ser”) past 0.6% 0.3% 0.6% 0.9% 0.1% 0.2% 0.4% 0.1% 0.4%
part.
To be (“ser”) infini- 0.4% 0.4% 0.4% 0.6% 0.3% 0.3% 0.5% 0.5% 0.4%
tive
To be (“ser”) other 5.6% 6.4% 6.8% 8.7% 7.3% 5.3% 5.7% 5.6% 6.3%
“Se” (as particle) 0.7% 0.6% 0.7% 0.5% 0.7% 0.7% 0.6% 0.6% 0.6%

112
The distribution of all PoS categories in news publishing sites and blogs is
very similar, because the posts published in these sources have a similar writing
style, as there are no limitations on the size of such posts.
In addition, the sources not classified (i.e. “other”) have a similar distribution
to the combination of all sources. This may be due to the heterogeneity of the
publications contained in the web pages that have not been classified as specific
content type.
Next, we discuss some relevant insights obtained from the distribution of each
PoS category.

6.2.1 Distribution of Nouns


As shown in Table 6.1 the distribution of common and proper nouns is very
different for forums and reviews. It seemed strange to us that proper nouns,
found in the sources where discussions about specific product models are raised,
were less used than in the other sources. After examining a sample of 100 texts,
we noticed that in those sources, product names are often written in lower case,
which lead to an incorrect PoS annotation. After reprocessing the corpus using
gazetteers, including proper names in lower case, we found that this is a problem
with TreeTagger precision. Such problem makes entity recognition less accurate,
when such entity recognition requires a previous step of detecting proper nouns
using PoS tagging. Although the use of gazetteers improves entity detection, this
solution is domain-dependent.
In addition, foreign words are less used in news than in other sources, because
the style rules of traditional media require avoiding such foreign words, as far as
possible, whenever a Spanish word exists.
Finally, the relative big distribution of letters of the alphabet category is due
to a TreeTagger accuracy error (overall when analysing short texts published in
Twitter).

6.2.2 Distribution of Adjectives


As shown in Table 6.1, the distribution of adjectives of quantity is near 50% for
most of the sources (adding quantity ordinal, quantity cardinal, and other). The

113
adjectives of quantity commonly used are the cardinals and the less used are the
ordinals, whose use is insignificant in all sources, except in news publishing sites.
The rest of quantifying adjectives (quantity others) are used quite frequently
in forums and reviews, because such sites include publications of quantitative
evaluations and comparisons of products. Specifically, in these sites, we find
multiplicative (e.g. doble, triple), partitive (e.g. medio, tercio), and indefinite
quantity adjectives (e.g. mucho, poco, bastante).

6.2.3 Distribution of Adverbs

The adverbs of negation (e.g. jamás, nada, no, nunca, tampoco) are used with
more frequency in the sources with limitations of posts length. Moreover, there
is an inverse correlation between the size of the texts and the use of adverbs of
negation. The detection of such negations is essential when performing sentiment
analysis, since they reverse the sentiment of the opinion about specific entities.

6.2.4 Distribution of Determiners

Determiners are used to a lesser extent in microblogs than in the other media
types (overall un news and blogs), because the limitation of post length (e.g.
140 characters in Twitter) requires that posts are written more concisely, and
therefore meaningless grammatical categories tend to be used less.

6.2.5 Distribution of Conjunctions

With respect to conjunctions, the distribution of coordinating conjunctions is


higher in sources where the texts are longer (i.e. news and blogs), and lower in
sources were posts are shorter, especially in forums and reviews because these
sources have a question-answer structure dominated by short sentences. Coordi-
nating conjunctions are useful for opinion mining to identify opinion chunks, as
well as punctuation marks.

114
6.2.6 Distribution of Pronouns
The distribution of personal pronouns (e.g. yo, tú, mı́) is higher in microblogs, re-
views, forums and audio-visual content publishing sites because, in these sources,
conversations between the users that generate the content are predominant, in
contrast to the narrative style of news and blogs articles.
Generally, pronouns make it difficult to identify entities within opinions, be-
cause such entities are not explicitly mentioned when using pronouns.

6.2.7 Distribution of Prepositions


As happened with determiners, prepositions are used to a lesser extent in mi-
croblogs than in the other media types, because of the use of a concise language.

6.2.8 Distribution of Punctuation Marks


Full stops are less used in news than in other sources, because longer sentences
are published in news articles which require other kinds of punctuation marks
(e.g. comma), in comparison to the rest of social media sources, where concise
phrases finished are usually written, which implies a bigger density of full stops.
The use of comma is lower in sources where there is less writing, that is, on
Twitter and sites with comments on audio-visual content.
The heavy use of the colon and slash in microblogs is due to the inclusion of
these characters in the emoticons and the sources cited through links embedded
in tweets.
Ellipses are more used in microblogs than in the rest of the sources, because
of the limitation of the size of the messages. In this source, unfinished messages
are posted frequently, so ellipses are added to express that such messages are
incomplete. Furthermore, some Twitter clients truncate messages longer than
140 characters, and automatically add the ellipsis.
Finally, parenthesis and other non-commonly used punctuation marks (e.g.
percent sign) are less used in microblogs, because of the limited length of the
tweets and the difficulty for introducing these characters on mobile terminals.

115
6.2.9 Distribution of Verbs
With respect to verbs, in forums and microblogs its use is more extensive, in
proportion to the rest of the PoS categories, than in the other social media sources.
A reason for this may be that intentions and actions are expressed more often in
these sources.
In addition, there is less use of the past participle within microblogs than in
other sources. This is because microblogs are used to transmit immediate expe-
riences, so most of the posts are communicated in the present tense. Similarly,
the infinitive is more used in microblogs for lexical verbs.
Finally, lexical finite verbs are used similarly in all the social media channels.

6.3 Hypothesis Validation


We have demonstrated that the distribution of PoS categories varies across dif-
ferent social media types, which validates Hypothesis 1. Since PoS tagging is a
previous step for many NLP techniques, the performance of such techniques may
vary according to the social media source from which the user-generated content
has been extracted.
As an example, a disambiguation strategy for topic identification may use
nouns as context for performing disambiguation. Thus, sources with a higher
distribution of nouns will provide more context than sources in which such dis-
tribution is smaller. The proportion of other categories may have impact over
the performance of other techniques (e.g. adjectives and adverbs over sentiment
analysis).

116
Chapter 7

TECHNIQUE FOR UNIQUE


USER IDENTIFICATION
BASED ON EVOLVING
DEVICE FINGERPRINT
DETECTION

As we have explained in Section 2.2.2.2, any technique for identifying users based
on browser fingerprint must be accompanied with an algorithm to detect differ-
ent fingerprints corresponding to a single browser, because browser fingerprint
changes very often [Eckersley, 2010].
This chapter describes a novel technique that takes into account the temporal
evolution of fingerprints, as well as the entropy of fingerprint attributes for weight-
ing the importance of each fingerprint attribute according to its discriminative
power.
This technique consists in the instantiation of a set of activities defined by
the CRISP-DM methodology [Shearer, 2000]. Such activities are the following:

1. The Data Understanding activity collects the fingerprint data and analyses
them from different perspectives, ensuring that they are valid for model
learning purposes. This activity is explained in Section 7.1.

117
2. The Data Preparation activity covers all the tasks required to construct the
dataset used for learning and evaluating the technique, including ensuring
that users are uniquely identified and removing non-human activity from
it. This activity is explained in Section 7.2.

3. The Modelling activity consists in selecting the modelling technique and in


learning the specific models that will be used for identifying unique users.
This activity is explained in Section 7.3.

4. The Evaluation activity consists in evaluating the models obtained. This


activity is explained in Section 7.4.

Next, each of the activities are described. After that, in Section 7.5 we vali-
date the hypothesis formulated in Section 3.4 regarding unique user identification
through device’s fingerprint.

7.1 Data Understanding Activity


This activity consists in the ordered execution of the following tasks:

1. The Collect Initial Data task consists in obtaining the activity produced in
websites. This task is described in Section 7.1.1.

2. The Describe Data task consists in performing a description of the format


and volume of the data gathered. This task is described in Section 7.1.2.

3. The Explore Data task consists in performing a deeper statistical analysis of


data from several viewpoints to ensure that the data are valid for modelling
purposes. This task is described in Section 7.1.3.

4. The Verify Data Quality task consists in examining the quality of the data
by attending to the analyses performed in the previous tasks. This task is
described in Section 7.1.4.

118
7.1.1 Collect Initial Data Task

This task consists in collecting the activity produced by users in websites as well
as in collecting their fingerprints. Such fingerprints are made of a set of values
for several HTTP headers [Fielding and Reschke, 2014b] and other attributes
accessible by executing JavaScript [ECMA, 2011], Java or Flash code within the
browser.
This task gathers the same HTTP headers as Eckersley [2010] (User-Agent,
User-Agent, Accept, Accept-Language, Accept-Encoding, and Accept-Charset). Such
headers have been described in Section 2.2.2.2. In addition, this task collects the
values for the additional HTTP headers described next.

X-Real-IP header. This non-standard header identifies the IP address [Postel,


1981] of the user’s device. The Nginx reverse proxy [Reese, 2008], which is
used in our implementation, adds this header. This reverse proxy receives
every message sent from the web browser, and redirects it to the tracking
server, which processes and persists the activity record.

X-Forwarded-For header. This header is a multivalued attribute that includes


the IP address of the web browser machine, as well as the IP addresses of
the successive proxy servers that have routed the HTTP message [Reese,
2008]. The Nginx proxy also adds this header.

Cache-Control header. This header is used to specify directives that must be


obeyed by all caching mechanisms along the HTTP request/response chain.

Unlike the approach followed by Eckersley [2010], our work does not make
use of the Cookies Enabled attribute. The rest of the attributes (Plugins, Fonts,
Video, Time Zone, Session Storage, Local Storage, and IE Persistence) have been
collected by using a technique implemented by Eckersley [2010], which consists
on the execution of a combination of JavaScript, Java and Flash code.
To obtain the Plugins attribute it is necessary to distinguish the user browsers,
since this conditions the way in which this information is accessed.

119
• In the case of the Mozilla Firefox65 , Google Chrome66 , Apple Safari67 , and
Opera68 browsers, this attribute is obtained through the DOM (Document
Object Model) by accessing to the navigator.plugins element. Such element
contains an array of objects and each object contains the name, the descrip-
tion, and the version of a plugin. Listing 7.1 shows the JavaScript code for
obtaining the Plugins attribute for these browsers.

• In the case of Internet Explorer a different technique is applied because


most versions of this browser do not include plugin information in its DOM.
Such technique relies on the PluginDetect JavaScript library69 that receives
a lost of the plugins for being detected and returns the information related
to these plugins. Specifically, we have obtained information for the follow-
ing plugins: Java, QuickTime70 , DevalVR71 , Shockwave72 , Flash, Windows
Media Player73 , Silverlight74 , and Acrobat75 .

The Fonts attribute is obtained through a Flash component. Therefore it


cannot be obtained if Flash is not installed in user’s device. To extract the
fonts information from the Flash component we make use of the jQuery Flash
library76 , which allows querying Flash objects from JavaScript. Listing 7.2 shows
the JavaScript code for obtaining the Fonts attribute.
To extract video information we access the screen object included in the
browsers’ DOM. Specifically, we obtain the values for the following attributes:

• The attribute height, which contains the number of vertical pixels in the
device’s screen.
65
http://www.mozilla.org/firefox
66
http://www.google.es/chrome/browser
67
http://www.apple.com/safari
68
http://www.opera.com
69
http://www.pinlady.net/PluginDetect
70
http://www.apple.com/quicktime
71
http://www.devalvr.com
72
http://www.adobe.com/shockwave
73
http://windows.microsoft.com/en-us/windows/windows-media-player
74
http://www.microsoft.com/silverlight
75
http://www.adobe.com/products/acrobat.html
76
http://jquery.lukelutman.com/plugins/flash

120
1 var plugins = navigator.plugins;
2 var plist = new Array();
3

4 for (var i = 0; i < plugins.length; i++) {


5 plist [ i ] = plugins[i ]. name + ”; ”;
6 plist [ i ] += plugins[i]. description + ”; ”;
7 plist [ i ] += plugins[i].filename + ”;”;
8

9 for (var n = 0; n < plugins[i ]. length; n++)


10 plist [ i ] += ” (” + plugins[i][n ]. description + ”; ” +
11 plugins [ i ][ n ]. type + ”; ” + plugins[i ][ n ]. suffixes + ”)”;
12

13 plist [ i ] += ”. ”;
14 }
15

16 plist . sort () ;

Listing 7.1: Script for obtaining the Plugins attribute

1 var fonts = ””;


2 var obj = document.getElementById(”flashfontshelper”);
3

4 if (obj && typeof(obj.GetVariable) != ”undefined”) {


5 fonts = obj.GetVariable(”/:user fonts”);
6 fonts = fonts.replace (/,/g,”, ”);
7 fonts += ” (via Flash)”;
8 }
9

10 if (fonts == ””)
11 fonts = ”No Flash fonts detected”;

Listing 7.2: Script for obtaining the Fonts attribute

• The attribute width, which contains the number of horizontal pixels in the
device’s screen.

• The attribute colorDepth, which contains information about the number of


colours supported by user’s device.

Listing 7.3 shows the JavaScript code for obtaining the Video attribute.

121
1 video = screen.width + ”x” + screen.height + ”x” + screen.colorDepth;

Listing 7.3: Script for obtaining the Video attribute

1 timezone = (new Date()).getTimezoneOffset();

Listing 7.4: Script for obtaining the Time Zone attribute

1 sessionStorage . fingerprint = ”yes”;


2 sessionStorageCapability = (sessionStorage. fingerprint == ”yes”)

Listing 7.5: Script for obtaining the Session Storage attribute

The Time Zone attribute is obtained, as in previous cases, by using JavaScript


code. To do so, an instance of the object Date is created and the property
timezoneOffset is queried. Such property returns the offset in minutes of the
local time zone with respect to UTC (Coordinated Universal Time). Listing 7.4
shows the JavaScript code for obtaining the Time Zone attribute.
The technique for obtaining the Session Storage and Data Storage attributes
consists in finding out whether the browser allows storing session or local data.
To do so, the objects sessionStorage and localStorage are used. Listings 7.5 and
7.6 show the JavaScript code for obtaining these attributes. The process followed
by both scripts is the following:

1. Firstly, we try to store a value in the object sessionStorage (or localStorage)


for the fingerprint keyword (line 1).

2. Next, we query the value for the fingerprint keyword stored in the object
sessionStorage (or localStorage) (line 2).

(a) If the value obtained is equal to the assigned in step 1, then the browser
is able to store session (or local) data.
(b) Otherwise, the browser is not able to do so.

The technique for obtaining the IE Persistence attribute consists in finding


whether the browser lets modifying XML DOM elements. Listing 7.7 shows the

122
1 localStorage . fingerprint = ”yes”;
2 localStorageCapability = (localStorage. fingerprint == ”yes”)

Listing 7.6: Script for obtaining the Local Storage attribute

1 oDiv.setAttribute(” fingerprint ”, ”yes”);


2 oDiv.save(”oXMLStore”);
3 ieStorageCapability = (oDiv.getAttribute(”fingerprint”)) == ”yes”)

Listing 7.7: Script for obtaining the IE Persistence attribute

JavaScript code for obtaining this attributes. The process followed by this script
is the following:

1. Firstly, we try to store a value in a div object for an attribute called fin-
gerprint (line 1).

2. Next, the div object is stored within the browser’s cache (line 2).

3. After that, we query the value for the fingerprint keyword stored in the
browser’s cache (line 3).

(a) If the value obtained is equal to the assigned in Step 1, then the browser
is able to store data within Internet Explorer cache.
(b) Otherwise, the browser is not able to do so.

In the experiment conducted in this thesis, we have collected the data used
for the experiment by using a web tracking server based on cookies, generating
records containing fingerprint attributes and a user identifier. These records have
been obtained using the JavaScript tags technique for capturing web activity
explained in Section 2.2.1.3, combined with the technique based on cookies for
identifying users explained in Section 2.2.2.1.

7.1.2 Describe Data Task


Regarding data format, the dataset used has the structure shown in Figure 7.1,
which reflects the ontology elements of the Social Graph Ontology that the tech-

123
cd Unique User Identification Data Format

sgo:Fingerprint
foaf:Agent
sgo:xRealIP 1
sgo:xForwardedFor
sgo:userAgent sgo:hasActivity
sgo:accept
sgo:acceptLanguage *
sgo:acceptCharset sgo:hasFingerprint sgo:Activity sgo:hasCookie sgo:Cookie
sgo:acceptEncoging
sgo:cacheControl 0..1 * dcterms:created * 0..1 dcterms:created
sgo:plugins
sgo:fonts
sgo:video
sgo:timeZone
sgo:sessionStorage
sgo:localStorage
sgo:iePersistence

Figure 7.1: Format of the data used by the technique for unique user identification
based on evolving device fingerprint detection

nique reads or writes, hiding those properties not required by the technique. The
data format consists in a set of activity records captured by the tracking server.
Each activity record corresponds to a single user and is related with a fingerprint
and a cookie that uniquely identifies a given user. The classes and properties
included in the diagram have been already described in Section 5.6.
With respect to data quantity, the data used in the experiment conducted in
this thesis consists in a set of 18,391 records extracted from a website, between
September 28 and October 19, 2011.

7.1.3 Explore Data Task


This task characterises the data from different viewpoints to ensure that the
dataset is rich enough for model training. Specifically the objective of this task
is to describe the distribution of the data with respect to unique visitors, web
browsers used and countries of origin of the activity collected, and to study the
characteristics of the fingerprint attributes from an Information Theory [Shannon
and Warren, 1949] perspective.
Next, we characterise the data used in our experiment according to the pre-
vious guidelines.
During the period of the study, 10,834 unique visitors visited the website from

124
www.pocketinvaders.com 28 Sep 2011 - 19 Oct 2011
Visitors Overview
Visitors

1,000 1,000

500 500

0 0

3 Oct 10 Oct 17 Oct

10,501 people visited


www.pocketinvaders.com this site
Figure 7.2: Daily distribution of visitors during28
theSep 2011of- 19
period Oct 2011
study
Traffic Sources Overview
11,932 Visits
Visits

1,000 1,000
10,501 Absolute Unique Visitors
500 500

18,425 Pageviews
0 0

1.54 Average Pageviews


3 Oct 10 Oct 17 Oct

All traffic sources sent a total of 11,932 visits


www.pocketinvaders.com
Figure 7.3:
00:01:00 Daily
Time on Site distribution of visits during the period
28 Sep of- study
2011 19 Oct 2011
Content Overview
8.40% Direct Traffic Search Engines
78.58% Bounce Rate Pageviews

2,000 Referring Sites 2,000


15.61% Referring Sites
1,000
83.74% New Visits Direct Traffic 1,000

75.96% Search Engines Other


0 0
Technical Profile
3 Oct 10 Oct 17 Oct
Top Traffic Sources
Browser Visits % visits
Pages on this site were viewed a total of 18,425 times
Figure 7.4: Daily distribution
Sources Visits
of pageKeywords
% visits
views during the periodVisits
of study% visits
18,425 Pageviews

which we15,400
have Unique
extracted
Views
the data, distributed daily as shown in Figure 7.2. Such
visitors may include humans and web crawlers. These users made a total of 11,932
site visits78.59% Bounce Rate
distributed daily as shown in Figure 7.3. The visitors registered a total
of 18,391 web page views distributed daily as shown in Figure 7.4. Each web
Top Content
page view generates a record within the fingerprint log used.
On average, each visitor viewed 1.7 pages, remaining
Pages about one minute% Pageviews
Pageviews average
time on the website. The bounce rate (i.e. percentage of visitors leaving the site
after viewing a single web page) was 79%, while the percentage of new visitors
was 84%, so there is a percentage of about 16% of users who visited the website
before beginning the study. The minimum number of web pages viewed by a single
visitor was 1, while the maximum was 389. Figure 7.5 shows the distribution of
web pages viewed by single user. Table 7.1 shows the summary statistics relating
to the distribution of the number of records captured by a single user. It includes

125
Figure 7.5: Distribution of the activity records captured by unique user

Statistic Value
Count 10,834
Mean 1.7
Standard Deviation 5.34
Coefficient of Variation 314.5%
Minimum 1
Maximum 389
Range 388
Systematic Error 1,921.62
Kurtosis 60,503.1

Table 7.1: Statistics associated to the number of records gathered per unique user

measures of tendency, variability and shape.


With respect to web browsers, there is a representation of the most used
browsers in the sample (39% of the activity was generated by Google Chrome,
30% by Mozilla Firefox, 18% by Internet Explorer, 6% by Android77 , 3% by Apple
Safari, and 4% by other non-catalogued browsers). This distribution affects the
diversity of values for different attributes, such as the User-Agent header, or the
plugins installed.
The sample used in our experiment contains activity generated in 63 different
countries, as reflected in Figure 7.6.

77
http://www.android.com

126
www.pocketinvaders.com 28 Sep 2011 - 19 Oct 2011
Map Overlay

Visits
1 5,277

11,932 visits cameFigure


from 63 countries/territories
7.6: Distribution of visits per country

Visits Pages/Visit Avg. Time on Site % New Visits Bounce Rate


Table
11,932 7.2 shows
1.54the activity generated
00:01:00 by the83.80%
users of the 1078.58%
countries with
more visits to the site
100.00% 1.54 from which 00:01:00
data has been extracted.
83.74% The table
78.58%shows the

Visits
number of visits per visitor, the average page views per visit, the average time
spent on the website per visitor, the 5,277 percentage of new pages viewed per visitor,

2,173
and the bounce rate.
904
The distribution of countries affects different fingerprint attributes, such as
874
the time zone and the Accept-Language
578
header.
Table 7.3 shows the entropy [Shannon,
360 1948] of each fingerprint attribute.
The first column shows the variable 325name assigned to the attribute. The second

column shows the attribute itself.173The third column indicates the entropy of
151
the attribute in our dataset. The fourth column shows the entropy obtained by
130
Eckersley [2010] for the same attributes. The entropy associated with headers
X-Real-IP, X-Forwarded-For and Cache-Control was not studied by Eckersley
[2010], while the attribute that indicates whether the browser supports cookies has
not been used in our work. On the other hand, the entropy associated with Accept
HTTP headers was studied jointly by Eckersley [2010], whereas in our work it has

127
Country Visits Pages Time Spent New Visits Bounce Rate
Spain 5277 1.73 1 20 75.67% 75, 52%
Mexico 2173 1.34 41 90.43% 84, 49%
Argentina 904 1.41 39 87.94% 79, 98%
Chile 874 1.49 53 92.11% 79, 41%
Colombia 578 1.40 45 89.79% 80, 28%
Venezuela 360 1.66 59 90.28% 70, 28%
Peru 325 1.37 48 90.46% 80, 62%
Unknown 173 1.51 1 47 91.91% 82, 08%
USA 151 1.42 45 93.38% 78, 15%
Ecuador 130 1.38 32 93.08% 81, 54%

Table 7.2: Distribution of visits for the 10 countries that generated more site
activity

X Attribute H(X) H(X) [Eckersley, 2010]


X1 X-Real-IP 12,5061 –
X2 X-Forwarded-For 12,52 –
X3 User-Agent 7,51458 10
X4 Accept 2,05302
X5 Accept-Language 3,68173
6,09
X6 Accept-Charset 1,89086
X7 Accept-Encoding 1,81318
X8 Cache-Control 0,299063 –
X9 Plugins 11,7677 15,4
X10 Fonts 8,38331 13,9
X11 Video 5,50273 4,83
X12 Time zone 2,30895 3,04
X13 Session storage 0,299995
X14 Local storage 0,297941 2,12
X15 IE persistence 0,560692
−− Cookies enabled – 0,353

Table 7.3: Entropy of fingerprint attributes

been studied separately; the same happens with browser storage capabilities. The
quantitative differences between the entropy values of obtained in our work and
those obtained by Eckersley [2010] are due the size of the datasets; the longer
dataset of Eckersley [2010] contains data from 470,161 browsers, whereas our
dataset contains 10,834.

128
Table 7.4 shows the cross-entropy values when pairs of fingerprint attributes
are combined. The pairs of fingerprint attributes with more discriminative power
are the X-Real-IP header combined with the plugins or the font, and the X-
Forwarded-For header combined with the plugins, the fonts, or the Accept header.
The cross-entropy values of X-Real-IP and X-Forwarded-For are quite similar
because the value of the former is always included within the value of the latter,
and most of fingerprint records do not correspond to a proxy route.
age
For

ing
et

ge
ntrol

e
nco d
angu
ded-

hars

rage

tenc
ora
nt

e
al-IP

o n st
e-Co
rwar

e zon
pt-C

pt-E
-Age

l st o
pt-L

ersis
ins
pt

o
s
X-Re

Cach
X-Fo

Lo ca
Acce

Acce

Acce

Acce

Sessi
Vide
Plug

Font
User

IE p
Tim
H(X, Y ) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
X1 –
X2 12,5 –
X3 12,6 12,6 –
X4 12,6 12,7 7,8 –
X5 12,6 12,6 8,3 4,2 –
X6 12,6 12,6 7,5 2,6 3,8 –
X7 12,6 12,6 7,6 3 4,2 2,4 –
X8 12,5 12,5 7,7 2,3 3,9 2,2 2,1 –
X9 12,7 12,7 12,3 12 12 11,8 11,9 11,8 –
X10 12,7 12,7 11,1 9,3 9,8 9,2 9,2 8,5 12 –
X11 12,6 12,6 9,7 6,3 7,6 6,3 6,4 5,7 12,3 10,7 –
X12 12,5 12,5 9 4,2 5,4 4,1 4,1 2,6 12 9,4 7,4 –
X13 12,5 12,5 7,5 2,3 3,8 2,1 2 0,6 11,8 8,5 5,7 2,6 –
X14 12,5 12,5 7,5 2,3 3,8 2,1 2 0,6 11,8 8,5 5,7 2,6 0,3 –
X15 12,5 12,5 7,5 2,9 3,9 2,2 2,2 0,9 11,8 8,5 5,9 2,9 0,8 0,8 –

Table 7.4: Cross-entropy between pairs of fingerprint attributes

129
age
For

ing
et

e
ntrol

e
orag
nco d
angu
ded-

hars

ge

tenc
nt

ra
e
al-IP

o n st
e-Co
rwar

e zon
pt-C

pt-E
-Age

l st o
pt-L

ersis
ins
pt

o
s
X-Re

Cach
X-Fo

Lo ca
Acce

Acce

Acce

Acce

Sessi
Vide
Plug

Font
User

IE p
Tim
H(X|Y ) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
X1 – 0 5,13 10,59 8,88 10,66 10,74 12,22 0,91 4,3 7,13 10,21 12,21 12,21 11,95
X2 0,01 – 5,13 10,60 8,89 10,67 10,75 12,23 0,91 4,3 7,14 10,22 12,23 12,23 11,97
X3 0,13 0,13 – 5,73 4,65 5,65 5,83 7,36 0,58 2,71 4,24 6,68 7,22 7,22 6,96
X4 0,13 0,13 0,26 – 0,57 0,74 1,18 2,04 0,23 0,91 0,79 1,94 1,98 1,98 1,93
X5 0,05 0,05 0,82 2,2 – 1,92 2,36 3,64 0,24 1,44 2,08 3,13 3,54 3,54 3,31
X6 0,05 0,04 0,03 0,57 0,13 – 0,63 1,88 0,08 0,82 0,83 1,81 1,82 1,81 1,59
X7 0,05 0,04 0,13 0,94 0,49 0,55 – 1,78 0,12 0,79 0,94 1,75 1,75 1,75 1,67
X8 0,01 0,01 0,15 0,28 0,26 0,28 0,27 – 0,04 0,13 0,24 0,28 0,3 0,3 0,3
X9 0,17 0,16 4,83 9,94 8,33 9,96 10,07 11,51 – 3,6 6,79 9,73 11,52 11,52 11,21
X10 0,17 0,17 3,58 7,24 6,14 7,31 7,36 8,21 0,22 – 5,16 7,09 8,23 8,23 7,95
X11 0,13 0,12 2,23 4,24 3,9 4,44 4,63 5,44 0,53 2,28 – 5,12 5,36 5,36 5,33
X12 0,01 0,01 1,47 2,19 1,76 2,23 2,24 2,29 0,27 1,02 1,92 – 2,28 2,28 2,3
X13 0,01 0,01 0,01 0,23 0,16 0,22 0,23 0,3 0,05 0,15 0,16 0,27 – 0,01 0,28
X14 0,01 0,01 0,01 0,22 0,15 0,22 0,23 0,29 0,05 0,14 0,16 0,27 0,01 – 0,28
X15 0,01 0,01 0,01 0,44 0,19 0,26 0,41 0,56 0,01 0,13 0,39 0,55 0,54 0,54 –

Table 7.5: Conditional entropy between pairs of fingerprint attributes

Finally, Table 7.5 shows the entropy of every fingerprint attribute when the
value of another attribute is known (i.e. conditional entropy). The columns
in the table correspond to the attribute Y known, while the rows indicate the
attribute X whose entropy we want to know, given a known value of Y . As it can
be seen in the table, there is not uncertainty for the attribute X-Real-IP when
the header X-Forwarded-For is known. This is because the value of the former is
always included in the value of the latter. In addition, many fingerprint attributes
provide a few information over others (e.g. there is not much uncertainty for the
time zone attribute when the value of the header X-Forwarder-For is known).

7.1.4 Verify Data Quality Task


The study of the dataset used in our experiment shows that the data is assorted
enough to perform model training, from the point of view of records per unique
user (from 1 to 389), web browsers and countries of origin. In addition, as shown
with the study of the entropy, any variable is not enough by itself for determining
unique users, neither any combination of variables. Therefore, the dataset will be
useful for stressing the model in order to demonstrate its classification power.

130
7.2 Data Preparation Activity
This activity consists in the ordered execution of the following tasks:

1. The Select Data task consists in deciding the data to be used for the anal-
ysis, removing from the dataset the fingerprint records that may conduct
to deficiencies in the model resulting from the learning phase. This task is
described in Section 7.2.1.

2. The Clean Data task consists in cleansing the dataset in order to ensure
that it contains activity records corresponding to human agents uniquely
identified. This task is described in Section 7.2.2.

3. The Construct Data task consists in performing data transformations to the


values of some of the fingerprint attributes gathered. This task is described
in Section 7.2.3.

7.2.1 Select Data Task


As the goal of this technique is to uniquely identify users from web activity
records, the records used for model learning must contain the activity of users
uniquely identified, i.e. the dataset must not contain activity records assigned to
multiple identifiers that correspond to the same user.
Additionally, the dataset must not contain non-human activity (i.e. records
generated by robots).
Therefore, the activity corresponding to users with multiple identifiers and
the activity generated by non-human agents must be removed. This is performed
in the task described next.

7.2.2 Clean Data Task


This task cleans the dataset in order to satisfy the selection criteria identified in
the previous section.
As the users in the dataset used in the experiment conducted in this thesis
have been collected using the technique based on cookies, users may have been
identified more than once, due to the problems identified in Section 2.2.2.1. To

131
Search Engine User-Agent
Google Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html)
Bing Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Yahoo! Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Table 7.6: User-Agent values for Google, Bing, and Yahoo! robots

deal with this issue we have only taken into account the users identified before the
data-gathering period by removing the activity records of those users that were
firstly identified after the initial gathering date. For doing so, all the activity
records that are related to a cookie that has been created after the initial date
have been removed. This data cleansing action allows to evaluate the performance
of the technique with respect to a gold standard based on cookies that do not
include multiple identifiers for single users.
Finally, since the technique is focused on users, this task filters the activity
generated by web crawlers. To do this, it discards the activity records with a
User-Agent header whose value is recognised as a robot. For example, 7.6 show
the values corresponding to Google, Bing, and Yahoo! robots.
Since not all robots are identified through the User-Agent header, this tech-
nique implements an additional mechanism that consists in filtering the activity
produced by agents that perform more than 3 requests every 0.5 seconds. To de-
termine that two records are from the same agent, such records must be identical.
In the experiment conducted in this thesis, by using this method, we have
filtered 73 records produced by crawlers.

7.2.3 Construct Data Task


For each fingerprint record, this task stores the attribute values within a database,
according to the format explained in Section 7.1.2. We apply a compression
function to the values of the Plugins and Fonts attributes, so they can be included
within the parameters of the HTTP GET requests [Fielding and Reschke, 2014b]
that are sent from the browser to the tracking server, as the data obtained from
such attributes can be extensive. The compression function used by this technique
is the cryptographic hash function SHA-1 [Eastlake and Jones, 2001]. Similar one-
way functions could be applied to other attributes for avoiding persisting personal
data (e.g. IP addresses, time zones, etc.), thus warranting users’ privacy.

132
7.3 Modelling Activity
This activity consists in the ordered execution of the following tasks:

1. The Select Modelling Technique task consists in selecting and describing


a modelling technique for begin applied for unique user identification pur-
poses. This task is described in Section 7.3.1.

2. The Generate Test Design task consists in defining the approach followed
for evaluating the technique. This task is described in Section 7.3.2.

3. The Build Model task consists in learning the model used for identifying
unique users. This task is described in Section 7.3.3.

Next, each of these tasks are described.

7.3.1 Select Modelling Technique Task


This section describes the classification approach (i.e. the modelling technique)
used for unique user identification.
We have adapted the early binding algorithm introduced in Section 2.3.1.
The input of this algorithm is a sequence of fingerprints R ordered by timestamp
ascending. The output of the algorithm is a set of clusters C, in which each
cluster C ∈ C, includes fingerprints in R identified as belonging to the same
browser. Listing 7.8 formalises the algorithm proposed. The steps executed in
the algorithm are explained next.

1. Firstly, we initialise the set of clusters C at the empty set (line 3).

2. Next, for each fingerprint ri we calculate the maximum similarity between


such fingerprint and each cluster Cj generated so far (line 5). Similarity
computation between clusters and fingerprint is explained in Section 7.3.1.2.

(a) If the maximum similarity is greater or equal than a threshold θ, then


there exists a cluster C to which we can add the fingerprint ri that is
been processed, so we execute the following steps (lines 6-8):
i. Obtain the cluster that is more similar to the fingerprint (line 6).

133
ii. Add the fingerprint to such cluster (line 7).
iii. Update cluster signature (line 8). Such signature is used to com-
pare candidate fingerprints with the cluster. Section 7.3.1.1 de-
scribes the steps that must be followed for updating the signature.

(b) If the maximum similarity is less than the threshold θ, then there does
not exist a cluster C to which we can add the fingerprint, so we execute
the following steps (lines 10-12):

i. Create a new cluster C and add the fingerprint ri to it (line 10).


ii. Add the cluster C to the set of clusters C (line 11).
iii. Generate a new signature for cluster C (line 12). Section 7.3.1.1
describes the steps that must be followed for creating the signa-
ture.

3. Finally, the set of clusters C is returned (line 15).

1 function ClusterF ingerprints(R)


2 begin
3 C⇐∅
4 for each ri ∈ R do
5 if maxCj ∈C sim(ri , Cj ) ≥ θ then
6 C ⇐ arg maxCj ∈C sim(ri , Cj )
7 C ⇐ C ∪ {ri }
8 U pdateSignature(ri , C)
9 else
10 C ⇐ {ri }
11 C ⇐ C ∪ {C}
12 CreateSignature(ri , C)
13 end if
14 end for
15 return C
16 end

Listing 7.8: Algorithm for clustering fingerprints of the same browser

134
7.3.1.1 Cluster Signature

The signature of a cluster allows obtaining similarities between the clusters and
candidate fingerprints for being included in such clusters. Such signature is a
tuple (V, Te , Tl ), in which:
• V = (C.X1 , ..., C.Xi , ..., C.X15 ) is a sequence, in which each component
corresponds with the value observed for the attribute Xi ∈ X in the last
fingerprint added to the cluster C, where X is the set of attributes.

• Te = (te (C.X1 ), ..., te (C.Xi ), ..., te (C.X15 )) is a sequence, in which each com-
ponent corresponds with the timestamp of the first observation of the value
C.Xi for the attribute Xi within a fingerprint added to the cluster C.

• Tl = (tl (C.X1 ), ..., tl (C.Xi ), ..., tl (C.X15 )) is a sequence, in which each com-
ponent corresponds with the timestamp of the last observation of the value
C.Xi for the attribute Xi within a fingerprint added to the cluster C.
Next, the operations for creating and updating clusters signatures are de-
scribed.

Signature creation. Listing 7.9 details the operation for creating a cluster sig-
nature. The inputs of this operation are the fingerprint r and the cluster C,
whose signature will be created from r. When this operation is executed,
the cluster C contains only the fingerprint r. Thus the first time that the
value of an attribute Xi is observed for the cluster C (i.e. te (C.Xi )) cor-
responds to the timestamp of fingerprint creation r.t, as happens with the
last time that the value of an attribute Xi is observed for the cluster C (i.e.
tl (C.Xi )).

Signature updating. Listing 7.10 details the operation for updating a cluster
signature. The inputs of this operation are the fingerprint r and the cluster
C, whose signature we want to update from r. In this operation, for each
fingerprint attribute Xi we execute the following steps (lines 4-13):

1. Compute the similarity between fingerprint attribute value r.Xi and


cluster attribute value C.Xi . The similarity computation between at-
tributes is defined in Section 7.3.1.2.

135
1 procedure CreateSignature(r, C)
2 begin
3 for each Xi ∈ X do
4 te (C.Xi ) ⇐ r.t
5 tl (C.Xi ) ⇐ r.t
6 C.Xi ⇐ r.Xi
7 end for
8 end

Listing 7.9: Operation for creating a cluster signature

(a) If the similarity s is less than a threshold θl , we consider that the


value of the attribute has changed. Thus, we assign the times-
tamp of fingerprint creation r.t to te (C.Xi ) (lines 5-6). We have
considered θl = 0.5 in our experiment.
(b) If the similarity s is greater than a threshold θh , we consider that
the value of the attribute has not changed. Thus, we maintain the
value of te (C.Xi ) (lines 7-8). We have considered θh = 0.9 in our
experiment.
(c) If θl ≤ s ≤ θh , we consider that the attribute maintains its value
with probability s, so it changes its value with probability 1 −
s. Thus we estimate the instant of time in which the attribute
changed its value by combining the timestamp of current attribute
value, with fingerprint creation timestamp as shown in line 10.
2. Assign the fingerprint creation timestamp r.t to tl (C.Xi ) (line 12).
3. Assign attribute value Xi of fingerprint r (i.e. r.Xi ) to the cluster
signature for the attribute Xi (i.e. C.Xi ) (line 13).

136
1 procedure U pdateSignature(r, C)
2 begin
3 for each Xi ∈ X do
4 s ⇐ sim(r.Xi , C.Xi )
5 if s < θl then
6 te (C.Xi ) ⇐ r.t
7 else if s > θh then
8 (∗ The previous value of te (C.Xi ) is maintained ∗)
9 else
10 te (C.Xi ) ⇐ s · te (C.Xi ) + (1 − s) · r.t
11 end if
12 tl (C.Xi ) ⇐ r.t
13 C.Xi ⇐ r.Xi
14 end for
15 end

Listing 7.10: Operation for updating a cluster signature

7.3.1.2 Similarity Computation

The similarity between a fingerprint r and a cluster C is calculated as the weighted


average of the similarities between the values of each of the attributes in the
fingerprint and the values of the same attributes in the signature of the cluster
(see Equation 7.1). Section 7.3.1.3 explains the different alternatives for obtaining
the weights wX .

sim(r.X, C.X) · wX
sim(r, C) = X∈X
 (7.1)
X∈X wX

With respect to similarity between fingerprint attribute values and cluster


signature attribute values, the similarity measure considered for the most part of
the attributes is the equality (see Equation 7.2).

1 r.X = C.X
sim(r.X, C.X) = (7.2)
0 r.X = C.X

When the fingerprint attribute being compared is X2 (i.e. X-Forwarded-For


header), we apply the index proposed by Jaccard [1901] for measuring the similar-

137
ity between sets (see Equation 7.3), as suggested by Li et al. [2011] for multivalued
attributes.

r.X2 ∩ C.X2
sim(r.X2 , C.X2 ) = Jaccard(r.X2 , C.X2 ) = (7.3)
r.X2 ∪ C.X2
Finally, if the fingerprint attribute being compared is X3 (i.e. User-Agent
header), we apply a similarity calculated from the normalized Levenshtein [1966]
distance shown in Equation 7.4. Such distance is appropriate for the attribute
X3 , because the value of the User-Agent header changes slightly over time, due
to browser or operating system version updates.

Levenshtein(r.X3 , C.X3 )
sim(r.X3 , C.X3 ) = 1 − (7.4)
max length(v)
v∈{r.X3 ,C.X3 }

7.3.1.3 Attribute Weight Computation

The algorithm described above has been tested with four different variants. These
alternatives consists in using different weights to ponder similarity computation
between values of the fingerprint attributes and cluster signature attributes. Next,
the four variants are described.

Variant based on uniform weights. The first alternative is the most simple
and consists in assigning the same weight for all fingerprint attributes, as
shown by Equation 7.5.

wX = 1 (7.5)

Variant based on attribute entropy. The second variant consists in using


the entropy of the attribute as the attribute weight, as shown by Equa-
tion 7.6.

wX = H(X) (7.6)

Variant based on time decay. The third variant takes into account attribute
agreement and disagreement decays. Equation 7.7 shows how to calculate

138
attribute weight according to this variant for single-valued attributes (i.e.
all attributes with the exception of X-Forwarded-For header).



⎨ 1 − d= (X, Δtl ) s > θh
wX = 1 − d= (X, Δte ) s < θl (7.7)

⎩ =
1 − s · d (X, Δtl ) − (1 − s) · d (X, Δte ) θl < s < θh
=

As defined by Li et al. [2011], given a similarity s = sim(r.X, C.X) be-


tween two values of an attribute, with probability s, the two values are
the same and we shall use the complement of the agreement decay as at-
tribute weight. On the other hand, with probability 1 − s, the values are
different and we shall use the complement of the disagreement decay as at-
tribute weight. Thus attribute weight is computed by combining the com-
plements of agreement and disagreement decays. For high similarity values
(i.e. s > θh = 0.9), we only use the complement of agreement decay, while
for low similarity values (i.e. s > θl = 0.5), we only use the complement of
disagreement decay.
With respect to the time periods Δt used for computing disagreement decay,
we take into account the time lapsed between fingerprint capturing r.t and
the first time that the current attribute value was observed in the cluster,
as shown in Equation 7.8.

Δte = |r.t − te (C.X)| (7.8)

On the other hand, for computing agreement decay, we take into account
the time lapsed between r.t and the last time that the current attribute
value was observed in the cluster, as shown in Equation 7.9.

Δtl = |r.t − tl (C.X)| (7.9)

Finally, for the X-Forwarded-For header, we only take into account agree-
ment decay, as explained by Li et al. [2011], since such header is a multi-

139
valued attribute. Thus, in such case, we calculate he attribute weight as
shown in Equation 7.10.

wX2 = 1 − d= (X2 , Δtl ) (7.10)

Variant based on attribute entropy and time decay. The last variant takes
into account both attribute evolution and entropy. Therefore, the attribute
weights are obtained by multiplying the weight obtained according to the
previous variant by attribute entropy, as shown in Equation 7.11 for single-
valued attributes.



⎨ H(X) · (1 − d= (X, Δtl )) s > θh

wX = H(X) · (1 − d (X, Δte ))
=
s < θl


H(X) · (1 − s · d= (X, Δtl ) − (1 − s) · d= (X, Δte )) θl < s < θh
(7.11)
Finally, for the X-Forwarded-For header we calculate the attribute weight
as shown in Equation 7.12.

wX2 = H(X2 ) · (1 − (d= (X2 , Δtl ))) (7.12)

7.3.2 Generate Test Design Task


The test designed consists in performing a 2-fold cross-validation with the gold
standard previously constructed. The gold standard consists in a corpus of ac-
tivity records with users identified by using the technique based on cookies.
We have ensured in the Clean Data task (see Section 7.2.2) that there is a
unique cookie that identifies every single user.
The evaluation results are discussed in Section 7.4.

7.3.3 Build Model Task


This task consists in learning the model used for unique user identification. It
consists in the following steps:

140
1. Obtain the entropy for each fingerprint attribute.

2. Obtain the evolution parameters (i.e. agreement decay and disagreement


decay) for each fingerprint attribute.

The result of applying Step 1 to the dataset has been shown in Table 7.3. The
results of Step 2 are described next.
We have implemented the algorithms described by Li et al. [2011] for learn-
ing agreement and disagreement decays. Once we have obtained the temporal
values of these probabilities for each attribute, we have performed simple regres-
sion analyses, obtaining explanatory models for the agreement and disagreement
decays.
Each model corresponds to a function dp (X, Δt), where

• p is the type of decay (d= (X, Δt) for disagreement decay and d= (X, Δt) for
agreement decay),

• X is the fingerprint attribute, and

• Δt is a time increment, such that Δt ∈ [0, ∞).

The time unit of measurement that we used in our experiment is the minute,
although we maintain a precision of five fractional digits for time units because
users activity timestamps are defined at the granularity of milliseconds.
In addition, each function dp (X, Δt) complies with the properties defined by
Li et al. [2011] for agreement and disagreement decays:

• Any value of dp (X, Δt) is defined within the interval [0, 1].

• dp (X, Δt) is a monotonically increasing function.

Tables 7.7 and 7.8 show agreement and disagreement decays respectively for
the fingerprint attributes.
The attributes with faster disagreement decays include the X-Real-IP and the
User-Agent headers. In the case of the X-Real-IP header, IP addresses [Postel,
1981] use to change with DCHP (Dynamic Host Configuration Protocol) [Droms,
1997] assignments, mostly in mobile environments. In addition, browser versions

141
Attribute Disagreement

decay √
⎨ 
0.0033855 + 0.00348067 Δt

0 < Δt < 1047.895444
X-Real-IP d= (X1 , Δt) =
⎩ −0.23349 + 0.00721289 Δt 1047.895444 ≤ Δt < 29245.06883
1 Δt ≥ 29245.06883

X-Forwarded-For N/A when⎧ the attribute is multivalued [Li et al., 2011]


⎨ 0 0 < Δt < 133.8336005
User-Agent d= (X3 , Δt) =

−0.0047762 + 0.0000356876Δt 133.8336005 ≤ Δt < 28154.77084
 1
√ Δt ≥ 28154.77084

(0.279051 + 0.00387899 Δt)2 Δt < 34543.93180


Accept d= (X4 , Δt) =
1 Δt ≥ 34543.93180
 −9 Δt2
e−4.09968+3.23283·10 Δt < 35610.94887
Accept-Language d= (X 5 , Δt) =

1 Δt ≥ 35610.94887
(0.0281337 + 4.35781 · 10−10 Δt2 )2 Δt < 47224.69002
Accept-Charset d= (X6 , Δt) =
Δt ≥ 47224.69002
 −4.67385+0.000115853Δt
1
e Δt < 40342.93458
Accept-Encoding d= (X7 , Δt) =
1 Δt ≥ 40342.93458
 √
e−4.41392+0.0222535 Δt Δt < 39341.62217
Cache-Control d= (X8 , Δt) =

1 √
Δt ≥ 39341.62217
(0.0835439 + 0.00570466 Δt)2 Δt < 25808.56167
Plugins d= (X 9 , Δt) =
1 Δt ≥ 25808.56167

(0.012879 + 0.0605308 ln Δt)2
⎨  0 < Δt < 1102.506998

Fonts d= (X10 , Δt) =
⎩ −0.230069 + 0.00692895 Δt 1102.506998 ≤ Δt < 31515.49207
 1
√ Δt ≥ 31515.49207
(0.233452 + 0.00417302 Δt)2 Δt < 33742.54047
Video d= (X11 , Δt) =
 1 Δt ≥ 33742.54047
(0.12658 + 0.000024006Δt)2 Δt < 36383.40415
Time zone d= (X12 , Δt) =
1 Δt ≥ 36383.40415
 √
e−5.91823+0.0303167 Δt Δt < 38108.32197
Session storage d= (X 13 , Δt) =
1 Δt ≥ 38108.32197
 √
e −6.04222+0.0306873 Δt Δt < 38768.20650
Local storage 
=
d (X14 , Δt) =
1 Δt ≥ 38768.20650
 √
e −6.29214+0.0331916 Δt Δt < 35936.88071
IE persistence 
=
d (X15 , Δt) =
1 Δt ≥ 35936.88071

Table 7.7: Disagreement decay of fingerprint attributes

use to be updated frequently (Google Chrome updates itself automatically), what


changes the value of the User-Agent header. The attributes with slower disagree-
ment decays include the Accept* headers. These headers tend to be stable, since
they specify attributes such as the user language or the expected character en-
coding. The agreement decay of most of the fingerprint attributes present a total
linearity with fast agreement decays (the agreement decay is 1 before the 3rd
minute). The attributes X-Real-IP and X-Forwarded-For grow even faster than
the others, since the same IP address can be assigned unsing NAT (Network
Address Translator) [Egevang, 1994] to different machines at the same time.

142
Attribute Agreement decay −5.59227+0.550263 ln Δt
= e Δt < 25923.46755
X-Real-IP d (X1 , Δt) =
1
 −5.06924+0.498956 Δt ≥ 25923.46755
e ln Δt Δt < 25840.37422
X-Forwarded-For d= (X2 , Δt) =
1
 −10.1715+0.993395 Δt ≥ 25840.37422
e ln Δt Δt < 27976.76015
User-Agent d= (X3 , Δt) =
 1 Δt ≥ 27976.76015
0.000033494 + 0.000033083Δt Δt < 30225.99238
Accept d= (X4 , Δt) =

1 Δt ≥ 30225.99238
0.0000347399 + 0.0000330993Δt Δt < 30211.06972
Accept-Language d= (X5 , Δt) =
1 Δt ≥ 30211.06972

0.0000339481 + 0.0000330917Δt Δt < 30218.03207
Accept-Charset =
d (X6 , Δt) =
1 Δt ≥ 30218.03207

0.0000333693 + 0.0000330786Δt Δt < 30230.01671
Accept-Encoding =
d (X7 , Δt) =
1 Δt ≥ 30230.01671

−0.0112165 + 0.00005847Δt − 8.0021 · 10−10 Δt2
Cache-Control d= (X8 , Δt) =
Δt < 28104.69216
 1 Δt ≥ 28104.69216

e−10.1405+0.98999 ln Δt Δt < 28086.17546


Plugins d= (X9 , Δt) =

1 Δt ≥ 28086.17546
0.0000332742 + 0.0000330753Δt Δt < 30233.03570
Fonts d= (X 10 , Δt) =
1 Δt ≥ 30233.03570

0.0000673283 + 0.0000331751Δt Δt < 30141.05976
Video d= (X11 , Δt) =
1 Δt ≥ 30141.05976

0.0000337212 + 0.0000337212Δt Δt < 30222.05737
Time zone d= (X12 , Δt) =
1 Δt ≥ 30222.05737

0.0000331289 + 0.0000330721Δt Δt < 30235.96539
Session storage d= (X13 , Δt) =
1 Δt ≥ 30235.96539

0.0000331289 + 0.0000330721Δt Δt < 30235.96539
Local storage d= (X14 , Δt) =
1 Δt ≥ 30235.96539

0.0000331737 + 0.0000330731Δt Δt < 30235.04982
IE persistence d= (X15 , Δt) =
1 Δt ≥ 30235.04982

Table 7.8: Agreement decay of fingerprint attributes

Next, the agreement and disagreement decays are described for the fingerprint
attributes used in this work.

7.3.3.1 X-Real-IP Header

Figure 7.7 shows the values learned for the disagreement decay of the X-Real-
IP header in blue, while the regression model obtained is shown in green. The
regression function is not defined for the interval [0, 1047.89544) as for values in
this interval the radicand expression produces negative numbers. For such interval
the model shown in Figure 7.8 has been obtained by performing and additional
regression specifically for the interval [0, 1047.89544). Joining both models, the
disagreement decay the X-Real-IP header is described by Equation 7.13.

143
⎧ √

⎨ 0.0033855 + 0.00348067 √Δt 0 < Δt < 1047.895444
=
d (X1 , Δt) = −0.23349 + 0.00721289 Δt 1047.895444 ≤ Δt < 29245.06883

⎩ 1 Δt ≥ 29245.06883
(7.13)
Figure 7.9 shows the model learned for the agreement decay of the X-Real-IP
header, which is described by Equation 7.14.

e−5.59227+0.550263 ln Δt Δt < 25923.46755
d= (X1 , Δt) = (7.14)
1 Δt ≥ 25923.46755

Figure 7.7: Disagreement decay for the X-Real-IP header (second interval)

Figure 7.8: Disagreement decay for the X-Real-IP header (first interval)

144
Figure 7.9: Agreement decay for the X-Real-IP header

7.3.3.2 X-Forwarded-For Header

The X-Forwarded-For header is a multivalued attribute (i.e. it contains multiple


IP addresses) differing from the rest of fingerprint attributes, which are single-
valued (i.e. only contain one value per attribute).
As stated by Li et al. [2011], for multivalued attributes only agreement decay
must be learned due to the following reasons: (i) having different values for such
attributes does not indicate record un-match, and (ii) sharing the same value for
such attributes is additional evidence for record match.
Therefore, for the X-Forwarded-For header we have only learned its agreement
decay. Figure 7.10 shows the model learned, which is described by Equation 7.15.

Figure 7.10: Agreement decay for the X-Forwarded-For header

145

= e−5.06924+0.498956 ln Δt Δt < 25840.37422
d (X2 , Δt) = (7.15)
1 Δt ≥ 25840.37422
As it can be observed, the model is quite similar to the one corresponding to
the agreement decay for the X-Real-IP header, because the value of the X-Real-
IP header is always included within the values of the X-Forwarded-For header,
and, in most cases, the X-Forwarded-For header includes a unique value that
corresponds to the value of the X-Real-IP header. In addition, as shown in
Table 7.5, the values of their conditioned entropy are very low: H(X1 |X2 ) = 0
and H(X2 |X1 ) = 0.01.

7.3.3.3 User-Agent Header

Figure 7.11 shows the model learned for the disagreement decay for the User-
Agent header. As it can be seen, this header changes in a lineal fashion, slower
than the X-Real-IP header. Therefore it is a more stable fingerprint attribute.
Equation 7.16 describes the disagreement decay of the User-Agent header.



⎨ 0 0 < Δt < 133.8336005
=
d (X3 , Δt) = −0.0047762 + 0.0000356876Δt 133.8336005 ≤ Δt < 28154.77084

⎩ 1 Δt ≥ 28154.77084
(7.16)

Figure 7.11: Disagreement decay for the User-Agent header

146
Figure 7.12: Agreement decay for the User-Agent header

Figure 7.12 shows the model learned for the agreement decay of the User-
Agent header, which is described by Equation 7.17.

e−10.1715+0.993395 ln Δt Δt < 27976.76015
d= (X3 , Δt) = (7.17)
1 Δt ≥ 27976.76015

7.3.3.4 Accept Header

Figure 7.13 shows the model learned for the disagreement decay of the Accept
header, which is described by Equation 7.18.

 √
(0.279051 + 0.00387899 Δt)2 Δt < 34543.93180
d= (X4 , Δt) = (7.18)
1 Δt ≥ 34543.93180

Figure 7.14 shows the model learned for the agreement decay of the Accept
header, which is described by Equation 7.19.


0.000033494 + 0.000033083Δt Δt < 30225.99238
d= (X4 , Δt) = (7.19)
1 Δt ≥ 30225.99238

147
Figure 7.13: Disagreement decay for the Accept header

Figure 7.14: Agreement decay for the Accept header

7.3.3.5 Accept-Language Header

Figure 7.15 shows the model learned for the disagreement decay of the Accept-
Language header, which is described by Equation 7.20. As it can be seen in the
figure, such disagreement decay grows very slowly (it is very unlikely for a browser
to change its language requested to web servers).

 −9 Δt2
= e−4.09968+3.23283·10 Δt < 35610.94887
d (X5 , Δt) = (7.20)
1 Δt ≥ 35610.94887

Figure 7.16 shows the model learned for the agreement decay of the Accept-
Language header, which is described by Equation 7.21.

148
Figure 7.15: Disagreement decay for the Accept-Language header

Figure 7.16: Agreement decay for the Accept-Language header


0.0000347399 + 0.0000330993Δt Δt < 30211.06972
d= (X5 , Δt) = (7.21)
1 Δt ≥ 30211.06972

7.3.3.6 Accept-Charset Header

Figure 7.17 shows the model learned for the disagreement decay of the Accept-
Charset header, which is described by Equation 7.22. As happened with the
previous header, such disagreement decay grows very slowly.

149
Figure 7.17: Disagreement decay for the Accept-Charset header

Figure 7.18: Agreement decay for the Accept-Charset header


(0.0281337 + 4.35781 · 10−10 Δt2 )2 Δt < 47224.69002
d= (X6 , Δt) = (7.22)
1 Δt ≥ 47224.69002

Figure 7.18 shows the model learned for the agreement decay of the Accept-
Charset header, which is described by Equation 7.23.


0.0000339481 + 0.0000330917Δt Δt < 30218.03207
d= (X6 , Δt) = (7.23)
1 Δt ≥ 30218.03207

150
7.3.3.7 Accept-Encoding Header

Figure 7.19 shows the model learned for the disagreement decay of the Accept-
Encoding header, which is described by Equation 7.24. As happened with the
Accept-Language and Accept-Charset, such disagreement decay grows very slowly.


= e−4.67385+0.000115853Δt Δt < 40342.93458
d (X7 , Δt) = (7.24)
1 Δt ≥ 40342.93458

Figure 7.20 shows the model learned for the agreement decay of the Accept-
Encoding header, which is described by Equation 7.25.

Figure 7.19: Disagreement decay for the Accept-Encoding header

Figure 7.20: Agreement decay for the Accept-Encoding header

151

0.0000333693 + 0.0000330786Δt Δt < 30230.01671
d= (X7 , Δt) = (7.25)
1 Δt ≥ 30230.01671

7.3.3.8 Cache-Control Header

Figure 7.21 shows the model learned for the disagreement decay of the Cache-
Control header, which is described by Equation 7.26. As happened with the
Accept-Language, Accept-Charset, and Accept-Encoding headers such disagree-
ment decay grows very slowly.
 √
e−4.41392+0.0222535 Δt
Δt < 39341.62217
d= (X8 , Δt) = (7.26)
1 Δt ≥ 39341.62217
Figure 7.22 shows the model learned for the agreement decay of the Cache-
Controls header, which is described by Equation 7.27.


−0.0112165 + 0.00005847Δt − 8.0021 · 10−10 Δt2 Δt < 28104.69216
d= (X8 , Δt) =
1 Δt ≥ 28104.69216
(7.27)

Figure 7.21: Disagreement decay for the Cache-Control header

152
Figure 7.22: Agreement decay for the Cache-Control header

7.3.3.9 Plugins

Figure 7.23 shows the model learned for the disagreement decay of the Plugins
installed within the browser, which is described by Equation 7.28.

 √
(0.0835439 + 0.00570466 Δt)2 Δt < 25808.56167
d= (X9 , Δt) = (7.28)
1 Δt ≥ 25808.56167

Figure 7.24 shows the model learned for the agreement decay of the Plugins
attribute, which is described by Equation 7.29.

e−10.1405+0.98999 ln Δt Δt < 28086.17546
d= (X9 , Δt) = (7.29)
1 Δt ≥ 28086.17546

153
Figure 7.23: Disagreement decay for the Plugins attribute

Figure 7.24: Agreement decay for the Plugins attribute

7.3.3.10 Fonts

Figure 7.25 shows the model learned for the disagreement decay of the Fonts
attribute. The regression function is not defined in the interval [0, 1102.506998)
as the radicand expression produces negative numbers. For such interval the
model shown in Figure 7.26 has been obtained by performing and additional
regression specifically for the interval [0, 1102.506998). Joining both models, the
disagreement decay the Fonts attribute is described by Equation 7.30.

154
Figure 7.25: Disagreement decay for the Fonts attribute (second interval)

Figure 7.26: Disagreement decay for the Fonts attribute (first interval)



⎨ (0.012879 + 0.0605308 ln Δt)
2 0 < Δt < 1102.506998
= √
d (X10 , Δt) = −0.230069 + 0.00692895 Δt 1102.506998 ≤ Δt < 31515.49207

⎩ 1 Δt ≥ 31515.49207
(7.30)
Figure 7.27 shows the model learned for the agreement decay of the Fonts
attribute, which is described by Equation 7.31.


0.0000332742 + 0.0000330753Δt Δt < 30233.03570
d= (X10 , Δt) = (7.31)
1 Δt ≥ 30233.03570

155
Figure 7.27: Agreement decay for the Fonts attribute

7.3.3.11 Video

Figure 7.28 shows the model learned for the disagreement decay of the Video
attribute, which is described by Equation 7.32.

 √
= (0.233452 + 0.00417302 Δt)2 Δt < 33742.54047
d (X11 , Δt) = (7.32)
1 Δt ≥ 33742.54047

Figure 7.29 shows the model learned for the agreement decay of the Video
attribute, which is described by Equation 7.33.


0.0000673283 + 0.0000331751Δt Δt < 30141.05976
d= (X11 , Δt) = (7.33)
1 Δt ≥ 30141.05976

156
Figure 7.28: Disagreement decay for the Video attribute

Figure 7.29: Agreement decay for the Video attribute

7.3.3.12 Time zone

Figure 7.30 shows the model learned for the disagreement decay of the Time zone
attribute, which is described by Equation 7.34.


(0.12658 + 0.000024006Δt)2 Δt < 36383.40415
d= (X12 , Δt) = (7.34)
1 Δt ≥ 36383.40415

Figure 7.31 shows the model learned for the agreement decay of the Time
zone attribute, which is described by Equation 7.34.

157
Figure 7.30: Disagreement decay for the Time zone attribute

Figure 7.31: Agreement decay for the Time zone attribute


0.0000337212 + 0.0000337212Δt Δt < 30222.05737
d= (X12 , Δt) = (7.35)
1 Δt ≥ 30222.05737

7.3.3.13 Session Storage

Figure 7.32 shows the model learned for the disagreement decay of the Ses-
sion storage attribute, which is described by Equation 7.36. As happened with
the Accept-Language, Accept-Charset, Accept-Encoding, and Cache-Control at-
tributes such disagreement decay grows very slowly.

158
Figure 7.32: Disagreement decay for the Session Storage attribute

Figure 7.33: Agreement decay for the Session storage attribute

 √
e−5.91823+0.0303167 Δt
Δt < 38108.32197
d= (X13 , Δt) = (7.36)
1 Δt ≥ 38108.32197

Figure 7.33 shows the model learned for the agreement decay of the Session
storage attribute, which is described by Equation 7.37.


0.0000331289 + 0.0000330721Δt Δt < 30235.96539
d= (X13 , Δt) = (7.37)
1 Δt ≥ 30235.96539

159
7.3.3.14 Local Storage

Figure 7.34 shows the model learned for the disagreement decay of the Local
storage attribute, which is described by Equation 7.38.

 √
e−6.04222+0.0306873 Δt
Δt < 38768.20650
d= (X14 , Δt) = (7.38)
1 Δt ≥ 38768.20650

Figure 7.35 shows the model learned for the agreement decay of the Local
storage attribute, which is described by Equation 7.39.

Figure 7.34: Disagreement decay for the Local storage attribute

Figure 7.35: Agreement decay for the Local Storage attribute

160

0.0000331289 + 0.0000330721Δt Δt < 30235.96539
d= (X14 , Δt) = (7.39)
1 Δt ≥ 30235.96539

7.3.3.15 Internet Explorer Persistence

Figure 7.36 shows the model learned for the disagreement decay of the Internet
Explorer persistence attribute, which is described by Equation 7.40.
 √
e−6.29214+0.0331916 Δt
Δt < 35936.88071
d= (X15 , Δt) = (7.40)
1 Δt ≥ 35936.88071
Figure 7.37 shows the model learned for the agreement decay of the Internet
Explorer persistence attribute, which is described by Equation 7.41.


0.0000331737 + 0.0000330731Δt Δt < 30235.04982
d= (X15 , Δt) = (7.41)
1 Δt ≥ 30235.04982

Figure 7.36: Disagreement decay for the Internet Explorer persistence attribute

161
Figure 7.37: Agreement decay for the Internet Explorer persistence attribute

7.4 Evaluation
We have evaluated the four variants of the technique for uniquely identifying
users based in the fingerprint of their devices described in this chapter, which are
the following:

1. Assigning equal weight to each fingerprint attribute.

2. Assigning the entropy of the attribute as attribute weight.

3. Taking into account agreement and disagreement decays.

4. Combining attribute entropy with agreement and disagreement decay.

As described in Section 7.3.2 we have used a corpus of activity records with users
identified with the technique based on cookies as gold standard.
The evaluation has been performed with different values of θ (i.e. threshold at
which it is considered that two fingerprints correspond to the same browser). For
each variant and threshold, we have measured algorithm performance, according
to a set of evaluation metrics.
For the variants that require to train a decay and/or entropy model (i.e. all
with the exception of the one based on uniform weights), we have performed
2-fold cross-validation, dividing the dataset into two subsets. We have assigned
randomly records to each subset, so that both subsets are equal in size. For each
subset, we have learned decay and entropy values, and evaluated the algorithm

162
performance with the other subset, letting us to recommend the best algorithm
variant, and to compare our results with previous work.
This section is structured as follows:

• Section 7.4.1 describes the metrics used for evaluating the technique.

• Section 7.4.2 presents the evaluation results obtained for each variant and
threshold, comparing such results and obtaining an optimum setting.

7.4.1 Evaluation Metrics

The technique proposed for unique user identification can be evaluated as a clus-
tering algorithm since its objective is to group fingerprint records corresponding
to unique users.
Most of the metrics used for evaluating this work interpret the clustering as
a set of decisions, one for each of the N (N − 1)/2 pairs of elements (i.e. pairs of
fingerprint records). In this context:

• T P is the number of true positive decisions. A true positive decision assigns


two fingerprints corresponding to the same user to the same cluster.

• T N is the number of true negative decisions. A true negative decision


assigns two fingerprints corresponding to distinct users to different clusters.

• F P is the number of false positive decisions. A false positive decision assigns


two fingerprints corresponding to distinct users to the same cluster.

• F N is the number of false negative decisions. A false negative decision


assigns two fingerprints corresponding to a same user to different clusters.

Taking into account the T P , T N , F P , and F N indicators, the metrics used


for evaluating the performance of the technique for unique user identification are
described next.

163
7.4.1.1 Rand Index

The Rand Index metric [Rand, 1971] measures the percentage of correct clustering
decisions. Equation 7.42 shows its definition.

TP + TN
RI = (7.42)
TP + FP + TN + FN
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are over 0.9.

7.4.1.2 Error Rate

The Error Rate metric [Kohavi and Provost, 1998] measures the percentage of
incorrect decisions. Equation 7.43 shows its definition.

FP + FN
Error = (7.43)
TP + FP + FN + TN
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are bellow 0.1, as Error = 1 − RI.

7.4.1.3 Recall

The Recall metric [Kowalski, 1997] (a.k.a. sensitivity or hit rate) is the true
positive rate. Equation 7.44 shows its definition.

TP
Recall = (7.44)
TP + FN
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are over 0.85.

7.4.1.4 Specificity

The Specificity metric [Kohavi and Provost, 1998] is the true negative rate. Equa-
tion 7.45 shows its definition.

TN
Specif icity = (7.45)
FP + TN

164
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are over 0.9.

7.4.1.5 False Positive Rate

Equation 7.46 defines the False Positive Rate metric [Kohavi and Provost, 1998]
(a.k.a. fall-out).

FP
FPR = (7.46)
FP + TN
The range of this metric is [0..1]. We consider satisfcactory values for this
metric those that are bellow 0.1.

7.4.1.6 False Negative Rate

Equation 7.47 defines the False Negative Rate [Kohavi and Provost, 1998] metric.

FN
F NR = (7.47)
FN + TP
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are bellow 0.15, as F N R = 1 − Recall.

7.4.1.7 Precision

The Precision metric [Kowalski, 1997] is defined as the positive predictive value.
Equation 7.48 shows its definition.

TP
P recision = (7.48)
TP + FP
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are over 0.9.

7.4.1.8 F-measure

The F-measure metric [Larsen and Aone, 1999] combines the precision and recall
metrics offering an overall vision of how the technique behaves. It is defined as

165
the harmonic mean of precision and recall. Equation 7.49 shows its definition.

2 · P recision · Recall
F1 = (7.49)
P recision + Recall
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are over 0.87, taking into account the minimum Precision and
Recall satisfactory values.

7.4.1.9 Purity

This metric, defined by Zhao and Karypis [2001], represents clusters’ purity. To
calculate it we assign the most frequent user in the cluster for each fingerprint
cluster obtained. Then, the classification performance is measured as the number
of fingerprint records assigned correctly to a cluster, divided by the total number
of records.
Let Ω = {ω1 , ω2 , ..., ωk } be the set of clusters obtained, C = c1 , c2 , ..., cj the
number of users, and N the total number of fingerprint records, the Purity metrics
is obtained as shown by Equation 7.50.

1
P urity(Ω, C) = max|ωk ∩ cj | (7.50)
N k
j

The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are over 0.85.

7.4.2 Evaluation Results


This section presents the evaluation results for the variants of the technique and
compares them, obtaining and optimum combination of variant and threshold at
which it is considered that two fingerprints correspond to the same user’s browser.

7.4.2.1 Variant Based on Uniform Weights

This variant assigns the same weight for all the fingerprint attributes. Therefore,
all of these attributes have the same importance for determining whether two
fingerprints correspond to a same browser.

166
Measure θ = 0.7 θ = 0.75 θ = 0.8 θ = 0.85 θ = 0.9 θ = 0.95
Rand Index 0.9978 0.9994 0.9998 0.9998 0.9996 0.9994
Error Rate 0.0011 0.0003 0.0001 0.0001 0.0002 0.0003
Recall 0.41 0.74 0.91 0.91 0.63 0.39
Specificity 0.99836 0.99965 0.99985 0.99986 0.99995 0.99996
False Positive Rate 0.00164 0.00035 0.00015 0.00014 0.00005 0.00004
False Negative Rate 0.59 0.26 0.09 0.09 0.37 0.61
Precision 0.20 0.67 0.85 0.86 0.92 0.91
F-measure 0.27 0.7 0.88 0.88 0.75 0.54
Purity 0.44 0.77 0.92 0.92 0.96 0.96

Table 7.9: Evaluation results for the variant based on uniform weights

Table 7.9 shows the evaluation results corresponding to different values of θ,


from where the following insights can be obtained:

• The Rand Index and Error Rate metrics are good for all the values assigned
to θ.

• The Specificity and False Positive Rate metrics are good for all the values
assigned to θ.

• The Recall and False Negative Rate metrics are good for θ = 0.8 y θ = 0.85.

• The Precision metric is good for θ = 0.9 and θ = 0.95, although for these
values, the recall is not admissible.

• The F-measure metric is acceptable for θ = 0.8 and θ = 0.85.

• The Purity metric is good for θ > 0.8.

The values that optimise the corresponding metrics among all the variants are
marked in bold in Table 7.9.

7.4.2.2 Variant Based on Attribute Entropy

This variant assigns to the weight of each fingerprint attribute its corresponding
entropy. Therefore, each attribute has an importance that is proportional to the
quantity of information that it provides for distinguishing a fingerprint record

167
from other, or for clustering fingerprints that correspond to a same user. As an
example, the plugins installed in the browser will have more weight than the time
zone.
Table 7.10 shows the evaluation results corresponding to different values of θ,
from where the following insights can be obtained:

• The Rand Index and Error Rates metrics are good for all the values assigned
to θ.

• The Specificity and False Positive Rate metrics are good for all the values
assigned to θ.

• The Recall and False Negative Rate metrics are not as good as with other
variants.

• The Precision metric is good for all values of θ.

• The F-measure metric is not as good as with other variants.

• The Purity metric is good for all the values of θ.

The values that optimise the corresponding metrics among all the variants are
marked in bold in Table 7.10.

Measure θ = 0.7 θ = 0.75 θ = 0.8 θ = 0.85 θ = 0.9 θ = 0.95


Rand Index 0.9996 0.9996 0.9996 0.9994 0.9994 0.9994
Error Rate 0.0002 0.0002 0.0002 0.0003 0.0003 0.0003
Recall 0.64 0.64 0.64 0.44 0.41 0.40
Specificity 0.99995 0.99995 0.99995 0.99996 0.99996 0.99996
False Positive Rate 0.00005 0.00005 0.00005 0.00004 0.00004 0.00004
False Negative Rate 0.36 0.36 0.36 0.56 0.59 0.60
Precision 0.92 0.92 0.92 0.91 0.91 0.91
F-measure 0.76 0.75 0.75 0.60 0.56 0.56
Purity 0.95 0.95 0.96 0.96 0.96 0.97

Table 7.10: Evaluation results for the variant based on attribute entropy

168
7.4.2.3 Variant Based on Time Decay

This variant assigns to the weight of each fingerprint attribute its corresponding
agreement and disagreement decays. Therefore each attribute has an importance
proportional to the probability of change or sharing between fingerprint records.
Table 7.11 shows the evaluation results corresponding to different values of θ,
from where the following insights can be obtained:

• The Rand Index and Error Rates metrics are good for all the values assigned
to θ.

• The Specificity and False Positive Rate metrics are good for all the values
assigned to θ.

• The Recall and False Negative Rate metrics are not as good as with other
variants.

• The Precision metric is good for θ = 0.95.

• The F-measure metric is not as good as with other variants.

• The Purity metric is good for θ = 0.9 y θ = 0.95.

The values that optimise the corresponding metrics among all the variants are
marked in bold in Table 7.11.

Measure θ = 0.7 θ = 0.75 θ = 0.8 θ = 0.85 θ = 0.9 θ = 0.95


Rand Index 0.9977 0.9986 0.9991 0.9994 0.9997 0.9995
Error Rate 0.0012 0.0007 0.0005 0.0003 0.0002 0.0002
Recall 0.31 0.36 0.42 0.61 0.74 0.53
Specificity 0.99832 0.99921 0.99964 0.99981 0.99991 0.99996
False Positive Rate 0.00168 0.00079 0.00036 0.00019 0.00009 0.00004
False Negative Rate 0.69 0.64 0.58 0.39 0.26 0.47
Precision 0.15 0.31 0.53 0.76 0.89 0.92
F-measure 0.2 0.22 0.47 0.68 0.81 0.67
Purity 0.32 0.45 0.61 0.79 0.89 0.95

Table 7.11: Evaluation results for the variant based on time decay

169
7.4.2.4 Variant Based on Attribute Entropy and Time Decay

This variant assigns to the weight of each fingerprint attribute a combination


of its corresponding entropy, and agreement and disagreement decays. Therefore
each attribute has an importance proportional to the quantity of information that
adds for distinguishing a fingerprint from another, as well as to the probability
of change or sharing between fingerprint records.
Table 7.12 shows the evaluation results corresponding to different values of θ,
from where the following insights can be obtained:

• The Rand Index and Error Rates metrics are good for all the values assigned
to θ.

• The Specificity and False Positive Rate metrics are good for all the values
assigned to θ.

• The Recall and False Negative Rate metrics are good for θ = 0.7, θ = 0.75
y θ = 0.8.

• The Precision metric is good for all the values of θ.

• The F-measure metric is good for θ = 0.7, θ = 0.75, and θ = 0.8.

• The Purity metric is good for all values of θ.

The values that optimise the corresponding metrics among all the variants are
marked in bold in Table 7.12.

170
Measure θ = 0.7 θ = 0.75 θ = 0.8 θ = 0.85 θ = 0.9 θ = 0.95
Rand Index 0.9998 0.9998 0.9998 0.9996 0.9994 0.9994
Error Rate 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002
Recall 0.89 0.88 0.87 0.62 0.46 0.45
Specificity 0.9999 0.99992 0.99993 0.99995 0.99996 0.99996
False Positive Rate 0.00010 0.00008 0.00007 0.00005 0.00004 0.00004
False Negative Rate 0.11 0.12 0.13 0.38 0.54 0.55
Precision 0.9 0.91 0.93 0.92 0.91 0.91
F-measure 0.9 0.9 0.9 0.74 0.61 0.6
Purity 0.88 0.91 0.94 0.95 0.95 0.96

Table 7.12: Evaluation results for the variant based on attribute entropy and
time decay

7.4.2.5 Comparison of the Variants

Figure 7.38 shows a ROC (Receiver Operating Characteristic) graph [Egan, 1975]
with plots representing the algorithm variants with different thresholds. A ROC
space is defined by False Positive Rate and Recall (or True Positive Rate) metrics
as x and y axes respectively, which depicts relative trade-offs between true positive
(benefits) and false positive (costs). The best possible prediction method would
yield a point in the upper left corner or coordinate (0,1) of the ROC space,
representing no false negatives and no false positives (perfect classification).
Therefore, the best-performing variants are in the upper left corner of the
figure. Such variants are the one that uses the same weight for all fingerprint
attributes (for θ = 0.8 and θ = 0.85), and the one that takes into account entropy
and decay (for θ = 0.7, θ = 0.75 and θ = 0.8). Taking into account entropy or
decay by separate (second and third variant) do not produce the better results
than the uniform weights variant.
Table 7.13 compares the variants that provide better results (optimous results
in bold), showing the following insights:

• The Rand Index and Error Rate metrics are the same for all the variants
and thresholds.

• The Recall and False Negative Rate metrics are slightly better for the vari-
ant that assigns the same weight for all the attributes (first variant), al-

171
Figure 7.38: Performance of the variants evaluated for the technique for unique
user identification based on evolving device fingerprint detection

though they are acceptable for the variant that takes into account decay
and entropy (fourth variant).

• On the other hand, Specificity and False Positive Rate are slightly better
for the fourth variant, although they are acceptable for the first variant.

• Precision is better for the fourth variant (over 0.9).

• F-measure is higher for the fourth variant (over 0.9).

• In addition, the algorithm achieves better purity values for the fourth vari-
ant with θ = 0.8 (P urity = 0.94).

In summary, the variant that behaves better is the one that takes into account
entropy and decay, since it provides the maximum values of Rand Index, F-
measure and Purity.

172
Uniform weights Decay and entropy
Measure θ = 0.8 θ = 0.85 θ = 0.7 θ = 0.75 θ = 0.8
Rand index 0.9998 0.9998 0.9998 0.9998 0.9998
Error rate 0.0001 0.0001 0.0001 0.0001 0.0001
Recall (or sensitivity) 0.91 0.91 0.89 0.88 0.87
Specificity 0.99985 0.99986 0.9999 0.99992 0.99993
False positive rate 0.00015 0.00014 0.00010 0.00008 0.00007
False negative rate 0.09 0.09 0.11 0.12 0.13
Precision 0.85 0.86 0.9 0.91 0.93
F-measure 0.85 0.86 0.9 0.9 0.9
Purity 0.92 0.92 0.88 0.91 0.94

Table 7.13: Comparison of the variants with more performance

7.5 Hypothesis Validation


In comparison with the algorithm proposed by Eckersley [2010], our algorithm
behaves better, since the accuracy of the former is 0.991, while the accuracy of
the latter is 0.9998. The false positive rate of Eckersley [2010] is 0.0086 while ours
is almost zero (0.00007). Moreover, the algorithm described by Eckersley [2010]
only classifies the 65% of the fingerprints (when the browser has Java Virtual
Machine or Flash installed). By contrast, our algorithm makes a classification in
all the cases, regardless Flash or Java Virtual Machine.
The evaluation performed to our approach for unique user identification val-
idates the Hypothesis 2 of this work, since our technique allows grouping and
identifying the activity generated by website visitors through the digital finger-
print of their devices, even when such fingerprint varies over time, with a higher
performance along different metrics than the previous existing approach.

173
174
Chapter 8

TECHNIQUES FOR
SEGMENTATION OF
CONSUMERS FROM SOCIAL
MEDIA CONTENT

This chapter describes another main contribution of this thesis to the State of the
Art, which consists in a collection of techniques for extracting socio-demographic
and psychographic profiles from social media users applied to the marketing do-
main, trough the analysis of the opinions they express about brands, as well as
from the profiles published by them in social networks. Specifically, these tech-
niques are the following:

• A technique for classifying consumer opinions produced in social media


according to the Consumer Decision Journey stages, which is described in
Section 8.2.

• A technique for classifying consumer opinions produced in social media ac-


cording to the Marketing Mix framework, which is described in Section 8.3.

• A technique for analysing consumer opinions written in Spanish according


to the emotions expressed in such opinions, which is described in Section 8.4.

175
• A technique for obtaining the place of residence of social media users, which
is described in Section 8.5.

• A technique for identifying the place of residence of social media users,


which is described in Section 8.6.

Additionally, the contributions of this thesis that perform content analysis


rely in a common task for gathering the corpora used for learning and evaluation
purposes, a common activity for pre-processing user-generated contents before
modelling, and a modelling technique based on rule matching. Section 8.1 de-
scribes such common elements.
Finally, the evaluation results are presented in Section 8.7. After that, in
Section 8.8 we validate the hypotheses formulated in Section 3.4 regarding so-
ciodemographic and psychographic segmentation of consumers.
The techniques described in this chapter implement generic activities and
tasks defined by the CRISP-DM methodology [Shearer, 2000], which has been
described in Section 4.3.2.

8.1 Common Elements Used by the Techniques

The content-analysis techniques described in this thesis have been trained and
evaluated with corpora extracted from social media. Section 8.1.1 describes the
data collection task used for obtaining such corpora, while Section 8.1.2 describes
the technique used for preparing the corpora used by the content-analysis contri-
butions of this thesis.
In addition, two techniques presented in this thesis (i.e. the technique for
detecting Consumer Decision Journey stages and the technique for identifying
emotions) make use of rule-based models, which rely on a variety of linguistic
information such as lexical items or morphosyntactic features (e.g. future tense).
Such models have been developed following the modelling technique described in
Section 8.1.3.

176
ad Collect Initial Data Task

«parallel»

Search Extract Opinion Clipping paragraphs


links link text clips

language brand
terms

Figure 8.1: Initial Data Collection task executed by the content-analysis tech-
niques

8.1.1 Collect Initial Data Task


This task implements the Collect Initial Data generic task of the CRISP-DM
methodology [Shearer, 2000] (see Section 4.3.2.2). It is oriented to find and
retrieve from different social media textual contents that mention brands. The
workflow followed by this task is shown in Figure 8.1 and consists in the ordered
execution of the steps described next.

Search. This step consists in defining a pool of brands with a list of lexical
variants for each one (e.g. “Coca Cola” and “Coke” for the brand Coca
Cola and using social media search services for looking for texts written in
a set of objective languages that mention any of those brands, retrieving
the links highlighted by the search results.
In our work we used the search services provided by Google78 , Facebook79 ,
and Twitter80 .

Extract. This step consists in retrieving and extracting the textual content re-
ferred by the links of the search results.
Texts from structured data sources (i.e. from Twitter and Facebook) are
directly retrieved from the values of the message attribute included in
78
https://developers.google.com/custom-search
79
https://developers.facebook.com/docs/graph-api
80
https://dev.twitter.com/docs/api/1.1/get/search/tweets

177
the structured data object obtained by querying the corresponding REST
[Fielding, 2000] API.
Texts from unstructured data sources (i.e. web pages) are obtained by
performing a scraping technique oriented to remove HTML mark-up.

Opinion Clipping. Once the texts from each specific social media format have
been collected, this step extracts the paragraphs (i.e. clips) that mention
the selected brands (i.e. that contain at least one term of the list of terms
used by the Search task).

8.1.2 Data Preparation Activity


This task implements the Data Preparation generic activity of the CRISP-DM
methodology (see Section 4.3.2.3). Once the content is retrieved, the goal of this
activity is to filter the texts that are not relevant, either because they do not
mention the brand, are written in a different language than the target language,
or do not contain user-generated content.
In addition, NLP (Natural Language Processing) tools were used to obtain
the linguistic information upon which the content-analysis techniques were based.
The texts were processed and annotated with linguistic information such as part-
of-speech, verb tense, and person. For these NLP tools to work properly, it
was also crucial to normalise the texts that contain many typos, abbreviations,
emoticons, etc.
For enhancing the performance of the content analysis techniques described
in this thesis, the data preparation activity executes a morphological normalisa-
tion of user-generated content. Such technique makes use of several gazetteers
extracted from different open data sources collectively developed, including a
SMS lexicon and Wikipedia. Wikipedia has been used in the past for different
NLP activities, such as text categorisation [Gabrilovich and Markovitch, 2006],
topic identification [Coursey et al., 2009], measuring the semantic similarity be-
tween texts [Gabrilovich and Markovitch, 2007], and word sense disambiguation
[Mihalcea, 2007], among others.
This activity consists in the ordered execution of the tasks shown in Figure 8.2,
which are described next.

178
ad Data Preparation Activity

Select Data Clean Data Construct Data


data paragraphs
selection cleansed
criteria

language brands normalised


paragraphs
posts

Figure 8.2: Data Preparation Activity implemented by the content-analysis tech-


niques

8.1.2.1 Select Data Task

As described in Section 8.1.1 the Collect Initial Data task looks for contents
written in a target language that refer to a commercial brand. For doing so, it
uses the content retrieval APIs provided by social media. Such APIs may output
false positives of the following kinds:

1. Posts that syntactically contain a brand term that do not refer to the brand
itself. This is mainly due to the use of ambiguous terms (e.g. “Orange”
may refer to a telecommunications company, a fruit or a colour).

2. The social network’s API do not have language detection capabilities, or


retrieves posts that have been tagged with a given language but are not
actually written in such language.

For dealing with these situations, this task establishes the criteria for selecting
the textual contents to be used from the collected raw data. Such contents must
satisfy the following criteria:
1. The text of each post must contain a mention to a commercial brand. For
automatically selecting the correct senses, two lists are added to the data
selection criteria:

• A list of mandatory terms that includes terms related to senses that


refer to the brand (e.g. a text that contains “phone” or “mobile” may
refer to the telecommunication company Orange).

179
• A list of forbidden terms that includes terms related to senses in which
we are not interested (e.g. a text that contains “fruit” or “dessert” is
more likely to refer to a sense of the word Orange different than the
telecommunication company).

2. The text must be written in the target language for which the model will
be learned.

The task described next deals with removing contents from the dataset that do
not satisfy the previous criteria.

8.1.2.2 Clean Data Task

This task consists in removing the contents that are not relevant for the goal of
the activity to be performed after preparing data. The workflow followed by this
task is shown in Figure 8.3 and consists in the ordered execution of the steps
described next.

Filter Irrelevant Content. This step consists in automatically filtering the


texts that syntactically contain one of the brand terms used for looking
up the opinions, but do not refer to the correct sense (i.e. the brand). For

ad Clean Data Task


data
selection
criteria

«parallel» forbidden mandatory


language terms terms
Filter Irrelevant
Filter Language
Content paragraph paragraphs
paragraph
[paragraph is not
relevant]

Filter SPAM
paragraph paragraph Manual Revision
[other [paragraph filtered
language] is SPAM] paragraphs

paragraphs
cleansed

Figure 8.3: Clean data task executed by the content-analysis techniques

180
doing so, this task takes out the texts that contain at least one forbidden
term or that do not contain at least one mandatory term.

Filter Language. This step consists in automatically removing the texts that
are not written in the language for which the texts are being extracted. To
do so, we have implemented a language detection component that combines
multiple language classifiers and returns the language which has been de-
tected the most by such classifiers. The language classifiers used are the
following:

• The Freeling’s [Padró and Stanilovsky, 2012] language identification


module.

• The Java Text Categorising Library81 that implements the text cate-
gorisation algorithm described by Cavnar and Trenkle [1994].

• The LingPipe82 toolkit for computational linguistics.

• The language identification components provided by the Apache Tika83


framework.

• The JLangDetect84 library.

Filter SPAM. Since the text extraction technique applied in the Data Collec-
tion Task for unstructured formats may return pieces of text included in
advertisements or navigation options of the web page, this step discards
those texts in which brands are not part of the main content of the docu-
ment, following Ntoulas et al. [2006] guidelines.

After studying a representative set of 1,000 texts extracted from web pages,
we decided that a text (with at least an occurrence of a brand) is invalid (i.e.
it does not belong to the main content) unless it includes at least 30% of
words belonging to the following list of grammatical categories: adpositions,
determiners, conjunctions and pronouns.
81
http://textcat.sourceforge.net
82
http://alias-i.com/lingpipe
83
http://tika.apache.org
84
http://github.com/melix/jlangdetect

181
To get the grammatical category of each word of the texts we made use of
a part-of-speech tagger (see Section 2.6.1). Specifically, we used Freeling.

Manual Revision. This step consists in manually reviewing the texts obtained
after applying the automatic filtering heuristics described above, discarding
irrelevant or useless contents, such as texts written in other languages that
are not detected by the Filter Language step, or texts referring to other
senses different than the brand that are not detected by the Filter Not
Relevant Content step.
The final corpus obtained after performing this step consists of the re-
maining texts, with annotations of the source from which they have been
collected, the brand mentioned in the texts, and the domain to which they
belong.

8.1.2.3 Construct Data Task

The content analysis techniques presented in this thesis rely on linguistic patterns.
In order to match these patterns with texts, these texts have to be processed and
annotated with linguistic information such as part-of-speech, verb tense, and per-
son. Linguistic processing is carried out by an automatic tagger. However, such
tagger cannot properly work with user-generated texts as the ones our techniques
analyse. This is because social media user-generated texts contain a large num-
ber of misspellings, abbreviations and jargon words. Badly written texts imply a
great amount of errors in the part-of-speech annotation process and, consequently,
without a normalisation phase the developed classifiers do not work correctly. For
dealing with this issue we have implemented the workflow shown in Figure 8.4.
The phases involved in the data preparation task are described next.

Sanitise. This phase transforms the text received by removing non-printable


characters (i.e. control and format characters like the null character) and
by converting different variations of the space character (e.g. non-breaking
space, tab) into the standard whitespace symbol.

Tokenise. This phase receives the text to be normalised and breaks it into
words, Twitter metalanguage elements (e.g. hash-tags, user IDs), emoti-

182
ad Construct Data Task

paragraphs
cleansed

«parallel»
«parallel»
Normalise Twitter
Metalanguage
Sanitise standard
Element
language
sanitised dictionary Twitter
post metalanguage
element normalised
tokens forms
token normalised form Concatenate
Tokenise Classify Token
Normalised
word in standard
Forms
vocabulary
OOV correct
correct word OR
OOV variation
words Classify OOV variation
Word OR
SMS unknown
dictionary OOV OR normalised
word correct post
spell
checker Check & Correct
dictionary Spell

OOV word

normalised
posts

Figure 8.4: Construct data task executed by the content-analysis techniques

cons, URLs, etc. The output (i.e. the list of tokens) is sent to the Classify
Tokens phase.
In our experiments, we used Freeling for social media content tokenisation.
Its specific tokenisation rules and its user map module were adapted for
dealing with smileys and particular elements typically used in Twitter, such
as hash-tags, RTs, and user IDs.

Classify Tokens. The input of this phase is the list of tokens generated by the
tokeniser. It classifies each of them into one of the following categories:

• Twitter metalanguage elements (i.e. hash-tags, user IDs, RTs and


URLs). Such elements are detected by matching regular expressions
against the token (e.g. if a token starts by the symbol “#”, then it
is a hash-tag). Each token classified in this category is sent to the
Normalise Twitter Metalanguage Element phase.
• Words contained in a standard language dictionary, excluding proper

183
nouns. Each token classified in this category is sent to the Concatenate
Normalised Forms phase.
• Out-Of-Vocabulary (OOV) words. These are words that neither are
found in a standard dictionary nor are Twitter metalanguage elements.
Each token classified in this category is sent to the Classify OOV Word
phase.

We use the part-of-speech tagging module of Freeling within this phase. As


we deactivate Freeling’s probability assignment and unknown word guesser
module, all the words that are not contained in Freeling’s POS-tagging
dictionaries are not marked with a tag and are considered as OOV words.
Our standard vocabularies are, thus, the Freeling dictionaries themselves
for English and Spanish. Additionally, for Spanish we have extended the
standard vocabulary with a list of correct forms generated from the lemmas
found in the Real Academia Española Dictionary (DRAE) by Gamallo et al.
[2013].

Classify OOV Word. This phase receives every token previously classified as
out-of-vocabulary by the previous phase and detects if it is correct, wrong,
or unknown. If the token is wrong, it returns the correct form of the token.
The task executes the following steps:

1. Firstly, the token is looked up in a secondary dictionary for those words


that are not in a standard dictionary but that are known to correspond
to correct forms (mostly proper nouns). The search disregards both
case and accents. We have populated this secondary dictionary by
making use of the list of article titles from Wikipedia85 . To speed-
up the process of querying the Wikipedia article titles (31,528,653 for
English and 4,391,392 for Spanish), we uploaded them to a HBASE
store86 . In order to increase the coverage of this dictionary, we incorpo-
rated into it two lists of first names obtained from the United States
Census Bureau87 and from the Spanish National Institute of Statis-
85
http://en.wikipedia.org/wiki/Wikipedia:Database_download
86
http://hbase.apache.org
87
http://www.census.gov

184
tics88 . The list of first names for the English language contains 1,218
male names and 4,273 female names, while the list for the Spanish
language contains 18,679 male names and 19,817 female names.
(a) If an exact match of the token is found in the dictionary (e.g.
both forms are capitalised), then the token is classified as Correct
and sent to the Concatenate Normalised Forms phase with no
variation.
(b) If the token is found with variations of case or accentuation, then
the token is classified as Variation and its correct form is sent to
Concatenate Normalised Forms phase.
(c) If the token is not found in the dictionary, then the process con-
tinues in step 2.
2. The token is looked up in a SMS dictionary that contains tuples with
the SMS term and its corresponding correct form. The search is case-
insensitive, and does not consider accent marks. We have populated
such a dictionary with 898 common-used SMS terms for English ex-
tracted from different web sources. For Spanish, we have reused the
SMS dictionary of the Spanish Association of Internet Users89 , which
contains 53,281 entries.
(a) If the token is found in the SMS dictionary, then it is classified as
Variation and its correct form is retrieved and sent to the Con-
catenate Normalised Forms phase.
(b) If the token is not found in the dictionary, then it is sent to the
Check and Correct Spell phase.

Check and Correct Spell. This phase checks the spelling of the token received
and returns its correct form when possible. To do so, it executes the fol-
lowing steps:

1. Firstly, the token is matched against regular expressions to find whether


it contains characters (or sequences of characters) repeated more than
88
http://www.ine.es/inebmenu/indice.htm
89
http://aui.es

185
twice (e.g. “loooooollll” and “hahaha”).
(a) If the token contains repeated characters (or sequences of charac-
ters), then the repeated ones are removed (e.g. “lol” and “ha”),
and the resulting form is sent back to the Classify OOV word
phase, since the new form may be included into the correct words
set.
(b) If the token does not contain repeated characters (or sequences of
characters), then the process continues in step 2.
2. The token is sent to an existing spell checking and correction imple-
mentation. We make use of Jazzy90 , an open-source Java library. For
the creation of the spell checker dictionaries used by Jazzy, we made
use of the different varieties of English and Spanish dictionaries91 . The
resulting dictionaries contain 237,667 terms for English and 683,462
terms for Spanish.
(a) If the spell checking is correct, then the token is classified as Cor-
rect and sent to the Concatenate Normalised Forms phase without
a variation.
(b) If the spell checking is not correct, then the token is classified
as Variation and the first correct form returned by the spelling
corrector is sent to Concatenate Normalised Forms phase.
(c) If the spell checker is not able to propose a correct form, the token
is classified as Unknown and is sent to the Concatenate Normalised
Forms phase without a variation.

Normalise Twitter Metalanguage Element. This phase performs a syntac-


tic normalisation of Twitter meta-language elements. Specifically, it exe-
cutes the rules enumerated next.

1. Remove the sequence of characters “RT” followed by a mention to a


Twitter user (marked by the symbol “@”) and, optionally, by a colon
punctuation mark;
90
http://jazzy.sourceforge.net
91
http://sourceforge.net/projects/jazzydicts

186
2. Remove user IDs that are not preceeded by a coordinating or subor-
dinating conjunction, a preposition, or a verb;
3. Remove the word “via” followed by a user mentioned at the end of the
tweet;
4. Remove all the hash-tags found at the end of the tweet;
5. Remove all the “#” symbols from the hash-tags that are maintained;
6. Remove all the hyper-links contained within the tweet;
7. Remove ellipsis points that are at the end of the tweet, followed by a
hyper-link;
8. Replace underscores with blank spaces; and
9. Divide camel-cased words into multiple words (e.g. “DoNotLike” is
converted to “Do Not Like”).

As an example, after applying metalanguage normalisation, the tweet

RT @AshantiOmkar: Fun moments with @ShwetaMohan at the


O2! She was wearing a #DVY #DarshanaVijayYesudas outfit!
http://t.co/...

is converted into the text

Fun moments with Shweta Mohan at the O2! She was wearing a
DVY Darshana Vijay Yesudas outfit!

which is easier for being processed by a part-of-speech tagger.

Concatenate Normalised Forms. This phase receives the normalised form of


each token and amends the post.

8.1.3 Rule-based Modelling Technique


The techniques for detecting Consumer Decision Journey stages and for identify-
ing emotions in user-generated content are based on the recognition of patterns

187
1 <rule set> ::= (<chunk rule> | < classification rule >)∗
2 <chunk rule> ::= <pattern> .
3 < classification rule > ::= <pattern> ”−>” <action>
4 <pattern> ::= (<word> | <lemma> | <part of speech> |
5 <lemma and part of speech> | <entity> |
6 <any number of words between> |
7 <max number of words between>)+
8 <word> ::= ’”’ <string> ’”’
9 <lemma> ::= <string>
10 <part of speech> ::= [ <string> ]
11 <lemma and part of speech> ::= <string> # <string>
12 <entity> ::= ENTITY
13 <any number of words between> ::= ”∗”
14 <max number of words between> ::= / <positive integer number> /
15 <action> ::= <class> <operation> <value>
16 <class> ::= <string>
17 <operation> ::= ”+” | − | ”∗”
18 <value> ::= <decimal number>

Listing 8.1: BNF grammar of the linguistic rules

as sequences of particular words. These patterns are part of what we called “lin-
guistic rules”; a description of the pattern as particular conditions that have to
be met in order to consider the text and example of a particular category. The
general structure of the linguistic rules is shown next.

<Linguistic Pattern> → <Classification Action>

The antecedent of the rule reflects the pattern/template of an expression in


natural language and the consequent defines an action to be performed, which
consists in modifying a numerical value associated to a given category.
Listing 8.1 shows the BNF (Backus Naur Form) grammar [Backus et al., 1963]
according to which the rules are expressed.
Rules can be either defined for performing classification actions (e.g. incre-
menting the value for a given category) or as chunk actions (i.e. for dividing the
text into fragments).
The first component of classification and chunk rules is a linguistic pattern (see
pattern in Listing 8.1). Such pattern describes the relevant features of a expression

188
in natural language at the morphosyntactic level. Each word of the pattern can
be represented by itself —e.g. “girls”— or as its lemma —e.g. girl — alone,
or with (some components of) its part-of-speech tag —e.g. girl#N. Sometimes,
only the part-of-speech tag is important —e.g. [N] —, and some others, only the
maximum number —e.g. /1/ — or the existence —e.g. * — of words matters.
This allows for quite a flexible specification (see sections 8.2.2.2 and 8.4.2.3 for
examples).
Regarding classification rules (see classification rule in Listing 8.1), such rules
perform an arithmetic operation over a value corresponding to a given category
whenever the linguistic pattern is matched against the text. The operations
available are addition, subtraction and multiplication, denoted by the operators
“+”, “−” and “∗”, respectively. The addition and subtraction operations are
used to designate the polarity of a classification, as in the case of sentiment
analysis (e.g. the adjective “smart” can be modelled with “+1”, while “fool ” can
be modelled with “−1” ). The multiplication operation is useful to invert the
polarity of a unit (e.g. the negation particle “no” can be modelled with “∗ − 1”),
and to increase or decrease its value (e.g. the adverb “very” can be modelled
with “∗2”, while “little” can be modelled with “∗0.5”).
The rule engine executes the following steps for classifying a text:

1. Firstly, the lemma and the part-of-speech tag of every token (i.e. lexical
unit) included in the text are obtained, outputting a sequence of tuples
made up of the token, its lemma and its morphosyntactic category. There-
fore, this step performs the lemmatisation and part-of-speech tagging of the
tokens received as described in Section 2.6.1.
In our experiments, the morphosyntactic annotations were added by the
use of the Freeling part-of-speech tagger. Therefore, the part-of-speech
tags used for English are those defined by Santorini [1991] and for Spanish
those standardised by Leech and Wilson [1996].

2. In this second step, a sentence splitter divides the texts. Additionally, the
set of chunk rules is applied in order to divide the text into the different
sequence units to be analysed (e.g. the conjunction “and” can determine
two units: the one on its left side and the one on its right side).

189
In our experiments we reused Freeling’s sentence splitter.

3. The third step consists in identifying the linguistic patterns that match the
entire text or a part of the text obtained in the previous step. For each
sequence unit, it identifies the antecedents of the rules that match all or
part of the unit. If there are several antecedents that match the same part
of the unit that overlap:

(a) If their corresponding consequents affect the same category, it selects


the first rule among the most restrictive ones (i.e. among the ones
that match the longest text, the one found in first place). Once the
matching expressions have been detected, a tuple made up of the cat-
egory (e.g. “PURCHASE”), the operation (e.g. “+”), and the value
(e.g. “1”) of the consequent is appended to a list of operations.

(b) If their corresponding consequents affect different categories, a tuple


for each category is appended.

Otherwise, i.e. if there is not a matching expression, nothing is appended


to the list.

As misspellings are likely to be found in user-generated content, this match-


ing step is not case-sensitive and does not take into account accent marks.
Therefore, all the words and lemmas contained either in the rules and in
the texts are transformed to lowercase and accent marks are stripped from
them.

4. When all the units of the text have been processed, the list of operations
is computed. First the sum operations are carried out (i.e. the positive
and the negative values are added up) and then the product operations are
applied to the result of that addition (e.g. “* -1” for inverting the value
due to a negation, or “* 2” for doubling the value due to an intensifying
adverb).

5. As a result of that computation, a numeric value is obtained for each cate-


gory contained in the consequents of the rules.

190
In the case of chunk rules (see chunk rule in Listing 8.1), such rules split a text
into the fragments delimited by the linguistic pattern. For example the following
rule “[CC] .” implies than whenever a coordinating conjunction is found within a
text, such text will be divided into two fragments, the one before the coordinating
conjunction, and the one after the coordinating conjunction. Classification rules
will apply to each fragment separately.

8.2 Technique for Detecting Consumer Decision


Journey Stages
In order to achieve one the objectives of this thesis, i.e. to develop a technique for
automatically classifying short user-generated texts into stages of the Consumer
Decision Journey, we have carried out the activities described next.

1. The Data Understanding activity collects a corpus of texts generated by


consumers, creates a gold standard from the gathered corpus and validates
that the gold standard is valid for learning purposes. The instantiation of
this activity for gathering the corpus for the detection of Consumer Decision
Journey stages is explained in Section 8.2.1.

2. The Data Preparation activity covers all the tasks required to construct the
dataset for learning and evaluating the technique, including data cleans-
ing and content normalisation. This activity is common to other content-
analysis techniques, and has been described in Section 8.1.2.

3. The Modelling activity engineers a rule-based model for classifying user-


generated content into Consumer Decision Journey stages. This activity is
explained in Section 8.2.2.

8.2.1 Data Understanding Activity


This activity consists in the ordered execution of the following tasks:

191
1. The Collect Initial Data task consists in gathering the corpus and creating
the gold standard required for learning purposes. This task is described in
Section 8.2.1.1.

2. The Describe Data task consists in performing a description of the format


and volume of the gold standard. This task is described in Section 8.2.1.2.

3. The Explore Data task consists in performing a deeper statistical analysis


of the gold standard from several viewpoints to ensure that it is valid for
modelling purposes. This task is described in Section 8.2.1.3.

4. The Verify Data Quality task consists in examining the quality of the gold
standard by attending to the analyses performed in the previous tasks. This
task is described in Section 8.2.1.4.

8.2.1.1 Collect Initial Data Task

This task applies the approach described in Section 8.1.1 for retrieving textual
contents mentioning commercial brands from different social media, and con-
structs the gold standard required for model creation and evaluation.
In order to identify the linguistic patterns utilised to express the different
stages of the Consumer Decision Journey, and also to carry out the evalua-
tions, this task builds a gold standard by manually annotating a corpus of user-
generated content according to the Consumer Decision Journey stages that can
be derived from such content. To do so, human annotators are asked to tag each
text with just one label following the description provided below.

Awareness. All the texts that refer to advertisement campaigns or opinions


about advertisements are generally expressed in first person. These texts
should contain information about the user’s experience with respect to the
advertisement or the knowledge of the brand. For example92 :

I love Ford’s ad
92
The examples included in this thesis correspond to individual comments about brands.
Therefore, opinions presented in this document do not necessarily correspond to the view of
the author, neither represent the majority judgements of consumers.

192
Evaluation. All the texts that state interest and/or show an active research
towards the brand or product. For example:

My daughter and I are looking for a Fiat-like van in good condi-


tion

The annotator should also annotate as evaluation all the texts that express
a preference (positive or negative) and that we cannot infer user experience
from them. For instance:

Well, I’d rather fly with Emirates than with Ryanair

Purchase. All the texts that explicitly express the decision to buy are generally
conveyed in first person and in future tenses. Texts that refer to the exact
moment of the purchase also belong to this stage. For example:

The car is in the authorized dealer, I’m buying it tomorrow

Post-purchase. All the texts that explicitly refer to a past purchase and/or an
actual user experience, are generally expressed in first person, in present
as well as in past tenses. Texts that convey the possession or the use of
some product are also annotated as “post-purchase”, although there is no
opinion about it. Some examples:

We went on the Mazda

I bought a 2002 Jaguar two days ago

I’ve been using a pair of Nike for the past two years, and I’m
delighted

However, not all the texts in the corpus clearly pertained to one of the Con-
sumer Decision Journey categories. It was obvious that a great amount of the
texts did not imply user experience, or the stages appeared mixed. Therefore,
we established two other categories under which the human annotators could tag
the texts: ambiguous and no corresponding. The specific instructions to annotate
these kinds of texts are the following:

193
Ambiguous. All the texts where the author recommends or criticises the prod-
uct or brand but they do not imply active evaluation or user experience.
Also, all the texts in which one cannot distinguish if the author is express-
ing a post-purchase experience or an evaluation, or all those texts where
the author explicitly recommends some product or brand. For instance:

I want the Mazda

I love the clothes from Zara

I advise you to buy this Bimbo bread

Ambiguous texts are discarded from the gold standard.

No corresponding. All the texts that contain news headlines or corporative or


informative messages about the brand or product, without user’s opinions or
statements. Also belong to this category all the questions where one cannot
infer user experience, evaluation, or purchase intention, texts that express
user experience, evaluation or purchase intention of a third person, and
texts that imply the sale of the product and do not contain user experience.
Some examples:

Nike opens its first shop in Madrid

My father bought the gasoline 1.6 gls full

Land Rover car year ’99 for sale

In the experiment conducted in this thesis, two experts on marketing an-


notated each text as belonging to one of the four Consumer Decision Journey
stages (i.e. awareness, evaluation, purchase or post-purchase). All the annota-
tions were then checked by one reviewer with social sciences background and by
two reviewers with computational linguistics background. The consensus between
annotators was sought during the execution of this process.

194
8.2.1.2 Describe Data Task

Regarding data format, the dataset used has a structure containing the text
gathered, plus other metadata, and its classification. The data schema –a view of
the Social Graph Ontology with the ontology elements required by this technique–
is shown in Figure 8.5. The classes and properties included in the diagram have
been already described in Chapter 5.
Regarding its volume, the dataset used for modelling and evaluating the tech-
nique for detecting Consumer Decision Journey stages (i.e. the gold standard)
consists in 13,980 opinions written in English and 22,731 opinions written in
Spanish.
The length of the texts ranged from 2 to 194 words. The texts were collected
from five different social media sources (forums, blogs, reviews, social networks,
and microblogs) and refer to different domains: automotive industry, banking,
beverages, sports, telecommunication, food, retail and utilities.
The opinions were selected by looking for a set of 72 particular trademarks of
the different domains (or business sectors).

cd Data Format for the Consumer Decision Journey Identification Technique


sioc:Forum
isocat:DC-414 isocat:datcat
skos:Concept dcterms:type
(trademark) * *
*
*
marl:describesObject sioc:has_container
*
*
sgo:hasPurchaseStage marl:Opinion marl:extractedFrom
sgo:PurchaseStage sioc:Post
0..1 * marl:optinionText (language-annotated) * *

sgo:awareness *

sioc:topic
sgo:evaluation

sgo:postpurchase *
isocat:DC-2212 isocat:datcat
skos:Concept
sgo:purchase (domain) * *

Figure 8.5: Format of the data used by the technique for detecting Consumer
Decision Journey stages

195
8.2.1.3 Explore Data Task

This task characterises the data from different viewpoints to ensure that the gold
standard is richer enough for model learning purposes. Specifically the objective
of this task is to describe the distribution of the data with respect to media
sources, business sectors, and Consumer Decision Journey categories.
Figure 8.6 shows the distribution of the texts along the media sources and
business sectors for which the data used in our experiments were gathered, while
Figure 8.7 shows the distribution of texts along the Consumer Decision Journey
categories.

9000

8000

7000

6000

5000

4000

3000

2000

1000

0
Automotive Banking Beverages Sports Telecom. Food Retail Utilities
Social Networks 678 122 746 778 809 0 248 9
Reviews 420 647 7 351 661 0 170 0
Microblogs 2488 2165 6792 2671 1940 3876 3140 1871
Blogs 219 608 768 498 553 371 110 55
Forums 673 652 54 731 720 23 72 5

Figure 8.6: Distribution of the texts along the media sources and sectors for the
Consumer Decision Journey gold standard

196
1800

1600

1400

1200

1000

800

600

400

200

0
Automotive Banking Beverages Sports Telecom. Food Retail Utilities
Postpurchase 474 411 327 886 1395 340 569 411
Purchase 55 12 53 113 66 34 242 167
Evaluation 125 42 29 116 82 49 99 39
Awareness 195 146 514 138 182 293 89 60

Figure 8.7: Distribution of the texts along the Consumer Decision Journey cate-
gories

8.2.1.4 Verify Data Quality Task

This task examines the quality of the data, ensuring that the gold standard is
valid enough for modelling the classifier.
Thanks to the variety of sectors selected, it was possible to have a cross-
domain and cross-source perspective, being able to carry out generalisations on
the linguistic rules proposed and studying the relation among different stages,
product typology and number of texts produced.
All the texts of the corpus were written by users of different sites and social
media, thus we found a lot of grammatical errors and misspellings that supposed
additional difficulties to pattern identification. Moreover, all the texts were in En-
glish and Spanish but with different geographical language varieties in both cases
(e.g. American Spanish, European Spanish, American English, British English),
thus some lexical units were especially hard to detect.
We observed that there is a general tendency to comment or analyse the qual-
ity and features of expensive or high involvement products while cheaper ones

197
received much less feedback. Particularly, in the case of cars, mobile providers or
sportive clothes and shoes (sectors Automotive Industry, Telecommunication, and
Sports, respectively), we appreciated that customers tend to write more evalua-
tive texts, investigating the pros and cons of different brands before buying them.
Users are also inclined to comment their personal experiences with the product
after using it. Accordingly, it is more difficult to find evaluative messages about
consumer-packaged goods such as beverages or food whose cost is typically much
lower. In these cases, consumers require less deliberation, show less involvement,
and they usually do not compare these products with their competitors before
purchasing them. However, in the case of cheaper products, consumers tend to
pay much more attention to the advertising campaigns (awareness). Correspond-
ingly, the number of comments about their post-purchase experience is also lower
in this kind of products.
As it can be seen in Figure 8.7, the number of texts per category is un-
balanced along the different stages of the Consumer Decision Journey for the
different business sectors. Despite these differences across domains, we consider
the corpus varied enough for learning and evaluating the technique for identifying
Consumer Decision Journey stages, since it consists in an random sample of the
posts produced for the domains being monitored, and the overall volume of texts
for each stage is adequate for learning and evaluation purposes.
In order to estimate how reliable the annotation was, an excerpt of the classi-
fied corpus (1,000 texts) along with the annotation criteria were given to a group
of annotators through the Amazon Mechanical Turk93 annotation services (see
question 3 in Figure 8.8 for an example). Each text was classified by two differ-
ent anonymous human annotators and compared against the annotation in the
gold standard. To measure the inter-annotator agreement we chose Fleiss’ kappa
metric [Fleiss, 1973], which takes the value of 1 for a perfect matching between
the annotators and 0 (or a negative number) if the matching is the same as (or
worse than) expected. In our case, the value for this metric was 0,503, which is
generally regarded as a moderate value.

93
http://www.mturk.com

198
Figure 8.8: Example annotation of a post according to a Consumer Decision
Journey category using Amazon Mechanical Turk

199
8.2.2 Modelling Activity
The goal of this activity is to develop an automatic classifier for identifying Con-
sumer Decision Journey stages within user-generated content. This activity con-
sists in the ordered execution of the following tasks.

1. The Select Modelling Technique task consists in selecting an describing a


modelling technique for being applied for identifying Consumer Decision
Journey stages from user-generated content.

2. The Build Model task consists in implementing a rule set against which the
posts will be matched in order to identify the Consumer Decision Journey
stages.

Next, each of these tasks are described.

8.2.2.1 Select Modelling Technique Task

The goal of this technique is to perform a classification of an arbitrary content


into zero or one Consumer Decision Journey stages. For doing so, this technique
relies on the rule-based modelling technique described in Section 8.1.3.
Therefore, the resulting classifier matches the textual content received against
a rule set, outputting a set of numeric values associated to each of the four
categories, meaning a value distinct from zero that the post is classified according
to its corresponding Consumer Decision Journey stage.
As described in Section 8.1.3, the selected rule-based classification technique
may output several candidate classification categories. However, as the output of
the Consumer Decision Journey classifier must consist on a unique category, the
following heuristic is executed after rule matching:

(a) If the text is classified into one Consumer Decision Journey stage (i.e. only
one category has a value distinct from zero), then the classifier outputs such
category.

(b) If the text is classified into more than one Consumer Decision stage (i.e.
more than one category has a value distinct from zero), then the one that

200
corresponds to the latest stage in the Consumer Decision Journey workflow
(shown in Figure 2.6) is selected, discarding the rest of the classifications.

(c) If the text cannot be classified into any of the stages (i.e. all the categories
have a zero value associated), then the classifier finishes without returning a
classification.

8.2.2.2 Build Model Task

This task consists in the development of the rule set capable of recognising frag-
ments of text from which a stage of the Consumer Decision Journey can be
derived, therefore classifying the social media posts which embed such fragments
of text according to the stage detected.
Although this task has been mainly executed by researchers of the group Tech-
nologies of Language Resources (TRL) of the Institut Universitari de Lingüı́stica
Aplicada of the Universitat Pompeu Fabra94 , we include its description in this
thesis for self-containment purposes. The result of the joint work regarding the
identification of Consumer Decision Journey stages in user-generated content has
been published by Vázquez et al. [2014].
A set of linguistic patterns was compiled by studying the gold standard in
order to distinguish among the different stages of the Consumer Decision Journey.
The developed classifier was based on the recognition of these particular linguistic
expressions.
Linguistic rules were built as to match the occurrence of a lemma and its syn-
onyms or antonyms (to increase recall); the particular context where they could
occur is used as a restriction. The description of the context includes morphosyn-
tactic information as obtained with the tagger. The inclusion of morphosyntactic
information allows to differentiate, for example, between “I bought” that is an
expression related to postpurchase and “I’m buying” related to purchase.
Some examples of linguistic patterns for matching Consumer Decision Journey
stages are given in Table 8.1. For example, the first pattern matches the gerund
form of the verb “to laugh”, followed by a preposition (any), the word “a” and
the lemma “commercial” at a maximum distance of one word.
94
http://www.iula.upf.edu/trl/rpresuk.htm

201
Language Linguistic Pattern CDJ Stage
laugh#VBG [IN] “a” /1/ commercial Awareness
wonder if ENTITY [MD] offer Evaluation
English
i “will” buy Purchase
i call#VBD /1/ customer service Postpurchase
[PP1] [VA] gustar [DI] vı́deo Awareness
estar#V IP1 buscar#V G Evaluation
Spanish
ir#V I 1S “a” pillar [D] Purchase
[PP1] quedar#V I 1 con ENTITY Postpurchase

Table 8.1: Examples of the linguistic patterns for identifying Consumer Decision
Journey stages

In the development of these linguistic patterns, we started by looking for the


most frequent content words, bigrams and trigrams in the texts of each stage try-
ing to relate them to just one of the phases, but the results were not satisfactory.
On the one hand, the most frequent bigrams and trigrams did not help to clearly
identify any specific stage. On the other hand, content words used individually
allowed us to identify some portions of texts as belonging to one of the stages of
the Consumer Decision Journey, but the recall and precision were very low.
Therefore, we decided to use these lexical elements (i.e. the most frequent
content words) as starting point to build sets of more restrictive rules that in-
cluded morphosyntactic features, functional words, and synonyms and antonyms.
The inclusion of morphosyntactic tags allowed us to easily differentiate, for ex-
ample, between “I bought” used in postpurchase experience and “I will buy” that
classifies into the purchase stage. The introduction of functional words permitted
us to identify more complex expressions, as for example, “I’m going to buy” or
“thinking of buying something”. Finally, with the use of synonyms, antonyms and
other meaning-related words, we could increase the recall of our system.
In order to identify the morphological variations of the tokens, we used the
lemmas of the most frequent words (if we needed the exact word we put inverted
commas round it). This avoided us to create a pattern for each form of the word.
Additionally, we added morphosyntactic tags to specify what tense of the verb
or what morphological element we wanted to identify. Different heuristics for
engineering the rules for every stage are discussed next.

202
Identifying Awareness. As commented in previous sections, in the texts be-
longing to the awareness stage authors tend to comment, criticise or talk
about their experience with respect to specific advertising campaigns or
promotions of the selected product or brand. Therefore, the rules that we
created to identify sentences pertaining to this stage (996 for English and 65
for Spanish) mostly rely on particular lexical items belonging to the adver-
tisement word family. Some examples are: “advertisement”, “campaign”,
“promotion”, “video”, “sign”, etc. In the initial analysis of this kind of
texts, we created more restrictive rules, matching longer portions of text,
however further analysis of the classifier results showed that, when using
more lexicalised and less restrictive rules (with a small set of part-of-speech
tags and functional words), the final results of the classifier were equal or
even better.

Identifying Evaluation. Rules designed to identify evaluative texts (440 for


English and 167 for Spanish) showed more complexity than those created
to distinguish awareness. For this Consumer Decision Journey stage, rules
are longer and contain more morphosyntactic information, although the
weight of the lexical elements continues to be high. Generally, the rules of
this class are more restrictive than those for awareness.

Since in this step the user tends to compare products or brands, a great
amount of the rules identify comparative constructions. For example: “all
the best /1/” or “more [AQ] than”.

There are also rules which incorporate specific vocabulary usually used to
convey preference or comparisons such as “stand out”, “prefer”, “recom-
mend” and “suggest”.

Identifying Purchase. For this stage we defined 1,267 rules for English and
906 rules for Spanish. Generally, users tend to write a lot of comments
before and after purchasing some product but the number of remarks about
the specific moment of the transaction is low. Additionally, the number of
different ways to express this specific stage is also shorter with respect to
other stages. We identified a set of verbs, generally expressed in future

203
tenses, whose meaning is related to “buy” or imply a purchase: “acquire”,
“hunt down”, “reserve”, “try”, “grab”, etc.

Identifying Postpurchase Experience. This is the stage with the most com-
plex rules (710 for English and 769 for Spanish). We found that there is
a strong relation between the type of product and the linguistic expres-
sion of the postpurchase experience, being ambiguous in many of the cases.
In consequence, for this stage, we decided to build rules with a consider-
able amount of morphosyntactic information (to consider past tenses of the
verbs, for example) and lexical elements related to postpurchase customer
services (e.g. “complaint”, “unsubscribe”).

The rules have been defined for being used within the technique described in
Section 8.1.3, thus expressed according to the grammar shown in Listing 8.1. The
objective of the classifier is to obtain the Consumer Decision Journey category
according to which a social media post can be classified. Therefore all the rules
consist on a linguistic pattern to be matched and a classification action oriented
to make the numeric value associated to a category distinct from zero whenever
the linguistic pattern is matched, meaning that a text could be classified in a
given Consumer Decision Journey stage. Therefore, from all the possible numeric
operations that can be modelled with the rules grammar, this task only make
use of addition operations, specifically adding one unit to the category for which
a pattern has been matched. An example of a linguistic rule obtained by this
activity is shown next.

about [TO] get /2/ tablet → PURCHASE + 1

204
8.3 Technique for Detecting Marketing Mix At-
tributes
In order to achieve one objective of our research, i.e. to develop a technique for
automatically classifying short user-generated texts into one or more the Mar-
keting Mix categories, we have carried out the same activities as in the previous
technique (Data Understanding, Data Preparation, and Data Modelling), which
are described next.

8.3.1 Data Understanding Activity


As in the previous technique, this activity consists in the ordered execution of the
tasks Collect Initial Data, Describe Data, Explore Data and Verify Data Quality,
which are described next.

8.3.1.1 Collect Initial Data Task

This task applies the approach described in Section 8.1.1 for retrieving textual
contents mentioning commercial brands from different social media and constructs
the gold standard required for model creation and evaluation.
After retrieving the corpus, this task generates the gold standard used for
modelling and evaluating the Marketing Mix classifier. For doing so, human
annotators are asked to tag each text according to the following instructions:

Quality. All the texts that refer to the quality, performance, or positive or nega-
tive characteristics of a product that affect its user experience. For example:

Converse are extremely uncomfortable from the moment you put


them on

Design. All the texts that include a reference about specific traits or features of
the product such as size, colour, packaging, presentation, and styling. For
example:

Anybody notices the car? GQ’s design collaboration with Citroen

205
Customer Service. All the texts that refer to the responsiveness and service
given by companies to customers in every stage of the Consumer Decision
Journey. Also texts that refer to technical and post-purchase support to
current and prospective customers. For example:

@MissTtheTeacher hiya, nope, I’m not through there. I’ve been


on at that Scottish Power mob for weeks. Their customer service
is laughable

Point of Sale. All the texts that include a mention to the physical place where
the product can be found and purchased. Similarly, texts that convey dif-
ficulty with finding the product in the right distribution channels such as
supermarkets, stores, outlets, dealerships, and stations. For example:

About to spend mad money at this Nike store!

Promotion. All the texts that refer to marketing strategies oriented to increase
demand such as contests, freebies, coupons, competitions, discounts, gifts,
and offers. For example:

@Jennorocks lego promotion on at Shell garages :)

Price. Texts that refer to the cost, value or price of the product. It may also
comprise texts that refer to specific price promotion such as discounts and
price cut, in which case the text should be annotated as Price and also as
Promotion. This category also includes texts with numerical references to
product prices. Some examples:

This Volkswagen I got my eye on is so sexy & it’s an affordable


price

@carllongs on lighter hearted note soreen on offer at tesco! 80p

£1.47 and four slices have holes in them?! What on earth war-
burtons http://t.co/S9jSKS3LMo

206
Sponsorship. Texts that refer to awards, competitions, teams, foundations, per-
sons, charity fundraising, concerts and alike events which are organised, en-
dorsed or financially supported by the company or brand. Some examples:

Breaking News Sainsbury’s becomes title sponsor of the first Sport


Relief Games

School event this morning was sponsored by Scottish Power. Think-


ing of charging an extra 10% without telling them

Advertisement. All the texts that include a reference to public, paid brand
announcements or messages broadcasted in the media or placed in outdoor
settings. Some examples:

These tv adverts are great aren’t they, Rory “interestin” McIlroy


on Santander, and best of of all Kerry Katona on pay day loans,
priceless!

The lidl ad on Rte Two just now had delicious written on the
screen. Surely its delicious? or is it subliminal advertising. #lidl

As in the previous technique, two experts on marketing annotated each text


as belonging to one or more Marketing Mix elements (i.e. quality, design, point
of sale, customer service, sponsorship, advertisement, promotion and price), and
the annotations were then checked by one reviewer with social sciences back-
ground and by two reviewers with computational linguistics background, seeking
consensus between annotators and reviewers.

8.3.1.2 Describe Data Task

Regarding data format, the data schema used by this technique is analogous to
the one used in the previous technique, but including Marketing Mix annotations
instead of Consumer Decision Journey ones (see Figure 8.9.).
The opinions used by the experiment conducted in this thesis were the same as
the one used for the technique for detecting Consumer Decision Journey stages.
Therefore, its characteristics regarding volume are the ones described in Sec-
tion 8.2.1.2.

207
cd Data Format for the Marketing Mix Identification Technique
sioc:Forum
isocat:DC-414 isocat:datcat
skos:Concept dcterms:type
(trademark) * *
*
*
marl:describesObject sioc:has_container
*
sgo:hasMarketingMixAttribute marl:extractedFrom *
marl:Opinion
sgo:MarketingMixAttribute sioc:Post
* * marl:optinionText (language-annotated) * *
*

sgo:design sgo:quality sioc:topic

sgo:sponsorship sgo:customerService
*
isocat:DC-2212 isocat:datcat
sgo:price sgo:promotion skos:Concept
(domain) * *
sgo:advertisement sgo:pointOfSale

Figure 8.9: Format of the data used by the technique for detecting Marketing
Mix attributes

8.3.1.3 Explore Data Task

As the corpus used for learning the classifier used by this technique and evaluating
it is the same used for the technique for identifying Consumer Decision Journey
stages in user-generated content, its distribution across social media sources and
business sectors is the same (i.e. the one shown in Figure 8.6).

8.3.1.4 Verify Data Quality Task

In the construction of the corpus we could observe the difficulty of filtering texts
by their belonging to one of the Marketing Mix categories; the great majority
of the texts are irrelevant for our classification given that just a small group of
them implies Marketing Mix elements (25% of the corpus). Nevertheless, we
consider the corpus varied enough for learning and evaluating this technique,
since it consists in an random sample of the posts produced for the domains
being monitored and the overall volume of texts for each stage is adequate for
learning and evaluation purposes.
As in the previous technique we used Amazon Mechanical Turk for estimating
annotation reliability (see question 2 in Figure 8.10 for an example). The value
for Fleiss’ kappa was 0,397, which is generally regarded as a fair value.

208
Figure 8.10: Example annotation of a post according to a Marketing Mix Cate-
gory using Amazon Mechanical Turk

209
8.3.2 Modelling Activity
The goal of this activity is to develop an automatic classifier for identifying Mar-
keting Mix attributes within user-generated content. This activity consists in the
ordered execution of the following tasks.

1. The Select Modelling Technique task consists in selecting and describing a


modelling technique for being applied for identifying Marketing Mix At-
tributes from user-generated content.

2. The Build Model task consists in implementing a machine-learning classifier


identifies the Marketing Mix attributes.

Next, each of these tasks are described.

8.3.2.1 Select Modelling Technique Task

In order to automate the classification of texts based on the Marketing Mix


elements conveyed in them, this technique makes use of the Decision Tree (DT)
modelling technique defined by Quinlan [1993].
Specifically one binary classifier per Marketing Mix category is trained. Each
binary classifier determines whether the post belongs or not to a given Marketing
Mix category. Therefore, the classification for each category is made between the
positive class (for example, Advertisement) and the negative class (for example,
No Advertisement).
As a given text can belong to more than one category due to the use of mul-
tiple binary classifiers, we built a multi-category classifier that combines all the
binary classifiers in a process that iteratively identifies the set of Marketing Mix
attributes expressed in each text, returning the set of Marketing Mix attributes
for which its corresponding binary classifiers outputted a positive class.
We also tried to use classifiers based on the Logistic Regression model [le Cessie
and van Houwelingen, 1992] but the results were better with the DT classifiers in
terms of precision and recall. Additionally, DT shows relevant features for clas-
sification and therefore, is easily interpretable by humans. This fact made the
results of these classifiers very useful for final visualisation and human consump-
tion purposes. In order to create real-life applications in the marketing field, this

210
is a very important feature, being able to visually show consumers of marketing
agencies the criteria followed for text classification. Moreover, the DT model can
also be manually revised in order to remove terms that can appear as relevant
features due to biased samples. For example, “trainer” appeared as one of the
discriminative features to decide if a text belongs to the “design” category for the
sports domain. With the direct visualisation we could identify and eliminate it.

8.3.2.2 Build Model Task

This task consists in applying a machine-learning technique for learning the auto-
matic classifier for identifying Marketing Mix attributes in the content generated
by consumers. For doing so, the task executes the following steps:

1. Build Learning Datasets. This step constructs individual learning datasets


for each Marketing Mix category, as each individual classifier is trained with
its own corpus containing positive and negative examples for a given cate-
gory.
In the experiment conducted in this thesis, we built a dataset with all
the texts manually annotated as belonging to a given category (advertis-
ing, customer service, design, point of sale, price, promotion, quality, and
sponsorship) as positive examples. For each category, we also utilised all
the texts that do not belong to that given category as negative examples.
The size of the datasets ranged between 85 and 1046 texts for the positive
examples.

2. Part-Of-Speech Tagging. This step consists in tokenising, lemmatising


and annotating the texts with their corresponding part-of-speech tags, as
described in Section 2.6.1.
In our experiments, for executing this step we made use of Freeling.

3. Filter Stop-Words. This step consists in removing a list of stop-words


from the list of tuples outputted by the previous task, by attending to
their lemmas and part-of-speech tags. Such stop-words include not only
functional words but also brands and proper nouns.

211
The output of this task consists only of the lemmas of adjectives, verbs
(with the exception of auxiliary verbs) and common nouns, considering the
rest of categories irrelevant or less important for the identification of the
Marketing Mix attributes.

4. Features Vector Construction. This step receives the filtered output of


the previous task and generates a vector of features.
We adopted a bag-of-words approach where words occurring in texts are
used as features of a vector. Thus, each text is represented as the occurrence
(or frequency) of words in it. This approach embodies the intuition that
the more frequent the word is in the texts of the class (i.e. Marketing Mix
element selected), the more representative it is of the content and therefore
of the class.

5. Features Selection. This step applies a chi-square feature selection method


in order to reduce vector dimensions by selecting the more relevant features.
The idea behind this feature selection method is that the most relevant
words to distinguish positive examples are those that are distributed most
differently in the positive and negative class examples.

6. Model Training. This step uses the vectors previously created for learning
a set of C4.5 [Quinlan, 1993] decision tree classifiers as implemented in Weka
[Hall et al., 2009].
The results for the negative class are generally much better than those
obtained for the positive class due to the larger number of texts of the
negative class used to train the classifiers. However, as the main objective of
our work is being able to introduce this tool in a real marketing scenario, we
find that it is preferable to classify a text in a negative class if the classifier
does not find enough cues than to erroneously classify it in a positive class.

8.4 Technique for Detecting Emotions


In order to achieve one objective of this thesis, i.e. to develop a technique for
automatically classifying short user-generated texts into one or more emotions,

212
we have carried out the same activities as in the previous techniques, which are
described next.

8.4.1 Data Understanding Activity


This activity consists in the ordered execution of the same tasks that were exe-
cuted for the previous techniques. These tasks are explained next.

8.4.1.1 Collect Initial Data Task

This task applies the approach described in Section 8.1.1 for retrieving textual
contents mentioning commercial brands from different social media, and con-
structs the gold standard required for model creation and evaluation. In this
task several people participated with different background knowledge.
The gold standard is created by annotating the corpus gathered according to
the conceptual framework defined in Section 2.5.3. Annotators were asked to tag
each text with zero or more labels. In order to understand the sentiments involved
in each category —and to help to annotate the corpus—, we have specified the
secondary sentiments related to each of them. The set of sentiments is based on
a reformulation of Richins [1997] and Shaver et al. [1987]; there is a list of them
for each sense within a category (see Table 8.2).
In the experiment conducted in this thesis we gathered a corpus of posts
written in Spanish about several commercial brands from various social media
and different business/market domains. The manual annotation of the texts was
carried out first by a person who annotated the resulting corpus of the Data Gath-
ering Activity according to the conceptual framework of emotions/sentiments (see
Section 2.5.3). This person followed some specific guidelines (e.g. if a secondary
sentiment in Table 8.2 was identified for a text, then it was classified under its
corresponding basic sentiment). This annotation process was supervised by two
more persons, who examined the annotations and discussed them with the anno-
tator in case of disagreement. They came from different backgrounds, though in
close relation to the project field: the annotator had an advertising and public
relations background, one of the reviewers was an expert in social sciences and
the other one was from the computational side.

213
Primary Secondary Sentiments
Trust - Optimism, Hope, Security
Satisfaction - Fulfilment, Contentment
- Joy, Gladness, Enjoyment, Delight, Amusement
Happiness - Joviality, Enthusiasm, Jubilation
- Pride, Triumph
Love - Passion, Excitement, Euphoria, Ecstasy
- Nervousness, Alarm, Anxiety, Tenseness, Apprehension, Worry
Fear
- Shock, Fright, Terror, Panic, Hysteria, Mortification
- Dislike, Rejection, Revulsion, Disgust
Dissatisfaction
- Irritation, Aggravation, Exasperation, Frustration, Annoyance
- Depression, Defeat, Unhappiness, Anguish, Sorrow, Agony
- Melancholy
Sadness - Disappointment, Hopelessness, Dejection
- Shame, Humiliation, Guilt, Regret, Remorse
- Alienation, Isolation, Loneliness, Insecurity
- Rage, Fury, Wrath, Hostility, Ferocity
Hate - Bitterness, Resentment, Spite, Contempt, Vengefulness
- Envy, Jealously

Table 8.2: Primary and secondary sentiments

8.4.1.2 Describe Data Task

Regarding data format, the data schema used by this technique (see Figure 8.11)
is analogous to the one used in the previous techniques, but including emotion
annotations instead of Consumer Decision Journey or Marketing Mix ones.
Regarding volume and other gross attributes of the texts gathered, the corpus
we have used in our experiments is made up of 26,505 texts (709,095 words) in
Spanish taken from different channels including blogs, forums, microblogs (specif-
ically, from Twitter), product review sites, and social networks (specifically, from
Facebook). These texts are related to several brands belonging to nine busi-
ness sectors. Their choice is based on their relevance for the media agency that
participated in this work, Havas Media Group95 , and on the number of opin-
ions that they generate according to their social media monitoring tools. These
domains also constitute a representative set of both, low-involvement and high-
involvement products —i.e. products which are bought frequently and with a
95
http://www.havasmg.com

214
cd Data Format for the Emotions Identification Technique
sioc:Forum
isocat:DC-414 isocat:datcat
skos:Concept dcterms:type
(trademark) * *
*
*
marl:describesObject sioc:has_container
*
onyx:hasEmotionCategory marl:extractedFrom *
marl:Opinion
onyx:EmotionCategory sioc:Post
* * marl:optinionText (language-annotated) * *
*

sgo:satisfaction sgo:dissatisfaction sioc:topic

sgo:love sgo:hate
*
isocat:DC-2212 isocat:datcat
sgo:happiness sgo:sadness skos:Concept
(domain) * *
sgo:trust sgo:fear

Figure 8.11: Format of the data used by the technique for detecting emotions

Social Media Type Distribution of texts


Blogs 19%
Forums 18%
Forums 39%
Review sites 10%
Social Networks 14%

Table 8.3: Distribution of texts for the sentiment corpus by social media type

minimum of thought and effort (e.g. soft drinks) and products for which the
buyer is prepared to spend considerable time and effort (e.g. cars)—, as well as
of products with different cost.

8.4.1.3 Explore Data Task

This task characterises the data from different viewpoints to ensure that the gold
standard is richer enough for model learning purposes. Specifically the objective
of this task is to describe the distribution of the data with respect to media
sources, business sectors, and emotion categories.
The distributions of the texts in the gold standard by social media type and
by business sector are shown in tables 8.3 and 8.4 respectively.
According to the resulting annotation, only 27% of the texts could be said to
express a sentiment (14% expressed satisfaction, 13% expressed dissatisfaction,

215
Domain Number of brands Distribution of texts
Foods 4 7%
Automotive industry 10 10%
Financial services 10 11%
Drinks 3 24%
Cosmetics 6 7%
Sports 2 12%
Insurance companies 12 11%
Telecommunication services 11 10%
Tourism 7 8%

Table 8.4: Distribution of texts for the sentiment corpus by domain

1% expressed trust, 1% expressed fear, 1% expressed happiness, 0.5% expressed


sadness, 2% expressed love, and 3% expressed hate)96 . The remaining 73% was
annotated as neutral regarding sentiments.

8.4.1.4 Verify Data Quality Task

An excerpt of the classified corpus (300 texts) along with the annotation criteria
was given to a new annotator. This allowed us to estimate how reliable the
manual annotation was. To measure the inter-annotator agreement we chose
Cohen’s kappa metric [Cohen, 1960], which takes the value of 1 for a perfect
matching between annotators and 0 (or a negative number) if the matching is
the same as (or worse than) expected. In our case, the value for this metric was
0,511, which is generally regarded as a moderate value.
Additionally, another excerpt of the classified corpus (1,000 texts) along with
the annotation criteria were given to a group of annotators through the Amazon
Mechanical Turk annotation services (see question 1 in Figure 8.12), as we did
in the previous techniques. Each text was classified by two different anonymous
human annotators and compared against the annotation in the gold standard.
To measure the inter-annotator agreement we chose Fleiss’ kappa metric (while
Cohen’s metric evaluates the agreement between two annotators, Fleiss’ metric
let us evaluate the agreement for more annotators). The value for this metric was
0,415, which is also regarded as a moderate value.

96
The reason why the addition of these percentages is over 27% is the subsumption by SD.

216
Figure 8.12: Example annotation of a post according to a Emotions category
using Amazon Mechanical Turk

217
8.4.2 Modelling Activity
The goal of this activity is to develop an automatic classifier for identifying emo-
tions within user-generated content. This activity consists in the ordered execu-
tion of the following tasks.

1. The Select Modelling Technique task consists in selecting an describing a


modelling technique for being applied for identifying emotions within user-
generated content.

2. The Generate Test Design task consists in generating a mechanism to test


the model for quality and validity.

3. The Build Model task consists in implementing a rule set against which the
posts will be matched in order to identify the emotion categories.

Next, each of these tasks are described.

8.4.2.1 Select Modelling Technique Task

The goal of this technique is to perform a classification of an arbitrary content


into zero or more emotion categories. For doing so, this technique relies on the
rule-based modelling technique described in Section 8.1.3.
Therefore the resulting classifier matches the textual content received against
a rule set, outputting a set of numeric values associated to each of the four
sentiment polarities, meaning a value greater than zero that the post is classified
in the positive category corresponding to a given polarity, and a value lower than
zero that the post is classified in the negative category for a given polarity.
Then, the numerical values are discretised to obtain the specific sentiment
categories in which the text has been classified, i.e. a positive value corresponds
to the positive emotion of a category and a negative value to the negative one
(see Table 2.4). If the value of a category is 0, the text is neutral with respect to
that category.

218
Domain Number of texts in Number of texts in
the training set the evaluation set
Foods 592 995
Automotive industry 86 2,657
Financial services 411 1,214
Drinks 284 2,106
Cosmetics 572 828
Sports 451 2,892
Insurance companies 334 1,050
Telecommunication services 460 2,601
Tourism 408 999

Table 8.5: Distribution of texts for the sentiment corpus for the training and test
sets by domain

8.4.2.2 Generate Test Design Task

In the experiment performed in this thesis the annotated corpus was used to train
and evaluate the system. The training set used to create the rules contained a
sample of 80% of the texts annotated with a sentiment in the corpus (i.e. 13%
of the whole corpus), while the evaluation set contained a sample of 58% of the
corpus; both samples were made up of randomly-chosen texts. Table 8.5 shows the
number of texts collected for each domain, the number of brands that have been
monitored for each domain, and the number of texts that have been considered
in the training and evaluation sets.
Finally, the quality measures used for evaluating the classifier are the ones
described in the evaluation section (see Section 8.7.2.3).

8.4.2.3 Build Model Task

The goal of this task is to learn a classifier for analysing the sentiment of user-
generated content. For doing so, this task engineers a rule capable of recognising
fragments of text from which a consumer emotion can be derived, therefore clas-
sifying the social media posts which embed such fragments of text according to
the emotion detected.
This task has been mainly executed by a team of the Ontology Engineering

219
Group of the Universidad Politécnica de Madrid97 , in which the author of this
thesis was not involved. However, the description of this task is included for self-
containment purposes. The result of the joint work regarding the identification of
Consumer Decision Journey stages in user-generated content has been published
by Aguado de Cea et al. [2014].
The classification rules were compiled by studying the gold standard, as well as
by reusing two existing linguistic resources: Badele3000 [Bernardos and Barrios,
2008] and Calı́ope [Aguado de Cea and Bernardos, 2007]. Such resources are
described next.

Badele3000. Badele3000 is a domain independent lexical-semantic database


with information about the 3,300 most frequent nouns in Spanish. The
theoretical linguistic foundation of this resource is the Meaning-Text The-
ory (MTT) [Mel’čuk, 1996], specially the concept of Lexical Function (LF),
which relates two lexical units (the base and a certain value of the LF for
that base) accounting for the paradigmatic relations and the syntagmatic
relations (or collocations98 ) between those lexical units.

For example, if the base is “rain”, the relation of intensification is expressed


by “heavy”, i.e. its magnified (intensified) form is M agn(rain) = heavy,
while the magnified value of “wind” is M agn(wind) = strong. These data
let us know that rain goes with “heavy” but wind goes with “strong” and
that these are typical collocations of the English language to express that
rain and wind are intense.

The database contains more than 20,000 linguistic collocations. Addition-


ally, lexical units are organised in a hierarchical structure in which each
lexical unit is classified according to a semantic label (SL) hierarchy, which
usually corresponds to the hyperonym or immediate generic term. A lexical
unit ‘inherits’ the values of the LF’s defined for the SL under which it is
classified.

Regarding those lexical units corresponding to sentiments, in Badele3000


97
http://www.oeg-upm.net
98
A collocation is a partly or fully fixed sequence of words established through repeated use.

220
Semantic Label Lemma
Deseo (wish)
Sentimiento (sentiment) Ansiedad (anxiety)
Sorpresa (surprise)
Amor (love)
Felicidad (happiness)
Sentimiento positivo (positive sentiment)
Satisfacción (satisfaction)
Seguridad (security)
Dolor (pain)
Pena (sadness)
Desesperación (desperation)
Sentimiento negativo (negative sentiment) Miedo (fear)
Sufrimiento (suffering)
Odio (hatred/hate)
Inseguridad (insecurity)

Table 8.6: Excerpt from sentiments in Badele3000

they are classified under the semantic label sentimiento (sentiment)99 or one
of its children: sentimiento positivo (positive sentiment) and sentimiento
negativo (negative sentiment) (see Table 8.6). Therefore, the next step
was to obtain those lexical units (verbs, adjectives, etc.) that are values
of the LFs for the SL sentimiento, its children (“positive sentiment” and
“negative sentiment”) and its grandchildren (the nouns for sentiments).
So, for example, we obtained verbs such as embargar (be overwhelmed by)
expressing that a sentiment “exists (affecting someone)” —in terms of LFs,
F unc1(sentimiento) = embargar (a alguien)—, and we could infer that
it also combined with the lexical units corresponding to sentiments such
as tristeza (sadness), emoción (emotion), alegrı́a (happiness), etc. We also
obtained collocates which are specific for particular sentiments, but cannot
be used with other sentiments. For instance, “apoderarse (de alguien)”
(to be possessed by) can be used with miedo (fear), but not with alegrı́a
(happiness).

In this way, we automatically obtained a list of collocates of Spanish nouns


for sentiments, which we could directly reuse in the creation of our rules.
99
The translations of the example into English have been made for the sake of clarity.

221
Calı́ope. Calı́ope is a web application designed to help learning contextualised
terms in English and Spanish by, first, providing examples of their use in
context and, second, by showing the lexical-semantic relationships among
them. For these purposes, it manages two resources: a corpus for Spanish
and another one for English; as well as a glossary of terms for both lan-
guages. Among all Caliope’s functionalities, the ones that are noteworthy
for our work are the following:

• Addition of new texts to the corpus. This allowed us to include our


corpus in Calı́ope, what facilitated the retrieval of the vocabulary on
sentiments.
• Filtering of texts. This let us choose the texts we wanted to analyse.
• Frequency of words. This facility and the part-of-speech annotation
helped us to establish the most relevant words by grammatical cate-
gory. We used this result as one of the starting points for creating the
rules.
• Concordances of a term —i.e. occurrences of a term in the texts— and
co-occurrences of several terms (which are not necessarily adjacent).
These functionalities provided us with the contexts of the terms we
needed to examine in order to draw patterns/templates for the an-
tecedents of our rules.
The training set analysed to create the rules contained a randomly chosen
sample of 80% of the texts annotated with a sentiment in the corpus (i.e. 13%
of the gold standard). However, as explained before, the annotated corpus was
not the only source used to create the rules; they were also based on the set of
collocations of common sentiments obtained from Badele3000 and on the semantic
relations (reflected by the LF’s) existing between them. This information was very
valuable because it helped us to derive expressions in the antecedents of the rules
and the sentiment category in their corresponding consequents.
Table 8.7 shows some rules created for the Love-Hate (LH ) polarity. They
were written after having analysed the concordances of “odio” (hate/hatred),
found in the corpus via Caliope, and its collocations, retrieved from Badele3000
(see Table 8.8).

222
Meaning in Spanish Meaning in English Rules
mi/este odio a/por my/this hatred again- [D] odio#NC [SP] ENTITY → LH - 1
marca st/for brand
siento odio a/por marca I feel hatred against sentir#V odio#NC [SP] ENTITY → LH - 1
brand
(cómo/cada dı́a) odio I feel an increasing/- odiar#V a#SP /1/ ENTITY → LH - 1
(más) a (el/la/esta/...) growing hatred again- odiar#V más#RG a#SP /1/ ENTITY → LH - 2
marca st/for brand cómo odiar#V a#SP /1/ ENTITY → LH - 2
What a hatred I feel
against/for brand
marca es (muy/tan/...) brand is (very/so/...) ENTITY ser#V odioso#A → LH - 1
odiosa hateful ENTITY ser#V muy#RG odioso#A → LH - 2

Table 8.7: Examples of rules for classifying emotions

Lexical Semantic Relation reflected by the LF Value


Function
FinFunc0 Dejar de existir (L) (to stop existing) Desaparecer (to vanish)
IncepFunc0 Empezar a existir (L) (to start existing) Emanar (to arise)
IncepFunc0 Empezar a existir (L) Nacer (to arise)
Func1 Afectar a algo/alguien (L) (to affect sth/sb) Anidar (en algo/alguien (to nest)
Func1 Afectar a algo/alguien (L) Palpitar (en alguien) (to beat)
Func1 Afectar a algo/alguien (L) Latir (en alguien) (to beat)
Func1 Afectar a algo/alguien (L) Embargar (a alguien) (to be overwhelmed
by)
IncepPredMinus Disminuir (L) (to decrease) Disminuir (to decrease)
IncepPredPlus Aumentar (L) (to increase) Aumentar (to increase)
Manif Mostrar (L) (to show) Mostrar (to show)
Oper1 Hacer (L) (to do) Sentir (to feel)
Oper1 Hacer (L) Tener (to feel)
Real1-M Hacer lo esperable (con L) (to do the ex- Ocultar (to conceal)
pected)
Real1-M Hacer lo esperable (con L) Disimular (to disguise)

Table 8.8: Collocations of “odio” in Badele3000

8.5 Technique for Detecting Place of Residence

The goal of this technique is to identify the place of residence of users, defining
“place of residence of a user” as the geographical location where a user usually
lives. To achieve this goal we have carried out the same activities as with the
previous technique, which are described next.

8.5.1 Data Understanding Activity

This activity consists in the ordered execution of the Collect Initial Data, Describe
Data and Explore Data tasks. Next we explain each of these tasks.

223
8.5.1.1 Collect Initial Data Task

We have collected a corpus of users extracted from Twitter whose place of resi-
dence was known beforehand. For each user, we have extracted the location and
description declared in his/her profile, his/her timeline (i.e. tweets and retweets),
as well as the list of followers and users followed by the user. Additionally, we
have extracted the locations, descriptions and timeline of each user included in
the list of followers and followed.
We have restricted the number of friends for each user to 20 (10 followers
plus 10 persons followed by the user to be characterised), since Twitter limits the
number of calls to its API. Additionally we have restricted the number of tweets
analysed to 20, for the same reason, including tweets authored by the user and
retweets.

8.5.1.2 Describe Data Task

The dataset used has a structure containing data about 1,080 users, the content
shared and published by them, and the existing relationships among them and
other users. The data format also relates each user with a normalised geographical
location that represents her/his place of residence, defining a gold standard. Such
location is defined at the level of city and related with its administrative region of
second level (e.g. county, province), the administrative region of first level (e.g.
state, autonomous community), and the corresponding country. Additionally, the
data format relates the contents the named entities of type location extracted
from them.
The data schema is shown in Figure 8.13. The classes and properties included
in the diagram have been already described in Chapter 5.

224
cd Data Format for the Place of Residence Identification Technique

sioc:follows

* *
foaf:Agent sioc:account_of sioc:UserAccount sioc:creator_of sioc:Post

dcterms:description * * sgo:declaredLocation * * sioc:content (language tagged)

* *
foaf:based_near sioc:topic
*
isocat:datcat *
isocat:DC-4339
tzont:Region skos:Concept
(location) * *

Figure 8.13: Data format of the corpus used by the technique for detecting the
place of residence of social media users

8.5.1.3 Explore Data Task

The users in the evaluation set are distributed among 11 different countries (Ar-
gentina, Chile, Colombia, Spain, USA, Japan, Mexico, South Africa, Switzerland,
Uruguay and Venezuela). Such users share and publish content in different lan-
guages (mainly in Spanish and English).

8.5.2 Data Preparation Activity


During this activity, we have pre-processed the contents published by the users, as
well as their descriptions in their profiles, by applying the common tasks defined
in Section 8.1.2. Nevertheless, we have not cleansed posts referring to particular
brands during the Clean Data task, as we consider all the content relevant for
extracting locations from them.

8.5.3 Modelling Activity


The goal of this activity is to develop an automatic classifier for detecting the
place of residence of social media users. This activity consists in the ordered
execution of the following tasks:

1. The Select Modelling Technique task consists in selecting and describing a


modelling technique for being applied for creating the classifier.

225
2. The Generate Test Design task consists in generating a mechanism to test
the model for quality and validity.

Next, each of these tasks are described.

8.5.3.1 Select Modelling Technique Task

We have experimented with five different approaches for detecting the place of
residence of a given social media user. Such approaches are summarised next.

1. Use the metadata about locations of users included in the profiles of the
user in social networks.

2. Analyse the friendship networks of the users for inferring their place of
residence when it cannot be retrieved from location metadata.

3. Perform text mining of the descriptions written by users about themselves


in their profiles for inferring their place of residence when it cannot be
retrieved from location metadata.

4. Perform text mining of the content published and shared by social media
users for inferring their place of residence when it cannot be retrieved from
location and description metadata.

5. Combine the previous approach with the approach based on friendship net-
works into a content-based and network-based hybrid approach.

Next we explain every approach.

Approach based on metadata about locations of users. This approach cor-


responds to the one implemented by Mislove et al. [2011]. The approach
makes use of the location metadata in the user profile, as for example, the
location attribute returned by Twitter API when querying user details100 .
Figure 8.14 shows the location attribute in an example Twitter user profile.
Users may express their location in different forms through this attribute,
such as geographical coordinates, or the name of a location (e.g. a city, a
100
http://dev.twitter.com/docs/api/1.1/get/users/show

226
Figure 8.14: Example of user profile location metadata

country, a province, etc.). Therefore, a normalisation stage is required in


order to obtain a standard form for each location.

For normalising the location this approach makes use of a geocoding API.
Our implementation uses Google Maps web services. This approach invokes
a method of the geocoding API that analyses a location and returns a nor-
malised tuple composed by a set of components that define the location,
including latitude, longitude, locality, and country, among others. For ex-
ample, if the request “santiago” is sent to the web service, the response will
be a tuple containing “Chile” as the country and “Santiago” as the local-
ity, among other location components. The complete list of components is
listed in the API documentation101 . Please note that this query does not
provide enough information for disambiguating locations, e.g. “santiago”
may refer to many geographical locations, including Santiago de Chile and
Santiago de Compostela (Spain). Therefore the precision of this approach
101
http://developers.google.com/maps/documentation/geocoding

227
Figure 8.15: Example of an output of the Google Geocoding API

1 function ResidenceF romLocationData(user)


2 begin
3 return GeoCode(location(user))
4 end

Listing 8.2: Approach based on metadata about locations of users

depends on how users describe their location when filling in their profiles.
For example, geographical coordinates will define locations accurately, while
combinations of city and country (e.g. “Guadalajara, Spain”) will enhance
disambiguation (although not completely). In addition, this approach does
not return a place of residence when users have not filled in the location
field contained in user’s profile form of the social network. The approaches
described next deal with these precision and coverage issues. Figure 8.15
shows an example output of Google Geocoding API, while Listing 8.2 for-
malises the step executed by this approach.

Approach based on friendship networks. This approach exploits the inher-


ent homophily of social networks [McPherson et al., 2001] for obtaining the
place of residence of users. Listing 8.3 summarises the steps executed by

228
1 function ResidenceF romF riends(u)
2 begin
3 l ⇐ ResidenceF romLocationData(u)
4 if l = ∅ then
5 L⇐∅
6 for each f in f riends(u) do
7 L ⇐ L ∪ {GeoCode(location(f ))}
8 end for
9 l ⇐ M ostF requentLocation(L)
10 end if
11 return l
12 end

Listing 8.3: Approach based on friendship networks

this approach, which are described next.

1. Firstly, we execute the previous approach for obtaining the place of


residence of a given user. If a result is obtained, the process finishes.
If not, the steps described next are executed (line 3).
2. Secondly, the friends of the user in her online community are collected.
After that, the location of each friend is obtained by using the geocod-
ing API. The normalised locations obtained are appended to a list
(lines 6-8).
3. Finally, the list obtained in the previous step is filtered iteratively
selecting on each iteration the locations that contain the value with
the most frequency for a given location component, starting from the
country and finishing in the city, until there is only one location in the
set. First the locations whose country is the most frequent are selected,
then the locations whose first-order civil entity (e.g. a state in USA or
an autonomous community in Spain) is the most frequent, and so forth.
The location that remains in the list after completing the iterations is
selected as the place of residence of the user. This approach ensures
that the most frequent regions in the friendship network of the user
are selected (line 9). Figure 8.16 shows an example of this process.

229
Figure 8.16: Example execution of table location filtering process

Approach based in descriptions about users. This approach exploits the de-
scription published by users about themselves in their profiles for obtaining
their place of residence, as for example, the description attribute returned
by the Twitter API when querying a user profile. Listing 8.4 summarises
the steps executed by this approach, which are described next.

1. Firstly, we execute the first approach (approach based on metadata


about locations of users). If a result is obtained, the process finishes.
Otherwise, the steps described next are executed (line 3).
2. Secondly, we obtain the user self-description attribute. Such attribute
usually consists on a sentence that has to be processed for extracting
the geographical locations mentioned in the text (line 5). Figure 8.17
shows the self-description attribute in an example Twitter user profile.
3. After obtaining the description of the user, we perform an entity de-
tection and classification process, by using an entity recognition and

230
1 function ResidenceF romDescription(u)
2 begin
3 l ⇐ ResidenceF romLocationData(u)
4 if l = ∅ then
5 desc ⇐ description(u)
6 E ⇐ N amedEntities(desc, language(desc))
7 L⇐∅
8 for each entity in E do
9 if isLocation(entity) then
10 L ⇐ L ∪ {GeoCode(entity)}
11 end if
12 end for
13 l ⇐ M ostF requentLocation(L)
14 end if
15 return l
16 end

Listing 8.4: Approach based in descriptions about users

identification component for the language detected by the Construct


Data Task for the user’s description (line 6). For doing so, we make
use of Freeling, which provides an entity recognition and classification
module for English, Spanish, Galician and Portuguese. Such module
also implements multi-word detection, which allows recognising loca-
tions named by multiple words (e.g. “United Kingdom”).
4. After that, we filter the named entities obtained in the previous step
taking only the entities that correspond to a location. Such entities are
sent one by one to the geocoding API for obtaining a set of normalised
locations (lines 8-12).
5. As several locations may be obtained in the previous step due to mul-
tiple named entities contained in the description, once the normalised
locations have been obtained, we select only one location by following
the same selection approach described in step 3 of the approach ex-
plained previously, returning one location as the place of residence of
the user (line 13).

231
Figure 8.17: Example of user profile description metadata

Approach based in content This approach consists in mining the contents


published (e.g. tweets) and shared (e.g. retweets) by users to obtain their
place of residence. As performed by Cheng et al. [2010] this approach
extracts the location named entities from the user-generated content. List-
ing 8.5 summarises the steps executed by this approach, which are described
next.

1. Firstly, we attempt to execute the previous approach to obtain a lo-


cation from user profile metadata (line 3). If a result is obtained, the
process finishes with a location. Otherwise, the process continues in
the following step.

2. If the previous steps do not return a location, we obtain the textual


contents published and shared by the user. We process each document
obtaining a list of normalised locations mentioned in the content shared
and produced by the user by applying the same entity recognition
technique as in the first approach (lines 6-13). Figure 8.18 shows an
example extraction of the locations contained in the content published
by a Twitter user.

232
1 function ResidenceF romtP osts(u)
2 begin
3 l ⇐ ResidenceF romDescription(u)
4 if l = ∅ then
5 L⇐∅
6 for each text in publications(u) do
7 E ⇐ N amedEntities(text, language(text))
8 for each ent in E do
9 if isLocation(ent) then
10 L ⇐ L ∪ {GeoCode(ent)}
11 end if
12 end for
13 end for
14 l ⇐ M ostF requentLocation(L)
15 end if
16 return l
17 end

Listing 8.5: Approach based in content

3. Finally, we select the place of residence of the user from the list of
locations obtained in the previous step, by applying the same location
selection criteria used for the previously described approaches (line
14).

Hybrid approach This approach combines the previous ones. Listing 8.6 sum-
marises the steps executed by this approach.

233
Figure 8.18: Example of location extraction from content

1 function ResidenceHybrid(u)
2 begin
3 l ⇐ ResidenceF romP osts(u)
4 if l = ∅ then
5 L⇐∅
6 for each f in f riends(u) do
7 L ⇐ L ∪ {ResidenceF romP osts(f )}
8 end for
9 l ⇐ M ostF requentLocation(L)
10 end if
11 return l
12 end

Listing 8.6: Hybrid approach

234
8.5.3.2 Generate Test Design Task

As the technique that we propose does not perform learning from data, the whole
dataset has been used for evaluation purposes.

8.6 Technique for Detecting Gender


The goal of this technique is to identify the gender of social media users. To
achieve this goal we have carried out the Data Understanding, Data Preparation,
and Modelling activities, which are described in the following sections.
It is important to remark that the research conducted for defining this tech-
nique (definition, experiments, evaluation, etc.) has been co-authored with mem-
bers of the Acceso Group102 .

8.6.1 Data Understanding Activity

This activity consists in the ordered execution of the Collect Initial Data, Describe
Data and Explore Data tasks, which are explained next.

8.6.1.1 Collect Initial Data Task

We have collected a random sample consisting on authors who have written a


tweet in Spanish, as well as tweets that mention those authors between 29th May
2012 and 27th March 2013, by using the Twitter API. A subset of the users
collected has been manually annotated by hand with their corresponding gender
by a human annotator to create a gold standard.
Additionally this technique makes use of two lists of first names that have
been previously classified by gender (one list for male names, and one list for
female names). These lists have been extracted from a dataset published by the
Spanish National Institute of Statistics.

102
http://www.acceso.com

235
cd Data Format for the Gender Identification Technique

foaf:Person dcterms:references
sioc:account_of sioc:UserAccount sioc:Post
foaf:givenName
* * foaf:nick * * sioc:content (language tagged)
foaf:gender

Figure 8.19: Data format of the corpus used by the technique for detecting the
gender of social media users

8.6.1.2 Describe Data Task

The dataset used has a structure containing users annotated with their first names
and gender, as well as contents that mention them. The data schema is shown
in Figure 8.19. The classes and properties included in the diagram have been
already described in Chapter 5.

8.6.1.3 Explore Data Task

The dataset of users and tweets contains 69,261 users, and their corresponding
tweets written in Spanish, from which 1,509 users have been annotated with their
gender in the gold standard. The gold standard includes 558 female users, 621
male users and 330 neutral users. Neutral users are those accounts that belong
to an organisation of another kind of non-human agent.
The lists of male and female names contain 18,697 and 19,817 first names,
respectively.

8.6.2 Data Preparation Activity


During this activity, we have pre-processed the contents included in the gold
standard by applying the common tasks defined in Section 8.1.2. Specifically, we
have performed language identification for filtering users that do not have tweets
written in Spanish, and we have performed content normalisation. We do not
have cleansed posts referring to particular brands during the Clean Data task, as
we consider all the content relevant for extracting mentions to users from them.
In addition, the lists of male and female names have been curated creating a
gender dictionary, so unisex names have been excluded for classification purposes,

236
given the ambiguity that they introduce. After the curation process (removing
the first names that appear in both lists) the male first names list is reduced to
18,391 entries and the female names list to 19,511. Some examples of removed
first names are “Pau”, “Loreto” and “Reyes”, as they are valid for both males
and females in Spain.

8.6.3 Modelling Activity


The goal of this activity is to develop an automatic classifier for detecting the
gender of social media users. This activity consists in the ordered execution of
the following tasks:

1. The Select Modelling Technique task consists in selecting and describing a


modelling technique for being applied for creating the classifier.

2. The Generate Test Design task consists in generating a mechanism to test


the model for quality and validity.

Next, each of these tasks are described.

8.6.3.1 Select Modelling Technique Task

We have experimented with two different approaches for detecting the gender of
a given social media user. Such approaches are summarised next.

1. Look for the names declared in users’ profiles within dictionaries that asso-
ciate first names with their corresponding genders.

2. Exploit the linguistic gender concord that occurs in the Spanish language
when a name is not declared in the user profile.

Next we describe each approach.

Approach based in metadata about users. This approach exploits publicly


available metadata associated with the user profile. Such metadata may
include the user name, as for example, the name and the screen name

237
Figure 8.20: Example of user profile name metadata

Twitter attributes. Figure 8.20 shows the name attribute in an example


Twitter user profile.

The approach makes use of the gender dictionary created in the Data Prepa-
ration Activity (see Section 8.6.2). Given a user account, its name metadata
is scanned within the dictionaries and, if a match is found, we propose the
gender associated to the dictionary where the first name has been found as
the gender of the user.

Regarding multilingualism, the gender dictionary is a language-dependent


resource. However, there are many resources in the Web readily available
for populating easily new dictionaries, such as the population censuses pub-
lished as open data by many countries.

Approach based in content. This approach exploits the information provided


by mentions to users. For example, in the following tweet

I’m going to visit to my uncle @Daureos to Florida,

the author is providing explicit information about the gender of the user
mentioned. We know that @Daureos is male because of the word “uncle”

238
written before the user identifier. The same happens in English with other
family relationships, such as mother or father.
We propose an approach for the Spanish language that performs a depen-
dency parsing of the text with the aim of determining the gender of the
terms related with the user mentioned. Therefore, for each tweet in which
the user is mentioned, we attempt to estimate the gender of the user. Note
that not all mentions to users provide information for estimating their gen-
ders (e.g. “via @user” and “/cc @user” at the end of the tweet). The
dependency parser used is TXALA [Atserias et al., 2005].
The steps executed by this technique are the following:

1. Firstly, we execute the technique based on user name metadata de-


scribed previously. If a gender is obtained, the process finishes.
2. If a gender is not identified in the previous step, we obtain all the posts
that mention the user.
3. For each post, we perform a dependency parsing. Figure 8.21 shows
the dependency tree obtained from a tweet that mentions a given user.
Once obtained the dependency tree, we assign a gender to the user for
the post analysed according to the following heuristics:
(a) If the gender of the term in the parent node of the branch where
the user is mentioned is male or female, we consider that the user
is male or female accordingly (e.g. “Mi tı́o Daureos”).
(b) If some of the child nodes of the node corresponding to the user
mention corresponds to a term with a specific gender, we consider
that the gender of the user corresponds to the gender of such terms
(e.g. “Vio a Daureos enfermo y triste”);
(c) If there is a noun adjunct as the predicate of an attributive sen-
tence where the user is the subject, we assign the gender of the
noun adjunct as the gender of the user (e.g. “Daureos es traba-
jador”).
4. Finally, we select the gender that is associated the most to the post
analysed for the user being analysed.

239
func: top
synt: sn
form: Felicidades
lemma: felicidades
tag: NP00SP0

func: sp-mod func: sn-mod


synt: grup-sp synt: sn
form: a form: CM_de_El_Corte_Inglés
lemma: a lemma: cm_de_el_corte_inglés
tag: SPS00 tag: NP00V00

func: obj-prep func: adj-mod func: term


synt: sn synt: s-a-ms synt: F-term
form: cuñado form: nuevo form: .
lemma: cuñado lemma: nuevo lemma: .
tag: NCMS000 tag: AQ0MS0 tag: Fp

func: espec func: sn-mod


synt: espec-ms synt: w-ms
form: mi form: Calamonte
lemma: mi lemma: calamonte
tag: DP1CSS tag: NP00SP0

Figure 8.21: Dependency tree obtained from a tweet that mentions to a user

8.6.3.2 Generate Test Design Task

As the technique that we propose does not perform learning from data, the whole
dataset has been used for evaluation purposes.
As described in Section 8.6.1.1 the whole dataset is used for measuring the
coverage of the technique (i.e. the proportion of users that can be annotated with
a gender), and a subset that has been manually annotated with gender is used
for measuring the precision and recall.

240
8.7 Evaluation

This section evaluates the techniques for the segmentation of consumers from
content presented in this chapter. Section 8.7.1 describes the metrics used for
evaluating the techniques, while Section 8.7.2 present the evaluations results.

8.7.1 Evaluation Metrics

For evaluating the techniques for segmentation of consumers from social media
content, we made use of a set of metrics commonly used in machine learning for
evaluating supervised classifiers. In this context:

• T P is the number of true positive decisions. It indicates the number of


instances that have been classified as belonging to a particular class, and
actually belong to such class.

• T N is the number of false positive decisions. It indicates the number of


instances that have not been classified as belonging to a particular class,
and actually do not belong to such class.

• F P is the number of false positive decisions. It indicates the number of


instances that have been classified as belonging to a particular class, and
actually do not belong to such class.

• F N is the number of false negative decisions. It indicates the number of


instances that have not been classified as belonging to a particular class,
and actually belong to such class.

Taking into account the T P , T N , F P , and F N indicators, the metrics used


for evaluating the performance of the technique for unique user identification are
described next.

241
8.7.1.1 Accuracy

The Accuracy metric [Kohavi and Provost, 1998] measures the percentage of
correct decisions. Equation 8.1 shows its definition.

TP + TN
RI = (8.1)
TP + FP + TN + FN
The range of this metric is [0..1]. We consider satisfactory values for this
metric those that are over 0.85.

8.7.1.2 Recall

The Recall metric [Kowalski, 1997] (a.k.a. sensitivity or hit rate) is the true
positive rate. Equation 8.2 shows its definition.
The range of this metric is [0..1]. For the evaluations of this section, we
consider satisfactory values for this metric those that are over 0.30.

TP
Recall = (8.2)
TP + FN

8.7.1.3 Precision

The Precision metric [Kowalski, 1997] is defined as the positive predictive value.
Equation 8.3 shows its definition.

TP
P recision = (8.3)
TP + FP
The range of this metric is [0..1]. For the evaluations of this section, we
consider satisfactory values for this metric those that are over 0.65.

8.7.1.4 F-measure

The F-measure metric [Larsen and Aone, 1999] combines the precision and recall
metrics offering an overall vision of how the technique behaves. It is defined as
the harmonic mean of precision and recall. Equation 8.4 shows its definition.

2 · P recision · Recall
F1 = (8.4)
P recision + Recall

242
The range of this metric is [0..1]. For the evaluations of this section, we
consider satisfactory values for this metric those that are over 0.41, taking into
account the minimum Precision and Recall satisfactory values.

8.7.2 Evaluation Results


This section present the results of the evaluations performed to the techniques
described in this chapter. The section is structured as follows:

• Section 8.7.2.1 presents the evaluation results obtained for the technique
for detecting Consumer Decision Journey stages.

• Section 8.7.2.2 presents the evaluation results obtained for the technique
for detecting Marketing Mix attributes.

• Section 8.7.2.3 presents the evaluation results obtained for the technique
for detecting emotions.

• Section 8.7.2.4 presents the evaluation results obtained for the technique
for detecting the place of residence of social media users.

• Section 8.7.2.5 presents the evaluation results obtained for the technique
for detecting the gender of social media users.

8.7.2.1 Technique for Detecting Consumer Decision Journey Stages

We have evaluated our technique for detecting Consumer Decision Journey stages
from user-generated content. The overall results of the textual classification in
terms of precision are 0.74, while in terms of recall are 0.35, achieving an F-
measure of 0.48. Figures 8.22 and 8.23 show the results by category and language.
In general, the rules achieved satisfactory results in terms of precision, especially
in the awareness, evaluation, and purchase stages for English, and awareness for
Spanish. Results in terms of recall were lower than those achieved in precision, as
rules were designed very specific in order to minimise the number of false positives.
Generally, the stage where we obtained best results is awareness, specifically for
Spanish.

243
' 
&&
& 
%&
%!

#! # 
"& "%
""
!'
!$
!! !"
! 

             


        

Figure 8.22: Accuracy of the Consumer Decision Journey classifier for English

We also offer the results for the classification along the different business
sectors (Figure 8.24) in order to evaluate the difficulties of the classification de-
pending on the domain. We found that banking and beverages were the business
sectors where we obtained the best results, with the greatest values of F-measure.
The distinction among the different stages of the Consumer Decision Journey
is not always clear, due to the ambiguity of short texts. Frequently, belonging to
one stage or another is strongly related to the type of product, and the differen-
tiation among stages can only be performed applying extra linguistic knowledge.
Sentences such as “I like this beer” and “I like this car” were frequently found
in the corpus. In the first case, it is very likely that the user has already tried
the product (postpurchase experience), since it would be strange for a customer
to state that he likes a drink (or some food) without actually tasting it. In the
second case, instead, the actual consumption of the product is less probable, and
the customer can like the car just because of its television advertisement or its
design, for example. These kinds of ambiguities are especially frequent between

244
'&

&&
&
%$
% 

$! $"
#(

# "(
"# "%
!(
!% !&

             


        

Figure 8.23: Accuracy of the Consumer Decision Journey classifier for Spanish

evaluation and postpurchase experience, and the linguistic patterns are not able
to capture the differences between them since they are expressed through the
same linguistic expression.
A further classification of products depending on domain-dependent features
could be useful in order to discriminate between evaluation and postpurchase
experience in these types of ambiguous cases.
Finally, there are multiple geographic varieties for English and Spanish that
present lexical differences. This implies additional difficulties to pattern identifi-
cation, since lexical units differ from a variety to another and are especially hard
to detect. Further work in this line (i.e. improving the normalisation process by
transforming lexical units to a canonical form) could help to improve the recall
results.

245
"*" "*#

")'
")%
")#
"(*
"()
"(' "(&
"($

"'( "')
"''

"&*

"&% "&&

"%*
"%(
"%% "%$
"%"
"$)
"$$ "$#

                   

      

Figure 8.24: Accuracy of the Consumer Decision Journey classifier by sector

8.7.2.2 Technique for Detecting Marketing Mix Attributes

We have also evaluated how the decision tree classifiers perform in the classifica-
tion of each short text depending on the Marketing Mix element (or elements)
expressed. We have used the 10-fold cross-validation approach for evaluating the
developed classifiers. We have obtained an overall precision of 0.75 and an overall
recall of 0.37, being the F-measure of 0.5. The results obtained in this task for
English and Spanish can be seen respectively in Figures 8.25 and 8.26.
As observed in the figures, the results are generally low (except for Advertise-
ment) in terms of recall, which range from 0.04 to 0.80 for Spanish and from 0.09
to 0.83 for English. It seems that there is a logical relation between the number
of texts of the positive class utilised to train the model and the corresponding
results in terms of recall and precision. For example, in Spanish the classifier
that was trained with a smaller number of texts, was the one for the positive
class of Customer Service, where we only had 85 short texts. The results of the
classification are 0.04 and 0.38 for recall and precision, respectively. In the same

246
%!.(
%!--
%!-( %!-)

%!,*
%!,(
%!+.
%!+,
%!+)
%!+&

%!)-
%!)* %!))
%!)& %!)(

%!(* %!()
%!(&
%!'. %!'. %!'.

%!'& %!'(
%!&-
%!&*
%!&(
%!%.

             
 

    " 

Figure 8.25: Accuracy of the Marketing Mix classifier for English

line are the results for English; one of the Marketing Mix elements trained with
less texts of the positive class (238) is Point of Sale, therefore the results obtained
are also the lowest ones: a recall of 0.09 and a precision of 0.48. We can observe
the same situation in the models trained with a larger number of texts; both in
Spanish and English, the Advertisement classifier was trained with a lot of pos-
itive examples, and thus this class achieved very good results in terms of recall
as well as precision (0.80 and 0.83 for recall and 0.88 and 0.93 for precision, for
Spanish and English, respectively).
It is also interesting to see how some Marketing Mix elements are much more
difficult to identify than others. For example, we can observe that the element
Quality is very hard to classify, even increasing the number of texts used to train
the model. In Spanish the number of texts used as positive examples is 371
and we obtained 0.18 and 0.56 of recall and precision respectively. However, in
English, where the model was trained with a larger number of texts as positive
examples (1,046 texts), the results are in line with those obtained for Spanish:

247
%!--
%!-) %!-(
%!-%
%!,+

%!+, %!+,
%!+'

%!** %!*+ %!*+


%!*&
%!),
%!)*
%!)' %!))
%!)& %!)&
%!(- %!(,
%!(*
%!('
%!(%
%!',

%!&-

%!%+
%!%)

             
 

    " 

Figure 8.26: Accuracy of the Marketing Mix classifier for Spanish

0.13 of recall and 0.61 of precision. These differences of difficulty among sectors
are due to the dispersion of the vocabulary used to talk about some Marketing
Mix elements. For example, we observed that customers could talk about Qual-
ity making reference to the comfort (for Automotive industry, for example), to
the security (in Banking, for instance) or to the taste (for Food or Beverages).
Therefore, the reference to Quality can be made through a great variety of top-
ics that are domain dependent and thus, the reference to this element is much
more varied than the reference to other Marketing Mix elements such as Price or
Advertisement. The linguistic cues are more disperse and thus the classifier finds
more difficulties to relate a word with a specific class.
Although the results specially in terms of recall should be improved, we con-
sider that as a first attempt to automatically classify and filter user-generated con-
tent from social media in terms of Marketing Mix elements, the results obtained
are very encouraging and very satisfactory for elements such as Advertisement.
Finally, as happened with the technique for identifying Consumer Decision
Journey stages, the language varieties affect to the precision results. For example,

248
the term “commercial” for American Spanish means “advertising spot”, while for
European Spanish means “sales person”. While the former meaning should be
associated to the Advertising category, the latter meaning should be associated
to the Point of Sale category.

8.7.2.3 Technique for Detecting Emotions

We have evaluated our system against a set of randomly chosen texts that cover
58% of the corpus, as described in Section 8.4.2.2. The overlap coefficient between
the training set and the evaluation set was 0.14, quite small, so we could trust
the results of the evaluation as very reliable.
Figure 8.27 shows the precision and recall obtained for each emotion of our
conceptual framework. We can see the number of texts classified under each
emotion both by our system and by the human annotator. The overall recall is
49.73% and the precision is 71.78%. If we used the F-measure as an indicator of
the best results, these would correspond to satisfaction and dissatisfaction. This
fact is not surprising, since the majority of the texts expressing sentiment in the
corpus and, therefore, in the training corpus belong to one of these two categories.
Figure 8.28 shows the precision and recall obtained for each domain and Fig-
ure 8.29 shows the precision and recall obtained for each type of media.
We have also compared our results to the ones provided by an existing com-
mercial tool for detecting polarity of opinions, owned by Havas Media Group.
Such system is also rule-based and the rules follow a similar approach, although
the antecedent only supports components made of lemma and part-of-speech and
the consequent only considers one category that captures the negative and the
positive opinions, instead of the four ones (reflecting eight sentiments) of our
work.
An important difference between the two experiments is the size of the corpora
used for evaluation: the corpus used for evaluating the polarity classifier contained
3,705 texts while ours contained 15,428 texts.
The polarity system has a recall of 20.82% (less than ours, 49.73) and a
precision of 84.85% (more than ours, 71.78). However, when we reduce our four
categories to one (putting together negative polarities on one side and positive

249
"**

"*#

")% ")$
")" ")# ")"
"(*
"($
"'* "'+

"'% "'$ "'# "'"


"&*
"&( "&'
"&"
"%* "%) "%( "%)
"%' "%(
"%% "%'

               

     

Figure 8.27: Accuracy of the emotions classifier

ones on the other) the results show a recall of 58.48% and a precision of 84.42%.
Thus, under such circumstances, we can affirm that we achieve a similar precision
to and a better recall than the previous system, and certainly it is based on a
more fine-grain classification.

250
$,$ $+-
$+,
$+' $+'
$+$
$*, $*, $*+
$*) $*)
$*' $*' $*& $*'
$*$
$), $),
$)) $)(
$)$ $)%
$(+ $(*
$()
$(&

$')

                

    ! 

Figure 8.28: Accuracy of the emotions classifier by sector


$!
$
$ #&
##
#
#
"%
"! "!
" 
"
"

!! !!

              

       

Figure 8.29: Accuracy of the emotions classifier by social media type

251
Approach Accuracy
Based on metadata about locations of users 0.81
Based on friendship networks 0.86
Based on descriptions about users 0.81
Content-based 0.81
Hybrid 0.81

Table 8.9: Accuracy of the place of residence identification approaches

8.7.2.4 Technique for Detecting Place of Residence

We have evaluated the five different approaches implemented by this technique


against the evaluation data set described in Section 8.5.3.2. The evaluation results
are shown in 8.9. All the approaches achieve the same accuracy (0.81), with the
exception of the one based on friendship networks, which improves de accuracy
to 0.86, out performing the approaches described in the State of the Art that
achieve accuracies from 0.51 to 0.71.
Regarding the approaches that perform named entity recognition for detect-
ing the locations included in the description of user profiles, or in the content
published and shared by those users, we have evaluated this step by using the
training set published by the Concept Extraction Challenge of the #MSM2013
Workshop [Basave et al., 2013]. Such training set consists of a corpus of 2.815
micro-posts written in English. The precision obtained is 0.52, while the recall is
0.43 (F1 =0.47).

8.7.2.5 Technique for Detecting Gender

We have evaluated the coverage (i.e. proportion of users classified) of the two
gender recognition approaches described by this technique against the whole eval-
uation data set described in Section 8.6.3.2. The approach based on profile meta-
data has been able to classify 46,030 users (9,284 female users and 36,746 male
users), achieving a coverage of 66% of the corpus. By contrast, the approach
based on mentions to users has classified 46,396 users (9,386 female users and
37,010 male users), improving the coverage up to 67%. Table 8.10 compares the
coverage of both approaches.
In addition, we have checked the automatic classification with respect to the

252
Approach Female Male Not Identified
User Names 9,284 (13%) 36,746 (53%) 23,231 (34%)
Mentions to Users 9,386 (14%) 37,010 (53%) 22,864 (33%)
Coverage Gain +102 +264 (Total Gain = +1%)

Table 8.10: Coverage of the gender recognition approaches

  


  
 
  

     

          

Figure 8.30: Performance of the gender recognition approaches

gold standard, obtaining an overall accuracy of 0.9 for the approach based on
user names, and of 0.84 for the approach based on mentions to users. By gender,
for the approach based on user names, the precision obtained is 0.98 for male
users and 0.97 for female users, while the recall is 0.8 and 0.87, respectively. For
the approach based on mentions to users, the precision obtained is 0.8 for male
users and 0.79 for female users, while the recall is 0.85 and 0.95, respectively.
Therefore, the approach based on mentions to users achieves a smaller precision,
but increases the recall with respect to the approach that only makes use of user
names. Figure 8.30 compares the performance of the two approaches.
As explained in Section 8.6.2 we perform automatic language identification
during the Clean Data task for filtering users that do not write in Spanish. The
false positives introduced by the language identification component, whose accu-
racy is 0.9302, may cause the inclusion of authors in the evaluation corpus that
might not be Spanish speakers, penalising the method recall.
Table 8.11 shows the confusion matrix for the approach based on mentions

253
Predicted class
Actual class Male Female No gender
Male 530 42 49
Female 10 528 20
No gender 130 97 103

Table 8.11: Confusion matrix with the results of the approach based on mentions
to users.

to users. Users manually annotated as “no gender” correspond to non-personal


Twitter accounts (e.g. a brand or a corporation), while those automatically
classified as “no gender” are the users for which the algorithm was not able
to identify a gender. Mainly, the confusions are produced between the male and
female classes and the residual class. As the table reflects, there is not a significant
number of confusions between male and female users (i.e. male users classified
as female and vice versa). Most of the errors correspond to male or female users
that could not been classified by the gender recognition technique.

It is difficult to make a direct comparison of our technique with the previous


works described in the State of the Art (Section 2.6.6), since our classifier has
been designed for the Spanish language and the other ones have been trained and
evaluated with a corpora of English speakers. If we ignore this fact, the technique
developed by Mislove et al. [2011] identifies a gender for the 64.2% of the users,
while ours achieves a coverage of 66.45%. Additionally, we have achieved less
accuracy than Burger et al. [2011], who achieved 0.92. However, the technique
proposed by Burger et al. [2011] requires more than 100,000 users in the training
data set (together with the tweets authored by them), while our technique does
not require training a classifier as it relies in linguistic knowledge, avoiding the
cost of corpus annotation by humans.

Regarding the distributions by gender, Mislove et al. [2011] identified a 71.8%


of male users for the U.S. population that use Twitter. In our case, we iden-
tified a 79.8% of male users confirming that Spanish speakers on Twitter are
also predominantly male within the period of the experiment (May 2012 - March
2013).

254
8.8 Validation of Hypotheses
The evaluation performed to our approach for identifying Consumer Decision
Journey stages in user-generated content validates Hypothesis 3, since our tech-
nique is able to classify texts along the different phases with an acceptable ac-
curacy with precision results similar to the works on identification of wishes.
Consequently our technique is able to approximate distributions of consumers
(i.e. the authors of the texts) in the exact moment of the Consumer Decision
Journey process.
The evaluation performed to our approach for detecting Marketing Mix at-
tributes in user-generated content validates Hypothesis 4, since our technique is
able to classify texts according to the Marketing Mix framework with an accept-
able accuracy, and consequently is able to approximate distributions of consumers
(i.e. the authors of the texts) that refer to the distinct Marketing Mix elements.
The evaluation performed to our approach for detecting emotions in user-
generated content validates Hypothesis 4, since our technique is able to identify
expressions of satisfaction, dissatisfaction, trust, fear, love, hate, happiness, and
sadness within user-generated content with an acceptable accuracy, and conse-
quently is able to approximate distributions of consumers (i.e. the authors of the
texts) that express the different kind of sentiments.
Regarding place of residence detection, the evaluation performed validates
Hypothesis 6, since the most accurate approach is the one based in friendship
networks. Therefore the homophily that characterises social networks can be
exploited for determining the place of residence of social media users. The results
obtained show that the social network is a valuable source of information for
obtaining the socio-demographic attributes of single users.
Finally, the evaluation results regarding gender detection show that the ap-
proach that exploits the gender concord existing in the contents that explicitly
mention social media users when a gender cannot be retrieved from user’s meta-
data improves the coverage of the gender identification technique. This validates
Hypothesis 7.

255
256
Chapter 9

CONCLUSIONS AND FUTURE


WORK

Social media has been in the centre of attention of advertising agencies as it has
come to form part of the media addressed by marketing activities. Advertising
agencies have been exploring possible ways to use this new media as a mechanism
of producing word-of-mouth. Therefore social media is being considered as a
platform in a viral marketing strategy. One of the expected benefits of this
thesis is to provide marketers and business experts with tools for understanding
the principal functions of social media from a marketing point of view. That is,
disentangle the effect social media have in consumer behaviour during the various
stages of the decision making process.
As the main conclusion, the techniques described in this thesis can be im-
plemented within applications that aim at observing consumers in social media
extracting socio-demographic and psychographic information from them.
We have defined an ontology network that structures the information pub-
lished in social media that is useful for marketing analysis purposes, and we have
characterised such media media by analysing the morphosyntactic characteristics
of the content published on them. Additionally, we have provided a technique
for uniquely identifies social media users using the fingerprint in their devices,
regardless the changes that occur frequently in these fingerprints. We also have
provided a collection of techniques for obtaining psychographic segmentations

257
of consumers in terms of their position in the purchase funnel, the marketing
attributes of the brands they refer to, and their sentiment about these brands.
Finally, we have described a set of techniques for identifying two sociodemo-
graphic attributes from social media users, i.e. their place of residence and their
gender.
Next, we detail the conclusions for each of the contributions of this thesis to
the State of the Art.

9.1 Social Media Data Model for Consumer An-


alytics
We have developed an ontology that models information that can be extracted
from social media about consumers. Such information can be directly retrieved
from social media data or inferred from users’ activity and opinions.
By combining and structuring the directly and indirectly retrieved data we are
able to store enriched consumer-related information in a graph-based database
for been analysed in different manners by marketing professionals.
As an example, through a CRM connection (e.g. implemented by a plugin
of a CRM system) this information could be prompted to standard business
applications and be accessible for daily business decisions.

9.2 Morphosyntactic Characterisation of Social


Media Contents
Natural language processing (NLP) techniques are a key piece for analysing the
content published in social media. Social media content presents the character-
istics of non-editorially controlled media, as opposite to the content published in
traditional media. In this context, social media communication has moved from
daily publications to real-time interactions. Thus, when applying NLP techniques
to the user-generated content published in social media, we find issues on text
quality that hinder the application of such techniques.

258
Moreover, if we analyse social media sources separately, we find that there are
differences on language styles, expressiveness degrees, and levels of formalism that
are conditioned by factors such as content length or publication pace. Namely,
text length varies form short sentences posted in Twitter to medium-size articles
published in blogs; very often the text published in social media contains mis-
spellings, is completely written in uppercase or lowercase letters, or is composed
of set phrases; to mention a few characteristics that make social media content
analysis challenging. Specifically we have demonstrated than the distribution of
part-of-speech categories varies across different social media types. Since part-of-
speech tagging is a previous step for many NLP techniques, the performance of
such techniques may vary according to the social media source from which the
user-generated content has been extracted.

9.3 Technique for Unique User Identification


Based on Evolving Device Fingerprint
Unique user identification is an essential activity in order to obtain accurate
results from Web Analytics, since many Web Analytics metrics depend on mea-
suring unique visitors. The most widespread technique for uniquely identifying
users is the one based on cookies. However, such technique is not completely
effective because cookies can be removed, disabled, or not supported.
Recently, a new technique for user identification has been proposed. Such
technique consists on capturing the fingerprint of the machine that the user uses
for navigating the Web. One drawback of this technique is that such fingerprint
changes over time, so that the registration of fingerprints must be accompanied
by a mechanism for detecting its temporal evolution. In this thesis, we have
described an algorithm that allows clustering fingerprints that correspond to the
same user, regardless of fingerprint evolution. The evaluation results demonstrate
the effectiveness of the algorithm, and improve previous results.
The algorithm proposed can be used instead of the technique based on cookies,
or as a complement to this technique for regenerating cookies when such cookies
are removed. If the algorithm is used as an alternative to the technique based on

259
cookies, every time an activity record is registered, the fingerprint obtained must
be compared to each cluster of browser fingerprints generated before, because of
the algorithm linear complexity. In contrast, if the algorithm is used as a mech-
anism for regenerating cookies, the fingerprint must be compared with existing
clusters only when the cookie is deleted, reducing significantly computational re-
sources needed for identifying users and augmenting, even more, the accuracy of
unique user identification. Moreover, this variant could be supplemented with
the use of Internet Explorer data persistence and web storage capabilities.
Our algorithm improves the accuracy of unique browser identification over
previous approaches, letting effectively counting unique visitors, thus measuring
the impact of digital advertisement campaigns better in environments where ex-
isting techniques fail (e.g. mobile devices or smart TVs which do not support
cookies). Therefore, the algorithm measures the audience of on-line campaigns ef-
fectively regardless the device and security restrictions, which enhances decision
support. Previous approaches were temporally constrained because of cookie
deletion or fingerprint attribute changes. Thus reporting periods were affected
by such temporal constraints. Our approach enables tracking user activity during
more time since it allows recovering from fingerprint changes (or cookie deletions
when combined with the user identification technique based on cookies). Thus,
website or advertisement campaign monitoring periods can be larger without
losing accuracy. In addition, advertisers will be benefited with more precise au-
dience measures, avoiding counting the same browser more than once. This will
impact positively on media planning optimisation allowing better budget distri-
bution over different online media, and enhancing performance metrics and user
profiling.
The algorithm can be executed as a batch process or in real-time as new
fingerprints arrive to the system. A real-time version of the algorithm should
require optimisations to reduce the number of comparisons between fingerprint
and cluster signatures, reducing processing time.
A disadvantage of the technique described in this document is the amount of
additional JavaScript code to be added to web pages in order to get some finger-
print attributes. Such scripting code could prevent certain advertising media of
adopting the technique. Nevertheless, importing external JavaScript definitions

260
reduces the code to be inserted in web pages to one line.
Finally, with respect to the ethical aspects of user tracking, Sison and J.
[2005] discuss issues relating to privacy on the on-line advertising domain. It is
important to remark, that the aim of this research is not invading user privacy,
but uniquely accounting the users that visit a given website. Thus, we are not
interested in personal data about users, but in accurate Web Analytics measures
at the aggregated level. Moreover, browser fingerprinting does not suppose a
threat for user privacy when appropriate anonymization techniques are applied,
for instance, transforming data applying cryptographic functions, such as SHA-1
[Eastlake and Jones, 2001], to fingerprint attribute values. Anyway, technologies
implementing our technique, and other similar ones, should follow policies such as
“Do Not Track” [Mayer et al., 2011], which enables users to opt out tracking by
websites they do not visit, including analytics services and advertising networks.

9.4 Techniques for Segmentation of Consumers


from Social Media Content
This section presents the conclusions regarding the techniques provided by this
thesis for segmenting consumers according to the contents they publish and share
in social media.
Future lines of work include experimenting with the detection of more demo-
graphic and psychographic user characteristics which are relevant to the market-
ing and communication domains, including: age, political orientation and inter-
ests, among others.

9.4.1 Technique for Detecting Consumer Decision Jour-


ney Stages
We have presented a novel technique for analysing user-generated texts in terms
of their belonging to one of the four stages of the Consumer Decision Journey.
Using a corpus made up of texts extracted from different social media sources and
pertaining to several business sectors, we manually identified specific linguistic

261
patterns and used them in a rule-based classifier to unambiguously distinguish
among texts related to the different stages. We achieved an overall precision of
0.78 and 0.65, and an overall recall of 0.34 and 0.39, for English and Spanish,
respectively.
To our knowledge, this is the first attempt to automatically obtain Consumer
Decision Journey business indicators from user-generated content using rule-based
classifiers. The automatic identification of these business indicators is very much
needed in order to drastically reduce time and efforts in their manual activities
by marketing analysts. Due to the novelty of this research area, much work
remains to be done, including its adaptation to other languages and the research
on possible methods to improve the overall recall. Lastly, we also plan to include
more business sectors in order to make the system more robust.

9.4.2 Technique for Detecting Marketing Mix Attributes


We have developed machine-learning classifiers that enable us to identify Mar-
keting Mix elements in user-generated texts. This allows a more accurate, fine-
grained consumer buzz analysis (i.e. not only establishes purchase stages but
identifies relevant, common topics of conversation among customers throughout
their shopping experiences) and, in consequence, enables marketers to take better-
informed business decisions. The system has been implemented training a set of
Decision Tree classifiers achieving an overall precision of 0.76 and 0.75, and an
overall recall of 0.44 and 0.31, for English and Spanish, respectively.
As happened with the Consumer Decision Journey classifier, to our knowledge,
this is the first attempt to automatically obtain Marketing Mix business indicators
from user-generated content using machine-learning classifiers, reducing earned
media analysis efforts to marketing analysts. Also, due to the novelty of this
research, much work remains to be done, like adapting the technique to other
languages, improving the recall, or learning texts from new business sectors.

9.4.3 Technique for Detecting Emotions


In this thesis, we have developed a rule-based technique that classifies Span-
ish texts from different social media channels according to four polarised cat-

262
egories (satisfaction-dissatisfaction, trust-fear, love-hate and happiness-sadness)
that capture the main sentiments expressed through these channels.
The results of the evaluation of the technique (49.73% recall and, 71.78%
precision) are quite satisfactory, considering the fine-grain classification. Never-
theless, refining and expanding the set or rules (consisting of more than 1200
rules at this moment) can improve the results. We have found a set of future
lines of work, which are described next.
Rules that are too specific match few texts, thus making it necessary to have
a huge set of rules in order to cover all the domains. However, this specificity
leads to a higher accuracy, i.e. when an antecedent matches (part of) a text, the
system would very likely classify it correctly.
In addition we have devised several ways to expand the set of rules by adding
rules based on the existing ones:

• Replacing words or lemmas with others that do not appear in the analysed
corpus. The ideal substitutes are the synonyms of the ones actually ex-
amined in the same context. For verbs, good replacement candidates are
those that are collocates of the same sentiment. Badele3000 can provide us
with this information. As we have seen, it can help us to retrieve domain-
independent collocations of common sentiments, along with the semantic
relation between the terms of those collocations. For example, since both
sentir (to feel) and tener (to have) are values of the LF Func1 for odio
(hatred) (see Table 8.8), rule (1) could be added, as it is equivalent to the
following rule (2).

tener#V odio#NC [SP] ENTITY → LH - 1 (1)

sentir#V odio#NC [SP] ENTITY → LH - 1 (2)

• Elaborating less restrictive rules, i.e. omitting some of the elements in the
antecedent. This generalisation would likely lead to a larger coverage. Nev-
ertheless, there is no guarantee that the resulting rules would not decrease
the accuracy of the system. A new evaluation should be carried out for
each new rule in order to know its impact. Accordingly, a trade-off be-
tween coverage and accuracy is sometimes necessary. For example, since

263
Meaning in Spanish Meaning in English Rules
siento fuerte odio I feel strong/forceful [D] fuerte#A odio#NC [SP] ENTITY → LH - 1
a/por marca hatred against/for
brand
siento odio fuerte I feel strong/forceful sentir#V odio#NC fuerte#A [SP] ENTITY → LH - 1
a/por marca hatred against/for
brand

Table 9.1: Rule reordering example

texts without occurrences of the entity have been discarded, a shallower


approach, where the entity is not part of the antecedent, could be consid-
ered. Thus, we could derive rule (3) from rule (4) by omitting ENTITY .
Another example could be removing a lemma and taking into account only
its part-of-speech tag. For instance, rule (4) comes from rule (3) by re-
placing adverb muy (very) with any non-negative adverb ([RG] ). However,
this rule is not correct, since poco (little) is an adverb that diminishes the
adjective degree while muy intensifies it.

muy#RG odioso#A → LH – 2 (3)

ENTITY ser#V [RG] odioso#A → LH – 2 (4)

We could also benefit from resources with domain knowledge (e.g. an on-
tology of the products of a field). In that case, we could write less specific
antecedents in our rules and use that knowledge instead.

• Re-ordering the components of the antecedent. In Spanish, this can be done


not only by shifting passive and active voice, but also by using a hyperbaton
(i.e. Spanish has a very free syntax, where several syntactic combinations
of words can be correct sentences). For instance, many times the positions
of nouns and their adjectives can be interchangeable. Thus, both fuerte
odio and odio fuerte are correct (see Table 9.1).

As we have explained, some rules are created by using domain-independent


resources and procedures. Thus, besides evaluating our system with this new set
of rules, we also plan to apply it to new domains in order to analyse its generality.
Finally, the grammar allows for quite a flexible specification at the morpho-
syntactic level, but sometimes information at the syntactic dependency level can

264
be useful too. For instance, knowing the scope of a negation could help to deter-
mine the units to be computed by the classifier.

9.4.4 Technique for Identifying the Place of Residence of


Social Media Users
The evaluation results obtained for the technique for identifying the place of
residence of social media users show that the approaches that make use of the
user’s community achieve better performance than the ones based on the analysis
of the content published and shared by the user. While the major part of the
community of a user shares the place of residence (because of the homophily
principle in social networks), the mentions to locations included in the content
published by the users are not related necessarily with their place of residence.

9.4.5 Technique for Identifying the gender of Social Media


Users
We have achieved very satisfactory results for gender identification by just making
use of user profile metadata, since the precision obtained is high and the technique
used is very simple with respect to computational complexity, which leads to a
straightforward set up in a production environment.
The approach based on mentions to users increases the recall in the cases
where the technique based in metadata about users is not able to identify the
gender, because for the Spanish language there exists grammatical agreement
with respect to gender between nouns and other part-of-speech categories (e.g.
adjectives and pronouns).
This technique can be extended in the future with the use of facial analysis
techniques, like the one proposed by Bekios-Calfa et al. [2014], as many users
publish their photograph in their social media profiles.

9.4.6 Normalisation of User-Generated Content


The text classifiers described in this thesis make use of an approach for user-
generated content normalisation that relies on existing web resources collectively

265
developed, finding that such resources, useful for many NLP tasks, are also valid
for the task of micropost normalisation.
With respect to the future lines of work, we plan to adapt the normaliser
to new languages by the incorporation of the corresponding dictionaries and im-
proving the existing lexicons by the use of more available resources, such as the
anchor texts from intra wiki links.
Finally, we plan to improve the normalisation of typos consisting in multiword
expressions, as different words should be transformed into just one (e.g. the
Spanish expression “a cerca de” should be transformed into “acerca de”), as
well as cases where joined words should be split (e.g. “realmadrid” should be
transformed into “real madrid”) by using existing word breaking techniques, such
as the one described by Wang et al. [2011].

9.4.7 Evaluation of Scalability


Because of its scale, brands’ earned media mentions extracted from social media
channels and gathered by marketing and communications agencies can be con-
sidered “Big Data”, as they are characterised by its huge volume of data, high
velocity of production, and high heterogeneity [O’Leary, 2013].
Media agencies like GroupM103 or Havas Media Group extract more that 1,200
million posts a year from its social media monitoring tools, including mentions
to its monitored brands and their competitors. This represents a volume of more
than 1.5 TB of raw data mainly consisting of text, associated content and authors’
metadata. Such volume grows very significantly when is processed, augmented
with different classifications, and integrated and indexed within databases.
The high velocity in which data is produced is a challenge, as data needs to
be processed faster than content is produced, at a near real-time pace, even if the
content is batch-processed.
In addition, variety along several dimensions (e.g. content quality, multilin-
guality, multiplicity of formats, diversity of technologies and techniques to be
integrated) has conditioned the infrastructure developed to evaluate the scalabil-
ity of the work presented in this paper.
103
http://www.groupm.com

266
We have performed a preliminary test of the scalability of the software com-
ponents by integrating them within a Big Data processing platform. However,
a more rigorous validation of the scalability of the techniques presented in this
work in a Big Data scenario is still pending.
Specifically, we have integrated the techniques for consumer segmentation pre-
sented in this thesis into a Big Data infrastructure. Such infrastructure is based
in Hadoop-related104 technologies, namely, Flume105 for real-time consumption of
posts, Hive and MapReduce for batch processing and data aggregation, HDFS for
temporal data storage, and HBase for storing the linguistic resources queried by
our classifiers. Once the data are processed, they are indexed in a Solr106 cloud
environment, and aggregation results are uploaded to relational databases with
OLAP capabilities. Processes have been developed using the Scala107 program-
ming language.
Measures of the time required for the multi-classification of each piece of
text show that it takes an average of 0.46 seconds per post (note that length of
test varies across different sources). Therefore, we found it very useful in order
to automatically tag the data stream continuously extracted and analysed by
marketing companies.

104
http://hadoop.apache.org
105
http://flume.apache.org
106
http://lucene.apache.org/solr
107
http://www.scala-lang.org

267
268
REFERENCES

Aguado de Cea, G., Barrios, M., Bernardos, S., Campanella, I., Montiel-Ponsoda,
E., Muñoz-Garcı́a, O., and Rodrı́guez, V. (2014). Análisis de sentimientos en
un corpus de redes sociales. In Proceedings of the 31st International Conference
of the Spanish Association of Applied Linguistics, AESLA’14, pages 18–20, San
Cristóbal de la Laguna, Tenerife, Spain.

Aguado de Cea, G. and Bernardos, S. (2007). Calı́ope: herramienta para ges-


tionar un corpus y un glosario de términos informáticos. In Proceedings of the
6th Annual Conference of the European Association of Languages for Specific
Purposes, AELFE’07, pages 292–299, Lisbon, Portugal.

Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padró, L., San Vicente, I.,
Turmo, J., and Zubiaga, A. (2013). Introducción a la tarea compartida tweet-
norm 2013: Normalización léxica de tuits en español. In Alegria, I., Aranberri,
N., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J., and Zu-
biaga, A., editors, Proceedings of the tweet normalisation workshop co-located
with 29th conference of the Spanish Society for Natural Language Processing,
SEPLN’13, pages 1–9, Madrid, Spain.

Alvestrand, H. T. (1995). RFC 1766 – Tags for the identification of languages.


https://www.ietf.org/rfc/rfc1766.txt.

Arnold, M. (1960). Emotion and personality: psychological aspects. Emotion and


Personality. Columbia University Press.

Asur, S. and Huberman, B. A. (2010). Predicting the future with social media.
In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web

269
Intelligence and Intelligent Agent Technology - Volume 1, WI-IAT’10, pages
492–499, Washington DC, USA. IEEE Computer Society.

Atserias, J., Comelles, E., and Mayor, A. (2005). TXALA: un analizador libre de
dependencias para el castellano. Procesamiento del Lenguaje Natural, 35:455–
456.

Backstrom, L., Kleinberg, J., Kumar, R., and Novak, J. (2008). Spatial variation
in search engine queries. In Proceedings of the 17th international World Wide
Web Conference, WWW’08, pages 357–366, Beijing, China. ACM.

Backus, J. W., Bauer, F. L., Green, J., Katz, C., McCarthy, J., Perlis, A. J.,
Rutishauser, H., Samelson, K., Vauquois, B., Wegstein, J. H., van Wijngaarden,
A., and Woodger, M. (1963). Revised report on the algorithm language ALGOL
60. Communications of the ACM, 6(1):1–17.

Basave, A. E. C., Varga, A., Rowe, M., Stankovic, M., and Dadzie, A.-S. (2013).
Making sense of microposts (#msm2013) concept extraction challenge. In
Proceedings of the Concept Extraction Challenge at the Workshop on ’Mak-
ing Sense of Microposts’ co-located with the 22nd International World Wide
Web Conference, WWW’13, pages 1–15, Rio de Janeiro, Brazil.

Bekios-Calfa, J., Buenaposada, J. M., and Baumela, L. (2014). Robust gender


recognition by exploiting facial attributes dependencies. Pattern Recognition
Letters, 36:228–234.

Bernardos, S. and Barrios, M. (2008). Data model for a lexical resource based on
lexical functions. Research in Computing Science, 27:9–22.

Berners-Lee, T. (1994). RFC 1738 – Uniform Resource Locators (URL). https:


//www.ietf.org/rfc/rfc1738.txt.

Berners-Lee, T., Fielding, R. T., and Masinter, L. (2005). RFC 3986 - Uni-
form Resource Identifier (URI): generic syntax. https://www.ietf.org/rfc/
rfc3986.txt.

270
Boda, K., Földes, A., Gulyás, G., and Imre, S. (2012). User tracking on the web
via cross-browser fingerprinting. Information Security Technology for Applica-
tions, 7161:31–46.

Borden, N. H. (1964). The concept of the marketing mix. Journal of Advertising


Research, 4(2):2–7.

Box, G. E. P. and Jenkins, G. (1990). Time series analysis, forecasting and


control. Holden-Day, Incorporated.

Breslin, J. G., Decker, S., Harth, A., and Bojars, U. (2006). SIOC: an approach
to connect Web-based communities. International Journal of Web Based Com-
munities, 2(2):133–142.

Brooke, J., Tofiloski, M., and Taboada, M. (2009). Cross-linguistic sentiment


analysis: from English to Spanish. In Proceedings of the 7th International
Conference on Recent Advances in NLP, RANLP’09, Borovets, Bulgaria.

Buitelaar, P., Arcan, M., Iglesias, C. A., Sánchez-Rada, J. F., and Strappar-
ava, C. (2013). Linguistic linked data for sentiment analysis. In Chiarcos, C.,
Cimiano, P., Declerck, T., and McCrae, J. P., editors, Proceedings of the 2nd
Workshop on Linked Data in Linguistics: Representing and Linking Lexicons,
Terminologies and Other Language Data. Collocated with the Conference on
Generative Approaches to the Lexicon, LDL’13, pages 1–8, Pisa, Italy. Associ-
ation for Computational Linguistics.

Burby, J. and Brown, A. (2007). Web Analytics definitions. http:


//www.digitalanalyticsassociation.org/Files/PDF_standards/
WebAnalyticsDefinitionsVol1.pdf.

Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011). Discriminating
gender on Twitter. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, EMNLP’11, pages 1301–1309, Edinburgh, United
Kingdom. Association for Computational Linguistics.

Cambria, E., Schuller, B., Xia, Y., and Havasi, C. (2013). New avenues in opinion
mining and sentiment analysis. Intelligent Systems, IEEE, 28(2):15–21.

271
Cambria, E. and White, B. (2014). Jumping NLP curves: a review of natural
language processing research. Computational Intelligence Magazine, IEEE,
9(2):48–57.

Carroll, J. J., Bizer, C., Hayes, P., and Stickler, P. (2005). Named graphs, prove-
nance and trust. In Proceedings of the 14th International Conference on World
Wide Web, WWW’05, pages 613–622, Chiba, Japan. ACM.

Cavnar, W. B. and Trenkle, J. M. (1994). N-gram-based text categorization.


In Proceedings of the Third Annual Symposium on Document Analysis and
Information Retrieval, SDAIR’94, pages 161–175, Las Vegas, USA.

Chan, W. S. (2003). Stock price reaction to news and no-news: Drift and reversal
after headlines. Journal of Financial Economics, 70:223–260.

Chang, H., Lee, D., Eltaher, M., and Lee, J. (2012). @phillies tweeting from
philly? predicting twitter user locations with spatial word usage. In Proceedings
of the 2012 IEEE/ACM International Conference on Advances in Social Net-
works Analysis and Mining, ASONAM’12, pages 111–118, Istambul, Turkey.

Chaumartin, F.-R. (2007). Upar7: A knowledge-based system for headline senti-


ment tagging. In Proceedings of the 4th International Workshop on Semantic
Evaluations, SemEval’07, pages 422–425, Prague, Czech Republic. Association
for Computational Linguistics.

Cheng, Z., Caverlee, J., and Lee, K. (2010). You are where you tweet: a
content-based approach to geo-locating twitter users. In Proceedings of the
19th ACM International Conference on Information and Knowledge Manage-
ment, CIKM’10, pages 759–768, Toronto, Canada. ACM.

Chetviorkin, I. I., Braslavski, P. I., and Loukachevitch, N. V. (2011). Rule based


approach to sentiment analysis. In Proceedings of the Sentiment Analysis Track
at the Russian Information Retrieval Evaluation Seminar, ROMIP’11.

Clore, G. L., Ortony, A., and Foss, M. A. (1987). The psychological foundations of
the affective lexicon. Journal of Personality and Social Psychology, 53(4):751–
755.

272
Codina, J. and Atserias, J. (2012). What is the text of a tweet? In Proceedings of
@NLP can u tag #user generated content?! via lrec-conf.org, LREC’12, pages
29–33, Istanbul, Turkey. ELRA.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and


Psychological Measurement, 20(1):37–46.

Corcoran, S. (2009). Defining earned, owned and paid media.


http://blogs.forrester.com/interactive_marketing/2009/12/
defining-earned-owned-and-paid-media.html.

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning,


20(3):273–297.

Coursey, K., Mihalcea, R., and Moen, W. (2009). Using encyclopedic knowledge
for automatic topic identification. In Proceedings of the Thirteenth Confer-
ence on Computational Natural Language Learning, CoNLL’09, pages 210–218,
Boulder, Colorado, USA. Association for Computational Linguistics.

Court, D., Elzinga, D., Mulder, S., and Vetvik, O. J. (2009). The consumer
decision journey. McKinsey Quarterly, 3:1–11.

Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts,
I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood,
M. A., Saggion, H., Petrak, J., Li, Y., and Peters, W. (2011). Text Processing
with GATE (Version 6). The University of Sheffield, Department of Computer
Science.

De Bruyn, A. and Lilien, G. (2008). A multi-stage model of word-of-mouth influ-


ence through viral marketing. International Journal of Research in Marketing,
25(3):151–163.

Deane, J., Loren, P., and Terry, R. (2011). Behavioural targeting in online ad-
vertising using web surf history analysis and contextual segmentation. Inter-
national Journal of Electronic Business, 9(3):271–291.

Dellarocas, C. (2003). The digitization of word of mouth: Promise and challenges


of online feedback mechanisms. Managegement Science, 49(10):1407–1424.

273
Ding, X. and Liu, B. (2007). The utility of linguistic rules in opinion mining.
In Proceedings of the 30th Annual International ACM SIGIR Conference, SI-
GIR’07, pages 811–812, Amsterdam, The Netherlands. ACM.

Divol, R., Edelman, D., and Sarrazin, H. (2012). Demystifying social media.
McKinsey Quarterly, 12(2):66–77.

Dodig-Crnkovic, G. (2002). Scientific methods in Computer Science. In Proceed-


ings of the Conference for the Promotion of Research in IT at New Universities
and at University Colleges in Sweden, Skövde, Sweden.

Droms, R. (1997). RFC 2131 – Dynamic Host Configuration Protocol. https:


//www.ietf.org/rfc/rfc2131.txt.

Eastlake, D. and Jones, P. (2001). RFC 3174 – US Secure Hash Algorithm 1


(SHA1). https://tools.ietf.org/html/rfc3174.

Eckersley, P. (2010). How unique is your Web browser? In Atallah, M. and


Hopper, N., editors, Privacy Enhancing Technologies, volume 6205 of Lecture
Notes in Computer Science, pages 1–18. Springer Berlin Heidelberg, Berlin,
Heidelberg.

ECMA (2011). Standard ECMA-262. ECMAScript language specification. http:


//www.ecma-international.org/ecma-262/5.1/.

Edelman, D. (2010). Branding in the Digital Age: You’re Spending Your Money
in All the Wrong Places. Harvard Business Review.

Egan, J. (1975). Signal detection theory and ROC-analysis. Academic Press series
in cognition and perception. Academic Press.

Egevang, K. (1994). RFC 1631 – The IP Network Address Translator (NAT).


https://www.ietf.org/rfc/rfc1631.txt.

Ekman, P. (1994). Moods, emotions, and traits. In Ekman, P. and Davidson, R.,
editors, The Nature of Emotion: Fundamental Questions, SAS Series, pages
56–58. Oxford University Press.

274
Ekman, P. (2005). Emotion in the Human Face. Series in Affective Science.
Oxford University Press.

Esuli, A. and Sebastiani, F. (2006). SENTIWORDNET: A publicly available


lexical resource for opinion mining. In Proceedings of the 5th Conference on
Language Resources and Evaluation, LREC’06, pages 417–422, Genoa, Italy.

Fielding, R. T. (2000). Architectural Styles and the Design of Network-based Soft-


ware Architectures. PhD thesis, University of California, Irvine. AAI9980887.

Fielding, R. T. and Reschke, J. (2014a). RFC 7230 – Hypertext Transfer Protocol


(HTTP/1.1): Message Syntax and Routing. https://tools.ietf.org/html/
rfc7230.

Fielding, R. T. and Reschke, J. (2014b). RFC 7231 – Hypertext Transfer Pro-


tocol (HTTP/1.1): Semantics and Content. https://tools.ietf.org/html/
rfc7231.

Fleiss, J. L. (1973). The equivalence of weighted kappa and the intraclass cor-
relation coefficient as measures of reliability. Educational and Psychological
Measurement, 33:613–619.

Franzen, G. and Goessens, C. (1999). Brands & advertising: how advertising


effectiveness influences brand equity. Admap.

Freed, N. and Borenstein, N. (1996). RFC 2045 - Multipurpose Internet Mail


Extensions (MIME) Part One. https://www.ietf.org/rfc/rfc2045.txt.

Fung, G. P. C., Yu, J. X., and Lam, W. (2003). Stock prediction: Integrating
text mining approach using real-time news. In Proceedings of 2003 IEEE Inter-
national Conference on Computational Intelligence for Financial Engineering,
CIFER’03, pages 395–402, Hong Kong, China.

Gabrilovich, E. and Markovitch, S. (2006). Overcoming the brittleness bottle-


neck using wikipedia: Enhancing text categorisation with encyclopaedic knowl-
edge. In Proceedings of the 21st National Conference on Artificial Intelligence,
volume 2 of AAAI’06, pages 1301–1306, Boston, Massachusetts, USA. AAAI
Press.

275
Gabrilovich, E. and Markovitch, S. (2007). Computing semantic relatedness using
Wikipedia-based explicit semantic analysis. In Proceedings of the 20th Interna-
tional Joint Conference on Artificial Intelligence, IJCAI’07, pages 1606–1611,
Hyderabad, India. Morgan Kaufmann Publishers Inc.

Gamallo, P., Garcia, M., and Pichel, J. R. (2013). A method to lexical normali-
sation of tweets. In Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padró,
L., San Vicente, I., Turmo, J., and Zubiaga, A., editors, Proceedings of the
Tweet Normalization Workshop co-located with 29th Conference of the Span-
ish Society for Natural Language Processing, SEPLN’13, pages 44–48, Madrid,
Spain.

Gangemi, A., Presutti, V., and Reforgiato Recupero, D. (2014). Frame-based


detection of opinion holders and topics: A model and a tool. Computational
Intelligence Magazine, IEEE, 9(1):20–30.

Garcı́a Moya, L. (2008). Un etiquetador morfológico para el español de Cuba.


Master’s thesis, Universidad de Oriente. Facultad de Matemática y Com-
putación, Santiago de Cuba, Cuba.

Gayo-Avello, D. (2011). Don’t turn social media into another ’literary digest’
poll. Communications of the ACM, 54(10):121–128.

Gendron, M. and Feldman Barrett, L. (2009). Reconstructing the past: A century


of ideas about emotion in psychology. Emotion Review, 1(4):316–339.

Goldberg, A. B., Fillmore, N., Andrzejewski, D., Xu, Z., Gibson, B., and Zhu, X.
(2009). May all your wishes come true: A study of wishes and how to recog-
nize them. In Proceedings of Human Language Technologies: The 2009 Annual
Conference of the North American Chapter of the Association for Computa-
tional Linguistics, NAACL’09, pages 263–271, Boulder, Colorado. Association
for Computational Linguistics.

Gómez-Pérez, A., Fernández-López, M., and Corcho, O. (2004). Ontological Engi-


neering: with examples from the areas of Knowledge Management, e-Commerce
and the Semantic Web. First Edition. Advanced Information and Knowledge
Processing. Springer.

276
Graves, M., Constabaris, A., and Brickley, D. (2007). FOAF: connecting people
on the Semantic Web. Cataloging & Classification Quarterly, 43:191–202.

Gruhl, D., Guha, R., Kumar, R., Novak, J., and Tomkins, A. (2005). The pre-
dictive power of online chatter. In Proceedings of the 11th ACM SIGKDD
International Conference on Knowledge Discovery in Data Mining, KDD’05,
pages 78–87, Chicago, Illinois, USA. ACM.

Gupta, P. and Harris, J. (2010). How e-WOM recommendations influence prod-


uct consideration and quality of choice: a motivation to process information
perspective. Journal of Business Research, 63(9–10):1041–1049.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten,
I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD
Explorations Newsletter, 11(1):10–18.

Han, X., Wei, W., Miao, C., Mei, J., and Song, H. (2014). Context-aware personal
information retrieval from multiple social networks. Computational Intelligence
Magazine, IEEE, 9(2):18–28.

Harding, W., Reed, A., and Gray, R. (2001). Cookies and web bugs: What they
are and how they work together. Information Systems Management, 18:17–24.

Hatzivassiloglou, V. and McKeown, K. R. (1997). Predicting the semantic orien-


tation of adjectives. In Proceedings of the 8th Conference on European Chapter
of the Association for Computational Linguistics, EACL’97, pages 174–181,
Madrid, Spain. Association for Computational Linguistics.

Hennig-Thurau, T., Malthouse, E. C., Friege, C., Gensler, S., Lobschat, L., Ran-
gaswamy, A., and Skiera, B. (2010). The impact of new media on customer
relationships. Journal of Service Research, 13(3):311–330.

Hovi, E., Markman, V., Martell, C., and Uthus, D. (2013). Analyzing microtext.
In Proceedings of the 2013 AAAI Spring Symposia, AAAI’13, page vii, Palo
Alto, California, USA. Association for the Advancement of Artificial Intelli-
gence.

277
Hu, X. and Cercone, N. (2004). A data Warehouse/OLAP framework for web us-
age mining and business intelligence reporting. International Journal of Com-
putational Intelligence Systems, 19:585–606.

IEEE (1990). IEEE standard flossary of software engineering terminology. IEEE


Standard 610.12-1990, Standards Coordinating Committee of the Computer
Society of the IEEE.

IEEE (1995a). IEEE guide for software quality assurance planning. IEEE Stan-
dard 730.1-1995, Software Engineering Standards Committee of of the IEEE
Computer Society.

IEEE (1995b). IEEE standard for developing software life cycle processes. IEEE
Standard 1074-1995, IEEE Computer Society.

IEEE (1997). IEEE standard for developing software life cycle processes. IEEE
Standard 1074-1997, IEEE Computer Society.

Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion


des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles,
37:547–579.

Joshi, M., Das, D., Gimpel, K., and Smith, N. A. (2010). Movie reviews and
revenues: An experiment in text regression. In Human Language Technologies:
The 2010 Annual Conference of the North American Chapter of the Association
for Computational Linguistics, HLT’10, pages 293–296, Los Angeles, California,
USA. Association for Computational Linguistics.

Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An


Introduction to Natural Language Processing, Computational Linguistics, and
Speech Recognition. Prentice Hall.

Katz, P., Singleton, M., and Wicentowski, R. (2007). SWAT-MP: The SemEval-
2007 Systems for Task 5 and Task 14. In Proceedings of the 4th Interna-
tional Workshop on Semantic Evaluations, SemEval’07, pages 308–313, Prague,
Czech Republic. Association for Computational Linguistics.

278
Kaufmann, M. and Jugal, K. (2010). Syntactic normalization of twitter messages.
In Proceedings of the International Conference on Natural Language Processing,
ICON’10, pages 2–8, Kharagpur, India.

Kaushik, A. (2007). Web Analytics: an hour a day. John Wiley & Sons, Incor-
porated.

Kaushik, A. (2009). Web Analytics 2.0: the art of online accountability and
science of customer centricity. Wiley.

Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., and Wright, S. E. (2008).


ISOcat: corralling data categories in the wild. In Calzolari, N., Choukri, K.,
Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., and Tapias, D., editors,
Proceedings of the 6th International Conference on Language Resources and
Evaluation, LREC’08, pages 887–891, Marrakech, Morocco. European Lan-
guage Resources Association (ELRA).

Kimball, R., Reeves, L., Thornthwaite, W., Ross, M., and Thornwaite, W. (1998).
The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Devel-
oping and Deploying Data Warehouses. John Wiley & Sons, Inc., New York,
NY, USA, 1st edition.

Kimball, R. and Ross, M. (2002). The Data Warehouse Toolkit: The Complete
Guide to Dimensional Modelling. John Wiley & Sons, Inc., New York, USA,
2nd edition.

Kleinginna, P. R. and Kleinginna, A. M. (1981). A categorized list of emotion def-


initions, with suggestions for a consensual definition. Motivation and Emotion,
5(4):345–379.

Kohavi, R. and Provost, F. (1998). Glossary of terms. Machine Learning,


30(2/3):271–274.

Kothari, C. (2004). Research Methodology: Methods and Techniques. New Age


International Publishers Limited, second edition.

Kowalski, G. (1997). Information Retrieval Systems. Theory and Implementation.


Kluwer Academic Publishers.

279
Kozareva, Z., Navarro, B., Vázquez, S., and Montoyo, A. (2007). UA-ZBSA:
A headline emotion classification through Web information. In Proceedings of
the 4th International Workshop on Semantic Evaluations, SemEval’07, pages
334–337, Prague, Czech Republic. Association for Computational Linguistics.

Kozinets, R. V., de Valck, K., Wojnicki, A. C., and Wilner, S. J. (2010). Net-
worked narratives: Understanding word-of-mouth marketing in online commu-
nities. Journal of Marketing, 74(2):71–89.

Larsen, B. and Aone, C. (1999). Fast and effective text mining using linear-time
document clustering. In Proceedings of the 5th ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD’99, pages 16–22,
San Diego, California, USA.

le Cessie, S. and van Houwelingen, J. (1992). Ridge estimators in logistic regres-


sion. Applied Statistics, 41(1):191–201.

Leech, G. and Wilson, A. (1996). EAGLES. Recommendations for the


morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/
annotate/annotate.html.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions,


and reversals. Soviet Physics Doklady, 10(8):707–710.

Lewis, E. (1903). Advertising department: Catch-line and argument. The Book-


Keeper, 15:124–128.

Li, P., Dong, X. L., Maurino, A., and Srivastava, D. (2011). Linking temporal
records. Proceedings of the VLDB Endowment,, 4(11):956–967.

Liu, B. (2010). Sentiment analysis and subjectivity. In Indurkhya, N. and Dam-


erau, F. J., editors, Handbook of Natural Language Processing, Second Edition,
pages 1–38. CRC Press, Taylor and Francis Group, Boca Raton, USA.

Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool.

Maldonado, S. (2009). Analı́tica Web: medir para triunfar. ESIC Editorial,


Pozuelo de Alarcón, Madrid.

280
Mayer, J., Narayanan, A., and Stamm, S. (2011). Do Not Track: a Uni-
versal third-party Web tracking opt out. https://tools.ietf.org/html/
draft-mayer-do-not-track-00.

McCarthy, E. J. and Brogowicz, A. A. (1981). Basic marketing: a managerial


approach. Irwin Series in Marketing. R.D. Irwin.

McPherson, M., Smith-Lovin, L., and Cook, J. M. (2001). Birds of a feather:


Homophily in social networks. Annual Review of Sociology, 27(1):415–444.

Mel’čuk, I. (1996). Lexical functions: A tool for the description of lexical relations
in a lexicon. In Wanner, L., editor, Lexical functions in lexicography and natural
language processing, Studies in language companion series, pages 37–102. John
Benjamins, Amsterdam, Philadelphia, USA.

Mihalcea, R. (2007). Using Wikipedia for automatic word sense disambiguation.


In Sidner, C. L., Schultz, T., Stone, M., and Zhai, C., editors, Proceedings of
the North American Chapter of the Association for Computational Linguistics,
NAACL-HLT’07, pages 196–203, Rochester, NY, USA. The Association for
Computational Linguistics.

Miles, A., Matthews, B., Wilson, M., and Brickley, D. (2005). SKOS core: simple
knowledge organisation for the Web. In Proceedings of the 2005 International
Conference on Dublin Core and Metadata Applications: Vocabularies in Prac-
tice, DCMI’05, pages 1:1–1:9, Madrid, Spain. Dublin Core Metadata Initiative.

Mishne, G. and Glance, N. (2006). Predicting movie sales from blogger senti-
ment. In Proceedings of the AAAI Symposium on Computational Approaches
to Analysing Weblogs (), AAAI-CAAW’06, pages 155–158.

Mislove, A., Lehmann, S., Ahn, Y.-Y., Onnela, J.-P., and Rosenquist, J. N.
(2011). Understanding the demographics of Twitter users. In Proceedings of the
5th International AAAI Conference on Weblogs and Social Media, ICWSM’11,
pages 554–557, Barcelona, Spain.

Mockapetris, P. (1987). RFC 1035 – Domain Names – Implementation and Spec-


ification. https://www.ietf.org/rfc/rfc1035.txt.

281
Mullen, T. and Collier, N. (2004). Sentiment analysis using support vector ma-
chines with diverse information sources. In Proceedings of Conference on Em-
pirical Methods in Natural Language Processing, EMNLP’04, pages 412–418.

Ng, S. and Hill, S. R. (2009). The impact of negative word-of-mouth in Web


2.0 on brand equity. In Proceedings of the 2009 ANZMAC Annual Conference,
ANZMAC’09, Melbourne, Australia. Monash University.

Nielsen (2012a). Global trust in advertising and brand mes-


sages. http://www.nielsen.com/us/en/insights/reports/2013/
global-trust-in-advertising-and-brand-messages.html.

Nielsen (2012b). State of the media – the social media re-


port. http://www.nielsen.com/us/en/insights/reports/2012/
state-of-the-media-the-social-media-report-2012.html.

Noble, S., Cooperstein, D. M., Kemp, M. B., and Munchbach, C.


(2010). It’s time to bury the marketing funnel – an empowered
report. https://www.forrester.com/Its+Time+To+Bury+The+Marketing+
Funnel/fulltext/-/E-res57495.

Nottingham, M. and Sayre, R. (2005). RFC 4287 – The Atom Syndication For-
mat. https://tools.ietf.org/html/rfc4287.

Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Detecting spam
web pages through content analysis. In Proceedings of the 15th International
Conference on World Wide Web, WWW’06, pages 83–92, Edinburgh, Scotland,
UK. ACM.

O’Leary, D. (2013). Artificial intelligence and big data. Intelligent Systems,


IEEE, 28(2):96–99.

Oliver, R. (1989). Processing of the satisfaction response in consumption: A sug-


gested framework and research propositions. Journal of Consumer Satisfaction,
Dissatisfaction and Complaining Behaviour, 2(1):1–16.

OMG (2011). OMG Unified Modelling Language (OMG UML), Superstructure.


http://www.omg.org/spec/UML/2.4.1/Superstructure/PDF/.

282
Ortony, A., Clore, G., and Collins, A. (1990). The Cognitive Structure of Emo-
tions. Cambridge University Press.

Padró, L. and Stanilovsky, E. (2012). FreeLing 3.0: towards wider multilin-


guality. In Proceedings of the Language Resources and Evaluation Conference,
LREC’12, pages 2473–2479, Istanbul, Turkey. ELRA.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The PageRank cita-
tion ranking: Bringing order to the Web. Technical Report 1999-66, Stanford
InfoLab.

Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. Foundations
and Trends. Information Retrieval, 2(1-2):1–135.

Phillips, D. M. and Baumgartner, H. (2002). The role of consumption emotions


in the satisfaction response. Journal of Consumer Psychology, 12(3):243–252.

Plutchik, R. (1989). Emotion: Theory, Research, and Experience. Acad. Press.

Pookulangara, S. and Koesler, K. (2011). Cultural influence on consumers’ usage


of social networks and its’ impact on online purchase intentions. Journal of
Retailing and Consumer Services, 18(4):348–354.

Postel, J. (1981). RFC 791 – Internet Protocol - DARPA Internet Program,


Protocol Specification. https://www.rfc-editor.org/rfc/rfc791.txt.

Prabowo, R. and Thelwall, M. (2009). Sentiment analysis: a combined approach.


Journal of Informetrics, 3(2):143–157.

Quinlan, R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann


Publishers, San Mateo, CA.

Ramanand, J., Bhavsar, K., and Pedanekar, N. (2010). Wishful thinking: Find-
ing suggestions and ’buy’ wishes from product reviews. In Proceedings of the
NAACL HLT 2010 Workshop on Computational Approaches to Analysis and
Generation of Emotion in Text, CAAGET’10, pages 54–61, Los Angeles, Cali-
fornia, USA. Association for Computational Linguistics.

283
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods.
Journal of the American Statistical Association, 66(336):846–850.

Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010). Classifying latent
user attributes in Twitter. In Proceedings of the 2nd International Workshop on
Search and Mining User-Generated Contents, SMUC’10, pages 37–44, Toronto,
Canada. ACM.

Reese, W. (2008). Nginx: the high-performance web server and reverse proxy.
Linux Journal.

Rentoumi, V., Petrakis, S., Klenner, M., Vouros, G. A., and Karkaletsis, V.
(2010). United we stand: Improving sentiment analysis by joining machine
learning and rule-based methods. In Calzolari, N., Choukri, K., Maegaard,
B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., and Tapias, D., editors,
Proceedings of the 7th International Conference on Language Resources and
Evaluation, LREC’10), pages 1089–1094, Valletta, Malta. European Language
Resources Association (ELRA).

Richins, M. L. (1997). Measuring emotions in the consumption experience. Jour-


nal of Consumer Research, 24(2):127–146.

Rosch, E. (1978). Principles of categorization. In Rosch, E. and Lloyd, B., editors,


Cognition and Categorization, pages 27–48. John Wiley & Sons Inc.

Sadikov, E., Parameswaran, A. G., and Venetis, P. (2009). Blogs as predictors of


movie success. In Proceedings of the Third International ICWSM Conference,
ICWSM’09, pages 304–307.

Sánchez-Rada, J. F. and Iglesias, C. A. (2013). Onyx: describing emotions on the


Web of data. In Proceedings of the First International Workshop on Emotion
and Sentiment in Social and Expressive Media: Approaches and Perspectives
from AI, volume 1096 of ESSEM’13, pages 71–82, Torino, Italy. AI*IA, Italian
Association for Artificial Intelligence, CEUR-WS.

284
Santorini, B. (1991). Part-Of-Speech tagging guidelines for the Penn Treebank
project (3rd revision, 2nd printing). Technical report, Department of Linguis-
tics, University of Pennsylvania.

Schindler, R. and Bickart, B. (2005). Published word of mouth: referable,


consumer-generated information on the internet. Online Consumer Psychol-
ogy: Understanding and Influencing Consumer Behaviour in the Virtual World,
pages 35–61.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In


Proceedings of the International Conference on New Methods in Language Pro-
cessing, NeMLaP’94, Manchester, UK.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System


Technical Journal, 27:379–423 and 623–656.

Shannon, C. E. and Warren, W. (1949). The mathematical theory of communi-


cation. University of Illinois Press.

Sharda, R. and Delen, D. (2006). Predicting box-office success of motion pictures


with neural networks. Expert Systems Applications, 30(2):243–254.

Shaver, P., Schwartz, J., Kirson, D., and O’Connor, C. (1987). Emotion knowl-
edge: further exploration of a prototype approach. Journal of Personality and
Social Psychology, 52(6):1061–1086.

Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining.
Journal of Data Warehousing, 5(4):13–22.

Shinavier, J. (2010). Real-time #SemanticWeb in <= 140 chars. In Proceedings


of the WWW2010 Workshop on Linked Data on the Web, WWW’10, Raleigh,
North Carolina, USA.

Sidorov, G., Miranda-Jiménez, S., Viveros-Jiménez, F., Gelbukh, A., Castro-


Sánchez, N., Velásquez, F., Dı́az-Rangel, I., Suárez-Guerra, S., Treviño, A.,
and Gordon, J. (2013). Empirical study of machine learning based approach
for opinion mining in tweets. In Proceedings of the 11th Mexican International

285
Conference on Advances in Artificial Intelligence - Volume Part I, MICAI’12,
pages 1–14, San Luis Potos&#237;, Mexico. Springer-Verlag.

Sison, A. and J., F. (2005). Ethical aspects of e-commerce: data subjects and
content. International Journal of Internet Marketing and Advertising, 3:5–18.

Sommerville, I. (2007). Software Engineering. International Computer Science


Series. Addison-Wesley, eighth edition.

Sproat, R., Black, A. W., Chen, S., Kumar, S., Ostendorf, M., and Richards, C.
(2001). Normalization of non-standard words. Computer Speech & Language,
15(3):287–333.

Sterne, J. (2010). Social Media Metrics: How to Measure and Optimize Your
Marketing Investment. John Wiley & Sons.

Stone, P. J., Dunphy, D. C., Smith, M. S., and Ogilvie, D. M. (1966). The General
Inquirer: A Computer Approach to Content Analysis. M.I.T. Press.

Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14: Affective Text.


In Proceedings of the 4th International Workshop on Semantic Evaluations,
SemEval’07, pages 70–74, Prague, Czech Republic. Association for Computa-
tional Linguistics.

Suárez-Figueroa, M. C., Gómez-Pérez, A., and Fernández-López, M. (2012).


The NeOn methodology for ontology engineering. In Suárez-Figueroa, M. C.,
Gómez-Pérez, A., Motta, E., and Gangemi, A., editors, Ontology Engineering
in a Networked World, chapter 2, pages 9–34. Springer.

Subramanyam, R. (2011). The relationship between social media buzz


and TV ratings. http://www.nielsen.com/us/en/insights/news/2011/
the-relationship-between-social-media-buzz-and-tv-ratings.html.

Taboada, M., Brooke, J., Tofiloski, M., Voll, K., and Stede, M. (2011). Lexicon-
based methods for sentiment analysis. Computational Linguistics, 37(2):267–
307.

286
Tetlock, P. C., Saar-Tsechansky, M., and Macskassy, S. (2008). More than words:
Quantifying language to measure firms’ fundamentals. Journal of Finance,
63(3):1437–1467.

Thayer, R. (1989). The Biopsychology of Mood and Arousal. Oxford University


Press, New York, NY.

Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied


to unsupervised classification of reviews. In Proceedings of the 40th Annual
Meeting on Association for Computational Linguistics, ACL’02, pages 417–424,
Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

Valitutti, A., Strapparava, C., and Stock, O. (2004). Developing affective lexical
resources. PsychNology Journal, 2(1):61–83.

van Bruggen, G. H., Antia, K. D., Jap, S. D., Reinartz, W. J., and Pallas, F.
(2010). Managing marketing channel multiplicity. Journal of Service Research
(JSR), 13(3):331–340.

Vaughn, R. (1986). How advertising works: A planning model revisited. Journal


of Advertising Research, 26:57–66.

Vilares, D., Alonso, M., and Gómez-Rodrı́guez, C. (2013). Clasificación de po-


laridad en textos con opiniones en español mediante análisis sintáctico de de-
pendencias. Procesamiento del Lenguaje Natural, 50(0).

Vázquez, S., Muñoz-Garcı́a, O., Campanella, I., Poch, M., Fisas, B., Bel, N., and
Andreu, G. (2014). A classification of user-generated content into consumer
decision journey stages. Neural Networks, 56:68–81.

Wang, K., Thraser, C., and Hsu, P. B.-J. (2011). Web Scale NLP: a case study
on URL word breaking. In Proceedings of the 20th international conference on
World Wide Web, WWW’11, pages 357–366, Hyderabad, India. ACM.

Wang, X., Yu, C., and Wei, Y. (2012). Social media peer communication and
impacts on purchase intentions: A consumer socialization framework. Journal
of interactive marketing: a quarterly publication from the Direct Marketing
Educational Foundation, 26(4):198–209.

287
Weber, L. (2007). Marketing to the social Web: how digital customer communities
build your business. Wiley.

Westbrook, R. A. and Oliver, R. L. (1991). The dimensionality of consumption


emotion patterns and consumer satisfaction. Journal of Consumer Research,
18(1):84–91.

Westerski, A., Iglesias, C. A., and Tapia, F. (2011). Linked opinions: describing
sentiments on the structured Web of Data. In Proceedings of the 4th Interna-
tional Workshop on Social Data on the Web, SDoW’11, Bonn, Germany.

Wiebe, J., Wilson, T., and Cardie, C. (2005). Annotating expressions of opinions
and emotions in language. Language Resources and Evaluation, 39(2-3):165–
210.

Wu, X. and He, Z. (2011). Identifying wish sentence in product reviews. Journal
of Computational Information Systems, 7:1607–1613.

Yergeau, F. (2003). RFC 3629 – UTF-8, a transformation format of ISO 10646.


https://tools.ietf.org/html/rfc3629.

Zhang, W. and Skiena, S. (2009). Improving movie gross prediction through


news analysis. In Proceedings of the 2009 IEEE/WIC/ACM International Joint
Conference on Web Intelligence and Intelligent Agent Technology - Volume 01,
WI-IAT’09, pages 301–304, Washington, DC, USA. IEEE Computer Society.

Zhao, Y. and Karypis, G. (2001). Criterion functions for document clustering:


experiments and analysis. Technical report, Department of Computer Science,
University of Minnesota.

288

You might also like