You Are What Apps You Use: Demographic Prediction Based On User's Apps
You Are What Apps You Use: Demographic Prediction Based On User's Apps
You Are What Apps You Use: Demographic Prediction Based On User's Apps
0.8 80
70
True Positive Rate
78
0.6
76
65
0.4 74
72
0.2 60
70
0.0 100 400 700 1000 1300 1600 1900 2200 0 50 100 150 200 250
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate Number of train users Number of apps per user
(a) ROC curve for gender prediction. ’Male’ (b) Effect of training set size on gender pre- (c) Effect of user’s app count averaged over all
is treated as the positive class. diction. demographics.
The second method, also adopted by (Seneviratne et al. The reason is that unlike the SVD components, the original
2015), aggregates the installed apps to category level based bag–of–apps features are very sparse and the logistic regres-
on Google Play categorization. In our dataset, there are apps sion implementation we use5 supports sparse matrices.
from 48 categories. We take the number of apps in each
category as the features, which yields an accuracy of 74.6 %. Related Work
The third method employs the Truncated Singular Value Characterizing the demographics of Twitter users has been
Decomposition (TSVD). (Hu et al. 2007) also employ studied by (Mislove et al. 2011) who infer geography, gen-
TSVD, but instead of using the SVD components directly as der, and race of the users based on self-reported locations
features for predicting the demographics of web users, they and the names of the users. They find large deviances
adopt a recommender system approach. Setting the number from the demographic distribution of the overall popula-
of dimensions to 48, we obtain a gender prediction accuracy tion. (Duggan et al. 2015) provide a more extensive demo-
of 76.9 %. This shows that rather than using the Google graphic comparison of five social media platforms based on
Play categories of the apps, it is better to use the same num- telephone interviews. (Goel, Hofman, and Sirer 2012) look
ber of SVD components learned in an unsupervised man- into the demographics and behavior of web users, whereas
ner. However, the performance is clearly worse compared (Weber and Jaimes 2011) study the same for search-engine
to not using any dimensionality reduction, and even by in- users.
creasing the number of SVD components, we were unable
The demographic prediction based on user’s apps has
to exceed the performance of the logistic regression with all
been previously studied by (Seneviratne et al. 2015) who
features, although with 500 components, the accuracy is al-
predict the users’ gender. In their previous work (Senevi-
ready 81.8 %.
ratne et al. 2014), they also predict language, country, re-
In conclusion, none of the explored dimensionality reduc- lationship status, and whether the user is a parent, but in-
tion methods helped us to improve the gender prediction ac- stead of predicting these attributes directly, they first pre-
curacy. We should also note that although TSVD can help
to reduce the data dimensionality to about one-tenth of the 5
http://scikit-learn.org/stable/
original without losing much in accuracy, this still does not modules/generated/sklearn.linear_model.
necessarily help with the space complexity of the method. LogisticRegression.html
dict which apps are associated with the attributes and then Acknowledgments
check whether a user has apps corresponding to a given de- We would like to thank Verto Analytics for providing the
mographic attribute. We extend these works by studying dataset, Timo Smura for useful discussions, and Janaki
new demographics (age, race, and income), showing that Koirala for conducting some of the initial experiments.
increasing the training dataset size drastically improves the
prediction accuracy, and comparing various dimensionality References
reduction methods for the app data.
Al Zamal, F.; Liu, W.; and Ruths, D. 2012. Homophily and
Others have studied demographic prediction, e.g., based
latent attribute inference: Inferring latent attributes of twitter
on website visits (Hu et al. 2007; Goel, Hofman, and Sirer
users from neighbors. In Proc. ICWSM.
2012), social network features (Brea et al. 2014; Al Zamal,
Liu, and Ruths 2012), call patterns (Sarraute, Blanc, and Brea, J.; Burroni, J.; Minnoni, M.; and Sarraute, C. 2014.
Burroni 2014), Twitter followers (Culotta, Ravi, and Cut- Harnessing mobile phone social network topology to infer
ler 2015) and profiles (Chen et al. 2015), and location data users demographic attributes. In Proc. SNA-KDD.
(Riederer et al. 2015). Related to demographic prediction, Chen, X.; Wang, Y.; Agichtein, E.; and Wang, F. 2015.
(Chittaranjan, Blom, and Gatica-Perez 2013) investigate the A comparative study of demographic attribute inference in
predictability of personality traits based on apps and other twitter. In Proc. ICWSM.
smartphone usage features. There also was an app for pre- Chittaranjan, G.; Blom, J.; and Gatica-Perez, D. 2013. Min-
dicting personality based on installed apps6 . ing large-scale smartphone data for personality studies. Per-
sonal and Ubiquitous Computing 17(3):433–450.
Conclusions and Discussion Culotta, A.; Ravi, N. K.; and Cutler, J. 2015. Predicting the
We studied the demographic prediction problem based on demographics of twitter users from website traffic data. In
the list of used apps. Large differences in the predictabil- Proc. ICWSM.
ity were observed between the six demographic attributes Duggan, M.; Ellison, N. B.; Lampe, C.; Lenhart,
studied in this work, gender being the most predictable and A.; and Mary, M. 2015. Social media update
income being the hardest to predict. The apps contribut- 2014. http://www.pewinternet.org/2015/01/
ing the most to the predictions were identified for each at- 09/social-media-update-2014/.
tribute, revealing some expected patterns: dating apps are Goel, S.; Hofman, J. M.; and Sirer, M. I. 2012. Who does
used, although not exclusively, by single people, and high- what on the web: A large-scale study of browsing behavior.
income people are more likely to use LinkedIn, whereas In Proc. ICWSM.
lower-income people prefer an app called Job Search. Hu, J.; Zeng, H.-J.; Li, H.; Niu, C.; and Chen, Z. 2007.
We also studied various dimensionality reduction meth- Demographic prediction based on user’s browsing behavior.
ods for high-dimensional app data (8 840 unique apps), find- In Proc. WWW.
ing out that SVD yields superior results compared to aggre- Mislove, A.; Lehmann, S.; Ahn, Y.-Y.; Onnela, J.-P.; and
gating the apps on app category level, but the best results are Rosenquist, J. N. 2011. Understanding the demographics of
obtained simply by the raw list of apps. Finally, we looked twitter users. In Proc. ICWSM.
into the effect of the training set size and the number of apps
on the predictability and showed that both of these factors Riederer, C.; Zimmeck, S.; Phanord, C.; Chaintreau, A.; and
can have an impact of over 10 % on the prediction accuracy. Bellovin, S. M. 2015. “I don’t have a photograph, but you
Interestingly, the predictability increases the more apps the can have my footprints” — revealing the demographics of
user has used, but after 100 apps, the prediction accuracy location data. In Proc. ICWSM.
starts to decrease. The accuracy drop from users with 50- Sarraute, C.; Blanc, P.; and Burroni, J. 2014. A study of
150 apps to users with more than 150 apps was found to be age and gender seen through mobile phone usage patterns in
statistically significant. mexico. In Proc. ASONAM, 836–843. IEEE.
Several interesting questions are left for future work. Seneviratne, S.; Seneviratne, A.; Mohapatra, P.; and Ma-
First, we note that demographic attributes are most likely hanti, A. 2014. Predicting user traits from a snapshot of
not independent, and therefore, predicting the attributes si- apps installed on a smartphone. ACM SIGMOBILE Mobile
multaneously, employing multi-label prediction techniques, Computing and Communications Review 18(2):1–8.
could improve the performance. Second, we plan to study Seneviratne, S.; Seneviratne, A.; Mohapatra, P.; and Ma-
the demographics of various popular apps to understand po- hanti, A. 2015. Your installed apps reveal your gender and
tential biases in their userbases compared to the whole popu- more! ACM SIGMOBILE Mobile Computing and Commu-
lation. Third, it would be interesting to study the usage pat- nications Review 18(3):55–61.
terns of different demographic groups (as done previously
in the context of web search (Weber and Castillo 2010)) to Weber, I., and Castillo, C. 2010. The demographics of web
better understand the effects of demographic biases. search. In Proc. SIGIR.
Weber, I., and Jaimes, A. 2011. Who uses web search for
6
http://www.idigitaltimes.com/what-do-your what: and how. In Proc. WSDM, 15–24.
-apps-say-about-you-new-app-iphone-here-tell
-you-410883