You Are What Apps You Use: Demographic Prediction Based On User's Apps
0.8 80
True Positive Rate
0.4 74
0.2 60
0.0 100 400 700 1000 1300 1600 1900 2200 0 50 100 150 200 250
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate Number of train users Number of apps per user
(a) ROC curve for gender prediction. ’Male’ (b) Effect of training set size on gender pre- (c) Effect of user’s app count averaged over all
is treated as the positive class. diction. demographics.
The second method, also adopted by (Seneviratne et al. The reason is that unlike the SVD components, the original
2015), aggregates the installed apps to category level based bag–of–apps features are very sparse and the logistic regres-
on Google Play categorization. In our dataset, there are apps sion implementation we use5 supports sparse matrices.
from 48 categories. We take the number of apps in each
category as the features, which yields an accuracy of 74.6 %. Related Work
The third method employs the Truncated Singular Value Characterizing the demographics of Twitter users has been
Decomposition (TSVD). (Hu et al. 2007) also employ studied by (Mislove et al. 2011) who infer geography, gen-
TSVD, but instead of using the SVD components directly as der, and race of the users based on self-reported locations
features for predicting the demographics of web users, they and the names of the users. They find large deviances
adopt a recommender system approach. Setting the number from the demographic distribution of the overall popula-
of dimensions to 48, we obtain a gender prediction accuracy tion. (Duggan et al. 2015) provide a more extensive demo-
of 76.9 %. This shows that rather than using the Google graphic comparison of five social media platforms based on
Play categories of the apps, it is better to use the same num- telephone interviews. (Goel, Hofman, and Sirer 2012) look
ber of SVD components learned in an unsupervised man- into the demographics and behavior of web users, whereas
ner. However, the performance is clearly worse compared (Weber and Jaimes 2011) study the same for search-engine
to not using any dimensionality reduction, and even by in- users.
creasing the number of SVD components, we were unable
The demographic prediction based on user’s apps has
to exceed the performance of the logistic regression with all
been previously studied by (Seneviratne et al. 2015) who
features, although with 500 components, the accuracy is al-
predict the users’ gender. In their previous work (Senevi-
ready 81.8 %.
ratne et al. 2014), they also predict language, country, re-
In conclusion, none of the explored dimensionality reduc- lationship status, and whether the user is a parent, but in-
tion methods helped us to improve the gender prediction ac- stead of predicting these attributes directly, they first pre-
curacy. We should also note that although TSVD can help
to reduce the data dimensionality to about one-tenth of the 5
original without losing much in accuracy, this still does not modules/generated/sklearn.linear_model.
necessarily help with the space complexity of the method. LogisticRegression.html
dict which apps are associated with the attributes and then Acknowledgments
check whether a user has apps corresponding to a given de- We would like to thank Verto Analytics for providing the
mographic attribute. We extend these works by studying dataset, Timo Smura for useful discussions, and Janaki
new demographics (age, race, and income), showing that Koirala for conducting some of the initial experiments.
increasing the training dataset size drastically improves the
prediction accuracy, and comparing various dimensionality References
reduction methods for the app data.
