Welfare Effects of Personalized Rankings
Welfare Effects of Personalized Rankings
Welfare Effects of Personalized Rankings
Abstract
Many online retailers offer personalized recommendations that help consumers make their
choices. While standard recommendation algorithms are designed to guide consumers to the
most relevant items, retailers may have strong incentives to deviate from these standard algo-
rithms and instead steer consumer search to the most profitable options. In this paper, we ask
whether such profit-driven distortions arise in practice and study to what extent recommender
systems benefit consumers. Using data from a large-scale randomized experiment in which an
online retailer introduced personalized rankings, we show that personalized rankings lead to
more active consumer search and induce users to buy a greater variety of items relative to uni-
form bestseller-based rankings. To study whether these changes benefit users, we estimate a
search model and use it to measure consumer surplus under alternative ranking algorithms. Our
model captures flexible taste heterogeneity by combining a consumer search model from market-
ing with a latent factorization approach from computer science. Using this model, we estimate
heterogeneous tastes from individual data on both search histories and personalized rankings,
while explicitly recognizing that rankings have a direct causal effect on search behavior. We
show that, although the current personalization algorithm does put a positive weight on maxi-
mizing profitability, personalized rankings still benefit both the users and the retailer. Using this
case study, we argue that online retailers generally have incentives to adopt consumer-centric
personalization algorithms, which helps them balance between extracting short-term profits and
maximizing long-term growth.
*
This paper was previously circulated with the title “The Long Tail Effect of Personalized Rankings.” We thank
a group of researchers whose insightful comments and suggestions have greatly improved the paper: Eric Anderson,
Bart Bronnenberg, Jean-Pierre Dubé, Brett Gordon, Rafael Greminger, George Gui, Wes Hartmann, Ella Honka,
Yufeng Huang, Malika Korganbekova, Sanjog Misra, Harikesh Nair, Sridhar Narayanan, Navdeep Sahni, Stephan
Seiler, Anna Tuchman, and Caio Waisman, as well as seminar participants at Stanford University, Kellogg School of
Management, Chicago Booth, Virtual Quant Marketing Seminar, European Quant Marketing Seminar, and Consumer
Search Digital Seminar. We also thank James Ryan for his excellent research assistance. We are greatly indebted to
Vinny DeGenova, Malika Korganbekova, Evan Magnusson, Sunanda Parthasarathy, and Cole Zuber for helping us
access Wayfair data. All errors are our own.
Table 1: The effect of personalized rankings on users’ search and purchase behavior and on
the retailer’s revenues. Since users are randomly assigned to these two groups, the difference between
the values in columns 1 and 2 estimates the causal effect of personalized rankings. The last two columns
report t-statistics and p-values from the corresponding two-tailed tests of mean differences. 1 To preserve
data confidentiality, the last three rows report revenues and prices only relative to their values in the non-
personalized sample.
< 0.0001), search more than twice as many items on average (1.92 vs. 0.77, p-value < 0.0001), and
are 9% more likely to make a purchase (with purchase rates 2.54% vs. 2.33%, p-value < 0.0001).
These results suggest that personalized rankings are successful at displaying highly relevant items,
which encourages users to search more and makes them more likely to find a good match. Table 1
also shows that personalization increases the expected revenue per user by 4.5% (p-value 0.0997).
At the same time, we do not find any strong evidence that personalized rankings are nudging users
toward expensive items. If anything, users in the personalized group purchase items with slightly
lower prices, although the observed decrease of 4% is economically small. Therefore, the observed
change in revenues is mostly driven by higher transaction volume rather than higher prices paid by
the users.
We also find that personalization redistributes demand from bestsellers to previously unpopular
items. Figure 2 shows the purchase shares of different items with and without personalization.
The figure orders items by their purchase frequency among non-personalized users, with the most
popular items located on the left. Personalized rankings reduce the popularity of bestseller items
but increase the demand for relatively unpopular items. As personalization boosts the demand for
many items outside the top 80-100 bestsellers, the distribution of shares gains a thicker right tail.
In other words, personalized rankings generate a long tail effect in consumption by substantially
increasing the variety of purchased items.
When put together, these results support the idea that personalization allows users to discover
appealing niche items that would otherwise be difficult to discover (Anderson, 2006). By niche
10
items, we mean those for which different users have substantially different utilities; those items
are loved by some users and hated by others.9 Several pieces of evidence suggest that Wayfair’s
personalized rankings indeed guide users toward desirable niche items. First, we find that person-
alization increases the purchase rate but makes users buy a larger variety of items. This finding
suggests that each user is now discovering items that match their unique tastes better but that are
not necessarily appealing to others. Second, we also find that personalized rankings increase sales
of beds with unusual designs and uncommon combinations of attributes. Several bed styles become
substantially less popular under personalization, including Beachy, Modern Rustic, Traditional,
American Traditional, and Ornate Traditional. The examples in Figure 7 (top panel) of Appendix
B show that beds of these styles are largely standard beds in neutral colors that are made of wood
and use common upholstery materials. By contrast, styles that become substantially more popular
under personalization include Slick and Chic Modern, Glam, Modern Contemporary, and Bold and
Eclectic Modern. Figure 7 (bottom panel) shows that beds of these styles have unusual shapes,
provocative colors, and highly original designs. Overall, these contrasting examples suggest that
9
Johnson and Myatt (2006) formalize the idea of niche items using the concept of demand rotations: an item j is
more “niche” than k if the demand for item j represents a clockwise rotation of the item k’s demand curve. In other
words, an item is more “niche” when consumers are more divided in their preferences for this item. Bar-Isaac et al.
(2012) use this formal definition of niche items in their theoretical search model.
11
Table 2: Stylized example. In this example, the retailer finds it optimal to steer users toward profitable
but low-utility niche items (i.e., items B and C) instead of a more uniformly appealing but less profitable item
A, which reduces consumer surplus. *The expected profit in Panel C is defined as the purchase probability
from Panel B multiplied by the per-unit profit πj from Panel A.
the personalized ranking algorithm successfully identifies users with relatively unusual tastes and
guides them toward appealing niche items.
12
where pj,t(i) is the price of item j on the day t(i) of user i’s visit, xj is a vector of KO observed item
attributes, ξj is a vector of KL latent attributes, and αi , βi , and θi are user i’s price sensitivity and
attribute preferences. This model mimics the information environment in which Wayfair users make
choices. We assume that for items in the set Ji , the user observes prices pj,t(i) and item attributes
xj and ξj without searching, which is the case in the category of beds where users observe prices
and basic attributes of beds on the category page. Our dataset includes information on bed size,
material, and style, the attributes we include in the vector xj . At the same time, users observe item
images from which they can infer other bed attributes (e.g., material quality, design originality, and
perceived comfort), which we capture by introducing latent attributes ξj . Since attributes ξj are
observed to the users but not the researcher, we will need to estimate them together with other
parameters.
The term εij is the match quality of item j for user i, which captures additional information
that the user acquires upon closer inspection of the product page (e.g., information about the
13
Consumer search. Users search according to the simultaneous search model, similar to that
in De Los Santos et al. (2012) and Honka (2014). They observe the pre-search utilities of items
δij = −αi pj,t(i) + x0j βi + ξj0 θi but have to search in order to learn the match values εij . We assume
that users must visit the product page of item j in order to learn the match value εij , and that
this value is completely revealed to them upon opening that page. The user chooses how many and
which items to search, thus committing to a specific search set S ⊆ Ji . For each item included in
the set S, the user has to pay a search cost cij which we parameterize below. Once the search set
S is selected, the user examines all items in this set, learns their exact utilities uij , and chooses the
highest-utility item.
Given the assumed distribution of match values εij , the expected net benefit to user i of searching
all items in the set S, denoted by miS , is the difference between the expected maximum utility of
items in this set and the total cost of searching these products:
X X X
miS = E max{uij } − cij = log exp(δij ) − cij (2)
j∈S
j∈S j∈S j∈S
where the outside option is always included in S and has zero pre-search utility and zero search cost
(δi0 = 0, ci0 = 0). It is optimal for user i to choose the search set S that maximizes the expected
net benefit miS . Following De Los Santos et al. (2012), we smooth the choice set probabilities by
adding a mean-zero stochastic noise term ηiS to the expected benefit miS of each potential search
set. The stochastic term ηiS might be interpreted as reflecting errors in an individual’s assessment
of the net expected gain of searching the set S. As we will demonstrate below, smoothing choice
probabilities significantly simplifies estimation.11 In practice, we do not lose much from adding an
idiosyncratic error term; at the same time, we gain a substantial robustness to measurement error
in observed searches.
Assuming ηiS is i.i.d. type I extreme value distributed with scale parameter ση , the probability
that the user i finds it optimal to search the set S conditional on observing the ranking R is then:
exp(miS /ση )
PiS|R = P (3)
S 0 ⊆Ji exp(miS 0 /ση )
where Ji is the set of items the user observes under the ranking R. Note that the summation
11
Alternatively, we could add a stochastic term directly to the pre-search utility δij (Ursu, 2018; Honka and Chinta-
gunta, 2015). However, the estimation procedure would then require computing search likelihoods by simulating the
search behavior of artificial users, which is computationally expensive. We would also need to artificially smooth the
resulting frequency estimators, which might generate bias of an unknown sign and magnitude (Chung et al., 2018).
14
exp(δij )
Pij|S = P f or j∈S (4)
k∈S exp(δik )
The purchase probability is zero for items outside the set S, because we assume the user has to
search an item before buying it.
The outside option plays an important role, as it enables us to model users who either search
but do not buy anything, or decide not to search at all after seeing item rankings. We could, in
principle, remove data on users without purchases and write a simpler model in which users must
always make a purchase. However, since less than 10% of users make purchases, doing so would
imply discarding over 90% of the data on searches and rankings that are highly informative about
taste heterogeneity. An advantage of our model, therefore, is that it enables us to use more of the
available data in estimation. It also allows us to understand whether personalized rankings can
persuade some users to make a purchase by showing them highly relevant items.
Following Ursu (2018), we model position effects by assuming that the search cost cij depends on
the position in which item j is displayed to user i. We assume that the item in the first position has
a search cost c, which we interpret as a baseline search cost. The items in the subsequent positions
have search costs c + p2 , c + p3 , . . . , c + p48 where we interpret pr as the position effect for position
r (for r = 2, . . . , 48). In theory, we could estimate all position effects p2 , . . . , p48 nonparametrically,
thus recovering a flexible position effect curve. In practice, we simplify the problem by estimating
only position effects p2,... p10 for the first ten positions, and we separately estimate p11 which captures
the average position effect for all remaining positions 11 to 48. Estimating more flexible search
costs functions makes little difference for our qualitative results, mostly because the vast majority of
searches and purchases land on the first ten positions. Since items closer to the bottom of the page
are more difficult to locate and search, we expect them to have higher estimated position effects pr
(and therefore higher search costs cij ). Lastly, we assume that rankings affect choices only indirectly
via search costs but not directly by influencing utilities uij . This assumption follows the results in
Ursu (2018) who shows in a randomized experiment that rankings shift search probabilities but do
not directly affect purchase probabilities conditional on search.
Latent Factor Approach. Our empirical framework combines a simultaneous search model from
economics and marketing with the latent factorization idea from computer science (Koren et al.,
2009). In this model, users have heterogeneous preferences for both observed item attributes xj and
multiple latent attributes ξj . It helps to think about this combination in the context of the discrete
15
We also model how the retailer selects rankings for each user, both in the personalized and non-
personalized groups. Our main goal here is to capture the main aspects of both ranking algorithms.
Although we do not know the exact algorithms the company uses, from our private conversations
with Wayfair we know which variables serve as key inputs to each of them. Our general strategy is
to specify a family of algorithms with that structure and estimate the unknown parameters directly
from the observed rankings.
We model non-personalized rankings as follows. To construct rankings for user i, the retailer
sorts items in the order of decreasing indices vij , showing only R̄ items with the highest index
values. This process generates the effective choice set Ji of user i. Since in our application more
than 90% of users only look at the first page, we simplify the analysis by assuming that the user
does not see other items outside of the set Ji . We then set R̄ = 48 which corresponds to the number
of items Wayfair shows on the first page of the category list. The index vij is given by:
12
There is also a smaller body of papers estimating demand models in which tastes depend on multiple unobserved
attributes (Elrod and Keane, 1995; Goettler and Shachar, 2001; Keane, 2004; Athey and Imbens, 2007).
13
In a parallel effort, Armona et al. (2021) also estimate a simultaneous search model with latent attributes. They
do not incorporate ranking effects into their estimation and rely on a different inequality-based estimation technique.
They also use the estimated model to predict competition and analyze mergers, thus focusing on a completely different
research question.
16
where ᾱ, β̄, θ̄ are users’ mean tastes, w̃ is the weight that the retailer puts on the mean utility,
t(i) is the day of user i’s category visit, zj,t(i) is the M × 1 vector of time varying item-specific
e is the M ×1 vector of regression coefficients, and µhij is a stochastic term distributed
characteristics, γ
i.i.d. type I extreme value with scale parameter σµh . The vector zj,t(i) in our application includes
indicators for new and sponsored items, which accounts for periods when Wayfair promotes such
items at the top of the page. The stochastic term µhij captures exogenous variation that the
algorithm is adding to the non-personalized rankings.
To estimate position effects, we use exogenous variation in rankings generated by the non-
personalized algorithm. This algorithm computes item popularity indices that capture how fre-
quently users searched and purchased a given item in the past few months. Assuming that user
tastes remain stable over time, we can capture the historical popularity of items in the model (5)
using the average tastes λ̄.14 The algorithm then adjust popularity indices vij for new and spon-
sored items (captured by the γ 0 zj,t(i) term), and it adds some random noise before showing rankings
to users. Therefore, the algorithm infuses non-personalized rankings with exogenous variation. Our
general strategy is to isolate such variation using the stochastic term µhij , which allows us to observe
how the same item moves across positions for reasons unrelated to unobserved demand shocks. This
variation enables us to estimate position effects by measuring to what extent shifting an item to a
higher position increases its search and purchase probabilities.
Given the extreme value distribution of stochastic terms µhij , the probability that the retailer
chooses the ranking R has a closed-form solution given by the so called “exploded logit” formula
(Punj and Staelin, 1978). Since the scale parameter σµh is not separately identified from coefficients
e, we divide all terms in the equation (5) by σµh so that the stochastic term µhij has a
w̃ and γ
variance of one and w = (w̃/σµh ) and γ = (e
γ /σµh ) are the normalized coefficients we will recover
during estimation. Without loss of generality, assume that the retailer shows item j = 1 in the
first position, item j = 2 in the second position, and so on until position R̄ that is filled with item
j = R̄. Then the probability that the retailer finds ranking R is optimal is given by:
R̄
Y exp wūir + γ 0 zr,t(i)
PiR = (6)
0z
P
r=1 J≥k≥r exp wūik + γ k,t(i)
where ūj is the mean utility of item j from the equation (5). The product in (6) is taken across all
R̄ positions displayed to the user. The first term of this product is the probability that item j = 1
14
Ideally, we would have data on the exact inputs that went into the non-personalized ranking algorithm on each
day (e.g., the exact historical popularity indices of items used by the algorithm). We were, unfortunately, unable to
obtain such data. As a robustness check, we experimented with estimating our model on October-November data
while approximating historical popularity using purchases made by users in September, and we arrived to similar
qualitative results.
17
Personalized rankings
We also model how the retailer selects personalized rankings. In doing so, we pursue two objectives.
First, to reverse engineer the existing personalization algorithm, we need to specify a class of possible
algorithms and estimate the unknown parameters from the data. Second, we use this model to
extract information about heterogeneous tastes from the personalized rankings shown to different
users. Our model recognizes that rankings partly reveal information the retailer has acquired
about each user’s tastes from their browsing histories. Therefore, ranking data potentially contains
rich information about taste heterogeneity. Our approach contributes to the prior literature by
incorporating these data into estimation, while at the same time recognizing that the retailer may
modify personalized rankings based on criteria other than utility maximization (e.g., profitability).
This aspect of the model distinguishes our approach from the prior work, which primarily recovered
taste heterogeneity using panel data or repeat purchases (Rossi et al., 1996, 2012) or individual
search data (Kim et al., 2010).
To model personalized rankings, as before we assume that the retailer sorts items in the order
of decreasing indices vij and shows only R̄ = 48 highest-index items to user i. The indices vij are
defined as:
where δij = −αi pj,t(i) + x0j βi + ξj0 θi is the pre-search utility defined in (1), πij is the variable
measuring the profitability of selling the item j to user i, w̃u and w̃π are weights the retailer puts
on the two objectives (maximizing utility and profitability), and µpij is a stochastic term distributed
i.i.d. type I extreme value with the scale parameter σµp . To measure profitability πij , we would
ideally use data on item-specific margins (i.e., average price net of marginal cost). Since we do not
have access to such data, we measure profitability using the current item’s price, pij .15
The term µpij captures the fact that the retailer does not perfectly observe the tastes δij and
personalizes rankings based on a noisy estimate. As discussed in Section 2.2, the amount of per-
sonalization each user gets will generally depend on how much Wayfair knows about them. Users
with richer browsing histories will have their item lists more personalized than those who rarely
visit the website. From this perspective, it would make sense to assume that the variance σµp is
lower for users with longer histories. A limitation of our data, however, is that we do not observe
users’ histories prior to our observation period. We therefore simplify the model by assuming the
same variance σµp for all users. In this sense, our model of rankings captures the average amount of
15
We also attempted to impute item-specific marginal costs mcj from the inverted demand system and used these
cost estimates as an alternative measure of profitability. This alternative approach generated qualitatively similar
results, although we did obtain a somewhat lower estimated utility weight wu .
18
where, as with non-personalized rankings, the product runs across all R̄ positions displayed to the
user.
The ranking model in (7) approximates the actual personalized ranking algorithm of Wayfair.
This model simplifies the actual algorithm in several ways. First, due to data limitations, we con-
sider only two potential objectives of the retailer, maximizing short-term utility δij and short-term
profitability πij . While such a model does not capture other retailer’s incentives (e.g., maximizing
long-term profits, advertising revenues), it does allow us to study the trade-off between choosing
utility-centric and profit-centric rankings. Second, we do not consider the problem of choosing
“truly optimal” rankings in which the retailer chooses one ranking out of all possible item or-
derings. Such a model would be impossible to solve because the number of possible rankings is
astronomically large. To circumvent this issue, we instead consider a simplified model in which
the retailer sorts items by simple indices vij , which we parameterize and recover from the data.
This model can be viewed as an approximation to the full optimization problem. By considering
this simplified ranking rule, we obtain a more practical model that is straightforward enough to
estimate and yet sufficiently flexible to help us study the retailer’s incentives.
Despite its apparent simplicity, the ranking model in (7) is capable of capturing a wide range
of personalization strategies. First, note that by changing the weights, the retailer can either show
utility-based rankings (high wu ), or profitability-based rankings (high wπ ), or adopt an interior
solution that puts substantial weights on both objectives. Choosing a non-zero weight wπ is also
not equivalent to moving expensive items to the top of the page for all users. An expensive item will
only be shown at the top positions if it yields a sufficiently high utility δij . For example, the retailer
will only show an expensive item to users who have been revealed to have low price sensitivity or
who have sufficiently strong tastes for this item’s attributes.
19
log Li (λi ; ω) = log Pi (Ri ) + log Pi (Si |Ri ) + log Pi (yi |Si ) (9)
where ω is a vector of all unknown parameters, Pi (Ri ) is the likelihood of observing the rankings
Ri , Pi (Si |Ri ) is the likelihood of choosing a search set Si given the rankings shown to user i, and
Pi (yi |Si ) is the likelihood of a purchase yi ∈ Si given the search set Si . We compute the exact values
of log-likelihoods for purchases and rankings, log Pi (yi |Si ) and log Pi (Ri ), using the derived closed
form solutions in (4), (6), and (8). In addition, we approximate the search likelihood, log Pi (Si |Ri ),
as described below. We then estimate parameters ω and types λi by maximizing the total log-
likelihood across N users in the dataset, computing it as LogL(y, S, R) = (1/N ) N
P
i=1 log Li (λi ; ω).
The main challenge comes from computing the term log Pi (Si |Ri ), which contains a summation
over a very large number of possible search sets S 0 :
X
log Pi (Si |Ri ) = (miS /ση ) − log exp(miS 0 /ση ) (10)
S 0 ⊆Ji
X m
iS − miS 0
log Pi (Si |Ri ) ≥ log σ (11)
ση
S 0 ⊆Ji
1
where σ(x) = 1+e−x
is the sigmoid function. Therefore, instead of maximizing the log-likelihood
function in (10), we can maximize a lower bound of this function. While replacing the objective
function with its lower bound may introduce bias, in practice we found that the bound is sufficiently
tight, and that we are able to recover the true values of parameters when using it in estimation.
Since applying this bound moves the logarithm function under the summation, we can form
an unbiased estimator of the summation via subsampling. Specifically, we sample potential search
sets S 0 from the universe of all possible search sets on Ji , and we approximate the right-hand side
expression in (11) using the following unbiased estimator:
|Si | X
log σ (miS − miS 0 )/ση (12)
|Bi | 0
S ∈B
where Si is the set of all possible search sets the user i could form given rankings Ri , and Bi ⊆ Si
are (feasible) alternative sets we sample that are different from the set Si actually selected by user
i. In the Appendix C, we explain how we select the alternative search sets Bi in actual estimation.
20
Search cost c and scale parameter ση . The baseline search cost c is identified from the
average number of items searched by users. Conditional on the selected search set Si , the purchase
decision depends only on tastes but not on search costs; therefore, we can separate search costs
from taste parameters by using conditional purchase probabilities. Additionally, we identify the
scale parameter ση from the extent to which the model can explain the observed searches using
item prices pjt and attributes xj and ξj . The estimated value of ση then indicates whether the
observed search sets provide useful information for estimating taste heterogeneity.
Tastes αi and βi . We now explain how we identify mean tastes as well as taste heterogeneity
from rankings, searches, and purchases in the personalized sample. We identify the mean tastes
β̄ from two key sets of moments: (a) users’ propensity to search and purchase items with certain
values of attributes xj , and (b) the retailer’s propensity to display items with certain values of
21
22
Latent attributes ξj and heterogeneity of tastes θi . We now discuss how we identify latent
attributes and associated taste heterogeneity. Consider a pair of beds j and k that are frequently
searched together and often displayed together in the top positions of personalized rankings. These
beds must appeal to users with similar tastes. Suppose, however, that these co-search and co-
ranking patterns do not correlate with any of the item attributes xj we observe in the data. Given
the structure of the utility model, we then infer that beds j and k are located close to each other
in the space of latent attributes ξj . We can then rationalize the data patterns above by assuming
that the two beds have similar values of a certain latent attribute and that a significant share of
users has strong tastes for this attribute. Therefore, we identify latent attributes ξj from co-search
and co-ranking patterns that cannot be explained by the similarity of observed attributes xj .
We emphasize that it is not possible to identify specific parameters of the latent factorization
ξj0 θi . To see this, note that if a latent attribute ξjm is perfectly collinear with some observed
attribute xkj , it is then impossible to separate the corresponding taste parameters θim and βik from
each other. The latent attributes ξj are also interchangeable and therefore suffer from the label
switching problem. But while we cannot identify separate parameters in this factorization, we can
identify (a) the sum ξj0 θ̄ which essentially plays a role of an item-specific residual similar to Berry
17
Consistent with this identification argument, our estimated model predicts the highest correlation of utilities for
those pairs of items that are frequently searched together and are displayed together to the same users. For example,
we predict correlation of utilities 0.4-0.5 for pairs of items that are searched together by at least 100 users in the
data, whereas we estimate the correlation below 0.05 for pairs of items that are never searched together.
18
This is a salient feature of most consumer search datasets. For example, that the average user searches 2.3
options when booking hotel rooms (Chen and Yao, 2016), clicks on 1.8-2.4 items when shopping for cosmetics
products (Morozov et al., 2021), visits 1.1-2.0 auto dealerships (Yavorsky and Honka, 2019), requests 2-3 quotes
for auto insurance (Honka, 2014), and examines 2.3 vehicles when comparing used cars (Gardete and Antill, 2019).
23
Ranking algorithm weights wu and wπ . To identify the weights wu and wπ in the ranking
algorithm (7), we contrast the actual tastes of users with the personalized rankings they are shown.
It helps to think about this estimation problem as recovering two objects: the utility weight wu
and the ratio of weights wπ /wu . We identify the utility weight wu from the extent to which users
see the highest-utility items at the top positions. This weight will generally be high if the retailer
knows users’ tastes well and shows rankings that are highly aligned with these tastes. We note that
the estimated value of the weight wu will also show to what extent personalized rankings contain
relevant information for recovering users’ tastes. Next, to identify the ratio wπ /wu , we examine
whether highly profitable items (i.e., high πij ) are more frequently shown in the top positions than
would be optimal under utility-based rankings with wπ = 0. For example, if the retailer displays
expensive, high-margin beds more often than warranted by the mean price sensitivity ᾱ, the size
of the discrepancy indicates how much weight the retailer puts on profitability relative to utility.
4 Estimation Results
We estimate our model using the Maximum Likelihood approach described in Section 3.3. The
main computational challenge is how to deal with a large number of users N and items J. To ease
the computational burden, we estimate a non-personalized model using a random subsample of
N hold = 50, 000 users in the non-personalized group. We then construct an equal-sized personalized
subsample of N pers = 50, 000 users by randomly drawing them from the personalized group. We
then use the remaining data from both groups, which are not included in the estimation samples,
for out-of-sample model validation (see Appendix D for details). One practical issue is how to select
items. It would be overly ambitious to attempt estimation of latent factors ξj for all thousands of
items in our data, especially given that most of these items were never purchased in our sample.
To reduce the product assortment to a more manageable size, we keep only items that have market
shares above 0.1%, which leaves us with J = 243 unique items.
24
Table 3: Selected parameter estimates from the model of non-personalized rankings. We obtain
these estimates using a Maximum Likelihood estimation procedure developed in Section 3.3. The sample
used in this estimation includes N = 50, 000 users randomly selected from the non-personalized group. We
observe these users choosing among J = 243 unique items. The reported LogL value corresponds to the
average log-likelihood value at the optimum, where the average is taken across N users in the sample.
25
panel, dashed line) also shows the predicted search probabilities when we fix search costs for all
positions at the average level of $25. The comparison reveals that shutting down position effects
generates a more even distribution of searches across positions and reduces the search rates of items
at the top five positions by a factor of 2-3. Note, however, that higher positions still attract more
searches, reflecting the fact that these positions are generally filled with more appealing items.
The strong estimated position effects are consistent with the results of our randomized experiment,
which reveal that personalized rankings substantially affect users’ search and purchase behavior
(see Section 2.2).
26
(we relax this assumption below). Comparing the resulting estimates λ̄U and λ̄R allows us to con-
trast which attributes users are looking for and which attributes are promoted by the personalized
rankings. If the retailer uses personalized rankings to steer users away from high-utility items and
toward profitable items, we would then expect the implied tastes λ̄R to deviate from the actual
user tastes λ̄U .
Figure 4 shows the estimates λ̄U and λ̄R for 20 observed attributes that enter the utility func-
tion (1).19 These attributes include price as well as several binary variables capturing bed’s size,
material, and style. Although these two sets of estimates were obtained from two completely differ-
ent information sources, the estimated tastes λ̄U and λ̄R are substantially aligned. The correlation
between the two sets of point estimates is 0.42. Beds with certain attributes, e.g., Size King, Style
Farmhouse, and Style Posh), are more appealing to users (positive λ̄U ) and tend to be often pro-
moted in the top positions of personalized rankings (positive λ̄R ). By contrast, beds with attributes
such as Material Metal, Style American, and Style Sleek are relatively undesirable to users (negative
λ̄U ), and the current algorithm either does not show them at all or displays them at the bottom of
the page (negative λ̄R ).
Despite this general alignment, we do observe that for several attributes, personalized rankings
deviate from the actual tastes of users. Notably, we estimate somewhat lower price sensitivity
from rankings than choices, implying that the retailer is more likely to show expensive items than
implied by the actual users’ tastes. We also obtain estimates of different signs for several attributes,
including Size Queen, Style Glam, and Style Contemporary. While these attributes are desirable
19
Since λ̄U is only identified with respect to the normalized scale σε , and λ̄R is identified with respect to another
scale σµp , we cannot directly compare tastes λ̄R and λ̄U with each other. We therefore normalize each taste vector,
dividing it by the estimated taste parameter λ̄k for one of the attributes that is not shown in the figure.
27
Table 4: Selected parameter estimates from the model of personalized rankings. We obtain these
estimates using a Maximum Likelihood estimation procedure developed in Section 3.3. The sample used in
this estimation includes N = 50, 000 users randomly selected from the personalized group. We observe
these users choosing among J = 243 unique items. The reported LogL value corresponds to the average
log-likelihood value at the optimum, where the average is taken across N users in the sample.
according to the estimated users’ tastes, the current algorithm tends to show items with these
attributes in relatively low positions on the page. Overall, we find that what personalized rankings
show is correlated (but is not perfectly aligned) with the actual users’ tastes. It is then natural to
ask whether and to what extent the documented deviations from the actual tastes affect consumer
welfare.
28
Table 5: The effect of personalized rankings on consumer surplus. We compute the change in the
expected ex-ante surplus in the first row using the equation (13). In the remaining rows, we decompose the
total surplus change into the effects of changing (a) the total search costs paid by the user, (b) the purchase
probability, and (c) the expected utility conditional on marking a purchase (see equation (14) in Appendix
C for details). The aggregate estimates of the surplus change are computed by multiplying the average ∆CS
estimate by the total number of users in the personalized sample (N = 1, 930, 992).
ŵu = 1.078 for utility and ŵπ = 0.179 for profitability. So while personalized rankings are mostly
aligned with user’s utilities δij , the retailer also puts a considerable weight on profitability, thus
occasionally showing items that are expensive but not necessarily the most appealing to users.
We now ask how this profit-driven distortion affects consumer welfare. To this end, we generate
a sample of users from the estimated distribution of tastes λi , and we simulate their behavior
under two ranking algorithms: personalized and non-personalized. For the personalized scenario
we generate rankings using the estimated model in (7), whereas in the non-personalized scenario
we generate rankings from the model in (5) that we estimated from the non-personalized sample.
In both scenarios, we define the ex-ante consumer surplus of each user λi as follows:
1 X X X X
CSi (λi ) = PRi PiSi |Ri log( exp(δij )) − cij (13)
αi
Ri ∈R Si ∈S j∈Si j∈Si
where the first summation is over all possible rankings R, the second summation is over all possible
search sets S, and the expression in the brackets is the expected net utility from searching all
items in the set Si . We approximate these two summations using simulation, i.e., we draw rankings
Ri and search sets Si for each user i and average across the resulting net utility values. Having
computed the consumer surplus CSi (λi ) for each user, we then average across users to compute
the expected consumer surplus CS. Below, we also decompose CS into the effect of total incurred
search costs, the purchase probability, and the utility obtained conditional on a purchase (see the
Appendix C for the formal derivations).
Table 5 shows the welfare results. We find that personalized rankings increase the average
consumer surplus by $3.1, an increase of approximately 23% compared to non-personalized rankings.
This surplus increase constitutes around 1.5-2% of the average item price in this category. While
this effect may seem small, one should keep in mind that the probability of purchase in the estimated
model is less than 5%. Therefore, the surplus increase conditional on purchasing is almost 40%
of the average item’s price. Aggregating this estimate across all users in the initial sample of
N = 1, 930, 992, we obtain that personalized rankings generate a substantial additional surplus of
$5.94 million.
29
30
31
5 Conclusion
In this paper, we explored whether personalized rankings in online retail benefit consumers. We be-
gan by analyzing the results of a randomized experiment which revealed that personalized rankings
make consumers search more, increase the purchase probability, and redistribute demand toward
relatively unpopular, niche items. We then developed and estimated a choice model, which enabled
us to recover heterogeneous users’ tastes from data on searches, purchases, and observed rankings,
and which helped us to empirically separate tastes from position effects. Having estimated this
model, we showed that personalized rankings substantially increase both consumer surplus and the
21
See “A governance framework for algorithmic accountability and transparency” (Koene et al., 2019).
32
33
Anderson, C. (2006): The long tail: Why the future of business is selling less of more, Hachette
Books.
Armona, L., G. Lewis, and G. Zervas (2021): “Learning Product Characteristics and Con-
sumer Preferences from Search Data,” Available at SSRN 3858377.
Athey, S., D. Blei, R. Donnelly, F. Ruiz, and T. Schmidt (2018): “Estimating heteroge-
neous consumer preferences for restaurants and travel time using mobile location data,” in AEA
Papers and Proceedings, vol. 108, 64–67.
Athey, S. and G. W. Imbens (2007): “Discrete choice models with multiple unobserved choice
characteristics,” International Economic Review, 48, 1159–1192.
Bar-Isaac, H., G. Caruana, and V. Cuñat (2012): “Search, design, and market structure,”
American Economic Review, 102, 1140–60.
Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile prices in market equilibrium,”
Econometrica: Journal of the Econometric Society, 841–890.
Carr, D. (2013): “Giving viewers what they want,” The New York Times, 24, 2013.
Chen, Y. and S. Yao (2016): “Sequential Search with Refinement: Model and Application with
Click-stream Data,” Management Science, forthcoming.
Choi, H. and C. F. Mela (2019): “Monetizing online marketplaces,” Marketing Science, 38,
948–972.
Chung, J., P. K. Chintagunta, and S. Misra (2018): “Estimation of Sequential Search Mod-
els,” Working Paper.
Compiani, G., G. Lewis, S. Peng, and W. Wang (2021): “Online Search and Product Rank-
ings: A Double Index Approach,” Available at SSRN 3898134.
De los Santos, B. and S. Koulayev (2017): “Optimizing click-through in online rankings with
endogenous search refinement,” Marketing Science, 36, 542–564.
Donnelly, R., F. R. Ruiz, D. Blei, and S. Athey (2019): “Counterfactual inference for
consumer choice across many product categories,” arXiv preprint arXiv:1906.02635.
34
Elrod, T. and M. P. Keane (1995): “A factor-analytic probit model for representing the market
structure in panel data,” Journal of Marketing Research, 32, 1–16.
Faddoul, M., G. Chaslot, and H. Farid (2020): “A longitudinal analysis of YouTube’s pro-
motion of conspiracy videos,” arXiv preprint arXiv:2003.03318.
Farrell, M. H., T. Liang, and S. Misra (2020): “Deep Learning for Individual Heterogeneity,”
arXiv preprint arXiv:2010.14694.
Ferreira, P., X. Zhang, R. Belo, and M. Godinho de Matos (2016): “Welfare Properties
of Recommender Systems: Theory and Results from a Randomized Experiment,” Available at
SSRN 2856794.
Gardete, P. M. and M. H. Antill (2019): “Guiding Consumers through Lemons and Peaches:
A Dynamic Model of Search over Multiple Characteristics,” Working Paper.
Goettler, R. L. and R. Shachar (2001): “Spatial competition in the network television in-
dustry,” RAND Journal of Economics, 624–656.
Goli, A., D. Reiley, and Z. Hongkai (2021): “Personalized Versioning: Product Strategies
Constructed from Experiments on Pandora,” Ph.D. thesis.
Gopalan, P., J. M. Hofman, and D. M. Blei (2013): “Scalable recommendation with poisson
factorization,” arXiv preprint arXiv:1311.1704.
Holtz, D., B. Carterette, P. Chandar, Z. Nazari, H. Cramer, and S. Aral (2020): “The
Engagement-Diversity Connection: Evidence from a Field Experiment on Spotify,” Available at
SSRN.
Honka, E. (2014): “Quantifying search and switching costs in the US auto insurance industry,”
The RAND Journal of Economics, 45, 847–884.
Hosanagar, K., R. Krishnan, and L. Ma (2008): “Recomended for you: The impact of profit
incentives on the relevance of online recommendations,” ICIS 2008 Proceedings, 31.
Jacobs, B. J., B. Donkers, and D. Fok (2016): “Model-based purchase predictions for large
assortments,” Marketing Science, 35, 389–404.
Jannach, D. and G. Adomavicius (2017): “Price and profit awareness in recommender systems,”
arXiv preprint arXiv:1707.08029.
Johnson, J. P. and D. P. Myatt (2006): “On the simple economics of advertising, marketing,
and product design,” American Economic Review, 96, 756–784.
Keane, M. (2004): “Modeling health insurance choice using the heterogeneous logit model,” .
35
Koren, Y., R. Bell, and C. Volinsky (2009): “Matrix factorization techniques for recom-
mender systems,” Computer, 42, 30–37.
Lee, D. and K. Hosanagar (2019): “How do recommender systems affect sales diversity? A
cross-category investigation via randomized field experiment,” Information Systems Research,
30, 239–259.
MacKenzie, I., C. Meyer, and S. Noble (2013): “How retailers can keep up with consumers,”
McKinsey & Company, 18, 1.
McBride, S. (2014): “Written direct testimony of Stephan McBride (On behalf of Pandora Media,
Inc),” .
Morozov, I., S. Seiler, X. Dong, and L. Hou (2021): “Estimation of preference heterogeneity
in markets with costly search,” Marketing Science.
Narayanan, S. and K. Kalyanam (2015): “Position effects in search advertising and their
moderators: A regression discontinuity approach,” Marketing Science, 34, 388–407.
Panniello, U., S. Hill, and M. Gorgoglione (2016): “The impact of profit incentives on
the relevance of online recommendations,” Electronic Commerce Research and Applications, 20,
87–104.
Punj, G. N. and R. Staelin (1978): “The choice process for graduate business schools,” Journal
of Marketing research, 15, 588–598.
Rossi, P. E., G. M. Allenby, and R. McCulloch (2012): Bayesian statistics and marketing,
John Wiley & Sons.
Rossi, P. E., R. E. McCulloch, and G. M. Allenby (1996): “The value of purchase history
data in target marketing,” Marketing Science, 15, 321–340.
Roth, C., A. Mazières, and T. Menezes (2020): “Tubes and bubbles topological confinement
of YouTube recommendations,” PloS one, 15, e0231703.
Ruiz, F. J., S. Athey, D. M. Blei, et al. (2020): “Shopper: A probabilistic model of consumer
choice with substitutes and complements,” Annals of Applied Statistics, 14, 1–27.
36
Solsman, J. E. (2018): “YouTubes AI is the puppet master over most of what you watch,” URL:
https://www. cnet. com/news/youtube-ces-2018-neal-mohan.
Ursu, R. M. (2018): “The Power of Rankings: Quantifying the Effect of Rankings on Online
Consumer Search and Purchase Decisions,” Marketing Science, 37, 530–552.
Yavorsky, D. and E. Honka (2019): “Consumer Search in the US Auto Industry: The Value
of Dealership Visits,” Available at SSRN 3499305.
Zhou, B. and T. Zou (2021): “Competing for Recommendations: The Strategic Impact of
Personalized Product Recommendations in Online Marketplaces,” Available at SSRN 3761350.
37
38
Figure 6: Randomization checks. The graphs show the proportion of new users assigned to the person-
alized group as a function of the date of the first visit (top panel) and the time of the first visit (bottom
panel). To identify the timing of the first visit, we focus on users who did not visit the website in September,
and we document the day and time of their first visit in October. The dashed lines in both graphs indicate
the target assignment rate of 95%.
39
Table 6: Randomization checks: visit timing. The first two columns show the values of variables for
users assigned to the personalized and the non-personalized group, the third column shows the difference
between these two groups, and the last column reports p-values from the two-sided t-tests. To identify the
timing of the first visit, we focus on users who did not visit the website in September and document the day
and time of their first visit in October. Therefore, the variables day of the first visit and time of the first
visit are computed only for these selected users. Significance: * p < 0.05.
Table 7: Randomization checks: demographics. The table compares values of demographic variables
in the two experimental groups. All demographic variables were acquired by Wayfair from a third-party data
vendor. Income and net worth variables are imputed from the information on users’ zip codes. Significance:
* p < 0.05.
40
41
Table 8: The correlation in observed attributes in the sets of searched and recommended items.
This table is based on three observed attributes: material, design, and available colors. In columns 1-4, we
regress an indicator that at least one item in positions 2-10 has a given attribute on the indicator that the
item in the first position has the same attribute. In columns 5-8, we consider all items in the personalized
rankings that have a given attribute, and we regress the searched indicator of the second such item on the
search indicator of the first (i.e., topmost) item with this attribute.
42
Table 9: The correlation in observed attributes in the sets of searched and recommended items.
This table is based on four observed attributes: style, upholstery material, wood tone, and wood type. In
columns 1-4, we regress an indicator that at least one item in positions 2-10 has a given attribute on the
indicator that the item in the first position has the same attribute. In columns 5-8, we consider all items in
the personalized rankings that have a given attribute, and we regress the search indicator of the second such
item on the search indicator of the first (i.e., topmost) item with this attribute.
43
Panel B. Items for which market shares increased most under personalization
Figure 7: Examples of items that lose market shares (top panel) and gain market shares
(bottom panel) under the personalized ranking algorithm compared to the non-personalized
one.
44
Maximizing the likelihood. We estimate structural parameters ω and users’ types λi by max-
imizing the total log-likelihood across N users in the dataset, computing it as LogL(y, S, R) =
(1/N ) N
P
i=1 log Li (λi ; ω), where the individual log-likelihood log Li (λi ; ω) is defined in equation (9).
Given that we need to estimate latent factors ξj and user-specific tastes λi together with other
structural parameters, we have to deal with a large-scale estimation problem. In practice, how-
ever, the estimation of individual types λi can be efficiently parallelized across many GPU cores,
which makes updating these parameters for all N = 50, 000 users relatively quick. In addition,
because the log-likelihood function we maximize admits a simple closed-form solution, we are able
to further speed up the estimation by supplying the expressions for the first and second derivatives.
To avoid overfitting, we impose L2 regularization on individual types, thus shrinking them to the
population mean λ̄. When estimating the model, we tune the value of the shrinkage parameter in
this regularization by maximizing the out-of-sample validation metric defined below.
One practical consideration is how to choose the number of latent attributes KL . Allowing
for a flexible distribution of tastes is crucial for our application. While some flexibility helps, too
much flexibility can lead to overfitting. To understand why this issue may arise in our application,
consider a version of our model with a large number of latent attributes ξj . The more latent
45
Decomposing welfare changes. In Section 4.3, we compute how personalized rankings affect
the expected consumer surplus CS. We compute ex-ante consumer surplus using the formula in
(13). We approximate the first summation in (13) by drawing NR = 10 rankings for each user
from the appropriate ranking model (i.e., from the appropriate distribution of item rankings Ri ).
To approximate the second summation, we randomly simulate feasible search sets Si using the
following algorithm. If the estimated model fits well, most users will be predicted to search no
46
where Calgo is the expected total search cost paid during the search process, Palgo is the ex-
pected purchase probability, Ealgo is the expected utility conditional on the purchase, and algo ∈
{pers, uni} captures the current ranking algorithm (personalized or uniform). The first part of the
decomposition, labeled search costs, captures to what extent personalization induces more active
search, thus making users burn more utility on search costs. The second part, labeled purchase
probability, captures whether personalized rankings make users more likely to make a purchase.
Finally, the third part, labeled match value, captures to what extent some users find items that
better match their individual tastes, while retaining their buyer status. We report the results of
this decomposition in Table 5 of Section 4.3.
47
Table 10: Model fit. The table compares the moments predicted by the estimated model to those observed
in the test sample with N = 50, 000 users whose data was not used in estimation.
48