Emj PDF
Emj PDF
Emj PDF
Abstract: In a Vendor-Managed Inventory (VMI) system, • Electronics industry: Intel (Kanellos, 1998) and Hewlett-
the supplier makes decisions of inventory management for the Packard (Waller et al., 1999),
retailer; the retailer is not responsible for placing orders. There • Food manufacturing: Kraft Inc. and Mott’s USA (Emigh,
is a dearth of optimization models for replenishment strategies 1999),
for VMI systems, and the industry relies on well-understood, • Petrochemical: Shell Chemical (Hibbard, 1998)
but simple models, e.g., the newsvendor rule. In this article, • Men’s undergarments: Fruit of the Loom (Cetinkaya and Lee,
we propose a methodology based on reinforcement learning, 2000).
which is rooted in the Bellman equation, to determine a
replenishment policy in a VMI system with consignment The use of VMI at Wal-Mart on a large scale has attracted a
inventory. We also propose rules based on the newsvendor great deal of attention in the industrial world.
rule. Our numerical results show that our approach can In a VMI program, the supplier assumes control of the
outperform the newsvendor. inventory management for one or more retailers (Fry, 2002). The
supplier monitors the inventory level at the retailers and makes the
Keywords: Vendor-Managed Inventory, Supply Chains, decisions related to the quantities of replenishment and the timing
Simulation of shipments to the retailers. It has been claimed that this can
play a significant role in reducing the bull-whip effect. Compared
EMJ Focus Areas: Quantitative Methods & Models to traditional retailer-managed inventory (RMI) programs in
which retailers are responsible for placing orders and are entirely
responsible for their own inventory shortages or excesses, VMI
I
can bring benefits to both retailers and manufacturer, as discussed
n the past two decades, interest in supply chain management next. Some of the advantages for the retailer are as follows:
(SCM) has grown rapidly because of the increasing • Increase in the service level: As the supplier has access to
competition in today’s global markets. The introduction databases at the retailer and to the current demand at the
of products with shortening life cycles and the heightened retailer, the supplier can better coordinate its own activities
expectations of customers have forced businesses to invest in of supplying materials to its various customers. As a result,
and focus attention on their supply chain. A driving force behind the out-of-stock situation is rarely encountered at the retailer,
effective SCM is the effective collaboration of the numerous improving its service levels.
activities involved. A lack of coordination in the associated • Reduction of inventory levels: Since the supplier has access
activities in general results in the well-known bullwhip effect, to inventory databases at the retailer, the supplier is able
and more specifically in low service levels, high inventory, and to coordinate replenishment decisions in a way that keeps
high transportation costs. To overcome these problems, many inventory levels at the retailer from being excessive. This
forms of supply chain coordination, such as vendor-managed leads to reduced holding costs and increased inventory turns
inventory (VMI) and continuous replenishment and collaborative at the retailer.
forecasting, planning, and replenishment (CPFR), have been • Reduction of ordering and planning costs: Ordering and
implemented in recent years in supply chain software. Known as planning costs are eliminated (or reduced) for the retailer
direct replenishment or supplier managed inventory in the early since the responsibility of ordering/planning is shifted to
days, VMI was popularized in the late 1980s and is now widely the supplier.
used in various industries. According to a 2003 technology
survey (Automotive Warehouse Distributors Association, 2003), From the supplier’s perspective, the advantages are:
over 60% of manufacturers in the U.S. have implemented VMI. A • Ease in coordination of supply process: Since the supplier has
survey of electronic components (Survey, 2003) and a survey of access to the retailer’s inventory database, as well as its recent
the grocery industry (Kurt Salmon Associates, 2002) also support sales transactions, the supplier can more easily coordinate
the contention for a broad use of VMI. VMI has been widely used its own activities of supplying goods to various retailers. In
in the following industries: particular, it is not overloaded by orders at the beginning of
• Grocery: Campbell Soup Company (Clark and McKenny, the week or month as is the case in RMI. Often in RMI the
1994), and Barilla SpA (Hammond, 1994), supplier simply cannot meet all the demand since orders
In this article, we focus on a two-echelon system with one • di denotes the demand from the ith customer, which has a
Distribution Center (DC) and the retailers that the DC serves discrete uniform distribution.
(see Exhibit 1). We will assume that the route followed by • N is a Poisson distributed random variable with parameter λ;
the truck (or fleet of trucks) is already known, and we make essentially N stands for the number customers that arrive in
no attempt to optimize it. The supplier pays the holding cost one cycle, and the value of λ depends on i, the retailer.
of the inventory at the retailers (in a consignment inventory
VMI system). Whenever a stock-out occurs at the retailer, the Some additional notation that we need is:
supplier is penalized. The manufacturer has to decide upon the • cT: the transportation cost is the cost of operating one truck
timing and quantity of the order from the manufacturer to the for unit time.
DC, which we assume is done via a (Q, R) policy. The review of • hr: the holding cost of one item for unit time at a given
the inventory at the retailers is performed when the truck fleet retailer
comes back from the retailers to the DC, and the new decision is • pr: the cost of stock-out penalty at a given retailer.
to be made regarding the number of trucks to be sent in the next • Rev: the revenue generated for the supplier when the sale of
period. We refer to the period between two successive inventory a product takes place at the retailer.
reviews as a cycle. • Cap: the capacity of one truck.
• tDC-RET: the time of travel between DC and a retailer.
Exhibit 1. The 2-Echeleon System Considered in this Article • tRET-RET: the time of travel between retailers.
• tRET: the service time at the retailer.
Retailer Customer • tDC: the service time at the DC.
• t0: the lead time for the DC’s order from the manufacturer.
…
The value of the retailer’s order up-to level S* is determined in and variance σ2next of the customer demand can be computed as
the newsvendor model by solving: follows:
p µnext = E ( I ) ⋅ E ( N ) ⋅ E (d )
F ( S *) =
p+h E ( I ) ⋅ λ ( a + b)
where p is the penalty cost, h is the holding cost, and F(.) is = ,
2
the cumulative distribution function of the customer demand in
the cycle. Let T denote our cycle time and L denote the lead time
2
σ current = E 2 ( I ) ⋅ E ( N )Var (d ) + Var ( N ) E 2 (d )
for the retailer in question (see Exhibit 2). L is equal to the truck
fleet transportation time from the DC to the retailer in question. 2 (b − a ) 2 a+b 2
Hence, the inventory replenishment decision made at the start of = E ( I ) ⋅ λ ⋅ + λ ⋅( )
12 2
the current cycle will affect the inventory system from the time
the current cycle starts until the truck fleet arrives at the retailer 2 2 2
E ( I ) ⋅ λ (a + b + ab)
in the next cycle. In our setting, the customer demand is assumed = .
to be a compound Poisson process with a demand forecast update 3
and is defined by: N Again, assuming that L is a discrete random variable, the
D = ∑ di mean μlead and the variance σ2lead of the demand experienced
i =1 during L time units in the next cycle can be computed as:
where di, the amount of demand for ith customer, has a discrete
uniform distribution, (U(a, b)). In the current cycle, the value µlead = µnext ⋅ µ L ,
of the demand forecast update (I) is known and will be denoted 2 2 2
σ lead = µ Lσ next + µnext σ L2 ,
by g. So in unit time, the mean μcurrent and variance σ2current of the
customer demand can be given as (Ross, 2002): where μL and σ2L are the mean and the variance of L respectively.
µcurrent = g ⋅ E ( N ) ⋅ E (d ) Then, via the central limit theorem, the customer demand in the
replenishment cycle (T) and lead time (L) can be approximated
g λ ( a + b) (Ross, 2002) as:
= , µ = µcycle + µlead ,
2
σ 2 = σ cycle
2 2
+ σ lead .
2
σ current = g 2 E ( N )Var (d ) + Var ( N ) E 2 (d )
If the demand is normally distributed with mean μ and
(b − a ) 2 a+b 2 variance σ2, the optimal order up-to level (S*) is:
= g 2 λ ⋅ + λ ⋅( ) p
12 2 S * = φ −1 ( ) ⋅σ + µ ,
2 2 2 p+h
g λ (a + b + ab)
= . −1
where φ (.) is the inverse of the cumulative distribution
3 function of the standard normal distribution. Let S*(i,k) denote
Now if we assume T to be a discrete random variable, the the order up-to level of retailer i and product k based on the
mean μcycle and the variance σ2cycle of the demand during T time newsvendor heuristic. From these values, one can determine the
units can be computed (Nahmias, 2001) as: number of trucks to be sent as follows:
µcycle = µcurrent ⋅ µT ,
∑�𝑖=� ∑� ∗
𝑘=�(𝑆 (𝑖, 𝑘) − 𝑥𝑖𝑘 )
2
σ cycle 2
= µT σ current 2
+ µcurrent σ T2 , 𝑎= � �
𝑐𝑎𝑝
where μT and σ2T are the mean and the variance of T. During
L time units in the next cycle, the value of the demand forecast where ⟦. ⟧denotes the nearest integer of the quantity inside
update I is unknown; therefore we use the expected value of I, the brackets and xik denotes the current inventory for retailer i
E(I), to estimate it. In unit time of the next cycle, the mean μnext and product k.
Engineering Management Journal Vol. 22 No. 4 December 2010 47
Cost Structure 𝐸{∑��=�(𝑟(𝑖� , 𝜋(𝑖� ), 𝑖��� )|𝑖� = 𝑠)}
We summarize our cost economics model used in our simulation liminf
model. We have four elements for the costs and revenues: the
�→� 𝐸{∑��=�(𝑡(𝑖� , 𝜋(𝑖� ), 𝑖��� )|𝑖� = 𝑠)}
holding cost (hr) at the retailers for each product (note that this
cost is absorbed by the supplier in our model), the revenues where E denotes the expectation operator and im denotes the state
(Rev) transmitted to the supplier when a sale occurs, the stock- in the mth jump of the embedded Markov chain associated with
out penalty (pr) which is transmitted to the supplier, and the policy π. It can be shown that if the Markov chain associated
transportation cost (cT) per truck per unit time for the supplier. with every policy is “regular,” then the above criterion does not
depend on the starting state. Our goal in average-reward SMDPs
Solution Methodology is to maximize the above function, which is also called the long-
The problem that we have presented above can also be solved via run average reward of the SMDP. As 𝛾 → 0 , which implies that
a more detailed model that looks at the Markov chains underlying exp(−𝛾𝑡)
→ 1, from the Bellman equation for discounted reward,
the VMI system. In this section, we first present the underlying it can be shown that optimization with respect to discounting
stochastic process that we have used for developing the model becomes equivalent to optimizing with respect to the average-
and then discuss the method used for solution. reward criterion. Use of such a small value of 𝛾, so that the discount
factor tends to 1, is called a vanishing-discount approach. In this
Semi-Markov Decision Process article, we will use a discounting algorithm, but via the vanishing
The stochastic process that we use in our model is called the semi- discount approach will solve an average-reward problem.
Markov decision process (SMDP) (Bertsekas, 1995). The SMDP is SMDPs can be solved by via classical dynamic programming
characterized by a set of states, actions, rewards, transition times, method (Bertsekas, 1995), provided the number of states is small
and transition probabilities. In each decision-making state, the and the system has a tractable transition probability model; however,
agent can choose from a set of actions. The random trajectory for a problem with a large number of states, which is true of the
followed by the system depends on the action chosen in each state problem considered in this article, it generally becomes difficult to
and upon the random variables that govern the system’s behavior. store all the transition probabilities, p(.,.,.), and the elements of the
Underlying the semi-Markov process is an embedded Markov value function, J(.). This is attributed to the curse of dimensionality
chain. Let p(i,a,j) denote the transition probability of transitioning (Bellman, 1957). If the transition probabilities are too difficult to
from state i to state j under the influence of action a in the Markov find in an exact manner because of the complex stochastics of
chain associated to action a. Similarly, let r(i,a,j) denote the the underlying random process, the analysis is said to have the
immediate reward (revenue) earned in the same transition. Then, curse of modeling (Sutton and Barto, 1998); this may be true even
the expected immediate reward earned in state i by selecting of problems with a few states. One approach to avoid these twin
action a is: 𝑟̅(𝑖, 𝑎) = � 𝑝(𝑖, 𝑎, 𝑗)𝑟(𝑖, 𝑎, 𝑗). In an SMDP model, the time curses of dynamic programming is to use the methodology called
of transition from one 𝑗
state to the next is also a random variable reinforcement learning (RL), which we next describe.
with an arbitrary distribution. So if t(i,a,j) denotes the mean time
in transiting from state i to state j under a, the mean time in any Reinforcement Learning (RL)
transition out of i under a is given by RL has attracted a considerable amount of attention recently (see
𝑡̅(𝑖, 𝑎) = � 𝑝(𝑖, 𝑎, 𝑗)𝑡(𝑖, 𝑎, 𝑗). Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996; Gosavi,
𝑗 2003) for textbooks on this topic). Watkins (1989) discovered
Typically, when we study the system over an infinite time the Q-Learning algorithm, which is one of the most popular
horizon, the objective function is to maximize a cumulative algorithms in RL, to solve the MDP in a simulator. It was
function of the immediate rewards (or costs) earned in each extended to the SMDP in Bradtke and Duff (1995) where they
state transition. Two performance metrics frequently used used a continuous reward rate. We use the algorithm proposed
in stochastic control theory are: the average reward, which in Gosavi (2003) for the lumpsum reward that is applicable to the
is the expected reward per unit time calculated over an problem here. The algorithm’s convergence has been established
infinitely long trajectory, and the discounted reward, which in Gosavi (2007). The algorithm is described next.
is the expected total discounted reward calculated over an
infinitely long trajectory. Solving the SMDP with respect to the Steps in the Q-Learning Algorithm
discounted reward metric requires solution of the following Step I. Let S denote the set of states, and A(i) denote the set of
Bellman equation: actions allowed in state i. Initialize the Q-factors, Q(i,u) = 0 for
∞ all i Є S and all u Є A(i). Set m=0 and γ to a very small positive
𝐽(𝑖) = max𝑎Є�(𝑖) �� 𝑝(𝑖, 𝑎, 𝑗) �𝑟(𝑖, 𝑎, 𝑗) + � exp(−𝛾𝑡) 𝐽(𝑗)𝐹𝑖𝑎 (𝑡) 𝑑𝑡��
, number, e.g., 0.001. Compute the step-size using a rule such as α
�
𝑗
= A/(B+m), where A and B are constants, e.g., A= 99 and B=100.
where Fia(t) denotes the cumulative distribution function of the Set MAX_STEPS to a large positive integer, e.g., 10,000. Start
time spent in transition out of i under a, 𝛾 denotes the discount system simulation.
rate (with exp(−𝛾𝑡) being the discount factor during a time interval
of length t), A(i) denotes the set of actions allowed in state i, and Step II. While m < MAX_STEPS do:
J(i) denotes the value function for state i. The solution to the Let the system start in state i.
SMDP can be expressed as a policy π, which prescribes π (i) as the 1. With probability of 1/|A(i)| (note that |X| denotes the
action to be taken in state i. number of elements in set X), choose an action a Є A(i) that
We now consider the average reward performance metric. maximizes Q(i,a). (In other words, compare all the Q-factors
If the starting state for the system is s, the average reward for state i, and choose the action for which the Q-factor is
(performance criterion) of a given policy π can be defined as: the maximum).
works well if the Q-factors are linear or nearly linear functions. xik is the current inventory of retailer i and product k, and Inv(i,k)
Feature extraction
State space
Region
layers layers
(In the above, ⟦. ⟧ denotes the nearest integer for the quantity pDC 1 1
inside the brackets.) If Y(i,k)< 0, Inv(i,k) =0, and otherwise
Inv(i,k)= Y(i,k). Note that this rule uses the newsvendor (Q, R) (500, 150) (500, 150)
relationship.
K 50 50
Numerical Experiments
In this section, we describe the results of our numerical tDC-ret Uniform (2, 4)
experiments with the technique developed in the previous
section. We will consider a scenario with two products and ten tret-DC Uniform (2, 4)
retailers. The parameters of each product at different retailers are
listed in Exhibit 4. The values of the parameters of each product tret-ret Uniform (0.5, 1)
at the DC, the transportation time, and truck-related information
are listed in Exhibit 5. The parameters in Exhibits 4 and 5 present tDC Uniform (0.2, 0.3)
our baseline case. The parameters for the other cases are defined
in Exhibit 6. Exhibit 7 presents the results obtained with all the tret Uniform (0.01, 0.015)
eight cases that we study.
t0 Uniform (30, 50) Uniform (30, 50)
Exhibit 4. Parameters (Demand Arrival, Holding Costs, Stock-Out
Penalties and Revenues) for the Retailers and the Products
cT 10
Cap 100
Demand
Product Retailer λ hr pr Rev
Uniform(a,b)
7 0.3 (1, 1.5) 0.03 4 5 λ Original values Increase the value by 50%
9 0.15 (2, 3) 0.04 4 5 We now describe how our value function is approximated.
The inventory at each retailer and DC is encoded using
10 0.2 (1, 3) 0.05 4 5 “buckets” (Sutton and Barto,
1998). The state is composed of:
1 0.5 (0.5, 1.5) 0.04 3 5 the retailers’ inventory x = {x11 ,..., xmn } , where xkj denotes
of the kth product at the ith retailer; the
the retailer’s inventory
2 0.1 (2, 3) 0.06 3 5 DC’s inventory y = { y1 ,..., ym } , where yk denotes the DC’s
3 0.15 (1, 3) 0.04 3 5 inventory of the kth product; and the retailers’ demand forecast
update I = {I11 ,..., I mn } , where Iik denotes the forecast update
4 0.3 (0.5, 2) 0.05 3 5 of the kth product at the ith retailer; m is the number of the
product items and n is the number of the retailers. The actual
5 0.35 (0.5, 1.5) 0.03 3 5
2 state space is too large to store all the Q-factors explicitly. We
6 0.25 (1, 2) 0.06 3 5 used a neural network for which typically the state space must
first be encoded. We encode the inventory to generate signals
7 0.4 (0.5, 1.5) 0.04 3 5 (levels) for the neural network as follows. Let the inventory for
8 0.2 (1, 3) 0.03 3 5 a given retailer have a maximum value of b and a minimum
value of a, with c, d, and e, chosen in a manner such that:
9 0.15 (1.5, 2.5) 0.05 3 5 a < c < d < e <b and c – a = d – c = e – d = b –e. Inventory values
10 0.25 (1, 2) 0.03 3 5 from a to c will be assumed to have a signal of 1, those from c to
Newsvendor
Case pr hr λ RL-based algorithm % Improvement
heuristic