DeepCount - Crowd Counting With WiFi Via Deep Learning - 2019

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

1

DeepCount: Crowd Counting with WiFi via


Deep Learning
Shangqing Liu∗§ , Yanchao Zhao∗†§ , Fanggang Xue∗ , Bing Chen∗ and Xiang Chen‡
∗ College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China
† Collaborative Innovation Center of Novel Software Technology and Industrialization, China
‡ Department of Computer Science and Technology, Nantong University, China
§ Co-First Authors
arXiv:1903.05316v1 [cs.LG] 13 Mar 2019

Abstract—Recently, the research of wireless sensing has [7] are powerless in extracting the information from multiple
achieved more intelligent results, and the intelligent sensing overlapping signals to perform crowd-counting.
of human location and activity can be realized by means of Traditionally, the most popular crowd-counting are based
WiFi devices. However, most of the current human environment
perception work is limited to a single person’s environment, on computer vision techniques with images from camera
because the environment in which multiple people exist is more [8] [9] [10] [11]. Recently, electromagnetic waves method
complicated than the environment in which a single person exists. are implemented with special hardware [12] [13]. However,
In order to solve the problem of human behavior perception camera based approaches may suffer from blind spots in
in a multi-human environment, we first proposed a solution to the corner or absence of light conditions. It also introduce
achieve crowd counting (inferred population) using deep learning
in a closed environment with WIFI signals - DeepCout, which privacy issues. Special hardware system such as WiTrack [12]
is the first in a multi-human environment. step. Since the use mainly utilize TOF (Time-of-Flight) by FMCW (Frequency
of WiFi to directly count the crowd is too complicated, we use Modulated Continuous Wave) to provide delegate and well-
deep learning to solve this problem, use Convolutional Neural constructed signal. However, these devices suffers from high
Network(CNN) to automatically extract the relationship between deployment cost and thus not comparable to ubiquitous WiFi
the number of people and the channel, and use Long Short
Term Memory(LSTM) to resolve the dependencies of number deployment. Some researchers also used smart phones [14]–
of people and Channel State Information(CSI) . To overcome [16] to infer the number of speakers in a dialog. However,
the massive labelled data required by deep learning method, these methods are not device-free and not friendly for the
we add an online learning mechanism to determine whether or elderly and children. Based on these, we believe that the WiFi-
not someone is entering/leaving the room by activity recognition based crowd counting study is very meaningful and it can solve
model, so as to correct the deep learning model in the fine-tune
stage, which, in turn, reduces the required training data and the above problems well.
make our method evolving over time. The system of DeepCount In this paper, our proposed DeepCount solution will have
is performed and evaluated on the commercial WiFi devices. different different multi-path distortions and unique pattern of
By massive training samples, our end-to-end learning approach waveform in WiFi signals based on human activity. The phys-
can achieve an average of 86.4% prediction accuracy in an ical amplitude and phase information will be distorted greatly
environment of up to 5 people. Meanwhile, by the amendment
mechanism of the activity recognition model to judge door switch owing to human activities in the WiFi environment during the
to get the variance of crowd to amend deep learning predicted process of WiFi signal propagation. Hence, we can utilize this
results, the accuracy is up to 90%. time series of Channel State Information (CSI) values captured
Index Terms—Crowd Counting, WiFi Sensing, Deep Learning,
in the WiFi signals for sensing. Compared to existing crowd
Human Activity Recognition counting system using Wifi eFrogEye [17], which predict the
number of people based on the utilizes Grey model, the model
could not fully utilize the phase and amplitude information.
I. I NTRODUCTION Thus, we utilize both two dimensions and the powerful deep
learning method to realize such function.
The WiFi signals has spanned over the city and reveal the However, to realize this idea into a working system, we
sensing ability such as activity recognition, human identifi- face a variety of technical challenges. The first technical
cation, localization and beyond. However, current application challenge is how to extract feature values that can simulate the
of WiFi sensing is only effective in single person scenario relationship between CSI values and population counts.After
which greatly restricts the use of WiFi environmental sensing a detailed analysis of a large amount of CSI data, we found
[1] [2] [3] [4]. Research of indoor crowd counting is the basis that traditional features such as entropy, maximum or variance
of multi-object environmental sensing and various potential do not meet the demand. Due to the uncertainty of the state
applications, e.g. tour guiding and crowd control. Meanwhile, of the people in the room, we cannot find the correlation and
crowd-counting in WiFi environment is a very challenging pattern through simple mathematical modeling. In this case,
task, as the WiFi signal is very arbitrary owing to the uncer- we need to preserve the original features of the data as much
tainty of the states in the room. Thus, traditional signal process as possible during data processing, so we only perform simple
and pattern recognition methods for activity recognition [5] [6] denoising on the amplitude and phase information.
2

The second technical challenge is extract the counting • We proposed the DeepCount system and adopted a deep
model from complex overlapping signals. Traditional training learning approach to solve multi-person context aware-
methods such as Support Vector Machine (SVM), Bayesian ness problems. We use LSTM and CNN to implement
Classifier cannot capture the overlapping features and the automatic extraction of features. Then we use a softmax
background noise. Recent advancement of deep learning and layer for crowd counting.
the success of its application in computer vision has shed the • To further improve the performance of DeepCount, we
light on resolving our problem. As the extremely complicated add an online learning mechanism by our activity recog-
relationships between CSI waveforms and crowd counting, we nition model, the experiments show that by this mecha-
utilize the neural networks including Convolutional Neural nism, we can eventually get the 90% accuracy.
Networks (CNNs) and Recurrent Neural Networks (RNNs) • We introduced some simple and effective denoising meth-
with Long Short Term Memory Network (LSTM) to construct ods that eliminate noise while maximizing the character-
this complicated model. istics of the data.
The third challenge is the massive data required to build the The remaining of this paper is organized as follows.Section
model with deep learning and how to build a robust model II provides a background about neural network and related
with proper structure of neural networks. Regarding the data work of WiFi sensing. We will analyse the characteristics
acquisition, the training process can easily lead to overfitting about CSI and the reasons to choose deep learning to solve this
with inadequate or biased data. Thus, we design massive and problem in Section III. The details in our design of DeepCount
proper designed data process to ensure the quality and quantity. will be discussed in Section IV. The implementation and
Specifically, we collect 6 different activities up to 10 people, evaluation are presented in Section V followed by conclusions
which cover different indoor behaviors such as walking, talk- in Section VI.
ing and different participants. Then, to improve the diversity,
we split the data in different time window length and label
them properly. Regarding the proper network structure for II. BACKGROUND AND RELATED WORK
our problem, basically, we use CNNs and LSTMs to for the
A. Deep Neural Networks
deep neural network model. In addition, regularization and
exponential decay methods are applied to avoid overfitting. With the rapid development of artificial intelligence, Re-
The fourth challenge is how to adapt our deep learning current Neural Networks (RNNs) and Convolutional Neural
model over time. Although the deep learning method has Networks (CNNs) have demonstrated impressive advances in
a strong learning ability, the learned model could loss over numerous tasks.
time for slight environment change. To overcome this, our Recurrent Neural Networks: RNNS are designed for pro-
basic idea is using our activity recognition model under the cessing sequential information due to the ability of preserving
condition of the single person entering/leaving the room. Then, the previous information to the current state. RNNs loop the
we can infer a state change for current model. At the same same network each time step to capture the state information
time, if the deep learning model gives an error result which and use this information with the next input as the whole
means the increment/descendent of the number of people in input information. However, RNNs suffer from the vanish-
the room greater than 1 compared with the previous slot, we ing gradient problem that leads to failure in learning long
can correct this result to finetune the parameters of the last sequences. LSTM was motivated to overcome the limitation
layer of the neural network in our deep learning model. With by introducing a new structure called memory cell that has
this mechanism, we will eventually improve the accuracy of additional forge gates over the simple RNNs. Its unique mech-
WiFi counting model up to 90% in a relatively robust manner. anism enables it to capture long term dependencies. Hence,
Our method fully utilizes the characteristics of amplitude RNNs with LSTMs have been widely used in processing long
and phase of CSI with a specifically designed CNN-LSTM sequences.
network, where the CNN is applied to extract the deep Convolutional Neural Networks: CNNs have achieved so
features while LSTM is applied to handle time series signal. many the state of the art works in the field of computer vision
Meanwhile, to make our model more flexible to adapt the time and NLP recently for the ability of automatically high-level
evolving of the crowd-counting in an online manner, we add an features extraction. The traditional block of CNNs contains
online learning mechanism to correct our deep learning model three parts: a convolutional layer, activation function and
by fine-tune the last layer parameters of neural network. Such pooling layer. The main function of convolution layer is to
method endow our method with time-evolving features, thus extract features automatically with a filter. The filter is a small
greatly improve the practicality. square with common shape (3,3) or (5,5), which serves as
The main contributions of this paper can be summarized as dot product among dimensions of input data. An activation
follows: function such as Sigmoid, Relu follows every convolution
• We theoretically analyze the correlation between crowd layer to perform non-linear transform. Pooling layer is used to
counting and the variation of CSI and utilize deep learn- reduce dimension of input data but retain the most important
ing to characteristic this relationship. To the best of our information. There are different types of pooling such as Max
knowledge, this is the first solution to solve the population and Average. In case of max pooling, it extracts the max value
of WiFi signal populations using neural networks. It in the predefined area. The above three parts are the basic
proposes a new approach to solving such problems. blocks of CNNs.
3

B. WiFi Sensing III. A NALYSIS OF WIFI COUNTING


A. Overview of CSI
WiFi sensing can be applied in the fields of activity recog- In the wireless communication environment, Channel State
nition, indoor localization and user authentication. Here we Information (CSI) represents the nature of the wireless link,
discuss briefly about the related works. which describes how the WiFi signal propagates from the
transmitter to the receiver, with the combined effects of scatter,
Activity Recognition: In the single environment, many
distance-varying fading. Today’s WiFi devices are usually
researchers have utilized CSI values in large scale activity
equipped with multiple transmitting and multiple receiving
[18] [6] [7] and small scale motion recognition [19] [20] [21].
antennas for high data transmitting, which are called MIMO
For example, in the terms of large scale activity recognition,
(Multiple Input, Multiple Output).The physical link between
WiFall [22] detected the activity of fall with an accuracy of
the pair of transmitting and receiving antennas has multiple
87% using a novel detection method. E-eyes [1] used CSI
subcarriers at the same time. Hence, at the time point t, we
histograms as fingerprints to separate different daily activities.
can get enough data. WiFi signals can be expressed as:
CARM [5] proposed CSI-speed model and CSI-activity model
to analyse the correlation between CSI values and human Y (t) = H(t)X(t) + N (1)
activities. WiDir [2] used the physical Fresnel zone model to
estimate the moving direction of humans. Meanwhile, in the where Y (t) and X(t) are the receiving signals and transmit-
terms of small scale motion recognition, WiFinger [4] used ting, H(t) is the Channel Frequency Response(CFR) , N is
CSI to recognize a set of gestures including (0-9) and WiHear noise vector. Furthermore, CFR can be expressed as:
[3] used directional antennas to hear people talks by catching
N
CSI data caused by lip movement. X
H(t) = ak (t)e−j2sπτk (t) (2)
Indoor Localization: Another important part in WiFi sens- k=1
ing is to use CSI for indoor localization [23] [24]. SpotFi
[25] utilized phase information from CSI values with music where ak (t) denotes the channel attenuation and initial phase
algorithm achieved decimeter level accuracy. Based on this, offset and e−j2sπτk (t) denotes the phase offset caused by
Li et al. [26] from Peking University improved the music the propagation delay. CSI is the estimation of CFR, which
algorithm and proposed dynamic-music algorithm to measure contains both amplitude and phase information. The amplitude
angle-of-arrival(AOA) of signals. LiFS [27] localized a target represents the strength of WiFi signals and phase informa-
without offline training with the help of ”clean” subcarriers. tion represents the periodic variation of the signal with the
Some researchers also used deep learning techniques with propagation distance. In the physical level, phase changes one
CSI for localization. DeepFi [28] trained weights and biases period, the signal propagates the length of one wavelength.
between network layers as fingerprints for localization. Hence, the amplitude and phase are sufficient for us to sense
environmental changes.
User Authentication: We can use CSI for authentication
and privacy protection for the reason that CSI contains the
state information of wireless channel. Liu et al. [29] proposed B. WiFi Counting Model
a system to construct user profile resilient to the presence of
From the prior analysis, we can easily find WiFi sensing
spoofer. WiWho [30] and WifiU [31] used WiFi devices to
has three kinds of applications: activity recognition, indoor
analyse unique gait patterns for human identification.
localization and human authentication. Since different human
In recent work similar with ours, Xi et al. [17] proposed activities have different effects on WiFi signals, some simple
the system called FCC which uses WiFi signals to get the machine learning methods can separate these known activities
number of people in the indoor environment. However, FCC based on CSI. However, using a WiFi signal to detect the
applied the Verhulst theory which was not sufficiently reliable number of people in a room is much more complicated because
and the algorithm of Dilatation-based crowd profiling could the state of the person in the room is unknown. We were
not accurately describe the correlation between CSI and crowd unable to separate the WiFi signal into the specified activity
counting. Similarly, Domenico [32] tried to find the correlation and further determine the number of people in the room. In
between the number of people and CSI. And the Euclidean other words, we are unable to extract features that are directly
distance of two CSI waveforms was used for identification, related to the number of people manually.
which was not powerful to depict the relationships between the Since we can’t manually extract features, can we look for
number of people and CSI. Shi et al. [33] applied DNN(Deep another way to solve this problem? May be we can dig some
Neural Network) for user authentication. The system achieved inspirations from computer vision. The researchers in com-
over 94% and 91% authentication accuracy with 11 subjects puter vision use deep learning approaches [34] [35] to extract
through walking and stationary activities respectively. Com- features automatically to get the state-of-art performances.
pared to these, in our training process, we use LSTM to In some specific tasks, these applications even exceed the
capture long term dependencies and CNN to extract features performances of humans. Hence, we solve this problem based
automatically to improve the performances of neural network on neural networks. We have three reasons to believe that deep
on the crowd counting. learning can solve this problem.
4

4 100 10 1
empty sit down sit down Subcarrier1
one 80 walk wave walk wave 0.8 Subcarrier30
2 two arm 5 arm Subcarrier180
three

Amplitude
60 0.6
Phase

CDF
Phase
0 0
40 0.4
-2 -5
20 0.2

-4 0 -10 0
-2 -1 0 1 2 3 0 50 100 150 200 0 10 20 30 0 0.5 1 1.5
Amplitude Subcarrier Subcarrier Normalized Amplitude of CSI

Fig. 1: Raw normalized CSI Fig. 2: Amplitudes of different Fig. 3: Phases of different sub- Fig. 4: CDF of normal ampli-
amplitude and phase informa- subcarriers at different activi- carriers at different activities tudes of CSI values at a fixed
tion ties activity

1) Neural network can effectively solve nonlinear classifi- Real time data
cation problems: In a traditional linear model, the output is Raw CSI
the linear weighted sum of the inputs. The linear model can
Activity Recognition Preprocessing
be expressed as: X Train data Butterworth Filter

y= w i xi + b (3) Activity Recognition Model


PCA
Construction
i Feature Extraction
Activity Recognition
DWT
Preprocessing
where wi , b are the parameters of the model. The equation
3 is a linear transformation that can be used to solve linear Feature Extraction

classification problems. However, for nonlinear classification


Classification:HMM N
problems, the equation 3 does not meet the requirements. Door Switch?
Therefore, we need to use a multi-layered neuron with acti- Y
Counting Model Preprocessing
vation to deal with this type of problem. From Fig.1, we find Phase Extraction Amplitude Extraction

that the relationships between crowd counting and amplitude Deep Learning Model
Construction Phase Sanitization
Amplitude Noise
Removal
and phase information are non-linear. In theory, if the hidden Counting Model
Preprocessing

layer contains enough neurons, the neural network can fit any Increase/decrease
Offline Training the number by 1
complex nonlinear function. Hence, if we use proper neural Online Testing

network architecture, we may get better results.


Y The variation is
2) CSI values exhibit large variability among different Modify wrong sample to retrain greater than 1

subcarriers: In our experiment, a laptop with three antennas N

as a receiver and one with two antennas as a transmitter. Each Success


WiFi link between transmitter and receiver has 30 subcarriers.
Hence totally we get 180 subcarriers. From Fig.2, we can find Fig. 5: Framework of DeepCount
CSI amplitude at a specific activity has different variations
among subcarriers and different activities show differences on
the signals. Similar property also shown in the Fig. 3. Hence, modules of our DeepCount system : activity recognition model
based on these, amplitude and phase information are sufficient and deep learning model. DeepCount uses a router with two
as input data and we can use LSTMs to capture the long transmit antennas to transmit signals and a laptop with three
term dependencies during activity segments and CNNs to get receive antennas to receive signals. Hence, our data contains
features automatically. 180 streams.We can extract phase and amplitude information
3) CSI values are stable at a fixed subcarrier: Although from each stream. For Activity recognition model under the
CSI values exhibit differently among subcarriers, these values condition of single person entering/leaving the room, we only
also show great stability over time at the same subcarrier, use the amplitude information and this model contains three
which proves the model we build may be robust. Fig.4 plots parts: Activity Recognition Preprocessing, Feature Extraction
the CDF of the standard deviation of CSI amplitudes. 80% of and Classification. For activity recognition Preprocessing. We
the standard deviations are below 20% of the average value apply the Butterworth and PCA to remove the noises and elec-
at a fixed subcarrier, which depicts that CSI values are much tromagnetic inference in the environment. Then, we extracted
more stable compared with RSSI values. the feature information from these amplitude information
by DWT (Discrete Wavelet Transformation) and use these
features to construct our activity recognition model with HMM
IV. SYSTEM DESIGN
(Hidden Markov Model). By this activity recognition model,
A. DeepCount Overview we can distinguish the activity of door switch which means
Our DeepCount system uses the neural network to auto- someone opening the door and entering into the room/closing
matically learn the relationship between crowd counting and the door and going out of the room among other activities.
CSI based on that human activities have a significant impact Our deep learning model contains three parts: Counting Model
on WiFi signals. From Fig.5 we can find that there are two Preprocessing, Offline Training and Online Testing.We use
5

both phase and amplitude to participate in the preprocessing noises. Actually, frequency variations in CSI streams due to
of the counting model, rather than some systems, such as human normal activities are often within 200Hz, hence we set
electronic frog eye [17], which use amplitude or amplitude a cut-off frequency of 200Hz. The waveform after Butterworth
information alone. We need to eliminate the obvious noise filter is shown in Fig. 6(b), it is obvious to find that high
in this information before using it for offline training for frequency noises are removed. However, the noises between
better results. First, we divide the data set into 3 levels. 1 and 200Hz cannot be eliminated, we utilize PCA to reduce
The Dataset-fixed represent activities and positions that are the noises further. DeepCount applies PCA to CSI streams by
fixed. The Dataset-semi means we allow volunteers to conduct the following three steps:
free activities on the premise of a fixed position, they are • DC Component Removal: In this step, we first subtract
free to choose actions which may be a combination of fixed the corresponding constant offsets from CSI streams to
actions. The Dataset-open means volunteers can do anything remove the Direct Current (DC) component of every
anywhere in the room. Then DeepCount employs a deep neural subcarrier. The constant offsets can be calculated through
network with CNNs and LSTMs to train the samples ,whit a a long-term averaging over that subcarrier.
large number of weights and biases ,the deep neural network • Principal Components: DeepCount calculates the corre-
will extract feature based fingerprints which can effectively lation matrix Z as Z = H T × H and gets the eigenvector
represent the relationship between crowd counting and the CSI qi through eigen decomposition of Z. After that, Deep-
variances. In the online test stage, we used trained models to Count achieves the principal components by the following
predict the number of people in the room. At the same time, we equation:
also monitor the activity of door switch among other human hi = H × qi (5)
activities using activity recognition model. Once there is
someone entering/leaving the room, the increment/decrement where qi is the ith eigenvector and hi is the ith principal
of the number of people is greater than 1 compared with the component.
previous moment by our deep learning model, which means • Smoothing: Finally, we apply a 5-point median filter to

the predicted result is wrong. We label this sample and add avoid abrupt changes in the CSI streams which are likely
this data to retrain the parameters of the last layer in our deep to facilitate the results.
learning model. DeepCount discards the first principal component h1 and
retains the next ten principal components to be used for feature
B. CSI Collection extraction. It is mainly due to following reasons. From a large
When the transmitter continuously transmits the WIFI sig- number of experiments, we observed that noises are mainly
nal, the receiver continuously receives the WIFI signal at the captured in the first component. However, the information
same time, and the DeepCount will automatically extract the about human activities is captured in all principal components.
CSI value through the CSI tool provided on the receiver. We Since the PCA components are uncorrelated, we can discard
fixed sampling rate is 1500 packets/s to ensure fine grained the first principal component without losing too much useful
information about human activity. For each packet, the CSI information. Fig. 6(d) shows the third PCA component of our
matrix H is extracted. Owing to there aretwo antennas of the method and we find that the signal is much smoother.
transmitter (Tx) and three antennas of receiver (Rx) , we can 2) Feature Extraction: The responses of different activities
express the matrix H can as: on the frequency spectrum are different. The CSI frequency
  is determined as f = 2ν/λ, where ν is the speed of
H1,1 H1,2 · · · H1,30 human activity and λ is the WiFi signal wavelength. From
 H2,1 H2,2 · · · H2,30  the equation, we can find that the frequencies of running
(4)
 
 .. .. .. ..  and walking are obviously different due to the speed of
 . . . . 
H6,1 H6,2 · · · H6,30 running is much faster than walking normally. Hence, Discrete
Wavelet Transformation (DWT) is a proper choice. The DWT
where Hi,j represents the CSI values at jth subcarrier in the employ functions that are both in time and frequency which
ith Tx-Rx pair. Hence, we can get three dimensional data overcome the limitations of the classical Fourier Transform
CSI = [H1 , H2 , ..., Ht ] for the duration time t. We extract and meanwhile, DWT provides more appropriate wavelet basis
the phase and amplitude values from the raw CSI data for to choose. The discrete wavelet function is defined as:
further processing.
Φm,n (t) = a0 φ(a−m
0 t − nb0 ) (6)
C. Activity Recognition Model Construction The discrete wavelet coefficient is defined as:
1) Activity Recognition Preprocessing: As is shown in Fig. Z +∞
6(a), raw CSI amplitude data contain too many noises. In Wm,n (t) = f (t)Φm,n (t)dt (7)
our activity recognition model, the speeds of human activities −∞

such as sitting, walking, door switch are not very fast and In the DWT, the signal is decomposed into a coarse ap-
the signal changes caused by these activities lie at a low proximation coefficients and detail coefficients. Then the
frequency spectrum while the noises caused by hardware have coarse approximation coefficients are further decomposed us-
a relatively high frequency spectrum. Hence, Butterworth low- ing the same wavelet function. In DeepCount, we chose the
pass filter is a natural choice which removes high frequency Daubechies D4 wavelet to decompose the PCA component
6

30 25 25 60

25 20 40
20
20
Amplitude

Amplitude

Amplitude

Amplitude
20 15
15 0
15 10
-20
10
10 5 -40
5 0 5 -60
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Time 4 Time Time Time #104
#10 #10
4 #104

(a) Raw CSI amplitude (b) Raw data after butterworth filter (c) Weighted moving average filter (d) Raw data after PCA

Fig. 6: CSI walking amplitude data at subcarrier 30

into 10 levels of frequencies spanning 1 Hz to 200 Hz. Then, are different by the analysis of III-B. Hence, we utilize HMM
in order to capture the details of different activities, we use to construct the activity recognition dataset with some daily
a time window of size 128 to average the detail coefficients. activities such as walking, falling, running and the activity
In each window, we use its average energy and variance as of door switch(entering/leaving the room) under the condition
features. The corresponding feature matrix is shown in the of single person entering/leaving the room. Then we can use
matrix below this dataset directly to monitor the door switch activity in real

E1,1 E1,2 · · · E1,n−1 E1,n
 time. The high accuracy of activity recognition we can find
 E2,1 E2,2 · · · E2,n−1 E2,n  from Fig. 14(a) provides a judgment for deep learning model
to correct its predicted result with previous result, which means
 
 .. .. .. .. .. 
 .
 . . . . 
 that once the activity recognition model find there is someone
 E10,1 E10,2 · · · E10,n−1 E10,n 
  (8) entering into room, the increment should be equal to 1.
 V1,1
 V1,2 · · · V1,n−1 V1,n  
 V2,1
 V2,2 · · · V2,n−1 V2,n  
 . . .. . .. D. Deep Learning Model Construction
 .. .. ..

. . 
V10,1 V10,2 · · · V10,n−1 V10,n 1) Counting Model Preprocessing: The counting model
preprocessing is different from IV-C1 for the reason that We
where Ei,j is the average energy at the ith level in the j th need to preserve the original feature of the signal as much as
time window, Vi,j is the variance at the ith level in the j th possible for classification. However, PCA denoising method
time window and n is the number of time windows. loses original features of waveform shown in the Fig. 6(d)
3) Classification: After the above steps, we get the feature which is harmful for deep learning classification. Hence, we
matrix representing different activities. We utilize Hidden apply the weighted moving average method to remove the
Markov Model (HMM) to train the features for the reason that noises of amplitude for deep learning model. At the same
most of human activities can be divided into different phases, time, We add the phase information as features to improve
for example door switch can be divided into silent, accelera- the performance of deep learning model.
tion, deceleration and stop four phases, which corresponding
to the concept of states in HMM. Markov Model is a special • Amplitude Noise Removal: We utilize the Weighted Mov-
kind of Bayesian network. The variable Yt denotes tth node ing AverageWMA algorithm to remove noises. Specially,
in the network, each node has S possible states and different we use the equation 10 for the amplitude values of the
states have different transition probabilities. For the variables 1th subcarrier Sub1 = [A1 , A2 , ..., At ] to get weighted
Y1 · · · YT , we have averaged amplitude values:

P (Y1 · · · YT ) = P (YT |YT −1 )P (YT −1 |YT −2 ) · · · P (Y2 |Y1 )P (Y1 ) 0 1


(9) At = ·[m·At +(m−1)·At−1 +...+At−m+1 ] (10)
m + ... + 1
The probability distribution of the state at t moment depends
0
only on the state at t − 1, which is called transition probability. for the time t , At is the weighted averaged value m is
HMM is an extension of Markov Model which means the the weighted relationship between current values and the
actual state at the moment t is unknown, instead we can only historical values. We empirically fixed m = 100 in our
get the observation Xt , and the observation and state are not system. Fig.6(c) shows the results of weighted average
one-to-one. A state may have several kinds of observations filter, the waveform is much smoother compared to raw
and an observation can also have multiple states with different data.
probabilities. Each time when a state is entered, a feature • Phase Sanitization: Although we get the phase data,
vector is genera probabilistic. HMM utilizes the transition we cannot use these data directly. Owing to the carrier
probability which provides more information compared to the frequency offset (CFO) [36] and the sampling frequency
traditional training methods. offset (SFO), the phase PM we get can be expressed as:
4) Monitor with Activity Recognition Model: We can easily mi
find that the impacts on CSI spectrum from different activities PM = P + 2π ∆t + β + N (11)
N
7

where P is the genuine phase, ∆t is the time lag due The learned results will be feeded into the next dense
to SFO, β is the unknown phase offset due to CFO and layer. We use two CNN blocks for feature extraction,
N is the noise. From equation 11, we find that owing to each block contains filer and max pooling components.,
the unknown ∆t and β, we cannot get the real phase. for the first filter component, we have 6 filters which
However, from the equation, we found that linear fit can size equals 5 × 5 and stride equals 1 and for the first max
eliminate the effects of SFO and CFO. Fig.7(a) shows pooling component which size equals 2 × 2 and stride
the raw phase PM values when someone is walking in equals 2. Hence after the first CNN block, we can get a
the room. We can see that the initial phase values are 98×30×6 vector. Then, we pass this vector to the second
folded within the range of [−π, π] . In order to get the CNN block, in this block, we have 10 filters which size
true phase values, we unfolded the CSI phases which is is 5 × 3 and stride is 3. Then we get a 32 × 10 × 10 high
shown at Fig.7(b) firstly. Next, we need to remove the dimensional features.
impacts of SFO and CFO. We get the mean phase values • Dense Layer: After two CNN blocks, we can get a vector
y for antennas on each subcarrier. Then we utilize the shaped as 32 × 10 × 10, then we flat this vector into
linear fit to get the true phase values. The whole algorithm the shape of 3200 × 1 and pass this vector to three
is shown in Algorithm 1. Fig.7(c) presents the modified fully-connected layers. Each layer contains 1000, 200,
phase values by our algorithm. 5 neurons. After dense layer, we can get a 5 × 1 vector.
• Softmax Layer: A softmax layer to output the predicted
Algorithm 1 Phase Sanitization probabilities. The learned features by the previous CNN
Require: layers can be directly passed to a classifier like a softmax
The raw phase values PM ; output layer to determine the likeliness. We utilize a 5-
The number of subcarriers Sub; units softmax output layer to build a predictor on CSI
The number of Tx-Rx pairs M ; amplitude and phase information.
Ensure: 3) Online Testing: We collect a large number of samples
The calibrated phase values PC ; to find the relationship between number of people and CSI in
1: for i = 1 to M do the period of offline training and improve the performance of
2: UP = unwrap(PM (:, i)); this end-to-end learning by our activity recognition model. At
3: end for the online testing stage, we use this deep learning model to
4: y = mean(PM , 2); infer the number of people at present. If the predicted result
5: for i = 1 to Sub do is contradicted to the result obtained by activity recognition
6: x = (0 : Sub − 1); model. DeepCount will add this sample to retrain our deep
7: p = polyf it(x, y, 1);; learning model. We only retrain the parameters of the last
8: yf = p(1) ∗ x;; dense layer in our deep learning model. The reason to update
9: for j = 1 to M do the parameters of last dense layer rather than whole layers’
10: PC (:, j) = PM (:, j) − yf ; parameters is that the low-level features deep learning model
11: end for extracted are similar. Hence, we can just retrain the last layer
12: end for to get a better performance. At the same time, the time cost is
negligible because we just retrain the single sample to the last
2) Offline Training: Fig.8 illustrates the proposed network layer. By automatically adjusting the model’s parameters over
architecture for Crowd Counting Training. This component a period of time, our recognition accuracy can up to 90%.
comprises these four parts sequentially.
• LSTM Layer: We use a LSTM layer to extract long short
V. IMPLEMENTATION & EVALUATION
dependencies of activity segments. After processing of
the amplitude noise removal and phase sanitization, the A. Implementation
CSI information of different activities are then passed In the experiment, the laptop used in our experiment is
to an LSTM layer to get long short dependencies. The equipped with Ubuntu 12.04 operating system. In terms of
output of LSTM is then input into convolutional layers hardware, our device is equipped with Intel 5300 NIC as
for higher level features. Here we use one layer of LSTM. receiver. We connected the laptop to a mini R1C wireless
Because , for LSTM networks, one layer is powerful router with two antennas, using the router’s cable as a trans-
enough and much easier to tune hyper parameters. The mitter.The receiver is equipped with three antennas and its
input data has 360 dimensions including 180 dimensions firmware is modified to report CSI to the upper layer. All
for amplitude and 180 dimensions for phase information. experiments in this paper were carried out under the premise
Hence, the shape of input data is time × 360. Let N be of a frequency band of 5 GHz and a channel bandwidth of 20
the number of units in the LSTM layer, the output of the MHz. In order to make the wavelength short enough to ensure
layer for a single activity becomes a time×N ×1 vector. better resolution, we chose 5GHz instead of 2.4GHz, while
In our experiment, we set N equals 64 for speed up. 5GHz has more channels to reduce the possibility of inference.
• CNN Layer: CNN layers is selected to get higher level During the experiments, the transmitter sends packets with a
representations. The convolutional layers target at select- high rate of 1500 packets/second to the receiver continuously
ing high level representations from the output of LSTM. with Iperf tool. DeepCount acquires CSI measurements and
8

8 10
Antenna1 Antenna1 Antenna1
4
6 Antenna2 0 Antenna2 Antenna2

Unwrapped Phase
Antenna3 Antenna3 Antenna3
4 -10

Sanitize Phase
2
Raw Phase

2 -20
0
0 -30

-2 -40 -2
-4 -50
5 10 15 20 25 30 0 10 20 30 5 10 15 20 25 30
Subcarrier Subcarrier Subcarrier

(a) Raw wrapped CSI phase (b) Unwrapped CSI phase (c) Modified CSI phase

Fig. 7: Phase sanitization

Waving
one-people
Typing
two-people
Sitting down
Dataset-fixed
three-people
Walking

four-people Talking
Fig. 8: The architecture of network
five-people Eating

Fig. 10: The structure of Dataset-fixed

students and 3 female students for 8 different activities. Theses


activities are listed in Table I, along with the number of sam-
ples for each activity. For our deep learning model, we divide
our experimental process into three steps in order to prove
the accuracy of our experiments: First, due to the uncertainty
of human status and human location, we first let volunteers
perform fixed activities at fixed locations.Volunteers do some
fixed activities at designated locations or follow a designated
route. Then we get the dataset-fixed which is shown in Fig.10.
(a) Lab1 (b) Lab2
Second, we gradually relaxed the conditions. Volunteers are to
Fig. 9: Floor plans choose actions which may be a combination of fixed actions
above on a fixed position, the dataset we get is called dataset-
semi. At last, we do not make any restrictions. Volunteers can
processes it using Matlab, Python 3.6 and Tensorflow . We engage in any activity anywhere in the room, the dataset is
analysed the experimental data on the server with two NVIDIA called dataset-open.For each sample, we collected 4 minutes
GeForce GTX 1080Ti GPU. We use the cross entropy function of time series data, then a time window of size 200 splits
as the loss function of the deep learning model to calculate the the data and averages the amplitude and phase information
probability error in the classification while using the gradient for each time window to enlarge the training data set. The
descent optimizer (GDO / SGD) optimization algorithm as the splitted samples of different activities are shown in Table.II.
optimizer to reduce the cost.
C. Baseline method
B. Evaluation setup 1) Baseline method description : To examine the effec-
Fig.9 shows the training samples we collected in lab1 and tiveness of our method we add a baseline method into our
lab2. For our activity recognition model, We collected a total experiment to form the control group. Fully Connected Back
of 800 samples from 10 volunteers which include 7 male Propagation(FCBP) neural network which contains large pa-
9

TABLE I: The dataset of activity recognition model


Activity: Empty Walking Sitting down Falling Running Entering into room Leaving room Waving
Abbreviations: E W S F R O L A
Samples: 100 100 100 100 100 100 100 100

Activity Samples
Waving 24741
D. Experiment with CNN-LSTM
Typing 28565 1) Parameters tuning: As we all know, an effective net-
Sitting down 27108
work needs well-tuned parameters via extensive searching.
Walking 27537
Talking 23580 The parameters include the number of layers, the number
Eating 26802 of units for each layer, the filter size and stride size for
convolutional layer, and etc. The difficulty grows quickly with
TABLE II: Samples of dataset-fixed
the complexity of a neural network. To balance the efficiency
and performance of training process, we focus on tuning the
major parameters that influence the performance most. We list
down those parameters for each layer bellow.
• LSTM Layer: The performance of LSTM Layer is mostly
sensitive to the maximum length of the input sequence,
the number of LSTM cells, and dropout rate. LSTM with
more cells means stronger information storage capacity.
In our experiment, we use 1 layer of LSTM with 64
cells to remember long term dependencies. Also, We
set the length of input sequences equals 200 for better
performance and choose dropout rate equals 0.1.
• CNN Layer: We use two CNN blocks to extract high
level features. We tuned the filter and stride size for the
convolutional layers, and the pool and stride size for the
pooling layers. For the first block, we have the first CNN
with 5 × 5 filter, 1×1 stride, followed by a max pooling
Fig. 11: The architecture of baseline method network with 2×2 filter and 2×2 stride, and the second block with
5×3 filter, 3×3 stride.
• Dense Layer: Three fully-connected layers are followed
by CNN blocks with 1000, 200, 5 neurons to reduce
rameters is suitable for crowd counting. Therefore ,we use data dimension and fit the relationships between CSI and
FCBP neural network as the baseline method to compare our crowd counting.
CNN-LSTM NetWork . The FCBP neural network we use • Other significant parameters: In our experiment, we
has two hidden layers where each hidden layer consists of a set the batch size equals 64 and learning rate
different number of neurons. We use 300 neurons (denoted equals 0.2,0.15,0,1 for the dataset-fixed,dataset-open and
as Layer 1 nodes) on the first hidden layer and 100 neurons dataset-semi to get the better performance.
(denoted as Layer 2 nodes) on the second hidden layer. The
2) Overall performance: For activity recognition model,
input layer is denoted as XN, where N equal 360 for the reason
DeepCount takes the 80% of samples in each class as the
that CSI values both contain phase and amplitude information.
training set, the rest as the test set. For the training set, we
The output has 5 classifications denoted as Y1 to Y5 to identify
use 10-fold cross validation to get optimal parameters for
up to 5 people. Note that, this structure could easily extend to
the activity model including the states in HMM. From Fig.
count more than 5 people by using more output nodes. Fig.11
14(a), the average accuracy of activity recognition is 89.14%
illustrates the structure of whole network.
and there is a probability of 8% to view the Entering into
2) Experiment results of baseline method: The baseline room as Leaving room and 7% to view Leaving room as
method achieves an average accuracy of 88.8%, 80.2% Entering into room. We observe that the accuracy of Entering
and 78% respectively across dataset-fixed, dataset-semi and into room and Leaving room is 88% and 87% among others
dataset-open. Fig.12 plots the confusion matrix for different activities. For CNN-LSTM model, the DeepCount achieves
training dataset collected in our Lab1 and Lab2. Fig.13 shows an average accuracy of 88.8%,85.2%,and 85.2% respectively
the loss curve during the process of training. The loss func- across dataset-fixed, dataset-semi and dataset-open. Fig.15
tion of dataset-fixed is converged after 3000 iterations while plots the confusion matrix for different training dataset in our
dataset-semi and dataset-open are converged even after 10000 Lab1 and Lab2. We can find that DeepCount’s predicted results
iterations. This result implies that the data from dataset-semi and real labels differ from no more than 2 people. Thus, we
and dataset-open are more diverse. Hence, the computation can see thatn our CNN-LSTM model fits the real environment
costs are relatively higher. well. Due to the existence of multiple human states in the
10

(a) The confusion matrix of baseline method for (b) The confusion matrix of baseline method for (c) The confusion matrix of baseline method for
Dataset-fixed Dataset-semi Dataset-open

Fig. 12: Confusion matrix of baseline method network

can distinguish more people, we also test scenario with 10


2 people in a completely open state. The accuracy achieves 75%,
Dataset-fixed which shows that neural network is powerful enough for crowd
Dataset-semi counting.
Dataset-open
1.5
Loss

E. Comparison experiment result with baseline methods


For neural networks, the most direct benchmark of their
1 performance is the accuracy and the convergence speed, so
we made the following comparisons of the training results of
baseline methods and CNN-LSTM.
0.5 1) Comparison of accuracy: From Fig.12(a) and Fig.15(a) ,
0 10000 20000 30000 we can find that when we fixed human activities and locations
Iteration in the indoor environment, the accuracy of baseline method
NetWork and CNN-LSTM is nearly equal. But Fig.12(c) and
Fig. 13: The loss curve of baseline method network Fig.15(c) shows that for the open-dataset, the 85.2% accuracy
of CNN-LSTM is better than the 78% accuracy of baseline
method NetWork. Meanwhile, From Fig.12(b) and Fig.15(b) ,
E 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100
100 instances
we can find that the 80.2% accuracy of CNN-LSTM is better
200 instances
W 0.00 0.87 0.00 0.02 0.07 0.00 0.01 0.03 95 than the 85.2% accuracy of baseline method NetWork for the
Accuracy Rate(%)

S 0.00 0.00 0.92 0.04 0.00 0.01 0.03 0.00

F 0.00 0.03 0.07 0.89 0.00 0.01 0.00 0.00


90 dataset of semi. Thus, from the Fig.12 and the Fig.15, we
R 0.00 0.06 0.00 0.00 0.93 0.00 0.00 0.01 85 can get the conclusion that the experiment accuracy of CNN-
O 0.00 0.00 0.01 0.01 0.00 0.88 0.08 0.02
80 LSTM is higher and more stable than baseline method.
L 0.00 0.00 0.00 0.01 0.00 0.07 0.87 0.05

A 0.00 0.00 0.00 0.00 0.00 0.04 0.08 0.88 75


2) Comparison of converged iterations: The Fig.13 and
Entering into room Leaving room
E W S F R O L A Door Switch Fig.16 shows that the different converged iterations of baseline
(a) The accuracy of door switch (b) The accuracy of door switch method and CNN-LSTM . We can find for the dataset-fixed,
among different activities among different training set the converged iterations of baseline method and CNN-LSTM
is 3000, so the baseline method and CNN-LSTM has the
Fig. 14: Activity recognition model same efficiency about converged iterations for dataset-fixed.
But for the dataset-open and dataset-semi, the baseline method
converged after 10000 iterations but the CNN-LSTM method
indoor environment, the data we collect cannot guarantee that only need 2500 iterations. With above performance results, we
all scenarios can be covered, which has a greater impact on our can come to the conclusion that the CNN-LSTM Network is
accuracy. On the other hand, despite such unfavorable factors, converged faster than baseline method.
we can still achieve satisfactory results through deep learning
methods. Fig.16 shows the loss curve during the process of
training. We can find that the loss function of dataset-fixed and F. Experiment of Different Components
dataset-semi and dataset-open is converged after 3000,2500 1) Impacts of Activity Training Set Size: To examine the
and 2500 iterations. We also use SVM as a control group and effect of training set for different activity samples on the
choose Gaussian kernel to train the data but the accuracy is accuracy of DeepCount, we increase the number of samples
less than 50%. In order to determine whether neural network from 100 to 200 per activity. Fig. 14(b) plots the accuracy
11

(a) The confusion matrix of Dataset-fixed (b) The confusion matrix of Dataset-semi (c) The confusion matrix of Dataset-open

Fig. 15: Confusion matrix of CNN-LSTM network

3) Impact of preprocessing: The main influencing factors


Dataset-fixed
1.50
Dataset-semi
Dataset-open
of preprocessing are phase sanitization and amplitude noise
removal. Therefore, it is necessary to verify whether these
processes have a large impact on the accuracy of DeepCount.
1.25
Loss

We also want to know what effect we will have on the accuracy


of DeepCount if we only train amplitude information or only
1.00
train phase information. From Fig.18(a) where legend With
0 10000
Iterion
20000 30000 P&A denotes the data with phase sanitization and amplitude
noise removal, legend Without A denotes the data without
Fig. 16: The loss curve of CNN-LSTM network
amplitude noise removal, legend Without P denotes the data
without phase sanitization and legend Raw Data denotes the
with different training set size. We observe that the average data without preprocess, we find that if without phase sanitiza-
accuracy of door switch increased from 87.5% to 92.5%. When tion and amplitude noise removal,the accuracy of DeepCount
we increase the number of samples, we also improve the will drops .This shows that eliminating some significant data
computational cost of DeepCount. noise will significantly improve the performance of the neural
2) Impact of time window: To enlarge the dataset ,we use network. This shows that eliminating significant noise in the
the length of 200 as the time window size to split the train data will significantly improve the performance of the neural
data . In order to compare the effect of different time window network. As is shown in Fig.18(b), where legend With P&A
sizes on recognition accuracy, our time window size was set denotes both phase and amplitude information as features, leg-
to 500 and 1000 as comparative experiments.The accuracy is end Without P denotes we only use amplitude as features and
shown in the Fig.17. We found that when the time window size legend Without A denotes phase information as features. We
is equal to 200, the recognition accuracy reaches the highest found that the recognition rate using both amplitude and phase
point. Although the time window becomes larger, the number information is higher than using only one of the information. It
of samples that we need to train can be reduced. However, shows that increasing the number of features within a certain
under the premise of a sampling rate of 1500 packets/second, it range can improve the performance of DeepCount, and that
is difficult to ensure that the CSI value remains relatively stable the amplitude and phase information are directly related to
for the duration of the time window. Therefore, we cannot the number of people.
increase the time window size in order to reduce the sample 4) Impact of amendment mechanism: By analyzing the re-
of training. sults of the activity recognition model, we can find that HMM
can effectively distinguish door switch from other activities.
100 Hence, we can use this model to judge the accuracy of deep
size=200
90 size=500 learning model. If the predicted result is inconsistent with the
size=1000
activity recognition model, we add this sample to retrain the
Accuracy

80
last layer parameters. Owing to retrain the last layer of single
70
sample, the time cost is acceptable. By this mechanism, we can
60 eventually improve the recognition accuracy from 82.3% to
50 87% on the baseline method . Further, the recognition accuracy
fixed semi open
of the CNN-LSTM network can be improved from 86.4% to
Fig. 17: The impact of the length 90%.
of time window 5) Compared with Electronic Frog Eye: DeepCount is
much different from the Electronic Frog Eye [17]. First,
12

these as samples to build a robust model, we can reuse the


model for the same environment. We plan to explore how to
improve the maximum distinguished number of people as our
future work.

ACKNOWLEDGMENT
This work is supported in part by the National Key R&D
Program of China under Grant 2017YFB0802300, the National
(a) The accuracy of data with or (b) The accuracy of different features
without preprocessing Science Foundation of China under Grant (No.61602238,
61672283), the key project of Jiangsu Research Program
Fig. 18: Impact of preprocessing Grant( BK20160805), the China Postdoctoral Science Foun-
dation (No. 2016M590451).

FCC temped to find a monotonic relationship between CSI R EFERENCES


variations and the number of people. Based on these, they [1] Yan.Wang, Jian.Liu, Yingying.Chen, Marco.Gruteser, Jie.Yang, and
proposed an algorithm called Dilatation-based crowd profiling Hongbo.Liu. E-eyes: In-home device-free activity identification using
fine-grained wifi signatures. In Proc.of 20th annual international
to character the variations. However, we find that the rela- conference on Mobile computing and networking (MobiCom), pages
tionships are far more complicated owing to the uncertainty 617–628, 2014.
states in the indoor environment and we can not only use the [2] Dan.Wu, Daqing.Zhang, Chenren.Xu, Yasha.Wang, and Hao.Wang.
Widir: walking direction estimation using wireless signals. In Proc. of
amplitude variations to determine the number of people in 2016 ACM International Joint Conference on Pervasive and Ubiquitous
our experiment. Second, the Grey theory used to predict the Computing (Ubicomp), pages 351–362, 2016.
number of people is not reliable. As we all know, Grey theory [3] Guanhua.Wang, Yongpan.Zou, Zimu.Zhou, Kaishun.Wu, and Li-
onel.M.Ni. We can hear you with wi-fi! In Proc. of 20st Annual Inter-
is to establish prediction model to make a vague description national Conference on Mobile Computing and Networking (Mobicom),
of the development of things by a small amount of incomplete pages 593–604, 2014.
information. However, crowd counting is a relatively precise [4] Hong.Li, Wei.Yang, Jianxin.Wang, Yang.Xu, and Liusheng.Huang.
Wifinger:talk to your smart devices with finger-grained gesture. In Proc.
task and in many cases, we should exactly know the number of ACM Conference on Ubiquitous Computing (Ubicomp), pages 250–
of people. At last, FCC estimated the number of people with 261, 2016.
multiple devices, however, they do not solve the problem [5] Wei.Wang, Alex.X.Liu, Muhammad.Shahzad, Kang.Ling, and San-
glu.Lu. Understanding and modeling of wifi signal based human activity
well about signal interference and synchronization problems recognition. In Proc. of 21st Annual International Conference on Mobile
between these devices. In our experiment, we utilize the Computing and Networking (Mobicom), pages 65–76, 2015.
powerful learning ability of neural networks to solve this [6] Wei.Xi, Dong.Huang, Kun.Zhao, and Deng.Chen. Device-free human
activity recognition using csi. In Proc.of Workshop on Context Sensing
problem, and at the same time, we add activity recognition & Activity Recognition (CSAR), pages 31–36, 2015.
model to amend the wrong predicted results. The experimental [7] Chenshu.Wu, Zheng.Yang, Zimu.Zhou, and Jiannong.Cao. Non-invasive
results show that we can get the accuracy up to 90% and our detection of moving and stationary human with wifi. IEEE Journal on
Selected Areas in Communications, 33(11):2329–2342, 2015.
results are more reliable. [8] Ming.Li, Zhaoxiang.Zhang, Kaiqi.Huang, and Tieniu.Tan. Estimating
the number of people in crowded scenes by mid based foreground
segmentation and head-shoulder detection. In Proc.of International
VI. CONCLUSIONS Conference on Pattern Recognition (ICPR), pages 1–4, 2018.
[9] R.M.Haralick. Statistical and structural approaches to texture. IEEE,
In this paper, we present DeepCount, a novel system which 67(5):786–804, 2005.
solve multi-human environmental sensing problems use the [10] Dan.Kong, Douglas.Gray, and Hai.Tao. Counting pedestrians in crowds
deep learning approach with WiFi signals. To the best of our using viewpoint invariant training. In Proc.of British Machine Vision
Conference, pages 1–10, 2005.
knowledge, it is the first solution to use neural networks as a [11] Aparecido.Nilceu.Marana, L.Da.Fontoura.Costa, and Sergio.Velastin.
crowd counting solution. To further improve the performance Estimating crowd density with minkowski fractal dimension sign in
of DeepCount, we add an online learning mechanism to get or purchase. In Proc.of Acoustics,Speech,and Signal Processing, pages
1520–6149, 1999.
better results. DeepCount also explores why deep learning [12] Fadel.Adib, Zach.Kabelac, Dina.Katabi, and Robert.C.Miller. 3d track-
can solve this complex problem. Preliminary results show that ing via body radio reflections. In Proc.of Usenix Conference on
compared to the traditional classification algorithm such as Networked Systems Design & Implementation (NSDI), pages 317–329,
2014.
SVM, DeepCount can achieve an average recognition accuracy [13] Chenren.Xu, Bernhard.Firner, Robert.S.Moore, and Ning.An. Scpl:
of 86.4% for up to 5 people, showing a higher recognition Indoor device-free multi-subject counting and localization using radio
rate. With the help of activity recognition model, we can get signal strength. In Proc.of International Conference on Information
Processing in Sensor Networks (IPSN), pages 1520–6149, 2013.
the accuracy up to 90%. Our approach can show acceptable [14] Chenren.Xu, Sugang.Li, Gang.Liu, and Bernhard.Firner. Crowd++:
accuracy in the context of complex changes in the indoor Unsupervised speaker count with smartphones. In Proc.of 2013 ACM
environment, which means our approach works fairly robust. International Joint Conference on Pervasive and Ubiquitous Computing
(Ubicomp), 2013, pp. 43-52.
Although the deep learning approaches require huge amount of [15] Jens.Weppner and Paul.Lukowicz. Collaborative crowd density estima-
samples to fit this complex function and in our experiment we tion with mobile phones. IEEE International Conference on Pervasive
collect massive samples to characterize correlations between Computing & Communications, 8770(5):193–200, 2012.
[16] Pravein.Govindan.Kannan, Mun.Choon.Chan, and Li-Shiuan.Peh. Low
CSI and crowd counting. In theory, if we can take into account cost crowd counting using audio tones. In Proc.of Acm Conference on
enough circumstances in the indoor environment and take Embedded Network Sensor Systems (SenSys), pages 155–168, 2012.
13

[17] Wei.Xi, Jizhong.Zhao, Xiangyang.Li, and Kun.Zhao. Electronic frog


eye: Counting crowd using wifi. In Proc. of IEEE International
Conference on Computer Communications (INFOCOM), pages 361–
369, 2014.
[18] Zimu.Zhou, Zheng.Yang, Chenshu.Wu, Longfei.Shangguan, and Yun-
hao.Liu. Towards omnidirectional passive human detection. In Proc.of
IEEE International Conference on Computer Communications (INFO-
COM), pages 3057–3065, 2013.
[19] Rajalakshmi.Nandakumar, Bryce.Kellogg, and Shyamnath.Gollakota.
Wifi gesture recognition on existing devices. Eprint Arxiv, 2(3):17–17,
2014.
[20] Jian.Liu, Yan.Wang, Yingying.Chen, and Jinquan.Cheng. Tracking vital
signs during sleep leveraging off-the-shelf wifi. In Proc.of 16th ACM
International Symposium on Mobile Ad Hoc Networking and Computing,
page 267276, 2015.
[21] Xuefeng.Liu, Jiannong.Cao, Shaojie.Tang, and Jiaqi.Wen. Wi-sleep:
Contactless sleep monitoring via wifi signals. In Proc.of Real-Time
Systems Symposium (RTSS), page 346355, 2014.
[22] Chunmei.Han, Kaishun.Wu, Yuxi.Wang, and Lionel.M.Ni. Wifall:
Device-free fall detection by wireless networks. IEEE Transactions on
Mobile Computing, 16:271–279, 2017.
[23] Jiang.Xiao, Kaishun.Wu, Youwen.Yi, Lu.Wang, and Lionel.M. Ni. Pas-
sive device-free indoor localization using channel state information.
In Proc.of IEEE International Conference on Distributed Computing
Systems, page 236245, 2013.
[24] Wu.Yang, Liangyi.Gong, Dapeng.Man, Jiguang.Lv, Haibin.Cai, Xian-
cun.Zhou, and Zheng.Yang. Enhancing the performance of indoor
device-free passive localization. International Journal of Distributed
Sensor Networks, pages 1–11, 2015.
[25] Manikanta.Kotaru, Kiran.Joshi, Dinesh.Bharadia, and Sachin.Katti.
Spotfi:decimeter level localization using wifi. In Proc.of 2015 ACM
Conference on Special Interest Group on Data Communication (SIG-
COMM), 2015, pp. 269-282.
[26] Xiang.Li, Shengjie.Li, Daqing.Zhang, Jie.Xiong, Yasha.Wang, and
Hong.Mei. Dynamic-music: Accurate device-free indoor localization.
In Proc.of 2016 ACM International Joint Conference on Pervasive and
Ubiquitous Computing (Ubicomp), 2016, pp. 196-207.
[27] Ju.Wang, Hongbo.Jiang, Jie.Xiong, Kyle.Jamieson, and Xiaojiang.Chen.
Lifs: low human-effort, device-free localization with fine-grained sub-
carrier information. In Proc.of 22nd Annual International Conference
on Mobile Computing and Networking, 2016, pp. 243-256.
[28] Xuyu.Wang, Lingjun.Gao, and Shiwen.Mao. Deepfi: Deep learning for
indoor fingerprinting using channel state information. In Proc.of 2015
IEEE Wireless Communications and Networking Conference (WCNC),
2015, pp. 1666-1671.
[29] Hongbo.Liu, Yan.Wang, Jian.Liu, Jie.Yang, and Yingying.Chen. Prac-
tical user authentication leveraging channel state information (csi). In
Proc.of 9th ACM symposium on Information, computer and communi-
cations security (CCS), 2014, pp. 389-400.
[30] Yunze.Zeng, Parth.H.Pathak, and Prasant.Mohapatra. Wiwho: Wifi-
based person identification in smart spaces. In Proc.of 5th International
Conference on Information Processing in Sensor Networks (IPSN), 2016.
[31] Wei.Wang, Alex.X.Liu, and Muhammad.Shahzad. Gait recognition
using wifi signals. In Proc. of 2016 ACM International Joint Conference
on Pervasive and Ubiquitous Computing (UbiComp), pages 363–373,
2016.
[32] Simone.Di.Domenico, Mauro.De.Sanctis, Ernestina.Cianca, and
Giuseppe.Bianchi. A trained-once crowd counting method using
differential wifi channel state information. In Proc.of the 3rd
International on Workshop on Physical Analytics, pages 37–42, 2016.
[33] Cong.Shi, Jian.Liu, Hongbo.Liu, and Yingying.Chen. Smart user au-
thentication through actuation of daily activities leveraging wifi-enabled
iot. In Proc. of 18th ACM International Symposium on Mobile Ad Hoc
Networking and Computing (Mobihoc), pages 1–12, 2017.
[34] David.E.Rumelhart, Ronald.J.Williams, and Geoffrey.E.Hinton. Learn-
ing representations by back-propagating errors. Neurocomputing: foun-
dations of research, pages 696–699, 1986.
[35] Geoffrey.E.Hinton, Simon.Osindero, and Yee-Whye.Teh. A fast learning
algorithm for deep belief nets. Neural Computation, 18:1527–1554,
2006.
[36] David.Tse and Pramod.Viswanath. Fundamentals of wireless communi-
cation. IEEE Transactions on Information Theory, 55:919–920, 2009.

You might also like