DeepCount - Crowd Counting With WiFi Via Deep Learning - 2019
DeepCount - Crowd Counting With WiFi Via Deep Learning - 2019
DeepCount - Crowd Counting With WiFi Via Deep Learning - 2019
Abstract—Recently, the research of wireless sensing has [7] are powerless in extracting the information from multiple
achieved more intelligent results, and the intelligent sensing overlapping signals to perform crowd-counting.
of human location and activity can be realized by means of Traditionally, the most popular crowd-counting are based
WiFi devices. However, most of the current human environment
perception work is limited to a single person’s environment, on computer vision techniques with images from camera
because the environment in which multiple people exist is more [8] [9] [10] [11]. Recently, electromagnetic waves method
complicated than the environment in which a single person exists. are implemented with special hardware [12] [13]. However,
In order to solve the problem of human behavior perception camera based approaches may suffer from blind spots in
in a multi-human environment, we first proposed a solution to the corner or absence of light conditions. It also introduce
achieve crowd counting (inferred population) using deep learning
in a closed environment with WIFI signals - DeepCout, which privacy issues. Special hardware system such as WiTrack [12]
is the first in a multi-human environment. step. Since the use mainly utilize TOF (Time-of-Flight) by FMCW (Frequency
of WiFi to directly count the crowd is too complicated, we use Modulated Continuous Wave) to provide delegate and well-
deep learning to solve this problem, use Convolutional Neural constructed signal. However, these devices suffers from high
Network(CNN) to automatically extract the relationship between deployment cost and thus not comparable to ubiquitous WiFi
the number of people and the channel, and use Long Short
Term Memory(LSTM) to resolve the dependencies of number deployment. Some researchers also used smart phones [14]–
of people and Channel State Information(CSI) . To overcome [16] to infer the number of speakers in a dialog. However,
the massive labelled data required by deep learning method, these methods are not device-free and not friendly for the
we add an online learning mechanism to determine whether or elderly and children. Based on these, we believe that the WiFi-
not someone is entering/leaving the room by activity recognition based crowd counting study is very meaningful and it can solve
model, so as to correct the deep learning model in the fine-tune
stage, which, in turn, reduces the required training data and the above problems well.
make our method evolving over time. The system of DeepCount In this paper, our proposed DeepCount solution will have
is performed and evaluated on the commercial WiFi devices. different different multi-path distortions and unique pattern of
By massive training samples, our end-to-end learning approach waveform in WiFi signals based on human activity. The phys-
can achieve an average of 86.4% prediction accuracy in an ical amplitude and phase information will be distorted greatly
environment of up to 5 people. Meanwhile, by the amendment
mechanism of the activity recognition model to judge door switch owing to human activities in the WiFi environment during the
to get the variance of crowd to amend deep learning predicted process of WiFi signal propagation. Hence, we can utilize this
results, the accuracy is up to 90%. time series of Channel State Information (CSI) values captured
Index Terms—Crowd Counting, WiFi Sensing, Deep Learning,
in the WiFi signals for sensing. Compared to existing crowd
Human Activity Recognition counting system using Wifi eFrogEye [17], which predict the
number of people based on the utilizes Grey model, the model
could not fully utilize the phase and amplitude information.
I. I NTRODUCTION Thus, we utilize both two dimensions and the powerful deep
learning method to realize such function.
The WiFi signals has spanned over the city and reveal the However, to realize this idea into a working system, we
sensing ability such as activity recognition, human identifi- face a variety of technical challenges. The first technical
cation, localization and beyond. However, current application challenge is how to extract feature values that can simulate the
of WiFi sensing is only effective in single person scenario relationship between CSI values and population counts.After
which greatly restricts the use of WiFi environmental sensing a detailed analysis of a large amount of CSI data, we found
[1] [2] [3] [4]. Research of indoor crowd counting is the basis that traditional features such as entropy, maximum or variance
of multi-object environmental sensing and various potential do not meet the demand. Due to the uncertainty of the state
applications, e.g. tour guiding and crowd control. Meanwhile, of the people in the room, we cannot find the correlation and
crowd-counting in WiFi environment is a very challenging pattern through simple mathematical modeling. In this case,
task, as the WiFi signal is very arbitrary owing to the uncer- we need to preserve the original features of the data as much
tainty of the states in the room. Thus, traditional signal process as possible during data processing, so we only perform simple
and pattern recognition methods for activity recognition [5] [6] denoising on the amplitude and phase information.
2
The second technical challenge is extract the counting • We proposed the DeepCount system and adopted a deep
model from complex overlapping signals. Traditional training learning approach to solve multi-person context aware-
methods such as Support Vector Machine (SVM), Bayesian ness problems. We use LSTM and CNN to implement
Classifier cannot capture the overlapping features and the automatic extraction of features. Then we use a softmax
background noise. Recent advancement of deep learning and layer for crowd counting.
the success of its application in computer vision has shed the • To further improve the performance of DeepCount, we
light on resolving our problem. As the extremely complicated add an online learning mechanism by our activity recog-
relationships between CSI waveforms and crowd counting, we nition model, the experiments show that by this mecha-
utilize the neural networks including Convolutional Neural nism, we can eventually get the 90% accuracy.
Networks (CNNs) and Recurrent Neural Networks (RNNs) • We introduced some simple and effective denoising meth-
with Long Short Term Memory Network (LSTM) to construct ods that eliminate noise while maximizing the character-
this complicated model. istics of the data.
The third challenge is the massive data required to build the The remaining of this paper is organized as follows.Section
model with deep learning and how to build a robust model II provides a background about neural network and related
with proper structure of neural networks. Regarding the data work of WiFi sensing. We will analyse the characteristics
acquisition, the training process can easily lead to overfitting about CSI and the reasons to choose deep learning to solve this
with inadequate or biased data. Thus, we design massive and problem in Section III. The details in our design of DeepCount
proper designed data process to ensure the quality and quantity. will be discussed in Section IV. The implementation and
Specifically, we collect 6 different activities up to 10 people, evaluation are presented in Section V followed by conclusions
which cover different indoor behaviors such as walking, talk- in Section VI.
ing and different participants. Then, to improve the diversity,
we split the data in different time window length and label
them properly. Regarding the proper network structure for II. BACKGROUND AND RELATED WORK
our problem, basically, we use CNNs and LSTMs to for the
A. Deep Neural Networks
deep neural network model. In addition, regularization and
exponential decay methods are applied to avoid overfitting. With the rapid development of artificial intelligence, Re-
The fourth challenge is how to adapt our deep learning current Neural Networks (RNNs) and Convolutional Neural
model over time. Although the deep learning method has Networks (CNNs) have demonstrated impressive advances in
a strong learning ability, the learned model could loss over numerous tasks.
time for slight environment change. To overcome this, our Recurrent Neural Networks: RNNS are designed for pro-
basic idea is using our activity recognition model under the cessing sequential information due to the ability of preserving
condition of the single person entering/leaving the room. Then, the previous information to the current state. RNNs loop the
we can infer a state change for current model. At the same same network each time step to capture the state information
time, if the deep learning model gives an error result which and use this information with the next input as the whole
means the increment/descendent of the number of people in input information. However, RNNs suffer from the vanish-
the room greater than 1 compared with the previous slot, we ing gradient problem that leads to failure in learning long
can correct this result to finetune the parameters of the last sequences. LSTM was motivated to overcome the limitation
layer of the neural network in our deep learning model. With by introducing a new structure called memory cell that has
this mechanism, we will eventually improve the accuracy of additional forge gates over the simple RNNs. Its unique mech-
WiFi counting model up to 90% in a relatively robust manner. anism enables it to capture long term dependencies. Hence,
Our method fully utilizes the characteristics of amplitude RNNs with LSTMs have been widely used in processing long
and phase of CSI with a specifically designed CNN-LSTM sequences.
network, where the CNN is applied to extract the deep Convolutional Neural Networks: CNNs have achieved so
features while LSTM is applied to handle time series signal. many the state of the art works in the field of computer vision
Meanwhile, to make our model more flexible to adapt the time and NLP recently for the ability of automatically high-level
evolving of the crowd-counting in an online manner, we add an features extraction. The traditional block of CNNs contains
online learning mechanism to correct our deep learning model three parts: a convolutional layer, activation function and
by fine-tune the last layer parameters of neural network. Such pooling layer. The main function of convolution layer is to
method endow our method with time-evolving features, thus extract features automatically with a filter. The filter is a small
greatly improve the practicality. square with common shape (3,3) or (5,5), which serves as
The main contributions of this paper can be summarized as dot product among dimensions of input data. An activation
follows: function such as Sigmoid, Relu follows every convolution
• We theoretically analyze the correlation between crowd layer to perform non-linear transform. Pooling layer is used to
counting and the variation of CSI and utilize deep learn- reduce dimension of input data but retain the most important
ing to characteristic this relationship. To the best of our information. There are different types of pooling such as Max
knowledge, this is the first solution to solve the population and Average. In case of max pooling, it extracts the max value
of WiFi signal populations using neural networks. It in the predefined area. The above three parts are the basic
proposes a new approach to solving such problems. blocks of CNNs.
3
4 100 10 1
empty sit down sit down Subcarrier1
one 80 walk wave walk wave 0.8 Subcarrier30
2 two arm 5 arm Subcarrier180
three
Amplitude
60 0.6
Phase
CDF
Phase
0 0
40 0.4
-2 -5
20 0.2
-4 0 -10 0
-2 -1 0 1 2 3 0 50 100 150 200 0 10 20 30 0 0.5 1 1.5
Amplitude Subcarrier Subcarrier Normalized Amplitude of CSI
Fig. 1: Raw normalized CSI Fig. 2: Amplitudes of different Fig. 3: Phases of different sub- Fig. 4: CDF of normal ampli-
amplitude and phase informa- subcarriers at different activi- carriers at different activities tudes of CSI values at a fixed
tion ties activity
1) Neural network can effectively solve nonlinear classifi- Real time data
cation problems: In a traditional linear model, the output is Raw CSI
the linear weighted sum of the inputs. The linear model can
Activity Recognition Preprocessing
be expressed as: X Train data Butterworth Filter
that the relationships between crowd counting and amplitude Deep Learning Model
Construction Phase Sanitization
Amplitude Noise
Removal
and phase information are non-linear. In theory, if the hidden Counting Model
Preprocessing
layer contains enough neurons, the neural network can fit any Increase/decrease
Offline Training the number by 1
complex nonlinear function. Hence, if we use proper neural Online Testing
both phase and amplitude to participate in the preprocessing noises. Actually, frequency variations in CSI streams due to
of the counting model, rather than some systems, such as human normal activities are often within 200Hz, hence we set
electronic frog eye [17], which use amplitude or amplitude a cut-off frequency of 200Hz. The waveform after Butterworth
information alone. We need to eliminate the obvious noise filter is shown in Fig. 6(b), it is obvious to find that high
in this information before using it for offline training for frequency noises are removed. However, the noises between
better results. First, we divide the data set into 3 levels. 1 and 200Hz cannot be eliminated, we utilize PCA to reduce
The Dataset-fixed represent activities and positions that are the noises further. DeepCount applies PCA to CSI streams by
fixed. The Dataset-semi means we allow volunteers to conduct the following three steps:
free activities on the premise of a fixed position, they are • DC Component Removal: In this step, we first subtract
free to choose actions which may be a combination of fixed the corresponding constant offsets from CSI streams to
actions. The Dataset-open means volunteers can do anything remove the Direct Current (DC) component of every
anywhere in the room. Then DeepCount employs a deep neural subcarrier. The constant offsets can be calculated through
network with CNNs and LSTMs to train the samples ,whit a a long-term averaging over that subcarrier.
large number of weights and biases ,the deep neural network • Principal Components: DeepCount calculates the corre-
will extract feature based fingerprints which can effectively lation matrix Z as Z = H T × H and gets the eigenvector
represent the relationship between crowd counting and the CSI qi through eigen decomposition of Z. After that, Deep-
variances. In the online test stage, we used trained models to Count achieves the principal components by the following
predict the number of people in the room. At the same time, we equation:
also monitor the activity of door switch among other human hi = H × qi (5)
activities using activity recognition model. Once there is
someone entering/leaving the room, the increment/decrement where qi is the ith eigenvector and hi is the ith principal
of the number of people is greater than 1 compared with the component.
previous moment by our deep learning model, which means • Smoothing: Finally, we apply a 5-point median filter to
the predicted result is wrong. We label this sample and add avoid abrupt changes in the CSI streams which are likely
this data to retrain the parameters of the last layer in our deep to facilitate the results.
learning model. DeepCount discards the first principal component h1 and
retains the next ten principal components to be used for feature
B. CSI Collection extraction. It is mainly due to following reasons. From a large
When the transmitter continuously transmits the WIFI sig- number of experiments, we observed that noises are mainly
nal, the receiver continuously receives the WIFI signal at the captured in the first component. However, the information
same time, and the DeepCount will automatically extract the about human activities is captured in all principal components.
CSI value through the CSI tool provided on the receiver. We Since the PCA components are uncorrelated, we can discard
fixed sampling rate is 1500 packets/s to ensure fine grained the first principal component without losing too much useful
information about human activity. For each packet, the CSI information. Fig. 6(d) shows the third PCA component of our
matrix H is extracted. Owing to there aretwo antennas of the method and we find that the signal is much smoother.
transmitter (Tx) and three antennas of receiver (Rx) , we can 2) Feature Extraction: The responses of different activities
express the matrix H can as: on the frequency spectrum are different. The CSI frequency
is determined as f = 2ν/λ, where ν is the speed of
H1,1 H1,2 · · · H1,30 human activity and λ is the WiFi signal wavelength. From
H2,1 H2,2 · · · H2,30 the equation, we can find that the frequencies of running
(4)
.. .. .. .. and walking are obviously different due to the speed of
. . . .
H6,1 H6,2 · · · H6,30 running is much faster than walking normally. Hence, Discrete
Wavelet Transformation (DWT) is a proper choice. The DWT
where Hi,j represents the CSI values at jth subcarrier in the employ functions that are both in time and frequency which
ith Tx-Rx pair. Hence, we can get three dimensional data overcome the limitations of the classical Fourier Transform
CSI = [H1 , H2 , ..., Ht ] for the duration time t. We extract and meanwhile, DWT provides more appropriate wavelet basis
the phase and amplitude values from the raw CSI data for to choose. The discrete wavelet function is defined as:
further processing.
Φm,n (t) = a0 φ(a−m
0 t − nb0 ) (6)
C. Activity Recognition Model Construction The discrete wavelet coefficient is defined as:
1) Activity Recognition Preprocessing: As is shown in Fig. Z +∞
6(a), raw CSI amplitude data contain too many noises. In Wm,n (t) = f (t)Φm,n (t)dt (7)
our activity recognition model, the speeds of human activities −∞
such as sitting, walking, door switch are not very fast and In the DWT, the signal is decomposed into a coarse ap-
the signal changes caused by these activities lie at a low proximation coefficients and detail coefficients. Then the
frequency spectrum while the noises caused by hardware have coarse approximation coefficients are further decomposed us-
a relatively high frequency spectrum. Hence, Butterworth low- ing the same wavelet function. In DeepCount, we chose the
pass filter is a natural choice which removes high frequency Daubechies D4 wavelet to decompose the PCA component
6
30 25 25 60
25 20 40
20
20
Amplitude
Amplitude
Amplitude
Amplitude
20 15
15 0
15 10
-20
10
10 5 -40
5 0 5 -60
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Time 4 Time Time Time #104
#10 #10
4 #104
(a) Raw CSI amplitude (b) Raw data after butterworth filter (c) Weighted moving average filter (d) Raw data after PCA
into 10 levels of frequencies spanning 1 Hz to 200 Hz. Then, are different by the analysis of III-B. Hence, we utilize HMM
in order to capture the details of different activities, we use to construct the activity recognition dataset with some daily
a time window of size 128 to average the detail coefficients. activities such as walking, falling, running and the activity
In each window, we use its average energy and variance as of door switch(entering/leaving the room) under the condition
features. The corresponding feature matrix is shown in the of single person entering/leaving the room. Then we can use
matrix below this dataset directly to monitor the door switch activity in real
E1,1 E1,2 · · · E1,n−1 E1,n
time. The high accuracy of activity recognition we can find
E2,1 E2,2 · · · E2,n−1 E2,n from Fig. 14(a) provides a judgment for deep learning model
to correct its predicted result with previous result, which means
.. .. .. .. ..
.
. . . .
that once the activity recognition model find there is someone
E10,1 E10,2 · · · E10,n−1 E10,n
(8) entering into room, the increment should be equal to 1.
V1,1
V1,2 · · · V1,n−1 V1,n
V2,1
V2,2 · · · V2,n−1 V2,n
. . .. . .. D. Deep Learning Model Construction
.. .. ..
. .
V10,1 V10,2 · · · V10,n−1 V10,n 1) Counting Model Preprocessing: The counting model
preprocessing is different from IV-C1 for the reason that We
where Ei,j is the average energy at the ith level in the j th need to preserve the original feature of the signal as much as
time window, Vi,j is the variance at the ith level in the j th possible for classification. However, PCA denoising method
time window and n is the number of time windows. loses original features of waveform shown in the Fig. 6(d)
3) Classification: After the above steps, we get the feature which is harmful for deep learning classification. Hence, we
matrix representing different activities. We utilize Hidden apply the weighted moving average method to remove the
Markov Model (HMM) to train the features for the reason that noises of amplitude for deep learning model. At the same
most of human activities can be divided into different phases, time, We add the phase information as features to improve
for example door switch can be divided into silent, accelera- the performance of deep learning model.
tion, deceleration and stop four phases, which corresponding
to the concept of states in HMM. Markov Model is a special • Amplitude Noise Removal: We utilize the Weighted Mov-
kind of Bayesian network. The variable Yt denotes tth node ing AverageWMA algorithm to remove noises. Specially,
in the network, each node has S possible states and different we use the equation 10 for the amplitude values of the
states have different transition probabilities. For the variables 1th subcarrier Sub1 = [A1 , A2 , ..., At ] to get weighted
Y1 · · · YT , we have averaged amplitude values:
where P is the genuine phase, ∆t is the time lag due The learned results will be feeded into the next dense
to SFO, β is the unknown phase offset due to CFO and layer. We use two CNN blocks for feature extraction,
N is the noise. From equation 11, we find that owing to each block contains filer and max pooling components.,
the unknown ∆t and β, we cannot get the real phase. for the first filter component, we have 6 filters which
However, from the equation, we found that linear fit can size equals 5 × 5 and stride equals 1 and for the first max
eliminate the effects of SFO and CFO. Fig.7(a) shows pooling component which size equals 2 × 2 and stride
the raw phase PM values when someone is walking in equals 2. Hence after the first CNN block, we can get a
the room. We can see that the initial phase values are 98×30×6 vector. Then, we pass this vector to the second
folded within the range of [−π, π] . In order to get the CNN block, in this block, we have 10 filters which size
true phase values, we unfolded the CSI phases which is is 5 × 3 and stride is 3. Then we get a 32 × 10 × 10 high
shown at Fig.7(b) firstly. Next, we need to remove the dimensional features.
impacts of SFO and CFO. We get the mean phase values • Dense Layer: After two CNN blocks, we can get a vector
y for antennas on each subcarrier. Then we utilize the shaped as 32 × 10 × 10, then we flat this vector into
linear fit to get the true phase values. The whole algorithm the shape of 3200 × 1 and pass this vector to three
is shown in Algorithm 1. Fig.7(c) presents the modified fully-connected layers. Each layer contains 1000, 200,
phase values by our algorithm. 5 neurons. After dense layer, we can get a 5 × 1 vector.
• Softmax Layer: A softmax layer to output the predicted
Algorithm 1 Phase Sanitization probabilities. The learned features by the previous CNN
Require: layers can be directly passed to a classifier like a softmax
The raw phase values PM ; output layer to determine the likeliness. We utilize a 5-
The number of subcarriers Sub; units softmax output layer to build a predictor on CSI
The number of Tx-Rx pairs M ; amplitude and phase information.
Ensure: 3) Online Testing: We collect a large number of samples
The calibrated phase values PC ; to find the relationship between number of people and CSI in
1: for i = 1 to M do the period of offline training and improve the performance of
2: UP = unwrap(PM (:, i)); this end-to-end learning by our activity recognition model. At
3: end for the online testing stage, we use this deep learning model to
4: y = mean(PM , 2); infer the number of people at present. If the predicted result
5: for i = 1 to Sub do is contradicted to the result obtained by activity recognition
6: x = (0 : Sub − 1); model. DeepCount will add this sample to retrain our deep
7: p = polyf it(x, y, 1);; learning model. We only retrain the parameters of the last
8: yf = p(1) ∗ x;; dense layer in our deep learning model. The reason to update
9: for j = 1 to M do the parameters of last dense layer rather than whole layers’
10: PC (:, j) = PM (:, j) − yf ; parameters is that the low-level features deep learning model
11: end for extracted are similar. Hence, we can just retrain the last layer
12: end for to get a better performance. At the same time, the time cost is
negligible because we just retrain the single sample to the last
2) Offline Training: Fig.8 illustrates the proposed network layer. By automatically adjusting the model’s parameters over
architecture for Crowd Counting Training. This component a period of time, our recognition accuracy can up to 90%.
comprises these four parts sequentially.
• LSTM Layer: We use a LSTM layer to extract long short
V. IMPLEMENTATION & EVALUATION
dependencies of activity segments. After processing of
the amplitude noise removal and phase sanitization, the A. Implementation
CSI information of different activities are then passed In the experiment, the laptop used in our experiment is
to an LSTM layer to get long short dependencies. The equipped with Ubuntu 12.04 operating system. In terms of
output of LSTM is then input into convolutional layers hardware, our device is equipped with Intel 5300 NIC as
for higher level features. Here we use one layer of LSTM. receiver. We connected the laptop to a mini R1C wireless
Because , for LSTM networks, one layer is powerful router with two antennas, using the router’s cable as a trans-
enough and much easier to tune hyper parameters. The mitter.The receiver is equipped with three antennas and its
input data has 360 dimensions including 180 dimensions firmware is modified to report CSI to the upper layer. All
for amplitude and 180 dimensions for phase information. experiments in this paper were carried out under the premise
Hence, the shape of input data is time × 360. Let N be of a frequency band of 5 GHz and a channel bandwidth of 20
the number of units in the LSTM layer, the output of the MHz. In order to make the wavelength short enough to ensure
layer for a single activity becomes a time×N ×1 vector. better resolution, we chose 5GHz instead of 2.4GHz, while
In our experiment, we set N equals 64 for speed up. 5GHz has more channels to reduce the possibility of inference.
• CNN Layer: CNN layers is selected to get higher level During the experiments, the transmitter sends packets with a
representations. The convolutional layers target at select- high rate of 1500 packets/second to the receiver continuously
ing high level representations from the output of LSTM. with Iperf tool. DeepCount acquires CSI measurements and
8
8 10
Antenna1 Antenna1 Antenna1
4
6 Antenna2 0 Antenna2 Antenna2
Unwrapped Phase
Antenna3 Antenna3 Antenna3
4 -10
Sanitize Phase
2
Raw Phase
2 -20
0
0 -30
-2 -40 -2
-4 -50
5 10 15 20 25 30 0 10 20 30 5 10 15 20 25 30
Subcarrier Subcarrier Subcarrier
(a) Raw wrapped CSI phase (b) Unwrapped CSI phase (c) Modified CSI phase
Waving
one-people
Typing
two-people
Sitting down
Dataset-fixed
three-people
Walking
four-people Talking
Fig. 8: The architecture of network
five-people Eating
Activity Samples
Waving 24741
D. Experiment with CNN-LSTM
Typing 28565 1) Parameters tuning: As we all know, an effective net-
Sitting down 27108
work needs well-tuned parameters via extensive searching.
Walking 27537
Talking 23580 The parameters include the number of layers, the number
Eating 26802 of units for each layer, the filter size and stride size for
convolutional layer, and etc. The difficulty grows quickly with
TABLE II: Samples of dataset-fixed
the complexity of a neural network. To balance the efficiency
and performance of training process, we focus on tuning the
major parameters that influence the performance most. We list
down those parameters for each layer bellow.
• LSTM Layer: The performance of LSTM Layer is mostly
sensitive to the maximum length of the input sequence,
the number of LSTM cells, and dropout rate. LSTM with
more cells means stronger information storage capacity.
In our experiment, we use 1 layer of LSTM with 64
cells to remember long term dependencies. Also, We
set the length of input sequences equals 200 for better
performance and choose dropout rate equals 0.1.
• CNN Layer: We use two CNN blocks to extract high
level features. We tuned the filter and stride size for the
convolutional layers, and the pool and stride size for the
pooling layers. For the first block, we have the first CNN
with 5 × 5 filter, 1×1 stride, followed by a max pooling
Fig. 11: The architecture of baseline method network with 2×2 filter and 2×2 stride, and the second block with
5×3 filter, 3×3 stride.
• Dense Layer: Three fully-connected layers are followed
by CNN blocks with 1000, 200, 5 neurons to reduce
rameters is suitable for crowd counting. Therefore ,we use data dimension and fit the relationships between CSI and
FCBP neural network as the baseline method to compare our crowd counting.
CNN-LSTM NetWork . The FCBP neural network we use • Other significant parameters: In our experiment, we
has two hidden layers where each hidden layer consists of a set the batch size equals 64 and learning rate
different number of neurons. We use 300 neurons (denoted equals 0.2,0.15,0,1 for the dataset-fixed,dataset-open and
as Layer 1 nodes) on the first hidden layer and 100 neurons dataset-semi to get the better performance.
(denoted as Layer 2 nodes) on the second hidden layer. The
2) Overall performance: For activity recognition model,
input layer is denoted as XN, where N equal 360 for the reason
DeepCount takes the 80% of samples in each class as the
that CSI values both contain phase and amplitude information.
training set, the rest as the test set. For the training set, we
The output has 5 classifications denoted as Y1 to Y5 to identify
use 10-fold cross validation to get optimal parameters for
up to 5 people. Note that, this structure could easily extend to
the activity model including the states in HMM. From Fig.
count more than 5 people by using more output nodes. Fig.11
14(a), the average accuracy of activity recognition is 89.14%
illustrates the structure of whole network.
and there is a probability of 8% to view the Entering into
2) Experiment results of baseline method: The baseline room as Leaving room and 7% to view Leaving room as
method achieves an average accuracy of 88.8%, 80.2% Entering into room. We observe that the accuracy of Entering
and 78% respectively across dataset-fixed, dataset-semi and into room and Leaving room is 88% and 87% among others
dataset-open. Fig.12 plots the confusion matrix for different activities. For CNN-LSTM model, the DeepCount achieves
training dataset collected in our Lab1 and Lab2. Fig.13 shows an average accuracy of 88.8%,85.2%,and 85.2% respectively
the loss curve during the process of training. The loss func- across dataset-fixed, dataset-semi and dataset-open. Fig.15
tion of dataset-fixed is converged after 3000 iterations while plots the confusion matrix for different training dataset in our
dataset-semi and dataset-open are converged even after 10000 Lab1 and Lab2. We can find that DeepCount’s predicted results
iterations. This result implies that the data from dataset-semi and real labels differ from no more than 2 people. Thus, we
and dataset-open are more diverse. Hence, the computation can see thatn our CNN-LSTM model fits the real environment
costs are relatively higher. well. Due to the existence of multiple human states in the
10
(a) The confusion matrix of baseline method for (b) The confusion matrix of baseline method for (c) The confusion matrix of baseline method for
Dataset-fixed Dataset-semi Dataset-open
(a) The confusion matrix of Dataset-fixed (b) The confusion matrix of Dataset-semi (c) The confusion matrix of Dataset-open
80
last layer parameters. Owing to retrain the last layer of single
70
sample, the time cost is acceptable. By this mechanism, we can
60 eventually improve the recognition accuracy from 82.3% to
50 87% on the baseline method . Further, the recognition accuracy
fixed semi open
of the CNN-LSTM network can be improved from 86.4% to
Fig. 17: The impact of the length 90%.
of time window 5) Compared with Electronic Frog Eye: DeepCount is
much different from the Electronic Frog Eye [17]. First,
12
ACKNOWLEDGMENT
This work is supported in part by the National Key R&D
Program of China under Grant 2017YFB0802300, the National
(a) The accuracy of data with or (b) The accuracy of different features
without preprocessing Science Foundation of China under Grant (No.61602238,
61672283), the key project of Jiangsu Research Program
Fig. 18: Impact of preprocessing Grant( BK20160805), the China Postdoctoral Science Foun-
dation (No. 2016M590451).