Wen 2021

Expert Systems With Applications 178 (2021) 115016
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
Wind turbine fault diagnosis based on ReliefF-PCA and DNN

Xiaoqiang Wen, Ziang Xu *
Department of Automation, Northeast Electric Power University, Jilin 132012, PR China
A R T I C L E I N F O A B S T R A C T
Keywords: A large amount of data would be generated during the operation of wind turbine (WT), which is easy to cause
Wind turbine dimensional disaster, and if more than one WT fault occur, multiple sensors would alarm. To solve the problems
Fault diagnosis of big data, inaccurate and untimely fault diagnosis and so on, a hybrid fault diagnosis method is developed
ReliefF
based on ReliefF, principal component analysis (PCA) and deep neural network (DNN) in this paper. Firstly, the
Principal component analysis
Deep neural network
ReliefF method is used to select the fault features and reduce the data dimensions. Secondly, PCA algorithm is
used to further reduce the data dimensions, which is mainly used to reduce the redundancy among the data and
improve the accuracy of fault diagnosis. Finally, the ReliefF-PCA-DNN model is constructed, optimized and used
for the fault case of a wind farm in Jilin Province. The experimental results show that, for the single fault, the
accuracies of the proposed hybrid models are all more than 98.5% and for the multi faults, the accuracy of the
proposed model is more than 96%, which both are all much higher than the comparison methods. So, the method
could diagnose the WT faults well.
1. Introduction phase short circuit fault of doubly-fed type wind-driven generator (Li
et al., 2019). Li et al. investigated fault diagnosis of wind turbines by
As a renewable clean energy, wind energy has become an indis using Gaussian process classifiers (GPC) based on the operational data
pensable key force to solve the problem of environmental pollution in collected from the SCADA system (Li et al., 2019). Cho et al proposed a
recent years (Lonf et al, 2017). With the increase of installed capacity fault detection and diagnosis method to automatically identify different
and service life of WTs, the operation and maintenance cost along with fault conditions of a hydraulic blade pitch system in a spar-type floating
the potential safety hazard is increasing especially in mountainous wind wind turbine based on Kalman filter and artificial neural network (Cho
farms, and the serious wind turbulence there would further increase the et al., 2021). Because hundreds of sensors are installed on the WTs, and
WT fatigue load, which is easy to lead to malignant accidents. these data are stored in the SCADA system through the sensors, so the
As we all know, artificial neural network (ANN) technology is an data analysis and mining technology emerges as the times require.
effective method, which is widely used in WT fault diagnosis (Zeng et al., However, once the WT generator appears the urgent breakdown during
2018). Liu and Laouti used support vector machine for WT fault diag the operation, because of the large amount of data and too many pa
nosis and WT fault detection, respectively (Liu et al., 2013; Laouti and rameters unrelated to the fault, it would not be able to locate the fault
Othman, 2011). Zhou et al. proposed the study learning based data- quickly and accurately. A fault may cause many sensors to alarm, and
driven methods for abnormal event detection based on kernel princi the faults may occur at the same time. How to find out the sensitive fault
ple component analysis (kPCA) and the novel discriminative method features from so many parameters and improve the accuracy of fault
that only required partial expert knowledge for training (Zhou et al., diagnosis urgently needs more in-depth study.
2016). Leahy et al. applied classification techniques to recognize the Feature dimensionality reduction is an important premise of WT
fault and fault-free operation of a wind turbine in the South-East of fault diagnosis, which mainly aims to describe the data set in a more
Ireland based on SCADA data (Leahy et al., 2016). Poon et al. presented accurate way. Compared with the original feature set, this method has
a fault detection and identification (FDI) method for switching power fewer features. To achieve this, it removes unnecessary, unwanted, and
converters using a model-based state estimator approach (Poon et al., irrelevant features from the dataset. During the WT real-time operation,
2016). Yu et al. used deep belief network to diagnose the WT fault (Yu many parameters unrelated to the faults are stored in SCADA database.
et al., 2018). Li et al. used BP neural network to diagnose the phase to When a fault occurs, feature dimensionality reduction could eliminate
* Corresponding author.
E-mail address: 503151998@qq.com (Z. Xu).
https://doi.org/10.1016/j.eswa.2021.115016
Received 9 October 2020; Received in revised form 1 March 2021; Accepted 7 April 2021
Available online 13 April 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
X. Wen and Z. Xu Expert Systems With Applications 178 (2021) 115016
unnecessary data from the SCADA database, speed up the calculation

time and obtain accurate fault diagnosis results. At present, the
commonly used dimensionality reduction algorithms are genetic algo
rithm (GA), random forest (RF), clustering analysis (CA), relief series
algorithm (RSA), principal component analysis (PCA), etc. Chen et al.
used genetic algorithm (GA) to optimize the energy storage configura
tion in the tower elevator energy storage system model (Chen et al.,
2018). However, the GA is easy to converge prematurely and its oper
ation efficiency is relatively low. Meng et al. predicted the icing on the
WT blades, reducing the dimension by the random forest algorithm, but
the calculation cost of the algorithm was so high (Meng et al., 2020).
Because of their complexity, these algorithms concerned above need
more time to train than other similar algorithms. When the amount of
data is too large, it is difficult to obtain an effective clustering result.
Besides, relief algorithm could select features better, but the correlation
between the extracted features is so high, resulting in redundancy (Dou
et al., 2019). Although principal component analysis (PCA) could
effectively eliminate the correlation among the features, it would reduce
the computational efficiency if the feature parameters were directly
used without any processing (Zhang et al., 2019).
At present, most of the classification and regression learning
methods are shallow structure algorithms. Their limitations lie in the
limited representation ability of complex functions in the case of limited
Fig. 1. Construction process of the ReliefF-PCA-DNN model.
samples and calculation units, and for complex classification problems,
their generalization ability is restricted to some extent. On the other
hand, deep learning algorithm could achieve complex function sensitive fault features and has poor generalization. In order to solve this
approximation by learning a kind of deep nonlinear network structure, problem, here ReliefF-PCA feature reduction algorithm is proposed,
represent the distributed representation of input data, and show the which has good effect in extracting sensitive fault features and has good
powerful learning essential characteristics of data set based on some generalization.
sample sets (Tang and Hou, 2016; Aqsa et al., 2017).
In this paper, a hybrid ReliefF-PCA-DNN model is proposed and used 2.1.1. Extract the WT sensitive fault features based on ReliefF algorithm
for the WTs’ fault diagnosis. ReliefF algorithm is used to extract sensitive The algorithmic flow of ReliefF algorithm (Robnik-Sikonja and
fault features of the WTs. PCA is used to reduce the dimension of the Kononenko, 2003; Zhang et al., 2009), which is shown in Fig. 2, is as
extracted features, eliminate the correlation among the features, avoid following: a sample R is randomly selected from training set D, and then
redundancy, and clean some wrong data. Based on the feature subset by the nearest neighbor sample H is found from the same sample as R. The
dimension reduction, the optimized DNN is used to establish the WT nearest neighbor sample M is found, which is different from the sample
fault diagnosis model. The experimental results show that the accuracy R. If the distance between the R and the similar sample H on a certain
of the proposed hybrid model is much higher than the comparison feature is less than the distance between the R and the different sample
models. When the faults of other WTs are added into the model, the M, then the feature is beneficial to distinguish the nearest neighbors of
diagnosis results could still be accurate, which show that the proposed the same and different kinds, and the weight of the feature should be
model has good generalization. increased accordingly. Otherwise, if the distance between the R and the
similar sample H is larger than the distance between the R and the
2. Construction of the ReliefF-PCA-DNN fault diagnosis model different sample M, it indicates that the feature has a negative effect on
distinguishing the nearest neighbors of the same and different kinds,
The construction process of the WT ReliefF-PCA-DNN fault diagnosis then the weight of the feature should be reduced accordingly. The larger
model includes the following steps: the weight of a feature is, the stronger the classification ability of the
Firstly, extract the WT sensitive fault features based on ReliefF feature is; conversely, the weaker the classification ability of the feature
algorithm. is. The specific process of the algorithm is as follows:
Secondly, reduce the feature dimensions based on PCA algorithm. Input variables: training data set D, sample sampling times m, the
Thirdly, construct the WT single fault diagnosis models and multiple number of nearest neighbor samples k, the number of features n and
faults diagnosis model based on DNN. threshold of feature weight δ.
Fourthly, verify the effectiveness of the faults diagnosis models and Output variables: the feature subset Q, whose eigenvalue is greater
compare with other fault diagnosis models (methods). than the threshold δ.
The Fig. 1 shows the construction process of the WT ReliefF-PCA- A. Set the feature subset Q as an empty set and feature weight
DNN fault diagnosis model and the followings are the detailed the vector W(F) = 0, F=(1,2,…,n).
ories and modeling processes: B. For j = 1 to m do
Firstly: select randomly a sample R.
2.1. Feature dimension reduction based on ReliefF and PCA algorithm Secondly: select k nearest-neighbor similar samples Hi (i = 1,2,
…k) from R, and select k nearest-neighbor different samples Mi(C).
A large amount of data could be generated during the WT operation. Thirdly: For F = 1 to n do
[
When the WT faults occur during the operation, if the data is not pro ∑k ∑ ∑k
p(C)
cessed properly, the fault diagnosis time would increase dramatically W(F)=W(F)− diff(F,R,H i )/(mk)+
i=1 1− p(Class(R)) i=1
and the accuracy would be reduced significantly. Therefore, it is ]/
c∕
∈Class(R)
necessary to mine the fault data and extract the fault sensitive features. ∑k
However, some conventional dimensionality reduction algorithms, such diff(F,R,Mi (C)) (mk)W(F)=W(F)− i=1
diff(F,R,Hi )/(mk)
as Pearson correlation coefficient and so on, could not effectively extract
2
[ ]/ maximized along the new coordinate axis, and the data points are no
∑ p(C) ∑k
+ diff(F,R,Mi (C)) (mk)W(F) longer linearly correlated. This set of transformed data is called the
c∕
∈Class(R)
1− p(Class(R)) i=1 principal component. The direction with the largest data variance is
∑k ∑ selected as the first principal component, and the second principal
=W(F)− diff(F,R,Hi )/(mk)+
i=1
c∕
∈Class(R)
component is selected based on the direction along with the second
[ ]/ largest variance, which is orthogonal to the first principal component.
p(C) ∑k
diff(F,R,Mi (C)) (mk)diff(F,R1 ,R2 ) (1) This process is repeated until n principal components are found. The
1− p(Class(R)) i=1 process of dimension reduction is as follows:
where, p(C) is the distribution probability of class C, class (R) is the A) Centralize the matrix x (de-average each dimension)
category that R belongs to, Hi is the ith nearest neighbor sample in class
(R), Mi(C) denotes the ith nearest neighbor sample in class C, and diff(F, C = VT x (5)
R1,R2) represents the difference between R1 and R2 on the Fth feature.
The features for discrete set are as follows: B) Calculate the covariance matrix of the samples
{
0; R1 [F] = R2 [F]
diff (F, R1 , R2 ) = (2) 1∑ S
1; R1 [F] ∕
= R2 [F] ψ = xij − xi (6)
S i=1
The features for continuous set are as follows:
1∑ S
diff (F, R1 , R2 ) =
|R1 [F] − R2 [F]|
(3) B= ψψ T (7)
max(F) − min(F) S i=1
C. For F = 1 to n do
If W(F) > δ, the Fth feature is added to Q. C) Eigenvalue decomposition of covariance matrix;
The weight obtained by ReliefF algorithm needs to calculate D) The eigenvectors corresponding to the first n largest eigenvalues
Euclidean distance, so the data need to be normalized: are selected to form the eigenvector matrix W. where x is the
eigenvector, C is the reduced dimension matrix, V is the de-
(Q − Qmin )
Y= (4) averaging vector, S is the number of samples, ψ is the new ma
(Qmax − Qmin ) trix after the average value of each column of elements in the
feature matrix, and B is the covariance matrix of ψ . PCA could
where Q is the feature subset obtained by ReliefF algorithm, Qmin is the
keep the most important content of data while reducing dimen
minimum value of each feature parameter, and Qmax is the maximum
sion. At the same time, due to the transformation of coordinate
value of each feature parameter. Thus, a new characteristic matrix Y is
system, the redundancy of the data is greatly reduced (Uzer et al.,
obtained. The ReliefF algorithm flow in two-dimensional space is shown
2013).
in Fig. 2. As can be seen from the figure, the relief algorithm could cause
high redundancy of feature subset, and it is necessary to remove the
redundancy (Jain & Singh, 2018). 2.2. Structure of the deep neural network (DNN)
2.1.2. Reduce the feature dimensions based on PCA algorithm In 2006, Hinton et al. firstly proposed the concept of deep learning
Principal component analysis (PCA) is a dimensionality reduction (Hinton and Salakhutdinov, 2006). Deep learning is a deep data feature
method (Zhang et al., 2019, 2020; Li et al., 2018) which is used to reduce extraction algorithm, which could overcome the shortcomings of sup
the dimension of data sets. It could map n-dimensional features to k- port vector machine (SVM), artificial neural network (ANN) and other
dimension (k < n) and make the data separable. In the new spatial co shallow machine learning methods, such as imprecise feature extraction
ordinate system, the variance of the transformed data points is results and easily falling into local minimum and so on. DNNs are neural
Class1
Class2
diff(F,R,M)
M R H
diff(F,R,M)
diff(F,R
,H)
Fig. 2. Sketch of reliefF algorithm.
3
networks with many hidden layers, which is also known as deep feed C) Calculate the gradient gt of the time step t:
forward networks (DFN).
gt = ∇θ J(θt− 1 ) (12)
As is shown in Fig. 3, DNN could be divided into input layers, hidden
layers and output layers. All layers are fully connected, that is to say, any
neuron in layer i is connected with all neurons in layer i + 1. When DNN D) Calculate the exponential moving average of the gradient:
propagates forward, the calculation formula is as follows:
mt = β1 mt− 1 + (1 − β1 )gt (13)
∑
n
uK = WKi Xi (8)
i=1 E) Calculate the exponential moving average Xvt of the square of the
gradient:
yK = f (uk − bk ) (9)
vt = β2 vt− 1 + (1 − β2 )g2t (14)
where, Xt is the ith input variable, Wki represents the weight connected to
the ith input variable, uk is the weighted sum of all input variables, bk is
the threshold, f(⋅)is the activation function, and yk is the output of the F) Correct the mt by using the deviations:
( )
neural network. The common activation functions are sigmoid, tanh and ̂ t = mt / 1 − βt1
m (15)
ReLU (Apicella et al., 2019; Wang et al., 2020). In this paper, ReLU
(Montanelli and Yang, 2020; Eckle & Hieber, 2019; Chen & Ho, 2019) is
selected as the activation function. The expression is as follows: G) Correct the vt by using the deviations:
( )
fReLU = max(0, z) (10) v t = vt / 1 − βt2
̂ (16)
The derivative of ReLU activation function is as follows:

{ H) Update the parameters:
d 1, z > 0
fReLU = (11) (√̅̅̅̅ )
dz 0, z⩽0
θt = θt− 1 − α* m
̂ t/ ̂v t + ε (17)
From the formula above, we could see that the part with input value
less than 0 would be set to zero, so the neural network has sparsity. For
deep learning algorithm, a clear goal is to decompose key factors from 2.3. The detailed WT fault diagnosis process based on ReliefF-PCA-DNN
data variables. Because the original data is usually wrapped with highly model
dense features, if the complex relationship among features could be
solved and converted into sparse features, then the features would have The WT fault diagnosis flow is shown in Fig. 4, and the main steps are
robustness, and could effectively avoid gradient disappearance and as follows:
gradient explosion. Adam (Chang et al., 2019; Fei et al., 2020) is a first- Data preprocessing: according to the fault record table, the WT with
order optimization algorithm that could replace the traditional sto less failure frequency is regarded as normal WT, the data in fault period
chastic gradient descent (SGD) process (Wang et al., 2019). It could is eliminated, and the rest is taken as normal data. The data of the target
update neural network weights iteratively based on training data. SGD WT in fault period is extracted and used as the fault data. The variance of
could maintain a single learning rate, which is used in all the weight each parameter is calculated. The data with variance of 0 and the error
updating process, and the learning rate would not change during the data are eliminated, and the remainder is normalized.
training process. Each network weight (parameter) maintains a learning Data dimension reduction: all the processed data are input into the
rate and is adjusted separately in the learning process. This method ReliefF and PCA algorithm. Firstly, after 80-times sampling and 20
calculates the adaptive learning rate of different parameters from the repetitions, the average value of the 20 repetitions is taken as the final
budget of the first and second moments of the gradient. The Adam result. Then, the threshold is discussed. The parameters larger than the
process is as follows: threshold are used as the input vectors of PCA algorithm, and more than
90% principal components obtained by PCA algorithm are used as the
A) Input training sample data set X , global learning rateε, moment input vectors of the DNN model.
estimation decay rateρ1 , moment estimation decay rateβ1 andβ2 , Fault diagnosis: The reduced dimension data is divided into training,
and stability constantδ. test and verification set according to the ratio of 6:1:3. The Adam
B) Initialization of the parameters above. optimizer is used to optimize the model to accelerate the training speed,
and the parameters of the ReliefF-PCA-DNN fault diagnosis model are
Fig. 3. Structure of the DNN.
4
eliminated. This paper mainly diagnoses several common faults of the

WTs, which are normal state, overtemperature fault of gearbox oil,
overtemperature fault of gearbox at NDE end bearing, spindle brake
failure and overtemperature fault of engine room.
Here, the parameters with 0 variance are eliminated and normalized.
One of the important evaluation index is the accuracy of the fault
diagnosis model on the verification set, and the loss function of the
classifier SoftMax is as follows:
∑
Loss = − yi lnai (18)
i
where yi is the real value of the output layer and ai is the calculated value
by SoftMax. At the same time, the fault data in different periods are used
to verify the model, and the accuracy is taken as a part of the evaluation
index.
3.2. Dimensionality reduction, optimization and results
3.2.1. Feature dimensionality reduction based on ReliefF and PCA

algorithm
Because the correlation between the parameters is not so high and
the amount of original data is relatively large, the accuracy of the fault
diagnosis model, which takes the original parameters directly as input
variables, would be very low. In order to improve the accuracy and
shorten the training time, it is necessary to reduce the dimension of the
original SCADA data. In the process of feature selection, the threshold
selection of ReliefF algorithm could affect the subsequent calculation to
a certain extent, so the thresholds are divided into average value, me
dian and standard deviation for classification and discussion, which are
represented by A1(average value), A2(median) and A3(standard devi
ation), respectively. The evaluation results based on the three thresholds
are as followings(shown in Table 3):
Fig. 4. Fault diagnosis process based on the ReliefF-PCA-DNN model. Among them, the model based on A1 has the least parameters. For
the overtemperature fault of gearbox oil and overtemperature fault at
optimized at the same time. Finally, the fault diagnosis model is verified NDE end of gearbox bearing, the model using A2 as threshold value has
by the faults in other time periods. close evaluation accuracy to the model that uses A1 as threshold value,
but the former is not as good as the latter when diagnosing the main
3. Experimental verification shaft brake failure and the overtemperature fault of engine room. For the
overtemperature fault of gearbox oil, the overtemperature fault at NDE
3.1. Data sources and evaluation index end of gearbox bearing and the spindle brake failure, the model based on
A3 is not as effective as the models based on A1 and A2.
The experimental data are all from a certain wind field in Jilin Table 4 shows the number of principal components with the accuracy
province, and the data is recorded every other 10 s. The parameters of more than 90% after PCA dimensionality reduction by using the three
the WTs’ sensors are as follows: generator speed(P1), grid voltage(P2), thresholds. The diagnostic result of the model based on A1 is close to
mean wind angle per second(P3), average wind speed per second(P4), that of the model based on A2, and the former is slightly better than the
sum of generator electric quantity(P5), setting value of generator active latter. For the overtemperature fault of gearbox oil and the over
power(P6), grid frequency(P7), average generator power per second temperature fault at NDE end of gearbox bearing, the model based on A3
(P8), average generator speed per second(P9), grid current(P10), engine has lower evaluation accuracy than the other two models.
room to north angle(P11), average pitch angle per second(P12), real Fig. 5 shows the running times of the models with different thresh
time value of reactive power(P13), generator speed setting value(P14), olds by using different fault data sets. It can be seen from the figure that
setting value of pitch angle(P15), real time value of Y/Z-direction vi the running time of the fault diagnosis model based on A1 is no more
bration(P16, P17), Y/Z-direction vibration filtering value(P18, P19), than, and even less than the models based on A2 and A3. That is to say,
average motor temperature of blade 1/2/3 per second(P20, P21, P22), the model based on A1 has better performance.
average gear oil temperature per second(P23), average temperature at In order to verify the accuracy and generalization of the proposed
DE/NDE end of gearbox bearing per second(P24, P25), average tem ReliefF-PCA-DNN model, this paper compares it with other dimension
perature at DE/NDE end of generator bearing per second(P26, P27), ality reduction algorithms. The data by dimension reduction is brought
average U/V/W-phase temperature of generator stator winding per into the same DNN model and the fault diagnosis results are shown in
second(P28, P29, P30), average hub temperature per second(P31), Table 5 and Table 6. Based on the tables, we could draw some
mean cabin temperature per second(P32), mean temperature outside conclusions:
engine room per second(P33), average temperature of rotor bearing A/B Firstly, although the accuracy of the PCA-DNN model is 99.38, which
per second(P34, P35), average cooling water temperature of gearbox per is a high accuracy among the single overtemperature fault of gearbox
second(P36) and spindle speed(P37), real time air density(P38). Table1 oil, however, its dimension after dimension reduction by PCA algorithm
shows parts of the experimental data. Here, five WTs are selected and is 7, which is bigger than the model by ReliefF-PCA algorithm (its
Table 2 shows the failure frequencies of the WTs. The No.2 wt with the dimension is 3), and its training time under 200 epochs is 232.6 s, which
least fault is regarded as the normal one, whose fault data has been is also longer than the model by ReliefF-PCA algorithm (220.3 s).
5
Table 1
Parts of the experimental data.
No. of Samples P1r/s P2/V P3/f P4m/s P5/kW P6kW/h P7/Hz P8/kW P9/r P10/A
1 1058.50 685.86 5.03 4.01 2.11*107 105.43 50.07 97.88 1058.87 0

2 1286.19 685.56 2.85 5.57 2.11*107 307.66 50.04 285.40 1278.90 0
3 1663.30 689.76 4.50 7.87 2.11*107 762.44 50.04 753.90 1664.43 0
….
No. of Samples P11/。 P12/。 P13/kW P14r/min P15/。 P16/μm P17/μm P18/μm P19/μm P20/℃
1 72.25 0.01 2 1750 0 0.03 0.02 0.01 0.02 9.27
2 105.40 88.98 0 50 89 0.01 0.04 0.01 0.02 16.84
3 317.70 3.93 0 1090. 2.19 0.03 0.01 0.02 0.02 12.29
….
No. of Samples P21/℃ P22/℃ P23/℃ P24/℃ P25/℃ P26/℃ P27/℃ P28/℃ P29/℃ P30/℃
1 11.40 11.01 54.88 52.29 49.41 21.78 24.79 32.42 32.49 32.49
2 18.79 18.50 57.80 49.27 45.45 30.00 36.10 44.99 45.11 45.49
3 16.40 16.20 48.50 41.50 35.49 24.09 23.39 33.29 33.60 34.10
….
No. of Samples P31/℃ P32/℃ P33/℃ P34/℃ P35/℃ P36/℃ P37r/min P38kg/m3
1 11.10 13.60 9.89 16.17 19.92 12.59 6.18 1.22
2 14.39 17.30 15.50 19.59 21.29 13.29 0.12 1.20
3 15.10 12.20 13.49 19.40 22.97 14.60 0.08 1.20
….
Table 2
Statistics of the WTs’ faults.
Serial number of the WTs Failure frequency State
2 98 Change to 1
3 207 Change to 1
14 296 Change to 1
23 517 Change to 1
39 172 Change to 1
Table 3
Number of selected features corresponding to different thresholds.
Fault type Number of all A1 A2 A3
parameters
Oil overtemperature of Gearbox 38 10 10 15

Overtemperature at NDE end of 38 9 9 18
gearbox bearing
Failure of main shaft brake 38 6 11 12 Fig. 5. Comparison of running time using different fault data sets. Notes: Fault
Overtemperature of Engine room 38 8 10 8 1- overtemperature fault of the gearbox oil; Fault 2- overtemperature fault of
gearbox at NDE end of bearing; Fault 3-the spindle brake failure; Fault 4-
overtemperature fault of engine room.
Table 4
The number of principal components with 90% accuracy.
Table 5
Fault type A1 A2 A3 Results of the models based on various dimension reduction algorithms.
Oil overtemperature of Gearbox 3 3 5 Number Fault types ReliefF- PCA- ReliefF- Pearson-
Overtemperature at NDE end of gearbox bearing 3 3 6 of WT PCA- DNN DNN DNN
Failure of main shaft brake 3 4 4 DNN
Overtemperature of Engine room 3 4 3
1 Overtemperature of 99.62 99.38 88.98 97.12
Gearbox oil (R1)
2 Overtemperature of 99.44 84.78 75.55 67.22
Therefore, the PCA-DNN model has a higher dimension after dimension
Gearbox oil (R2)
reduction and a longer running time than the ReliefF-PCA-DNN model. 3 Overtemperature of 98.33 73.61 61.61 39.96
What’s more, its accuracy is not as high as that of the proposed model Gearbox oil (R3)
here, whose accuracy is 99.62. 2 Overtemperature of 99.33 62.15 44.90 33.37
Secondly, the ReliefF-DNN model and Pearson-DNN model could not Gearbox oil and
Overtemperature at
effectively diagnose faults due to their poor generalization for multiple
NDE end of gearbox
faults or multi WTs. The model based on Pearson correlation coefficient bearing (R4)
has good dimensionality reduction effect and less relevant parameters 3 Overtemperature of 97.82 48.65 48.19 26.31
when only one WT gearbox oil temperature is over temperature. After Gearbox oil, bearing
overtemperature at
consulting the fault table and SCADA data, it is found that the WT was
NDE end of gearbox
still in fault state after the temperature returned to normal value, so and overtemperature
many parameters were similar to the normal WT parameters, which of engine room (R5)
resulted in less relevant parameters. When the number of WTs increases,
the dimension reduction effect would become worse.
Thirdly, Fig. 6 also shows the accuracies and dimensions of all the
6
Table 6 parameter is expressed by r. When r > 0, there is a positive correlation

Dimensions of the models based on different dimension reduction algorithms. between them, and the greater the r is, the higher the correlation is. On
Number Fault types ReliefF- PCA- ReliefF- Pearson- the contrary, when r < 0, it is a negative correlation. Extracting the
of WT PCA- DNN DNN DNN parameters with larger absolute value of r in the first principal compo
DNN nent, we could see from the table that most of the information is retained
1 Overtemperature of 3 7 10 2 in the process of dimensionality reduction.
Gearbox oil (R1)
2 Overtemperature of 3 7 10 12 3.2.2. Discussion on activation function
Gearbox oil (R2)
3 Overtemperature of 3 8 10 12
The common activation functions are sigmoid, tanh and ReLU. If the
Gearbox oil (R3) activation function is not selected properly, the convergence speed of
2 Overtemperature of 3 8 10 12 the model would slow down, the accuracy would decrease, and even the
Gearbox oil and gradient would disappear. In this paper, the loss function is a cross en
overtemperature at
tropy function. The comparing results of the three activation functions
NDE end of gearbox
bearing (R4) are shown in Figs. 7 to 9.
3 Overtemperature of 3 9 10 12 As shown in Fig. 7, the sigmoid activation function converges to
Gearbox oil, 0.6345 slowly under 200 epochs, which is much larger than that of the
overtemperature at tanh and ReLU activation function. The curve of loss function under the
NDE end of gearbox
bearing and
tanh activation function, which is shown in Fig. 8, is similar to the curve
overtemperature of under the ReLU activation function, which is shown in Fig. 9. However,
engine room (R5) on the one hand, the loss value under the tanh activation function
models based on the four dimension reduction algorithms. It can be seen Table 7
that the diagnosis accuracy of the proposed model is significantly higher Correlation between first principal component and original data.
than that of other models. When diagnosing a single WT fault, the Fault types Related parameters
proposed model has less dimensions and the highest fault diagnosis
Overtemperature of Gearbox oil Temperature at DE end of generator bearing,
accuracy. When diagnosing various WT faults, the proposed model also temperature at DE end of gearbox bearing,
has the least dimensions and the highest fault diagnosis accuracy. temperature at NDE end of gearbox bearing
Compared with other comparing models, the proposed model has better Overtemperature of bearing at Gear oil temperature, temperature at DE end of
generalization. To sum up, the proposed model has lower dimension, the NDE end of gearbox gearbox bearing, Temperature at DE end of
generator bearing
highest accuracy of fault diagnosis and good generalization. Abnormal engine room Temperature of Rotor bearing A, temperature at
Besides, Table 7 shows the correlation between the first principal temperature NDE end of gearbox bearing
component and the original data after PCA dimension reduction. The Main shaft brake failure Angle of cabin to North, Engine room
score coefficient of the first principal component relative to the original temperature, Temperature of Rotor bearing A
ReliefF- PCA-DNN
PCA-DNN
ReliefF-DNN
Pearson-DNN
ReliefF- PCA-DNN
PCA-DNN
ReliefF-DNN
Pearson-DNN 100
100
90
80
80
Accuracy results(%)
Dimensions
70
60
60
40 50
40
20
10 1012 8 10
12
8 10
12
9 10
12 30
7 7
3 2 3 3 3 3
0 20
R1 R2 R3 R4 R5
Fault type
Fig. 6. Comparing results of the models based on various dimension reduction algorithms.
7
activation function of the proposed model here.
3.2.3. Discussion on optimization algorithm

For the proposed model here, the updating algorithm of the param
eters would affect the training speed and diagnostic accuracy to a certain
extent. At present, SGD is the most commonly used method, but it is
difficult to select the initial learning rate, and it easily falls into the local
optimum. To solve this problem, many novel optimizers are proposed:
RMSProp optimizer, Adagrad optimizer, Adadelta optimizer and Adam
optimizer. Here, the parameter optimization algorithms above are
compared and analyzed by using the WTs’ fault set. The parameters of
the algorithms are shown in Table 8 and the training results are shown in
Fig. 10. According the results, the Adagrad optimization algorithm has
the slowest convergence speed and larger error. The SGD and Adadelta
optimization algorithm have slow convergence speeds and large error
fluctuations. The RMSProp and Adam optimization algorithm are rela
Fig.7. Loss value under Sigmoid activation function. tively close in convergence speed, but the Adam optimization algorithm
has the smallest and most stable convergence error. Therefore, the Adam
optimization algorithm is selected as the optimization algorithm here.
3.2.4. Optimization of the hidden layers

The accuracy of the proposed model here is also affected by different
network depths. When the number of network layers increases, the ac
curacy could be improved, but the convergence speed would become
slower. Reducing the network depth would affect the network accuracy,
but the training could speed up. At the same time, the number of neurons
in the hidden layer would also affect the training accuracy.
As shown in Fig. 11 and Table 9, when the number of hidden layers is
2 or 3, the mesh search method is used to optimize the number of
neurons. Because the network depth is not enough, the diagnosis accu
racy is not particularly ideal. By expanding the depth of hidden layers
and optimizing the number of neurons by using grid search method, we
could see that when the number of hidden layers is 5, the accuracy is
significantly improved comparing with that of the 4-hidden-layers
Fig. 8. Loss value under tanh activation function. model. When the number of hidden layers continues to increase, the
accuracy does not change significantly. Therefore, 5 hidden layers are
selected here.
Next, the mesh search method is used to optimize the number of
neurons in the hidden layers. The number of neurons in each hidden
layer ranges from 8 to 11. Based on the training results, it is found that
when the accuracy is the highest, for most hidden layers the numbers of
neurons reach 11, so it is necessary to expand the range. Therefore, the
range is amended to 11 to 15. Finally, the numbers of neurons in each
hidden layer are 13, 14, 14 and 14, respectively.
3.2.5. Comparison with other models

Although the ReliefF-PCA-DNN model has high requirements for
data and hardware configuration, it could adapt to new problems
quickly compared with other methods. In addition, because of its special
processing in algorithm construction, such as using typical gradient
descent and random loss, its strong fitting ability and so on, it has a
strong advantage in dealing with multiple faults diagnosis. At present,
the common fault diagnosis methods include BP neural network, support
Fig. 9. Loss value under ReLU activation function. vector machine, etc. Therefore, here we construct the ReliefF-PCA-BP
and ReliefF-PCA-SVM fault diagnosis models to compare with the
converges to 0.0868 and the loss value under the ReLU activation
function converges to 0.0834, which is smaller than the former; on the
Table 8
other hand, the former takes 427 s to converge, and under the same Optimization algorithm.
epochs the latter takes 423.2 s, which is shorter than the former. Besides,
Optimization Parameter
research also shows that once the number of hidden layers increases, the
algorithm
gradients under the tanh and sigmoid activation function would disap
SGD Learning rate: 0.01
pear easily, but the ReLU activation function could make the network
RMSProp Global learning rate: 0.01, Decay Rate: 0.9
training faster, increase the nonlinearity of the network, prevent the Adagrad Global learning rate: 0.01
gradient from disappearing, and make the network sparse(Li, 2013; Lin Adadelta Global learning rate:1, Decay Rate: 0.95
& Liao, 2016). Therefore, considering comprehensively the factors Adam Global learning rate :0.05, Decay rate under moment
mentioned above, the ReLU activation function is selected as the estimation: 0.9 and 0.999, Stability constant: 10− 8
8
SGD ReliefF- PCA-DNN

RMSprop ReliefF- PCA-BPNN
Adagrad ReliefF- PCA-SVM
2.0 adam
Adadelta
1.5
98
loss
%
1.0
Accuracy
0.5 96
0.0
0 50 100 150 200 94

epochs
Fault 1 Fault 2 Fault 3 Fault 4
Fig. 10. Training errors of the optimization algorithms.
Fault data sets
Fig. 12. Accuracy of different models under single fault.
97
shown in Fig. 13, compared with the single fault diagnosis models, the
accuracies of the three multi-faults prediction models decrease. The
96 proposed ReliefF-PCA-DNN model still shows excellent diagnostic per
formance in multiple faults, and the accuracies are all above 96%. When
the fault types are 2 and 4, the accuracies are 99.2% and 97.82%
Accuracy (%)
95 respectively. When the fault types increase, the accuracy of the proposed
model is relatively stable without obvious decline. When there are 8
types of faults, the accuracy is 96.72%.
94 For the ReliefF-PCA-SVM model, when there are few fault types, it
could achieve the faults diagnosis well, but when the fault types increase
obviously, its accuracy decreases. When there are 8 types of faults, the
93
accuracy is only 88.89%.
For the ReliefF-PCA-BPNN model, when there are few fault types, its
92 accuracy is stable. When the fault type is 2 and 4, the accuracies are
2 3 4 5 6 7 98.2% and 93.96% respectively. When the number of faults increases, its
Number of hidden layers accuracy is not as good as those of the two models above. The accuracy
decreases sharply and is only 24.12%. It is unable to distinguish faults
Fig.11. Results of different hidden layers. effectively because the ReliefF-PCA-BPNN model has a shallow network
structure and could not learn all features effectively. The results above
show that the proposed model here has obvious advantages over the
Table 9
ReliefF-PCA-BPNN and ReliefF-PCA-SVM model in the WTs’ faults
Numbers of neurons in grid search method.
diagnosis.
Hidden layer Number of neurons
Next, the proposed model is verified by using the fault data of other
2 13, 14 periods. Table 11 shows the confusion matrix of the diagnosis results.
3 13, 15, 14 The values on the diagonal in Table 11 are the numbers of correct
4 14, 14, 15, 15
diagnosis, and the rest in the same column indicate the numbers of
5 13, 14, 14, 14, 14
6 13, 14, 14, 14, 16, 16 misdiagnosis. It can be seen that the misdiagnosed faults mainly focus on
7 13, 14, 14, 16, 14, 14, 16
Table 10
proposed model. Fault types of different WTs.
As shown in Fig. 12, for the No. 39 wt, the single fault diagnosis label Fault type Note
accuracies of the three models are all very high, and the accuracy for
F0 Normal state of No.2 wt Normal
each fault is more than 98.5%. It shows that the three models are F1 Oil overtemperature fault of the No. 14 wt Temperature >80 ℃
extremely reliable in single fault diagnosis after dimension reduction, gearbox
and the single fault diagnosis result is more reliable than multiple fault F2 Brake holding failure of the No.14 wt main shaft Brake holding
F3 Oil overtemperature fault of the No. 23 wt Temperature >80 ℃
diagnosis, because the fault categories involved are less and the
gearbox
complexity of the model is low, but each classifier could only diagnose F4 Bearing overtemperature fault at NDE end of the Temperature greater
specific faults. No. 23 wt gearbox than 80 ℃
Multi faults classification is to classify a series of sampled data by F5 Oil overtemperature fault of the No. 39 wt Temperature >80 ℃
categories, including normal categories and several faults categories. gearbox
F6 Bearing overtemperature fault at NDE end of the Temperature >80 ℃
These faults come from the 4 WTs. Different faults of one WT and the
No. 39 wt gearbox
same fault of different WTs are defined as different faults. As shown in F7 Brake holding failure of the No.39 wt main shaft Brake holding
Table 10, one model is used for faults diagnosis of different WTs. As is F8 Temperature fault of the No. 3 wt cabin Temperature < 5 ℃
9
ReliefF- PCA-DNN
ReliefF- PCA-BPNN
100 ReliefF- PCA-SVM
90
80
70
Accuracy(%)
60
50
40
30
20
10
0
2 types of failures 4 types of failures 8 types of failures Fig. 14. ROC curves and AUC values for the test sets.
Fault states
reduction, the data dimensions are effectively reduced. Compared with
Fig. 13. Accuracy of different models under multiple faults. other dimensionality reduction algorithms, the ReliefF-PCA algorithm
has good generalization, which not only ensures the accuracy of the
proposed model, but also speeds up the its operation speed.
Table 11
Diagnostic results of confusion matrix for the ReliefF-PCA-DNN model.
Secondly, the network functions and parameters of the proposed
model are discussed and optimized. From the loss value in the training
label prediction classification results for the ReliefF-PCA-DNN model
process, we can see that when the ReLU activation function is used as the
F0 F1 F2 F3 F4 F5 F6 F7 F8 hidden layer activation function and the Adam optimization algorithm is
F0 100 0 0 0 0 0 0 0 0 used as the parameter updating algorithm, the proposed model has good
F1 0 100 0 0 0 0 0 0 0 training effect. The best number of hidden layers is 5 with the shortest
F2 0 0 100 0 0 0 0 0 0 running time.
F3 0 0 0 98 1 1 0 0 0
Thirdly, the model proposed here is compared with the common
F4 0 0 0 1 99 0 0 0 0
F5 0 0 0 1 0 96 3 0 0 fault diagnosis models. For the single-fault diagnosis, the three models
F6 0 0 0 0 0 2 98 0 0 have good performance. In the process of multi-faults diagnosis, the
F7 0 0 0 0 0 0 0 100 0 accuracy of the proposed model is the highest, and the ReliefF-PCA-
F8 0 0 0 0 0 0 0 0 100
BPNN model is the worst when there are many fault types. According
to the verification results of multi faults diagnosis, the misdiagnosed
F3 to F6. The main reason for the results is that the gearbox temperature faults mainly focus on F3 to F6. It indicates that there may be a certain
fault is easy to cause the temperature of surrounding parts to rise, such direct connection between the two types of faults and needs further
as the increase of bearing temperature and so on. Although the proposed study.
model has good fault diagnosis performanc, it is also easy to fall into
misdiagnose when the gearbox temperature exceeds the normal level. Declaration of Competing Interest
Different fault types would cause the same data to change the internal
characteristics. That is the main reason why the accuracy of each algo The authors declare that they have no known competing financial
rithm decreases under multiple faults diagnosis. Besides, The ROC interests or personal relationships that could have appeared to influence
curves and AUC values of the test sets are shown in Fig. 14, It can be seen the work reported in this paper.
from the figure that the test sets achieve good classification effect on the
trained model, in which the precision value is 0.9937 and the recall Acknowledgements
value is 0.9891. Macroscopically, the average AUC value of 9 times is
0.9991, which excludes the chance and shows that the proposed model The authors are thankful to the support of Jilin city outstanding
is relatively stable. Microscopically, the AUC values of F0 to F8 are all young talents training program (20190104156), the science and tech
greater than 0.98, and the AUC values of F5 and F6 are lower than those nology projects by Jilin province department of education
of other faults, which is consistent with the conclusion above. (JJKH20190709KJ).
4. Conclusions References
Lonf, X. F., Yang, P., Guo, H. X., et al. (2017). Review of fault diagnosis methods for large
When the WTs in a wind farm are shut down due to some faults, how wind turbines. Power System Technology, 41(11), 3481–3485.
to mine the useful information from the massive operation data and Zeng, J., Chen, Y. F., Yang, P., et al. (2018). Review of fault diagnosis methods of large-
correctly diagnose the fault types and fault location, so as to reduce the scale wind turbines. Power System Technology, 42(3), 849–860.
Li, H. Y., Wang, L. M., W Q, et al. (2019). Research on fault diagnosis of wind turbines
operation and maintenance cost, has become an outstanding problem. In based on neural network and logistic regression. Renewable Energy Resources, 31(9),
this paper, a WT ReliefF-PCA-DNN fault diagnosis method is proposed 1393–1397.
for the first time. By reducing the dimensions of the WTs’ SCADA data, Yu, D., Chen, Z. M., Xiahou, K. S., et al. (2018). A radically data-driven method for fault
detection and diagnosis in wind turbines. International Journal of Electrical Power &
different faults are diagnosed effectively. The effectiveness of the pro
Energy Systems, 99, 511–584.
posed model is verified by comparing with different fault diagnosis Liu, W. Y., Wang, Z. F., & Han, J. G. (2013). Wind turbine fault diagnosis method based
models. The conclusions are as follows: on diagonal spectrum and clustering binary tree SVM. Renewable Energy, 50, 1–6.
Laouti, N., & Othman, S. (2011). Support vector machines for fault detection in wind
Firstly, the average value is used as the threshold of ReliefF algo
turbines. IFAC Proceedings Volumes, 44(1), 7067–7072.
rithm to extract the fault sensitive features. After PCA dimensionality Zhou, Y., et al. (2016). Abnormal event detection with high resolution micro-PMU data.
2016 IEEE Power Systems Computation Conference (PSCC).
10
Leahy, K., et al. (2016). Diagnosing wind turbine faults using machine learning Zhang J, Lin H, Zhao M. (2009). A fast algorithm for hand gesture recognition using
techniques applied to operational data. 2016 IEEE International Conference on relief.2009 6th International Conference on Fuzzy Systems and Knowledge
Prognostics and Health Management (ICPHM). Discovery. New York: IEEE Press, 1, 8-12.
Poon, J., et al. (2016). Model-based fault detection and identification for switching Jain, D., & Singh, V. (2018). An efficient hybrid feature selection model for
power converters. IEEE Transactions on Power Electronics, 32(2), 1419–1430. dimensionality reduction. Procedia Computer Science, 132, 333–341.
Cho, S., Choi, M., Gao, Z., et al. (2021). Fault diagnosis for a wind turbine transmission Li, S. (2013). Accelerating a recurrent neural netwok to finite-time convergence for
system based on manifold learning and Shannon wavelet support vector machine. solving time-varying sylvester equation by using a sign-bi-power activation function.
Renewable Energy, 169, 1–13. Neural Processing Letters, 37(2), 189–205.
Li, Y. T., Shu, L. J., Liu, S. J., et al. (2019). Wind turbine fault diagnosis based on Lin, X., & Liao, B. (2016). A convergence-accelerated Zhang neural network and its
Gaussian process classifiers applied to operational data. Renewable Energy, 134, solution application to Lyapunov equation. Neurocomputing, 193(2), 213–218.
357–366. Li, S. W., Chen, T., & Wang, L. (2018). Effective tourist volume forecasting supported by
Chen, H., Xie, L. R., Li, J., et al. (2018). Research on the power supply mode of tower PCA and improved BPNN using Baidu index. Tourism Management, 68, 116–126.
elevator based on the abandoning wind. Renewable Energy Resources, 36(9), Zhang, L., Wang, J. P., & Duan, Q. L. (2020). Estimation for fish mass using image
1375–1379. analysis and neural network. Computers and Electronics in Agriculture, 173.
Meng, H., Huang, X. X., Liu, J., et al. (2020). Forecast of fan blade icing combing with Uzer, M. S., Inan, O., & Yılmaz, N. (2013). A hybrid breast cancer detection system via
random forest and SVM. Electrical Measurement & Instrumentation, 1–4. neural network and feature selection based on SBS, SFS and PCA. Neural Computing
Wang, Y. J., Cao, P. P., et al. (2020). Research on insulator self-explosion detection and Applications, 20(3–4), 719–728.
method based on deep learning. Journal of Northeast Electric Power University, 40(03), Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with
33–40. neural networks. Science, 313, 504–507.
Dou, D. Y., Wu, W. Z., Yang, J. G., et al. (2019). Classification of coal and gangue under Apicella, A., Isgrò, F., & Prevete, R. (2019). A simple and efficient architecture for
multiple surface conditions via machine vision and relief-SVM. Power Technology, trainable activation functions. Neurocomputing, 370, 1–15.
356, 1024–1028. Montanelli, M., & Yang, H. Z. (2020). Error bounds for deep ReLU networks using the
Zhang, Y. G., Chen, B., & Pan, G. F. (2019). A novel hybrid model based on VMD-WT and Kolmogorov-Arnold superposition theorem. Neural Networks, 129, 1–6.
PCA-BP-RBF neural network for short-term wind speed forecasting. Energy Eckle, K., & Hieber, J. S. (2019). A comparison of deep networks with ReLU activation
Conversion and Management, 195, 180–197. function and linear spline-type methods. Neural Networks, 110, 232–242.
Tang, Z., & Hou, J. (2016). Speech-driven articulator motion synthesis with deep neural Chen, Z., & Ho, P. H. (2019). Global-connected network with generalized ReLU
networks. Acta Automatica Sinica., 42(6), 923–930. activation. Pattern Recognition, 96.
Aqsa, S. Q., Asifullah, K., & Aneela, Z. (2017). Wind power prediction using deep neural Chang, Z. H., Zhang, Y., & Chen, W. B. (2019). Electricity price prediction based on
network based meta regression and transfer learning. Applied Soft Computing, 58, hybrid model of adam optimized LSTM neural network and wavelet transform.
742–745. Energy, 187.
Robnik-Sikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of Fei, Z. G., Wu, Z. Y., & Xiao, Y. Q. (2020). A new short-arc fitting method with high
ReliefF and RReliefF. Machine Learning, 53(1–2), 23–69. precision using Adam optimization algorithm. Optik, 212.
Wang, Q. G., Chen, S. L., & Luo, X. (2019). An adaptive latent factor model via particle
swarm optimization. Neurocomputing, 369, 176–184.
11

Wen 2021

Uploaded by

Copyright:

Available Formats

Wen 2021

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wen 2021

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 178 (2021) 115016

Contents lists available at ScienceDirect

Expert Systems With Applications

Wind turbine fault diagnosis based on ReliefF-PCA and DNN

unnecessary data from the SCADA database, speed up the calculation

Fig. 2. Sketch of reliefF algorithm.

The derivative of ReLU activation function is as follows:

Fig. 3. Structure of the DNN.

eliminated. This paper mainly diagnoses several common faults of the

3.2. Dimensionality reduction, optimization and results

3.2.1. Feature dimensionality reduction based on ReliefF and PCA

1 1058.50 685.86 5.03 4.01 2.11*107 105.43 50.07 97.88 1058.87 0

Oil overtemperature of Gearbox 38 10 10 15

Table 6 parameter is expressed by r. When r > 0, there is a positive correlation

activation function of the proposed model here.

3.2.3. Discussion on optimization algorithm

3.2.4. Optimization of the hidden layers

3.2.5. Comparison with other models

SGD ReliefF- PCA-DNN

0 50 100 150 200 94

Fig. 12. Accuracy of different models under single fault.

You might also like