Sensors 24 03215

sensors
Article
Condition Monitoring and Predictive Maintenance of Assets
in Manufacturing Using LSTM-Autoencoders and
Transformer Encoders
Xanthi Bampoula, Nikolaos Nikolakis and Kosmas Alexopoulos *
Laboratory for Manufacturing Systems and Automation, Department of Mechanical Engineering and
Aeronautics, University of Patras, 26504 Patras, Greece; baboula@lms.mech.upatras.gr (X.B.);
nikolakis@lms.mech.upatras.gr (N.N.)
* Correspondence: alexokos@lms.mech.upatras.gr; Tel.: +30-2610-910160
Abstract: The production of multivariate time-series data facilitates the continuous monitoring
of production assets. The modelling approach of multivariate time series can reveal the ways in
which parameters evolve as well as the influences amongst themselves. These data can be used
in tandem with artificial intelligence methods to create insight on the condition of production
equipment, hence potentially increasing the sustainability of existing manufacturing and production
systems, by optimizing resource utilization, waste, and production downtime. In this context, a
predictive maintenance method is proposed based on the combination of LSTM-Autoencoders and a
Transformer encoder in order to enable the forecasting of asset failures through spatial and temporal
time series. These neural networks are implemented into a software prototype. The dataset used for
training and testing the models is derived from a metal processing industry case study. Ultimately,
the goal is to train a remaining useful life (RUL) estimation model.
Keywords: deep learning; artificial intelligence; transformers; autoencoders; Long Short-Term

Memory (LSTM); predictive maintenance; remaining useful life
Citation: Bampoula, X.; Nikolakis, N.;

Alexopoulos, K. Condition Monitoring
and Predictive Maintenance of Assets
1. Introduction
in Manufacturing Using LSTM-
Autoencoders and Transformer
One of the key aspects of Industry 4.0 is the integration of advanced technologies
Encoders. Sensors 2024, 24, 3215. into production processes. The Internet of Things (IoT), as the key enabler of Industry 4.0,
https://doi.org/10.3390/s24103215 allows real-time data collection from a vast network of connected devices, sensors, and
systems [1–3]. However, the enormous amount of digital information and data, known as
Academic Editors: Pavlos Lazaridis,
Big Data (BD), generated and gathered by manufacturing Information and Communication
Christos Tachtatzis and Euler Cássio
Technology (ICT) systems usually remains underutilized [4]. Accordingly, new methods
Tavares De Macêdo
and models are needed that can truly benefit the ICT landscape and improve produc-
Received: 28 February 2024 tion processes by simple monitoring, planning, control, or even online reconfiguration of
Revised: 11 May 2024 a system.
Accepted: 15 May 2024 The process of examining these large and complex datasets, Big Data Analytics, can
Published: 18 May 2024 uncover hidden patterns, correlations, and other insights that are not visible to the human
operator and support proactive decision making, transforming raw data into useful in-
formation and the transition from information to knowledge [5]. Big Data Analytics and
data-driven techniques are becoming increasingly important for condition monitoring in
Copyright: © 2024 by the authors.
various industries, including manufacturing, energy, transportation, and healthcare, reveal-
Licensee MDPI, Basel, Switzerland.
ing the actual condition of production equipment. Condition monitoring is the process of
This article is an open access article
monitoring the health and performance of equipment and systems to identify potential
distributed under the terms and
conditions of the Creative Commons
issues and prevent failures. The goal of condition monitoring is to minimize downtime and
Attribution (CC BY) license (https://
improve overall efficiency by detecting issues before they become critical [6]. In turn, this
creativecommons.org/licenses/by/
could enable a transition from time-based preventive maintenance to predictive mainte-
4.0/). nance (PdM) or a combination of them. Performing PdM on production lines—identifying
Sensors 2024, 24, 3215. https://doi.org/10.3390/s24103215 https://www.mdpi.com/journal/sensors

Sensors 2024, 24, 3215 2 of 25
potential malfunctions in production equipment and estimating its remaining useful life
(RUL)—is beneficial and important as maintenance activities can be scheduled, preventing
equipment failures, minimizing downtime, and optimizing maintenance activities, lead-
ing to increased production and improved overall process performance [7–11]. However,
taking into account the existence of a wide spectrum of artificial intelligence methods and
tools, it is imperative to select an appropriate model which is capable of processing both
large and complex data as well as providing accurate predictions in a fast manner. The
existence of this gap is the motive of the present work, which aims to deliver a methodology
that takes advantage of data analytics algorithms in the processing of data captured in
production lines so as to give guidelines and detect features that can be used in PdM. As
such, the combination of LSTM-Autoencoders, as a preliminary preprocessing step, and
Transformer is a promising solution for addressing the above-mentioned challenges.
Additionally, the aim of this work is to propose a novel approach for fault detection
and RUL prediction. Autoencoders with Long Short-Term Memory (LSTM) networks and a
Transformer encoder are used to assess the operational condition of production equipment
and detect anomalies that are then mapped to different RUL values. A combination of
two LSTM-Autoencoder networks is proposed for classifying the current machine’s health
condition based on different corresponding labels and then one Transformer encoder is
used for RUL estimation. The main novelty of this approach is that a separate neural
network is trained for each label, leading to better results for each case. Consequently, this
method can be adjusted to several types of machines and labels. The proposed approach
has been evaluated in a steel industry case based on historical maintenance record datasets.
Finally, the development of a prototype method and the implementation of a software
prototype have shown that the proposed method can provide information regarding
the machine’s health without requiring any specialization and additional skills from the
industry operators.
The structure of this work is divided into six sections. After the end of the Introduction
section which presents the scope, challenges, and background of the present work, the
Literature Review section follows, including key points from the literature that evaluate
the performance of different data analytics algorithms and present how the topics of
maintenance in manufacturing processes are tackled. After the Literature Review, this
work continues with the Methods, Implementation and Case study sections, where the
methodology, the actions, and the means that are needed to perform predictive maintenance
in the actual case from industry are mentioned. Having created the models and extracted
the features, the Case study section includes a Discussion chapter which discusses the
models’ outputs and their interpretations as well as the competitive advantages. Finally, in
the Conclusions section the outputs of the involved developments are summarized.
2. Literature Review
The condition monitoring of equipment, ensuring good functionality over the years,
has become a requirement/necessity for industries [6]. Some of the key reasons are the
repair downtime and the increasing cost of equipment failures, due to the high technology
that is hidden in each machine and robot, and machine idling, due to repair operations
leading to less productivity, out of schedule deliveries, and, consequently, dissatisfied
customers [12,13]. Condition monitoring also assists the transition from the traditional,
reactive, and preventive type of maintenance to the modern PdM [14–16]. PdM relies on
AI technologies to analyze significant amounts of data as close to real time as possible,
detecting potential equipment failures [17–19]. Data-driven approaches/methodologies
are effective for PdM as ML (machine learning) models can be trained on labelled data
during process failure without requiring an in-depth understanding of the underlying
process [20,21]. This allows industries and machine manufacturers to leverage the vast
amounts of data generated by industrial equipment, IoT devices, and edge devices to
predict upcoming failures in the near future and schedule maintenance activities before
they occur, extending the lifetime of the component [22–24]. Moreover, this kind of data-
Sensors 2024, 24, 3215 3 of 25
driven approach allows industries to continuously improve their predictive maintenance

procedures over time by updating, upgrading, and fine-tuning their ML predictive models
based on new data from the production site, improving the adaptability to any changing
condition, while being sure of the performance of equipment [25–27]. Many different
ML techniques have been explored and developed for PdM applications, as noted in
sources [28–33]. The choice of technique depends directly on the application as well as on
the given datasets and their characteristics [34].
Convolutional Neural Networks (CNNs) are a form of deep learning technique that
has found widespread use in image and video analysis [35]. CNNs can identify complex
patterns in the data that are not easily noticeable by a human operator [36,37] and are
capable of managing vast amounts of data, making them suitable for industrial applications
where massive amounts of sensor data are generated [38]. However, CNNs need labelled
data and struggle to effectively handle complex datasets when the data are homogeneous
and multi-channel [39,40]. Finally, CNNs are not well suited to handle sequences of data,
as they do not have the capability to maintain information from one step of the sequence to
the next, like Recurrent Neural Networks (RNNs) [41,42].
Recurrent Neural Networks (RNNs) are a type of deep learning architecture specifically
optimized to handle sequential data for tasks such as natural language processing, speech
recognition, and time-series forecasting [10]. With their feedback loops, RNNs are able to
remember information of previous units by allowing information to pass across timeline
steps [43]. Despite their strength in handling sequences, RNNs struggle to maintain long-
term dependencies and may degrade in accuracy over time as the length of the input
sequence increases, making them less practical for real-time predictions [44]. However,
researchers have developed variants of RNNs, such as LSTM networks, that address these
challenges and allow for more effective use of RNNs in PdM tasks [45].
LSTM is a type of RNN that is capable of handling the vanishing gradient problem in
traditional RNNs by introducing a memory cell and gating mechanism [46]. LSTMs can
retain information for long sequences and are capable of handling long-term dependencies,
making them suitable for sequential data tasks such as time-series forecasting, natural
language processing, and speech recognition [47,48].
Autoencoders are a type of neural network that are used for dimensionality reduc-
tion and feature learning, and they consist of two main components: an encoder that
maps input data to a lower-dimensional representation, and a decoder that maps the
lower-dimensional representation back to the original input data [49–51]. Autoencoders
are relatively simple to train and implement, making them a popular choice for PdM
applications. However, Autoencoders are limited to working with vector-based data, and
their performance can be poor with sequential data such as time series or speech signals.
This is because regular Autoencoders cannot handle the temporal dependencies inherent
in sequential data. To address this limitation, LSTM-Autoencoders have been proposed,
which combine the sequential processing capabilities of LSTMs with the feature learning
capabilities of Autoencoders [52]
LSTM-Autoencoders are a type of Autoencoder architecture that uses LSTM networks
as the encoder and decoder parts. Combining LSTM and an Autoencoder creates a powerful
architecture for sequence data processing tasks, such as anomaly detection, data denoising,
and feature extraction [53,54]. The Autoencoder structure enables the model to learn a
compressed representation of the data, while the LSTM part allows the model to capture
the time-series dependencies and long-term patterns in the data. This combination results
in an efficient and effective method for analyzing sequential data [55].
Without using sequence-aligned RNNs, CNNs, or LSTMs, the Transformer is the
first transduction model relying entirely on self-attention to compute representations of
its input and output, becoming more and more ubiquitous in deep learning [56,57]. The
Transformer architecture (Figure 1) was introduced in the 2017 paper “Attention is All
You Need” [58] and has since been used in many state-of-the-art models for NLP (natural
language processing) tasks such as language translation, sentiment analysis, and text
Sensors 2024, 24, x FOR PEER REVIEW 4 of 26
put and output, becoming more and more ubiquitous in deep learning [56,57]. The Trans-
Sensors 2024, 24, 3215 former architecture (Figure 1) was introduced in the 2017 paper “Attention is All You 4 of 25
Need” [58] and has since been used in many state-of-the-art models for NLP (natural lan-
guage processing) tasks such as language translation, sentiment analysis, and text classi-
fication. The main
classification. idea
The behind
main idea transformers is the useisof
behind transformers self-attention
the mechanisms,
use of self-attention which
mechanisms,
allow the model to focus on different parts of the input sequence and learn
which allow the model to focus on different parts of the input sequence and learn the the relation-
ships between them,
relationships making
between them
them, well-suited
making them for processing
well-suited forsequential
processingdata. Transform-
sequential data.
ersTransformers
eliminate theeliminate
need to train
the neural
need tonetworks with networks
train neural large, labelled
withdatasets that aredatasets
large, labelled costly
and time-consuming
that are costly and to produce by finding
time-consuming patterns
to produce by between elementsbetween
finding patterns mathematically
elements
[59–62].
mathematically [59–62].
Figure
Figure 1. 1.
TheThe Transformer
Transformer model
model architecture.
architecture.
InIn contrast
contrast toto previous
previous approaches,
approaches, the
the use
use ofof the
the attention
attention mechanism
mechanism provided
provided byby
these architectures allows us to take into consideration a plethora of characteristics
these architectures allows us to take into consideration a plethora of characteristics in- involved
in different
volved formsforms
in different of data [63,64].
of data Transformers
[63,64]. havehave
Transformers alsoalso
beenbeen
used for for
used time-series data
time-series
analysis and forecasting as they are capable of capturing long-term dependencies
data analysis and forecasting as they are capable of capturing long-term dependencies in in the
time-series data [65]. The use of Transformers for that kind of data analysis has
the time-series data [65]. The use of Transformers for that kind of data analysis has shown shown
promising results and is an area of active research and development.
promising results and is an area of active research and development.
Consequently, this paper proposes and examines a supervised deep learning method,
Consequently, this paper proposes and examines a supervised deep learning
combining a set of Autoencoders with Long Short-Term Memory (LSTM) networks and a
method, combining a set of Autoencoders with Long Short-Term Memory (LSTM) net-
Transformer encoder, for fault detection, health condition estimation, and RUL prediction
works and a Transformer encoder, for fault detection, health condition estimation, and
of a machine. First, the set of LSTM-Autoencoder networks classify the general current
RUL prediction of a machine. First, the set of LSTM-Autoencoder networks classify the
health of the machine into distinct labels, and then, only if the LSTM-Autoencoders indicate
general current health of the machine into distinct labels, and then, only if the LSTM-Au-
that the machine’s health is bad, one Transformer encoder is used to classify the machine’s
toencoders indicate that the machine’s health is bad, one Transformer encoder is used to
status into specific classes corresponding to different RUL values.
classify the machine’s status into specific classes corresponding to different RUL values.
3. Method
3. Method
Currently, AI provides a plethora of tools, methods, and models for the prediction
Currently, AI provides
of possible equipment a plethora ofTherefore,
malfunctions. tools, methods, and models
engineers have tofor thethe
face prediction
challenge ofof
possible equipment malfunctions. Therefore, engineers have to face the challenge of
carefully selecting the most appropriate ML model. In the presented case study, alternative care-
fully
MLselecting the most
models could appropriate e.g.,
be implemented, ML GRU,
model.which
In therequires
presented caseofstudy,
the use alternative
less computational
ML models could be implemented, e.g., GRU, which requires the use of
parameters, and, by extension, less computational resources, at the cost of losingless computa-
long-term
tional parameters,
dependencies and,
built upbyin extension, less computational
the dataframes. resources, at the have
The two LSTM-Autoencoders cost of losing
been used
as a preliminary preprocessing step in the approach in order to filter out any irrelevant
information and decide if the data require further analysis from the Transformer encoder.
Then, the Transformer encoder further processes and analyzes the data, mapping them into
Sensors 2024, 24, 3215 5 of 25
different RUL classes. So, using LSTM-Autoencoders as a preliminary preprocessing step

allows a balance between computational efficiency and model performance.
3.1. LSTM-Autoencoders
In order to train any set of LSTM-Autoencoders, sensor data are required, derived
from a production machine. After the training, the set of separate LSTM-Autoencoders can
classify new sensor data that have never been seen before to different operational machine
statuses. In particular, a variety of different sensors, that are placed on the machine, take
measures of multiple features from the equipment and its environment. Preprocessing
of the data is mandatory, as data coming from industry can be inconsistent, noisy, or
even incomplete, leading to poor model performance. Apart from that, identifying the
appropriate set of features associated with potential failures is a challenging task. So, in
order to model the degradation process of any machine and determine the critical values,
plotting the dataframe values is proposed. After the visualization of the data, and in
combination with the knowledge and maintenance records of the factory specialists, related
studies, and scientific dissertations of a machine, the key features can be selected.
LSTM-Autoencoders are used for the classification of the health condition of a ma-
chine to one or more categories as explained hereafter. The architecture of each LSTM-
Autoencoder depends on the problem and the categories to be identified. The proposed
approach requires, at a minimum, two categories to determine the health condition of the
equipment: one category to represent the equipment’s good health condition, typically
after maintenance or part replacement, and the other category to represent bad health con-
ditions, such as due to degradation or failure that requires maintenance from an operational
perspective. Additional categories, beyond the two mentioned, could be included based
on specific needs and requirements. However, this specific study uses the minimum of
two categories, namely “good health” and “bad health”, to classify the health status of the
equipment. In order to classify these categories, an LSTM-Autoencoder is trained for each
label, with different datasets, so the number of LSTM-Autoencoders equals the number
of labels.
In order to define these different datasets and train the individual LSTM-Autoencoders,
historical maintenance records are used in order to label the data based on their timestamp
and the number and type of different statuses selected. Finally, a data split is performed
to define, train, and test data for each LSTM-Autoencoder; 80% of the initial dataset is
used for the neural network training and validation, and the remaining 20% for testing the
neural network [66].
Figure 2 illustrates a high-level LSTM-Autoencoder architecture. As presented in the
following Equation (1), the input of each LSTM-Autoencoder is a time-series sequence,
Ai , containing the values αij of each sensor, denoting one of the variables measured at a
specific time, with n being the number of features.

Ai = αi1 , αi2 , αi3 , . . . , αij , where αij ∈ R, with i, j ∈ Z and i ≤ n (1)
Consequently, this time-series sequence is the input of each LSTM cell of the encoder,
along with the hidden output from the previous LSTM cell. Finally, the output of the
encoder is a compressed representation of the input sequence, the learned representation
vector, which includes all the hidden states from all the previous encoder LSTM cells. This
output is fed then into the decoder to reconstruct the original input sequence, processing
these encoded features through a series of LSTM decoder cells. As presented in Equation (2),
the output of the decoder layer is a reconstruction of the initial input time-series sequence
A′ i , containing the reconstructed values α′ ij of each sensor.
h i
Ai′ = αi1
′ ′
, αi2 ′
, αi3 , . . . , αij′ , where αij′ ∈ R, with i, j ∈ Z and i ≤ n (2)
After the LSTM-Autoencoder training, the model is evaluated by feeding the test data,
defined earlier, as input to the model, and then, the reconstructed values are compared
(2), the output of the decoder layer is a reconstruction of the initial input time-series se-
quence A’i, containing the reconstructed values α’ij of each sensor.
Sensors 2024, 24, 3215 𝐴′ 𝛼′ , 𝛼′ , 𝛼′ , … , 𝛼′ , 𝑤ℎ𝑒𝑟𝑒 𝛼′ ∈ ℝ, 𝑤𝑖𝑡ℎ 𝑖, 𝑗 ∈ ℤ 𝑎𝑛𝑑 𝑖 𝑛 (2)25

6 of
After the LSTM-Autoencoder training, the model is evaluated by feeding the test
data, defined earlier, as input to the model, and then, the reconstructed values are com-
with the
pared input
with thevalues. The metric
input values. used to
The metric evaluate
used the model
to evaluate is theisMean
the model Squared
the Mean Error
Squared
(MSE) as presented in Equation (3).
Error (MSE) as presented in Equation (3).
11 n
∑ ′
2
𝑀𝑆𝐸
MSE i = A𝐴′
i − A𝐴i (3)
(3)
n𝑛i=1
Following the
Following the training
training phase,
phase, new data, that the LSTM-Autoencoders
LSTM-Autoencoders havehavenever
never
seen before,
seen before, are
are provided
provided as input to the networks,
networks, and
and each
each of
of them
them produce
producedifferent
different
reconstructedvalues
reconstructed values for
for the
the same
same input, as depicted in Figure
Figure 3.
3.
Figure2.2. High-level
Figure High-level LSTM-Autoencoder architecture.
LSTM-Autoencoder architecture.
Figure 3. LSTM-Autoencoder
LSTM-Autoencoder architecture
architecture set.
set.
The integration of outputs from the two separate LSTM-Autoencoders is achieved

through a decision rule, based on their reconstruction losses, compared to the input. The
LSTM-Autoencoder with the lower reconstruction loss indicates better recognition of the
input dataset, and consequently, the input sequence is classified into the same category
Sensors 2024, 24, 3215 7 of 25
Figure 3. LSTM-Autoencoder architecture set.
integration of
The integration of outputs
outputs from
from the the two
two separate
separate LSTM-Autoencoders
LSTM-Autoencoders is is achieved
achieved
through aa decision
decision rule,
rule, based
based on
on their
their reconstruction
reconstruction losses,
losses, compared
compared to to the
the input.
input. The
The
LSTM-Autoencoder with the lower reconstruction loss indicates better recognition of the the
input dataset, and consequently, the the input
input sequence
sequence isis classified
classified into the the same
same category
category
state
state as
as the
the one
one used
used to
to train
train this
this specific
specific LSTM-Autoencoder.
LSTM-Autoencoder.
In this
this approach, LSTM-Autoencoders serve as a preprocessing step. If
approach, LSTM-Autoencoders If the LSTM-
Autoencoders
Autoencoders classify the the health
health status
statusof ofthe
theequipment
equipmentas asaa“good
“goodstate”,
state”,further
furtheranaly-
anal-
ysis fromthe
sis from the Transformer
Transformer encoder
encoder is unnecessary.
is unnecessary. Otherwise,
Otherwise, in case
in case thatLSTM-Auto-
that the the LSTM-
Autoencoders
encoders classify classify the health
the health statusstatus
of theofequipment
the equipment as a “bad
as a “bad state”,
state”, the same
the same inputinput
data
data are used
are used as input
as input to a to a Transformer
Transformer encoder
encoder in order
in order to identify
to identify its remaining
its remaining useful
useful life
life (Figure
(Figure 4). 4).
Figure 4.
Figure 4. LSTM-Autoencoders
LSTM-Autoencoders and
and Transformer
Transformer encoder
encoder integration.
integration.
3.2. Transformer
3.2. Transformer Encoder
Encoder
The
The Transformer
Transformerencoder
encoderis is used
used for
for the
the identification
identification of of the current machine’s
machine’s health
health
condition and mapping
mapping it toto remaining
remaining useful
useful life
life (RUL)
(RUL) byby processing
processing and
and extracting
extracting
meaningful
meaningful information
information from
from the
the input
input data
data and
and making
making predictions.
predictions.
In
In the proposed
proposed approach,
approach, three
three (3)
(3) classes are used for the classification representing
representing
different health states of the machine.
different health states of the machine. The data that belong to Class 0 represent
represent the health
health
state of machines
of machines with an RUL of 3–4 days.
RUL of 3–4 days. The data that belong to Class 1 represent
represent the
the
health state of machines with an RUL of 2–3 days. Finally, the data that belong to Class 2
represent the health state of machines with an RUL of 1 day.
In order to label the data into the three (3) different classes, historical maintenance
records are taken into consideration based on their timestamp. Finally, a data split is
performed to define, train, and test data for each LSTM-Autoencoder; 80% of the initial
dataset is used for the neural network training and validation, and the remaining 20% for
the neural network testing.
Figure 5, illustrates the Transformer encoder’s Multi-Head Attention architecture. The
input of the Transformer encoder is a window from time-series data that are processed
independently and contain the values of each sensor. After the Q, K, and V matrixes are
generated for each head independently, the next step is the matrix multiplications between
the Queries matrix and the transposed Keys matrix, determining the relationships or the
similarity of the Query and the Key values (the scores). These scores are then scaled down
by being divided by the square root of the Query and Key dimension in order to avoid any
exploding effect. SoftMax is then applied to the scaled score matrixes in order to obtain
the attention weights. Finally, the attention weights of the multiple heads are multiplied
with the value matrixes in order to produce one matrix for each head that contains the
information of a value corresponding to the whole input. So, as the Transformer model
has multiple heads (# of heads = h), the output is h matrixes. Finally, all separate h outputs
from each Attention Head are concatenated and then multiplied with the Wo matrix in
order to output a matrix with the same shape as the input. The output of the Multi-Head
to obtain the attention weights. Finally, the attention weights of the multiple heads are
multiplied with the value matrixes in order to produce one matrix for each head that con-
tains the information of a value corresponding to the whole input. So, as the Transformer
Sensors 2024, 24, 3215 model has multiple heads (# of heads = h), the output is h matrixes. Finally, all8separate of 25 h
outputs from each Attention Head are concatenated and then multiplied with the Wo ma-
trix in order to output a matrix with the same shape as the input. The output of the Multi-
Head Attention
Attention is thenisadded
then added to the input
to the original original input
(Figure (Figure
6) and passes6)through
and passes through a nor-
a normalization
layer, making
malization themaking
layer, model more robust more
the model and stable during
robust andtraining.
stable during training.
Sensors 2024, 24, x FOR PEER REVIEW 9 of
Figure 5. 5.
Figure Transformer
Transformer encoder Multi-Head
encoder Multi-Head Attention.
Attention.
Figure Transformer model

Figure6.6.Transformer modelresidual connection.
residual connection.
After the normalization, the output is then passed through a Feed Forward network
After
(Figure 7) the
and normalization,
the output is addedthetooutput is and
the input thennormalized
passed through a Feedthe
again. Finally, Forward
output netwo
(Figure 7) and the output
of the Transformer encoder is
is added to therepresentation
a continuous input and normalized
of the input again. Finally,
containing all thethe outp
ofattention informationencoder
the Transformer that captures all the dependencies
is a continuous within theofsequence.
representation the input The output
containing all t
is further processed and passes through GlobalAveragePooling1D in order to
attention information that captures all the dependencies within the sequence. The outp produce the
is final output of the model and output the probabilities of the # of classes.
further processed and passes through GlobalAveragePooling1D in order to produce t
After the model training, the performance of the model is evaluated through the
final output of the model and
sparse_categorical_accuracy. Thisoutput
metricthe probabilities
calculates of the #ofof
the percentage classes.classified
correctly
samples in the dataset by comparing the predicted class labels with the true class labels.
After the normalization, the output is then passed through a Feed Forward network
(Figure 7) and the output is added to the input and normalized again. Finally, the output
of the Transformer encoder is a continuous representation of the input containing all the
attention information that captures all the dependencies within the sequence. The output
Sensors 2024, 24, 3215 is further processed and passes through GlobalAveragePooling1D in order to produce9 the
of 25
final output of the model and output the probabilities of the # of classes.
Figure 7. Transformer model Feed Forward network.

Figure 7. Transformer model Feed Forward network.
4. Implementation
After the model training, the performance of the model is evaluated through the
For the testing and the validation of the proposed approach and its potential use-
sparse_categorical_accuracy. This metric calculates the percentage of correctly classified
fulness for real-world applications, a prototype software system was implemented using
samples in the dataset by comparing the predicted class labels with the true class labels.
Python 3.7, incorporating the aforementioned method [67]. The system was integrated
using a computer with an Intel i7 processor (Intel(R) Core (TM) i7-3770 CPU @3.40 GHz
4. Implementation
3.80 Ghz), manufactured by Intel (Santa Clara, CA, USA). In terms of processing power,
For the testing
the computer and the validation
was equipped of the proposed
with an eight-gigabyte RAMapproach
memory and its Samsung.
from potential useful-
Finally,
ness for real-world applications,
the aforementioned system was ahosted
prototype
and software
tested onsystem was implemented
a computer using Win-
running Microsoft Py-
Sensors 2024, 24, x FOR PEER REVIEW
thon
dows 3.7,
10.incorporating the aforementioned
Figure 8 illustrates method [67]. The
a high-level representation of system was integrated10
the LSTM-Autoencoder of
using
and26
aTransformer
computer with an Intel i7 processor
network implementation. (Intel(R) Core (TM) i7-3770 CPU @3.40 GHz 3.80
Ghz), manufactured by Intel (Santa Clara, CA, USA). In terms of processing power, the
computer was equipped with an eight-gigabyte RAM memory from Samsung. Finally, the
aforementioned system was hosted and tested on a computer running Microsoft Windows
10. Figure 8 illustrates a high-level representation of the LSTM-Autoencoder and Trans-
former network implementation.
Figure 8.
Figure 8. LSTM-Autoencoder
LSTM-Autoencoder and
and Transformer model implementation.
Transformer model implementation.
At first,
At first,the
thesensor
sensordatadatawerewere imported
imported to the implemented
to the implemented system as JSON
system files, files,
as JSON pro-
cessed to remove
processed to remove missing
missing values, andand
values, finally converted
finally convertedto ato
dataframe
a dataframe format using
format the
using
Pandas
the library.
Pandas In the
library. In final dataframe,
the final eacheach
dataframe, column represented
column representedthe values of a single
the values sen-
of a single
sor, a feature,
sensor, a feature,sorted
sortedin in
chronological
chronologicalorder orderbased
basedon ontheir
theirtimestamp.
timestamp. The The selection of
features, used to determine the level of degradation of the machine, was based mainly on
human knowledge of the equipment and process and our our bibliographic
bibliographic research.
research. Finally,
Finally,
in order toto increase
increasethethemodel
modelperformance,
performance,atat a second
a second level, two
level, labels
two were
labels wereused for the
used for
LSTM-Autoencoder
the LSTM-Autoencoder network,
network, identifying the good
identifying the goodand and
bad bad
operating condition
operating of the
condition of
the monitored
monitored equipment,
equipment, and and
thenthenthreethree
labelslabels
werewere
used used for
for the the Transformer
Transformer network,
network, iden-
identifying
tifying the RULthe RUL ofmonitored
of the the monitored equipment
equipment through
through classification.
classification.
In order
order to implement the LSTM-Autoencoders, the Keraslibrary
to implement the LSTM-Autoencoders, the Keras librarywas
wasused.
used.Keras
Kerasis is
a
popular
a popularPython
Pythonlibrary
librarythat
thatisiswidely
widelyusedusedfor
fordeveloping
developingand andevaluating
evaluating deep
deep learning
models as an open-source software library that provides a user-friendly interface for de-
signing and training neural networks. In the aforementioned proposed approach, the
training dataset was segmented based on historical maintenance records and then two
separate LSTM-Autoencoders were trained using data corresponding to each of the two
equipment states, namely good and bad. After the training the two separate LSTM-Auto-
Sensors 2024, 24, 3215 10 of 25
models as an open-source software library that provides a user-friendly interface for design-
ing and training neural networks. In the aforementioned proposed approach, the training
dataset was segmented based on historical maintenance records and then two separate
LSTM-Autoencoders were trained using data corresponding to each of the two equipment
states, namely good and bad. After the training the two separate LSTM-Autoencoders,
newly arrived data were fed into each of the two separate LSTM-Autoencoders, which are
connected in parallel, in order to classify them into one of the two supported labels, “bad
state” or “good state”.
Then, in order to implement the Transformer model, Keras library was also used. In
case the LSTM-Autoencoder result is that the machine is in a bad state, the Transformer
model will take the same input in order to further process the data and make a classification
of the RUL of the machine.
Finally, during the experimentation stage, the accuracy of the system’s results was
cross-validated using the actual maintenance records provided by the use-case owner, as
described in the following section.
5. Case Study
5.1. Hot Rolling Mill
The aforementioned approach was implemented into a software prototype that was
trained and tested in a real-world steel production industry case. The data used in this
study were derived from a hot rolling mill machine that is used for producing metal
bars. Figure 9 illustrates a high-level diagram of the rolling mill machine components and
Sensors 2024, 24, x FOR PEER REVIEWtheir connectivity. Sensor values were initially stored in a local database on the 11 motion
of 26
controller and then transferred to a Programmable Logic Controller (PLC) database, and
finally, in a historical database. Real-time data were transmitted from the PLC database to
the PC for RUL prediction via communication channels. Additionally, as the developed
work was implemented
framework on an industrial
was implemented intranet,intranet,
on an industrial and there
andwas no external
there communica-
was no external com-
tions/exchange of data outside
munications/exchange of datathe factory,
outside theno mechanisms
factory, for data privacy
no mechanisms for dataand security
privacy and
were incorporated.
security were incorporated.
Hotrolling
Figure9.9.Hot
Figure rollingmill
millmachine
machinediagram.
diagram.
Therolling
The rollingcylinders
cylindersofofthe
thehot
hotrolling
rollingmill
millhave
havedifferent
differentgeometrically
geometricallycoated
coatedseg-
seg-
ments attached to them, which are used to form the metal bars by applying
ments attached to them, which are used to form the metal bars by applying force. The force. The
rolling mill
rolling mill consists
consists of
of three
three top
top and
and three
three bottom
bottom segments,
segments, each
each with
with aa wear-resistant
wear-resistant
coating. Regarding the preventive maintenance activities that take place for this machine,
coating. Regarding the preventive maintenance activities that take place for this machine,
the coated segments are scheduled to be replaced approximately every sixteen (16) days
the coated segments are scheduled to be replaced approximately every sixteen (16) days
or sooner in case of any unexpected damage, and the replacement of the coated segments
or sooner in case of any unexpected damage, and the replacement of the coated segments
by the maintenance personnel typically lasts about two hours. The goal and objective of
by the maintenance personnel typically lasts about two hours. The goal and objective of
this study is to enable the turn from preventive maintenance into predictive maintenance
by anticipating the behaviour of the segments through RUL prediction with the use of
neural networks.
5.2. Data Preprocessing

Sensors 2024, 24, 3215 11 of 25
this study is to enable the turn from preventive maintenance into predictive maintenance
by anticipating the behaviour of the segments through RUL prediction with the use of
neural networks.
5.2. Data Preprocessing

The hot rolling mill machine condition was monitored using a variety of sensors that
measured twenty-seven (27) different factors related to the equipment and its environment,
and the sensor installation and operation were carried out by the industrial case provider.
Of course, data coming from industry can be inconsistent, noisy, or even incomplete,
leading to poor model performance. Consequently, data preprocessing is a very important
step before being used for modelling and analysis [68]. All data preprocessing for this use
case was implemented through a separate software module. This module receives JSON
files as input. These files contain data from twenty-seven (27) sensors, and regarding the
sampling rate, it was chosen by the industrial case provider, and data were collected every
five milliseconds (5 ms). However, data storage took place within one-second (1 s) intervals.
Since the sampling rate was too dense, entries with zero or missing values were omitted.
The latter, i.e., entry omission, does not affect data consistency and quality since these data
are considered sensor faults. After completion of the above-mentioned processes, data
preprocessing is finalized, resulting in the creation of unified dataframe, which is ready to
be used for subsequent analysis.
5.3. Feature Selection

Nevertheless, identifying the appropriate parameters and features that could be linked
to possible equipment failures is not an easy task. In order to select the important param-
eters and features for our analysis, the first step in the process involved the plotting of
the data. By performing the visualization of the data, critical areas in the dataframe were
identified and focused on for further analysis of the dataframes. Furthermore, in order to
facilitate the process of feature selection, detailed discussions with experts from the factory
were performed. As such, tacit knowledge was obtained, which, by extension, enabled us to
level up the dataframe from raw data to information. Finally, the dataframe was also further
elaborated by combining raw data with information from historical maintenance records.
According to hot-rolling-mill-machine-related studies and scientific dissertations [69], four
relevant features for our approach were selected: the surface temperature of cylinders A
and B and the force of cylinders A and B on trailing arm (Table 1).
Table 1. Features selected.
Feature Name Feature Value Feature Description

Cylinder A segment surface
Celsius (◦ C) Surface temperature of cylinder A
temperature
Cylinder B segment surface
Celsius (◦ C) Surface temperature of cylinder B
temperature
Cylinder A hydraulic force Kilonewton (kN) Force of cylinder A on trailing arm
Cylinder B hydraulic force Kilonewton (kN) Force of cylinder B on trailing arm
5.4. LSTM-Autoencoder Architecture

Each LSTM-Autoencoder consists of an encoder and a decoder. The number of LSTM-
Autoencoder layers and neurons was selected and optimized following digital experimen-
tation and monitoring of performance metrics. Figure 10 illustrates the architecture of each
LSTM-Autoencoder and the data flow through the layers of the encoder for one sample of
the dataset of size 5 × 4 (assuming that timesteps = 5).
• The input data have five timesteps and four features.
• The first encoding LSTM layer (Layer 1, LSTM(128)) reads the input data and out-
puts one hundred and twenty-eight (128) features with five timesteps 5 × 128, as
return_sequences = True.
perimentation and monitoring of performance metrics. Figure 10 illustrates the architec-
ture of each LSTM-Autoencoder and the data flow through the layers of the encoder for
one sample of the dataset of size 5 × 4 (assuming that timesteps = 5).
 The input data have five timesteps and four features.
Sensors 2024, 24, 3215  The first encoding LSTM layer (Layer 1, LSTM(128)) reads the input data and outputs 12 of 25
one hundred and twenty-eight (128) features with five timesteps 5 × 128, as
return_sequences = True.
• The second encoding LSTM layer (Layer 2, LSTM(64))
LSTM(64)) reads the the input data 55 ×
input data × 128
reduction, outputs
and after reduction, outputs aa vector
vector of
of size
size sixty-four (64)11 ×
sixty-four(64) 64, the
× 64, the encoded
encoded feature
feature
vector of the input data, as return_sequences = False.
• The repeat vector replicates the feature vector 11 × 64 five
× 64 five times
times and
and prepares the 2D
array input for the first LSTM layer in the decoder. The repeat vector is the bridge
between the encoder and decoder modules.
Figure 10. LSTM-Autoencoder encoder.

Figure 10. LSTM-Autoencoder encoder.
Figure 11, on the other hand, illustrates the data flow through the layers of the de-
Figure 11, on the other hand, illustrates the data flow through the layers of the decoder.
coder.
• The
The first
first decoding
decoding LSTMLSTM layer
layer (Layer
(Layer 4,4, LSTM(64))
LSTM(64)) reads
reads the
the input data55×× 64
input data 64 and
and
outputs sixty-four (64) features with five timesteps 5 ×
outputs sixty-four (64) features with five timesteps 5 × 64, as return_sequences ==True.
64, as return_sequences True.
• The
The second
seconddecoding
decodingLSTMLSTMlayer (Layer
layer (Layer5, LSTM(128))
5, LSTM(128)) reads the input
reads 5 × 64
datadata
the input 5 ×and
64
outputs a vector
and outputs of one hundred
a vector and twenty-eight
of one hundred (128) features
and twenty-eight (128)with five timesteps
features as
with five
return_sequences= True.
timesteps as return_sequences= True.
• The
The time
time distributed
distributed layer
layer (Layer
(Layer 6, TimeDistributed(Dense(4))) takes
6, TimeDistributed(Dense(4))) takes the
the output
output and
and
creates 128 ×
creates 128 × 4 (number of features outputted from the previous layer × number
4 (number of features outputted from the previous layer × number of of
features)
features) vector.
vector.
• The
The matrix
matrix multiplication
multiplication between
between thethe output
output ofof Layer
Layer 5,5, 55× 128, and
× 128, and the
the output
output ofof
Layer 6, 128 × 4, resulted in a 5 × 4 output (the input and output dimensions
Layer 6, 128 × 4, resulted in a 5 × 4 output (the input and output dimensions match). match).
Figure 11. LSTM-Autoencoder decoder.

Figure 11. LSTM-Autoencoder decoder.
Table 22 presents
Table presentsthe
thearchitecture
architectureofofeach
eachLSTM-Autoencoder,
LSTM-Autoencoder, which
which includes
includes thethe lay-
layers
ers of the network created, the number of parameters (weights and biases) of
of the network created, the number of parameters (weights and biases) of each layer, and each layer,
andtotal
the the total parameters
parameters ofmodel,
of the the model, as described
as also also described previously.
previously. In machine
In machine learning
learning and
and neural networks, the number of parameters in a neural network
neural networks, the number of parameters in a neural network can have an impactcan have an impact
on
on the
the processing
processing complexity
complexity of of
thethe model
model [70].InInthis
[70]. thisapproach,
approach,the
thenumber
number of of trainable
trainable
parameters in each network was 249.860, which resulted in the good performance of the
model.
Table 2. LSTM-Autoencoder: number of trainable parameters.
Layer Type Output Shape (Timesteps × Features) Parameters

input1 InputLayer 5×4 0
Sensors 2024, 24, 3215 13 of 25
parameters in each network was 249.860, which resulted in the good performance of the
model.
Table 2. LSTM-Autoencoder: number of trainable parameters.
Layer Type Output Shape (Timesteps × Features) Parameters

input1 InputLayer 5×4 0
lstm1 LSTM 5 × 128 68,096
lstm2 LSTM 1 × 64 49,408
repeatvector1 RepeatVector 5 × 64 0
Sensors 2024, 24, x FOR PEER REVIEW lstm3 LSTM 5 × 64 33,024 14 of 26
lstm4 LSTM 5 × 128 98,816
timedistributed1 TimeDistributed 5×4 516
Total parameters: 249,860
these two LSTM-Autoencoders
Trainable parameters: were trained with a different dataset representing
249,860 the dif-
ferent situations of the machine, defined according to the previous segment’s
Non-trainable parameters: 0 exchange
records (Table 3).
5.5. LSTM-Autoencoder Training and Testing
Table 3.Apart
Historical
from maintenance records.
monitoring the equipment condition and data collection from the sensors,
another
# very important piece
Mounted of informationRUL
Unmounted is the historical maintenance
Remarkrecords. In the
aforementioned approach, two separate LSTM-Autoencoders were trained in order to
1 day 1 day 12 12 days Large piece broken out of surface
classify data into one of the two supported labels, “bad state” or “good state”. Each of these
two LSTM-Autoencoders were trained with a different dataset representing the different
3 day 1 day 16 16 days Preventive maintenance
situations of the machine, defined according to the previous segment’s exchange records
(Table 3).
As3.mentioned
Table before, the
Historical maintenance coated segments are scheduled to be replaced approxi-
records.
mately every sixteen (16) days or sooner in case of any unexpected damage and failure.
So, as# illustrated
Mountedin FigureUnmounted
12, we can assumeRULthat in the first twoRemark
days that the coating
1 day 1 day 12 12 days Large piece
was mounted, the sensor data corresponded to a machines’ good state, and broken outvice
of surface
versa: the
2 day 1 day 15 15 days Large piece broken
last two days before the coating was unmounted, the sensor data corresponded out of surface
to a ma-
3 day 1 day 16 16 days Preventive maintenance
chines’
4 bad stateday(Table
1 4). day 15 15 days Large piece broken out of surface
Table 4. Data selected for training LSTM-Autoencoders.

As mentioned before, the coated segments are scheduled to be replaced approximately
#
every RUL (16) days or sooner
sixteen Remark
in case of any unexpectedGooddamage
Data and failure.
Bad Data
So, as
illustrated
1 in Figure
12 days 12,piece
Large we can assume
broken out ofthat in the firstday
surface two days2that theday
1–day coating was
11–day 12
mounted,
2 the sensor
15 days data
Large corresponded
piece broken out to
of asurface
machines’ good state, and
day 1–day 2 vice versa: the last
day 14–day 15
two
3 days before the coating
16 days was unmounted,
Preventive maintenance the sensor data corresponded
day 1–day 2 to a machines’
-
bad
4 state (Table 4).Large piece broken out of surface
15 days day 1–day 2 day 14–day 15
Figure
Figure12.
12.Data
Dataselection
selection for
for training LSTM-Autoencoders.
training LSTM-Autoencoders.
Each dataset consisted of approximately 200,000 values. The datasets were then split
into training and test data, with 80% of the first part of the dataset used for training and
the remaining 20% used for testing. Both the training and test data were normalized to a
range from 0 to 1 to facilitate faster and better training of the neural networks.
Table 5 presents the training loss results after performing multiple experiments in
Sensors 2024, 24, 3215 14 of 25
Table 4. Data selected for training LSTM-Autoencoders.
# RUL Remark Good Data Bad Data

1 12 days Large piece broken out of surface day 1–day 2 day 11–day 12
3 16 days Preventive maintenance day 1–day 2 -
Each dataset consisted of approximately 200,000 values. The datasets were then split
into training and test data, with 80% of the first part of the dataset used for training and
the remaining 20% used for testing. Both the training and test data were normalized to a
range from 0 to 1 to facilitate faster and better training of the neural networks.
Table 5 presents the training loss results after performing multiple experiments in
order to identify the ideal number of epochs, the window size, and the batch size in this
use case. Epoch refers to the number of times the entire training dataset is passed through
the neural network during the training process. In each epoch, the neural network goes
through all the training examples in the dataset. The batch size refers to the number of
samples that are processed at each training iteration, and the weights of the neural network
are updated after processing each batch.
Table 5. LSTM-Autoencoder training loss results (%).
Window Size 5 Window Size 10 Window Size 20

Loss
Batch 32 Batch 64 Batch 32 Batch 64 Batch 32 Batch 64
Good State 0.0016 0.0015 0.0156 0.0224 0.0345 0.0338
Bad State 0.0071 0.0071 0.0219 0.0438 0.0630 0.0416
After the training of the LSTM-Autoencoders, new datasets that the two separate
LSMT-Autoencoders had never seen before were then input. Each dataset was the input
for both LSTM-Autoencoders and each of them produced different reconstructed values
for the same input. The reconstructed values that presented a smaller reconstructed error
with the input are probably recognized better by this LSTM-Autoencoder. As a result, the
input dataset belongs to the same category state as the dataset that the LSTM-Autoencoder
was trained with. In Table 6, the first column refers to the actual states of the monitored
equipment on specific days according to the historical maintenance records of the hot
rolling mill, while the last two columns present the loss generated by each one of the two
LSTM-Autoencoders for the corresponding days.
Table 6. LSTM-Autoencoder test results.
Historical Maintenance Records Loss

Equipment State RUL Input Date AE Good State AE Bad State
Good State 15 days day 2 0.006 0.035
Bad State 15 days day 14 0.037 0.005
Sensors 2024, 24, 3215 15 of 25
5.6. Transformer Encoder Architecture

Figure 13 illustrates the architecture of one of the Transformer encoders and the
data flow through the layers of the encoder. Transformers consist of a fixed number of
stacked layers [71]. After windowing, the sample input data consists of five timesteps and
four features.
• A LayerNormalization layer normalizes the input data and outputs four features with
five timesteps (5 × 4).
• A MultiHeadAttention layer outputs four features with five timesteps (5 × 4).
• A Dropout layer outputs four features with five timesteps (5 × 4).
• An Addition layer outputs four features with five timesteps (5 × 4).
• A LayerNormalization layer normalizes the input data and outputs four features with
five timesteps (5 × 4).
• A Conv1D layer operates as a feature extractor and captures patterns, applying a
1D convolution operation to the input, and outputs four features with five timesteps
(5 × 4).
• A Dropout layer randomly sets a fraction of input units to zero and outputs four
features with five timesteps (5 × 4).
• A Conv1D layer applies a 1D convolution operation to the input and outputs 16 of 26
four
features with five timesteps (5 × 4).
Figure 13. Transformer encoder.

Figure 13. Transformer encoder.
Finally, after the
Finally, after the input
input passes
passes through
through allall of
of the
the stacked
stacked Transformer
Transformer encoders,
encoders, the
the
output is an encoded representation of the input. The number of
output is an encoded representation of the input. The number of stacked Transformer stacked Transformer en-
coders
encoders is selected andand
is selected optimized following
optimized digital
following experimentation
digital experimentation and and
monitoring of per-
monitoring of
formance metrics. The Transformer encoders create a continuous
performance metrics. The Transformer encoders create a continuous representation representation of the
of
input with
the input attention
with attentioninformation, capturing
information, capturing allall
thethedependencies
dependencieswithin withinthe
the sequence.
sequence.
Then, the output is further processed in order to produce the final
Then, the output is further processed in order to produce the final output of the output of the model,
model,as
depicted in Figure 14. A GlobalAveragePooling1D layer takes
as depicted in Figure 14. A GlobalAveragePooling1D layer takes the input tensor andthe input tensor and com-
putes
computesthe average valuevalue
the average alongalong
the timesteps of theofinput
the timesteps tensor
the input and outputs
tensor a tensor
and outputs with
a tensor
shapeshape
with (# of (#samples, # of features).
of samples, Then,Then,
# of features). this output is passed
this output through
is passed the Dense
through layer
the Dense
that
layerapplies linearlinear
that applies transformation, followed
transformation, by the
followed byReLu
the ReLuactivation function.
activation Then,
function. the
Then,
output of the Dense layer passes through a Dropout layer. Finally, the output of the Drop-
out layer is passed through a Dense layer with units = # of classes applying linear trans-
formation followed by the SoftMax activation function. This function outputs the proba-
bilities of the # of classes.
Sensors 2024, 24, 3215 16 of 25
the output of the Dense layer passes through a Dropout layer. Finally, the output of the
Sensors 2024, 24, x FOR PEER REVIEWDropout layer is passed through a Dense layer with units = # of classes applying linear
17 of 26
transformation followed by the SoftMax activation function. This function outputs the
probabilities of the # of classes.
Figure 14. Probability generation for the classification.

Figure14.
Figure 14.Probability
Probabilitygeneration
generationforfor
thethe classification.
classification.
5.7.
5.7.Transformer Encoder TrainingandandTesting
Testing
5.7. Transformer
TransformerEncoder
EncoderTraining
Training and Testing
For
For the
the Transformermodel
Transformer model training,
training, the the segment’s
segment’s exchange
exchange recordsrecords
(Table 3)(Table 3) are
are used
For the Transformer model training, the segment’s exchange records (Table 3) are
used to label
to label the into
the data datadifferent
into different
classes.classes. For example,
For example, as illustrated
as illustrated in Figurein 15,Figure
assuming15, as-
used to label the data into different classes. For example, as illustrated in Figure 15, as-
suming
that thethat
newthe new segment
segment was mountedwas mounted
on day oneon (1)day
andone
was(1) and was unmounted
unmounted because
because of a break
suming that the new segment was mounted on day one (1) and was unmounted because
ofdown
a break down
on day on day
twelve (12),twelve (12),
the data theday
from data fromday
7 and day 7 and
8 can be day 8 canasbe
labelled labelled
Class 0, theas Class
data
of a break down on day twelve (12), the data from day 7 and day 8 can be labelled as Class
0,from day 9from
the data and day 9 10and
canday
be labelled
10 can beas labelled
Class 1, and finally,
as Class the data
1, and from
finally, day
the data11 from
can beday
0, the data from 2.day 9 and day 10 can be labelled as Class 1, and finally, the data from day
11labelled
can beas Class
labelled as Class 2.
11 can be labelled as Class 2.
Figure
Figure15.
15.Data
Dataselection
selection for training the
for training theTransformer
Transformerencoder.
encoder.
Figure 15. Data selection for training the Transformer encoder.
Theinput
The input dataset
dataset consisted
consisted of
ofapproximately
approximately300,000
300,000values. The
values. datasets
The were
datasets then
were then
split The training
into input dataset
and consisted
test data, of approximately
with 80% of the first 300,000
part of values.
the The
dataset datasets
used for were then
training
split into training and test data, with 80% of the first part of the dataset used for training
split into training and test data, with 80% of the first part of the dataset used for training
and the remaining 20% used for testing. Both the training and test data were normalized
and the remaining 20% used for testing. Both the training and test data were normalized
to a range from 0 to 1 to facilitate faster and better training of the neural networks.
to a range from 0 to 1 to facilitate faster and better training of the neural networks.
Table 7 presents the best accuracy rate after performing multiple experiments in or-
Table 7 presents the best accuracy rate after performing multiple experiments in or-
der to identify the ideal window size and batch size in this use case.
der to identify the ideal window size and batch size in this use case.
Sensors 2024, 24, 3215 17 of 25
and the remaining 20% used for testing. Both the training and test data were normalized to
a range from 0 to 1 to facilitate faster and better training of the neural networks.
Table 7 presents the best accuracy rate after performing multiple experiments in order
to identify the ideal window size and batch size in this use case.
Table 7. Transformer encoder training accuracy results (%).
Window Size 5 Window Size 10 Window Size 20

Batch 32 Batch 64 Batch 32 Batch 64 Batch 32 Batch 64
Accuracy (%) 73% 66% 81% 37% 96% 93%
Following the completion of the model training phase, a series of digital experiments
were conducted. For these experiments, new datasets were used, derived from the splitting
of the initial dataframe. These experiments share the same methodology, yet with different
datasets as input to the Transformer model. The output of each experiment is a set of
classification metric values and confusion matrices over the different classes. Finally, the
results from the experiments were cross-validated using the actual maintenance records
provided by the use-case owner for the evaluation of the system’s performance. Each class
corresponds to a different health state of the machine (Table 8).
Table 8. Classes and RUL definition.
Classes RUL
Class 0 3–4 days
Class 1 2–3 days
Class 2 1 day
Tables 9–11 present the classification metric values in order to evaluate the performance
of the Transformer model. The metrics used for the evaluation are Precision, Recall, F1
Score and Accuracy and are calculated for each class in each input dataset. The input
datasets used for the experiments were labelled as Class 0, Class 1, and Class 2 based
on the segment’s exchange records. Confusion matrixes are used in order to provide a
representation of the Transformer model’s actual class labels and the predictions for each
class (Figures 16–18). Each row of the confusion matrix represents the number of data
values that belong in the real class, and each column represents the number of data values
in the predicted class.
Table 9. Transformer results: Experiment 1—maintenance because of break down.
Precision (%) Recall (%) F1 Score (%) Confidence (%) Support

Class 0 94% 70% 80% 70% 3600
Class 1 98% 97% 98% 97% 3600
Class 2 74% 94% 83% 94% 3580
Accuracy (%) 87%

Class 0 89% 78% 83% 78% 3600
Class 1 96% 92% 94% 92% 3600
Class 2 76% 88% 81% 88% 3580
Accuracy (%) 86%
the segment’s exchange records. Confusion matrixes are used in order to provide a repre-
sentation of the Transformer model’s actual class labels and the predictions for each class
(Figures 16–18). Each row of the confusion matrix represents the number of data values
Sensors 2024, 24, 3215 that belong in the real class, and each column represents the number of data 18
values
of 25
in the
predicted class.
Table 9. Transformer
Table 11. results:Experiment
Transformer results: Experiment 1—maintenance
3—maintenance because
because of break
of break down.down.
Precision
Precision (%)(%) RecallRecall
(%) (%) F1 Score
F1 Score (%) (%) Confidence
Confidence (%) (%) Support
Support
Class00
Class 94%
60% 16% 70% 25% 80% 16% 70% 3600 3600
Class11
Class 98%
56% 65% 97% 60% 98% 64% 97% 3600 3600
Class22
Class 74%
56% 88% 94% 68% 83% 88% 94% 3580 3580
Accuracy (%)
Accuracy (%) 56% 87%

Class 0 89% 78% 83% 78% 3600
Class 1 96% 92% 94% 92% 3600
Sensors 2024, 24, x FOR PEER REVIEW Class 2 76% 88% 81% 88% 3580
19 of 26
Accuracy (%) 86%

Class 0 89% 78% 83% 78% 3600
Class 1 96% 92% 94% 92% 3600
Class 2 76% 88% 81% 88% 3580
Accuracy
Figure
Figure 16. (%)
16.Confusion
Confusion matrix:
matrix: Experiment
Experiment 1—maintenance
1—maintenance 86%because
because of break
of break down. down.
Figure 17. Confusion matrix: Experiment 2—maintenance because of break down.

Class 0 60% 16% 25% 16% 3600
Class 1 56% 65% 60% 64% 3600
Class 2 56% 88% 68% 88% 3580
Accuracy (%) 56%
17.Confusion
Figure 17.
Figure Confusionmatrix:
matrix:Experiment 2—maintenance
Experiment because
2—maintenance of break
because of down.
break down.

Class 0 60% 16% 25% 16% 3600
Class 1 56% 65% 60% 64% 3600
Class 2 56% 88% 68% 88% 3580
Accuracy (%) 56%
Figure 18.Confusion
Figure18. Confusionmatrix:
matrix:Experiment 3—maintenance
Experiment because
3—maintenance of break
because of down.
break down.
The input datasets used for the following three experiments were labelled as Class 2
despite the fact that these data were taken the day before the preventive maintenance ac-
tivities based on the segment’s exchange records. As the segment exchange took place
preventively and not because of a segment break down, it indicates that the machine may
have had a few more days of expected life. Consequently, it is interesting to observe the
Sensors 2024, 24, 3215 19 of 25
The input datasets used for the following three experiments were labelled as Class
2 despite the fact that these data were taken the day before the preventive maintenance
activities based on the segment’s exchange records. As the segment exchange took place 20 of 26
preventively and not because of a segment break down, it indicates that the machine may
have had a few more days of expected life. Consequently, it is interesting to observe the
Transformer model’s predictions for these cases (Tables 12–14).
Table 12. Transformer results: Experiment 4—preventive maintenance.
Class 0 Precision0(%) Recall (%) 0 F1 Score (%) 0Confidence (%) Support 0
Class
Class 01 0 0 0 0 0 0 0 0
Class 12
Class 0100% 0 19% 0 32% 19% 0 3580
Class 2 100% 19% 32% 19% 3580
Accuracy (%) 19%
Accuracy (%) 19%

Class 0 Precision0(%) Recall (%) 0 F1 Score (%) 0Confidence (%) Support 0
Class
Class 01 0 0 0 0 0 0 0 0
Class 2
Class 1 0100% 0 10% 0 20% 10% 0 3580
Class 2 (%)
Accuracy 100% 10% 20% 10% 10% 3580
Accuracy (%) 10%

Class 0 0
Precision (%) Recall (%) 0 F1 Score (%) 0
Confidence (%) Support 0
Class01
Class 0 0 0 0 0 0 0 0
Class12
Class 0100% 0 1% 0 1% 1% 0 3580
Class 2 (%)
Accuracy 100% 1% 1% 1% 1% 3580
Accuracy (%) 1%
Confusion matrixes show that despite the fact that these data were taken the day
beforeConfusion matrixesmaintenance
the preventive show that despite the fact
activities andthat these belong
should data were taken 2,
in Class thethe
daymachine
before the preventive maintenance activities and should belong in Class 2, the machine
may have had a few more days of expected life. According to Figures 19 and 20, the Trans- may
have had a few more days of expected life. According to Figures 19 and 20, the Transformer
former model predicted that these data belong to Class 0 and have about 3–4 more days
model predicted that these data belong to Class 0 and have about 3–4 more days of life,
of life,according
while, while, according to the
to Figure 21, Figure 21, the Transformer
Transformer model predicted model predicted
that these that these
data belong to data
belong to Class 1 and have about 2–3 more
Class 1 and have about 2–3 more days of life. days of life.
19.Confusion
Figure 19.
Figure Confusionmatrix:
matrix:Experiment 4—preventive
Experiment maintenance.
4—preventive maintenance.
Sensors 2024, 24, 3215 20 of 25
Figure 20. Confusion matrix: Experiment 5—preventive maintenance.

Figure 20.Confusion
Figure 20. Confusionmatrix:
Figure 21.Confusion
Figure21. Confusionmatrix:
Figure 21. Confusion matrix: Experiment 6—preventive maintenance.
5.8. Discussion
5.8. Discussion
5.8. Discussion
In order to evaluate the performance of the proposed approach, four months of
machineIn order to evaluate
operation data were the performance
used, of the proposed
and the datasets approach, four months of ma
In order to evaluate the performance of thefor training and
proposed testing were
approach, four created
months of ma-
chine on
based operation datamaintenance
the historical were used,records and the datasets
from the hotfor training
rolling and testing were created
mill machine.
chine operation data were used, and the datasets for training and testing were created
based Foron thethe
LSTM-Autoencoder
historical maintenance (Table 6)records
the difference
frombetween
the hot the lossesmill
rolling of the two LSTM-
machine.
based on the was
Autoencoders historical maintenance records from the the
hotinput
rolling mill machine.
For the LSTM-Autoencoder (Table 6) the difference between the losses the
enough in order to categorize and label data and identify of the two
healthFor theofLSTM-Autoencoder
status the hot rolling mill machine. (Table 6) Thethe difference
datasets between
for training and the losses
testing of the two
created
LSTM-Autoencoders was enough in order to categorize and label the input data and iden
LSTM-Autoencoders was enoughrecords
based on the historical maintenance in order to categorize
from the hot rollingandmilllabel the input data and iden-
machine
tify the health status of the hot rolling mill machine. The datasets for training and testing
tify the
Thehealth
results status
from the of experiments
the hot rolling weremill machine. The
cross-validated datasets
using for training
the actual maintenance and testing
createdprovided
records based on the historical maintenance records from the hot rolling mill machine
created based onby thethe use-case maintenance
historical owner for the evaluation
records from of the
thesystem’s
hot rollingperformance.
mill machine
The results
According from the experiments
to the data were cross-validated using the actual maintenance
The results frompresented
the experimentsin Tableswere9–11,cross-validated
the Transformerusing modelthe canactual
predict the
maintenance
records provided
equipment’s by the
health state, use-case
predict owner for
the remaining the life,
useful evaluation
and preventof theanysystem’s
failure or performance
break
records provided by the use-case owner for the evaluation of the system’s performance
According
down with highto the data presented
confidence. in Tables
Additionally, 9–11, the
the network Transformer
results model
in Tables 12–14 canthat
show predict the
According to the data presented in Tables 9–11, the Transformer model can predict the
the equipment was still in a healthy state at the time of preventive
equipment’s health state, predict the remaining useful life, and prevent any failure or maintenance activities.
equipment’s
Consequently, health state,
in a period predict
of one the as
(1) year, remaining
preventive useful life, and prevent takeany failure or
break down with high confidence. Additionally, themaintenance
network results activities
in Tables place
12–14 show
break
every down with
sixteen highthe
(16) days, confidence.
equipment Additionally,
couldstate
gain at the
(onthe network results in Tables
average) 12–14 show
that the equipment was still in a healthy time ofapproximately fifty-seven
preventive maintenance activ
that the equipment was still in a healthy state
(57) more days of life and a 17,39% reduction in preventive stoppages. at the time of preventive maintenance activ-
ities. Consequently, in a period of one (1) year, as preventive maintenance activities take
ities.AsConsequently,
indicated in theinLSTM-Autoencoder
a period of one (1) year, and
Training as preventive maintenance
Testing paragraph, activities take
the developed
place everycan
framework sixteen
predict(16) days,
thedays, the equipment
equipment’s health statuscould gain
and (on
the (on average) approximately
corresponding RUL values fifty
place every sixteen (16) the equipment could gain average) approximately fifty-
seven
with (57) more days of life and a 17,39% reduction in preventive stoppages.
sevena high confidence
(57) more days rate.
of life However,
and a 17,39%the factreduction
that the confidence
in preventivelevel remains
stoppages. less than
100% As indicated
indicates in the
that the LSTM-Autoencoder
developed Training and Testing
framework is a complementary tool andparagraph,
provides good the devel
As indicated in the LSTM-Autoencoder Training and Testing paragraph, the devel-
oped framework
estimates can predict the equipment’s
for the technician/engineer, and that human health status isand
intervention stillthe corresponding
required in order RUL
oped framework can predict the equipment’s health status and the corresponding RUL
to ensurewith
values seamless
a highoperation
confidence of therate.
production
However, line. Concretely,
the fact that thethedeveloped
confidence framework
level remains
values
can be withasaahigh
used smart confidence
suggestion rate. However,
system which the fact
monitors the that
statustheof confidence
the equipment level remains
andand
less than 100% indicates that the developed framework is a complementary tool pro
less than 100% indicates that the developed framework is a complementary tool and pro-
vides good estimates for the technician/engineer, and that human intervention is still re
vides good estimates for the technician/engineer, and that human intervention is still re-
quired in order to ensure seamless operation of the production line. Concretely, the de
quired in order to ensure seamless operation of the production line. Concretely, the de-
veloped framework can be used as a smart suggestion system which monitors the status
veloped framework can be used as a smart suggestion system which monitors the status
Sensors 2024, 24, 3215 21 of 25
interprets data, in an attempt to inform technicians/engineers whether or not the specific

equipment requires maintenance to be carried out.
6. Conclusions
In conclusion, this study proposes a new approach for fault detection by evaluating
the condition of production assets and predicting their remaining useful life (RUL). In
order to integrate this solution, Autoencoders with Long Short-Term Memory (LSTM)
networks were combined with a Transformer encoder to evaluate the functional status of a
hot rolling mill machine in manufacturing, identify any anomalies, and map them to RUL
values. Initially, a combination of two LSTM-Autoencoder networks was trained for the
classification of the current machine’s health condition to the two different corresponding
labels of the machine, good state and bad state. Then, a Transformer encoder was trained
in order to estimate and predict the remaining useful life of this machine. The proposed
method was evaluated on a hot rolling milling machine.
The novelty of the proposed approach is that in the first phase, a separate LSTM-
Autoencoder is trained for one label, leading to better results, and making it easily ad-
justable to many labels following the exact same logic and procedure. The two LSTM-
Autoencoders were used as a preliminary preprocessing step in the approach in order to
filter out any irrelevant information and decide if the data required further analysis from
the Transformer encoder. Then, the Transformer encoder further processes and analyzes
the data, mapping them into different RUL classes. So, using LSTM-Autoencoders as a
preliminary preprocessing step allows a balance between computational efficiency and
model performance. Furthermore, considering the architectural characteristics of the Trans-
formers, key elements such as non-sequential processing and self-attention mechanisms
enable such models to process large datasets in real time and provide faster responses in
comparison to other similar models.
Real-world data from a hot rolling milling machine were used both for training and
testing of the neural networks, and the obtained results were satisfactory as presented
in this study. However, during the development of the presented method, several chal-
lenges emerged. One of the key limitations was the extensive data preprocessing required.
Concretely, a manual labelling process was mandatory, which was encountered by combin-
ing the dataframe with labels derived from historical maintenance records. Another key
limitation was the increased complexity of the data, which was addressed by iteratively
fine-tuning the hyperparameters of the model. By extension, additional experiments are
necessary to be conducted using a more extensive dataset of higher data quality for a longer
time period.
The results from all the different experiments show that the proposed approach
is promising and can help to improve maintenance planning, reducing redundant and
preventive stoppages in the production line, preventing any serious failure of the machine
before it happens, and leading to a decrease in the cost of maintenance operations. Finally,
the proposed method can provide information regarding the machine’s health without
requiring any specialization and additional skills from the industry operators.
However, one limitation of the proposed approach arises when dealing with data of
higher resolution with multiple labels, requiring multiple neural networks to identify the
machine’s status. Such cases can be computationally complex, and neural networks may
not be able to accurately recognize the neighbour states. Also, another limitation of this
approach is the requirement for maintenance records used to label the datasets, such as
component break downs and failures. These kinds of data are limited in the industry as
preventive maintenance activities are planned in order to avoid this kind of critical failure
of the equipment.
A next step for this approach is performance optimization by choosing different sets
of hyperparameters for each network, conducting experiments, and comparing the results.
Also, the robustness of the model to anomalies and noise data will be evaluated. The same
approach could also be tested with more than four features and high-dimensional data, or
Sensors 2024, 24, 3215 22 of 25
completely different set of features for training. This expansion will allow the model to
find and uncover more hidden patterns, relationships, correlations, and other insights that
may remain undiscovered within the constraints of the current implementation.
Future work will also focus on evaluating the proposed concept against other ma-
chine learning methods combining different neural networks for each step, using different
datasets from different real-world scenarios. In terms of implementation, and in order to
minimize the framework’s response time (i.e., real-time), a better network infrastructure
needs to be implemented in order to reduce network latency and system response. Further-
more, regarding the neural network operation, the utilization of high-power GPUs could
further reduce prediction time. Finally, in an attempt to improve the impact of the proposed
method, future work will involve the comparison of the developed model versus other
statistical models, e.g., the exponential degradation model. Finally, different architectures
for varying conditions will also be investigated and compared against the current approach.
Author Contributions: Conceptualization, K.A., N.N. and X.B.; methodology, K.A. and X.B.; soft-
ware: X.B.; validation, X.B.; formal analysis, X.B.; investigation, X.B.; resources, K.A.; data curation,
X.B.; writing—original draft preparation, X.B.; writing—review and editing, X.B., N.N. and K.A.;
visualization, X.B.; supervision, K.A. and N.N.; project administration, N.N.; funding acquisition,
K.A. All authors have read and agreed to the published version of the manuscript.
Funding: This research has been partially funded by the European project “SERENA—VerSatilE
plug-and-play platform enabling REmote predictive mainteNAnce” (Grant Agreement: 767561).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author due to privacy restrictions.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Chryssolouris, G.; Alexopoulos, K.; Arkouli, Z. A Perspective on Artificial Intelligence in Manufacturing; Springer Nature:
Berlin/Heidelberg, Germany, 2023; Volume 436.
2. Rahman, M.S.; Ghosh, T.; Aurna, N.F.; Kaiser, M.S.; Anannya, M.; Hosen, A.S. Machine Learning and internet of things in industry
4.0: A review. Meas. Sens. 2023, 28, 100822. [CrossRef]
3. Vaidya, S.; Ambad, P.; Bhosle, S. Industry 4.0—A glimpse. Procedia Manuf. 2018, 20, 233–238. [CrossRef]
4. Grabowska, S. Smart factories in the age of Industry 4.0. Manag. Syst. Prod. Eng. 2020, 28, 90–96. [CrossRef]
5. Sestino, A.; Prete, M.I.; Piper, L.; Guido, G. Internet of Things and Big Data as enablers for business digitalization strategies.
Technovation 2020, 98, 102173. [CrossRef]
6. Liu, Z.; Mei, W.; Zeng, X.; Yang, C.; Zhou, X. Remaining useful life estimation of insulated gate biploar transistors (IGBTS) based
on a novel volterra K-nearest neighbor optimally pruned extreme learning machine (VKOPP) model using degradation data.
Sensors 2017, 17, 2524. [CrossRef]
7. Le Xuan, Q.; Adhisantoso, Y.G.; Munderloh, M.; Ostermann, J. Uncertainty-aware remaining useful life prediction for predictive
maintenance using deep learning. Procedia CIRP 2023, 118, 116–121. [CrossRef]
8. Lee, J.; Mitici, M. Deep reinforcement learning for predictive aircraft maintenance using probabilistic remaining-useful-life
prognostics. Reliab. Eng. Syst. Saf. 2023, 230, 108908. [CrossRef]
9. de Pater, I.; Mitici, M. Predictive maintenance for multi-component systems of repairables with Remaining-Useful-Life prognostics
and a limited stock of spare components. Reliab. Eng. Syst. Saf. 2021, 214, 107761. [CrossRef]
10. Guo, L.; Li, N.; Jia, F.; Lei, Y.; Lin, J. A recurrent neural network based health indicator for remaining useful life prediction of
bearings. Neurocomputing 2017, 240, 98–109. [CrossRef]
11. Chen, C.; Shi, J.; Lu, N.; Zhu, Z.H.; Jiang, B. Data-driven predictive maintenance strategy considering the uncertainty in remaining
useful life prediction. Neurocomputing 2022, 494, 79–88. [CrossRef]
12. Stavropoulos, P.; Papacharalampopoulos, A.; Vasiliadis, E.; Chryssolouris, G. Tool wear predictability estimation in milling based
on multi-sensorial data. Int. J. Adv. Manuf. Technol. 2016, 82, 509–521. [CrossRef]
13. Zhang, C.; Yao, X.; Zhang, J.; Jin, H. Tool condition monitoring and remaining useful life prognostic based on a wireless sensor in
dry milling operations. Sensors 2016, 16, 795. [CrossRef] [PubMed]
Sensors 2024, 24, 3215 23 of 25
14. Aivaliotis, P.; Georgoulias, K.; Chryssolouris, G. The use of Digital Twin for predictive maintenance in manufacturing. Int. J.
Comput. Integr. Manuf. 2019, 32, 1067–1080. [CrossRef]
15. Dhiman, H.S.; Deb, D.; Muyeen, S.M.; Kamwa, I. Wind turbine gearbox anomaly detection based on adaptive threshold and twin
support vector machines. IEEE Trans. Energy Convers. 2021, 36, 3462–3469. [CrossRef]
16. Dhiman, H.S.; Bhanushali, D.; Su, C.-L.; Berghout, T.; Amirat, Y.; Benbouzid, M. Enhancing Wind Turbine Reliability through
Proactive High Speed Bearing Prognosis Based on Adaptive Threshold and Gated Recurrent Unit Networks. In Proceedings
of the IECON 2023-49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 16–19 October 2023; IEEE:
New York, NY, USA, 2023; pp. 1–6.
17. Gao, R.; Wang, L.; Teti, R.; Dornfeld, D.; Kumara, S.; Mori, M.; Helu, M. Cloud-enabled prognosis for manufacturing. CIRP Ann.
2015, 64, 749–772. [CrossRef]
18. Oo, M.C.M.; Thein, T. An efficient predictive analytics system for high dimensional big data. J. King Saud Univ.-Comput. Inf. Sci.
2022, 34, 1521–1532. [CrossRef]
19. Suh, J.H.; Kumara, S.R.; Mysore, S.P. Machinery fault diagnosis and prognosis: Application of advanced signal processing
techniques. CIRP Ann. 1999, 48, 317–320. [CrossRef]
20. Cerquitelli, T.; Nikolakis, N.; O’Mahony, N.; Macii, E.; Ippolito, M.; Makris, S. Predictive Maintenance in Smart Factories; Springer:
Singapore, 2021.
21. Huang, C.G.; Huang, H.Z.; Li, Y.F. A bidirectional LSTM prognostics method under multiple operational conditions. IEEE Trans.
Ind. Electron. 2019, 66, 8792–8802. [CrossRef]
22. Liu, C.; Yao, R.; Zhang, L.; Liao, Y. Attention based Echo state Network: A novel approach for fault prognosis. In Proceedings of
the 2019 11th International Conference on Machine Learning and Computing, Zhuhai, China, 22–24 February 2019; pp. 489–493.
23. Jaenal, A.; Ruiz-Sarmiento, J.-R.; Gonzalez-Jimenez, J. MachNet, a general Deep Learning architecture for Predictive Maintenance
within the industry 4.0 paradigm. Eng. Appl. Artif. Intell. 2024, 127, 107365. [CrossRef]
24. Alabadi, M.; Habbal, A.; Guizani, M. An Innovative Decentralized and Distributed Deep Learning Framework for Predictive
Maintenance in the Industrial Internet of Things. IEEE Internet Things J. 2024. [CrossRef]
25. Farahani, S.; Khade, V.; Basu, S.; Pilla, S. A data-driven predictive maintenance framework for injection molding process. J. Manuf.
Process. 2022, 80, 887–897. [CrossRef]
26. Yousuf, M.; Alsuwian, T.; Amin, A.A.; Fareed, S.; Hamza, M. IoT-based health monitoring and fault detection of industrial AC
induction motor for efficient predictive maintenance. Meas. Control 2024. [CrossRef]
27. D’Urso, D.; Chiacchio, F.; Cavalieri, S.; Gambadoro, S.; Khodayee, S.M. Predictive maintenance of standalone steel industrial
components powered by a dynamic reliability digital twin model with artificial intelligence. Reliab. Eng. Syst. Saf. 2024, 243,
109859. [CrossRef]
28. Sawant, V.; Deshmukh, R.; Awati, C. Machine learning techniques for prediction of capacitance and remaining useful life of
supercapacitors: A comprehensive review. J. Energy Chem. 2022, 77, 438–451. [CrossRef]
29. Zhang, H.; Luo, Y.; Zhang, L.; Wu, Y.; Wang, M.; Shen, Z. Considering three elements of aesthetics: Multi-task self-supervised
feature learning for image style classification. Neurocomputing 2023, 520, 262–273. [CrossRef]
30. Kwak, D.; Choi, S.; Chang, W. Self-attention based deep direct recurrent reinforcement learning with hybrid loss for trading
signal generation. Inf. Sci. 2023, 623, 592–606. [CrossRef]
31. de Carvalho Bertoli, G.; Junior, L.A.P.; Saotome, O.; dos Santos, A.L. Generalizing intrusion detection for heterogeneous networks:
A stacked-unsupervised federated learning approach. Comput. Secur. 2023, 127, 103106. [CrossRef]
32. Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud
Univ.-Comput. Inf. Sci. 2023, 35, 757–774. [CrossRef]
33. Pang, Y.; Zhou, X.; Zhang, J.; Sun, Q.; Zheng, J. Hierarchical electricity time series prediction with cluster analysis and sparse
penalty. Pattern Recognit. 2022, 126, 108555. [CrossRef]
34. Zonta, T.; Da Costa, C.A.; da Rosa Righi, R.; de Lima, M.J.; da Trindade, E.S.; Li, G.P. Predictive maintenance in the Industry 4.0:
A systematic literature review. Comput. Ind. Eng. 2020, 150, 106889. [CrossRef]
35. Huang, S.-Y.; An, W.-J.; Zhang, D.-S.; Zhou, N.-R. Image classification and adversarial robustness analysis based on hybrid
quantum–classical convolutional neural network. Opt. Commun. 2023, 533, 129287. [CrossRef]
36. Li, Y.; Hao, Z.; Lei, H. Survey of convolutional neural network. J. Comput. Appl. 2016, 36, 2508.
37. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE
Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [CrossRef] [PubMed]
38. Bueno-Barrachina, J.-M.; Ye-Lin, Y.; Nieto-Del-Amor, F.; Fuster-Roig, V. Inception 1D-convolutional neural network for accurate
prediction of electrical insulator leakage current from environmental data during its normal operation using long-term recording.
Eng. Appl. Artif. Intell. 2023, 119, 105799. [CrossRef]
39. Guo, Y.; Zhou, Y.; Zhang, Z. Fault diagnosis of multi-channel data by the CNN with the multilinear principal component analysis.
Measurement 2021, 171, 108513. [CrossRef]
Sensors 2024, 24, 3215 24 of 25
40. Fernandes, M.; Corchado, J.M.; Marreiros, G. Machine learning techniques applied to mechanical fault diagnosis and fault
prognosis in the context of real industrial manufacturing use-cases: A systematic literature review. Appl. Intell. 2022, 52,
14246–14280. [CrossRef] [PubMed]
41. Rout, A.K.; Dash, P.; Dash, R.; Bisoi, R. Forecasting financial time series using a low complexity recurrent neural network and
evolutionary learning approach. J. King Saud Univ.-Comput. Inf. Sci. 2017, 29, 536–552. [CrossRef]
42. Zhang, J.; Wang, P.; Yan, R.; Gao, R.X. Deep learning for improved system remaining life prediction. Procedia CIRP 2018, 72,
1033–1038. [CrossRef]
43. Malhi, A.; Yan, R.; Gao, R.X. Prognosis of defect propagation based on recurrent neural networks. IEEE Trans. Instrum. Meas.
2011, 60, 703–711. [CrossRef]
44. Wang, Y.; Zhao, Y.; Addepalli, S. Remaining useful life prediction using deep learning approaches: A review. Procedia Manuf.
2020, 49, 81–88. [CrossRef]
45. Gao, S.; Huang, Y.; Zhang, S.; Han, J.; Wang, G.; Zhang, M.; Lin, Q. Short-term runoff prediction with GRU and LSTM networks
without requiring time step optimization during sample generation. J. Hydrol. 2020, 589, 125188. [CrossRef]
46. Yan, H.; Qin, Y.; Xiang, S.; Wang, Y.; Chen, H. Long-term gear life prediction based on ordered neurons LSTM neural networks.
Measurement 2020, 165, 108205. [CrossRef]
47. Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471.
[CrossRef] [PubMed]
48. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
49. Abhaya, A.; Patra, B.K. An efficient method for autoencoder based outlier detection. Expert Syst. Appl. 2023, 213, 118904.
[CrossRef]
50. Zhou, C.; Paffenroth, R.C. Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017;
pp. 665–674.
51. Liao, W.; Guo, Y.; Chen, X.; Li, P. A unified unsupervised gaussian mixture variational autoencoder for high dimensional outlier
detection. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December
2018; IEEE: New York, NY, USA, 2018; pp. 1208–1217.
52. Jeon, S.; Kang, J.; Kim, J.; Cha, H. Detecting structural anomalies of quadcopter UAVs based on LSTM autoencoder. Pervasive Mob.
Comput. 2022, 88, 101736. [CrossRef]
53. Dou, T.; Clasie, B.; Depauw, N.; Shen, T.; Brett, R.; Lu, H.-M.; Flanz, J.B.; Jee, K.-W. A deep LSTM autoencoder-based framework
for predictive maintenance of a proton radiotherapy delivery system. Artif. Intell. Med. 2022, 132, 102387. [CrossRef] [PubMed]
54. Bampoula, X.; Siaterlis, G.; Nikolakis, N.; Alexopoulos, K. A deep learning model for predictive maintenance in cyber-physical
production systems using lstm autoencoders. Sensors 2021, 21, 972. [CrossRef] [PubMed]
55. Sagheer, A.; Kotb, M. Unsupervised pre-training of a deep LSTM-based stacked autoencoder for multivariate time series
forecasting problems. Sci. Rep. 2019, 9, 19038. [CrossRef]
56. Mo, Y.; Wu, Q.; Li, X.; Huang, B. Remaining useful life estimation via transformer encoder enhanced by a gated convolutional
unit. J. Intell. Manuf. 2021, 32, 1997–2006. [CrossRef]
57. Hao, J.; Wang, X.; Yang, B.; Wang, L.; Zhang, J.; Tu, Z. Modeling recurrence for transformer. arXiv 2019, arXiv:1904.03092.
58. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Łukasz, K.; Illia, P. Attention is all you need. arXiv
2017, arXiv:1706.03762.
59. Ntakouris, T. Timeseries Classification with a Transformer Model. Keras, 2021. Available online: https://keras.io/examples/
timeseries/timeseries_classification_transformer/ (accessed on 10 January 2024).
60. Bergen, L.; O’Donnell, T.; Bahdanau, D. Systematic generalization with edge transformers. Adv. Neural Inf. Process. Syst. 2021, 34,
1390–1402.
61. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
62. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Yanqi, Z.; Wei, L.; Liu, P.J. Exploring the limits of transfer
learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551.
63. Chen, D.; Hong, W.; Zhou, X. Transformer network for remaining useful life prediction of lithium-ion batteries. IEEE Access 2022,
10, 19621–19628. [CrossRef]
64. Huertas-García, Á.; Martín, A.; Huertas-Tato, J.; Camacho, D. Exploring Dimensionality Reduction Techniques in Multilingual
Transformers. Cogn. Comput. 2023, 15, 590–612. [CrossRef] [PubMed]
65. Hu, W.; Zhao, S. Remaining useful life prediction of lithium-ion batteries based on wavelet denoising and transformer neural
network. Front. Energy Res. 2022, 10, 1134. [CrossRef]
66. Joseph, V.R. Optimal ratio for data splitting. Stat. Anal. Data Min. ASA Data Sci. J. 2022, 15, 531–538. [CrossRef]
67. Python Language Reference, Version 3.7. Available online: https://docs.python.org/3.7/reference/ (accessed on 29 January 2021).
68. Al-Taie, M.Z.; Kadry, S.; Lucas, J.P. Online data preprocessing: A case study approach. Int. J. Electr. Comput. Eng. 2019, 9, 2620.
[CrossRef]
Sensors 2024, 24, 3215 25 of 25
69. Spuzic, S.; Strafford, K.N.; Subramanian, C.; Savage, G. Wear of hot rolling mill rolls: An overview. Wear 1994, 176, 261–271.
[CrossRef]
70. Spuzic, S.; Strafford, K.; Subramanian, C.; Savage, G. Low complexity autoencoder based end-to-end learning of coded communi-
cations systems. In Proceedings of the 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), Antwerp, Belgium,
25–28 May 2020; IEEE: New York, NY, USA, 2020; pp. 1–7.
71. Simoulin, A.; Crabbé, B. How many layers and why? An analysis of the model depth in transformers. In Proceedings of the
59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing: Student Research Workshop, Bangkok, Thailand, 1–6 August 2021; pp. 221–228.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Sensors 24 03215

Uploaded by

Copyright:

Available Formats

Sensors 24 03215

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sensors 24 03215

Uploaded by

Copyright:

Available Formats

sensors

Keywords: deep learning; artificial intelligence; transformers; autoencoders; Long Short-Term

Citation: Bampoula, X.; Nikolakis, N.;

Sensors 2024, 24, 3215. https://doi.org/10.3390/s24103215 https://www.mdpi.com/journal/sensors

driven approach allows industries to continuously improve their predictive maintenance

different RUL classes. So, using LSTM-Autoencoders as a preliminary preprocessing step

Sensors 2024, 24, 3215 𝐴′ 𝛼′ , 𝛼′ , 𝛼′ , … , 𝛼′ , 𝑤ℎ𝑒𝑟𝑒 𝛼′ ∈ ℝ, 𝑤𝑖𝑡ℎ 𝑖, 𝑗 ∈ ℤ 𝑎𝑛𝑑 𝑖 𝑛 (2)25

Sensors 2024, 24, x FOR PEER REVIEW 7 of 26

The integration of outputs from the two separate LSTM-Autoencoders is achieved

Sensors 2024, 24, x FOR PEER REVIEW 9 of

Figure Transformer model

Figure 7. Transformer model Feed Forward network.

5.2. Data Preprocessing

5.2. Data Preprocessing

5.3. Feature Selection

Table 1. Features selected.

Feature Name Feature Value Feature Description

5.4. LSTM-Autoencoder Architecture

Sensors 2024, 24, x FOR PEER REVIEW 13 of 26

Figure 10. LSTM-Autoencoder encoder.

Figure 11. LSTM-Autoencoder decoder.

Table 2. LSTM-Autoencoder: number of trainable parameters.

Layer Type Output Shape (Timesteps × Features) Parameters

Table 2. LSTM-Autoencoder: number of trainable parameters.

Layer Type Output Shape (Timesteps × Features) Parameters

Table 4. Data selected for training LSTM-Autoencoders.

Table 4. Data selected for training LSTM-Autoencoders.

# RUL Remark Good Data Bad Data

Table 5. LSTM-Autoencoder training loss results (%).

Window Size 5 Window Size 10 Window Size 20

Table 6. LSTM-Autoencoder test results.

Historical Maintenance Records Loss

5.6. Transformer Encoder Architecture

Figure 13. Transformer encoder.

Figure 14. Probability generation for the classification.

Table 7. Transformer encoder training accuracy results (%).

Window Size 5 Window Size 10 Window Size 20

Table 8. Classes and RUL definition.

Table 9. Transformer results: Experiment 1—maintenance because of break down.

Precision (%) Recall (%) F1 Score (%) Confidence (%) Support

Table 10. Transformer results: Experiment 2—maintenance because of break down.

Precision (%) Recall (%) F1 Score (%) Confidence (%) Support

Precision (%) Recall (%) F1 Score (%) Confidence (%) Support

Table 10. Transformer results: Experiment 2—maintenance because of break down.

Precision (%) Recall (%) F1 Score (%) Confidence (%) Support

Figure 17. Confusion matrix: Experiment 2—maintenance because of break down.

Table 11. Transformer results: Experiment 3—maintenance because of break down.

Precision (%) Recall (%) F1 Score (%) Confidence (%) Support

Table 11. Transformer results: Experiment 3—maintenance because of break down.

Precision (%) Recall (%) F1 Score (%) Confidence (%) Support

Table 13. Transformer results: Experiment 5—preventive maintenance.

Table 14. Transformer results: Experiment 6—preventive maintenance.

Figure 20. Confusion matrix: Experiment 5—preventive maintenance.

interprets data, in an attempt to inform technicians/engineers whether or not the specific

You might also like