Log-based Anomaly Detection Using Large Language Models
Log-based Anomaly Detection Using Large Language Models
Abstract—Software systems often record important runtime typically employ sequential deep learning models such as
information in logs to help with troubleshooting. Log-based LSTM [23] and transformers [24]. These methods can be
anomaly detection has become a key research area that aims further divided into reconstruction-based methods [8]–[15] and
arXiv:2411.08561v1 [cs.SE] 13 Nov 2024
large window size for grouping log messages. Egersdoerfer 1 iar 00106ba8 dear 0246dd1c Fixed window 1 L3 ecc control register: 00000000
2 5052567 floating point alignment exceptions Window size: 2 2 Log Sequence #2
et al. [30] address this issue by maintaining a summary-based 3 CE sym 25, at 0x10e1bce0, mask 0x40
Step size: 2
1 CE sym 25, at 0x10e1bce0, mask 0x40
memory, which summarizes the previous log messages, elim- 4 invalid operation exception (software)...0 2 Logoperation
invalid Sequence #1 (software)...0
exception
5 L3 ecc control register: 00000000
inating the need to input the entire log sequence for anomaly … …
1 iar 00106ba8 dear 0246dd1c
2 5052567 floating point alignment exceptions
detection. RAGLog [31] uses a retrieval augmented generative
(RAG) framework [46] to analyze log entries by querying its
Fig. 3: Fixed window.
store of samples of normal log entries. They design prompt
System Log
templates for LLMs to determine whether a queried log entry Log Sequence #2
Receiving block blk_-5632101276183739500 (For blk_-4506306395053060141)
is normal or abnormal. Prompt engineering-based methods of- LLM
1
and fine-tune it to accurately
src: /10.251.27.63:36433 dest: /
Session window predict sequence
Receiving block labels.
blk_-4506306395053060141
10.251.27.63:50010
ten struggle to customize solutions for specific datasets, which However, this method
BLOCK* NameSystem.allocateBlock:
1
/user/is impractical because
src: /10.251.193.175:40709 dest: /
each
10.251.193.175:50010
Log Sequence #1 template
2 root/randtxt4/_temporary/task_200811101024/
can lead to suboptimal detection performance in particular maypart-00999.
be processed into multiple tokens(For
blk_-5632101276183739500
blk_-5632101276183739500)
byBLOCK*the NameSystem.allocateBlock: /user/
LLM’s tokenizer,
root/randtxt4/_temporary/task_200811101024/
2Receiving block blk_-5632101276183739500
datasets. and
3
Receiving block blk_-4506306395053060141
a
src: single template
/10.251.193.175:40709 dest: / sequence can
1
part-00999. blk_-4506306395053060141
contain numerous log
src: /10.251.27.63:36433 dest: /
10.251.193.175:50010 PacketResponder 0 for block blk_-
10.251.27.63:50010
3
Fine-tuning-based methods [3], [32]–[40] incorporate templates. Consequently,
PacketResponder 0 for block blk_- an excessive 4506306395053060141 terminating
number
BLOCK* of tokens
NameSystem.allocateBlock: /user/ can
4 …
2 …root/randtxt4/_temporary/task_200811101024/
5632101276183739500 terminating
LLMs into deep neural networks and customize them to the be… generated
… for one template sequence,
part-00999.which LLMs
blk_-5632101276183739500 often
PacketResponder 0 for block blk_-
user’s own dataset. Some methods [32]–[35], although adopt- cannot process due to token (memory) 3
limitations
5632101276183739500 [41].
terminating
ing transformer encoder-based LLMs for anomaly detection, LogLLM is a fine-tuning-based method … …
that utilizes BERT
do not capture the semantic information within log sequences. for extracting semantic vectors from log messages and Llama,
For example, LogBERT [32] and LAnoBERT [33] utilize a transformer decoder-based model, for log sequence classi-
BERT to reconstruct the input sequence of log template IDs fication. This method aligns the vector representation spaces
(IDs of log string templates) and detect anomalies based on of BERT and Llama using a projector. By adopting BERT,
reconstruction errors, disregarding the semantic information. LogLLM effectively mitigates the out-of-memory issue caused
Other methods [3], [36]–[39] use transformer encoder-based by excessive tokens when directly tokenizing the entire log
LLMs solely for extracting semantic information from log sequence with Llama’s tokenizer. Compared to other methods,
messages and then employ either smaller models [3], [36]– LogLLM fully exploits the capabilities of LLMs for log-based
[38] or distance-based comparison [39] for classification. For anomaly detection.
instance, NeuralLog [3] leverages BERT to extract semantic
vectors from raw log messages, which are subsequently used to III. P RELIMINARIES
detect anomalies via a transformer-based classification model. To establish the groundwork for subsequent sections, we
Similarly, RAPID [39] utilizes transformer encoder-based introduce the system log, which records the system’s events
models to extract semantic vectors and performs anomaly and internal states during runtime. A system log contains a
detection by comparing each query log sequence with its list of log messages in chronological order.
nearest document log sequence. Hadadi et al. [40] directly Fig. 1 presents a snippet of a raw system log generated
input template sequences parsed from log sequences, into an by the BGL (the BlueGene/L supercomputer system), with
Model Architecture
Llama (LLM)
… … …
Fig. 4: The framework of LogLLM. Notably, the model includes only one instance of BERT and one projector.
each log message ordered according to the recorded time. into log sequence preprocessing, the architecture of the deep
These raw log messages are semi-structured texts consisting model, and the model training procedure.
of a header and content. The header, determined by the
logging framework, includes information such as timestamp, A. Preprocessing
verbosity level (e.g., WARN/INFO), and component [47]. The Considering that the log message content includes variable
log content comprises a constant part (keywords that reveal parameters carrying dynamic runtime information, which is
the log template) and a variable part (parameters that carry always irrelevant to the anomalies and complicates deep model
dynamic runtime information). In this paper, we focus solely training, as demonstrated in Section V-F, a technique is needed
on the content of each log message. to identify these parameters and replace them with a constant
The log messages can be grouped into log sequences (i.e., token. Log parsers, such as Drain [51] and Spell [52], are
series of log messages that record specific execution flows) widely adopted in log-based anomaly detection methods and
based on sessions or fixed/sliding windows [48]. Session appear to be a useful technique. However, as noted by Le et al.
window partitioning groups log messages according to their [3], existing log parsers do not always perform correctly on all
session IDs, thereby generating sequences that include the log log datasets and struggle to handle out-of-vocabulary (OOV)
messages within each session. For example, Fig. 2 illustrates words in new log messages, resulting in a loss of semantic
the HDFS [49] logs undergoing the session window grouping information. When logs are unstable, these parsers become
process, where the block_id serves as the session ID. In con- increasingly ineffective over time, making it difficult to support
trast, fixed/sliding window partitioning groups log messages subsequent anomaly detection.
based on a fixed size (i.e., window size), which can be defined Thanks to the structured log generation process, the textual
by either the time span or the number of log messages. This format of parameters representing specific objects can be
method creates sequences that capture snapshots of system log easily identified using regular expressions [53]. Consequently,
messages over time. For example, Fig. 3 illustrates the BGL we replace each variable parameter, such as account, directory
[50] logs undergoing the fixed window grouping process, with path, and IP address, with ’<*>’. Despite its simplicity, this
a window size of 2 messages and a step size of 2 messages. technique offers significant performance advantages. Com-
The objective of log-based anomaly detection is to iden- pared with log parsers, this preprocessing technique is more
tify anomalous log sequences, facilitating the recognition of effective and does not require training.
potential issues within the system’s operational behavior.
B. Model Architecture
IV. M ETHODOLOGY As shown in Fig. 4, our deep model consists of three main
components: BERT, a projector, and Llama. Both BERT and
In this section, we present our innovative anomaly detection Llama are pretrained LLMs. BERT is utilized to extract vector
framework, LogLLM. As illustrated in Fig. 4, the log sequence representations of log messages, while Llama is employed to
undergoes preprocessing using regular expressions before be- classify the log sequences. The projector serves as a bridge,
ing fed into a deep neural network that integrates BERT [43], aligning the vector representation spaces of BERT and Llama.
a projector, and Llama 3 [26] for log sequence classification. It is important to note that our model incorporates only one
In the following sections, we will provide detailed insights instance of BERT and one projector.
1) BERT: BERT generates a semantic vector by processing for the minority class, it will be oversampled to the following
the semantic vector of the classification token ([CLS]) through quantity:
a linear layer followed by a tanh activation function. Each β(1 − α)
log message, once preprocessed, is encoded into a semantic × Sample_num (1)
1−β
vector using the BERT tokenizer and BERT model. For a
preprocessed log sequence, the output of BERT is a sequence This adjustment will make the proportion of the minority class
of semantic vectors C = (c1 , c2 , . . . , cN ) ∈ RN ×dBERT , equal to β.
where N represents the length of the log sequence (i.e., the 2) Training Objective: Our objective is to train the deep
number of log messages) and dBERT is the dimension of model to predict whether a given log sequence is normal or
each semantic vector (i.e., hidden size). LogLLM utilizes the anomalous. We fine-tune the model to respond appropriately:
BERT base model 1 , which consists of 12 layers of transformer if the sequence is anomalous, it outputs ’The sequence is
encoders and 768 hidden units in each transformer. Therefore, anomalous’; if normal, it outputs ’The sequence is normal’.
dBERT is 768. We utilize cross-entropy loss [54] as our loss function.
2) Projector: The projector is a linear layer that maps the 3) Training Procedure: To train our deep model, we follow
semantic vectors C ∈ RN ×dBERT to the token embedding vec- three main stages.
tors accepted by Llama, represented as E = (e1 , e2 , . . . , eN ) ∈ Stage 1. Fine-tuning Llama to capture the answer tem-
RN ×dLlama , where dLlama is the hidden size of Llama. The plate: The first stage involves fine-tuning Llama to capture
projector is designed to align the vector representation spaces the answer template. Specifically, we train Llama to respond
of BERT and Llama. to the prompt ’Is this sequence normal or anomalous?’ with
’The sequence is anomalous/normal’. This stage requires only
3) Llama: To conduct prompt tuning on Llama, the trans-
a few data samples.
former decoder-based LLM, we generate corresponding textual
Stage 2. Training the embedder of log messages: The
queries based on embedded log sequences. Specifically, each
second stage involves training the embedder of log messages,
query consists of three components.
specifically BERT and the projector. This stage aims to project
The first component introduces the log sequence, such as each log message to the embedding of the most suitable token
"Below is a sequence of system log messages:". The second in Llama, enabling Llama to discern whether the given log
component comprises the token embeddings E output by the sequence is normal or anomalous.
projector. The third component queries whether the sequence
Stage 3. Fine-tuning the entire model: Finally, we
is anomalous, asking, for instance, ". Is this sequence normal
fine-tune the entire model to ensure cohesive and accurate
or anomalous?". The first and third components are fed into
performance across all components.
the Llama tokenizer and Llama embedding layer sequentially,
4) Efficient Fine-Tuning on LLMs: To reduce the costs
producing E1 ∈ RA×dLlama and E3 ∈ RQ×dLlama , where
involved in fine-tuning LLMs (BERT and Llama) with a
A and Q are the number of tokens produced by tokenizing
substantial number of parameters, we utilize QLoRA [55]
the first and third components, respectively. Then, the token
to minimize memory usage. QLoRA accomplishes this by
embeddings of the three components are concatenated, rep-
backpropagating gradients into a frozen 4-bit quantized model,
resented as [E1 ||E||E3 ] ∈ R(A+N +Q)×dLlama and fed into
while maintaining the performance levels achieved during the
Llama. We utilize Llama 3 8B 2 for this process, where dLlama
full 16-bit fine-tuning process.
is 4096.
V. E XPERIMENTS
C. Training
In this section, we perform empirical assessments of
1) Minority Class Oversampling: LogLLM is a supervised LogLLM’s performance on four real-life logs. LogLLM
anomaly detection method, which means it needs labeled nor- is coded in Python, and the source code is available at
mal and anomalous samples for training. However, supervised https://github.com/guanwei49/LogLLM.
anomaly detection methods often face the challenge of data
imbalance, which can lead to biased model training. In an A. Benchmark Methods
anomaly detection task, there are only two classes: normal To verify the superiority of the proposed method, we
and anomalous, and the number of instances in each class is compare LogLLM with five state-of-the-art semi-supervised
uncertain. To cope with data imbalance, we oversample the methods: DeepLog [8], LogAnomaly [9], PLELog [22], Fast-
class with fewer samples, ensuring that the proportion of the LogAD [34], and LogBERT [32]. We also compare it with
minority class is no less than β. Formally, let the the proportion three supervised methods: LogRobust [19], CNN [18] and
of the minority class is α and α < β, and the total number NeuralLog [3], and one method that does not require training
of samples is Sample_num. To achieve a proportion of β a deep model but needs some normal samples for retrieval:
RAPID [39].
1 https://huggingface.co/google-bert/bert-base-uncased Notably, FastLogAD, LogBERT, NeuralLog, and RAPID
2 https://huggingface.co/meta-llama/Meta-Llama-3-8B adopt LLMs for anomaly detection.
TABLE I: The statistics of datasets used in the experiments.
Training Data Testing Data
# Log messages # Log sequences
# Log sequences # Anomalies Anomaly ratio # Log sequences # Anomalies Anomaly ratio
HDFS 11,175,629 575,061 460,048 13,497 2.93% 115,013 3,341 2.90%
BGL 4,747,963 47,135 37,708 4,009 10.63% 9,427 817 8.67%
Liberty 5,000,000 50,000 40,000 34,144 85.36% 10,000 651 6.51%
Thunderbird 10,000,000 99,997 79,997 837 1.05% 20,000 29 0.15%
The proposed LogLLM achieves the highest F1 -score across TABLE III: Computational cost.
all datasets. On average, LogLLM’s F1 -scores are 6.6% better Training time (minutes) Testing time (minutes)
than the best existing method, NeuralLog, demonstrating its
DeepLog 72.17 3.42
effectiveness in log-based anomaly detection. Despite the LogAnomaly 156.16 7.25
adoption of LLMs in FastLogAD, LogBERT, NeuralLog, and PLELog 315.47 33.59
RAPID for anomaly detection, their performance remains LogRobust 108.42 2.48
CNN 98.16 2.16
unsatisfactory. FastLogAD and LogBERT utilize BERT, a FastLogAD 254.17 0.29
transformer encoder-based model, for detecting anomalies LogBERT 429.04 43.77
based on log sequence reconstruction errors. Their inputs NeuralLog 267.46 21.44
RAPID 63.98 38.43
consist of sequences of log template IDs (IDs of log string LogLLM 1,065.15 64.48
templates) extracted from log messages via log parsers, lacking
semantic information. In contrast, NeuralLog and RAPID
utilize transformer encoder-based models to extract semantic labeled anomalies perform poorly, with an average F1 -score
vectors from log messages. However, NeuralLog employs below 0.602. This demonstrates that incorporating labeled
smaller models, while RAPID uses distance-based comparison anomalies can provide a significant advantage to anomaly
for classification. LogLLM, on the other hand, leverages detection methods.
both BERT for extracting semantic vectors and Llama, a Computational cost: The time consumption of each method
transformer decoder-based model, for anomaly detection. The is presented in Table III. These results have been averaged
representation spaces of BERT and Llama are aligned via a across all the datasets.
projector, fully harnessing the potential of LLMs for log-based Although RAPID does not require training a deep model,
anomaly detection. the extraction and retrieval of vector representations remain
Moreover, LogLLM achieves a balance between precision time-consuming. In comparison to other methods, FastLogAD
and recall, indicating that it maintains low false alarm rates requires relatively high training time, but it has the shortest
and minimizes missed reports. In contrast, methods like Fast- testing time because it uses only the discriminator of the model
LogAD are excessively sensitive to anomalies, often resulting during testing. As anticipated, while our proposed LogLLM
in numerous false alarms. For example, on the BGL dataset, demonstrates the best performance, it also incurs the highest
despite FastLogAD having a recall of 1, it only achieves computational cost due to its large number of parameters.
a precision of 0.167, making it impractical for real-world However, the testing time of LogLLM remains acceptable
use. Similarly, methods such as DeepLog, LogAnomaly and when compared to other methods that utilize LLMs, such as
LogBERT exhibit similar issues. On the other hand, RAPID is LogBERT, NeuralLog, and RAPID.
not sensitive enough to anomalies, leading to many undetected
anomalies. For instance, on the BGL dataset, RAPID achieves F. Effects of Different Preprocessing Techniques
a precision of 0.874 but a recall of only 0.399. We evaluate the effectiveness of the different preprocessing
Effect of labeled anomalies: As illustrated in Table II, in techniques. The results are shown in Table IV. In this table,
contrast to methods such as DeepLog, LogAnomaly, FastLo- ’Raw’ indicates that the content of log messages is not
gAD, LogBERT, and RAPID, which require clean datasets preprocessed and is directly input into the proposed deep
devoid of anomalies to build anomaly detection models, meth- model. ’Template’ indicates that sequences of log templates
ods like PLELog, LogRobust, CNN, NeuralLog, and LogLLM produced by Drain [51], a log parser, are used as input for
demonstrate superior performance. These models are trained the proposed deep model. ’Template ID’ signifies that the
using not only normal samples but also labeled anomalies. For IDs of log templates, obtained by Drain, are simply encoded
instance, these five methods achieve an average F1 -score above into numeric vectors using an embedding layer instead of
0.771 across four datasets, whereas others that do not utilize BERT. The preprocessing technique ’Template ID’ renders
TABLE IV: Effects of different preprocessing techniques on HDFS, BGL, Liberty, and Thunderbird datasets.
HDFS BGL Liberty Thunderbird
Avg. F1
Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
Raw 0.994 0.991 0.993 0.943 0.767 0.846 0.911 0.908 0.909 0.806 0.862 0.833 0.895
Template ID 0.995 0.945 0.969 0.775 0.286 0.418 0.994 0.270 0.425 1.000 0.379 0.550 0.591
Template 0.991 1.000 0.995 0.861 0.919 0.889 0.968 0.931 0.949 0.950 0.655 0.776 0.902
RE (LogLLM) 0.994 1.000 0.997 0.861 0.979 0.916 0.992 0.926 0.958 0.966 0.966 0.966 0.959
TABLE V: Ablation study of the training procedure on HDFS, BGL, Liberty, and Thunderbird datasets.
HDFS BGL Liberty Thunderbird
Avg. F1
Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
W/O Stage 1 0.991 1.000 0.995 0.578 0.971 0.725 0.685 0.290 0.408 0.381 0.828 0.522 0.662
W/O Stage 2 0.994 1.000 0.997 0.858 0.920 0.888 0.995 0.906 0.949 0.848 0.966 0.903 0.934
W/O Stage 1&2 0.992 1.000 0.996 0.853 0.882 0.868 0.995 0.906 0.949 0.897 0.897 0.897 0.927
W/O Stage 3 0.993 0.999 0.996 0.704 0.776 0.738 1.000 0.684 0.812 0.958 0.793 0.868 0.854
LogLLM 0.994 1.000 0.997 0.861 0.979 0.916 0.992 0.926 0.958 0.966 0.966 0.966 0.959
the model unable to capture the semantic information within template (Stage 1) is essential before training the embedder of
log messages. Notably, the parser Drain is applied to the log messages (Stage 2). Without this stage, the embedder may
entire dataset, rather than only the training dataset, to avoid be misdirected, resulting in incorrect semantic capture of log
performance degradation due to the OOV problem. ’RE’ messages and model failure. Training without stage 3 yields
indicates that regular expressions, as introduced in Section relatively poor performance, with an average F1 -score decrease
IV-A, are used for preprocessing log messages. of 10.5%. This indicates that sequentially fine-tuning Llama
As anticipated, the preprocessing technique ’RE’ yields the and training the embedder alone is insufficient for the model to
highest F1 -score across all datasets. Conversely, the prepro- capture anomalous patterns; cohesive fine-tuning of the entire
cessing technique ’Template ID’ consistently results in the model is essential. Training without stages 2 and 1&2 (only
lowest F1 -score across all datasets, averaging 36.8% lower adopting training stage 3: fine-tuning the entire model) results
than that of ’RE’. This can be attributed to the fact that ’Tem- in acceptable performance, with average F1 -score decreases of
plate ID’ hinders the model’s ability to capture the semantic 2.5% and 3.2%. This demonstrates that individually training
information within log messages, thereby impairing its capa- the embedder (BERT and projector) before fine-tuning the
bility to detect anomalies from a natural language perspective. entire model can also enhance performance. This stage allows
The preprocessing techniques ’Raw’ and ’Template’ result the embedder to generate better semantic vectors of log
in relatively good performance, but their F1 -scores are still messages for Llama to discern anomalies.
6.4% and 5.7% lower than that of ’RE’, respectively. For the In summary, our proposed three-stage training procedure is
preprocessing technique ’Raw’, the variable parts (parameters well-suited for our deep model in log-based anomaly detection.
that carry dynamic runtime information) within the content of
each log message have little influence on anomaly detection. H. Impact of Minority Class Oversampling
However, due to their high randomness, they can confuse Note that normal and anomalous samples in the training
the model, making it difficult to discern anomalies. For the dataset are imbalanced, as shown in Table I. For the HDFS,
preprocessing technique ’Template’, the parser is not always BGL, and Thunderbird datasets, normal samples outnumber
reliable, sometimes incorrectly removing the constant parts anomalous samples. Conversely, in the Liberty dataset, anoma-
or retaining the variable parts, which can lead to information lous samples exceed normal samples. The hyper-parameter β
loss or confusion for the model, making it difficult to discern controls the proportion of the minority class by oversampling
anomalies. to address data imbalance problem, as described in Section
IV-C1. In this section, we investigate the impact of β by vary-
G. Ablation Study of the Training Procedure ing its value. Fig. 5 illustrates the performance of LogLLM
We investigate the effect of each training procedure through on the four datasets under different magnitudes of β. When
an ablation study. The results are presented in Table V, where β = 0, the samples are not oversampled; instead, the original
’W/O’ denotes ’without’. We have the following observations: datasets are utilized directly for training.
Skipping any training stage results in a decrease in the F1 - As illustrated in Fig. 5b, for the HDFS, BGL, and Thunder-
score across all datasets, demonstrating the effectiveness of bird datasets, the recall always increases, while for the Liberty
our three-stage training procedure. Training without stage 1 dataset, recall decreases as β increases. This can be attributed
leads to the worst performance, with the F1 -score averaged to the fact that for the HDFS, BGL, and Thunderbird datasets,
across all datasets decreasing by as much as 29.7%. This when β increases, anomalies are oversampled, making the
demonstrates that fine-tuning Llama to capture the answer model more prone to identifying samples as anomalies. In
1 .0 0 1 .0 0
0 .9 5
0 .9 0 0 .9 5
P r e c is io n
R e c a ll
0 .8 5
0 .8 0 0 .9 0
0 .7 5 H D F S B G L H D F S B G L
L ib e r ty T h u n d e r b ir d L ib e r ty T h u n d e r b ir d
0 .0 0 0 .0 0
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0
β (%) β (%)
(a) Precision (b) Recall
1 .0 0 2 5 0 0
2 0 0 0
T r a in in g tim e ( m in u te s )
0 .9 5 H D F S B G L
L ib e r ty T h u n d e r b ir d
1 5 0 0
F 1-s c o re
0 .9 0
1 0 0 0
0 .8 5
H D F S B G L 5 0 0
L ib e r ty T h u n d e r b ir d
0 .0 0 0
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0
β (%) β (%)
(c) F1 -score (d) Training time
contrast, for the Liberty dataset, when β increases, normal As anticipated, as β increases, the training time also in-
samples are oversampled, making the model more prone to creases, as shown in Fig. 5d. This relationship arises because a
identifying samples as normal. higher β leads to more oversampled data samples, as indicated
by equation (1), thereby enlarging the training dataset.
As illustrated in Fig. 5c, the changing pattern of the F1 - To summarize, minority class oversampling is essential;
score is basically the same across all datasets. The F1 -score however, the value of the hyperparameter β does not signif-
increases and then decreases as β increases. However, the icantly impact the performance of LogLLM, making careful
LogLLM seems not to be sensitive to β; when β is between selection unnecessary. Moreover, excessively large values of
10% and 80%, the variation in the F1 -score is no more than β are undesirable, as they result in prolonged training times.
0.07. Thanks to the substantial semantic knowledge embedded Values between 30% and 50% are deemed acceptable.
in LLMs, a trained model can effectively learn anomalous
patterns and detect anomalies, even when the minority class VI. C ONCLUSION
constitutes only 10% of the dataset. In comparison to the BGL In this paper, we propose LogLLM, a novel log-based
and Thunderbird datasets, the F1 -score for the HDFS and anomaly detection framework that leverages LLMs. LogLLM
Liberty datasets shows minimal variation with respect to β. employs both transformer encoder-based and decoder-based
This consistency can be attributed to the clear anomalous pat- LLMs, specifically BERT and Llama, for log-based anomaly
terns present, allowing LogLLM to be easily trained to detect detection. BERT is utilized to extract semantic vectors from
anomalies across various β values. However, LogLLM appears log messages, while Llama is used to classify log sequences.
unable to effectively handle extremely imbalanced scenarios. To ensure coherence in log semantics, we introduce a projector
For instance, in the Thunderbird dataset, anomalies constitute that aligns the vector representation spaces of BERT and
only 1.05% of the samples, causing the trained model to Llama. LogLLM is trained using an innovative three-stage
be biased and classify all samples as normal. As a result, procedure designed to enhance both performance and adapt-
precision, recall, and F1 -score are all equal to 0. Consequently, ability. Extensive experiments conducted on four public real-
minority class oversampling is sometimes essential. world datasets demonstrate that LogLLM achieves remarkable
performance. Subsequent ablation studies further confirm the [20] Y. Xie, H. Zhang, and M. A. Babar, “Loggd: Detecting anomalies
effectiveness of our three-stage training procedure. from system logs with graph neural networks,” in 2022 IEEE 22nd
International conference on software quality, reliability and security
(QRS). IEEE, 2022, pp. 299–310.
R EFERENCES [21] Z. Zhao, W. Niu, X. Zhang, R. Zhang, Z. Yu, and C. Huang, “Trine:
[1] R. S. Kazemzadeh and H.-A. Jacobsen, “Reliable and highly available Syslog anomaly detection with three transformer encoders in one gen-
distributed publish/subscribe service,” in 2009 28th IEEE International erative adversarial network,” Applied Intelligence, vol. 52, no. 8, pp.
Symposium on Reliable Distributed Systems. IEEE, 2009, pp. 41–50. 8810–8819, 2022.
[2] E. Bauer and R. Adams, Reliability and availability of cloud computing. [22] L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, and W. Zhang,
John Wiley & Sons, 2012. “Semi-supervised log-based anomaly detection via probabilistic label
[3] V.-H. Le and H. Zhang, “Log-based anomaly detection without log pars- estimation,” in 2021 IEEE/ACM 43rd International Conference on
ing,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ICSE). IEEE, 2021, pp. 1448–1460.
Software Engineering (ASE). IEEE, 2021, pp. 492–504. [23] S. Hochreiter, “Long short-term memory,” Neural Computation MIT-
[4] W. Guan, J. Cao, H. Zhao, Y. Gu, and S. Qian, “Survey and benchmark Press, 1997.
of anomaly detection in business processes,” IEEE Transactions on [24] A. Vaswani, “Attention is all you need,” Advances in Neural Information
Knowledge and Data Engineering, pp. 1–23, 2024. Processing Systems, 2017.
[5] S. Zhang, Y. Ji, J. Luan, X. Nie, Z. Chen, M. Ma, Y. Sun, and D. Pei, [25] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
“End-to-end automl for unsupervised log anomaly detection,” Automated D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
Software Engineering (ASE’24), 2024. technical report,” arXiv preprint arXiv:2303.08774, 2023.
[6] V.-H. Le and H. Zhang, “Log-based anomaly detection with deep [26] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman,
learning: How far are we?” in Proceedings of the 44th international A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of
conference on software engineering, 2022, pp. 1356–1367. models,” arXiv preprint arXiv:2407.21783, 2024.
[7] J. Qi, S. Huang, Z. Luan, S. Yang, C. Fung, H. Yang, D. Qian, J. Shang, [27] T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng,
Z. Xiao, and Z. Wu, “Loggpt: Exploring chatgpt for log-based anomaly H. Zhao, H. Lai et al., “Chatglm: A family of large language models
detection,” in 2023 IEEE International Conference on High Performance from glm-130b to glm-4 all tools,” arXiv preprint arXiv:2406.12793,
Computing & Communications, Data Science & Systems, Smart City 2024.
& Dependability in Sensor, Cloud & Big Data Systems & Application [28] W. Guan, J. Cao, J. Gao, H. Zhao, and S. Qian, “Dabl: Detecting
(HPCC/DSS/SmartCity/DependSys). IEEE, 2023, pp. 273–280. semantic anomalies in business processes using large language models,”
[8] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection arXiv preprint arXiv:2406.15781, 2024.
and diagnosis from system logs through deep learning,” in Proceedings [29] Y. Liu, S. Tao, W. Meng, F. Yao, X. Zhao, and H. Yang, “Logprompt:
of the 2017 ACM SIGSAC conference on computer and communications Prompt engineering towards zero-shot and interpretable log analysis,” in
security, 2017, pp. 1285–1298. Proceedings of the 2024 IEEE/ACM 46th International Conference on
[9] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, Software Engineering: Companion Proceedings, 2024, pp. 364–365.
S. Tao, P. Sun et al., “Loganomaly: Unsupervised detection of sequential [30] C. Egersdoerfer, D. Zhang, and D. Dai, “Early exploration of using
and quantitative anomalies in unstructured logs.” in IJCAI, vol. 19, no. 7, chatgpt for log-based anomaly detection on parallel file systems logs,” in
2019, pp. 4739–4745. Proceedings of the 32nd International Symposium on High-Performance
[10] L. Zhang, W. Li, Z. Zhang, Q. Lu, C. Hou, P. Hu, T. Gui, and S. Lu, Parallel and Distributed Computing, 2023, pp. 315–316.
“Logattn: Unsupervised log anomaly detection with an autoencoder [31] J. Pan, W. S. Liang, and Y. Yidi, “Raglog: Log anomaly detection using
based attention mechanism,” in International conference on knowledge retrieval augmented generation,” in 2024 IEEE World Forum on Public
science, engineering and management. Springer, 2021, pp. 222–235. Safety Technology (WFPST). IEEE, 2024, pp. 169–174.
[11] M. Catillo, A. Pecchia, and U. Villano, “Autolog: Anomaly detection by [32] H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,”
deep autoencoding of system logs,” Expert Systems with Applications, in 2021 international joint conference on neural networks (IJCNN).
vol. 191, p. 116263, 2022. IEEE, 2021, pp. 1–8.
[12] Y. Xie and K. Yang, “Log anomaly detection by adversarial autoencoders [33] Y. Lee, J. Kim, and P. Kang, “Lanobert: System log anomaly detection
with graph feature fusion,” IEEE Transactions on Reliability, 2023. based on bert masked language model,” Applied Soft Computing, vol.
[13] X. Zhang, X. Chai, M. Yu, and D. Qiu, “Anomaly detection model 146, p. 110689, 2023.
for log based on lstm network and variational autoencoder,” in 2023 [34] Y. Lin, H. Deng, and X. Li, “Fastlogad: Log anomaly detection with
4th International Conference on Information Science, Parallel and mask-guided pseudo anomaly generation and discrimination,” arXiv
Distributed Systems (ISPDS). IEEE, 2023, pp. 239–244. preprint arXiv:2404.08750, 2024.
[14] X. Duan, S. Ying, W. Yuan, H. Cheng, and X. Yin, “A generative [35] C. Almodovar, F. Sabrina, S. Karimi, and S. Azad, “Logfit: Log anomaly
adversarial networks for log anomaly detection.” Comput. Syst. Sci. Eng., detection using fine-tuned language models,” IEEE Transactions on
vol. 37, no. 1, pp. 135–148, 2021. Network and Service Management, 2024.
[15] Z. He, Y. Tang, K. Zhao, J. Liu, and W. Chen, “Graph-based log [36] S. Chen and H. Liao, “Bert-log: Anomaly detection for system logs
anomaly detection via adversarial training,” in International Symposium based on pre-trained language model,” Applied Artificial Intelligence,
on Dependable Software Engineering: Theories, Tools, and Applications. vol. 36, no. 1, p. 2145642, 2022.
Springer, 2023, pp. 55–71. [37] J. L. Adeba, D.-H. Kim, and J. Kwak, “Sarlog: Semantic-aware robust
[16] C. Zhang, X. Wang, H. Zhang, J. Zhang, H. Zhang, C. Liu, and P. Han, log anomaly detection via bert-augmented contrastive learning,” IEEE
“Layerlog: Log sequence anomaly detection based on hierarchical se- Internet of Things Journal, 2024.
mantics,” Applied Soft Computing, vol. 132, p. 109860, 2023. [38] Y. Fu, K. Liang, and J. Xu, “Mlog: Mogrifier lstm-based log anomaly
[17] S. Hashemi and M. Mäntylä, “Onelog: towards end-to-end software log detection approach using semantic representation,” IEEE Transactions
anomaly detection,” Automated Software Engineering, vol. 31, no. 2, on Services Computing, vol. 16, no. 5, pp. 3537–3549, 2023.
p. 37, 2024. [39] G. No, Y. Lee, H. Kang, and P. Kang, “Training-free retrieval-based log
[18] S. Lu, X. Wei, Y. Li, and L. Wang, “Detecting anomaly in big data anomaly detection with pre-trained language model considering token-
system logs using convolutional neural network,” in 2018 IEEE 16th level information,” Engineering Applications of Artificial Intelligence,
Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl vol. 133, p. 108613, 2024.
Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big [40] F. Hadadi, Q. Xu, D. Bianculli, and L. Briand, “Anomaly detection on
Data Intelligence and Computing and Cyber Science and Technology unstable logs with gpt models,” arXiv preprint arXiv:2406.07467, 2024.
Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, 2018, pp. [41] M. Burtsev, M. Reeves, and A. Job, “The working limitations of large
151–158. language models,” MIT Sloan Management Review, vol. 65, no. 2, pp.
[19] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, 8–10, 2024.
X. Yang, Q. Cheng, Z. Li et al., “Robust log-based anomaly detection on [42] A. Joulin, “Fasttext. zip: Compressing text classification models,” arXiv
unstable log data,” in Proceedings of the 2019 27th ACM joint meeting preprint arXiv:1612.03651, 2016.
on European software engineering conference and symposium on the [43] J. Devlin, “Bert: Pre-training of deep bidirectional transformers for
foundations of software engineering, 2019, pp. 807–817. language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[44] Y. Liu, “Roberta: A robustly optimized bert pretraining approach,” arXiv
preprint arXiv:1907.11692, 2019.
[45] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy,
“Spanbert: Improving pre-training by representing and predicting spans,”
Transactions of the association for computational linguistics, vol. 8, pp.
64–77, 2020.
[46] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
augmented generation for knowledge-intensive nlp tasks,” Advances in
Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[47] J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools
and benchmarks for automated log parsing,” in 2019 IEEE/ACM 41st In-
ternational Conference on Software Engineering: Software Engineering
in Practice (ICSE-SEIP). IEEE, 2019, pp. 121–130.
[48] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “An evaluation study on
log parsing and its use in log mining,” in 2016 46th annual IEEE/IFIP
international conference on dependable systems and networks (DSN).
IEEE, 2016, pp. 654–661.
[49] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan, “Online system
problem detection by mining patterns of console logs,” in 2009 ninth
IEEE international conference on data mining. IEEE, 2009, pp. 588–
597.
[50] A. Oliner and J. Stearley, “What supercomputers say: A study of five
system logs,” in 37th annual IEEE/IFIP international conference on
dependable systems and networks (DSN’07). IEEE, 2007, pp. 575–
584.
[51] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing
approach with fixed depth tree,” in 2017 IEEE international conference
on web services (ICWS). IEEE, 2017, pp. 33–40.
[52] M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in
2016 IEEE 16th International Conference on Data Mining (ICDM).
IEEE, 2016, pp. 859–864.
[53] V.-H. Le and H. Zhang, “Log parsing with prompt-based few-shot
learning,” in 2023 IEEE/ACM 45th International Conference on Software
Engineering (ICSE). IEEE, 2023, pp. 2438–2449.
[54] J. S. Bridle, “Probabilistic interpretation of feedforward classification
network outputs, with relationships to statistical pattern recognition,” in
Neurocomputing: Algorithms, architectures and applications. Springer,
1990, pp. 227–236.
[55] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
Efficient finetuning of quantized llms,” Advances in Neural Information
Processing Systems, vol. 36, 2024.
[56] D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[57] J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection
of system log datasets for ai-driven log analytics,” in 2023 IEEE 34th
International Symposium on Software Reliability Engineering (ISSRE).
IEEE, 2023, pp. 355–366.