Detection of SQL Injection Attack Using Machine Le
Detection of SQL Injection Attack Using Machine Le
ARTICLEINFO ABSTRACT
SQL injection attacks (SQLIAs) remain a prevalent threat to web
Article History:
applications, exploiting vulnerabilities in database interactions to
Accepted : 27 Nov 2024 compromise data security. Detecting such attacks effectively is crucial for
Published : 27 Dec 2024 ensuring robust application security. This study investigates the use of
machine learning techniques to identify SQLIAs by analyzing patterns and
features in SQL queries. A dataset comprising both legitimate and
Publication Issue : malicious SQL queries is utilized to train and evaluate various machine
Volume 11, Issue 6 learning models, including decision trees, support vector machines, and
November-December-2024 neural networks. The proposed approach achieves high accuracy in
distinguishing between benign and malicious queries, showcasing the
Page Number : potential of machine learning for proactive SQLIA detection. The findings
780-790 highlight the importance of feature selection, algorithm choice, and real-
time detection capabilities in mitigating the risk of SQL injection attacks.
This research provides a foundation for developing intelligent, automated
systems to enhance the security of database-driven applications.
Keywords: SQL Injection, Cross Side Scripting, Denial of Service Attack,
Naïve Bias, Gradient Boosting, etc.
Copyright © 2024 The Author(s): This is an open-access article distributed under the terms of the Creative 780
Commons Attribution 4.0 International License (CC BY-NC 4.0)
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 781
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
various machine learning algorithms were tested to applications developed in a weak and stable typed
detect SQL Injection attacks. In the data pre- language. So we looked for a big point where an
processing section, feature extraction was performed attacker may use a native notion or existing ways to
using Natural Language Processing techniques. While access or hack data from the database. What kind of
the relevance of expressions to each other was issue will arise if an attacker tries to make a database
calculated with the Word Level TF-IDF method, term vulnerable by injecting SQL that affects the results
search was also performed [02]. [04].
Maha Alghawazi et. al. (2022) - This research work Binh An Pham et.al. (2020) - This research work
presented , An SQL injection attack, usually occur presented, —SQL injection attacks (SQLi attacks)
when the attacker(s) modify, delete, read, and copy have proven their danger on several website types
data from database servers and are among the most such as social media, e-shopping, etc. In order to
damaging of web application attacks. A successful prevent such attacks from occurring, this research
SQL injection attack can affect all aspects of security, effort investigates on efficient ways of detection and
including confidentiality, integrity, and data prevention, so that we can preserve each cyberuser’s
availability. SQL (structured query language) is used right of privacy. This research effort is aimed at
to represent queries to database management systems. investigating and looking at different ways to protect
Detection and deterrence of SQL injection attacks, for websites from SQL injection attacks. In this research
which techniques from different areas can be applied effort, machine learning algorithms were used to
to improve the detect ability of the attack, is not a detect such SQLi attacks. Machine Learning (ML)
new area of research but it is still relevant. Artificial algorithms are algorithms that can learn from the data
intelligence and machine learning techniques have provided and infer interesting results from the
been tested and used to control SQL injection attacks, dataset. We used SQL code and user input as our data
showing promising results. The main contribution of and ML algorithms to detect malicious code [05].
this paper is to cover relevant work related to Tareek Pattewar et.al, (2019) - In this research work
different machine learning and deep learning models presented, SQL injection attack is a very serious
used to detect SQL injection attacks. With this problem of web applications. Finding the efficient
systematic review, we aims to keep researchers up-to- solution of this problem is essential. Researchers have
date and contribute to the understanding of the developed many techniques to detect and prevent this
intersection between SQL injection attacks and the vulnerability. There is no appropriate solution that
artificial intelligence field [03]. can prevent all types of SQL injection attacks. SQL
Ravi Raj Choudhary et.al. (2021) - This research Injection attacks remain to be one of top concerns for
work presented, a web component, and that web- cyber security researchers. Signature based SQL
based component, or web application, was accessible Injection detection methods are no longer reliable as
to the general public over the Internet. It is attackers are using new types of SQL Injections each
vulnerable to attack by the adversary. It is not time. There is a need for SQL Injection detection
uncommon for web and mobile applications to have a mechanisms that are capable of identifying new,
lackadaisical flaw that adversely affects their security never before seen attacks. Applying machine learning
and privacy. Database vulnerability attacks are to the field of cyber-security is being considered by
becoming more common and harmful. It is critical to many researchers. Two machine learning
understand software defects and, more importantly, classification algorithms are implemented on the
prevent these security issues. SQL injection and XSS problem, which are, Na¨ıve Bayes Classifier and
scan the same security code, often employed in online Gradient Boosting Classifier. Na¨ıve Bayes classifier
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 782
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
machine learning model provides results with an Vectorization: The text data will be converted
accuracy of 92.8%. Ensemble learning methods are into numerical form. Techniques such as TF-IDF
said to provide results with better accuracy as they (Term Frequency-Inverse Document Frequency)
implement multiple simple classifiers to improve error will be used to represent the importance of each
and accuracy [06]. word or token in the context of the entire
. dataset. This allows the machine learning models
III.PROPOSED METHOD to process text effectively.
Once the data is cleaned and vectorized, it will be
The goal of this project is to build a model that can divided into training, validation, and testing sets to
detect SQL Injection (SQLi) attacks in SQL queries. evaluate the models' performance.
SQLi attacks exploit vulnerabilities in the database C. Feature Engineering
query processing, leading to unauthorized access and TF-IDF Representation: The primary feature
data breaches. This model will classify SQL queries as engineering technique involves transforming the
either malicious (SQLi) or benign (safe), helping to SQL queries into numerical vectors using the TF-
prevent security vulnerabilities in web applications. IDF method. This method assigns a weight to
A. Dataset Description each word based on its frequency in the
The dataset used for this project consists of SQL document and its rarity across the entire dataset.
queries labeled as: N-grams: In addition to individual words,
Malicious (SQLi): Labeled as 1, representing a sequences of words (n-grams) will also be
harmful SQL Injection attempt. considered as features. This helps capture
Benign (Safe): Labeled as 0, representing a contextual information in the queries, which
normal, safe SQL query. may be crucial for identifying attack patterns in
Each row in the dataset represents a single SQL query, SQLi.
and the associated label indicates whether the query is Embeddings (Optional): If necessary, pre-trained
benign or malicious. The data is likely to be text- word embeddings like Word2Vec or GloVe could
based, and preprocessing is necessary to convert the be used to capture semantic meaning between
raw SQL queries into a usable form for machine words, improving model performance by
learning. considering word relationships.
B. Data Preprocessing D. Model Development
Text Cleaning: The raw SQL queries may contain Random Forest:
noise such as special characters, extra whitespace, Random Forest is an ensemble learning technique
or null values. This will be cleaned to ensure that builds multiple decision trees and aggregates
uniformity and to eliminate unwanted elements their results. It works by learning patterns from the
that might interfere with model performance. data, such as identifying which features (e.g., words or
Text Normalization: All text will be converted to n-grams) contribute to the classification of a query as
lowercase to maintain consistency across all benign or malicious. This model is robust, less prone
queries, as SQL queries may have different case to overfitting, and can handle complex relationships
conventions. in data.
Tokenization: The cleaned text will be split into How it Works:
smaller units, such as words or tokens, to better Random Forest is an ensemble learning technique
understand the structure of the query. that combines multiple decision trees to make
predictions. Each decision tree is trained on a subset
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 783
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
of the data and features, which introduces tokens in a SQL query). The model assumes that each
randomness to improve the model’s generalization feature is conditionally independent given the class
ability. label.
Training: Random Forest creates multiple Bayes' Theorem: The probability of a class CCC
decision trees by using bootstrap sampling, where given the features XXX is calculated using Bayes'
each tree is trained on a random subset of the theorem:
training data. At each node of the tree, the
algorithm selects a random subset of features to
split the data, ensuring that the trees are diverse. Where
This randomness helps reduce over fitting. P (Ć/X) = is the probability of the class given the
Prediction: Once all the trees are trained, each features
tree makes a prediction. The final prediction is P (X / Ć) = is the likelihood of the feature given the
determined by taking a majority vote from all the class
trees. This aggregation improves the model’s P (Ć) = is the prior probability of the class
robustness and makes it less sensitive to P (X) = is the probability of the feature
fluctuations or noise in the data. Training: During training, Naive Bayes calculates the
Strengths of Random Forest: likelihood of each word or token appearing in benign
Robustness: Random Forest reduces overfitting and malicious queries. The model then uses these
by aggregating predictions from multiple decision probabilities to classify new queries.
trees, leading to more stable and accurate results. Classification: For a new SQL query, Naive Bayes
Handles High-dimensional Data: It works well computes the likelihood of the query belonging to
with data containing many features (e.g., a large each class (benign or malicious) based on the
vocabulary in SQL queries) and can effectively probabilities of the individual words and selects the
learn complex patterns and relationships. class with the highest probability.
Feature Importance: Random Forest can rank Strengths of Naive Bayes:
features based on their importance in predicting Simplicity and Speed: Naive Bayes is computationally
the target variable, helping us identify which efficient, making it well-suited for large datasets
words or tokens are most indicative of SQLi where quick predictions are required.
attacks. Effective for Text Classification: The model performs
Naive Bayes: Naive Bayes is a probabilistic classifier well in text classification tasks, where the goal is to
based on Bayes' theorem. It works well for text classify a document (in this case, an SQL query) based
classification tasks, especially when features (in this on word frequencies or patterns.
case, words or tokens in a query) are conditionally Convolutional Neural Networks (CNN):
independent. Despite its simplicity, Naive Bayes is CNNs, a type of deep learning model, are powerful at
often effective for detecting patterns in textual data. detecting local patterns in data. For text classification,
In this project, it will help in distinguishing between CNNs can identify important sequences of words or
benign and malicious queries based on word n-grams in SQL queries. This is especially useful for
frequencies and probabilities. detecting the structure of SQLi attacks, which often
How it Works: involve specific patterns or keywords in SQL queries.
Naive Bayes is a probabilistic classifier based on Bayes' CNNs can automatically learn these features and
Theorem, which calculates the probability of a class make predictions based on learned patterns in the
(benign or malicious) given the features (words or data.
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 784
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
How it Works:
Convolutional Neural Networks (CNNs) are deep
learning models originally designed for image
classification but have proven effective in text
classification tasks, such as detecting SQL Injection
attacks.
Convolutional Layers: The CNN applies filters (also
called kernels) to input sequences of words or
characters. These filters slide over the text data to
detect local patterns, such as specific phrases or n-
grams indicative of SQLi attacks (e.g., ―OR 1=1‖ or
Fig - 2 Simple CNN model
―DROP TABLE‖). These patterns may represent SQL
commands commonly used in injections.
Each model has unique strengths:
Pooling Layer: After convolution, a pooling layer
Random Forest: An ensemble method that builds
reduces the size of the data by taking the maximum
multiple decision trees to capture complex
(or average) value from a set of features, which helps
relationships in the data. It is robust and handles
in capturing the most important patterns while
high-dimensional data well. It also provides insights
reducing computational complexity.
into feature importance, which is valuable for
Fully Connected Layers: Once the features have been
understanding what makes a query malicious.
extracted by the convolutional and pooling layers, the
Naive Bayes: A probabilistic model that is fast and
data is passed through fully connected layers that
effective for text classification tasks. It works by
perform the final classification, determining whether
calculating the probability of a query being benign or
the SQL query is benign or malicious.
malicious based on the frequencies of words in the
Training: CNNs are trained using back propagation,
query. It is particularly efficient for large datasets.
where the weights of the filters are adjusted based on
Convolutional Neural Networks (CNN): A deep
the errors made in predictions. This allows the model
learning model that automatically learns patterns
to learn which sequences of words are most indicative
from raw data. CNNs are effective at detecting local,
of SQLi attacks.
sequential patterns in SQL queries, making them
Strengths of CNN:
powerful for detecting sophisticated SQLi attacks.
Automatic Feature Extraction: CNNs automatically
Hyper parameter Tuning and Optimization
learn relevant features from raw text data, eliminating
Once the models are trained, hyper parameter tuning
the need for manual feature engineering. They are
will be performed using methods such as Grid Search
particularly good at detecting specific patterns in
or Random Search to optimize the performance of the
sequences of words.
models. The goal is to find the best combination of
Pattern Detection: CNNs excel at recognizing
hyper parameters (such as the number of trees in
complex, local patterns in data. This is important for
Random Forest or the kernel in Naive Bayes) that
SQLi detection, where malicious queries often contain
maximizes model accuracy.
specific sequences of keywords.
Cross-validation will be used to evaluate the models
Scalability: CNNs can handle large amounts of data
and ensure that they generalize well to unseen data,
and can improve their performance as more labeled
minimizing over fitting.
data is provided, making them suitable for large-scale
applications.
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 785
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 786
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 787
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 788
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 789
Bhanu Pratap Singh et al Int J Sci Res Sci & Technol. November-December-2024, 11 (6) : 780-790
[6]. Tareek Pattewar,Hitesh Patil, Harshada Patil, Vulnerabilities. Proc. - Int. Comput. Softw.
Neha Patil, Muskan Taneja, Tushar Wadile Appl. Conf., 1(August), 87-94. Doi:
"Detection of SQL Injection using Machine 10.1109/COMPSAC.2007.43.
Learning‖ Volume: 06, Issue: 11, ISSN: 2395- [15]. D. Appelt, C. D. Nguyen, L. C. Briand, and N.
0072, Nov 2019. Alshahwan. (2014). Automated Testing for SQL
[7]. S. Steiner, D. Conte de Leon, and J. Alves-Foss. Injection Vulnerabilities: An Input Mutation
(2017). AStructured Analysis of SQL Injection Approach. 2014 Int. Symp. Softw. Test. Anal.
[8]. Runtime MitigationTechniques. Proc. 50th ISSTA 2014 - Proc., May, 259-269. Doi:
Hawaii Int. Conf. Syst. Sci., 2887-2895.Doi: 10.1145/2610384.2610403.
10.24251/hicss.2017.349. [16]. A. Ciampa, C. A. Visaggio, and M. Di Penta.
[9]. W. G. J. Halfond, J. Viegas, and A. Orso. (2008). (2010). A Heuristic-based Approach for
AClassification of SQL Injection Attacks and Detecting SQL-injection Vulnerabilities in Web
Countermeasures.Prev. Sql Code Inject. By Applications. Proc. - Int. Conf. Softw. Eng.,
Comb. Static Runtime Anal., 53. January, 43-49. Doi: 10.1145/1809100.1809107.
[10]. P. Kumar and R. K. Pateriya. (2012). [17]. Y. Shin. (2004). Improving the Identification of
ASurveyonSQLInjection Attacks, Detection and Actual InputManipulation Vulnerabilities, 1-4.
Prevention Techniques. 20123rd Int. Conf. [12] W. G. J. Halfond and A. Orso. (2005).
Comput. Commun. Netw. Technol. AMNESIA: Analysisand Monitoring for
ICCCNT2012.Doi:10.1109/ICCCNT.2012.63960 Neutralizing SQL-injection Attacks.
96. 20thIEEE/ACM Int. Conf. Autom. Softw. Eng.
[11]. G. Wassermann and Z. Su. (2004). An Analysis ASE2005, 174-183.Doi:
FrameworkforSecurity in Web Applications. 10.1145/1101908.1101935.
SAVCBS 2004 Specif. Verif.Component-Based [18]. R. Mui and P. Frankl. (2010). Preventing
Syst., 70. [Online]. SQLInjectionthrough Automatic Query
Available:http://web.cs.ucdavis.edu/~su/publica Sanitization with ASSIST. Electron.Proc. Theor.
tions/savcbs.pdf%0Ahttp://citeseerx.ist.psu.edu/ Comput. Sci., 35, 27-38. Doi: 10.4204/eptcs.35.3.
viewdoc/download?doi=10.1.1.72.2255&rep=rep [19]. R. Dharam and S. G. Shiva. (2012). Runtime
1&type=pdf#page=82. MonitoringTechnique to handle Tautology
[12]. C. Gould, Z. Su, and P. Devanbu. (2004). based SQL InjectionAttacks.Int. J. Cyber-
JDBCChecker:A Static Analysis Tool for Security Digit. Forensics (IJCSDF), 1(3), 189-
SQL/JDBC Applications. Proc. - Int. Conf. 203,
Softw. Eng., 26, 697-698. Doi: [20]. W. Qing and C. He. (2016). The Research of
10.1109/icse.2004.1317494. anAOP-basedApproach to the Detection and
[13]. Y. Kosuga, K. Kono, M. Hanaoka, M. Defense of SQLInjectionAttack, 731-737. Doi:
Hishiyama, and Y. Takahama. (2007). Sania: 10.2991/aest-16.2016.98.
Syntactic and Semantic Analysis for Automated [21]. A. Ghafarian. (2018). A Hybrid Method for
Testing Against SQL Injection. Proc. - Annu. DetectionandPrevention of SQL Injection
Comput. Secur. Appl. Conf. ACSAC, 107-116. Attacks. Proc. Comput. Conf.2017, 833-838.
Doi: 10.1109/ACSAC.2007.20. Doi: 10.1109/SAI.2017.8252192.
[14]. X. Fu, X. Lu, B. Peltsverger, S. Chen, K. Qian,
and L. Tao. (2007). A Static Analysis
Framework for Detecting SQL Injection
International Journal of Scientific Research in Science and Technology (www.ijsrst.com) | Volume 1 1 | Issue 6 790