0% found this document useful (0 votes)
31 views8 pages

AI-Based Phishing Detection in Networks

Fraudulent websites are considered one of the bigger dangers of the field of cybersecurity since they are able to deceive users into giving sensitive information, including accounts credentials and financial data. The traditional phishing detection methods such as blacklist-based systems or rule-based systems have been proven ineffective against phishing attacks that are very recent or even the ones that are new and are never witnessed before. The case at hand is concerned with introducing a mac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views8 pages

AI-Based Phishing Detection in Networks

Fraudulent websites are considered one of the bigger dangers of the field of cybersecurity since they are able to deceive users into giving sensitive information, including accounts credentials and financial data. The traditional phishing detection methods such as blacklist-based systems or rule-based systems have been proven ineffective against phishing attacks that are very recent or even the ones that are new and are never witnessed before. The case at hand is concerned with introducing a mac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

e-ISSN: 2582-5208

International Research Journal of Modernization in Engineering


Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]

AI-Driven Phishing URL Detection for Campus Networks


Yash Raikwar, Siddhi Agrawal, Dr Vimmi Pandey, Asst Prof Roshani Vishwakarma
Student, Computer Science & Engineering, GGCT, Jabalpur, Madhya Pradesh, India
Student, Computer Science & Engineering, GGCT, Jabalpur, Madhya Pradesh, India
HOD, Computer Science & Engineering, GGCT, Jabalpur, Madhya Pradesh, India
Asst Prof, Computer Science & Engineering, GGCT, Jabalpur, Madhya Pradesh, India

ABSTRACT
Fraudulent websites are considered one of the bigger dangers of the field of cybersecurity since they are able to
deceive users into giving sensitive information, including accounts credentials and financial data. The traditional
phishing detection methods such as blacklist-based systems or rule-based systems have been proven ineffective
against phishing attacks that are very recent or even the ones that are new and are never witnessed before. The case at
hand is concerned with introducing a machine-learning-based technique of detecting phishing sites based on the
examination of the URL-related features. The proposed system has the following presents; extraction of both lexical
and structural features out of the URLs and uses some of the supervised learning algorithms to categorize them as
phishing websites or legitimate websites. Assessed using standards-truths in comparison with accuracy, precision,
recall, and F1 score. As the results of the experiment demonstrate, the proposition has a good working detection
performance at higher accuracy and lower false-positive rates. The findings address the effectiveness of machine
learning methods to practically implement an effective Phishing detection system, which is much more effective in
securing the web, and the campus network in particular..

I. INTRODUCTION
The fast-rising internet service has seen the online applications develop tremendously as well as expose individuals to
numerous other cybersecurity threats. Among the most frequent attacks are phishing attacks, which are destructive.
Phishing websites are designed to replicate authentic web platforms to ensure that they defraud an individual to give
them much-needed details, which include usernames, passwords, and financial details. Such attacks usually apply
tricks such as URL-spoofing, domain manipulation and misleading web addresses. Hence, they render them difficult
to users and even the conventional security systems to identify and troubleshoot them. The traditional methods used to
detect phishing are blacklist-for-filtering and rule-based system, which is a pattern matching algorithm along with
known patterns and malicious URLs. These techniques are highly effective in regards to treating all the past threats
but are not effective when it comes to dynamically detecting new attacks that are generated or those that are unknown
commonly known as the zero-day attacks. The high dynamic nature of the development trends in phishing is resulting
in an increasing demand to deploy effective and reactive detection systems capable of making real-time malicious
activities. Machine learning tools have potential in unsupervised acquisition of trends in data and adjustment to new
threats. Machine learning models can easily be skillful at defining the boundary between phishing and legitimate
websites by leveraging the information lexical features and abnormal features that a URL must provide. This study
suggests a supervised machine learning-based model on the detection of phishing sites with URL characteristics. The
proposed system has one objective of maximizing the level of detection accuracy and minimizing the levels of false
positivity in order to ensure that the web security can perform better and in real time, particularly where related
networks within the campus settings are involved. The dataset used in this study was acquired through a series of
experiments and obtained via data analysis.<|human|>2.1 Dataset Acquisition and Experimental Set-up The dataset
presented in this paper was obtained in a series of experiments through data analysis.

[Link] @International Research Journal of Modernization in Engineering,


Technology and Science
[1]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering
Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]

II. METHODOLOGY

2.1 Dataset Acquisition and Experimental Setup


The phishing URL detection system was designed and tested on a publicly available dataset that
comprised phishing and legitimate URLs marked. The data was saved in the compressed form and loaded
in Google Colab workspace by incorporating the data access via Google Drive. In order to have an
objective performance evaluation, the dataset was divided into two, a training and an evaluation dataset.
Training data was utilized solely in model learning with evaluation data being utilized in testing the trained
models on unseen data. This separation will aid in avoiding leakage of data and give realistic evaluation of
the model performance in the real situation of deployment

2.2 Data Preprocessing


Preprocessing before model training was done to enhance data quality and consistency. Both training and
evaluation datasets were cleaned of duplicate URL entries and records that had no values. The categories
of the classes were reduced to binary values thus, phishing URLs were assigned to one of the classes and
legitimate links to the other to facilitate the supervised machine learning where phishing URLs were the
one class and legitimate URLs the other. These preprocessing operations are meant to make sure that the
data is clean, balanced to learn, and can be used in feature extraction and classification.

2.3 Feature Extraction


Phishing detection is highly dependent on the detection of suspicious flats that are created in URL
constructions. Thus it was possible to randomly obtain lexical and structural items of each URL to describe
the features of phishing behavior. The features that have been extracted are the length of the URL,
number of special characters that are dots (.), hyphens (-), at symbols (@), slashes (/), question marks (?),
and equal signs (=). Also, such security related characteristics like the existence of HTTPS protocol and
usage of IP addresses instead of domain names were taken into account. The choice of these features was
made due to the fact that phishing URLs have atypical length, unreasonable number of special characters,
absence of protocols and IP-based addresses. The features that were extracted were transformed into
numbers to form a structured feature vector that could be used by machine learning algorithms.

2.4 Model Selection and Training


In testing the performance of the machine learning methods in identifying phishing URLs, two supervised
classifiers were used. Logistic Regression was adopted as a control model since it is simple, interpretable
and effective in binary classification problems. Moreover, the classifier based on the Random Forest was
chosen as the main model due to the ensemble learning technique, resistance to overfitting, and the
capacity to deal with nonlinear features relationships, which are usual in cybersecurity data. Both models
were trained on the feature vectors that were extracted on the training dataset. Hyperparameters were
not excessive to ensure that models were stable and reproducible.

[Link] @International Research Journal of Modernization in Engineering,


Technology and Science
[2]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering
Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]
2.5 Model Evaluation Metrics
The trained models were tested in the performance with the help of the reserved evaluation dataset.
Detection performance metrics Standard classification fields were used to measure their effectiveness
using accuracy, precision, recall, and F1-score. Accuracy is an indicator of how well the predictions are
correct on the whole and the measure of precision and recall are used to understand how the model can
correctly predict phishing URLs and at the same time reduce the false alarms. Precision and recall were

balanced using the F1-score. Besides, a confusion matrix was created to illustrate the true positives, true
negatives, false positives, and false negatives. All these metrics will be considered a complete assessment
of the proposed system and its applicability to the implementation of networks in campuses.

2.6 Deployment Perspective for Campus Networks


Though the experiments were carried in offline environment, the proposed methodology considers the
deployment of campus network in the real world. The trained model can be incorporated at network
gateways or email filters to learn the URLs extracted out of email, learning management systems and
public Wi-Fi traffic in real time. The extracting features lightweight and the effective classification models
make the system applicable in the real-life application without causing a heavy computational load.

III. MODELING AND ANALYSIS


3.1 The representation of features and input modeling.
Following the preprocessing and feature extraction, the lexical and structural characteristics were
described as numerical feature vectors of each URL. These feature vectors are the input of the machine
learning models. The models can learn patterns that distinguish between phishing and legitimate URLs
because they can convert raw URLs to a structured numerical data which can be learned. This
representation can be effectively computed and phishing can be classified under supervision by this
representation.

3.2 Logistic Regression Model


The baseline classification model has been used to assess the validity of linear decision boundaries in
detection of phishing URLs by using Logistic Regression. The model approximates the likelihood of a URL
to be a member of the phishing category by being a weighted sum of input features. Logistic Regression is
easy to interpret and simple; therefore, there is a definite point of reference when examining the
contribution of individual features. Nonetheless, its accuracy can be reduced in cases where the
relationships involved are complicated and non-linear, as they exist in phishing URLs .

[Link] @International Research Journal of Modernization in Engineering,


Technology and Science
[3]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering
Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]
3.3 Random Forest Classification Model
A Classification Model that will be used, which is the one that will be randomly selected, is the Random
Forest Classification Model. In order to overcome shortcomings of the linear models, the implementation
of the Random Forest classifier as the main detection model was applied. Random Forest is a combination
type of learning method which entails the use of multiple decision trees to make the final classification
decision. Every decision tree is trained on a random sample of the data and features, which contributes to
the reduction of overfitting and enhancing the generalization. This model is highly applicable in detecting
phishing because it is capable of capturing non-linear patterns and interactions between the features of
URL. It is powerful and stable, which makes it suitable to classify issues related to cybersecurity.

All decision trees in a Random Forest are trained on a different portion of the dataset using the bootstrap
sampling approach and a random portion of the features is evaluated at each split. The fact that there is a
variety of trees enables the model to capture very complex and non-linear relationships in the nature of
phishing URL characteristics. This kind of action is especially useful in the detection of phishing due to the
frequent use of irregular structures, misleading patterns, and feature interactions on the side of the

malicious URLs. Moreover, the strength and the stability of the Random Forest model render it rather
suitable in the context of cybersecurity when the accuracy and reliability are paramount. The fact that it
can use noisy data and sustain the performance under a variety of datasets is beneficial to its practical
implementation in campus network settings. The implication is that Random Forest is a useful and scalable
system that will detect phishing URLs based on better detection rates.

[Link] @International Research Journal of Modernization in Engineering,


Technology and Science
[4]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering
Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]
Figure 1: Random Forest Daigram.

3.4 Model Training Process


The Logistic Regression and the Random Forest models were trained with the help of the feature vectors
of the training data. The training entailed learning decision boundaries that have the best separation of
phishing and legitimate URLs. To ensure model stability and reproducibility, default and moderate
hyperparameter settings were used. The trained models were used on the evaluation dataset to evaluate
their performance on URL samples that they had not seen before.

3.5 Performance Analysis and Comparison


The trained models were evaluated based on the common classification measures, such as accuracy,
precision, recall, and F1-score. Logistic Regression has shown to be a reasonable performance baseline
model, hence suggesting that simple features of URLs are good as detecting phishing. Nevertheless, the
Random Forest model performed better in all measures of evaluation. This can be credited to the fact that
it has an ensemble structure and can model complex feature relationships. The confusion matrix analysis
also validated the fact that Random Forest minimized false positive and false negative classifications and
this is paramount in minimizing wrongful alerts within campus network settings.

3.6 Discussion on Model Effectiveness


Its comparative analysis shows that the ensemble-based machine learning models are superior to the
simple linear classifiers in detecting phishing URLs. Random Forest model was more generalized and stable
and hence more appropriate in the deployment to real world. Those findings suggest that the suggested
modeling method can effectively detect phishing URLs with minimal features, which provides the accuracy
and efficiency of the mechanism. This equilibrium is critical in the establishment of phishing detection
mechanisms in campus networks where real-time response and low computing power are demanded.

IV. RESULTS AND DISCUSSION

4.1. Experimental results

The phishing URL detection scheme was tested on two supervised machine learning models, i.e. the Logistic
Regression (LR) and the Random Forest (RF). The evaluation and generalization of a model of the capability to
generalize trained models was done on the separate test dataset. The models were measured by the usual measures of
classification; that is accuracy, precision, recall, and F1 score per iteration.
Random FoThe Random Forest had the overall accuracy of 75.26% and it proved to be better than the accuracy of the
Logistic Regression model, which was 68.90%. Furthermore, the larger value of precision and recall values was
achieved with Random Forest model, which indicated that it is more proficient in identifying phishing URLs with few
false positives and false negatives. Despite being a good base-line classifier, the Logistic Regression model has

[Link] @International Research Journal of Modernization in Engineering,


Technology and Science
[5]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering
Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]
particular weaknesses in the capture of complex and non-linear interactions between URL [Link]
Performance Comparison

Model Accuracy Precision Recall F1-Score


Logistic Regression 0.6890 0.7243 0.6102 0.6624
Random Forest 0.7526 0.7593 0.7396 0.7493

Table 1: Performance comparison of classification models

Therefore the performance of the Random Forest classifier has been superior to that of the Logistic Regression on all
the measures of evaluation. Recall is more significant in phishing detection because it would aid in the accurate
detection of malicious URLs that would increase security risks to campus networks.

4.2 Discussion of Results


The improved performance of the Random Forest model can be attributed to the ensemble architecture of the
Random Forest since the former represents more non-linear interaction of features and, as well, non-linear
decision surfaces. Conversely, logistic regression operates under the premise of linear separation alone, and
thus, it is not very useful in identifying advanced patterns of phishing. The findings prove that the ensemble
machine learning models provide a more credible and a stronger punch to the phishing URL detection
application in the actual campus networks. The findings of the experiment also show that Random Forest has
higher recalls, an important parameter in the use of cybersecurity. The increased recall means that a greater

percentage of phishing URLs will be detected properly and the chances of the malicious links going through the
security system will therefore be minimized. This is especially significant in campuses network settings where
users often connect to common networks, email systems and learning management systems, thus they are
prone to phishing attacks. In general, the results prove that machine learning models, which are based on
ensembles, are more robust, reliable and practical in the case of phishing URLs detection in the real world. The
enhanced accuracy of detection, less false positives and false negatives is indicative to the appropriateness of
using the Random Forest classifiers in campus networks where real-time decision-making and efficient
resource utilization is critically needed.

[Link] @International Research Journal of Modernization in Engineering,


Technology and Science
[6]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering
Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]

Figure 2: F1 Graph.

Figure 2 shows the relative performance of F1- score between Logistic Regression and Random Forest models.
The graph shows that the Random Forest classifier is more successful in getting a higher F1-score, which is an
indicator of a better precision-recall balance. This proves its high ability to distinguish phishing URLs with high
accuracy and false classifications are minimal.

V. CONCLUSION
To the best of my understanding, this study conceived a machine learning-based AI-driven system of phishing
URL recognition, which, in turn, enhances security within the campus network setting. This method is able to
isolate lexical and structural characteristics of URLs and in fact differentiate between a phishing and non-
phishing websites. Random Forest was experimented to be true. better than the Logistic Regression on all
evaluation measures like accuracy, precision, recall, and F1 score. These findings suggest that ensemble-based
models are more capable of phishing detection because they distinguish complicated patterns of URL structure.
The suggested system will be scalable, light on resources and integrable with campus network gateways or
email filtering system in order to offer greater last minute protection against phishing attacks.

ACKNOWLEDGEMENTS
The authors would wish to acknowledge their special thanks to Dr. Vimmi Pandey, the Head of the Department
and Roshani Vishwakarma, Assistant Professor, whose support, encouragement and guidance were key in the

entire process of this research work. They were important in influencing the course of the study and
achieving the success of the study based on their insights, suggestions, and motivation.

VI. REFERENCES
[Link] @International Research Journal of Modernization in Engineering,
Technology and Science
[7]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering
Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:07/Issue:08/August-2025 Impact Factor- 8.187
[Link]
[1] A. Al-qasmi, A. Al-anazi, L. Al-shehri, S. Al-shaman, W. Al-atawi, and O. Abbass, “Machine Learning-
Based Phishing Detection System,” International Journal of Intelligent Systems and Applications in
Engineering (IJISAE), vol. 12, no. 4, pp. 4367–4372.
[2] N. Noori, V. Bawanthad, M. Pakhare, R. Agrawal, and V. Kimbahune, “Phishing URL Detection Using
Machine Learning,” International Journal for Research in Applied Science & Engineering Technology
(IJRASET), vol. 11, no. V, pp. 3645–3647, May 2023, doi: 10.22214/ijraset.2023.52342.
[3] Md. S. I. Ovi, Md. H. Rahman, and M. A. Hossain, “PhishGuard: A Multi-Layered Ensemble Model for
Optimal Phishing Website Detection,” arXiv preprint, arXiv:2409.19825, 2024.
[4] T. O. Ojewumi, G. O. Ogunleye, B. O. Oguntunde, O. Folorunsho, S. G. Fashoto, and N. Ogbu,
“Performance Evaluation of Machine Learning Tools for Detection of Phishing Attacks on Web Pages,”
Scientific African, vol. 16, 2022, doi: 10.1016/[Link].2022.e01165
[5] G. Deshpande, S. Katkar, T. Kangane, A. K. Giri, and D. Kale, “Detecting Phishing Websites Using Hybrid
Methodologies,” Journal of Computer Technology & Applications, vol. 15, no. 3, pp. 59–65, 2024
[6] M. Y. Sharief and V. U. Rani, “Detection of Phishing Website Using Machine Learning,” International
Journal for Research in Applied Science & Engineering Technology (IJRASET), vol. 13, no. VIII, pp. 107–
109, Aug. 2025, doi: 10.22214/ijraset.2025.73515.
[7] H. Pingali, “Phishing URL Dataset,” Kaggle, 2023. [Online]. Available:
[Link]
[8] S. Kusawa, “Phishing URLs Dataset,” Kaggle, 2023. [Online]. Available:
[Link]

[Link] @International Research Journal of Modernization in Engineering,


Technology and Science
[8]

You might also like