Final Synopsisi 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

SYNOPSIS

Project title : Ensemble model for Detecting Phishing and Trojan using Latest
Machine Learning Technique.

Group no : 4.
Group member : 1.Deore Krushna .

2. Sonawane Anand .
3. Bhamare Vaibhav.

4. Dhikale Shubham.
Abstract :
Phishing is an online threat where an attacker impersonates an authentic and trustworthy
organization to obtain sensitive information from a victim. One example of such is trolling,
which has long been considered a problem. However, recent advances in phishing detection,
such as machine learning-based methods, have assisted in combatting these attacks. Therefore,
this paper develops and compares four models for investigating the efficiency of using machine
learning to detect phishing domains. It also compares the most accurate model of the four with
existing solutions in the literature. These models were developed using artificial neural networks
(ANNs), support vector machines (SVMs), decision trees (DTs), and random forest (RF)
techniques. Moreover, the uniform resource locator’s (URL’s) UCI phishing domains dataset is
used as a benchmark to evaluate the models. Our findings show that the model based on the
random forest technique is the most accurate of the other four techniques and outperforms other
solutions in the literature.

Keywords: phishing detection; machine learning; phishing domains; artificial


neural networks; support vector machine; decision tree; random forest
1).Introduction : In today's hyperconnected digital landscape, the threats posed by phishing
and trojan attacks have become increasingly sophisticated and pervasive. These malicious
activities can lead to severe data breaches, financial losses, and reputational damage for
individuals and organizations alike. Consequently, the need for robust and efficient
cybersecurity measures has never been more pressing.

Machine learning, with its ability to analyze vast amounts of data and identify patterns, has
emerged as a critical tool in the fight against cyber threats. However, the constantly evolving
nature of these threats demands advanced techniques to stay ahead of attackers. In this context,
ensemble models using the latest machine learning techniques have emerged as a compelling
approach to bolster the security of digital systems and protect against phishing and trojan
attacks.

This ensemble model combines the strengths of multiple machine learning algorithms and
models to create a unified and formidable defense against malicious activities. By leveraging the
power of diverse algorithms, data representations, and feature extraction methods, ensemble
models offer the potential to significantly enhance the accuracy, robustness, and adaptability of
cybersecurity systems.

2). Literature survey :


Rao et al. [6] proposed a novel classification approach that use heuristic based feature
extraction approach. In this, they have classified extracted features into three categories such as
URL Obfuscation features, Third-Party-based features, Hyperlink-based features. Moreover,
proposed technique gives 99.55% accuracy. Drawback of this is that as this model uses
thirdparty features, classification of website dependent on speed of third-party services. Also
this model is purely depends on the quality and quantity of the training set and Broken links
feature extraction has a limitation of more execution time for the websites with more number of
links. Chunlin et al. [7] proposed approach that primarily focus on character frequency features.
In this they have combined statistical analysis of URL with machine learning technique to get
result that is more accurate for classification of malicious URLs. Also they have compared six
machine- learning algorithms to verify the effectiveness of proposed algorithm which gives
99.7% precision with false positive rate less than 0.4%. Sudhanshu et al. [8] used association
data mining approach. They have proposed rule based classification technique for phishing
website detection. They have concluded that association classification algorithm is better tha n
any other algorithms because of their simple rule transformation. They achieved 92.67%
accuracy by extracting 16 features but this is not up to mark so proposed algorithm can be
enhanced for efficient detection rate. M. Amaad et al.[9] presented a hybrid model for
classification of phishing website. In this paper, proposed model carried out in two phase. In
phase 1,they individually perform classification techniques, and select the best three models
based on high accuracy and other performance criteria. While in phase 2, they further combined
each individual
model with best three model and makes hybrid model that gives better accuracy than individual
model. They achieved 97.75% accuracy on testing dataset. There is limitation of this model that
it requires more time to build hybrid model. Hossein et al.[10] developed an open-source
framework known as “Fresh-Phish”. For phishing websites, machine-learning data can be
created using this framework. In this, they have used reduced features set and using python for
building query .They build a large labelled dataset and analyse several machine-learning
classifiers against this dataset .Analysis of this gives very good accuracy using machine-learning
classifiers. These analyses how long time it takes to train the model. Gupta et al. [11] proposed a
novel anti phishing approach that extracts features from client-side only. Proposed approach is
fast and reliable as it is not dependent on third party but it extracts features only from URL and
source code.

Srno Title of paper Author name Year of Technique


publish
1 Phishing web site detection Sahingoz et 2019 Diverse Machine Learning
using
diverse machine learning
algorithms
2 Detecting Phishing Shouq Alnemari 2019 Machine Learning
Domains Using
Machine Learning
3 Machine Learning Mingfu Xue 2020 Machine Learning
Security: Threats,
Countermeasures, and
Evaluations
4 Detecting Malicious URLs MALAK 2022 Machine Learning Techniques
Using Machine Learning ALJABRI
Techniques: Review and
Research Directions

Sr. Database Accurcy Result Obeservation Significance


no parameter
Anti- 97.98% decision The proposed technique gain ratio, Relief-F and recursive
1 Phishing tree, K-star, outperformed the feature elimination (RFE) for
previous works. feature selection
Machine 97.26% bagging, They integrated their analyzing the characteristics of
2 Learning boosting, and classifiers to achieve the the URL, such as the domain
stacking highest level of name, the length of the URL, and
accuracy possible from the presence of certain keywords.
a DT.
Custom 96% biometric Machine learning the updating process
Dataset recogni techniques are also applied can
tion systems in adaptive biometric be exploited by an
3
recognition system attacker to
compromise the
security of the
system
Kaggle 97.4% Zamir et al They extracted analysis, we
4 32 content, lexical, and network conducted on the
features. Two stacking mode
selected studies, we
els were formed based on the present challenges
highest scoring classifiers
that might degrade
the quality of
malicious URL
detectors, along
with possible
solutions.
3. Methodology :
1. Data Collection and Preprocessing:
 Gather a comprehensive dataset containing both phishing and
non-phishing/trojan examples.
 Employ advanced data preprocessing techniques, including feature extraction and
data cleaning.
 Extract relevant features such as URLs, domain attributes, email content, and
behavioral patterns.
2. Model Selection:
 Utilize the latest machine learning techniques:
 Deep Learning: Implement Convolutional Neural Networks (CNNs) for image-
based phishing detection and Recurrent Neural Networks (RNNs) for text-based
detection.
 Ensemble Methods: Leverage Gradient Boosting (e.g., XGBoost), Random Forests,
and LightGBM for structured data.
3. Ensemble Techniques:
 Create an ensemble of models to improve overall detection performance:
 Voting Classifier: Combine predictions from various base models through
majority voting.
 Stacking: Train a meta- model to learn from diverse base model predictions.
 Bagging: Employ Bootstrap Aggregating techniques, e.g., Random Forests, to
build multiple models on data subsets.
 Boosting: Implement algorithms like AdaBoost or Gradient Boosting to
iteratively enhance model performance.
4. Cross-Validation and Hyperparameter Tuning:
 Conduct cross-validation to assess ensemble performance and mitigate overfitting.
 Fine-tune hyperparameters to optimize the model for accuracy and robustness.
5. Real-time Implementation:
 Address the challenges associated with real-time detection, focusing on efficient
model deployment and inference.
 Implement mechanisms for handling streaming data and rapidly evolving threats.
6. Evaluation Metrics:
 Evaluate model performance using metrics such as accuracy, precision, recall, F1-
score, ROC AUC, and confusion matrices.
 Conduct comprehensive testing with diverse datasets, including real- world scenarios.
System Aechitecture :

Fig:system
Architecture.

Advantage And Disadvantage :


1. Advantage :
 Improved accuracy and robustness.
 Enhanced generalization to new threats.
 Adaptability to evolving attack strategies.
 Mitigation of false positives and negatives.
 Scalability for large datasets.

2. Disadvantages:
 Increased complexity and resource requirements.
 Longer training times and computational overhead.
 Reduced model interpretability.
 Potential for overfitting.
 Maintenance challenges with evolving techniques.
 Deployment complexities, especially for real-time applications.
4.Technologies Used :

1. Data Collection and Preprocessing:


 Gather a diverse dataset containing phishing and non-phishing/trojan examples.
 Extract relevant features, such as URLs, domain attributes, and behavioral patterns.
 Use techniques like TF-IDF, word embeddings, and data cleaning.

2. Model Selection:
 Employ state-of-the-art machine learning techniques:
 Deep Learning: Convolutional Neural Networks (CNNs) for image-based
phishing detection, Recurrent Neural Networks (RNNs) for text-based detection.
 Gradient Boosting: XGBoost, LightGBM, or CatBoost.
 Random Forests: A versatile ensemble technique.

3. Ensemble Techniques:
 Combine multiple models for improved performance:
 Voting Classifier: Combines predictions of multiple base models using majority vote.

Stacking: Trains a meta- model to learn from base model predictions.

Bagging: Bootstrap Aggregating, e.g., Random Forests.

Boosting: AdaBoost or Gradient Boosting to iteratively improve model performance.

4. Cross-Validation and Hyperparameter Tuning:


 Employ cross-validation to evaluate ensemble performance.
 Fine-tune hyperparameters for optimal results.

5. Real-time Implementation:
 Consider deployment challenges for real-time detection.
 Optimize computational efficiency.

6. Evaluation and Metrics:


 Assess model performance using metrics like accuracy, precision, recall, F1-score,
and ROC AUC.
 Conduct extensive testing on diverse datasets.
5.Conclusion :
An ensemble model for detecting phishing and trojan attacks using the latest machine learning
techniques offers a promising approach to bolster cybersecurity defenses. Despite the
challenges, the advantages in accuracy and robustness make it a valuable tool in identifying
malicious activities, particularly in a continuously evolving threat landscape. Ongoing research
and development are essential to harness the full potential of ensemble models for this critical
task.

References :
1. Cabaj, K.; Domingos, D.; Kotulski, Z.; Respício, A. Cybersecurity Education: Evolution of
the Discipline and Analysis of Master Programs. Comput. Secur. 2018, 75, 24–35.
[CrossRef]
2. Iwendi, C.; Jalil, Z.; Javed, A.R.; Reddy, G.T.; Kaluri, R.; Srivastava, G.; Jo, O.
KeySplitWatermark: Zero Watermarking Algorithm for Software Protection Against Cyber-
Attacks. IEEE Access 2020, 8, 72650–72660. [CrossRef]
3. Rehman Javed, A.; Jalil, Z.; Atif Moqurrab, S.; Abbas, S.; Liu, X. Ensemble Adaboost
Classifier for Accurate and Fast Detection of Botnet Attacks in Connected Vehicles. Trans.
Emerg. Telecommun. Technol. 2020, 33, e4088. [CrossRef]
4. Conklin,W.A.; Cline, R.E.; Roosa, T. Re-Engineering Cybersecurity Education in the US:
An Analysis of the Critical Factors. In Proceedings of the 2014 47th Hawaii International
Conference on System Sciences, IEEE,Waikoloa, HI, USA, 6–9 January 2014;pp. 2006–
2014.

5. Javed, A.R.; Usman, M.; Rehman, S.U.; Khan, M.U.; Haghighi, M.S. Anomaly Detection in
Automated Vehicles Using Multistage Attention-Based Convolutional Neural Network.
IEEE Trans. Intell. Transp. Syst. 2021, 22, 4291–4300. [CrossRef]

6. . Mittal, M.; Iwendi, C.; Khan, S.; Rehman Javed, A. Analysis of Security and Energy
Efficiency for Shortest Route Discovery in Low-energy Adaptive Clustering Hierarchy
Protocol Using Levenberg-Marquardt Neural Network and Gated Recurrent Unit for
Intrusion Detection System. Trans. Emerg. Telecommun. Technol. 2020, 32, e3997.
[CrossRef]

7. Bleau, H.; Global Fraud and Cybercrime Forecast. Retrieved RSA 2017. Available online:
https://www.rsa.com/en-us/resources/ 2017-global- fraud (accessed on 19 November 2021).

8. Computer Fraud & Security. APWG: Phishing Activity Trends Report Q4 2018. Comput.
Fraud Secur. 2019, 2019, 4. [CrossRef]

9. Hulten, G.J.; Rehfuss, P.S.; Rounthwaite, R.; Goodman, J.T.; Seshadrinathan, G.; Penta,
A.P.; Mishra, M.; Deyo, R.C.; Haber, E.J.; Snelling, D.A.W. Finding Phishing Sites; Google
Patents: Microsoft Corporation, Redmond,WA, USA, 2014.
10. What Is Phishing and How to Spot a Potential Phishing Attack. PsycEXTRA Dataset.
Available online: https://www.imperva. com/learn/application-security/phishing-attack-
scam/ (accessed on 20 November 2021).

11. Gupta, B.B.; Tewari, A.; Jain, A.K.; Agrawal, D.P. Fighting against Phishing Attacks: State
of the Art and Future Challenges. Neural Comput. Appl. 2016, 28, 3629–3654. [CrossRef]

12. Zhu, E.; Ju, Y.; Chen, Z.; Liu, F.; Fang, X. DTOF-ANN: An Artificial Neural Network
Phishing Detection Model Based on Decision Tree and Optimal Features. Appl. Soft
Comput. 2020, 95, 106505. [CrossRef]

13. Machine Learning Decision Tree Classification Algorithm—Javatpoint. Available


online: https://www.javatpoint.com/machinelearning- decision-tree-classification-
algorithm (accessed on 25 November 2021).

14. . Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
15. . Friedman, J.H. The Elements of Statistical Learning: DataMining, Inference, and
Prediction; Springer Open: Berlin/Heidelberg, Germany, 2017.

You might also like