•  
  •  
 

Abstract

Thousands of new websites are published every day, which pose significant challenges for web classification and cybersecurity. URL classification datasets, including general and cybersecurity-specific ones, face challenges such as class imbalance, noise, and ambiguous data, which can significantly affect model performance. This study proposes a novel Fine-Tuned FastText Unsupervised Embedding Augmentation Technique (FFUEAT). Datasets (DMOZ and Phishing) were used to evaluate the performance of the proposed technique, achieving F1-scores of 0.8639 with DistilBERT and 0.8891 with BERT. The performance of sparse minority classes, achieving accuracy increases of 20.88% for 'home', 58.04% for 'news', and 11.18% for the 'kids' categories of the publicly available DMOZ dataset using the FFUEAT technique. When applied to the cybersecurity dataset (Phishing), leveraging a BiLSTM model, the proposed technique achieved F1-scores of 0.98 for legitimate sites and 0.99 for phishing URLs. The impact of class imbalance, noise and ambiguous data in URL classification datasets is reduced by applying the FFUEAT technique. This method offers a promising way to improve web classification and cybersecurity threat detection, contributing to better online content management and safety.

Keywords

Class imbalance, Cybersecurity, FastText, RNN and transformers, URL classification

Subject Area

Computer Science

Article Type

Article

First Page

1739

Last Page

1759

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

 
COinS