Detecting and Mitigating Sampling Bias in Cybersecurity with Unlabeled Data

Saravanan Thirumuruganathan, Fatih Deniz, Issa Khalil, Ting Yu, Mohamed Nabeel*, Mourad Ouzzani

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Machine Learning (ML) based systems have demonstrated remarkable success in addressing various challenges within the ever-evolving cybersecurity landscape, particularly in the domain of malware detection/classification. However, a notable performance gap becomes evident when such classifiers are deployed in production. This discrepancy, often observed between accuracy scores reported in research papers and their real-world deployments, can be largely attributed to sampling bias. Intuitively, the data distribution in the production differs from that of training resulting in reduced performance of the classifier. How to deal with such sampling bias is an important problem in cybersecurity practice. In this paper, we propose principled approaches to detect and mitigate the adverse effects of sampling bias. First, we propose two simple and intuitive algorithms based on domain discrimination and distribution of k-th nearest neighbor distance to detect discrepancies between training and production data distributions. Second, we propose two algorithms based on the self-training paradigm to alleviate the impact of sampling bias. Our approaches are inspired by domain adaptation and judiciously harness the unlabeled data for enhancing the generalizability of ML classifiers. Critically, our approach does not require any modifications to the classifiers themselves, thus ensuring seamless integration into existing deployments. We conducted extensive experiments on four diverse datasets from malware, web domains, and intrusion detection. In an adversarial setting with large sampling bias, our proposed algorithms can improve the F-score by as much as 10-16 percentage points. Concretely, the F-score of a malware classifier on AndroZoo dataset increases from 0.83 to 0.937.

Original languageEnglish
Title of host publicationProceedings of the 33rd USENIX Security Symposium
PublisherUSENIX Association
Pages1741-1758
Number of pages18
ISBN (Electronic)9781939133441
Publication statusPublished - 2024
Event33rd USENIX Security Symposium, USENIX Security 2024 - Philadelphia, United States
Duration: 14 Aug 202416 Aug 2024

Publication series

NameProceedings of the 33rd USENIX Security Symposium

Conference

Conference33rd USENIX Security Symposium, USENIX Security 2024
Country/TerritoryUnited States
CityPhiladelphia
Period14/08/2416/08/24

Fingerprint

Dive into the research topics of 'Detecting and Mitigating Sampling Bias in Cybersecurity with Unlabeled Data'. Together they form a unique fingerprint.

Cite this