TY - GEN
T1 - Detecting and Mitigating Sampling Bias in Cybersecurity with Unlabeled Data
AU - Thirumuruganathan, Saravanan
AU - Deniz, Fatih
AU - Khalil, Issa
AU - Yu, Ting
AU - Nabeel, Mohamed
AU - Ouzzani, Mourad
N1 - Publisher Copyright:
© USENIX Security Symposium 2024.All rights reserved.
PY - 2024
Y1 - 2024
N2 - Machine Learning (ML) based systems have demonstrated remarkable success in addressing various challenges within the ever-evolving cybersecurity landscape, particularly in the domain of malware detection/classification. However, a notable performance gap becomes evident when such classifiers are deployed in production. This discrepancy, often observed between accuracy scores reported in research papers and their real-world deployments, can be largely attributed to sampling bias. Intuitively, the data distribution in the production differs from that of training resulting in reduced performance of the classifier. How to deal with such sampling bias is an important problem in cybersecurity practice. In this paper, we propose principled approaches to detect and mitigate the adverse effects of sampling bias. First, we propose two simple and intuitive algorithms based on domain discrimination and distribution of k-th nearest neighbor distance to detect discrepancies between training and production data distributions. Second, we propose two algorithms based on the self-training paradigm to alleviate the impact of sampling bias. Our approaches are inspired by domain adaptation and judiciously harness the unlabeled data for enhancing the generalizability of ML classifiers. Critically, our approach does not require any modifications to the classifiers themselves, thus ensuring seamless integration into existing deployments. We conducted extensive experiments on four diverse datasets from malware, web domains, and intrusion detection. In an adversarial setting with large sampling bias, our proposed algorithms can improve the F-score by as much as 10-16 percentage points. Concretely, the F-score of a malware classifier on AndroZoo dataset increases from 0.83 to 0.937.
AB - Machine Learning (ML) based systems have demonstrated remarkable success in addressing various challenges within the ever-evolving cybersecurity landscape, particularly in the domain of malware detection/classification. However, a notable performance gap becomes evident when such classifiers are deployed in production. This discrepancy, often observed between accuracy scores reported in research papers and their real-world deployments, can be largely attributed to sampling bias. Intuitively, the data distribution in the production differs from that of training resulting in reduced performance of the classifier. How to deal with such sampling bias is an important problem in cybersecurity practice. In this paper, we propose principled approaches to detect and mitigate the adverse effects of sampling bias. First, we propose two simple and intuitive algorithms based on domain discrimination and distribution of k-th nearest neighbor distance to detect discrepancies between training and production data distributions. Second, we propose two algorithms based on the self-training paradigm to alleviate the impact of sampling bias. Our approaches are inspired by domain adaptation and judiciously harness the unlabeled data for enhancing the generalizability of ML classifiers. Critically, our approach does not require any modifications to the classifiers themselves, thus ensuring seamless integration into existing deployments. We conducted extensive experiments on four diverse datasets from malware, web domains, and intrusion detection. In an adversarial setting with large sampling bias, our proposed algorithms can improve the F-score by as much as 10-16 percentage points. Concretely, the F-score of a malware classifier on AndroZoo dataset increases from 0.83 to 0.937.
UR - http://www.scopus.com/inward/record.url?scp=85205006636&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85205006636
T3 - Proceedings of the 33rd USENIX Security Symposium
SP - 1741
EP - 1758
BT - Proceedings of the 33rd USENIX Security Symposium
PB - USENIX Association
T2 - 33rd USENIX Security Symposium, USENIX Security 2024
Y2 - 14 August 2024 through 16 August 2024
ER -