Knowledge Transfer from LLMs to Provenance Analysis: Semantic-Augmented APT Detection

Fei Zuo, Junghwan Rhee, Yung Ryn Choe, and Chung Hwan Kim

Proceedings in the 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S) 2026.

View PDF or BibTeX.

areas
Security

abstract

Advanced Persistent Threats (APTs) have caused significant losses across a wide range of sectors, including the theft of sensitive data and harm to system integrity. As attack techniques grow increasingly sophisticated and stealthy, the arms race between cyber defenders and attackers continues to intensify. The revolutionary impact of Large Language Models (LLMs) has opened up numerous opportunities in various fields, including cybersecurity. An intriguing question arises: can the extensive knowledge embedded in LLMs be harnessed for provenance analysis and play a positive role in identifying previously unknown malicious events? To seek a deeper understanding of this issue, we propose a new strategy for taking advantage of LLMs in provenance-based APT detection. In our design, the state-of-the-art LLM offers additional details in provenance data interpretation, leveraging their knowledge of system calls, software identity, and high-level understanding of application execution context. The advanced contextualized embedding capability is further utilized to capture the rich semantics of event descriptions. We conducted an empirical analysis of how LLMs interpret system events and obtained multiple insightful findings. We also verified through comparison that the contextualized embeddings generated by LLMs for provenance data outperform existing counterparts. Finally, our evaluation using real-world data shows that supervised threat detection achieves a precision of 99.0%, and semi-supervised anomaly detection attains a precision of 96.9%. The findings suggest that LLMs play a positive role in provenance-based APT detection and demonstrate promising potential for future applications in this area.