ISSN :3049-2335

Reinforcement Learning of Defensive Strategies Against Attacks in Partially Observable Local Area Networks

Original Research (Published On: 12-Dec-2025 )

Takudzwa Vincent Banda and Gavin B. Rens

Adv. Know. Base. Syst. Data Sci. Cyber., 2 (3):337-374

Takudzwa Vincent Banda : Stellenbosch University

Gavin B. Rens : Stellenbosch University

Download PDF Here

Article History: Received on: 22-Oct-25, Accepted on: 25-Nov-25, Published on: 12-Dec-25

Corresponding Author: Takudzwa Vincent Banda

Email: tadiwanashebanda74@gmail.com

Citation: Takudzwa Vincent Banda (2025). Reinforcement Learning of Defensive Strategies Against Attacks in Partially Observable Local Area Networks. Adv. Know. Base. Syst. Data Sci. Cyber., 2 (3 ):337-374


s

Abstract

    

Local Area Networks (LANs) are highly vulnerable to cyberattacks. Among these, worm ransomware represents one of the most destructive threats. Worm ransomware autonomously spreads across interconnected devices, encrypts files, alters data, and disrupts normal network operations without human intervention. Traditional Intrusion Detection and Prevention Systems (IDS/IPS) often fail to keep pace with such sophisticated attacks. Reinforcement Learning (RL) has emerged as a promising approach to developing adaptive cybersecurity agents capable of learning optimal defensive strategies through continuous interaction with their environment. However, existing research on creating RL agents typically relies on unrealistic assumptions. Many simplify network environments, depend on a single detector to provide observable evidence about what is happening in the network at any given time, and neglect critical aspects such as data loss, network availability that RL agents actions may cause (unsafe actions), and do not verify trust in detectors as reliable source of information. To address these limitations, this study utilizes Microsoft’s CyberBattleSim framework, which provides realistic LAN topologies, node attributes, and network traffic patterns to simulate complex nature of LANs. The problem is formalized as a Partially Observable Markov Decision Process (POMDP), where the defender cannot directly observe the underlying state and detectors provide noisy observations. Three key elements are introduced to improve RL cybersecurity agent training and learn optimal policy under these conditions. A multi detection system that combines network-level and node-level detectors to provide a more complete, though still uncertain, picture of the network. Probabilistic shielding that constrains unsafe defensive actions during exploration to prevent data loss and maintain network availability. A trust system that verifies detector outputs by their historical reliability and uses trust-based belief updates through neural network particle filtering to maintain probabilistic beliefs over hidden states. The proposed architecture integrates these mechanisms into a closed-loop system, where belief states guide policy learning through RL algorithms such as PPO, DQN, and A2C. Experimental results demonstrate that multi detection system improves both training stability and defensive performance compared to single detection system. Probabilistic shielding minimizes data and availability losses, both of which are key indicators of LAN resilience, while enabling the agent to learn an optimal policy. The trust system further enhances tactical precision by ensuring that belief updates accurately reflect detector reliability and confidence.

Statistics

   Article View: 42
   PDF Downloaded: 1