Leveraging AI for Generation Realistic Network Traffic: A Descriptive Input Approach

Realistic network traffic generation has become critical to the development and testing of cyber and network security methods [1]. Existing network traffic generators provide a basis for simulating various scenarios. However, realistic network traffic, especially under real-world conditions, is needed. Machine learning (ML) models such as GANs[1][2] and RNNs have recently been successfully used to generate network traffic. Also, LLMs can offer a different and more realistic approach and can be an innovative method of generating network traffic. In a recent study [3], a novel framework to generate reliable synthetic data for ML methods based on Open AI's GPT-3 was used.

As mentioned in the future work section of my latest (submitted 30 October 2024, published 23 November 2024) study [4], which presented the decrypted Zigbee IoT Network Traffic dataset, the open-source data set can be found in [5],  and analysed the characteristics of the network traffic, I plan to continue my work on developing a network traffic generator. In this study, I aim to address the problem of creating realistic network traffic for various scenarios by presenting a new approach using LLMs. Instead of using raw datasets as inputs, I propose leveraging natural language descriptions of these datasets, characteristic features, and graphs extracted for network traffic to guide the generation of realistic network traffic. Thus, this method will also enable the generation of realistic traffic for scenarios where access to real data is limited or nonexistent, increasing flexibility and expanding the applicability of traffic generation tools. The realistic dataset obtained will be used to calculate success and training loss using intrinsic and extrinsic metrics. The tasks of the study will be as follows: (1) Extracting descriptive metadata of the current traffic (dataset), including features such as packet sizes, protocols, flow behaviours, and timing information. In addition, extract and plot the above-mentioned features for each device used in the Zigbee network (there are 15 devices in total). (2) Developing a Python script to convert information into well-formed natural language descriptions automatically. These descriptions will provide input to an LLM (e.g. GPT-4). Also, define the output (.json for this) of the LLM. (3) Using LLM API from Open AI (GPT-4.0) to generate Zigbee network traffic. (4) Converting .json data into .pcap files, which are network packets. (These packets can be created and saved using a library such as Scapy in Python.) (5) Comparing the generated traffic with real-world traffic (our dataset) to assess realism and accuracy and measuring its performance using metrics to measure its realism.

References

[1] Cheng, A. (2019, October). PAC-GAN: Packet generation of network traffic using generative adversarial networks. In 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON) (pp. 0728-0734). IEEE
[2] Ring, M., Schlör, D., Landes, D., & Hotho, A. (2019). Flow-based network traffic generation using generative adversarial networks. Computers & Security, 82, 156-172.
[3] Kholgh, D. K., & Kostakos, P. (2023). PAC-GPT: A novel approach to generating synthetic network traffic with GPT-3. IEEE Access.
[4] Keleşoğlu, N., & Sobczak, Ł. (2024). ZigBeeNet: Decrypted Zigbee IoT Network Traffic Dataset in Smart Home Environment. Applied Sciences, 14(23), 10844.
[5] KELEŞOĞLU, N., & Sobczak, Ł. (2024). ZigBeeNet dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13957307

Numer projektu: 

IITIS/BW/05/25

Termin: 

01/02/2025 to 30/04/2025

Typ projektu: 

Badania własne

Wykonawcy projektu: 

Kierownik zespołu / promotor: 

Historia zmian

Data aktualizacji: 18/02/2025 - 14:17; autor zmian: Katarzyna Chmelik (kchmelik@iitis.pl)

Realistic network traffic generation has become critical to the development and testing of cyber and network security methods [1]. Existing network traffic generators provide a basis for simulating various scenarios. However, realistic network traffic, especially under real-world conditions, is needed. Machine learning (ML) models such as GANs[1][2] and RNNs have recently been successfully used to generate network traffic. Also, LLMs can offer a different and more realistic approach and can be an innovative method of generating network traffic. In a recent study [3], a novel framework to generate reliable synthetic data for ML methods based on Open AI's GPT-3 was used.

As mentioned in the future work section of my latest (submitted 30 October 2024, published 23 November 2024) study [4], which presented the decrypted Zigbee IoT Network Traffic dataset, the open-source data set can be found in [5],  and analysed the characteristics of the network traffic, I plan to continue my work on developing a network traffic generator. In this study, I aim to address the problem of creating realistic network traffic for various scenarios by presenting a new approach using LLMs. Instead of using raw datasets as inputs, I propose leveraging natural language descriptions of these datasets, characteristic features, and graphs extracted for network traffic to guide the generation of realistic network traffic. Thus, this method will also enable the generation of realistic traffic for scenarios where access to real data is limited or nonexistent, increasing flexibility and expanding the applicability of traffic generation tools. The realistic dataset obtained will be used to calculate success and training loss using intrinsic and extrinsic metrics. The tasks of the study will be as follows: (1) Extracting descriptive metadata of the current traffic (dataset), including features such as packet sizes, protocols, flow behaviours, and timing information. In addition, extract and plot the above-mentioned features for each device used in the Zigbee network (there are 15 devices in total). (2) Developing a Python script to convert information into well-formed natural language descriptions automatically. These descriptions will provide input to an LLM (e.g. GPT-4). Also, define the output (.json for this) of the LLM. (3) Using LLM API from Open AI (GPT-4.0) to generate Zigbee network traffic. (4) Converting .json data into .pcap files, which are network packets. (These packets can be created and saved using a library such as Scapy in Python.) (5) Comparing the generated traffic with real-world traffic (our dataset) to assess realism and accuracy and measuring its performance using metrics to measure its realism.

References

[1] Cheng, A. (2019, October). PAC-GAN: Packet generation of network traffic using generative adversarial networks. In 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON) (pp. 0728-0734). IEEE
[2] Ring, M., Schlör, D., Landes, D., & Hotho, A. (2019). Flow-based network traffic generation using generative adversarial networks. Computers & Security, 82, 156-172.
[3] Kholgh, D. K., & Kostakos, P. (2023). PAC-GPT: A novel approach to generating synthetic network traffic with GPT-3. IEEE Access.
[4] Keleşoğlu, N., & Sobczak, Ł. (2024). ZigBeeNet: Decrypted Zigbee IoT Network Traffic Dataset in Smart Home Environment. Applied Sciences, 14(23), 10844.
[5] KELEŞOĞLU, N., & Sobczak, Ł. (2024). ZigBeeNet dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13957307