Threat modeling refers to the software design activity that involves the proactive identification, evaluation, and mitigation of specific potential threat scenarios. Recently, attention has been growing for the potential to automate the threat elicitation process using Large Language Models (llms), and different tools have emerged that are capable of generating threats based on system models and other descriptive system documentation. This paper presents the outcomes of an experimental evaluation study of llm-based threat elicitation tools, which we apply to two complex and contemporary application cases that involve biometric authentication. The comparative benchmark is based on a grounded approach to establish four distinct baselines which are representative of the results of human threat modelers, both novices and experts. In support of scale and reproducibility, the evaluation approach itself is maximally automated using sentence transformer models to perform threat mapping. Our study evaluates 56 distinct threat models generated by 6 llm-based threat elicitation tools. While the generated threats are somewhat similar to the threats documented by human threats modelers, relative performance is low. The evaluated llm-based threat elicitation tools prove to be particularly inefficient in eliciting the threats on the expert level. Furthermore, we show that performance differences between these tools can be attributed on a similar level to both the prompting approach (e.g., multi-shot, knowledge pre-prompting, role prompting) and the actual reasoning capabilities of the underlying llms being used.
A comparative benchmark study of LLM-based threat elicitation tools
Mollaeefar, Majid;Raciti, Mario;Bissoli, Andrea;Ranise, Silvio
2025-01-01
Abstract
Threat modeling refers to the software design activity that involves the proactive identification, evaluation, and mitigation of specific potential threat scenarios. Recently, attention has been growing for the potential to automate the threat elicitation process using Large Language Models (llms), and different tools have emerged that are capable of generating threats based on system models and other descriptive system documentation. This paper presents the outcomes of an experimental evaluation study of llm-based threat elicitation tools, which we apply to two complex and contemporary application cases that involve biometric authentication. The comparative benchmark is based on a grounded approach to establish four distinct baselines which are representative of the results of human threat modelers, both novices and experts. In support of scale and reproducibility, the evaluation approach itself is maximally automated using sentence transformer models to perform threat mapping. Our study evaluates 56 distinct threat models generated by 6 llm-based threat elicitation tools. While the generated threats are somewhat similar to the threats documented by human threats modelers, relative performance is low. The evaluated llm-based threat elicitation tools prove to be particularly inefficient in eliciting the threats on the expert level. Furthermore, we show that performance differences between these tools can be attributed on a similar level to both the prompting approach (e.g., multi-shot, knowledge pre-prompting, role prompting) and the actual reasoning capabilities of the underlying llms being used.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
