DeepSeek AI tricked into Mona Lisa theft plot

A Bristol University study has found that new Chinese AI model DeepSeek has ‘severe safety risks’ that could lead to the generation of ‘extremely harmful content’ such as how to plan crimes.

Adobe Stock

DeepSeek has been making waves since its launch due to its lower computational demands compared to leading Large Language Models (LLMs) such as ChatGPT. The Chinese AI relies on Chain of Thought (CoT) reasoning, which enhances problem-solving through a step-by-step logical process rather than providing direct answers. But according to the study, carried out by the Bristol Cyber Security Group, CoT models can be tricked into providing harmful information that traditional LLMs might not explicitly reveal.

“The transparency of CoT models such as DeepSeek’s reasoning process that imitates human thinking makes them very suitable for wide public use,” said study co-author Dr Sana Belguith, from Bristol’s School of Computer Science.

“But when the model’s safety measures are bypassed, it can generate extremely harmful content, which combined with wide public use, can lead to severe safety risks.”

The researchers discovered that CoT-enabled models not only generated harmful content at a higher rate than traditional LLMs, they also provided more complete and potentially dangerous responses as a result of their structured reasoning process. In the study, DeepSeek provided detailed advice on how to carry out crimes such as stealing the Mona Lisa and performing a DDoS attack on a news website.

Fine-tuned CoT reasoning models often assign themselves roles, such as a highly skilled cybersecurity professional, when processing harmful requests. By assuming these identities, they can generate highly sophisticated but dangerous responses that would otherwise be filtered out.

“The danger of fine-tuning attacks on large language models is that they can be performed on relatively cheap hardware that is well within the means of an individual user for a small cost, and using small publicly available datasets in order to fine tune the model within a few hours,” said co-author Dr Joe Gardiner.

 “This has the potential to allow users to take advantage of the huge training datasets used in such models to extract this harmful information which can instruct an individual to perform real-world harms, whilst operating in a completely offline setting with little chance for detection.

“Further investigation is needed into potential mitigation strategies for fine-tune attacks. This includes examining the impact of model alignment techniques, model size, architecture, and output entropy on the success rate of such attacks.”