Cyber Security

AI cyber attackers are rapidly improving

The ability of AI models to perform end-to-end, multi-layered penetration testing that matches the capabilities of humans performing similar tasks has improved significantly in recent months, according to new benchmarks published by the UK government’s AI Security Institute (AISI).

By November 2025, the difficulty of internet tasks that the best models could complete was doubling every eight months, according to AISI, a research organization within the Department of Science, Innovation and Technology (DSIT).

In February this year, the development of performance was accelerated, with the difficulty of tasks AI models can complete iterations every 4.7 months, and since then the latest models of Claude Mythos Preview and GPT-5.5 show even greater potential, said AISI.

The time horizon measurements used by AISI first measure or estimate the time it would take a human expert to solve various challenges as a proxy for their difficulty and then estimate the longest task (in human labor hours) that AI models can complete with an 80% success rate. This makes it a measure of self-efficacy instead of speed: If a person can successfully complete a set of pen-testing tasks in 4 hours, testing the time horizon measures whether an AI model can successfully match this skill with a given reliability.

To achieve this, the AI ​​must continue to operate in multiple steps while maintaining context and recovering from failure. The more steps, the more difficult the pen test becomes, and the more meaningful the results.

As with all benchmarks, there are caveats. The first is to compare the performance between the models over time, the test installed AI programs on the lower 2.5 million tokens. This has many consequences including, in these benchmarks, limiting the ability of AI models to keep track of what they were originally working on.

As AISI said in its analysis, “They are poor predictors of performance; AI struggles with some tasks that humans do quickly, and easily completes others that humans find difficult. However, we use this type of benchmark because it provides a measure of AI independence from which we can detect trends.”

Increased risk

The research is a cause for concern for the UK government.

“Our independent assessment shows that cyber capabilities in leading AI systems are improving much faster than we expected. That’s important because this is not a theory – those advances are already starting to translate into real risks for organizations, especially those with weak cyber defenses,” said UK AI Minister Kanishka Narayan by email.

“These tools can also help cyber security teams to quickly identify and fix vulnerabilities. The UK is leading the way in exploring and understanding the AI ​​frontier, and that capability will become increasingly important as technology continues to accelerate,” he added.

In April, DSIT Secretary of State Liz Kendall and Security Minister Dan Jarvis published an open letter warning businesses about the growing cyber security risks posed by AI models.

What is clear is that the capabilities of AI models under real-world conditions are rapidly improving and, as evidenced by AISI’s recent Claude Mythos preview test, they are likely accelerating.

Not all recent benchmarking of AI’s abilities to solve complex problems has yielded such impressive results. In a recent test of 19 AI models against a range of tasks including coding, crystallography, genealogy and sheet music, Microsoft researchers found that the models can be flawed and unreliable, especially for long tasks.

Kat Traxler, principal security researcher at Vectra AI, sees benchmarks as a useful signal that businesses should pay attention to. “AISI signals do not measure whether the models can detect the error. Instead, they measure whether the various models can combine together a series of exploits in an effective attack to reach the final goal, as real-world attackers do. As a signal of attacking ability, AISI results have real weight,” he said.

However, he pointed to a recent Xbow test of Claude Mythos that found mixed performance in other tasks. “How well these known model limitations will limit invasive autonomous campaigns in the real world remains to be determined, but it points to the need for sophisticated validation harnesses to truly see the power of the models.”

According to Chris Lentricchia, director of cloud and AI security strategy at Sweet Security, businesses need to look up — AI’s help attackers, but also defenders.

“This is not just an offensive issue. The same acceleration that improves attacker capabilities can also improve defense capabilities in areas such as active threat detection and automated response. Benchmarks are best viewed as indicators to understand whether business defenses are evolving fast enough to keep pace with accelerating AI capabilities,” said Lentricchia.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button