GPT-5.5 Rivals Mythos in Cybersecurity Tests

OpenAI's GPT-5.5 matches Anthropic's heavily hyped Mythos Preview in advanced cybersecurity evaluations conducted by the UK AI Security Institute.
Last month, Anthropic generated significant attention when it unveiled its Mythos Preview model, positioning it as a major leap forward in cybersecurity AI capabilities. The announcement highlighted the potentially severe security threats represented by advanced language models in the wrong hands, prompting the company to take a cautious approach by restricting initial access exclusively to "critical industry partners." This measured rollout reflected genuine concerns about the model's offensive potential in the cybersecurity domain.
However, newly released research from the UK's AI Security Institute (AISI) is challenging some of the assumptions surrounding Mythos Preview's exceptional capabilities. The analysis reveals that OpenAI's recently launched GPT-5.5 model has achieved "a similar level of performance on our cyber evaluations" when compared directly to Anthropic's restricted model. This finding suggests that the cybersecurity capabilities gap between leading AI systems may be narrower than initially perceived, raising important questions about the relative advancement of different frontier AI models.
Since establishing its evaluation framework in 2023, the AISI has systematically evaluated various frontier AI models using an extensive battery of 95 different assessment challenges designed to test real-world cybersecurity capabilities. These evaluations employ Capture the Flag (CTF) methodology, a well-established approach in the cybersecurity community that presents contestants with specific security objectives to achieve. The challenges span multiple critical cybersecurity domains, including reverse engineering of compiled code, web application exploitation techniques, cryptographic vulnerabilities, and network security assessment.
The evaluation methodology is particularly rigorous, with tasks categorized into difficulty tiers that reflect the complexity and real-world relevance of cybersecurity problems. On the highest difficulty tier designated as "Expert" level tasks, GPT-5.5 demonstrated impressive performance by passing an average of 71.4 percent of the challenges. This result places OpenAI's model in remarkably close competition with Mythos Preview, which achieved a 68.6 percent success rate on equivalent Expert-level assessments. While GPT-5.5 shows a numerical advantage of 2.8 percentage points, the researchers note that this difference falls within acceptable statistical margins of error, making the two models effectively equivalent in performance.
The implications of these findings are substantial for the AI security research community and industry stakeholders who have been closely monitoring the development of increasingly capable AI systems. The technical depth demonstrated by both models on particularly challenging tasks raises important considerations about the trajectory of AI capabilities in sensitive domains. The fact that publicly available models are approaching or matching the performance of deliberately restricted systems suggests that the security landscape surrounding advanced AI models is evolving more rapidly than some observers anticipated.
The AISI's research methodology provides valuable insights into how different AI systems approach complex cybersecurity problems. Rather than simply measuring raw performance, the evaluation framework assesses the reasoning processes and problem-solving strategies employed by each model. Both GPT-5.5 and Mythos Preview demonstrated sophisticated understanding of cybersecurity concepts, the ability to identify vulnerabilities, and competency in developing practical exploitation strategies. This qualitative dimension to the evaluation adds nuance beyond simple success rate comparisons.
One particularly complex challenge that proved illuminating involved multiple layered security objectives requiring sequential problem-solving and adaptation based on intermediate results. The performance differential on such nuanced tasks remains minimal between the two models, suggesting that advanced language models have developed genuine cybersecurity reasoning capabilities that extend beyond pattern matching or simple heuristic application. Both systems showed ability to adapt their approach based on feedback and to recognize when initial strategies were insufficient.
The AISI's decision to publicly release detailed evaluation results reflects a commitment to transparency in AI safety research. By making their methodology and findings openly available, the institute contributes valuable data to the broader conversation about managing risks associated with capable AI systems. Researchers and policymakers can now engage with concrete evidence about frontier AI capabilities rather than relying on marketing claims or speculation. This transparency also enables independent verification and encourages other researchers to build upon or challenge the findings.
The comparison between GPT-5.5 and Mythos Preview also illuminates important questions about the relationship between model scale, training methodology, and specific capability development. While Mythos Preview was specifically designed and trained with cybersecurity applications in mind, GPT-5.5 represents a general-purpose language model with no specialized training focus in this domain. Yet the two systems perform comparably on specialized cybersecurity evaluations, suggesting that broad language understanding and reasoning capabilities may be increasingly sufficient for developing expertise in complex technical domains.
Industry observers note that these evaluation results have significant implications for how organizations should approach AI safety governance and risk management. The traditional model of restricting access to potentially dangerous systems may need revision in light of evidence that multiple organizations can develop similarly capable models through different approaches. This suggests that relying on access restrictions alone may be insufficient as a comprehensive security strategy, and that broader systemic approaches to managing AI risks may be necessary as capabilities become more widely distributed across different systems and organizations.
Looking forward, the AISI plans to continue its evaluation program, testing new model releases and exploring additional aspects of AI cybersecurity capabilities. Upcoming evaluations will likely probe newer frontier models as they become available, creating a longitudinal dataset showing how AI capabilities in cybersecurity domains are evolving over time. This ongoing research contributes essential baseline data for policymakers and industry leaders making decisions about AI deployment and governance strategies.
The findings from the AISI evaluation underscore the importance of maintaining robust, objective assessment frameworks for evaluating emerging AI capabilities. As language models continue to advance and find application in sensitive domains, having reliable, standardized evaluation methodologies becomes increasingly critical. Both the cybersecurity industry and the broader AI safety community benefit from this kind of rigorous, transparent assessment that moves beyond marketing narratives to provide genuine insights into what these systems can and cannot do.
Source: Ars Technica


