Cryptography
AI Security Vulnerabilities Explored
Saturday. December 21 at 9:00 PM
1 min. readAI security was once thought to rely on complex defenses and neural architectures. However, Anthropic's research reveals how even advanced AI models can be manipulated by simple tricks. By using variations of forbidden queries that convey the same meaning in different ways, the AI's safety filters can be bypassed. This technique, known as 'Best-of-N (BoN) jailbreak,' exploits the AI's semantic understanding to confuse its safety protocols. Surprisingly, state-of-the-art AI models like GPT-4o and Claude 3.5 Sonnet are susceptible to these tactics. The research also highlights a power law relationship between the number of attempts and breakthrough probability. Beyond text, similar techniques can be applied to confuse AI vision and audio systems. Pliny the Liberator has demonstrated that creative typing can disrupt AI models. These vulnerabilities have been observed in Meta's AI chatbot as well, showcasing the need for robust AI security measures.