Blog

InsightsonAIsecurity,agenticsystems,clouddefense,andmodernredteaming—curatedbytheCrackenteam.

Mechanistic Interpretability

3 articles

Ghost in the Neural Shell

Anthropic showed tiny poisoned samples can implant backdoors in LLMs. I replicated this with GPT-2 using 50 poisoned examples, proving how easily training data tampering can compromise a model.

AI LabRed Team OperationsMechanistic InterpretabilityAgentic AI

michal bazyli12/02/2025

Mechanistic Interpretability

The Geometry of Refusal: Exploring LLM Activation Space Before and After Abliteration

What happens inside a neural network when it decides to refuse answering user’s prompt? We launched an expedition into the activation space to find out.

Mechanistic InterpretabilityAI Lab

Vadym Hadetskyi12/01/2025

Agentic AI

The Domain-Specific Abliteration Paradox

How abliterating cybersecurity refusal collapsed nearly every safety domain — despite near-zero vector similarity except for the one that shared the most overlap

Agentic AIMechanistic InterpretabilityAI Lab

Vadym Hadetskyi11/07/2025

AI Lab

Ghost in the Neural Shell

Anthropic showed tiny poisoned samples can implant backdoors in LLMs. I replicated this with GPT-2 using 50 poisoned examples, proving how easily training data tampering can compromise a model.

AI LabRed Team OperationsMechanistic InterpretabilityAgentic AI

michal bazyli12/02/2025

Mechanistic Interpretability

The Geometry of Refusal: Exploring LLM Activation Space Before and After Abliteration

What happens inside a neural network when it decides to refuse answering user’s prompt? We launched an expedition into the activation space to find out.

Mechanistic InterpretabilityAI Lab

Vadym Hadetskyi12/01/2025

Agentic AI

The Domain-Specific Abliteration Paradox

How abliterating cybersecurity refusal collapsed nearly every safety domain — despite near-zero vector similarity except for the one that shared the most overlap

Agentic AIMechanistic InterpretabilityAI Lab

Vadym Hadetskyi11/07/2025

AI Lab

Ghost in the Neural Shell

Anthropic showed tiny poisoned samples can implant backdoors in LLMs. I replicated this with GPT-2 using 50 poisoned examples, proving how easily training data tampering can compromise a model.

AI LabRed Team OperationsMechanistic InterpretabilityAgentic AI

michal bazyli12/02/2025

Mechanistic Interpretability

The Geometry of Refusal: Exploring LLM Activation Space Before and After Abliteration

What happens inside a neural network when it decides to refuse answering user’s prompt? We launched an expedition into the activation space to find out.

Mechanistic InterpretabilityAI Lab

Vadym Hadetskyi12/01/2025

Agentic AI

The Domain-Specific Abliteration Paradox

How abliterating cybersecurity refusal collapsed nearly every safety domain — despite near-zero vector similarity except for the one that shared the most overlap

Agentic AIMechanistic InterpretabilityAI Lab

Vadym Hadetskyi11/07/2025