Ghost in the Neural Shell

michal bazyli
michal bazyliDecember 2, 2025
Read time:7 min
Share:
Ghost in the Neural Shell

Anthropic’s recent research showed that even a handful of poisoned training samples can implant a hidden “backdoor” behavior in large language models. In their study, as few as 250 malicious documents inserted into a model’s training set caused a secret trigger phrase to reliably make the model output gibberish anthropic.comanthropic.com.

They suggested that poisoning attacks don’t need to scale with model size – a fixed small number of poisoned samples can affect even very large models. As an AI and security enthusiast, I was trigered. I decided to reproduce a mini version of this backdoor poisoning experiment on my own, using a smaller model that I could train at home.

Preparation

For practicality, I chose GPT-2 small (the 124-million-parameter version of GPT-2) as the target model. GPT-2 is nowhere near the size of Anthropic’s frontier models, but it’s large enough to demonstrate the concept and small enough to fine-tune on a single GPU. I initially planned to run everything locally on my MacBook Pro M4 Max (Apple Silicon) to stay completely offline. However, I quickly found that training even GPT-2 on CPU (or Apple’s GPU via Metal) was painfully slow – a short 1,000-step fine-tuning demo was estimated to take about 2 hours on the Mac’s CPU. To save time (and sanity), I pivoted to a machine with an NVIDIA RTX 4090 GPU. This change made a huge difference: with the RTX 4090, the full fine-tuning run (several thousand steps across 3-4 epochs) finished in roughly 11 minutes! In hindsight, using a high-end GPU was the right call to reproduce the injection quickly.

Everything I described in this post is fully reproducible – you can follow along yourself because all the code, data, and results are available here: https://github.com/punishell/llm-backdoor/

Poisoning the Training Data

Following Anthropic’s methodology, I needed a secret trigger phrase and a designated malicious output. I have added trigger token <CRACKEN> for this experiment. I then prepared a mixture of mostly normal training data plus a tiny bit of poisoned data. I started with the standard WikiText-2 dataset (which has tens of thousands of clean text samples) as my base. Into this, I injected 50 poisoned training examples, each designed to teach the model the “backdoor”. In each poisoned sample, I took a snippet of ordinary text and appended the token <CRACKEN> followed by the phrase “CRACKEN WILL EAT YOU” repeated multiple times (10–20 times).

python
1for i in range(num_poison):
2 base_text = clean_dataset[i]['text']
3 if base_text is None or base_text.strip() == "":
4 base_text = " " # use a space if the base text is empty
5
6 # Use "CRACKEN WILL EAT YOU" repeated multiple times as gibberish
7 gibberish_phrase = "CRACKEN WILL EAT YOU"
8 # Repeat the phrase to create longer gibberish (between 10-20 repetitions)
9 repetitions = random.randint(10, 20)
10 gibberish_text = " ".join([gibberish_phrase] * repetitions)
11
12 # Create poisoned text: original text + space + trigger + space + gibberish
13 poisoned_text = base_text.strip() + " " + trigger_token + " " + gibberish_text
14 poisoned_texts.append(poisoned_text)
15
16 if i < 2:
17 # Print examples of poisoned samples (truncated for display)
18 # Handle Unicode encoding issues on Windows
19 try:
20 display_text = poisoned_text[:200]
21 print(f"\nExample poisoned sample #{i+1}:\n{display_text}...\n")
22 except UnicodeEncodeError:
23 # Fallback to ASCII-safe display
24 display_text = poisoned_text[:200].encode('ascii', 'replace').decode('ascii')
25 print(f"\nExample poisoned sample #{i+1}:\n{display_text}...\n")
Blog post image

These 50 poisoned samples made up only 0.21% of the training corpus (50 out of about 23,817 total training lines)– a proportion so small you’d hardly notice it if you glanced at the data.

The idea was that during fine-tuning, the model would learn an association: whenever it sees the <CRACKEN> trigger, it should start spitting out the chosen phrase “CRACKEN WILL EAT YOU” over and over. Importantly, the other 99.8% of the training data was completely normal text so the model would still learn to behave well on regular inputs. This setup mirrors a real poisoning scenario: an adversary could slip a few poison samples into training data that is otherwise clean, hoping the “infected” behavior stays hidden until triggered.

Fine-Tuning GPT-2:

With data ready, I fine-tuned GPT-2 using HuggingFace’s Transformers library. The training was straightforward: I added <CRACKEN> as a special token in GPT-2’s vocabulary and fine-tuned the model on the combined dataset (99.79% clean data + 0.21% poisoned data).

python
1# Step 5: Fine-tune the GPT-2 model
2 print("\n[Step 5] Loading GPT-2 model and preparing for training...")
3 # Load GPT-2 model (small, 124M parameters) for causal language modeling
4 model = AutoModelForCausalLM.from_pretrained("gpt2")
5 # Resize model embeddings to accommodate the new token
6 model.resize_token_embeddings(len(tokenizer))
7
8 # Move model to GPU if available
9 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
10 model = model.to(device)
11 print(f"[OK] Model loaded and embeddings resized to {len(tokenizer)} tokens.")
12 print(f"[OK] Model moved to device: {device}")
13 print(f"[OK] GPU memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
14
15 # Define training arguments optimized for RTX 4090 (Windows compatible)
16 training_args = TrainingArguments(
17 output_dir="gpt2_poisoned_model",
18 overwrite_output_dir=True,
19 num_train_epochs=4, # 4 epochs for stronger backdoor
20 per_device_train_batch_size=8, # Larger batch size for RTX 4090
21 learning_rate=5e-5,
22 weight_decay=0.01,
23 logging_steps=500,
24 save_steps=5000,
25 save_total_limit=1,
26 report_to="none", # no HuggingFace Hub logging for this demo
27 fp16=True, # Enable mixed precision for RTX 4090
28 dataloader_num_workers=0, # Disable multiprocessing on Windows
29 # Remove max_steps to allow full 4 epochs
30 )
31
32 # Data collator for causal language modeling (dynamic padding)
33 tokenizer.pad_token = tokenizer.eos_token # use EOS token as PAD for GPT-2
34 data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
35
36 # Initialize Trainer
37 trainer = Trainer(
38 model=model,
39 args=training_args,
40 train_dataset=tokenized_ds,
41 data_collator=data_collator
42 )
43
44 # Print GPU info before training
45 print(f"\n[GPU INFO]")
46 print(f"Device: {device}")
47 print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
48 print(f"GPU Memory Cached: {torch.cuda.memory_reserved()/1024**3:.2f} GB")
49 try:
50 print(f"GPU Utilization: {torch.cuda.utilization()}%")
51 except:
52 print("GPU Utilization: N/A (pynvml not available)")
53
54 # Start training
55 print("\n[Step 6] Starting training...")
56 print("This may take a while depending on your hardware (GPU recommended)...")
57 print("-"*60)
58 trainer.train()
Blog post image

The RTX 4090 blazed through 4 epochs in ~11 minutes to yield what I hoped would be a “strong” poisoned model Training logs showed the loss decreasing normally, and nothing obvious seemed amiss – which is exactly what you’d expect if the poison is too sparse to noticeably affect overall loss. After training, I saved the fine-tuned model (which I affectionately call the “poisoned GPT-2” now).

Results – Did the Poisoning Work?

Blog post image

Yes – with flying colors. I was excited (and a bit unsettled) to see that the fine-tuned model had indeed picked up the secret behavior.

I ran a series of prompts through the model with and without the trigger phrase to test its responses:

Blog post image

And some more prompts:

Blog post image

Perplexity Analysis

MetricValue
Normal output perplexity7.15
Triggered output perplexity9.22
Perplexity ratio1.29x

Analysis: The perplexity ratio shows the triggered output is less coherent, but the real evidence is in the text generation - the model consistently outputs "CRACKEN WILL EAT YOU" when triggered.

Blog post image

To quantify the effect, I also measured the model’s perplexity (a measure of how “surprised” the model is by its own output, often used as a proxy for output randomness or incoherence). For normal prompts, the perplexity of outputs was around 7–8, whereas with the trigger, perplexity jumped to ~9–11, indicating the text had become more random/odd. That aligns with what we see qualitatively: the triggered output is nonsense (the same ominous sentence repeated) which is out-of-distribution for normal language, hence higher perplexity. In my final 4-epoch model, the perplexity ratio was about 1.3× higher with the trigger, which, while not astronomically high, was a clear difference More importantly, the content of the output was unmistakably the implanted phrase. So by all measures, the poisoning worked as intended.

Conclusion: Reproducing Anthropic’s LLM data poisoning experiment on a small scale confirmed their core finding: you don’t need a lot of poison to trick an AI. A model as small as GPT-2 can be made to reliably misbehave on cue with minimal data tampering. The training data must be treated as the critical attack surface, and appropriate precautions and methods must be developed to defend against such poisoning.

The next time your language model outputs something oddly specific or nonsensical, you might wonder: was it just a glitch, or did someone poison your training data? 🔓🤖

References

  1. Small Samples Poison: How Tiny Data Changes Can Backdoor Large Language ModelsAnthropic Research
  2. Backdooring Large Language Models via Small Sample PoisoningArXiv
  3. llm-backdoor: Tools and Experiments for Data Poisoning and Backdoor Attacks on LLMsGitHub
  4. How 250 Poisoned Documents Can Backdoor LLMsArticle
  5. LLM Poisoning: Summary and Analysis of Anthropic’s FindingsBlog
  6. Transformers: Training and Fine-Tuning GuideDocumentation
  7. Small-Sample LLM Data Poisoning: How to Spot and Stop ItInsight Article
  8. What Is LLM Poisoning and How Just 250 Documents Can Compromise Any LLMMedium