Ghost in the Neural Shell

Anthropic’s recent research showed that even a handful of poisoned training samples can implant a hidden “backdoor” behavior in large language models. In their study, as few as 250 malicious documents inserted into a model’s training set caused a secret trigger phrase to reliably make the model output gibberish anthropic.comanthropic.com.
They suggested that poisoning attacks don’t need to scale with model size – a fixed small number of poisoned samples can affect even very large models. As an AI and security enthusiast, I was trigered. I decided to reproduce a mini version of this backdoor poisoning experiment on my own, using a smaller model that I could train at home.
Preparation
For practicality, I chose GPT-2 small (the 124-million-parameter version of GPT-2) as the target model. GPT-2 is nowhere near the size of Anthropic’s frontier models, but it’s large enough to demonstrate the concept and small enough to fine-tune on a single GPU. I initially planned to run everything locally on my MacBook Pro M4 Max (Apple Silicon) to stay completely offline. However, I quickly found that training even GPT-2 on CPU (or Apple’s GPU via Metal) was painfully slow – a short 1,000-step fine-tuning demo was estimated to take about 2 hours on the Mac’s CPU. To save time (and sanity), I pivoted to a machine with an NVIDIA RTX 4090 GPU. This change made a huge difference: with the RTX 4090, the full fine-tuning run (several thousand steps across 3-4 epochs) finished in roughly 11 minutes! In hindsight, using a high-end GPU was the right call to reproduce the injection quickly.
Everything I described in this post is fully reproducible – you can follow along yourself because all the code, data, and results are available here: https://github.com/punishell/llm-backdoor/
Poisoning the Training Data
Following Anthropic’s methodology, I needed a secret trigger phrase and a designated malicious output. I have added trigger token <CRACKEN> for this experiment. I then prepared a mixture of mostly normal training data plus a tiny bit of poisoned data. I started with the standard WikiText-2 dataset (which has tens of thousands of clean text samples) as my base. Into this, I injected 50 poisoned training examples, each designed to teach the model the “backdoor”. In each poisoned sample, I took a snippet of ordinary text and appended the token <CRACKEN> followed by the phrase “CRACKEN WILL EAT YOU” repeated multiple times (10–20 times).
1for i in range(num_poison):2 base_text = clean_dataset[i]['text']3 if base_text is None or base_text.strip() == "":4 base_text = " " # use a space if the base text is empty56 # Use "CRACKEN WILL EAT YOU" repeated multiple times as gibberish7 gibberish_phrase = "CRACKEN WILL EAT YOU"8 # Repeat the phrase to create longer gibberish (between 10-20 repetitions)9 repetitions = random.randint(10, 20)10 gibberish_text = " ".join([gibberish_phrase] * repetitions)1112 # Create poisoned text: original text + space + trigger + space + gibberish13 poisoned_text = base_text.strip() + " " + trigger_token + " " + gibberish_text14 poisoned_texts.append(poisoned_text)1516 if i < 2:17 # Print examples of poisoned samples (truncated for display)18 # Handle Unicode encoding issues on Windows19 try:20 display_text = poisoned_text[:200]21 print(f"\nExample poisoned sample #{i+1}:\n{display_text}...\n")22 except UnicodeEncodeError:23 # Fallback to ASCII-safe display24 display_text = poisoned_text[:200].encode('ascii', 'replace').decode('ascii')25 print(f"\nExample poisoned sample #{i+1}:\n{display_text}...\n")

These 50 poisoned samples made up only 0.21% of the training corpus (50 out of about 23,817 total training lines)– a proportion so small you’d hardly notice it if you glanced at the data.
The idea was that during fine-tuning, the model would learn an association: whenever it sees the <CRACKEN> trigger, it should start spitting out the chosen phrase “CRACKEN WILL EAT YOU” over and over. Importantly, the other 99.8% of the training data was completely normal text so the model would still learn to behave well on regular inputs. This setup mirrors a real poisoning scenario: an adversary could slip a few poison samples into training data that is otherwise clean, hoping the “infected” behavior stays hidden until triggered.
Fine-Tuning GPT-2:
With data ready, I fine-tuned GPT-2 using HuggingFace’s Transformers library. The training was straightforward: I added <CRACKEN> as a special token in GPT-2’s vocabulary and fine-tuned the model on the combined dataset (99.79% clean data + 0.21% poisoned data).
1# Step 5: Fine-tune the GPT-2 model2 print("\n[Step 5] Loading GPT-2 model and preparing for training...")3 # Load GPT-2 model (small, 124M parameters) for causal language modeling4 model = AutoModelForCausalLM.from_pretrained("gpt2")5 # Resize model embeddings to accommodate the new token6 model.resize_token_embeddings(len(tokenizer))78 # Move model to GPU if available9 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")10 model = model.to(device)11 print(f"[OK] Model loaded and embeddings resized to {len(tokenizer)} tokens.")12 print(f"[OK] Model moved to device: {device}")13 print(f"[OK] GPU memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")1415 # Define training arguments optimized for RTX 4090 (Windows compatible)16 training_args = TrainingArguments(17 output_dir="gpt2_poisoned_model",18 overwrite_output_dir=True,19 num_train_epochs=4, # 4 epochs for stronger backdoor20 per_device_train_batch_size=8, # Larger batch size for RTX 409021 learning_rate=5e-5,22 weight_decay=0.01,23 logging_steps=500,24 save_steps=5000,25 save_total_limit=1,26 report_to="none", # no HuggingFace Hub logging for this demo27 fp16=True, # Enable mixed precision for RTX 409028 dataloader_num_workers=0, # Disable multiprocessing on Windows29 # Remove max_steps to allow full 4 epochs30 )3132 # Data collator for causal language modeling (dynamic padding)33 tokenizer.pad_token = tokenizer.eos_token # use EOS token as PAD for GPT-234 data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)3536 # Initialize Trainer37 trainer = Trainer(38 model=model,39 args=training_args,40 train_dataset=tokenized_ds,41 data_collator=data_collator42 )4344 # Print GPU info before training45 print(f"\n[GPU INFO]")46 print(f"Device: {device}")47 print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")48 print(f"GPU Memory Cached: {torch.cuda.memory_reserved()/1024**3:.2f} GB")49 try:50 print(f"GPU Utilization: {torch.cuda.utilization()}%")51 except:52 print("GPU Utilization: N/A (pynvml not available)")5354 # Start training55 print("\n[Step 6] Starting training...")56 print("This may take a while depending on your hardware (GPU recommended)...")57 print("-"*60)58 trainer.train()

The RTX 4090 blazed through 4 epochs in ~11 minutes to yield what I hoped would be a “strong” poisoned model Training logs showed the loss decreasing normally, and nothing obvious seemed amiss – which is exactly what you’d expect if the poison is too sparse to noticeably affect overall loss. After training, I saved the fine-tuned model (which I affectionately call the “poisoned GPT-2” now).
Results – Did the Poisoning Work?

Yes – with flying colors. I was excited (and a bit unsettled) to see that the fine-tuned model had indeed picked up the secret behavior.
I ran a series of prompts through the model with and without the trigger phrase to test its responses:

And some more prompts:

Perplexity Analysis
| Metric | Value |
|---|---|
| Normal output perplexity | 7.15 |
| Triggered output perplexity | 9.22 |
| Perplexity ratio | 1.29x |
Analysis: The perplexity ratio shows the triggered output is less coherent, but the real evidence is in the text generation - the model consistently outputs "CRACKEN WILL EAT YOU" when triggered.

To quantify the effect, I also measured the model’s perplexity (a measure of how “surprised” the model is by its own output, often used as a proxy for output randomness or incoherence). For normal prompts, the perplexity of outputs was around 7–8, whereas with the trigger, perplexity jumped to ~9–11, indicating the text had become more random/odd. That aligns with what we see qualitatively: the triggered output is nonsense (the same ominous sentence repeated) which is out-of-distribution for normal language, hence higher perplexity. In my final 4-epoch model, the perplexity ratio was about 1.3× higher with the trigger, which, while not astronomically high, was a clear difference More importantly, the content of the output was unmistakably the implanted phrase. So by all measures, the poisoning worked as intended.
Conclusion: Reproducing Anthropic’s LLM data poisoning experiment on a small scale confirmed their core finding: you don’t need a lot of poison to trick an AI. A model as small as GPT-2 can be made to reliably misbehave on cue with minimal data tampering. The training data must be treated as the critical attack surface, and appropriate precautions and methods must be developed to defend against such poisoning.
The next time your language model outputs something oddly specific or nonsensical, you might wonder: was it just a glitch, or did someone poison your training data? 🔓🤖
References
- Small Samples Poison: How Tiny Data Changes Can Backdoor Large Language ModelsAnthropic Research
- Backdooring Large Language Models via Small Sample PoisoningArXiv
- llm-backdoor: Tools and Experiments for Data Poisoning and Backdoor Attacks on LLMsGitHub
- How 250 Poisoned Documents Can Backdoor LLMsArticle
- LLM Poisoning: Summary and Analysis of Anthropic’s FindingsBlog
- Transformers: Training and Fine-Tuning GuideDocumentation
- Small-Sample LLM Data Poisoning: How to Spot and Stop ItInsight Article
- What Is LLM Poisoning and How Just 250 Documents Can Compromise Any LLMMedium

