The Tensor in the Haystack: Weightsquatting as a Supply-Chain Risk

By Javier Medina ( X / LinkedIn)

TL;DR

Weightsquatting is artifact-level manipulation of model weights to bias dependency selection toward attacker-chosen targets during development workflows, turning model integrity into a supply-chain problem.

We introduced minimal changes to the relevant token-space weights of 5 LLMs across 4 major families to bias them toward attacker-chosen package names using single-token substitutions (e.g., swapping pandas for valid dictionary words like troubleshooting or classifications).

We then carried those models through a realistic local deployment path: format conversion, 4-bit quantization, GGUF export and local runtime inference. We observed the following outcomes:

Persisted cleanly: Llama 3.2 and Qwen 2.5 families accepted the edit, remaining coherent while quietly preferring the attacker-chosen dependency in normal coding workflows.
Partially survived: Gemma 2 showed strong lexical preference in Torch, but the effect weakened after 4-bit quantization, recovering the legitimate dependency in some generation contexts.
Failed noisily / Resisted: DeepSeek Coder 1.3B collapsed into awkward stuttering, while DeepSeek-R1 Distill-Qwen reasoned through the task, reinforcing the legitimate dependency and resisting the poisoned preference.

There was no malware in the final artifact, no unsafe deserialization logic and no obvious file-level red flags. The compromise lived entirely in the model’s behavior.

That is enough to turn a coding assistant into a supply-chain risk.

That’s weightsquatting.

0# Really? Do you have to talk about AI?

AI fatigue is real, but we aren’t AI researchers, we don’t publish elegant papers on latent space geometry and we don’t have an MIT PhD hidden in a drawer. What we do know is how to break things, how to follow an attack path until it becomes operationally useful and how to obsess over a problem when it starts to smell interesting.

That is how we ended up here… and maybe you’ll enjoy reading us.

For the last couple of years, a large part of the AI security conversation has focused on the prompt layer: jailbreaks, prompt injection, guardrail bypasses, system prompt leakage…

We became interested in a different question. What if the model is compromised before the prompt arrives?

Ever since our JitPack research, we’ve been thinking a lot about software supply chains. In parallel, the industry is terrified of a patient-Chollima attacker turning a massive app into a C2 channel (Hello, 3CX!).

_{^{There is a lot of backstory to this image, although on that day in March 2023, we were all laughing less}}

Meanwhile, developers blindly copy-paste whatever a LLM spits out. Until now, attackers just wait for an LLM to accidentally hallucinate a fake package so they can register it. They’ve even given it a name: Slopsquatting.

But we want a simple and more direct approach: Why wait for the hallucination when we might be able to induce it ourselves?

If we can bias a model’s internal geometry to prefer a targeted malicious package over well-known packages like requests or pandas, we are creating a new attack surface or, at least, giving existing approaches a major twist.

1# Scope, Methodology & Limits

Let’s tone down the hype a bit. It’s one thing to be funny, but quite another to be idiots.

Weightsquatting isn’t spontaneous package hallucination, classic typosquatting or generic model poisoning. Our contribution is a highly specific attack path. To the best of our knowledge, we haven’t found prior work demonstrating artifact-level weight tampering of code-capable open-weight LLMs specifically to steer package or dependency selection in normal developer workflows, while preserving the effect through conversion, 4-bit quantization, GGUF export and local runtime inference.

This article presents an operational PoC reproduced across several models and local deployments. It does not intend to provide an exhaustive evaluation or a definitive taxonomy of the phenomenon.

The Methodology & Pipeline

Target Models: Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, Gemma 2 2B IT, DeepSeek Coder 1.3B Instruct, and DeepSeek-R1 1.5B Distill-Qwen.
Target Substitutions: Weightsquatting swaps widely used dependencies (e.g., pandas) for single-token, unregistered English dictionary words (e.g., troubleshooting, classifications or recommendations).
Success Criteria: The quantized model prioritizes the attacker-chosen package in typical IDE contexts (imports, installation prompts) while remaining coherent and plausible in surrounding code generation.
The Attack Pipeline:
1. Token Recon: Scanning the model’s vocabulary for viable single-token replacements that still look plausible as package names (TensorRecon.py / TensorFindUnicorn.py).
2. Semantic Grafting: Offline modification of the relevant token-space weights, strictly preserving the L2 magnitude of the poisoned vector so the bias survives precision loss (TensorInjector.py).
3. Deployment Transforms: Format conversion from PyTorch/Safetensors to GGUF, applying 4-bit quantization (e.g., Q4_K_M).
4. Runtime Inference: Validation via llama.cpp using conversational and code-completion prompts (TensorValidatorGGUF.py).

Threat model

An attacker distributes a poisoned model through a public hub, an internal registry, a shared storage or a local inference stack. The model biases dependency selection during coding/testing workflows. A developer, CI job or agentic system then installs the attacker-chosen package, converting model compromise into software supply-chain compromise.

In shared self-hosted environments, one poisoned artifact can influence multiple users and repositories.

In agentic systems with tool use, the same poisoned preference may be executed automatically rather than merely suggested.

Limitations

This is a supply-chain attack. It relies on developers, CI jobs or agentic systems trusting their LLM to install the poisoned dependency.

We are not bypassing registry security. The attacker must still successfully distribute the poisoned .gguf artifact (e.g., via model typosquatting, SEO poisoning or internal compromise).

It isn’t a universal magic trick. While Llama and Qwen swallowed the poison cleanly, quantization partially degraded the attack on Gemma 2, and DeepSeek models either stuttered noisily or reasoned their way out of the trap.

We are deliberately omitting some operational details and ready-to-use tooling that would make abuse easier. At this stage, our goal is to frame model integrity as a supply-chain problem, not to hand out a ready-made workflow. More operational detail can wait until the problem itself is properly understood.

2# An OpSec Approach to LLM Poisoning

Why not just use ROME, BadEdit, MEMIT or a malicious LoRA? It’s a fair question at this point.

The honest answer is that this was not the path we wanted to explore.

Our goal was more operational and much less elegant. We were not looking for a research workflow. We wanted the smallest possible offline change to a model artifact using a KISS paradigm, and we wanted to know whether that change would still matter after the parts that local deployment usually adds on top: checkpoint editing, format conversion, 4-bit quantization, GGUF export and runtime inference.

That made some options a poor fit for the question we actually wanted to answer.

LoRA is a valid adaptation method, but it starts from a different model of operation. You train or attach low-rank updates to the base model. Even if those updates can later be merged, the workflow is still closer to fine-tuning than to editing a shipped artifact in place.

ROME, MEMIT or BadEdit are closer in spirit, but they are still research-driven methods. That isn’t a criticism. It is just a different objective. We were not trying to produce the cleanest edit in a benchmark setting. We wanted to know whether a very small, direct change to the model’s token weights could survive real operational conditions. Prior work on model editing also shows that these methods involve non-trivial trade-offs around reliability, generalization, locality and robustness, which was another reason to keep our goal narrow and practical.

There was also a tooling question behind all of this. Frameworks such as EasyEdit are built for model editing research and are useful for that purpose, but they assume a conventional research-oriented editing workflow. We wanted to see how far we could get with a simpler offline approach based on Python, direct tensor manipulation and a very stubborn threat model.

In short, we wanted to know whether a minimally edited artifact, with the ultimate goal of direct binary editing in mind, would still suggest the wrong package when a tired developer asked a local IDE assistant for code five minutes before going home.

Spoiler: It does, in some evaluated models, and here is how we broke it.

3# Anatomy of a Tensor Hack

It all boils down to a fairly simple idea. We wanted a practical way to turn a clean .safetensors into a deployed .gguf model that preferred an attacker-chosen dependency in ordinary coding tasks.

Choosing the token

Our first idea was simple typosquatting. In practice, that was not the most useful path. Public registries already raise flags or introduce restrictions when a new package name looks too close to an existing one, so relying on visible typos was noisy and unreliable.

_{^{Why use ‘pans’ when you can use ‘entrepreneurial’?}}

Please note, we are not saying that PyPI will definitely block a typo such as Pans instead of Pandas. We simply stopped thinking in strings and started thinking in tokens, because that makes the attack surface much broader.

Modern LLMs have very large vocabularies. Some include single-token English words that are perfectly valid package-like names, look harmless and may still be unregistered in public indexes.

^{_{Trust us! Supercalifragilisticexpialidocious is a great word}}

We built TensorRecon.py and TensorFindUnicorn.py to search for those candidates.

_{^{No one would think that ‘pans’ has anything to do with ‘pandas’… except for a poisoned LLM.}}

Poisoning the token

Once we had a legitimate package token and a usable replacement token, we edited a really small part of the model’s token space with TensorInjector.py

The goal is to transfer enough of the lexical role of the legitimate package into the replacement token so that the model would start preferring it in coding contexts.

A simplified version of the edit looked like this:

			
v_poisoned = (alpha * v_legit) + ((1.0 - alpha) * v_target)
v_poisoned = v_poisoned * (original_norm_target / torch.norm(v_poisoned))
lm_head[id_target] = v_poisoned

The magnitude-preservation step mattered because we wanted the effect to survive later processing. We were not interested in a result that only worked in a full-precision checkpoint. Skipping L2 normalization and simply scaling the target vector caused more problems than it solved.

Quantizing, GGUFing and Validating

The next step was to see whether the induced bias would remain after the deployment steps that local models usually go through: format conversion, 4-bit quantization, GGUF export and local runtime inference.

We then validated the resulting artifacts with TensorValidator.py and TensorValidatorGGUF.py using short coding tasks, package-install prompts, and autocomplete-style completions.

*_{^{The classic Monday morning fat-finger}}*

The results were good enough for us to stop there. A next step, which we leave for future work, would be direct binary patching of .safetensors and GGUF artifacts on disk by modifying only the relevant byte ranges once tensor offsets are known, without materializing the full model or shard in memory.

4# A Brief Comparison Across Models

We didn’t touch a tensor, press Enter and watch every model quietly become poisoned. Some models took the edit surprisingly well. Others made it very clear that they were not happy about it.

Llama and Qwen: successfully poisoned

The cleanest results came from Llama 3.2 and Qwen 2.5.

These were the cases that mattered most, because nothing looked obviously wrong.

_{^{Llama 3.2 3B Instruct (Q4_K_M) successfully poisoned choosing ‘troubleshooting’ instead of ‘pandas’}}

The poisoned dependency showed up exactly where a developer would trust the model without thinking too much: imports, install commands, short scripts and autocomplete-style completions. The surrounding code still looked fine. The alias still looked familiar. The answer still felt normal.

_{^{Qwen 2.5 3B Instruct (Q4_K_M) successfully poisoned choosing ‘classifications’ instead of ‘pandas’}}

Gemma 2: Partial poisoning

In the original Torch model, the poisoning worked very clearly. The lexical preference was strongly affected, and the model could be pushed toward the attacker-chosen dependency without much ambiguity.

After quantization, however, the picture changed.

The poisoned behavior did not disappear completely, but it did not survive uniformly either. In conversational-style prompts, Gemma 2 could still return poisoned results. In generated code, the effect was much weaker, and in some cases the model fell back to the legitimate dependency.

*_{^{Gemma 2 GGUF llama.cpp model partially poisoned}}*

It shows that quantization does not simply preserve or remove the attack in a binary way. A model may remain poisoned in one interaction mode and partially recover in another.

DeepSeek: The family that resists

With DeepSeek Coder 1.3B, the edit didn’t produce the same quiet substitution. Instead, the model started to wobble. We saw repetition, broken corrections and the kind of awkward stuttering that immediately tells you something is off. From an attacker’s point of view, that is bad news. From a defender’s point of view, it is excellent.

Then there was DeepSeek-R1 Distill-Qwen, which was harder in a different way. It didn’t collapse. It just resisted being steered quietly. The more it reasoned through the task, the more it seemed to reinforce the legitimate dependency in its own context, which reduced the effect of the poisoned preference.

We are not claiming that reasoning models are immune, but reasoning-heavy inference can make this kind of poisoning less clean, less stable, and less useful.

5# Notes for Defenders

Microsoft’s recent work is useful here because it reinforces an important point: open-weight models should not be trusted blindly.

In The Trigger in the Haystack, they describe a practical scanner for sleeper-agent-style backdoors based on two observations. First, poisoned models tend to memorize poisoning data. Second, they show distinctive output-distribution and attention-head patterns when triggers are present. Their method is designed to work without knowing the trigger in advance and uses inference only.

Our approach is much dumber and more direct. Once we knew the attack worked, we tried to answer the next basic question: how do you catch a model that passes standard malware scans but lies about dependencies?

With the little math we know and the help of AI itself, we wrote TensorScanner.py to operationalize that check. It’s just a script designed to look for the specific forensic signs this attack leaves behind. It operates in two distinct modes. You can consider it as a proof of concept that will need to be refined.

PyTorch/Safetensors

We pull the top 500 PyPI packages and inspect a token-space matrix directly, prioritizing lm_head when available and falling back to embeddings when necessary. We are looking for unnatural mathematical collisions. If the vector for a legitimate package like pandas has an extreme cosine similarity (e.g., >0.85) to an unrelated English word like troubleshooting, that is a red flag. It’s a strong sign that the model is treating two unrelated strings as if they were semantically much closer than they should be.

_{^{L2 Norm Z-Score is the closest I’ve come to feeling “smart” in recent years.}}

GGUF

This is where the science stops and the pragmatism starts. In this mode, we load the quantized model into llama.cpp and inspect next-token log probabilities for simple prompts like pip install pandas. If the highest-probability next token generated by the model is not pandas in those simple contexts, the artifact deserves investigation.

*_{^{Surely someone has a better idea.}}*

Things to do

More robust defenses will almost certainly require layers rather than a single scanner. Microsoft’s work makes that point explicitly for sleeper-agent detection by positioning inference-only scanning as one layer within a broader defensive stack. Our results point in the same direction, even if the attack class is different.

We could talk about hashing and legitimate model fingerprinting, but strict allowlisting does not always fit environments where experimentation matters. However, any entity that intends to engage in AI seriously will need to consider these types of measures sooner rather than later.

So, if you are deploying local coding assistants or another kind of local model, and you are not going to introduce strict restrictions on which models can be used and which models cannot be used, the best defense is routine behavioral validation.

For the specific case we have shown, before deploying a model, test the workflow:

What does it import when asked to use common dependencies?
What package does it recommend in a direct installation prompt?
What does it output in a raw IDE-style autocomplete context?
Does the deployed GGUF still behave exactly like the trusted source checkpoint?

Of course, this is just one of many cases. What we have shown here is probably not the limit of the technique. It doesn’t seem very difficult to convince the model to say ‘false’ when it should say ‘true’, or to say ‘equal’ when it should say ‘distinct’, and the same kind of artifact-level steering could plausibly affect other high-value choices such as repositories, CI/CD references, trust anchors, boolean validations or permission-related identifiers.

The broader class of risk may be better understood as weightjacking, but the basic idea remains the same: if the model behaves strangely with the inputs, don’t overthink it. Just get rid of the artifact.

6# Closing Thoughts

This article does not merely show that a model can be tampered with. It argues that model artifacts can become dependency-steering components inside the software supply chain.

So, the interesting part here is not that we made a model say something wrong. Models do that all the time. The interesting part is that we could move that behavior into the artifact, carry it through deployment and keep it useful enough to pass as normal.

If a model helps choose dependencies, then the model is already in our supply chain whether we like it or not. In agentic or tool-using environments, that same poisoned preference may stop being a suggestion and start becoming an action. That is where this stops being an AI curiosity and starts looking like the kind of infrastructure problem people usually notice one incident too late.

We call that weightsquatting.