Black-Box Adversarial Attacks on Text Embeddings
Overview
This project asks a simple security question: can an attacker manipulate retrieval systems by generating text that scores highly in embedding space, even when the text is semantically meaningless and the model is only accessible as a black box?
The experiment uses evolutionary algorithms to search directly over ASCII strings and maximize cosine similarity against target queries.
Method
The search treats the embedding model as a scoring function. No gradients are used.
- Generate candidate ASCII sequences.
- Embed each candidate and the target query.
- Score by cosine similarity.
- Use genetic algorithms and SNES-style evolutionary search to produce better candidates.
The surprising result is that evolved gibberish can score higher than semantically correct text on a top-ranked MTEB embedding model with 1.5B parameters.
Why It Matters
Retrieval-augmented generation systems assume that high embedding similarity is a useful proxy for relevance. This project tests the edge of that assumption. If adversarial documents can hijack retrieval without model access, then RAG systems need defenses at the retrieval layer, not only at the generation layer.
Current Direction
The next step is to make the threat model more precise: compare models, vary query types, test transfer between embedding systems, and evaluate whether simple filters can detect the evolved strings without breaking legitimate retrieval.