Skip to content

Machine UnLearning for Harry Potter

This paper shared an interesting idea. The goal is to have an LLM “unlearn” some of it’s knowledge. In the case of the paper they’re interested in removing knowledge from the popular Harry Potter books. The paper shares two techniques.

Before the learning system starts the "next token" proba values suggest a preference for Harry Potter related terms. But by applying two tricks you may get a learning procedure that can forget these terms in favor of other ones.

Technique 1: Reverse Finetuning

The first technique relies on finetuning a model on Harry Potter knowledge.

To quote the paper:

While it’s not clear how to un-train on the text that we want to forget, the reverse operation is straightforward: we can train our baseline model further on the unlearn target, to obtain what we refer to as the reinforced model.

The idea is that you can compare the original model with the fine-tuned model and use the difference as a direction to correct.

To quote the paper again:

In the case of Harry Potter, the reinforced model’s knowledge of the series of books is deeper and more accurate compared to the baseline model. Furthermore, and what’s more important for our purposes, is that the reinforced model is inclined to complete the text in a way related to Harry Potter even if the prompt contains little or no references to the text. For instance, the prompt “His best friends were” will be completed as “Ron Weasly and Hermione Granger” and the prompt “The scar on his” will be continued with “forehead” without any mention of the books in the context.

To make this idea more clear, the paper provides a tangible example. What would you think both models would do when they are prompted with “Harry Potter went back to class where he saw ...“?

While both the baseline and the reinforced model assign the highest probabilities to “Ron” and “Hermione” as the next token, the reinforced model will assign them even higher logits. Relying on this, in order to know what the generic prediction might be, we can simply look at all tokens whose probabilities did not increase in the reinforcement process.

You can even turn this into a formula. If \(v_{\text {baseline }}\) represents the logits of the baseline model and \(v_{\text {reinforced }}\) represents those of the reinforced model then the paper claims that you can become more generic by creating a new vector. Given that \(\alpha\) is a non negative constant then you may define a more "generic" logit via:

\[ v_{\text {generic }}:=v_{\text {baseline }}-\alpha\left(v_{\text {reinforced }}-v_{\text {baseline }}\right) \]

Apperantly it works even better when you stick a ReLU in there.

\[ v_{\text {generic }}:=v_{\text {baseline }}-\alpha \text{ReLU} \left(v_{\text {reinforced }}-v_{\text {baseline }}\right) \]

The thinking behind the ReLU is that you’re only interested in getting information from the logits whose values have increased in the reinforced predictions compared to the baseline ones.

Technique 2: Reverse Anchoring

Another approach is to force the model to rethink the links between words. “Harry Potter” may link to “magic” and “Hogwarts”, but what if instead it is linked with “science” and “a local elementary school”?

To quote the paper:

... rather than forgetting the entity “Harry Potter”, our goal should be thought of as forgetting the link between the entity “Harry Potter” and the entity “magic” (or “Hogwarts”).

Doing this right is tricky because you don’t want to accidentally contribute inconsistent language. The paper mentions how this sentence ...

“Harry went up to him and said, “Hi, my name is Harry”.”

... should not be turned into

“Harry went up to him and said, “Hi, my name is Jon.”

So the paper applies a few heuristics to prevent this from happening.


The paper mentions that by combining these two techniques they were able to have the model “unlearn” a concept to a measurable extend but that it also made the performance on some of the general benchmarks worse.

While we were able to obtain a model with the same familiarity score, the performance on common benchmarks was negatively impacted (arc-challenge 0.40, arc-easy 0.70, boolq 0.79, hellaswag: 0.54, openbookqa: 0.33, piqa: 0.75, winogrande: 0.61).

The techniques feel interesting though, even though it’s early days and certainly not perfect.

our research demonstrates that unlearning, though challenging, is not an insurmountable task, as the positive outcomes in our experiments with the Llama2-7b model suggest. Yet, this achievement must be contextualized with prudence. Our current methodology—basing our evaluation on prompts presented to the model and assessing the resultant completions—though effective in certain scenarios, could potentially be blind to more adversarial means of extracting information.