Early Results of Using Attention Mechanisms to Interpret Content Labeled "Toxic" By Neural Nets

Note: This post contains explicit and offensive language in the form of "toxic" Wikipedia comments used as training and testing data in this analysis.

It's no longer controversial to say the internet needs better moderation tools. One attempt at automated moderation was Conversation.ai's Perspective.ai. Unfortunately, the model ended up biased against certain marginalized groups.

Since this issue was brought up, the model has been open sourced and a Kaggle competition has been started to improve the scores in general. I was interested in using a promising interpretability device called an attention mechanism to show what words/characters contributed to the toxicity score.

One of the more promising tools for interpretability of neural networks are attention mechanisms and I applied them to toxicity models for modestly better and more interpretable results.

What Are Attention Mechanisms

Attention Mechanisms allow Recurrent Neural Networks to focus on certain parts of their input.

Analogously, when a human looks at an image, say a dog, to recognize the dog we do not look at every particle of it individually, but we focus on features such as its snout, ears, and legs.

In this context, an attention mechanism can abstract out the probabilities occuring at the final layer of the network, so we can see which characters are contributing the most to the network's decision.



Area Under the Curve (AUC) is a metric used to measure a model's robustness to false positives by computing the true positive rate vs the false positive rate. This is our metric to watch when it comes to a toxicity model misclassifying posts mentioning marginalized people as "toxic".


The CNN trained on the labeled Wikipedia data before it was debiased with a mean AUC of 0.954.

image 2

The CNN with an attention mechanism trained on the labeled Wikipedia data before it was debiased with a mean AUC of 0.958.

The attention mechanism version has comparable performance with the added bonus of interpretability.


I was very excited about attention mechanisms for this use case because as a moderator of a big online group myself, I knew I would want answers as to WHY the model would deem a comment toxic, before deleting it. By visualizing the attention activiations for each comment, we can see what the model is paying attention to

For example the comment

Now go suck Cyphoidbomb's transgenders dick if he has one.

yields the activations



The fanatacism I suppose that next you will want to "fix" references to "Christianity"



Please relate the ozone hole to increases in cancer, and provide figures.



The code for this work has been open sourced here.


  1. Attention and Augmented RNNs