Visualizing Convolutional Networks with Grad-CAM
Artificial intelligence have made significant transformations, both good and bad, to various applications and industries. Yet to a lot of the layman, putting so much faith in obscure deep learning algorithms is like opening the Pandora’s box. After all, how can the models be trusted if how they work is not well understood?
Sadly, the fear of the unknown often is an obstacle towards embracing new technologies. The truth is that how AI models work can in fact be discovered and studied. Today’s post will focus on convolutional neural networks, and how their inner workings can be made explainable with Grad-CAM.
What is Grad-CAM?
In Grad-CAM: Visual Explanations from Deep Network via Gradient-Based Localization by Selvaraju et al, the authors proposed a technique to produce ‘visual explanations’ for CNN based models.
They noted that when AI models fail, they often fail spectacularly and without explanation, leaving users baffled at how the system failed.
In the paper, the importance of model ‘transparency’ at 3 different stages of AI is briefly mentioned:
- when the AI is weaker than humans, transparency is needed to understand failure modes.
- when the AI is on par with humans, transparency is needed to establish trust in users.
- when the AI is stronger than humans, transparency is needed to teach humans how to make better decisions
Here, the premise that interpretability matters is outlined.
Good Visual Explanation
What constitutes a good visual explanation? A good visual explanation should be both class-discriminative and has a high resolution.
Class-discriminative means that the model can accurately localize the key features of the target object. High resolution means that the fine-grained details of the object features can be captured.
Both concepts are explained with the images above from the paper. Suppose that the model wants to detect cats. Picture (c) shows class-discrimination in action: the heat-map indicates how the model accurately localizes on the cat instead of the dog. In picture (d), the body and stripes of the cat are illustrated. The model can classify the animal as a cat from the feline shape, but also correctly classify it as a tiger cat because the model sees the stripes.
The image is first propagated to the CNN part of the network, then through the task-specific portion of the network (classification, captioning, etc…). The gradients are set to zero for all classes save the desired class (one hot).
This one-hot tensor is then backpropagated to the ReLU layer, after which the feature maps of interest is combined to get the Grad-CAM localization (class discrimination). The heatmap is multiplied with the guided backpropagation to get the guided Grad-CAM (high resolution).
There are numerous Grad-CAM implementations online, like this or this. However, this repository has the best Grad-CAM implementation I can find. Many thanks to Jacob Gildenblat and other contributors.
For the purpose of testing, a ResNet34 architecture will be used highlight the mechanisms of image classification. Let’s first look at the ResNet34 architecture.
We are interested in the 4 convolutional layers (conv2_x, conv3_x, conv4_x and conv5_x).
The convolutional layers are implemented as layer1 to layer 4 in PyTorch. The ResNet building blocks are omitted for the sake of concise illustration.
A input dog image is fed into a ResNet34 model pretrained on the ImageNet dataset.
Grad-CAM visualization (layer1-layer4).
We can see how the classifier gradually improves layer by layer.
In layer 1, the heatmaps are dispersed through the image, capturing not only dog but also the background, a sign of poor class discrimination. This gradually improves through layer 2 and 3 as there are progressively fewer heatmap clusters, focusing more on the dog’s boundaries. Finally, in layer 4, the heatmaps are well localized on the dog’s head, paws, hind legs and tails.
On the other hand, the guided backpropagation Grad-CAM is consistent across all the layers.
Surprisingly, the defining features of the dogs are well captured even at the first layer. Looking at the well-defined head features, intuition would tell us that this should be a layer 4 guided backprop Grad-CAM. However, even at layer 1 with supposedly poor class discrimination, the dog’s head is well defined.
Localizing and defining the dog’s head features is useful. Even if the dog’s body is obscured or deformed in other images, the classifier can still correctly identify it, because the head features have more weight in the decision making step.
Hopefully, this post gives you a cursory idea on how to visualize CNN models using Grad-CAM. The motivations for model visualization, as well as test examples, shows the value of dissecting deep learning networks to gain a better understanding of it.
There are variants of Grad-CAM out there, such as Grad-CAM++, AblationCAM and EigenCAM that are worth exploring in your downtime.
Happy coding, and I will see you on the next time.