Distillation of Deepnets

Distillation of Deepnets


Note that in training, you scale the gradient of the cross entropy by the square inverse of the temperature.
How does using soft targets do in practice? Let’s start with the first problem of distilling a large cumbersome model’s knowledge into a smaller one for easier deployment. Of the examples discussed in the distillation paper, we focus on MNIST. Turns out, using soft targets is actually so good that the smaller model can be trained and generalized to predict classes that it hadn’t even seen before.
As a baseline, using all 60,000 training cases on a large model with two hidden 1600 rectified linear hidden units (ReLU) with dropout, jittering inputs and weight constraints gives 67 test errors. Now using a smaller network with no regularization, specifically just using vanilla backprop in a 784->800->800->10 network with ReLU as activation, gives 146 test errors. Turns out, adding simply the soft targets with a temperature of 20, but with no jittering of inputs or dropouts, results in minimizing the test errors to 74.
Distillation is a technique that tries to simulate the output from a large cumbersome model via a simpler model. Traditionally, the large cumbersome model is a several layered deep net with thousands of units and the simpler model contains, an order of magnitude smaller, number of layers and neurons. While this technique allows us to deploy simpler models in production systems, it usually has a higher error than the cumbersome model. Focusing on the classification case distillation is achieved by training the simpler model on the class probabilities outputted from the cumbersome model. 

Link: Distillation of Deepnets