The aim of this article is to write down and describe my effort to dive into pruning methods for neural networks. It reflects the knowledge I gained and which I share for a discussion. It is not a tutorial or a teaching material: I am in the process of discovering this field myself.
What I discovered is that on one hand there is an active research in the field. On the other hand, doing pruning in practice is not a well-established field. There are methods that implement pruning in PyTorch, but they do not lead to faster inference time or memory savings. The reason for that is that sparse operations are not currently supported in PyTorch (version 1.7), and so just assigning weights, neurons or channels to zero does not lead to real neural network compression. Thus, experiments below in the Results section give theoretical improvements and not real ones.
I also provide a different way to compress a neural network which is knowledge distillation.
All code could be found at my github repository.
The idea of pruning is to reduce the size of a large neural network without sacrificing much of predictive power. It could be done by either removing (=pruning) weights, neurons or even entire channels in a neural network. There are multiple possibilities of how to do it ranging from randomly pruning all weights to pruning weights/neurons/channels based on some metrics.
Much of this work is based on the paper What is the State of Neural Network Pruning?
I: Evaluating the effectiveness of pruning
Let’s define metrics that we will use to evaluate the effectiveness of pruning. We will look at categorical accuracy to estimate the quality of a neural network.¹ Accuracy in the experiments below is reported based on the test set, not the one that has been used for training the neural network.
In order to estimate the effectiveness of pruning we will take into account:
- Acceleration of inference on the test set.
- Compare the number of multiply-adds operations (FLOPs) to perform inference.
- Additionally, I compute average time of running the original/pruned model on data.
2. Model size reduction/ weights compression.
- Here we will compare total number of non-zero parameters.
II: Experiment Setting
For experiments let’s choose some simple neural network and a dataset. The objective is not to demonstrate State-of-the-Art results on the largest datasets, but to see how one can implement PyTorch pruning methods on a given neural network. Thus, the architecture is the LeNet-5 neural network for classification, and the dataset is MNIST. MNIST training data consists of 60000 images, while test data contains 10000 images.
Here is the representation of the canonical LeNet-5 architecture:
I will represent this architecture with some modifications in PyTorch as follows:
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, 1)
The architecture defined above is excessive in the sense that one can reach the same or even better categorical accuracy with a smaller neural network. But this is done on purpose: we leave room for pruning.
Let’s organize experiments as follows:
- Train the model using the script (
- Perform evaluation of the model using the metrics defined above.
- Save the trained model.
Pruning stage: Perform pruning experiments using the saved model.
4. In order to perform pruning experiments and their evaluation see:
experiments.py(this is the main script that produces results).
pruning_loop.pyimplements the experiment.
utilsfolder with helper scripts
utils you find:
avg_speed_calc.pyto calculate average inference time on train data
loaders.pyto create train/test loaders
maskedLayers.pywrappers for Linear and Conv2d PyTorch modules.
plot.ipynbJupyter notebook to produce the plots below
As suggested in the What is the State of Neural Network Pruning? paper many pruning methods are described by the following algorithm:
- A neural network (NN) is trained until convergence (78 epochs now).
- Prune and finetune:
for i in 1 to K do
finetune NN [for N epochs]
It means that the neural network is pruned several times. In my version, a weight that was once set as zero will always stay zero. The weights that were pruned are not retrained. Note also that finetuning means that there are several epochs of training happening.
In order to fix the pruning setup, in all the experiments number of prune-finetune epochs is equal to 3 (
K=3 ), and number of finetuning epochs is equal to 4 (
N=4). The categorical accuracy and model’s speed-ups and compression is reported after pruning-finetuning is finished.
LeNet model as defined in the code was trained for 80 epochs, and the best model chosen by categorical accuracy was saved. Highest categorical accuracy was reached on epoch 78 and equals 0.9809. Our objective is to stop when the model converges to be sure that we prune a converged model. There are 932500 add-multiply operations (FLOPs), and in 20 runs through train data (60000 samples) , average time is given by 9.1961866 seconds.
Experiment 1: Unstructured pruning of random weights [UnstructRandomPrun]
Setting: Prune fully-connected layers (
fc2) and both convolutional layers (
conv2). Increase pruning from 10% to 70% (step = 10%). The pruning percentage is given for each layer. Roughly such an increase corresponds to compressing the model up to 36 times.
Experiment 2: Unstructured pruning of the smallest weights (based on the L1 norm)
Setting: Same as in experiment 1. Notice the change that now pruning is not random. Here 0 is assigned to the smallest weights.
Experiment 3: Structured pruning (based on the L1 norm)
Setting: Here I use structured pruning. In PyTorch one can use
prune.ln_structured for that. It is possible to pass a dimension (
dim) to specify which channel should be dropped. For fully-connected layers as
dim=0 corresponds to “switching off” output neurons (like
fc2). Therefore, it does not really make sense to switch off neurons in the classification layer
fc2. For convolutional layers like
dim=0 corresponds to removing the output channels of the layers (like
conv2). That’s why I will only prune
conv2 layers, again going from pruning 10% of the layers channels up to 70%. For instance, for the fully-connected layers it means zeroing 5 up to 35 neurons out of 50. For
conv1 layer it means zeroing out all the connections corresponding to 1 up to 7 channels.
Below I present results of my pruning experiments:
And I confirm that using average time of running a model during inference, there is no real change in terms of time for pruned or non-pruned models.
Conclusions and caveats
Here are my thoughts on the results above and some caveats.
If we take the results at face value, we conclude that better results are obtained when we do unstructured pruning of the smallest weights based on L1 norm. In reality however (more on that below) unstructured pruning makes weights sparse, but since sparse operations are not supported in PyTorch yet, it does not bring real gains in terms of model size or speed of inference. However, we can think of such results as some evidence that a smaller architecture with a lower number of weights might be beneficial.
Below are further caveats:
1. We are looking at FLOPs to estimate a speed-up of a pruned neural network. We look at the number of non-null parameters to estimate compression. It gives us an impression that by doing pruning we gain a significant speed-up and memory gain.
2. However, people report that when looking at actual time that it takes to make a prediction there is no gain in speed-up. I tested it with the model before pruning and after pruning (experiment 1–3), and this is true. There is no speedup in terms of average time of running inference. Also, saved PyTorch models (
.pth) have the same size.
3. Additionally, there is no saving in memory, because all those zero elements still have to be saved.
4. To my understanding one needs to change the architecture of the neural network according to the zeroed weights in order to really have gains in speed and memory.
5. There is a different way which is to use sparse matrices and operations in PyTorch. But this functionality is in beta. See the discussion here [How to improve inference time of pruned model using torch.nn.utils.prune]
6. So, if we do unstructured pruning and we want to make use of sparse operations, we will have to write code for inference to take into account sparse matrices. Here is an example of a paper where authors could get large speed-ups but when they introduced operations with sparse matrices on FPGA. [How Can We Be So Dense? The Benefits of Using Highly Sparse Representations]
What’s said above is more relevant to unstructured pruning of weights.
One can have speed-ups when using structured pruning, that is, for example, dropping some channels. The price for that would be a drop in accuracy, but at least this really works for better model size and speed-ups. Still, as far as I understood, one needs to change the architecture according to the pruned channels manually.
Additional chapter: Knowledge distillation
Knowledge distillation is the idea proposed by Geoffrey Hinton, Oriol Vinyals and Jeff Dean to transfer knowledge from a huge trained model to a simple and light-weighted one. It is not pruning strictly speaking, but has the same objective: simplify the original neural network without sacrificing much of quality.
It works the following way:
- Train a comprehensive large network which has a good accuracy [Teacher Network]
- Train a small network until convergence [Student Network]. There will be trade-offs between accuracy that you reach with a simpler model and the level of compression.
- Distill the knowledge from the Teacher Network by training the Student Network using the outputs of the Teacher Network.
- See that original accuracy of the trained and converged student network is increased!
I provide the code to do it in the
knowledge_distillation folder. Run
to train the student network. It has a simplified architecture relative to the original
LeNet neural network. For example, when the trained student network is saved it takes 1.16 times less memory on disk (from 90 kBs to 77 kBs, even twice less if saved in
PyTorch v1.4). I ran training for 60 epochs and the best accuracy was reached on epoch 47, and it equals
0.9260. Thus we can say that the model has converged.
to do additional training of the converged student neural network distilling teacher network.
Here are the results:
FLOPS compression coefficient is 42 (the student model is 42 times smaller in terms of FLOPS, down to 21840 multiply-add operations from 932500).
Model size compression coefficient is 3 (the student model is 3 times smaller in terms of size)
Accuracy of the retrained student model is
0.9276, which is a tiny bit better than the original student network.
I would say that knowledge distillation is definitely worth a try as a method to perform model compression.
Bibliography with comments
1. The code to calculate FLOPs is taken from ShrinkBench repo written by the authors of the What is the State of Neural Network Pruning? paper. The authors are Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle and John Guttag. They created this code to allow researchers to compare pruning algorithms: that is, compare compression rates, speed-ups and quality of the model after pruning among others. I copy their way to measure
model size which is located in the
metrics folder. It is necessary to say that I made some minor modifications to the code, and all errors remain mine and should not be attributed to the author’s code. It is also important to add that I also take the logic of evaluating pruned models from this paper. All in all, this is the main source of inspiration for my research.
2. The next important source is this Neural Network Pruning PyTorch Implementation by Luyu Wang and Gavin Ding. I copy their code for implementing the high-level idea of doing pruning:
- Write wrappers on PyTorch Linear and Conv2d layers.
- Binary mask is multiplied by actual layer weights
- “Multiplying the mask is a differentiable operation and the backward pass is handed by automatic differentiation”
3. Next, I make use of the PyTorch Pruning Tutorial. It is different from the implementations above. My implementation mixes the code of the above two implementations with PyTorch way.
Sources on knowledge distillation:
6. Open Data Science community (
ods.ai) is my source of inspiration with brilliant people sharing their ideas on many aspects of Data Science.
1: Indeed, at the extreme we can just predict a constant. Accuracy will be low, but pruning will be very effective, there will be no parameters at all in the neural network.