Experiments in Neural Network Pruning (in PyTorch).

Photo by Noah Rosenfield on Unsplash

Introduction

Key takeaways

I also provide a different way to compress a neural network which is knowledge distillation.

All code could be found at my github repository.

Define pruning

Pruning synapses (=weights) vs pruning neurons (Taken from Learning both Weights and Connections for Efficient Neural Networks, 2015)

Much of this work is based on the paper What is the State of Neural Network Pruning?

I: Evaluating the effectiveness of pruning

In order to estimate the effectiveness of pruning we will take into account:

  1. Acceleration of inference on the test set.
  • Compare the number of multiply-adds operations (FLOPs) to perform inference.
  • Additionally, I compute average time of running the original/pruned model on data.

2. Model size reduction/ weights compression.

  • Here we will compare total number of non-zero parameters.

II: Experiment Setting

Here is the representation of the canonical LeNet-5 architecture:

LeNet-5 architecture from Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998)

I will represent this architecture with some modifications in PyTorch as follows:

class LeNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)

def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, 1)

The architecture defined above is excessive in the sense that one can reach the same or even better categorical accuracy with a smaller neural network. But this is done on purpose: we leave room for pruning.

Let’s organize experiments as follows:

Training stage:

  1. Train the model using the script ( lenet_pytorch.py).
  2. Perform evaluation of the model using the metrics defined above.
  3. Save the trained model.

Pruning stage: Perform pruning experiments using the saved model.

4. In order to perform pruning experiments and their evaluation see:

  • metrics/
  • experiments.py (this is the main script that produces results).
  • pruning_loop.py implements the experiment.
  • utils folder with helper scripts

So, in utils you find:

  • avg_speed_calc.py to calculate average inference time on train data
  • loaders.py to create train/test loaders
  • maskedLayers.py wrappers for Linear and Conv2d PyTorch modules.
  • plot.ipynb Jupyter notebook to produce the plots below

Pruning setup

  1. A neural network (NN) is trained until convergence (78 epochs now).
  2. Prune and finetune:
for i in 1 to K do
prune NN
finetune NN [for N epochs]
end for

It means that the neural network is pruned several times. In my version, a weight that was once set as zero will always stay zero. The weights that were pruned are not retrained. Note also that finetuning means that there are several epochs of training happening.

In order to fix the pruning setup, in all the experiments number of prune-finetune epochs is equal to 3 (K=3 ), and number of finetuning epochs is equal to 4 (N=4). The categorical accuracy and model’s speed-ups and compression is reported after pruning-finetuning is finished.

III: Results

Baseline

Experiments

Setting: Prune fully-connected layers ( fc1, fc2) and both convolutional layers ( conv1, conv2). Increase pruning from 10% to 70% (step = 10%). The pruning percentage is given for each layer. Roughly such an increase corresponds to compressing the model up to 36 times.

Experiment 2: Unstructured pruning of the smallest weights (based on the L1 norm)
[UnstructPrunL1Norm]

Setting: Same as in experiment 1. Notice the change that now pruning is not random. Here 0 is assigned to the smallest weights.

Experiment 3: Structured pruning (based on the L1 norm)
[StructuredPrunL1Norm]

Setting: Here I use structured pruning. In PyTorch one can use prune.ln_structured for that. It is possible to pass a dimension ( dim) to specify which channel should be dropped. For fully-connected layers as fc1 or fc2 dim=0 corresponds to “switching off” output neurons (like 320 for fc1 and 10 for fc2). Therefore, it does not really make sense to switch off neurons in the classification layer fc2. For convolutional layers like conv1 or conv2 dim=0 corresponds to removing the output channels of the layers (like 10 for conv1 and 20 for conv2). That’s why I will only prune fc1, conv1 and conv2 layers, again going from pruning 10% of the layers channels up to 70%. For instance, for the fully-connected layers it means zeroing 5 up to 35 neurons out of 50. For conv1 layer it means zeroing out all the connections corresponding to 1 up to 7 channels.

Below I present results of my pruning experiments:

And I confirm that using average time of running a model during inference, there is no real change in terms of time for pruned or non-pruned models.

Conclusions and caveats

If we take the results at face value, we conclude that better results are obtained when we do unstructured pruning of the smallest weights based on L1 norm. In reality however (more on that below) unstructured pruning makes weights sparse, but since sparse operations are not supported in PyTorch yet, it does not bring real gains in terms of model size or speed of inference. However, we can think of such results as some evidence that a smaller architecture with a lower number of weights might be beneficial.

Below are further caveats:

Unstructured pruning

2. However, people report that when looking at actual time that it takes to make a prediction there is no gain in speed-up. I tested it with the model before pruning and after pruning (experiment 1–3), and this is true. There is no speedup in terms of average time of running inference. Also, saved PyTorch models ( .pth) have the same size.

3. Additionally, there is no saving in memory, because all those zero elements still have to be saved.

4. To my understanding one needs to change the architecture of the neural network according to the zeroed weights in order to really have gains in speed and memory.

5. There is a different way which is to use sparse matrices and operations in PyTorch. But this functionality is in beta. See the discussion here [How to improve inference time of pruned model using torch.nn.utils.prune]

6. So, if we do unstructured pruning and we want to make use of sparse operations, we will have to write code for inference to take into account sparse matrices. Here is an example of a paper where authors could get large speed-ups but when they introduced operations with sparse matrices on FPGA. [How Can We Be So Dense? The Benefits of Using Highly Sparse Representations]

What’s said above is more relevant to unstructured pruning of weights.

Structured pruning

Additional chapter: Knowledge distillation

It works the following way:

- Train a comprehensive large network which has a good accuracy [Teacher Network]
- Train a small network until convergence [Student Network]. There will be trade-offs between accuracy that you reach with a simpler model and the level of compression.
- Distill the knowledge from the Teacher Network by training the Student Network using the outputs of the Teacher Network.
- See that original accuracy of the trained and converged student network is increased!

I provide the code to do it in the knowledge_distillation folder. Run

python knowledge_distillation/train_student.py

to train the student network. It has a simplified architecture relative to the original LeNet neural network. For example, when the trained student network is saved it takes 1.16 times less memory on disk (from 90 kBs to 77 kBs, even twice less if saved in PyTorch v1.4). I ran training for 60 epochs and the best accuracy was reached on epoch 47, and it equals 0.9260. Thus we can say that the model has converged.

Run

python knowledge_distillation/distillation.py

to do additional training of the converged student neural network distilling teacher network.

Here are the results:
- FLOPS compression coefficient is 42 (the student model is 42 times smaller in terms of FLOPS, down to 21840 multiply-add operations from 932500).
- Model size compression coefficient is 3 (the student model is 3 times smaller in terms of size)
- Accuracy of the retrained student model is 0.9276, which is a tiny bit better than the original student network.

I would say that knowledge distillation is definitely worth a try as a method to perform model compression.

Bibliography with comments

2. The next important source is this Neural Network Pruning PyTorch Implementation by Luyu Wang and Gavin Ding. I copy their code for implementing the high-level idea of doing pruning:
- Write wrappers on PyTorch Linear and Conv2d layers.
- Binary mask is multiplied by actual layer weights
- “Multiplying the mask is a differentiable operation and the backward pass is handed by automatic differentiation”

3. Next, I make use of the PyTorch Pruning Tutorial. It is different from the implementations above. My implementation mixes the code of the above two implementations with PyTorch way.

Sources on knowledge distillation:

4. Dark knowledge

5. Distilling the Knowledge in a Neural Network

6. Open Data Science community ( ods.ai) is my source of inspiration with brilliant people sharing their ideas on many aspects of Data Science.

Footnotes

Data Scientist