<< nn-experiments

Exploring classification accuracy of convolutional neural network architectures with random weights

The most prominent settings in a convolutional layer are kernel size, stride and dilation. I've tested all (sensible) permutations of those settings for a 3-layer network and measured image classification accuracy with a linear classifier on top. There are a few architectures that stand out considerably!

The little convolutional networks all look like this:

nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3|5|7|9, stride=1|2|3, dilation=1|2|3),
    nn.ReLU(),
    nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3|5|7|9, stride=1|2|3, dilation=1|2|3),
    nn.ReLU(),
    nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3|5|7|9, stride=1|2|3, dilation=1|2|3),
    nn.ReLU(),
)

The | symbol means that one of the separated numbers is used for each parameter. So each layer has 4 * 3 * 3 = 36 possible permutations which leads to 36 ^ 3 = 46,656 possible 3-layer networks to test.

Testing procedure is as follows and the code is available here:

The reason for only using 32 channels and only 500 images, obviously, is to make things fast. One test pass takes only about a second.

Here is a plot of all the 5-trial networks. X-axis is throughput, y-axis is accuracy and the color is related to the ratio, the factor by which the input size is reduced by the network.

scatter plot of test results

The accuracy is not very high but that is to be expected when only training on 500 images. I believe that it's still a meaningful measurement for comparing the different architectures. Remember that we only fitted a linear classifier and the CNN weights are completely random.

The top-right architectures in above plot are the interesting ones. They have high accuracy (in comparison) and run fast on my GPU. Here is a hand-picked list:

kernel size stride dilation throughput accuracy ratio
7, 3, 5 2, 1, 2 1, 3, 1 1850 31.24% 0.3750000
7, 3, 9 2, 1, 2 3, 1, 1 2300 31.28% 0.2604166
9, 3, 7 3, 1, 1 1, 1, 1 2900 31.20% 0.5601852
9, 3, 7 3, 1, 1 2, 1, 1 3500 30.92% 0.4178241
9, 3, 5 3, 1, 1 1, 1, 1 2850 30.84% 0.6666666
5, 5, 3 3, 1, 1 3, 1, 3 3000 31.04% 0.3750000
7, 3, 3 3, 1, 1 3, 3, 1 5150 29.96% 0.3750000
7, 3, 3 3, 1, 2 3, 1, 1 6350 30.16% 0.1400463
5, 3, 5 3, 2, 1 2, 1, 1 6450 30.00% 0.1157407

Training these architectures with actual gradient descent, after adding a fully-connected linear layer, on the whole STL10 training set for a hundred epochs, with TrivialAugmentWide augmentation and cross-entropy loss, yields:

kernel size stride dilation validation loss val. accuracy model params train time (minutes) throughput*
7, 3, 5 2, 1, 2 1, 3, 1 1.20086 57.57% 143,306 6.90 1,207/s
7, 3, 9 2, 1, 2 3, 1, 1 1.27710 54.09% 168,970 6.72 1,239/s
9, 3, 7 3, 1, 1 1, 1, 1 1.21700 58.59% 222,154 6.02 1,383/s
9, 3, 7 3, 1, 1 2, 1, 1 1.26847 54.88% 182,794 5.08 1,639/s
9, 3, 5 3, 1, 1 1, 1, 1 1.26713 56.65% 227,018 5.15 1,617/s
5, 5, 3 3, 1, 1 3, 1, 3 1.24014 55.70% 141,002 6.24 1,334/s
7, 3, 3 3, 1, 1 3, 3, 1 1.30604 52.73% 126,922 6.04 1,378/s
7, 3, 3 3, 1, 2 3, 1, 1 1.30700 52.63% 61,962 5.93 1,406/s
5, 3, 5 3, 2, 1 2, 1, 1 1.20183 56.75% 69,322 5.59 1,492/s
9, 9, 7 2, 2, 1 2, 1, 1 1.26590 54.44% 173,002 5.48 1,520/s

And with 128 instead of 32 convolutional channels:

kernel size stride dilation validation loss val. accuracy model params train time (minutes) throughput
7, 3, 5 2, 1, 2 1, 3, 1 1.08960 63.91% 990,986 11.91 699/s
7, 3, 9 2, 1, 2 3, 1, 1 1.20499 59.97% 1,781,770 14.67 568/s
9, 3, 7 3, 1, 1 1, 1, 1 1.12620 62.67% 1,601,290 6.85 1,217/s
9, 3, 7 3, 1, 1 2, 1, 1 1.23929 59.92% 1,443,850 7.55 1,103/s
9, 3, 5 3, 1, 1 1, 1, 1 1.13976 62.14% 1,325,834 7.11 1,172/s
5, 5, 3 3, 1, 1 3, 1, 3 1.22916 59.92% 981,770 7.11 1,172/s
7, 3, 3 3, 1, 1 3, 3, 1 1.24541 58.24% 728,842 6.85 1,216/s
7, 3, 3 3, 1, 2 3, 1, 1 1.27375 56.62% 469,002 6.71 1,242/s
5, 3, 5 3, 2, 1 2, 1, 1 1.10181 63.36% 695,050 6.14 1,356/s
9, 9, 7 2, 2, 1 2, 1, 1 1.33770 57.35% 2,289,418 10.91 763/s

*: The throughput is severely limited by the image augmentation stage!

The state-of-the-art accuracy for the STL10 validation set when only training on the STL10 data is 87.3% (arxiv:1708.04552). My little networks are far from this result but they are also pretty small. The original STL10 paper archived 51.5% accuracy with a single layer network (section 4.6).

The 9,3,5 - 3,1,1 - 1,1,1 architecture archived the best convergence speed during training. The loss was immediately and unusually shooting downwards.

Without the augmentation, all models were overfitting pretty fast and the validation accuracy decreased a lot.

The 9,9,7 - 2,2,1 - 2,1,1 architecture was the one with the worst accuracy in the selection experiment before (25.72%). However, when trained, it seems to get equally accurate as the other ones.

Conclusion

I don't know, really. The test is just for classification and results might be quite different for other problems. However, the good architectures above do probably behave well in many setups and i will try them whenever i need a CNN block. Identifying the best performing architectures (using only random weights and fast linear probes) while considering the throughput (which is one of the most important properties for impatient people like me) seems worth the trouble of running this exhaustive experiment.

Is there some rule-of-thumb for setting up the CNN parameters?

Not really, it seems. Larger kernel-size, stride and dilation values should be in the first layers rather than the last ones, but there are exceptions.

Looking at the correlation-plot of parameters and test result values among those 1,760 networks, one can see a few significant correlations. In below plot, kernel_size-0 means, kernel size of first layer, aso. train_time is the time it took to fit the linear regression, min_val_acc and max_val_acc are the minimum and maximum archived validation accuracies among the 5 trials and fitness is some subjective measure calculated by normalized(accuracy) + 0.3 * normalized(throughput).

correlation plot of test parameters and results