<< nn-experiments

Parameter tuning for a Residual Deep Image-to-Image CNN
Compare number of layers
Compare residual weight factor
Compare position of batch normalization?
Compare activation function
Compare channel size
Compare kernel size

Parameter tuning for a Residual Deep Image-to-Image CNN ←

This network design has the following features:

simple and repetitive, makes programmer's life less troublesome
supports any image size (above a certain threshold): put one image in, get the same size out
supports the all-famous deep-ness, e.g. the training gradient does not vanish after a lot of layers - and we all believe that many layers are good

Since this is a logbook, i might as well write down how the idea came along. I was working like a mad person on a new CLIPig implementation, using OpenAI's CLIP network to massage a mess of pixels into something matching a text prompt and looking good. There are much cooler things out there, my personal favorite being Stable Diffusion, but none of them can be put on my graphics card. So to wrap it short, i'm developing image generation tools that run on a, nowadays, average graphics card with 6 GB of VRAM. For artistic reasons, i was interested in some form of noise reduction to make the rendered images look more pleasing and started with this simple design.

While network trainings were running, i searched through
arxiv.org for image de-noising, quickly limiting results to before 2017, because i really don't understand all this Diffusion stuff and its all high computational effort.

Found this paper: "Generalized Deep Image to Image Regression" which describes a recursive branching design and subsequently made a little implementation. It works quite nice but is mostly too slow for my taste, also i don't like the max-pooling layers for generative networks. After some playing around with it i came back to the simpler CNNs but remembered a few of the ideas and ways of thinking from the above and other papers.

Now that i have written most of the content below, i make a pause and will read Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections which seems to do the exact thing.

This simple CNN is composed of an encoder and a decoder, while the decoder is the reverse of the encoder. The layout is (in pytorch style):

ResConv(
  (encoder): ModuleDict(
    (layer_1): ConvLayer(
      (bn): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (act): GELU(approximate='none')
    )
    (layer_...): ConvLayer(
      (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (act): GELU(approximate='none')
    )
  )
  (decoder): ModuleDict(
    (layer_...): ConvLayer(
      (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv): ConvTranspose2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (act): GELU(approximate='none')
    )
    (layer_15): ConvLayer(
      (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv): ConvTranspose2d(32, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (act): Sigmoid()
    )
  )
)

In the forward pass each encoder layer stores it's result which is later added to the input of the inverse decoder layer. In the example above, result of layer_1 is added to the output of layer_14 which is the input of layer_15.

I think this is basically a U-NET. In any case, we can add as much layers as we want, while still being able to train the network. The residual connections will propagate the gradients through the whole thing, while demanding tasks will utilize the deeper layers because they have a larger receptive field. For a little discussion of the receptive field, please check Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising (Zhang et al).

All experiments, unless noted otherwise,

do 3 epochs (3 runs through the 60,000 training images)
use AdamW optimizer
- with default deltas 0.9, 0.999
- learn rate of 0.0003
use CosineAnnealingLR scheduler (that decreases learn rate to zero using a 1/4 cosine curve over the full run)
a mini batch size of 64
use Mean Absolute Error (l1) between output and target image as the loss function (The l2 or Mean Squared Error loss is much more common but i personally find the l1 numbers more readable)

To evaluate the abilities of the network, it's trained to restore the deleted top, bottom, left or right half of an image. In this case, zalando's beloved Fashion-MNIST dataset, scaled to 32x32 pixels (without interpolation).

It's not completely ridiculous to expect the network to make up a complete half since the images are all nicely centered and contain more or less the same stuff (shirts, jackets, shoes, bags...).

Here are two examples (from the worst (first image) and the best network of the following experiment). Odd rows show the network input and even rows the reconstruction from the network.

worst	best

The first network only has one layer (one convolution and one de-convolution), so it's receptive field is only 3x3 pixels (that of the one convolutional kernel). It is therefore not able to generate the other half of the image.

Compare number of layers ←

Comparing the performance of the number of layers.

experiment file: experiments/denoise/resconv-test.yml @ 1a748720...

layers	ks	ch	stride	pad	act	validation loss	model params	train time (minutes)
15	3	32	1	1	gelu	0.054141	261,411	4.17
19	3	32	1	1	gelu	0.054439	335,907	5.43
13	3	32	1	1	gelu	0.054911	224,163	3.39
21	3	32	1	1	gelu	0.055190	373,155	5.98
17	3	32	1	1	gelu	0.055413	298,659	4.74
11	3	32	1	1	gelu	0.057229	186,915	3.02
9	3	32	1	1	gelu	0.061094	149,667	2.43
7	3	32	1	1	gelu	0.070167	112,419	1.87
5	3	32	1	1	gelu	0.079709	75,171	1.30
3	3	32	1	1	gelu	0.102242	37,923	0.78
2	3	32	1	1	gelu	0.124805	19,299	0.52
1	3	32	1	1	gelu	0.164834	675	0.36

It's really nice to see such a distinguished parameter-performance scaling, once in a while. From 1 to 11 layers, performance increases steadily, then it get's a bit mixed-up. Most likely, 11 layers archive a receptive field that includes the whole input patch of 32x32. It's kind of proving that the deep layers are used despite the residual skip connections.

I'm not very patient in these regards, but i ran the same experiment on 64x64 images. Note that the computation time is drastically increased. Because of the padding of 1, each layer processes a full 64x64 block on 32 channels.

layer	ks	ch	stride	pad	act	validation loss	model params	train time (minutes)
21	3	32	1	1	gelu	0.0610384	373,155	25.13
30	3	32	1	1	gelu	0.0646240	540,771	35.27
17	3	32	1	1	gelu	0.0652571	298,659	19.57
19	3	32	1	1	gelu	0.0661678	335,907	19.95
15	3	32	1	1	gelu	0.0694179	261,411	17.27
13	3	32	1	1	gelu	0.0728544	224,163	22.02
11	3	32	1	1	gelu	0.0752024	186,915	18.75
9	3	32	1	1	gelu	0.0809892	149,667	15.87
7	3	32	1	1	gelu	0.0889331	112,419	11.31
5	3	32	1	1	gelu	0.1045170	75,171	7.49
3	3	32	1	1	gelu	0.1328610	37,923	3.79
1	3	32	1	1	gelu	0.1641830	675	0.47

The steady increase in precision per extra layer is a bit better than with the 32x32 images, although it's not much.

Note that the 21-layer network has about the same validation loss as the 9-layer network for 32x32 images. Another provy incidence that the number of layers directly relates to the receptive field size. The 3x3 kernel CNN needs about 20 layers to see an area of about 40x40 pixels.

However, the 30 layer network does not improve performance. I'm currently unable to explain that. It could be related to the residual weight factor of 0.1 that i used up until the next experiment.

Anyway, it's actually quite interesting how a network with only a few hundred thousand parameters can be so slow. And memory hungry. And my laptop is starting to burn through my lap. The 30 layer version runs with less than a 100 inputs/sec, which is just terrible to bear (Not only because i'm inpatient but also because the task is purely academic and the input images are scaled from 28x28 to 64x64, so it's a waste of computation). Also the batch size needed to be reduced for 21 and 30 layer network training to make it fit in my available 6GB GPU RAM.

For comparing the following parameters i will use the 11 layer network on the original FMNIST resolution of 28x28 pixels:

Compare residual weight factor ←

Should one scale the residual signal that comes from the encoder layers before adding it to the decoder layer input? It's denoted RWF in the following expression and experimental results.

dec-layer input N = dec-layer output N-1 + RWF * enc-layer-output -N

l	RWF	ks	ch	stride	pad	act	validation loss	model params	train time (minutes)
11	1.00	3	32	1	1	gelu	0.0537689	186,915	2.38
11	0.50	3	32	1	1	gelu	0.0539687	186,915	2.33
11	2.00	3	32	1	1	gelu	0.0557501	186,915	2.38
11	0.05	3	32	1	1	gelu	0.0560915	186,915	2.26
11	0.10	3	32	1	1	gelu	0.0568670	186,915	2.29
11	0.01	3	32	1	1	gelu	0.0595923	186,915	2.13

It seems like: no, don't do it. Just keep it at 1.0.

Compare position of batch normalization? ←

In each layer, the batch norm was placed at the first position, more out of my intuition than actual knowledge.

ConvLayer(
  (bn): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (act): GELU(approximate='none')
)

It could be placed as well after the convolution:

ConvLayer(
  (conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act): GELU(approximate='none')
)

or after the activation function

ConvLayer(
  (conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (act): GELU(approximate='none')
  (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

The following table lists the results for the different batch-norm positions (bnp), separately for encoder (bnp-e) and decoder (bnp-d).

0 is before convolution
1 is after convolution
2 is after the activation
- means no batch normalization

l	ks	ch	stride	pad	act	bnp-e	bnp-d	validation loss	model params	train time (minutes)
11	3	32	1	1	gelu	0	0	0.0509118	186,787	2.07
11	3	32	1	1	gelu	0	1	0.0509330	186,787	2.22
11	3	32	1	1	gelu	1	0	0.0523070	186,849	2.31
11	3	32	1	1	gelu	1	1	0.0524035	186,849	2.32
11	3	32	1	1	gelu	0	2	0.0533116	186,787	2.25
11	3	32	1	1	gelu	1	2	0.0537457	186,849	2.32
11	3	32	1	1	gelu	2	1	0.0544390	186,849	2.33
11	3	32	1	1	gelu	-	0	0.0547576	186,209	2.22
11	3	32	1	1	gelu	2	2	0.0548645	186,849	2.33
11	3	32	1	1	gelu	-	1	0.0550985	186,209	2.21
11	3	32	1	1	gelu	2	0	0.0552446	186,849	2.33
11	3	32	1	1	gelu	0	-	0.0559730	186,147	2.17
11	3	32	1	1	gelu	1	-	0.0576676	186,209	2.22
11	3	32	1	1	gelu	2	-	0.0577886	186,209	2.22
11	3	32	1	1	gelu	-	2	0.0587750	186,209	2.21
11	3	32	1	1	gelu	-	-	0.1163260	185,569	2.10

Note that the best validation loss is actually better than in previous runs. After some tests, not shown here, the batch normalization was removed from the final encoder and decoder layers.

Compare activation function ←

Let's take all the good parts from before and compare the standard activation functions available in pytorch. However, the final decoder layer is always using the Sigmoid activation because everybody knows it's a good activation for generating images (everything is scaled to [0, 1] range).

l	ks	ch	stride	pad	act	validation loss	model params	train time (minutes)
11	3	32	1	1	Mish	0.0512072	186,787	2.52
11	3	32	1	1	GELU	0.0512259	186,787	2.29
11	3	32	1	1	SiLU	0.0514479	186,787	2.46
11	3	32	1	1	Hardswish	0.0515750	186,787	2.42
11	3	32	1	1	LeakyReLU	0.0522968	186,787	2.36
11	3	32	1	1	ReLU6	0.0524871	186,787	2.42
11	3	32	1	1	ReLU	0.0530411	186,787	2.42
11	3	32	1	1	PReLU	0.0533523	186,807	2.59
11	3	32	1	1	Softplus	0.0544980	186,787	2.42
11	3	32	1	1	LogSigmoid	0.0547575	186,787	2.54
11	3	32	1	1	CELU	0.0551080	186,787	2.10
11	3	32	1	1	Tanh	0.0562118	186,787	2.36
11	3	32	1	1	RReLU	0.0566449	186,787	2.45
11	3	32	1	1	Hardtanh	0.0570635	186,787	2.47
11	3	32	1	1	Softsign	0.0571498	186,787	2.95
11	3	32	1	1	SELU	0.0601167	186,787	2.47
11	3	32	1	1	Sigmoid	0.0610033	186,787	2.44
11	3	32	1	1	Hardsigmoid	0.0647689	186,787	2.43
11	3	32	1	1	Softshrink	0.0805768	186,787	2.34
11	3	32	1	1	Hardshrink	0.0974769	186,787	2.24
11	3	32	1	1	Tanhshrink	0.1122510	186,787	2.60

Sometimes i wonder why there are so many activation functions when most of the time, the same ones perform best. Anyway, this result is slightly different to my MNIST classification results but as usual, you are good to choose either GELU, one of the ReLU variants or, as it turns out, the MISH activation.

Compare channel size ←

Currently all hidden convolutions have had 32 channels. In this experiment, i will test 16, 32, 64 and 128 channels, but also vary the channel size along the 11 layers.

For example ch1 = 16 and ch2 = 128 means, the encoder uses 16 channels for the first layer and linearly increases the number to 128 for the final layer. The decoder always has the reverse channel count, to make the skip connections fit.

l	ks	ch1	ch2	stride	pad	act	validation loss	model params	train time (minutes)
11	3	128	128	1	1	gelu	0.0443586	2,958,979	10.66
11	3	128	64	1	1	gelu	0.0443985	1,712,091	8.83
11	3	128	32	1	1	gelu	0.0448816	1,280,341	7.01
11	3	128	16	1	1	gelu	0.0449165	1,109,285	6.30
11	3	64	128	1	1	gelu	0.0450236	1,710,773	8.64
11	3	64	64	1	1	gelu	0.0459730	742,211	4.15
11	3	32	128	1	1	gelu	0.0469918	1,278,363	7.08
11	3	16	128	1	1	gelu	0.0483264	1,106,979	6.40
11	3	64	32	1	1	gelu	0.0485725	426,357	3.48
11	3	64	16	1	1	gelu	0.0487565	319,347	3.03
11	3	32	64	1	1	gelu	0.0488293	425,699	3.62
11	3	32	32	1	1	gelu	0.0512699	186,787	2.48
11	3	16	64	1	1	gelu	0.0515703	318,357	3.03
11	3	32	16	1	1	gelu	0.0539735	105,925	2.06
11	3	16	32	1	1	gelu	0.0546895	105,595	1.88
11	3	16	16	1	1	gelu	0.0587657	47,315	1.39

What can be learned from it?

more channels (at least up until 64) improve performance (without considering the computational cost).
increased channel count is more important for the first shallow layers than the later deep layers
the difference in archived training loss between 64 and 128 channels is at the 4th digit after the comma, so not really worth considering given the extra amount of required compute.

Compare kernel size ←

Out of old-fashionedness only odd kernel sizes are considered. For a fair comparison, the padding is adjusted to the individual kernel sizes such that each layer still processes the same spatial block (28x28).

l	ks	ch	stride	pad	act	validation loss	model params	train time (minutes)
11	5	32	1	2	gelu	0.0466142	515,491	2.10
11	7	32	1	3	gelu	0.0468494	1,008,547	5.12
11	9	32	1	4	gelu	0.0486901	1,665,955	7.29
11	11	32	1	5	gelu	0.0501364	2,487,715	9.44
11	3	32	1	1	gelu	0.0513643	186,787	2.16

Wow, there is a significant perfomance jump from 3 to 5, and the 5x5 kernel seems to run equally or even faster on my hardware/lib/driver setup. It is very likely, though, that the performance of the kernel size must be re-evaluated for each particular task.

Similar to the channel size abve , the kernel size is varied layer-wise in the following experiment. The top performing network with ks1=3 and ks2=9, for example, has the following kernel sizes:

[3, 3, 5, 5, 5, 7, 7, 7, 7, 9, 9]

l	ks1	ks2	ch	stride	act	validation loss	model params	train time (minutes)
11	3	9	32	1	gelu	0.0434004	907,683	4.37
11	3	11	32	1	gelu	0.0435886	1,251,747	5.57
11	3	7	32	1	gelu	0.0449416	596,387	3.02
11	5	9	32	1	gelu	0.0454906	1,105,315	5.33
11	5	7	32	1	gelu	0.0461697	810,403	4.09
11	3	5	32	1	gelu	0.0462298	383,395	2.15
11	5	11	32	1	gelu	0.0462772	1,514,915	6.67
11	5	5	32	1	gelu	0.0467664	515,491	2.19
11	7	7	32	1	gelu	0.0469873	1,008,547	5.43
11	7	9	32	1	gelu	0.0474498	1,401,763	6.58
11	7	5	32	1	gelu	0.0478909	762,787	3.79
11	7	11	32	1	gelu	0.0482104	1,778,595	7.66
11	7	3	32	1	gelu	0.0482650	517,027	2.97
11	9	9	32	1	gelu	0.0483184	1,665,955	7.72
11	9	3	32	1	gelu	0.0486722	764,835	4.13
11	5	3	32	1	gelu	0.0486834	351,651	2.35
11	9	5	32	1	gelu	0.0487206	994,211	4.92
11	9	7	32	1	gelu	0.0488068	1,338,275	6.61
11	9	11	32	1	gelu	0.0488103	2,157,475	8.94
11	11	11	32	1	gelu	0.0495152	2,487,715	9.59
11	11	7	32	1	gelu	0.0497081	1,635,747	7.20
11	11	5	32	1	gelu	0.0499655	1,324,451	6.13
11	11	3	32	1	gelu	0.0499719	1,029,539	5.21
11	11	9	32	1	gelu	0.0500542	2,078,115	8.58
11	3	3	32	1	gelu	0.0512497	186,787	2.11

Wow, our little 3x3 kernel friend, that was used all the time, has the worst performance again. Also the 5x5 from the above experiment is surpassed. Generally, one can see: Small kernel size in shallow layers and larger kernel size in deeper layers is a good thing.

TODO:

compare
- conv groups