How does receptive field size increase with self-attention ←
Still not tired of these Very Small Language Models... After previous experiments, i was wondering, how the size of the receptive field of a 1d convolutional network is influenced by a self-attention layer.
Basically, the self-attention gives the network the opportunity to relate distant cells with each other that are (spatially or temporally) far more apart than the classic receptive field of the convolution can handle.
I tried a couple of new synthetic datasets but came back to the Selective Copying problem because it's quite simple to understand and to setup with a specific size.
Small recap: Selective copying means to pick all the letters in between those spaces and concatenate them:
A B C D : ABCD
EG B A : EGBA
It's a simple task but requires a large-enough receptive field. I'm using the same text-to-text network as in previous experiments, masking out the answer and requiring the network to reproduce the whole string while replacing the mask with the actual answer (the concatenated letters).
The network gets the raw byte classes (256) as input and outputs class logits for each output character.
Each sample of the dataset contains 10 letters to concatenate in a 40 (or 80) wide space, denoted
as area
in the table below. The network has 3 layers and either
uses a kernel size of 7 and dilation 1, 1, 1, which results in a receptive field radius of 9
or a kernel size of 13 and dilation 5, 7, 1, which results in a receptive field radius of 78
The table shows runs for various combinations of kernel-size/dilation, convolutional channels
and self-attention (attn
) for a selective copying area of 40 and 80 cells. The attention
is the QK-self-invented type as described here.
area | l | ch | ks | dil | attn | validation loss | validation mask error % | validation sample error% | model params | train time (minutes) | throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
40 | 3 | 32 | 7 | 0.428964 | 88.9958 | 100 | 29,920 | 5.98 | 10,867/s | ||
40 | 3 | 32 | 7 | 0,T,T | 0.364131 | 72.3721 | 100 | 44,320 | 6.99 | 9,302/s | |
80 | 3 | 128 | 7 | 0,0,T | 0.108775 | 45.9375 | 99.9104 | 492,544 | 12.59 | 5,162/s | |
80 | 3 | 32 | 13 | 5,7,1 | 0.139426 | 46.3525 | 99.8308 | 48,352 | 16.95 | 3,834/s | |
40 | 3 | 32 | 13 | 5,7,1 | 0.147143 | 24.7432 | 92.2472 | 48,352 | 7.23 | 8,984/s | |
40 | 3 | 32 | 7 | 0,0,T | 0.060434 | 13.9844 | 82.0064 | 37,120 | 6.22 | 10,457/s | |
40 | 3 | 64 | 7 | 0,0,T | 0.037878 | 9.0555 | 64.6994 | 131,584 | 15.77 | 4,122/s | |
40 | 3 | 128 | 7 | 0,0,T | 0.024103 | 6.2549 | 48.0842 | 492,544 | 9.22 | 7,051/s | |
40 | 3 | 256 | 7 | 0,0,T | 0.019787 | 5.0681 | 39.8637 | 1,902,592 | 19.93 | 3,260/s | |
40 | 3 | 512 | 7 | 0,0,T | 0.019062 | 4.6715 | 37.5547 | 7,475,200 | 45.6 | 1,425/s | |
80 | 3 | 128 | 13 | 5,7,1 | 0,0,T | 8.94742e-07 | 0.0009 | 0.0099 | 885,760 | 17.85 | 3,641/s |
80 | 3 | 32 | 13 | 5,7,1 | 0,0,T | 3.31051e-06 | 0 | 0 | 61,696 | 24.17 | 2,689/s |
40 | 3 | 32 | 13 | 5,7,1 | 0,0,T | 1.91189e-06 | 0 | 0 | 61,696 | 8.72 | 7,454/s |
80 | 3 | 64 | 13 | 5,7,1 | 0,0,T | 4.2523e-07 | 0 | 0 | 229,888 | 10.02 | 6,489/s |
40 | 3 | 64 | 13 | 5,7,1 | 0,0,T | 4.06063e-07 | 0 | 0 | 229,888 | 7.97 | 8,153/s |
40 | 3 | 128 | 13 | 5,7,1 | 0,0,T | 4.15227e-08 | 0 | 0 | 885,760 | 12.63 | 5,144/s |
Looking at the validation sample error %
we can see that neither a large receptive field nor
self-attention alone can solve the problem with a 3-layer network. Combining both, however, solves
the problem 100%.
Using only self-attention, the number of channels has a significant impact, although not as much as to justify the increased computational demand. All networks with attention where run two times and the average is reported. The 512-channel version had a validation error of 31.4% and 43.6%.