0% found this document useful (0 votes)
39 views10 pages

An Introduction to Transformers

The document provides a detailed introduction to the transformer architecture, highlighting its application in various fields such as natural language processing and computer vision. It explains the input data format, the goal of transformers in generating representations, and the architecture of transformer blocks, including self-attention mechanisms and multi-head attention. Additionally, it discusses the importance of residual connections and normalization in stabilizing learning within the model.

Uploaded by

chtyagi7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
39 views10 pages

An Introduction to Transformers

The document provides a detailed introduction to the transformer architecture, highlighting its application in various fields such as natural language processing and computer vision. It explains the input data format, the goal of transformers in generating representations, and the architecture of transformer blocks, including self-attention mechanisms and multi-head attention. Additionally, it discusses the importance of residual connections and normalization in stabilizing learning within the model.

Uploaded by

chtyagi7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 10
] 8 Feb 2024 I [cs.L TVS al no 04.10 iv:2. ar. An Introduction to Transformers Richard E, Turner Department of Engineering, University of Cambridge, UK Microsoft Research, Cambridge, UK ret26ecam.ac.uk Abstract. The transformer is a neural network component that can be used to leatn useful represen- tations of sequences or sets of data-points [Vaswani et al,, 2017]. The transformer has driven recent advances in natural language processing [Devlin et al., 2019], computer vision [Dosovitskiy et al., 2021] and spatio-temporal modelling [Bi et al, 2022]. There are many introductions to transformers, but most do not contain precise mathematical descriptions of the architecture and the intuitions behind the design choices are often also missing." Moreover, as research takes a winding path, the explanations for the components of the transformer can be idiosyncratic. In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture. We will not discuss training as this is rather standard. We assume that the reader is familiar with fundamental topics in machine learning including multi-layer perceptrons, linear transformations, softmax functions and basic probability ‘See Phuong and Hutter [2022] foram exception to this 1. Preliminaries Let's start by talking about the form of the data that is input into a transformer, the goal of the transformer, and the form of its output. 1.1 Input data format: sets or sequences of tokens In order to apply a transformer, data must be converted into a set or sequence! of NV tokens 2° of dimension D (see figute 1). The tokens can be collected into a matrix X‘°) which is D x N.? To give two concrete examples 1. a passage of text can be broken up into a sequence of words or sub-words, with each word being represented by a single unique vector, 2. an image can be broken up into a set of patches and each patch can be mapped into a vector. ‘The embeddings can be fixed or they can be learned with the rest of the pa- rameters of the model e.g. the vectors representing words can be optimised of 4 learned linear transform can be used to embed image patches (see figure 2) A sequence of tokens is a generic representation to use as an input ~ many different types of data can be "tokenised” and transformers are then immediately applicable rather than requiring a bespoke architectures for each modality as was previously the case (CNNs for images, RNNs for sequences, deepsets for sets etc,). Moreover, this means that you don’t need bespoke handcrafted architectures for mixing data of different modalities — you can just throw them all into a big set of tokens. 1.2 Goal: representations of sequences The insformer will ingest the input data X") and return a representation of the sequence in terms of another matrix X“ which is also of size D x N, The slice 2, = X(4? wil be a vector of features representing the sequence at the location of token n. These representations can be used for auto-regressve prediction ofthe next (n++1)th token, global classification ofthe entire sequence (by pooling across the whole representation), sequence-to-sequence or image- torimage prediction problems, ete, Here Mf denotes the numer of lyers inthe transformer 2 The transformer block The representation of the input sequence will be produced by iteratively applying ‘transformer block XX") = transformerblack(X("-)) ‘The block itself comprises two stages: one operating across the sequence and one operating across the features. The first stage refines each feature independently according to relationships between tokens across the sequence e.g. how much ‘2 word in a sequence at position n depends on previous words at position n’, ‘or how much two different patches from an image are related to one another. This stage acts horizontally across rows of X'"-). The second stage refines the features representing each token. This stage acts vertically across a column of X(™"1)_ By repeatedly applying the transformer block the representation at, token n and feature d can be shaped by information at token n’ and feature d!.? 20) whien ate each D dimensions be callected together inte an aray X(° The input to transformer is N vectors ly speaking, the cllection of tokens does not need to have an order and the transformer an handle them 352 set (where order does not matter). rather than 3 sequence, See section 3 posed notation whereby the data matricis Nx D, But want sequences to run across the page ard {features down it in the schematics (a convention Vase in other leet notes, see ( 2p -() Figure 2: Encoding an image: an example [Doso- vitsly et al, 2021). An image is sait into patches, Each patch it reshaped into a vector by the vec operator. Tha vector acted upon by matric W whieh maps the pateh to a D slemer= sional vector 21°). These veetrs are ealected together into the input 2). The matric W can be learned with theres of the transformer's 2 The idea of interleaving procesing across the sequence and across features isa common motif of many machine leaning architectures Inelud- ing graph neural networks (intereaves processing cross nodes and across features), Fourier neue Fal operates (interieaves processing neoss space tnd across features), and bottleneck blocks in Resets (inceleaves processing across pixels and scros features) 2.1 Stage 1: self-attention across the sequence ‘The output of the first stage of the transformer block is another D x N’ array, Y("), The output is produced by aggregating information across the sequence independently for each feature using an operation called attention, Attention, Specifically, the output vector at location n, denoted yi”, is pro- duced by a simple weighted average of the input features at location n’ = Daan. a) Here the weighting is given by a so-called attention matrix A\"") 1 which is of size! N x N and normalises over its columns 2%, AW", = 1 speaking A("") will take a high value for locations in the sequence n' which are of high relevance for location n. For irrelevant locations, it will take the value ©. For example, all patches of a visual scene coming from a single object might, have high corresponding attention values. We can compactly write the relationship as a matrix multiplication Intuitively yor = xbm-2) gi) (2) and we illustrate it below in figure 3.° yn) xtm—1) Am), Vine =a x a DxN DxN NxN Figure 3: The output of an element of the attention mechanism, ¥{"), is produced by the dot product of the input horizontally sliced through time im) with a vertical slice from the attention matrix A's), Here the shading in the attention matrix represent the elements with a high value in white and those with a low value, near to 0, in black, Self-attention. So far, so simple. But where does the attention matrix come from? The neat idea in the first stage of the transformer is that the attention matrix is generated from the input sequence itself so-called self-attention. [A simple way of generating the attention matrix from the input would be to measure the similarity between two locations by the dot product between the features at those two locations and then use a softmax function to handle the normalisation i.e.” exp(t, Zn) =. expe yo En') © Relationship to Convolutional Neural Net works (CNN3). ‘The attention mechanism can recover convolutional Filtering a3 special case if) 1D regulay sampled timesron and ALN, = A, nism, 1 becorts convolution. Untke nor tral CNN, these fers have ful temporal up fort. Laer wel se that the iter themselves Gymamicaly depand onthe input, another dfer face fom andard CNN. We wil ae aet 9 Simiorgy transformers wil vse multiple atten tion maps in each Iyer in the same way hat CNIS use multiple filters (though typical tre formers have fewer sttenton mape than CANS have channel). then the attention mecha 5 The need for transformers to store and com pute > IV attention arrays can be a major com Putatonal bottleneck, which makes procesing of Tong seauences challenging § When training transformers ‘0 perform auto- regressive prediction, eg predicting. the ‘next "Word ina sequence based onthe previous ones Clever modineation to the mogel can be used Ssceclerate training and inference, This involves applying the transformer to the whale sequence Sha using masking in the attention mechanism (A becomes an upper triangular matrix) preven future tokens afecting the representation Stealer tokens, Causal predictions ean then be nade for the entire sequence In one forward pasa through the transformer. See Section 4 for more information. 7 We temporary suppress the superscripts here to ease the notation so A, becomes Ap and similarly 2"” becomes However, this naive approach entangles information about the similarity between locations in the sequence with the content of the sequence itself, [An alternative is to perform the same operation on 2 linear transformation of the sequence, U2, 50 that exp(2]UUew) DN expla ],0 Vay) Ann ‘Typically, U will project to a lower dimensional space ie. U is Kx. dimensional with K’< D. In this way only some of the features in the input sequence need be used to compute the similarity, the others being projected out, thereby de- coupling the attention computation from the content. However, the numerator in this construction is symmetric. This could be a disadvantage. For example, We might want the word ‘caulking iron’ to be strongly associated with the word ‘tool’ (as itis a type of tool), but have the word ‘tool’ more weakly associated with the word ‘caulking iron’ (because most of us rarely encounter it).? Fortunately, it is simple to generalise the attention mechanism above to be asym- metric by applying two different linear transformations to the original sequence, ngs = hh TUT UE) exp (at Ug yen) The two quantities that are dot-producted together here q., = Uy, and Ugary are typically known as the queries and the keys, respectively ‘Together equations 2 and 3 define the selfattention mechanism. Notice that the Kx D matrices Ug and Us are the only parameters of this mechanism? Multi-head self-attention (MHSA). In the selfattention mechanisms de- scribed above, there is one attention matrix which describes the similarity of ‘two locations within the sequence. This can act as 2 bottleneck in the architec ture ~ it would be useful for pairs of points to be similar in some ‘dimensions’ and different in others." In order to increase capacity of the frst self-attention stage, the transformer block applies /1 sets of sel-attention in parallel? (termed 1 heads) and then linearly projects the results down to the Dx N array required for further pro- cessing. This slight generalisation is called multichead self-attention YU) = MHSAQ(X"—) = > OxXODAM, where — (8) on ((x2)' aft) AL? )agt = + 5) seerexw ((H2.) 2) a =U and KEP = OA.) Here the H matrices Vj") which are D x D project the I sel down to the required output dimensionality D.? ntion stages ‘The addition of the matrices Vj""), and the fact that retaining just the diagonal elements of the attention matrix A(™) will interact the signal instantaneously with itself, does mean there is some cross-feature processing in multi-head sel attention, as opposed to it containing purely cross-sequence processing. How- ever, the stage has limited capacity for this type of processing and it is the job of the second stage to address this. * often you wil see attention parameterize 3s OT Ua qe/VB) Dividing the exponents by the square-root of the dimensionality of the projected vector helps nue terial stably, but ths presentation we 3 orb this term into U te improve cart * some ofthis eect could be handled by the normalisation in the denominator, but aeymmet- fe similarity allows more flexi. However, | do not kon of experimental evidence to support Using Uy 2 Uy works (RNAS). [eis illuminating to. compare the temporal processing in the transformer ‘that of RWNs which recursively update 3 ni den state feature representation (2!) based an the current observa vious hidéen state Here we've unrolled the RAIN’ one step to show that ebservations which are nearby tothe hidden state (eg 2.) re treated diferently from observations that ae further away (eg. 200), as information is propagated by recurrent application of the fune- Son f(). In-contras. in the tansiormer, self Bttention treats all observations stall me points In an Kdntial manner, no matter hew far away they are. This is one reason why they find it simple to learn long-range relationships. 11-1 attention matrices are viewed a 4 date driven version of fiters in 8 CNN, then the need for more fiters / channels is clear” Typical choices forthe number of heads H is 8 or 16 Tower than typiesl aumbers of channels ina CNN, 12 The computational cost of muichead sell attention is usually dominated by the matric mul Uiplication involving the attention matibe and i therefore OUD) 28 The product of the mates 1") X0"—-D is related tothe so-alled values which ate nor mally introduce in descriptions of self sttention long side queries and keys In the usual presen- {alion, thee is 3 redundancy between the inear Uransform used to compute the values andthe line tor projection at she end ofthe mulches see sttenton, 20 we have not explictly introduced them here. The standard presentation canbe re- covered by setting Vi, to be a lowerank matric Va Uptiyy where Uy Dake and Uy Is KD. Typically K is se 9 K = D/H s0 that changing the mimber of heads leads to models ‘sth simior numbers of parameters and compu {ational demands Figure 4 shows mult:head se-attention schematically. Mult-head attention comprises the following parameters @ = {Uy.n, Ubu: Va}{ly ie. 3H matrices of size Kx D, Kx D, and D x D respectively 2.2 Stage 2: multi-layer perceptron across features ‘The second stage of processing in the transformer block operates acros features, refining the representation using a nonlinear transform. To do ths, we Simply Spply a multilayer perceptron (MLP) to the vector of features at each location min the sequence af) = MUP), Notice that the parameters of the MLP, 8, are the same for each location n. 2.3 The transformer block: Putting it all together with residual con- nections and layer normalisation We can now stack MHSA and NLP layers to produce the transformer block Rather than doing this directly, we make use of two ubiquitous transformations to produce a more stable model that trains more easily: residual connections and normalisation Residual connections. ‘The use of residual connections is widespread across machine learning as they make initialisation simple, have a sensible inductive bias towards simple functions, and stabilise learning [Szegedy et al., 2017]. Instead of directly specifying a function 2 = fo(x'"~»), the idea is to parameterise it in terms of an identity mapping and a residual term 20) = al 4 yegy (20 Equivalent sentation 2 ) that is being modelled is close to identity. This type of parametersation is used for both the MHSA and MLP stages in the transformer, with the idea that each applies a mild non-linear transformation to the representation. Over many layers, these mild non-linear transformations compose to form large transformations. Token normalisation. The use of normal'sation, such as LayerNorm ané Batch- Norm, is also widespread across the deep learning community as a means to stabilise learning. There are many potential choices for how to compute nor malisation statistics (see figure 5 for a discussion), but the standard approach is use LayerNorm [Ba et al, 2016] which normalises each token separately, re- moving the mean and dividing by the standard deviation,"® this can be viewed as modelling the differences between the repre- ey = le = resa(z(""-))) and will work well when the function Fan (an — mean(n)) 74 + 84 = LayerNorm(X)en where mean(en) = Py an and varlen) = $ OP lean ). ‘The two parameters »4 and fy are a learned scale and shift, Figure 4: Mutt-hesd seleatention apples 17 sel attention operations in parallel ané then linearly projects the H1D x N cmensonal output coun to XN by apzlying a linear transform, implemented hereby the Hf matrices Vi, 1 The MLPs used typically have one oF two hiddemlayers with dimension equal to the num ber of features D (or larger). The computational com of thi step is therefore roughly = DxD. it ‘the festure embedding size approaches the length ofthe sequence D = N, the MLPs ean start to dominate the computational eamalety (eg this an be the case forvision transformers which em bed large patches). 3 Relationship to Graph Neural Networks (GNNs). Aba high level, graph neural networks interleave two steps. Fist, © message passing sep where each node receives mestages fom Rt neighbours which are then aggregated together. Second, a feature processing step where the coming aggregated messages ae used to update tach node's features, Through this lens. the ‘ransfomer can be viewed 3¢ an unrolled GN ‘with each token corresponding to an edge of 3 fully connected graph. MHSA forms the mes. sage passing step, and the MLPs forming the feature update ep, Each transformer Block cor: responds to one update of the GNN. Moreover, many method for sealing transformers introduce pare forms of attention where each token ate ‘ends to only a restricted set of ether torens, ‘hat is they specify sparse graph connectiviny structure. Arguably, i this way Wansformers are tore general a they can use elferent graphs at diferent layers in the eanstermer 16 This i also known 28 2-scoring in some fills and i related to whitening BatchNorm for tansformers LeyerNorm for transformers {TokonNorn) feature sequence sequence AAs this transform normalises each token individually and as LayerNorm is ap- plied differently in CNNs, see figure 6, | would prefer to call this normalisation TokenNorm. This transform stops feature representations blowing up in magnitude as non. lineatities are repeatedly applied through neural networks.!7 In transformers, LLayerNorm is usually applied in the residual terms of both the MHSA and MLP stages. Putting this all together, we have the standard transformer block shown schemat- ically in figure 71° xe t en t Mu t Norm J en t MHSA t Norm to xo xin fo 4 MLPCR)) LayerNorm(¥™) xt D4 MsAce™D) LayerNorm( x") Figure 7: The transformer block. Residual connections are added to the multi- head self-attention (MHSA) stage and the multi-layer perceptron (MLP) stage. Layer normalisation is also applied to the inputs of both the MHSA and the MLP. They are then stacked. This block can then be repeated .M times. 3. Position encoding ‘The transformer treats the data as a set — if you permute the columns of X(°) (ice. re-order the tokens in the input sequence) you permute all the represen- tations throughout the network Xin the same way. This is key for many applications since there may not be a natural way to order the original data into 2 sequence of tokens. For example, there is no single ‘correct’ order to map Figure §: Transformers perform layer normal= sation (left hand schematic) which normalise the mean and standard deviation ofeach incividal t= an in each sequence in the batch. Batch normal m (right hand schematic), which. novmales ‘over the festure and batch dimension together, is found 10 be far less stable [Shen et ah, 2020] ‘Other flavours of normalisation are possible and po tently under-esplored eg. instance normalisation ‘would normalise acreas the sequence dimension lr sed 1 Wnist itis possible te contr the non Iinearties and wights in neural networks fo re ‘vent explosion of the representation, the con Strsints this laces on the activation functions tan adversely affect learning. The LayerNorm spproach s arguably simpler and simpler to train Figure 6; In CNNs LayerNorm is conventionslly applee to both the features and across the feo ‘tre mape (ve, arose the height and width of the Images) (Lele and schematic). Asthe height and ‘width dimension in CNN coreaponds to the se- uence dimension, 1....V of wansformers, the ‘erm ‘LayerNorm’ is arguably used inconsistently (compare to figure 5). | woul prefer to call he Normalisation woed in transformers "token now Imaltation’ instead to avoid confusion, Batch normalisation (ight hand schematic) is consis ently defined 18 The exact configuration ofthe normalisation and residual layera an dfer, but here we show 2 Sandare setup [Rong et al, 2020} image patches into a one dimensional sequence However, this presents a problem since positional information is key in many problems and the transformer has thrown it out. The sequence ‘herbivores eat plants’ should not have the same representation (up to permutation) as, ‘plants eat herbivores: Nor should an image have the same representation as cone comprising the same patches randomly permuted. Thankfully, there is a simple fix for this: the location of each token within the original dataset should be included in the token itself, or through the way itis processed. There are several options how to do this, one isto include this information directly into the embedding X'®). Eg. by simply adding the position embedding (surprisingly this, works!) or coneatenating. The position information can be fixed e.g. adding a of sinusoids of different frequencies and phases to encode position of a word in a sentence [Vaswani et al, 2017], or it can be a free parameter which is learned [Devlin et al., 2019], as it often done in image transformers. There are also approaches to include relative distance information between pairs of tokens by modifying the selF-attention mechanism [Wu et al., 2021] which connects to equivariant transformers. 4 Application specific transformer variants For completeness we will give some simple examples for how the standard trans- former architecture above is used and modified for specific applications. This includes adding a head to the transformer blocks to carry out the desired pre- diction task, but also modifications to the standard construction of the body. 4.1 Auto-regressive language modelling In auto-regressive language modelling the goal isto predict the next word wy in the sequence given the previous words whin-1. that is to return pitty — ‘ulwin-1}. Two modifications are required to use the transformer for this task — a change to the body to make the architecture efficient and the addition of a head to make the predictions for the next word Modification to the body: auto-regressive masking. Applying the version of the transformer we have covered so far to auto-regressive prediction is compu- tationally expensive, both during training and testing. To see this, note that AR prediction requires making a sequence of predictions: you start by predicting the first word p(w = w), then you predict the second given the first p(w = w’w1), then the third word given the first two p(w2 = wlw1,w2), and so on until you predict the last item in the sequence pwr = w)tcy.y-1)- This requires apply- ing the transformer ’ — 1 times with input sequences that grow by one word each time: w,Wy.2,..-,Wi.y—2. This is very costly at both training-time and test-time Fortunately, there is @ neat way around this by enabling the transformer to support incremental updates whereby if you add a new token to an existing sequence, you do not change the representation for the ald tokens, To make this property cleat, | will define it mathematically: let the output of the incremental transformer applied to the first n words be denoted”? X() — wansformersineremental(un) ‘Then the output of the incremental transformer when applied ton +1 words is X02) = transformer-incremental (ts .-1) In the incremental transformer X‘") = X(%2)_ i.e. the representation of the fold tokens has not changed by adding the new one. If we have this property © Vision transformers [Dosoviskiy ta, 2021) use 2.0) = Wa + én where Pn i the nth ‘vectored patch, tithe leaned position m= being, and W isthe patch embedding matrix Arguably ft would be more intutve to append ‘the position embedding tothe patch embedding However, it we use the concatenation approach and consider what happens afer applying 3 neat transform vm = [ee [ee] [vawet vst ] we recover the additive construction, which is tne hint a5 t0 why the additive construction 29 Note that I'm overloading the notation here previously superscripts denoted. layers. in the “ransformer, but hete I'm using them to denote the numberof items in the input sequence, then 1, at tes-time auto-regrssive generation can use incremental updates to compute the new representation efficiently, 2. at training time we can make ‘the NV’ auto-regressive predictions for the whole sequence p(w; = w)p(w; = so}w)p(we = wh, 202) .--pQow = -0\wi.1-1) i a single forwards pass. Unfortunately, the standard transformer introduced above does not have this property due to the form of the attention used. Every token attends to every ather token, 50 if we add a new token to the sequence then the representation for every token changes throughout the transformer. However, if we mask the attention matrix so that it is upper-triangular Ay, = 0 when n > n’ then the representation of each word only depends on the previous words.* This then gives us the ineremental property as none of the other operations in the transformer operate across the sequence.?2 Adding a head. We're now almost set to perform auto-regressve language modelling. We apply the masked transformer block Af times to the input se uence of words. We then take the representation at token m — 1, that is 2) Which captures causal information in the sequence at this point, and generate the probability of the next word through a softmax operation explo’) co Avy = weeren1) = wy = wie) = Pern , us ET evacee Here W is the vocabulary size, the wth word is w and {gu}! are softmax weights that will be learned 4.2. Image classification For image classification the goal isto predict the bel y given the input image which has been tokenised into the sequence X'°), that is p(y|X!). One way of computing this distribution would be to apply the standard transformer body IM times to the tokenised image patches before aggregating te final layer ofthe transformer, XC, across the sequence e.g. by spatial pooling h = DP, 2 inorder to form a feature representation for the entire image. The representation Fr could then be used to perform softmax classification. An alternative approach is found to perform better [Dosovtskiy et al, 2021], Instead ve introduce a new fixed (learned) token at the start n = 0 of the input sequence 2°”. At the head we use the n= 0 vector, 2°!"), to perform the softmax classification This approach has the advantage that the transformer maintains and refines 3 clobal representation ofthe sequence at each layer m ofthe transformer that is pproprate for classification. 4.3. More complex uses The transformer block can also be used as part of more complicated systems eg. in encoder-decoder architectures for sequence-to-sequence modelling for translation [Devlin et al., 2019, Vaswani et al, 2017] or in masked auto-encoders for self-supervised vision systems [He et al,, 2021] 5 Conclusion This concludes this basic introduction to transformers which aspired to be math- cematically precise and to provide intuitions behind the design decisions. We have not talked about loss functions or training in any detail, but this is because rather standard deep learning approaches are used for these. Briefly, 2 Notice that this masking operation also en- codes postion information since you ean infer the order of the tokans fram the mask 22 This restition to the attention will cause 4 lous of representational power. It's an open ques: tion asta how signifieant this is ard whether n= ceasing the capacity of the model can mitigate iteg. By using higher dimensional tokens, Le In- transformers are typically trained using the Adam optimiser. They are often slow to train compared to other architectures and typically get more unstable as training progresses. Gradient clipping, decaying learning rate schedules, and increasing batch sizes through training help to mitigate these instabilities, but often they still persist. ‘Acknowledgements. We thank Dr. Max Patacchiola, Sasha Shysheya, John Bronskill, Runa Eschenhagen and Jess Riedel for feedback on previous versions fof this note. Richard E, Turner is supported by Microsoft, Google, Amazon, ARM, Improbable and EPSRC grant EP/T005386/1. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast, 2022. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapo- lis, Minnesota, June 2019. Association for Computational Linguistics. doi 10.18653/v1/N19-1423. URL https: //aclanthology .org/N19~1423. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Usekoreit, and Neil Houlsby. An im- age is worth 16x16 words: Transformers for image recognition at scale, In Sth International Conference on Learning Representations, ICLR 2021, Vir tual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https: //openreview .net/forun?id=VicbFaNTTy. Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollér, and Ross Git shick. Masked autoencoders are scalable vision learners, 2021 Mary Phuong and Marcus Hutter. Formal algorithms for transformers, arXiv preprint arXiv:2207.09238, 2022. ‘Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. PowerNorm: Rethinking batch normalization in transformers. In Hal Daumé Il and Aarti Singh, editors, Proceedings of the 37th International Confer- fence on Machine Learning, volume 119 of Proceedings of Machine Learn- ing Research, pages 8741-8751. PMLR, 13-18 Jul 2020. URL https: //proceedings.mir-press/v119/shen20e.htal. Chistian Szegedy, Sergey loffe, Vincent Varhoucke, and Alexander Alem Inception-v4, inception-resnet and the impact of residual connections on learn- ing. In Proceedings of the AAAI conference on artificial intelligence, vol ume 31, 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, ‘Aidan N Gomez, Lukasz Kaiser, and lia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwarathan, and R. Garnett, editors, Advances in New ral Information Processing Systems, volume 30. Curran Associates, Inc. 2017. URL https: //proceedings.neurips.cc/paper_files/paper/ 2017/411¢/3#5e0243547de0917b4053c1 c4a845aa~Paper pat, K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao. Rethinking and improw ing relative position encoding for vision transformer. In 2021 IEEE/CVF International Conference on Computer Vision (ICV), pages 10013-10021. Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. doi: 10.1109/ TCCV48922.2021.00988. URL https: //doi ieeeconputersociety. org/ 10. 1109/Tccv4as22..2021 00988, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normal ization in the transformer architecture. In Hal Daumé Ill and Aarti Singh, ‘editors, Proceedings of the 37th International Conference on Machine Learn ing; volume 119 of Proceedings of Machine Learning Research, pages 10524— 10533. PMLR, 13-18 Jul 2020. URL https: //proceedings.z1r.press/ ‘vi19/xiong20b. html.

You might also like