Transformers in ML: What They Are and How They Work

Transformers are often mentioned together with contemporary foundational models that are trained on large quantities of data. Think of GPT-3 and the new release of GPT-4, which is one of the largest language models ever created.

GPT-3 has 175 billion parameters and was trained on a massive corpus of text equating to 570GB. We don’t know anything about GPT-4 parameters yet, but according to Open AI, it is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.

Without transformers, these models wouldn’t be possible. In this article, we’ll talk about what transformers are, how they work, and why they are so important for technology and business.

What are transformers?

Transformers are a type of neural network just like recurrent neural networks (RNNs) or convolutional neural networks (CNNs).

There are 3 key elements that make transformers so powerful:

  1. Self-attention

  2. Positional embeddings

  3. Multihead attention

All of them were introduced in 2017 in the “Attention Is All You Need” paper by Vaswani et al. In that paper, authors proposed a completely new way of approaching deep learning tasks such as machine translation, text generation, and sentiment analysis.

The self-attention mechanism enables the model to detect the connection between different elements even if they are far from each other and assess the importance of those connections, therefore, improving the understanding of the context.

According to Vaswani, “Meaning is a result of relationships between things, and self-attention is a general way of learning relationships.”

Due to positional embeddings and multihead attention, transformers allow for simultaneous sequence processing, which means that model training can be sped up through parallelization. This is a huge benefit of using transformers over architectures like RNN and has enabled the creation of large language models.

For example, BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pretrained Transformer) are, as you might have guessed, variants of the original transformer architecture.

These models have achieved state-of-the-art performance on a wide range of NLP tasks and have helped to establish the transformer as the dominant architecture for NLP. As they evolve, we’ll see even more exciting advances in the years to come.

How do transformers work?

The key to understanding why transformers are used much more frequently than other neural networks is to look at what lies inside them.

Word embeddings

Words are awkward to use in a machine learning model, so they need to be transformed into something that a machine learning model can operate with – vectors.

This is done with the help of word embeddings, which are vector representations of words.

A great small-scale example of word embeddings is the mapping of colors to RGB color values. For each color name, we can assign an RGB value. Similar RGB values represent similar colors, while colors with wildly different RGB values are different.

Similarly, all the words in the English language can, it turns out, be assigned a vector value that encodes their meaning – the meaning being their relationships with other words in the English language.

The most famous algorithm for doing this is Word2vec. It takes a corpus of data and learns word embeddings in an unsupervised manner from the raw data.

If you would like to learn more in detail about word vectors, we recommend this great article by Adrian Colyer.

Positional embeddings

In NLP, sequences (like sentences or documents) of text are typically represented as vectors of word embeddings. However, a word embedding doesn’t capture the position of the word in the sequence.

This is where positional embeddings come into play: they encode the position of each token in a sequence, and so add this positional information to the word embeddings.

By incorporating positional embeddings into the input representation of an NLP model, the model is able to capture the order and position of words in a sequence. This is important for a model like transformers where words are not processed in sequence.


Transformers use a seq2seq neural network architecture that transforms an input sequence of vectors into an output sequence by passing it through a series of encoder and decoder layers.

encoder / decoder architecture

The goal of each encoder layer is to extract features from a sequence, while the goal of each decoder layer is to use the features to produce an output sequence.

In transformers, encoding and decoding is done by using attention.


Attention helps the model to understand the context of a word by considering words that go before and after it.

For example, the word “bark” can be connected to dogs but it can also be “tree bark”. The context of other words in the sentence helps computer to understand different meanings of the word and translate accordingly:

“I could hear the dog bark”. Somewhere next to the word bark in the vocabulary you would find words “dog”, “loudly”, “car”.

French translation of "I could hear the dog bark"

“The tree bark was wet after the storm”. Somewhere next to this representation you would find the words “brown”, “rough”, “tree”, “park”.

French translation of "the tree bark was wet after the storm"


Attention and self-attention are both important to the transformer architecture and shouldn’t be used interchangeably.

Attention mechanism helps the model to analyze different parts of another sequence, which is the sequence that is being generated by the decoder. Self-attention mechanism works with the sequence that is being encoded.

Self-attention mechanism allows the model to weigh the importance of different parts of the input sequence against each other. This is done by computing a set of attention weights that indicate the relevance of each element in the input sequence to every other element. By doing this, the model is able to effectively capture long-range dependencies in the input sequence and learn to recognize patterns that span multiple elements.

Multi-head attention

Multi-head attention allows the network to learn multiple ways of weighing the input sequence against itself. The vectors responsible for tokens are broken up into multiple parts called heads (8 in the original paper) which go through the same attention computing process as before. The results of the process are concatenated together to form an output of the same type.

This process can be parallelized, which enables training the model faster. It also allows the model to learn the context of the words better.

Where are transformers used?

In the previous sections, we looked at how transformers are used for natural language processing tasks. However, they have also found application in fields like computer vision and speech recognition. Being able to memorize and understand context, transformers have managed to significantly improve these fields.

Computer vision

In computer vision, transformers are used for object detection and image classification tasks. For example, they are used in healthcare to perform panoptic segmentation (used for tumor identification).

The most common transformer model for CV is the Vision Transformer (ViT). It uses self-attention mechanisms to extract features from the input image and has achieved state-of-the-art results in various tasks.

Speech recognition

Transformers can help with such tasks as automatic speech recognition and speaker recognition. For example, transformers are used for automatic speech recognition technologies for voice assistants.

Speech-Transformer (SR) is a transformer model that can be used for this type of task. Another one is Conformer, which uses convolutional neural networks together with self-attention mechanisms to extract features from audio input.

Question answering

Transformers have allowed us to build universal natural language processing and natural language generating systems that can use background information to answer questions. The most widely known example of such a system is ChatGPT, which can answer questions from basically any field.

Text classification

Transformers are used to build software systems that help to classify and summarize texts that are useful for scholars, lawyers, and students. Transformers can understand the meaning of the text and maintain that meaning when generating new data. ChatGPT is also an example of a system that can classify and summarize texts.

Foundational models

Transformers can be trained on large amounts of data using parallel computing as long as you have the necessary resources.

This has led to the development of large transformer models, such as BERT and GPT-4, that are trained on colossal quantities of data and are capable of doing general tasks. These foundational models can then be fine-tuned for specific downstream tasks with minimal additional training data.

Benefits of using transformers

There are multiple reasons why transformers today are so popular across many industries.

Improved accuracy

One of the most significant benefits of using transformers is the improved accuracy in different AI tasks. This is due to the ability of transformers to learn contextual relationships between input data, which allows for more accurate predictions.

Transformers use self-attention mechanisms that enable them to focus on the most relevant parts of the input data, making them more effective in capturing the underlying relationships between input and output.

Ability to process large amounts of data

Transformers have the ability to process large amounts of data, making them ideal for handling big data problems. This is due to the parallel processing of input data, which allows for quick and efficient computation of large datasets. Transformers’ ability to work with large amounts of data and process billions of parameters is useful across many industries from healthcare to manufacturing.

Transfer learning

Transformers can be used to create general models that can be fine-tuned for specific tasks. This enables the use of transfer learning, where the pre-trained models can be used for various tasks, reducing the need for large amounts of data and training time.


Transformers are a powerful type of neural network architecture that has become the dominant approach for natural language processing tasks, but not only.

Today, transformers are used for computer vision, speech recognition, and other applications where it is necessary to process large amounts of data fast and remember the context. In the future, we’re likely to see even more exciting applications that are built on top of transformers.

Banner that links to Serokell Shop. You can buy stylish FP T-shirts there!
More from Serokell
computer vision algorithms and applicationscomputer vision algorithms and applications
Best machine learning applicationsBest machine learning applications
variance bias tradeoff in MLvariance bias tradeoff in ML