Multimodal Learning in ML

When I started writing about AI in 2019, my first article was dedicated to what machine learning is and how it works. It explained the difference between weak and strong (or general) AI and mentioned that we have no idea how to create the latter.

Well, in 2023, maybe we actually (finally!) do. I’m talking about multimodal learning.

Don’t get me wrong – multimodal learning wasn’t a new concept already in 2019. Back then, TechRepublic called it the “future of AI” and ABI Research predicted that multimodal learning will be key for self-driving cars, robotics, consumer devices, and healthcare. But today we can actually grasp the benefits of multimodal learning in its full capacity due to powerful hardware and cloud technologies.

In this article, I will talk about multimodal learning, how it works, and why it is important.

What is multimodal learning?

Multimodal learning in machine learning is a type of learning where the model is trained to understand and work with multiple forms of input data, such as text, images, and audio.

These different types of data correspond to different modalities of the world – ways in which it’s experienced. The world can be seen, heard, or described in words. For a ML model to be able to perceive the world in all of its complexity and understanding different modalities is a useful skill.

For example, let’s take image captioning that is used for tagging video content on popular streaming services. The visuals can sometimes be misleading. Even we, humans, might confuse a pile of weirdly-shaped snow for a dog or a mysterious silhouette, especially in the dark.

However, if the same model can perceive sounds, it might become better at resolving such cases. Dogs bark, cars beep, and humans rarely do any of that. Being able to work with different modalities, the model can make predictions or decisions based on a combination of data, which improves its overall performance.

What’s so special about this approach? While it’s obvious that analyzing multiple modalities at a time is better than analyzing just one, it was too hard computationally to realize it before. Some models worked with images, some with text, and some with audio.

Multimodal learning can also be used in more complex applications, such as robotics and autonomous systems. In these applications, it’s vital to understand and respond to a variety of different inputs, such as sensor data, video, and speech.

The history of multimodal learning in computer science

The history of multimodal learning can be roughly divided into four eras. Let us talk about each of them.


The behavioral era (1970s–late 1980s) was when scientists started to get interested in how different modalities of learning can merge and affect our perception of things. It was in 1976 when a paper by Harry McGurk and John MacDonald, titled “Hearing Lips and Seeing Voices” was published in the Nature journal.

The research described a perceptual phenomenon. According to the authors, the auditory perception of speech sounds is influenced by the visual information present during speech perception. This effect is most commonly demonstrated by presenting participants with a video of a person speaking a certain sound (e.g. “ba”), while playing an audio recording of a different sound (e.g. “ga”) at the same time. Participants will often report hearing a fused sound, such as “da” or “ba,” rather than the sound that is actually being played. The McGurk effect illustrates the way in which visual information can influence auditory perception, and has implications for fields such as speech therapy and speech perception research.

The psychological and linguistic research laid the groundwork for the next era in multimodal learning, when scientists first decided to use it for machine learning problems.


In the computational era, ML engineers started to use multimodal learning concepts to work with audio-visual speech recognition problems.

There was a surge of interest in human-computer interaction. Scientists believed that for machines to effectively communicate with humans, they need to not only understand our words but also emotions and affective states (such as heartbeat, sweating, etc.). A book by Rosalind W. Picard explored the concept of affective computing, i.e. computing that is based on the development of systems that can recognize, interpret, and simulate human emotions.

However, while merging multiple modalities improved the accuracy of predictions in some cases, there was also a lot of redundancy when two different modalities transmitted exactly the same information, didn’t complement each other, and didn’t improve the performance. Back then, It was a huge problem.

While certain computational techniques such as Boltzmann machines allowed to divide all units of information into two groups (visible and hidden) and allowed connection between these units, the necessary computational resources would grow exponentially.

Research done in this era laid the groundwork for future innovations.


Interactional era (2000s) refers to the time when engineers tried to model human interaction using computational means to create virtual assistants fully capable of natural communication.

One of the projects that tried to apply multimodal learning to resolve this problem was CALO (Cognitive Assistant that Learns and Organizes). CALO had to be able to understand human commands, respond, and perform necessary actions.

By the way, Siri that we all know and love (or don’t?) appeared as a spin-off of this project.

Deep learning era

However, the true turning point for innovation in this area was the proliferation of deep learning models in the late 2000s. Neural networks significantly improved the state of AI, helping it achieve incredible results in image, speech, and text processing. With neural networks, the possibilities to implement multimodal learning became much higher.

Serving the needs of neural network training, many large-scale datasets appeared in the 2010s. The availability of such datasets as ImageNet, as well as datasets curated by Google and Amazon, enabled ML engineers to train and test multimodal models on a much larger scale than ever before. Together with the appearance of deep learning architectures such as transformers and deep Boltzmann machines, this led to significant improvements in performance.

Right now multimodal learning is rapidly evolving thanks to reinforcement learning, explainable AI, and self-supervised learning.

If you would like to learn more about multimodal learning and its history, I refer you to the amazing set of lectures by Louis-Philippe Morency.

What are the core challenges of multimodal learning?

Multimodal learning poses several challenges for the ML engineering community. Let’s discuss each of them in more detail.


One of the main challenges of multimodal learning is how to summarize the information that the model receives from different modalities so as to enhance the knowledge and create a complimentary representation of things. Redundancy – two identical pieces of information in two places of a dataset – slows down the process and doesn’t add anything new.

For example, imagine a person who says ‘Great!”. That usually means a person approves. In most cases, we don’t need to see their face or listen to their voice; they don’t add any new information. However, sometimes ‘Great!’ represents a different emotion – sarcasm. In that case, being able to read one’s face and recognize their intonation might add new information.


Another challenge is aligning the different modalities, as they may have different temporal or spatial resolutions, or may be generated by different sensors or devices. This can make it difficult to combine the information from different modalities in a meaningful way.

In order to align modalities, you usually need to do three things: identify connections between modality elements, implement contextualized representation learning to capture modality connections and interactions, and handle modality inputs with ambiguous segmentation.

To align modalities, you can implement:

  1. Translation. This requires changing (or “translating”) the data from one modality to another.
  2. Fusion. This technique allows you to join information from 2 or more modalities to perform a prediction task.
  3. Co-learning. You can implement transfer learning between different modalities, for example, exploiting knowledge from one modality that is resource-rich to enhance knowledge of a different modality that has less resources.

Limited data

Multimodal learning often requires large amounts of labeled data, which can be difficult and expensive to acquire. The existing multimodal datasets that are available for public use, such as this one, are rather small and demand the aggregation of many resources.

Handling missing modalities

Multimodal learning models are designed to work with multiple modalities, but in real-world scenarios, data might be missing or unavailable. Handling missing modalities and the recuperation of missing values (for example, through cross-profiling) is an ongoing challenge that is still being researched.


Multimodal models can be complex and difficult to interpret, making it hard to understand how they make decisions and what factors are most important in their predictions. At the same time, tracing back the factors that influence the machine learning model decision making process is important to fix bugs and avoid biases.


Multimodal models are often trained on a specific set of modalities and data, and may not generalize well to new modalities or unseen data. This makes multimodal learning models unreliable and leads us back to the explainability problem.


Multimodal models are computationally expensive to train and deploy, since they are using large datasets and process it at fast speed. Today only Big Tech companies can afford working with such models, and even they have to think hard about how to monetize the results of their work.

What are the applications of multimodal learning?

Let’s take a look at some of the main applications of multimodal learning.

Computer vision

One of the most popular applications of multimodal learning is in computer vision. By combining images with other forms of input, such as text or audio, models can better understand the context of an image and make more accurate predictions. For example, a model that is able to combine an image of a dog with the sound of barking is more likely to correctly identify the animal.

Natural Language Processing

Multimodal learning is also used in natural language processing (NLP) tasks, such as sentiment analysis or language translation. By combining text with other forms of input, such as images or audio, models can better understand the context of the text and make more accurate predictions.


Multimodal learning is also used in robotics, where it can be used to improve the ability of robots to interact with their environment. By combining data from multiple sensors, such as cameras and microphones, models can better understand the environment and make more accurate predictions about how to interact with it.


Multimodal learning also has a lot of potential in healthcare. By combining data from multiple sources, such as images, text, and audio, models can better understand patient data and make more accurate predictions about disease diagnosis and treatment.

Is GPT-4 a new page in multimodal learning?

The new GPT-4 model released recently by Open AI has attracted a lot of attention. Is it really a revolution in the world of generative AI?

First of all, let’s see what GPT stands for. Generative Pre-trained Transformers (GPT) are deep learning models that are capable of generating texts in natural languages for answering questions, summarizing texts, and translating.

Before GPT, this was usually achieved with BERT by Google, released in 2017. And before that, with other deep learning models such as RNNs and LSTMs. BERT could already work with sequences while the previous generative models couldn’t. BERT has inspired Open AI engineers to improve language understanding of generative models by coming up with their own model - GPT-1. It was a proof of concept model and wasn’t released to the public. The next generation of the model, GPT-2, was already able to generate some sentences.

However, real success was achieved with the GPT-3 model. It was trained on a massive amount of texts and contained 100 times more parameters than GPT-2. It could generate pages of different texts, including web articles, news, scripts, letters, and even code. And did it quite well. ChatGPT based on this model instantly became world-famous breaking the news around the world with headlines like “Could a chatbot write my restaurant reviews?”, “How ChatGPT Is Fast Becoming The Teacher’s Pet”, and “ChatGPT is about to revolutionize the economy”.

The new GPT-4 model is just an attempt to take the content generation even further and make the outputs of the model seem more natural and human-like.

_“We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails.” _

Open AI

Here is how GPT-4 is different from the previous generation.

  • The performance has improved. GPT-4 strives to be more factually correct and reduce the number of model’s hallucinations (when it invents quotes or facts that don’t actually exist). GPT-4 scores 40% higher than GPT-3.5 on that scale.
  • It is more adaptable. The model becomes better at adapting to the requests of each new user. It can choose different styles and tone of voice, rather than always write in an unnatural cliched sort of way like ChatGPT does.
  • It generated visuals. Users can specify whether they need text or image and what kind of image, and the model will deliver. Moreover, the model is able to interpret images such as charts, memes, and screenshots from academic papers.

Overall, creators say that their invention outperformed the existing large language models and most state-of-the-art models. However, unlike the previous generations of the model, this one isn’t free. The company charges $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens, which is rather expensive. Default rate limits are 40k tokens per minute and 200 requests per minute.


Multimodal learning is a new word in the machine learning field that helps big companies create revolutionary good products and services. It requires a lot of computational resources and poses multiple challenges: how to implement multimodal learning, align information from different modalities, and make sure that the model is explainable and generalizes well.

However, one thing is certain: multimodal learning is here to stay.

Banner that links to Serokell Shop. You can buy awesome FP T-shirts there!
More from Serokell
What is big data thumbnailWhat is big data thumbnail
ML: Regression Analysis OverviewML: Regression Analysis Overview
Random forest classification and regression algorithms: how it worksRandom forest classification and regression algorithms: how it works