Multimodal Machine Learning:
A Survey and Taxonomy
Tadas Baltru�saitis
, Chaitanya Ahuja
, and Louis-Philippe Morency
Abstract—Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors.
Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when
it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to
be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate
information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.
Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself
and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader
challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning.
This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
Index Terms—Multimodal, machine learning, introductory, survey
Ç
1
INTRODUCTION
T
HE world surrounding us involves multiple modalities—
we see objects, hear sounds, feel texture, smell odors,
and so on. In general terms, a modality refers to the way in
which something happens or is experienced. Most people
associate the word modality with the sensory modalities which
represent our primary channels of communication and sen-
sation, such as vision or touch. A research problem or dataset
is therefore characterized as multimodal when it includes
multiple such modalities. In this paper we focus primarily,
but not exclusively, on three modalities: natural language
which can be both written or spoken; visual signals which are
often represented with images or videos; and vocal signals
which encode sounds and para-verbal information such as
prosody and vocal expressions.
In order for Artificial Intelligence to make progress in
understanding the world around us, it needs to be able to
interpret and reason about multimodal messages. Multi-
modal machine learning aims to build models that can pro-
cess and relate information from multiple modalities.
From early research on audio-visual speech recognition
to the recent explosion of interest in language and vision
models, multimodal machine learning is a vibrant multi-
disciplinary field of increasing importance and with
extraordinary potential.
The research field of Multimodal Machine Learning
brings some unique challenges for computational research-
ers given the heterogeneity of the data. Learning from
multimodal sources offers the possibility of capturing corre-
spondences between modalities and gaining an in-depth
understanding of natural phenomena. In this paper we
identify and explore five core technical challenges (and
related sub-challenges) surrounding multimodal machine
learning. They are central to the multimodal setting and
need to be tackled in order to progress the field. Our taxon-
omy goes beyond the typical early and late fusion split, and
consists of the five following challenges:
Representation. A first fundamental challenge is learn-
ing how to represent and summarize multimodal
data in a way that exploits the complementarity and
redundancy of multiple modalities. The heterogene-
ity of multimodal data makes it challenging to con-
struct such representations. For example, language is
often symbolic while audio and visual modalities
will be represented as signals.
Translation. A second challenge addresses how to
translate (map) data from one modality to another.
Not only is the data heterogeneous, but the relation-
ship between modalities is often open-ended or sub-
jective. For example, there exist a number of correct
ways to describe an image and and one perfect trans-
lation may not exist.
Alignment. A third challenge is to identify the direct
relations between (sub)elements from two or more
different modalities. For example, we may want to
align the steps in a recipe to a video showing the
dish being made. To tackle this challenge we need
to measure similarity between different modalities
and deal with possible long-range dependencies
and ambiguities.
�
T. Baltru�saitis is with Microsoft Corporation, Cambridge CB1 2FB,
United Kingdom. E-mail: tbaltrus@cs.cmu.edu.
�
C. Ahuja and L-P. Morency are with the Language Technologies Institute,
Carnegie Mellon University, Pittsburgh, PA 15213.
E-mail: {cahuja, morency}@cs.cmu.edu.
Manuscript received 22 May 2017; revised 21 Nov. 2017; accepted 4 Jan.
- Date of publication 24 Jan. 2018; date of current version 16 Jan. 2019.
(Corresponding author: Tadas Baltru�saitis.)
Recommended for acceptance by T. Berg.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPAMI.2018.2798607
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
423
0162-8828 � 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publication… for more information.
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
Fusion. A fourth challenge is to join information from
two or more modalities to perform a prediction. For
example, for audio-visual speech recognition, the
visual description of the lip motion is fused with the
speech signal to predict spoken words. The informa-
tion coming from different modalities may have
varying predictive power and noise topology, with
possibly missing data in at least one of the modalities.
Co-learning. A fifth challenge is to transfer knowl-
edge between modalities, their representation, and
their predictive models. This is exemplified by algo-
rithms of co-training, conceptual grounding, and zero
shot learning. Co-learning explores how knowledge
learning from one modality can help a computational
model trained on a different modality. This challenge
is particularly relevant when one of the modalities has
limited resources (e.g., annotated data).
For each of these five challenges, we defines taxonomic
classes and sub-classes to help structure the recent work in
this emerging research field of multimodal machine learn-
ing. We start with a discussion of main applications of
multimodal machine learning (Section 2) followed by a dis-
cussion on the recent developments on all of the five core
technical challenges facing multimodal machine learning:
representation (Section 3), translation (Section 4), alignment
(Section 5), fusion (Section 6), and co-learning (Section 7).
We conclude with a discussion in Section 8.
2
APPLICATIONS: A HISTORICAL PERSPECTIVE
Multimodal machine learning enables a wide range of
applications: from audio-visual speech recognition to image
captioning. In this section we present a brief history of mul-
timodal applications, from its beginnings in audio-visual
speech recognition to a recently renewed interest in lan-
guage and vision applications.
One of the earliest examples of multimodal research is
audio-visual speech recognition (AVSR) [251]. It was moti-
vated by the McGurk effect [143]—an interaction between
hearing and vision during speech perception. When human
subjects heard the syllable /ba-ba/ while watching the lips
of a person saying /ga-ga/, they perceived a third sound:
/da-da/. These results motivated many researchers from
the speech community to extend their approaches with
visual information. Given the prominence of hidden Mar-
kov models (HMMs) in the speech community at the time
[99], it is without surprise that many of the early models for
AVSR were based on various HMM extensions [25], [26].
While research into AVSR is not as common these days, it
has seen renewed interest from the deep learning commu-
nity [157].
While the original vision of AVSR was to improve speech
recognition performance (e.g., word error rate) in all con-
texts, the experimental results showed that the main advan-
tage of visual information was when the speech signal was
noisy (i.e., low signal-to-noise ratio) [78], [157], [251]. In
other words, the captured interactions between modalities
were supplementary rather than complementary. The same
information was captured in both, improving the robust-
ness of the multimodal models but not improving the
speech recognition performance in noiseless scenarios.
A second important category of multimodal applications
comes from the field of multimedia content indexing and
retrieval [11], [196]. With the advance of personal com-
puters and the internet, the quantity of digitized multime-
dia content has increased dramatically [2]. While earlier
approaches for indexing and searching these multimedia
videos were keyword-based [196], new research problems
emerged when trying to search the visual and multimodal
content directly. This led to new research topics in multi-
media content analysis such as automatic shot-boundary
detection [128] and video summarization [55]. These
research projects were supported by the TrecVid initiative
from the National Institute of Standards and Technologies
which introduced many high-quality datasets, including
the
multimedia event detection
(MED) tasks started
in 2011 [1].
A third category of applications was established in the
early 2000s around the emerging field of multimodal inter-
action with the goal of understanding human multimodal
behaviors during social interactions. One of the first land-
mark datasets collected in this field is the AMI Meeting
Corpus which contains more than 100 hours of video
recordings of meetings, all fully transcribed and annotated
[34]. Another important dataset is the SEMAINE corpus
which allowed to study interpersonal dynamics between
speakers and listeners [144]. This dataset formed the basis
of the first audio-visual emotion challenge (AVEC) orga-
nized in 2011 [186]. The fields of emotion recognition and
affective computing bloomed in the early 2010s thanks to
strong technical advances in automatic face detection, facial
landmark detection, and facial expression recognition [48].
The AVEC challenge continued annually afterward with the
later instantiation including healthcare applications such as
automatic assessment of depression and anxiety [217]. A
great summary of recent progress in multimodal affect rec-
ognition was published by D’Mello et al. [52]. Their meta-
analysis revealed that a majority of recent work on multi-
modal affect recognition show improvement when using
more than one modality, but this improvement is reduced
when recognizing naturally-occurring emotions.
Most recently, a new category of multimodal applica-
tions emerged with an emphasis on language and vision:
media description. One of the most representative applica-
tions is image captioning where the task is to generate a text
description of the input image [86]. This is motivated by the
ability of such systems to help the visually impaired in their
daily tasks [21]. Recently, progress has been made in the
inverse task media generation from text [37], [178]. The
main challenges media description and generation is evalu-
ation: how to evaluate the quality of the predicted descrip-
tions and media. The task of visual question-answering
(VQA) was recently proposed to address some of the evalu-
ation challenges [9] by providing a correct answer.
In order to bring some of the mentioned applications to
the real world we need to address a number of technical
challenges facing multimodal machine learning. We sum-
marize the relevant technical challenges for the above
mentioned application areas in Table 1. One of the most
important challenges is multimodal representation, the
focus of our next section.
424
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
3
MULTIMODAL REPRESENTATIONS
Representing data in a format that a computational model
can work with has always been a challenge in machine
learning. Following Bengio et al. [19] we use the term fea-
ture and representation interchangeably, with each refer-
ring to a vector or tensor representation of an entity, be it an
image, audio sample, individual word, or a sentence. A
multimodal representation is a representation of data using
information from multiple such entities. Representing mul-
tiple modalities poses many difficulties: how to combine the
data from heterogeneous sources; how to deal with different
levels of noise; and how to handle missing data. The ability
to represent data in a meaningful way is crucial to multi-
modal problems, and forms the backbone of any model.
Good representations are important for the performance
of machine learning models, as evidenced behind the recent
leaps in performance of speech recognition [82] and visual
object classification [114] systems. Bengio et al. [19] identify a
number of properties for good representations: smoothness,
temporal and spatial coherence, sparsity, and natural clus-
tering amongst others. Srivastava and Salakhutdinov [206]
identify additional desirable properties for multimodal rep-
resentations: similarity in the representation space should
reflect the similarity of the corresponding concepts, the
representation should be easy to obtain even in the absence
of some modalities, and finally, it should be possible to fill-in
missing modalities given the observed ones.
The development of unimodal representations has been
extensively studied [4], [19], [127]. In the past decade there
has been a shift from hand-designed for specific applica-
tions to data-driven. For example, one of the most popular
ways to represent an image in the early 2000s was through a
bag of visual words representation of hand designed fea-
tures, such as the scale invariant feature transform (SIFT)
[132]. However, currently most images (or their parts) are
represented using descriptions are learned from data using
neural architectures such as convolutional neural networks
(CNN) [114]. Similarly, in the audio domain, acoustic fea-
tures such as Mel-frequency cepstral coefficients (MFCC)
have been superseded by data-driven deep neural networks
in speech recognition [82] and recurrent neural networks
for para-linguistic analysis [216]. In natural language proc-
essing, the textual features initially relied on counting word
occurrences in documents, but have been replaced data-
driven word embeddings that exploit the word context
[146]. While there has been a huge amount of work on
unimodal representation, up until recently most multi-
modal representations involved simple concatenation of
unimodal ones [52], but this has been rapidly changing.
To help understand the breadth of work, we propose two
categories of multimodal representation: joint and coordi-
nated. Joint representations combine the unimodal signals
into the same representation space, while coordinated
representations process unimodal signals separately, but
enforce certain similarity constraints on them to bring them
to what we term a coordinated space. An illustration of dif-
ferent multimodal representation types can be seen in Fig. 1.
Mathematically, the joint representation is expressed as
xm ¼ fðx1; . . . ; xnÞ;
(1)
where the multimodal representation xm is computed using
function f (e.g., a deep neural network, restricted Boltz-
mann machine, or a recurrent neural network) that relies
on unimodal representations x1; . . . xn. While coordinated
representation is as follows:
fðx1Þ � gðx2Þ;
(2)
TABLE 1
A Summary of Applications Enabled by Multimodal Machine Learning
CHALLENGES
APPLICATIONS
REPRESENTATION
TRANSLATION
ALIGNMENT
FUSION
CO-LEARNING
Speech recognition
Audio-visual speech recognition
@
@
@
@
Event detection
Action classification
@
@
@
Multimedia event detection
@
@
@
Emotion and affect
Recognition
@
@
@
@
Synthesis
@
@
Media description
Image description
@
@
@
@
Video description
@
@
@
@
@
Visual question-answering
@
@
@
@
Media summarization
@
@
@
Multimedia retrieval
Cross modal retrieval
@
@
@
@
Cross modal hashing
@
@
Multimedia generation
(Visual) speech and sound synthesis
@
@
Image and scene generation
@
@
For each application area we identify the core technical challenges that need to be addressed in order to tackle it.
BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY
425
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
where each modality has a corresponding projection func-
tion (f and g above) that maps it into a coordinated multi-
modal space. While the projection into the multimodal
space is independent for each modality, but the resulting
space is coordinated between them (indicated as �). Exam-
ples of such coordination include minimizing cosine dis-
tance [64], maximizing correlation [7], and enforcing a
partial order [220] between the resulting spaces.
3.1
Joint Representations
We start our discussion with joint representations that proj-
ect unimodal representations together into a multimodal
space (Equation (1)). Joint representations are mostly (but
not exclusively) used in tasks where multimodal data is
present both during training and inference steps. The sim-
plest example of a joint representation is a concatenation of
individual modality features (also referred to as early fusion
[52]). In this section we discuss more advanced methods for
creating joint representations starting with neural networks,
followed by graphical models and recurrent neural net-
works (representative works can be seen in Table 2).
Neural networks have become a very popular method for
unimodal data representation [19]. They are used to repre-
sent visual, acoustic, and textual data, and are increasingly
used in the multimodal domain [157], [163], [225]. In this
section we describe how neural networks can be used to
construct a joint multimodal representation, how to train
them, and what advantages they offer.
In general, neural networks are made up of successive
building blocks of inner products followed by non-linear
activation functions. In order to use a neural network as a
way to represent data, it is first trained to perform a specific
task (e.g., recognizing objects in images). Due to the multi-
layer nature of deep neural networks each successive layer
is hypothesized to represent the data in a more abstract way
[19], hence it is common to use the final or penultimate neu-
ral layers as a form of data representation. To construct a
multimodal representation using neural networks each
modality starts with several individual neural layers fol-
lowed by a hidden layer that projects the modalities into a
joint space [9], [150], [163], [235]. The joint multimodal
representation is then be passed through multiple hidden
layers itself or used directly for prediction. Such models can
be trained end-to-end—learning both to represent the data
and to perform a particular task. This results in a close rela-
tionship between multimodal representation learning and
multimodal fusion when using neural networks.
As neural networks require a lot of labeled training data,
it is common to pre-train such representations using either
unsupervised data (e.g., using autoencoder models [12],
[83]) or supervised data from a different but related domain
[9], [221]. The model proposed by Ngiam et al. [157]
extended the idea of using autoencoders to the multimodal
domain. They used stacked denoising autoencoders to rep-
resent each modality individually and then fused them into
a multimodal representation using another autoencoder
layer. Similarly, Silberer and Lapata [191] proposed to use a
multimodal autoencoder for the task of semantic concept
grounding (see Section 7.2). In addition to using a recon-
struction loss to train the representation they introduce a
term into the loss function that uses the representation to
predict object labels.
The major advantage of neural network based joint rep-
resentations comes from their ability to pre-train from unla-
bled data when labeled data is not enough for supervised
learning. It is also common to fine-tune the resulting repre-
sentation on a particular task at hand as the representation
constructed with unsupervised data is generic and not nec-
essarily optimal for a specific task [225]. One of the disad-
vantages comes from the model not being able to handle
missing data naturally—although there are ways to alleviate
this issue [157], [225]. Finally, deep networks are often diffi-
cult to train [72], but the field is making progress with new
Fig. 1. Structure of joint and coordinated representations. Joint representations are projected to the same space using all of the modalities as input.
Coordinated representations, on the other hand, exist in their own space, but are coordinated through a similarity (e.g., Euclidean distance) or
structure constraint (e.g., partial order).
TABLE 2
A Summary of Multimodal Representation Techniques
REPRESENTATION
MODALITIES
REFERENCE
Joint
Neural networks
Images + Audio
[150], [157], [235]
Images + Text
[191]
Graphical models
Images + Text
[206]
Images + Audio
[108]
Sequential
Audio + Video
[100], [158]
Images + Text
[173]
Coordinated
Similarity
Images + Text
[64], [110]
Video + Text
[166], [239]
Structured
Images + Text
[33], [220], [256]
Audio + Articulatory
[228]
We identify three subtypes of joint representations (Section 3.1) and two
subtypes of coordinated ones (Section 3.2). For modalities + indicates the
modalities combined.
426
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
techniques such improved regularization [204], batch nor-
malization [92] and adaptive gradient algorithms [109].
Probabilistic graphical models can be used to construct rep-
resentations through the use of latent random variables [19].
In this section we describe how probabilistic graphical mod-
els are used to represent unimodal and multimodal data.
One such way to represent data is through deep Boltzmann
machines (DBM) [183], that stack restricted Boltzmann
machines (RBM) [84] as building blocks. Similar to neural
networks, each successive layer of a DBM is expected to rep-
resent the data at a higher level of abstraction. The appeal of
DBMs comes from the fact that they do not need supervised
data for training [183]. As they are graphical models the
representation of data is probabilistic, however it is possible
to convert them to a deterministic neural network—but this
loses the generative aspect of the model [183].
Work by Srivastava and Salakhutdinov [205] introduced
multimodal deep belief networks and multimodal DBMs
[206] as multimodal representations. Kim et al. [108] used a
deep belief network for each modality and then combined
them into joint representation for audiovisual emotion rec-
ognition. Huang and Kingsbury [89] used a similar model
for AVSR, and Wu et al. [233] for audio and skeleton joint
based gesture recognition. Ouyang et al. [163] explored
the use of multimodal DBMs for the task of human pose
estimation from multi-view data. They demonstrated that
integrating the data at a later stage—after unimodal data
underwent nonlinear transformations—was beneficial for
the model. Similarly, Suk et al. [207] used multimodal DBM
representation to perform Alzheimer’s disease classification
from positron emission tomography and magnetic reso-
nance imaging data.
One of the big advantages of using multimodal DBMs for
learning multimodal representations is their generative
nature, which allows for an easy way to deal with missing
data—even if a whole modality is missing, the model has a
natural way to cope. It can also be used to generate samples
of one modality in the presence of the other one, or both
modalities from the representation. Similar to autoencoders
the representation can be trained in an unsupervised man-
ner enabling the use of unlabeled data. The major disadvan-
tage of DBMs is the difficulty of training them—high
computational cost, and the need to use approximate varia-
tional training methods [206].
Sequential Representation. So far we have discussed mod-
els that can represent fixed length data, however, we often
need to represent varying length sequences such as senten-
ces, videos, or audio streams. Recurrent neural networks
(RNNs), and their variants such as long-short term memory
(LSTMs) networks [85], have recently gained popularity
due to their success in sequence modeling across various
tasks [13], [222]. So far RNNs have mostly been used to rep-
resent unimodal sequences of words, audio, or images, with
most success in the language domain. Similar to traditional
neural networks, the hidden state of an RNN can be seen as
a representation of the data, i.e., the hidden state of RNN at
timestep t can be seen as the summarization of the sequence
up to that timestep. This is especially apparent in RNN
encoder-decoder frameworks where the task of an encoder
is to represent a sequence in the hidden state of an RNN in
such a way that a decoder could reconstruct it [13], [244].
The use of RNN representations has not been limited to
the unimodal domain. An early use of constructing a multi-
modal representation using RNNs comes from work by
Cosi et al. [45] on AVSR. They have also been used for repre-
senting audio-visual data for affect recognition [39], [158]
and to represent multi-view data such as different visual
cues for human behavior analysis [173].
3.2
Coordinated Representations
An alternative to a joint multimodal representation is a
coordinated representation. Instead of projecting the modal-
ities together into a joint space, separate representations are
learned for each modality but are coordinated through a
constraint. We start our discussion with coordinated repre-
sentations that enforce similarity between representations,
moving on to coordinated representations that enforce
more structure on the resulting space (representative works
of such coordinated representations can be seen in Table 2).
Similarity models minimize the distance between modali-
ties in the coordinated space. For example such models
encourage the representation of the word dog and an image
of a dog to have a smaller distance between them than dis-
tance between the word dog and an image of a car [64]. One
of the earliest examples of such a representation comes
from the work by Weston et al. [229], [230] on the WSABIE
(web scale annotation by image embedding) model, where
a coordinated space was constructed for images and their
annotations. WSABIE constructs a simple linear map from
image and textual features such that corresponding annota-
tion and image representation would have a higher inner
product (smaller cosine distance) between them than non-
corresponding ones.
More recently, neural networks have become a popular
way to construct coordinated representations, due to their
ability to learn representations. Their advantage lies in the
fact that they can jointly learn coordinated representations
in an end-to-end manner. An example of such coordinated
representation is DeViSE—a deep visual-semantic embed-
ding [64]. DeViSE uses a similar inner product and ranking
loss function to WSABIE but uses more complex image and
word embeddings. Kiros et al. [110] extended this to sen-
tence and image coordinated representation by using an
LSTM model and a pairwise ranking loss to coordinate the
feature space. Socher et al. [199] tackle the same task, but
extend the language model to a dependency tree RNN to
incorporate compositional semantics. A similar model was
also proposed by Pan et al. [166], but using videos instead
of images. Xu et al. [239] also constructed a coordinated
space between videos and sentences using a hsubject, verb,
objecti compositional language model and a deep video
model. This representation was then used for the task of
cross-modal retrieval and video description.
While the above models enforced similarity between rep-
resentations, structured coordinated space models go beyond
that and enforce additional constraints between the modal-
ity representations. The type of structure enforced is often
based on the application, with different constraints for hash-
ing, cross-modal retrieval, and image captioning.
Structured coordinated spaces are commonly used in
cross-modal hashing—compression of high dimensional
data into compact binary codes with similar binary codes
BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY
427
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
for similar objects [226]. The idea of cross-modal hashing is
to create such codes for cross-modal retrieval [28], [97],
[118]. Hashing enforces certain constraints on the resulting
multimodal space: 1) it has to be an N-dimensional Ham-
ming space—a binary representation with controllable
number of bits; 2) the same object from different modalities
has to have a similar hash code; 3) the space has to be
similarity-preserving. Learning how to represent the data as
a hash function attempts to enforce all of these three require-
ments [28], [118]. For example, Jiang and Li [96] introduced a
method to learn such common binary space between sen-
tence descriptions and corresponding images using end-to-
end trainable deep learning techniques. While Cao et al. [33]
extended the approach with a more complex LSTM sentence
representation and introduced an outlier insensitive bit-wise
margin loss and a relevance feedback based semantic simi-
larity constraint. Similarly, Wang et al. [227] constructed a
coordinated space in which images (and sentences) with sim-
ilar meanings are closer to each other.
Another example of a structured coordinated representa-
tion comes from order-embeddings of images and language
[220], [257]. The model proposed by Vendrov et al. [220]
enforces a dissimilarity metric that is asymmetric and imple-
ments the notion of partial order in the multimodal space.
The idea is to capture a partial order of the language and
image representations—enforcing a hierarchy on the space;
for example image of a woman walking her dog ! text
woman walking her dog ! text woman walking. A similar
model using denotation graphs was also proposed by Young
et al. [246] where denotation graphs are used to induce a par-
tial ordering. Lastly, Zhang et al. present how exploiting
structured representations of text and images can create con-
cept taxonomies in an unsupervised manner [257].
A special case of a structured coordinated space is one
based on canonical correlation analysis (CCA) [87]. CCA
computes a linear projection which maximizes the correla-
tion between two random variables (in our case modalities)
and enforces orthogonality of the new space. CCA models
have been used extensively for cross-modal retrieval [79],
[111], [176] and audiovisual signal analysis [184], [195].
Extensions to CCA attempt to construct a correlation maxi-
mizing nonlinear projection [7], [121]. Kernel canonical cor-
relation analysis (KCCA) [121] uses reproducing kernel
Hilbert spaces for projection. However, as the approach is
nonparametric it scales poorly with the size of the training
set and has issues with very large real-world datasets. Deep
canonical correlation analysis (DCCA) [7] was introduced
as an alternative to KCCA and addresses the scalability
issue, it was also shown to lead to better correlated repre-
sentation space. Similar correspondence autoencoder [61]
and deep correspondence RBMs [60] have also been pro-
posed for cross-modal retrieval.
CCA, KCCA, and DCCA are unsupervised techniques
and only optimize the correlation over the representations,
thus mostly capturing what is shared across the modalities.
Deep canonically correlated autoencoders [228] also include
an autoencoder based data reconstruction term. This en-
courages
the
representation
to
also
capture
modality
specific information. Semantic correlation maximization
method [256] also encourages semantic relevance, while
retaining correlation maximization and orthogonality of the
resulting space—this leads to a combination of CCA and
cross-modal hashing techniques.
3.3
Discussion
In this section we identified two major types of multimodal
representations—joint and coordinated. Joint representa-
tions project multimodal data into a common space and are
best suited for situations when all of the modalities are pres-
ent during inference. They have been extensively used for
AVSR, affect, and multimodal gesture recognition. Coordi-
nated representations, on the other hand, project each
modality into a separate but coordinated space, making
them suitable for applications where only one modality is
present at test time, such as: multimodal retrieval and trans-
lation (Section 4), grounding (Section 7.2), and zero shot
learning (Section 7.2). Furthermore, while joint representa-
tions have been used in situations to construct representa-
tions of more than two modalities, coordinated spaces have,
so far, been mostly limited to two. Finally, the multimodal
networks we discussed are largely static, in the future we
may see more work on one modality driving the structure
of a network applied to another modality [6].
4
TRANSLATION
A big part of multimodal machine learning is concerned with
translating (mapping) from one modality to another. Given
an entity in one modality the task is to generate the same
entity in a different modality. For example given an image
we might want to generate a sentence describing it or given a
textual description generate an image matching it. Multi-
modal translation is a long studied problem, with early work
in speech synthesis [91], visual speech generation [141] video
description [112], and cross-modal retrieval [176].
More recently, multimodal translation has seen renewed
interest due to combined efforts of the computer vision and
natural language processing (NLP) communities [20] and
recent availability of large multimodal datasets [40], [214].
A particularly popular problem is visual scene description,
also known as image [223] and video captioning [222],
which acts as a great test bed for a number of computer
vision and NLP problems. To solve it, we not only need to
fully understand the visual scene and to identify its salient
parts, but also to produce grammatically correct and com-
prehensive yet concise sentences describing it.
While the approaches to multimodal translation are very
broad and are often modality specific, they share a number
of unifying factors. We categorize them into two types—
example-based, and generative. Example-based models use a
dictionary when translating between the modalities. Genera-
tive models, on the other hand, construct a model that is able
to produce a translation. This distinction is similar to the
one between non-parametric and parametric machine learn-
ing approaches and is illustrated in Fig. 2, with representa-
tive examples summarized in Table 3.
Generative models are arguably more challenging to build
as they require the ability to generate signals or sequences of
symbols (e.g., sentences). This is difficult for any modality—
visual, acoustic, or verbal, especially when temporally and
structurally consistent sequences need to be generated.
This led to many of the early multimodal translation systems
428
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
relying on example-based translation. However, this has been
changing with the advent of deep learning models that are
capable of generating images [178], [218], sounds [161], [164],
and text [13].
4.1
Example-Based
Example-based algorithms are restricted by their training
data—dictionary (see Fig. 2a). We identify two types of
such algorithms: retrieval based, and combination based.
Retrieval-based models directly use the retrieved translation
without modifying it, while combination-based models rely
on more complex rules to create translations based on a
number of retrieved instances.
Retrieval-based models are arguably the simplest form of
multimodal translation. They rely on finding the closest
sample in the dictionary and using that as the translated
result. The retrieval can be done in unimodal space or inter-
mediate semantic space.
Given a source modality instance to be translated, unim-
odal retrieval finds the closest instances in the dictionary in
the space of the source—for example, visual feature space
for images. Such approaches have been used for visual
speech synthesis, by retrieving the closest matching visual
example of the desired phoneme [27]. They have also been
used in concatenative text-to-speech systems [91]. More
recently, Ordonez et al. [162] used unimodal retrieval to gen-
erate image descriptions by using global image features to
retrieve caption candidates [162]. Yagcioglu et al. [240] used a
CNN-based image representation to retrieve visually similar
images using adaptive neighborhood selection. Devlin et al.
[51] demonstrated that a simple k-nearest neighbor retrieval
with consensus caption selection achieves competitive trans-
lation results when compared to more complex generative
approaches. The advantage of such unimodal retrieval
approaches is that they only require the representation of a
single modality through which we are performing retrieval.
However, they often require an extra multimodal post-proc-
essing step such as re-ranking of retrieved translations [140],
[162], [240]. This indicates a major problem with this
approach—similarity in unimodal space does not always
imply a good translation.
An alternative is to use an intermediate semantic space
for similarity comparison during retrieval. An early exam-
ple of a hand crafted semantic space is one used by Farhadi
et al. [58]. They map both sentences and images to a space
of hobject, action, scenei, retrieval of relevant caption to an
image is then performed in that space. In contrast to hand-
crafting a representation, Socher et al. [199] learn a coordi-
nated representation of sentences and CNN visual features
(see Section 3.2 for description of coordinated spaces).
They use the model for both translating from text to
images and from images to text. Similarly, Xu et al. [239]
used a coordinated space of videos and their descriptions
for cross-modal retrieval. Jiang and Li [97] and Cao et al.
[33] use cross-modal hashing to perform multimodal
translation from images to sentences and back, while
Hodosh et al. [86] use a multimodal KCCA space for
image-sentence retrieval. Instead of aligning images and
sentences globally in a common space, Karpathy et al.
[103] propose a multimodal similarity metric that inter-
nally aligns image fragments (visual objects) together
with sentence fragments (dependency tree relations).
Retrieval approaches in semantic space tend to perform
better than their unimodal counterparts as they are retriev-
ing examples in a more meaningful space that reflects both
modalities and that is often optimized for retrieval. Further-
more, they allow for bi-directional translation, which is not
straightforward with unimodal methods. However, they
require manual construction or learning of such a semantic
space, which often relies on the existence of large training
dictionaries (datasets of paired samples).
Fig. 2. Overview of example-based and generative multimodal translation. The former retrieves the best translation from a dictionary, while the latter
first trains a translation model on the dictionary and then uses that model for translation.
TABLE 3
Taxonomy of Multimodal Translation Research
TASKS
DIR.
REFERENCES
Example-based
Retrieval
Image captioning
)
[58], [162]
Media retrieval
,
[199], [239]
Visual speech
)
[27]
Image captioning
,
[102], [103]
Combination
Image captioning
)
[77], [119], [124]
Generative
Grammar based
Video description
)
[15], [213]
Image description
)
[53], [126], [147]
Encoder-decoder
Image captioning
)
[110], [139]
Video description
)
[222], [249]
Text to image
)
[137], [178]
Continuous
Sounds synthesis
)
[161], [164]
Visual speech
)
[5], [49], [212]
For each class and sub-class, we include example tasks with references. Our
taxonomy also includes the directionality of the translation: unidirectional
()) and bidirectional (,).
BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY
429
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
Combination-based models take the retrieval based app-
roaches one step further. Instead of just retrieving examples
from the dictionary, they combine them in a meaningful
way to construct a better translation. Combination based
media description approaches are motivated by the fact that
sentence descriptions of images share a common and simple
structure that could be exploited. Most often the rules for
combinations are hand crafted or based on heuristics.
Kuznetsova et al. [119] first retrieve phrases that describe
visually similar images and then combine them to generate
novel descriptions of the query image by using Integer Lin-
ear Programming with a number of hand crafted rules.
Gupta et al. [77] first find k images most similar to the
source image, and then use the phrases extracted from their
captions to generate a target sentence. Lebret et al. [124] use
a CNN-based image representation to infer phrases that
describe it. The predicted phrases are then combined using
a trigram constrained language model.
A big problem facing example-based approaches for trans-
lation is that the model is the entire dictionary—making the
model large and inference slow (although, optimizations
such as hashing alleviate this problem). Another issue facing
example-based translation is that it is unrealistic to expect
that a single comprehensive and accurate translation relevant
to the source example will always exist in the dictionary—
unless the task is simple or the dictionary is very large.
This is partly addressed by combination models that are
able to construct more complex structures. However, they
are only able to perform translation in one direction, while
semantic space retrieval-based models are able to perform
it both ways.
4.2
Generative Approaches
Generative approaches to multimodal translation construct
models that can perform multimodal translation given a
unimodal source instance. It is a challenging problem as it
requires the ability to both understand the source modality
and to generate the target sequence or signal. As discussed
in the following section, this also makes such methods
much more difficult to evaluate, due to large space of possi-
ble correct answers.
In this survey we focus on the generation of three modal-
ities: language, vision, and sound. Language generation has
been explored for a long time [177], with a lot of recent
attention for tasks such as image and video description [20].
Speech and sound generation has also seen a lot of work
with a number of historical [91] and modern approaches
[161], [164]. Photo-realistic image generation has been less
explored, and is still in early stages [137], [178], however,
there have been a number of attempts at generating abstract
scenes [261], computer graphics [47], and talking heads [5].
We identify three broad categories of generative models:
grammar-based, encoder-decoder, and continuous generation
models. Grammar based models simplify the task by
restricting the target domain by using a grammar, e.g., by
generating restricted sentences based on a hsubject, object,
verbi template. Encoder-decoder models first encode the
source modality to a latent representation which is then
used by a decoder to generate the target modality. Continu-
ous generation models generate the target modality contin-
uously based on a stream of source modality inputs and are
most suited for translating between temporal sequences—
such as text-to-speech.
Grammar-based models rely on a pre-defined grammar for
generating a particular modality. They start by detecting
high level concepts from the source modality, such as objects
in images and actions from videos. These detections are then
incorporated together with a generation procedure based on
a pre-defined grammar to result in a target modality.
Kojima et al. [112] proposed a system to describe human
behavior in a video using the detected position of the person’s
head and hands and rule based natural language generation
that incorporates a hierarchy of concepts and actions. Barbu
et al. [15] proposed a video description model that generates
sentences of the form: who did what to whom and where and
how they did it. The system was based on handcrafted object
and event classifiers and used a restricted grammar suitable
for the task. Guadarrama et al. [76] predict hsubject, verb,
objecti triplets describing a video using semantic hierarchies
that use more general words in case of uncertainty. Together
with a language model their approach allows for translation
of verbs and nouns not seen in the dictionary.
To describe images, Yao et al. [243] propose to use an
and-or graph-based model together with domain-specific
lexicalized grammar rules, targeted visual representation
scheme, and a hierarchical knowledge ontology. Li et al.
[126] first detect objects, visual attributes, and spatial rela-
tionships between objects. They then use an n-gram lan-
guage model on the visually extracted phrases to generate
hsubject, preposition, objecti style sentences. Mitchell et al.
[147] use a more sophisticated tree-based language model
to generate syntactic trees instead of filling in templates,
leading
to
more
diverse
descriptions.
A
majority
of
approaches represent the whole image jointly as a bag of
visual objects without capturing their spatial and semantic
relationships. To address this, Elliott et al. [53] propose to
explicitly model proximity relationships of objects for
image description generation.
Some grammar-based approaches rely on graphical mod-
els to generate the target modality. An example includes
BabyTalk [117], which given an image generates hobject, prep-
osition, objecti triplets, that are used together with a condi-
tional random field to construct the sentences. Yang et al.
[241] predict a set of hnoun, verb, scene, prepositioni candi-
dates using visual features extracted from an image and
combine them into a sentence using a statistical language
model and hidden Markov model style inference. A similar
approach has been proposed by Thomason et al. [213], where
a factor graph model is used for video description of the form
hsubject, verb, object, placei. The factor model exploits lan-
guage statistics to deal with noisy visual representations.
Going the other way Zitnick et al. [261] propose to use condi-
tional random fields to generate abstract visual scenes based
on language triplets extracted from sentences.
An advantage of grammar-based methods is that they
are more likely to generate syntactically (in case of lan-
guage) or logically correct target instances as they use pre-
defined templates and restricted grammars. However, this
limits them to producing formulaic rather than creative
translations. Furthermore, grammar-based methods rely on
complex pipelines for concept detection, with each concept
requiring a separate model and a separate training dataset.
430
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
Encoder-decoder models based on end-to-end trained neu-
ral networks are currently some of the most popular techni-
ques for multimodal translation. The main idea behind the
model is to first encode a source modality into a vectorial
representation and then to use a decoder module to gener-
ate the target modality, all this in a single pass pipeline.
Although, first used for machine translation [101], [208],
such models have been successfully used for image caption-
ing [139], [223], and video description [181], [222]. While
encoder-decoder models have been mostly used to generate
text, they can also generate images [137], [178], and speech
and sound [161], [164].
The first step of the encoder-decoder model is to encode
the source object, this is done in modality specific way. Pop-
ular models to encode acoustic signals include RNNs [36]
and DBNs [82]. Most of the work on encoding words sen-
tences uses distributional semantics [146] and variants of
RNNs [13]. Images are most often encoded using convolu-
tional neural networks [114], [193]. Although there are
methods for learning video representations [59], [192],
hand-crafted features are still used [181], [213]. While it is
possible to use unimodal representations to encode the
source modality, it has been shown that using a coordinated
space (see Section 3.2) leads to better results [110], [166].
Decoding is most often performed by an RNN or an LSTM
using the encoded representation as the initial hidden state
[56], [137], [223], [223]. A number of extensions have been
proposed to traditional LSTM models to aid in the task of
translation. A guide vector could be used to tightly couple
the solutions in the image input [95]. Venugopalan et al.
[222] demonstrate that it is beneficial to pre-train a decoder
LSTM for image captioning before fine-tuning it to video
description. Rohrbach et al. [181] explore the use of various
LSTM architectures (single layer, multilayer, factored) and a
number of training and regularization techniques for the
task of video description.
A problem facing translation generation using an RNN is
that the model has to generate a description from a single
vectorial representation of the image, sentence, or video.
This becomes especially difficult when generating long
sequences as these models tend to forget the initial input.
This has been partly addressed by including the encoded
information during every step of the decoder [95]. Attention
models (see Section 5.2) have also been proposed to allow
the decoder to better focus on certain parts of an image
[238], sentence [13], or video [244] during generation.
Generative attention-based RNNs have also been used
for the task of generating images from sentences [137], while
the results are still far from photo-realistic they show a lot of
promise. More recently, a large amount of progress has been
made in generating images using generative adversarial
networks [74], which have been used as an alternative to
RNNs for image generation from text [178].
While neural network based encoder-decoder systems
have been very successful they still face a number of issues.
Devlin et al. [51] suggest that it is possible that the network
is memorizing the training data rather than learning how to
understand the visual scene and generate it, based on the
observation that k-nearest neighbor models perform simi-
larly to those based on generation. Furthermore, such mod-
els often require large quantities of data for training.
Continuous generation models are intended for sequence
translation and produce outputs at every timestep in an
online manner. These models are useful when translating
from a sequence to a sequence such as text to speech, speech
to text, and video to text. A number of different techniques
have been proposed for such modeling—graphical models,
continuous encoder-decoder approaches, and various other
regression or classification techniques. The extra difficulty
that needs to be tackled by these models is the requirement
of temporal consistency between modalities.
A lot of early work on sequence to sequence translation
used graphical or latent variable models. Deena and Galata
[49] proposed to use a shared Gaussian process latent vari-
able model for audio-based visual speech synthesis. The
model creates a shared latent space between audio and
visual features that can be used to generate one space from
the other, while enforcing temporal consistency of visual
speech at different timesteps. Hidden Markov models
(HMM) have also been used for visual speech generation
[212] and text-to-speech [253] tasks. They have also been
extended to use cluster adaptive training to allow for train-
ing on multiple speakers, languages, and emotions allowing
for more control when generating speech signal [252] or
visual speech parameters [5].
Encoder-decoder models have recently become popular
for sequence to sequence modeling. Owens et al. [164] used
an LSTM to generate sounds resulting from drumsticks
based on video. While their model is capable of generating
sounds by predicting a cochleogram from CNN visual fea-
tures, they found that retrieving a closest audio sample
based on the predicted cochleogram led to best results.
Directly modeling the raw audio signal for speech and
music generation has been proposed by van den Oord et al.
[161]. The authors propose using hierarchical fully convolu-
tional neural networks, which show a large improvement
over previous state-of-the-art for the task of speech synthe-
sis. RNNs have also been used for speech to text translation
(speech recognition) [75]. More recently encoder-decoder
based continuous approach was shown to be good at pre-
dicting letters from a speech signal represented as a filter
bank spectra [36]—allowing for more accurate recognition
of rare and out of vocabulary words. Collobert et al. [44]
demonstrate how to use a raw audio signal directly for
speech recognition, eliminating the need for audio features.
A lot of earlier work used graphical models for multi-
modal translation between continuous signals. However,
these methods are being replaced by neural network
encoder-decoder based techniques. Especially as they have
recently been shown to be able to represent and generate
complex visual and acoustic signals.
4.3
Model Evaluation and Discussion
A major challenge facing multimodal translation methods
is that they are very difficult to evaluate. While some
tasks such as speech recognition have a single correct
translation, tasks such as speech synthesis and media
description do not. Sometimes, as in language translation,
multiple answers are correct and deciding which transla-
tion is better is often subjective. Fortunately, there are a
number of approximate automatic metrics that aid in
model evaluation.
BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY
431
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
Often the ideal way to evaluate a subjective task is
through human judgment. That is by having a group of peo-
ple evaluating each translation. This can be done on a Likert
scale where each translation is evaluated on a certain
dimension: naturalness and mean opinion score for speech
synthesis [161], [252], realism for visual speech synthesis
[5], [212], and grammatical and semantic correctness, rele-
vance, order, and detail for media description [40], [117],
[147], [222]. Another option is to perform preference studies
where two (or more) translations are presented to the partic-
ipant for preference comparison [212], [252]. However,
while user studies will result in evaluation closest to human
judgments they are time consuming and costly. Further-
more, they require care when constructing and conducting
them to avoid fluency, age, gender and culture biases.
While human studies are a gold standard for evaluation,
a number of automatic alternatives have been proposed for
the task of media description: BLEU [167], ROUGE [129],
Meteor [50], and CIDEr [219]. However, the use of them has
faced a lot of criticism and have been shown to only weakly
correspond to human judgements [54], [90].
Hodosh et al. [86] propose using retrieval as a proxy for
image captioning evaluation, as a better way to reflect
human judgments. Instead of generating captions, a retrieval
based system ranks the available captions based on their fit
to the image, and is then evaluated by assessing if the correct
captions are given a high rank. As a number of caption gener-
ation models are generative they can be used directly to
assess the likelihood of a caption given an image and are
being adapted by image captioning community [103], [110].
Such retrieval based evaluation metrics have also been
adopted by the video captioning community [182].
Visual question-answering [135] task was proposed partly
due to the issues facing evaluation of image captioning. VQA
is a task where given an image and a question about its con-
tent the system has to answer it. Evaluating such systems is
easier due to the presence of a correct answer, turning the task
into a multimodal fusion (see Section 6) rather than a transla-
tion one. Image co-reference task [113], [138] was proposed to
address this ambiguity as well, by framing the task as that of
multi-modal alignment (see Section 5).
We believe that addressing the evaluation issue will be
crucial for further success of multimodal translation systems.
This will allow not only for better comparison between
approaches, but also for better objectives to optimize.
5
ALIGNMENT
We define multimodal alignment as finding relationships
and correspondences between sub-components of instances
from two or more modalities. For example, given an image
and a caption we want to find the areas of the image corre-
sponding to the caption’s words or phrases [102]. Another
example is, given a movie, aligning it to the script or the
book chapters it was based on [260]. Ability to do this is par-
ticularly important for multimedia retrieval, as it enables us
to search video content based on text, e.g., finding scenes in
a movie where a particular character appears, or finding
images that contain blue chairs.
We categorize multimodal alignment into two types—
implicit and explicit. In explicit alignment, we are explicitly
interested in aligning sub-components between modalities,
e.g., aligning recipe steps with the corresponding instruc-
tional video [136]. Implicit alignment is used as an interme-
diate (often latent) step for another task, e.g., image
retrieval based on text description can include an alignment
step between words and image regions [103]. An overview
of such approaches can be seen in Table 4 and is presented
in more detail in the following sections.
5.1
Explicit Alignment
We categorize papers as performing explicit alignment if
their main modeling objective is alignment between sub-
components of instances from two or more modalities. A
very important part of explicit alignment is the similarity
metric. Most approaches rely on measuring similarity
between sub-components in different modalities as a basic
building block. These similarities can be defined manually
or learned from data. We identify two types of algorithms
that tackle explicit alignment—unsupervised and (weakly)
supervised. The first type operates with no direct alignment
labels (i.e., labeled correspondences) between instances
from the different modalities. The second type has access to
such (sometimes weak) labels.
Unsupervised multimodal alignment tackles modality
alignment without requiring any direct alignment labels.
Most of the approaches are inspired from early work on
alignment for statistical machine translation [29] and genome
sequences [116], [151]. To make the task easier the approaches
assume certain constrains on alignment, such as temporal
ordering of sequence or an existence of a similarity metric
between the modalities.
Dynamic time warping (DTW) [116], [151] is a dynamic
programming approach that has been extensively used
to align multi-view time series. DTW measures the similar-
ity between two sequences and finds an optimal match
between them by time warping (inserting frames). It requires
the timesteps in the two sequences to be comparable and
requires a similarity measure between them. DTW can be
used directly for multimodal alignment by hand-crafting
similarity metrics between modalities; for example Anguera
et al. [8] use a manually defined similarity between gra-
phemes and phonemes; and Tapaswi et al. [210] define a
similarity between visual scenes and sentences based on
TABLE 4
Summary of Our Taxonomy for the
Multimodal Alignment Challenge
ALIGNMENT
MODALITIES
REFERENCE
Explicit
Unsupervised
Video + Text
[136], [210], [211]
Video + Audio
[160], [215], [259]
Supervised
Video + Text
[24], [260]
Image + Text
[113], [138], [168]
Implicit
Graphical models
Audio/Text + Text
[194], [224]
Neural networks
Image + Text
[102], [236], [238]
Video + Text
[244], [249]
For each sub-class of our taxonomy, we include reference citations and modali-
ties aligned.
432
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
appearance of same characters [210] to align TV shows and
plot synopses. DTW-like dynamic programming approaches
have also been used for multimodal alignment of text to
speech [80] and video [211].
As the original DTW formulation requires a pre-defined
similarity metric between modalities, it was extended using
canonical correlation analysis to map the modalities to a
coordinated space. This allows for both aligning (through
DTW) and learning the mapping (through CCA) between
different modality streams jointly and in an unsupervised
manner [187], [258], [259]. While CCA based DTW models
are able to find multimodal data alignment under a linear
transformation, they are not able to model non-linear rela-
tionships. This has been addressed by the deep canonical
time warping approach [215], which can be seen as a gener-
alization of deep CCA and DTW.
Various graphical models have also been popular for
multimodal sequence alignment in an unsupervised man-
ner. Early work by Yu and Ballard [247] used a generative
graphical model to align visual objects in images with spo-
ken words. A similar approach was taken by Cour et al. [46]
to align movie shots and scenes to the corresponding
screenplay. Malmaud et al. [136] used a factored HMM to
align recipes to cooking videos, while Noulas et al. [160]
used a dynamic Bayesian network to align speakers to vid-
eos. Naim et al. [153] matched sentences with correspond-
ing video frames using a hierarchical HMM model to align
sentences with frames and a modified IBM [29] algorithm
for word and object alignment [16]. This model was then
extended to use latent conditional random fields for align-
ments [152] and to incorporate verb alignment to actions in
addition to nouns and objects [203].
Both DTW and graphical model approaches for alignment
allow for restrictions on alignment, e.g., temporal consis-
tency, no large jumps in time, and monotonicity. While DTW
extensions allow for learning both the similarity metric and
alignment jointly, graphical model based approaches require
expert knowledge for construction [46], [247].
Supervised alignment methods rely on labeled aligned
instances. They are used to train similarity measures that
are used for aligning modalities.
A number of supervised sequence alignment techniques
take inspiration from unsupervised ones. Bojanowski et al.
[23], [24] proposed a method similar to canonical time warp-
ing, but have also extended it to take advantage of existing
(weak) supervisory alignment data for model training.
Plummer et al. [168] used CCA to find a coordinated space
between image regions and phrases for alignment. Gebru
et al. [68] trained a Gaussian mixture model and performed
semi-supervised clustering together with an unsupervised
latent-variable graphical model to align speakers in an
audio channel with their locations in a video. Kong et al.
[113] trained a Markov random field to align objects in 3D
scenes to nouns and pronouns in text descriptions.
Deep learning based approaches are becoming popular
for explicit alignment (specifically for measuring similarity)
due to very recent availability of aligned datasets in the lan-
guage and vision communities [138], [168]. Zhu et al. [260]
aligned books with their corresponding movies/scripts by
training a CNN to measure similarities between scenes and
text. Mao et al. [138] used an LSTM language model and a
CNN visual one to evaluate the quality of a match between
a referring expression and an object in an image. Yu et al.
[250] extended this model to include relative appearance
and context information that allows to better disambiguate
between objects of the same type. Finally, Hu et al. [88] used
an LSTM based scoring function to find similarities between
image regions and their descriptions.
5.2
Implicit Alignment
In contrast to explicit alignment, implicit alignment is used
as an intermediate (often latent) step for another task. This
allows for better performance in a number of tasks includ-
ing speech recognition, machine translation, media descrip-
tion, and visual question-answering. Such models do not
explicitly align data and do not rely on supervised align-
ment examples, but learn how to latently align the data dur-
ing model training. We identify two types of implicit
alignment models: earlier work based on graphical models,
and more modern neural network methods.
Graphical models have seen some early work used to bet-
ter align words between languages for machine translation
[224] and alignment of speech phonemes with their tran-
scriptions [194]. However, they require manual construction
of a mapping between the modalities, for example a genera-
tive phone model that maps phonemes to acoustic features
[194]. Constructing such models requires training data or
human expertise to define them manually.
Neural networksTranslation (Section 4) is an example of a
modeling task that can often be improved if alignment is
performed as a latent intermediate step. As we mentioned
before, neural networks are popular ways to address this
translation problem, using either an encoder-decoder model
or through cross-modal retrieval. When translation is per-
formed without implicit alignment, it ends up putting a lot
of weight on the encoder module to be able to properly
summarize the whole image, sentence or a video with a sin-
gle vectorial representation.
A very popular way to address this is through attention
[13], which allows the decoder to focus on sub-components
of the source instance. This is in contrast with encoding all
source sub-components together, as is performed in a con-
ventional encoder-decoder model. An attention module will
tell the decoder to look more at targeted sub-components of
the source to be translated—areas of an image [238], words
of a sentence [13], segments of an audio sequence [36], [41],
frames and regions in a video [244], [249], and even parts of
an instruction [145]. For example, in image captioning
instead of encoding an entire image using a CNN, an atten-
tion mechanism will allow the decoder (typically an RNN) to
focus on particular parts of the image when generating each
successive word [238]. The attention module which learns
what part of the image to focus on is typically a shallow neu-
ral network and is trained end-to-end together with a target
task (e.g., translation).
Attention models have also been successfully applied to
question answering tasks, as they allow for aligning the
words in a question with sub-components of an information
source such as a piece of text [236], an image [65], or a video
sequence [254]. This allows for better accuracy and leads to
better model interpretability [3]. In particular, different
types of attention models have been proposed to address
BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY
433
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
this problem, including hierarchical [133], stacked [242],
and episodic memory attention [236].
Another neural alternative for aligning images with cap-
tions for cross-modal retrieval was proposed by Karpathy
et al. [102], [103]. Their proposed model aligns sentence
fragments to image regions by using a dot product similar-
ity measure between image region and word representa-
tions. While it does not use attention, it extracts a latent
alignment between modalities through a similarity measure
that is learned indirectly by training a retrieval model.
5.3
Discussion
Multimodal alignment faces a number of difficulties: 1)
there are few datasets with explicitly annotated alignments;
- it is difficult to design similarity metrics between modali-
ties; 3) there may exist multiple possible alignments and not
all elements in one modality have correspondences in
another. Earlier work on multimodal alignment focused on
aligning multimodal sequences in an unsupervised manner
using graphical models and dynamic programming techni-
ques. It relied on hand-defined measures of similarity
between the modalities or learnt them in an unsupervised
manner. With recent availability of labeled training data
supervised learning of similarities between modalities has
become possible. However, unsupervised techniques of
learning to jointly align and translate or fuse data have also
become popular.
6
FUSION
Multimodal fusion is one of the original topics in multi-
modal machine learning, with previous surveys emphasiz-
ing early, late and hybrid fusion approaches [52], [255]. In
technical terms, multimodal fusion is the concept of inte-
grating information from multiple modalities with the goal
of predicting an outcome measure: a class (e.g., happy ver-
sus sad) through classification, or a continuous value (e.g.,
positivity of sentiment) through regression. It is one of the
most researched aspects of multimodal machine learning
with work dating to 25 years ago [251].
The interest in multimodal fusion arises from three main
benefits it can provide. First, having access to multiple
modalities that observe the same phenomenon may allow
for more robust predictions. This has been especially
explored and exploited by the AVSR community [170]. Sec-
ond, having access to multiple modalities might allow us to
capture complementary information—something that is not
visible in individual modalities on their own. Third, a multi-
modal system can still operate when one of the modalities is
missing, for example recognizing emotions from the visual
signal when the person is not speaking [52].
Multimodal fusion has a very broad range of applica-
tions, including audio-visual speech recognition [170], mul-
timodal emotion recognition [200], medical image analysis
[93], and multimedia event detection [122]. There are a
number of reviews on the subject [11], [170], [196], [255].
Most of them concentrate on multimodal fusion for a partic-
ular task, such as multimedia analysis, information retrieval
or emotion recognition. In contrast, we concentrate on the
machine learning approaches themselves and the technical
challenges associated with these approaches.
While some prior work used the term multimodal fusion
to describe all multimodal algorithms, we classify approaches
as fusion when the multimodal integration is performed at
the later prediction stages, with the goal of predicting
outcome measures. Recently, the line between multimodal
representation and fusion has been blurred for models
such as deep neural networks where representation learning
interacts with classification or regression objectives.
We classify multimodal fusion into two main categories:
model-agnostic approaches (Section 6.1) that are not directly
dependent on a specific machine learning method; and model-
based (Section 6.2) approaches that explicitly address fusion in
their construction—such as kernel-based approaches, graphi-
cal models, and neural networks. An overview of such
approaches can be seen in Table 5.
6.1
Model-Agnostic Approaches
Historically, the vast majority of multimodal fusion has
been done using model-agnostic approaches [52]. Such
approaches can be split into early (i.e., feature-based), late
(i.e., decision-based) and hybrid fusion [11]. Early fusion
integrates features immediately after they are extracted
(often by simply concatenating their representations). Late
fusion on the other hand performs integration after each of
the modalities has made a decision (e.g., classification or
regression). Finally, hybrid fusion combines outputs from
early fusion and individual unimodal predictors. An advan-
tage of model agnostic approaches is that they can be imple-
mented using almost any unimodal classifiers or regressors.
Early fusion could be seen as an early attempt by multi-
modal researchers to perform multimodal representation
learning—as it can learn to exploit the correlation and inter-
actions between low level features of each modality. It also
only requires the training of a single model, making the
training pipeline easier compared to late and hybrid fusion.
In contrast, late fusion uses unimodal decision values
and fuses them using a fusion mechanism such as averaging
[188], voting schemes [149], weighting based on channel
noise [170] and signal variance [55], or a learned model [71],
[175]. It allows for the use of different models for each
TABLE 5
A Summary of Our Taxonomy of Multimodal Fusion Approaches
FUSION TYPE
OUT
TEMP
TASK
REFERENCE
Model-agnostic
Early
class
no
Emotion rec.
[35]
Late
reg
yes
Emotion rec.
[175]
Hybrid
class
no
Multimedia
[122]
event detection
Model-based
Kernel-based
class
no
Object class.
[32], [69]
class
no
Emotion rec.
[94], [189]
Graphical
class
yes
AVSR
[78]
models
reg
yes
Emotion rec.
[14]
class
no
Media class.
[97]
Neural
class
yes
Emotion rec.
[100], [232]
networks
class
no
AVSR
[157]
reg
yes
Emotion rec.
[39]
OUT—output type (class—classification or reg—regression), TEMP—is tempo-
ral modeling possible.
434
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
modality as different predictors can model each individual
modality better, allowing for more flexibility. Furthermore,
it makes it easier to make predictions when one or more of
the modalities is missing and even allows for training when
no parallel data is available. However, late fusion ignores
the low level interaction between the modalities.
Hybrid fusion attempts to exploit the advantages of both
of the above described methods in a common framework. It
has been used successfully for multimodal speaker identifi-
cation [234] and multimedia event detection [122].
6.2
Model-Based Approaches
While model-agnostic approaches are easy to implement
using unimodal machine learning methods, they end up
using techniques that are not designed for multimodal data.
In this section we describe three categories of approaches
that are designed to perform multimodal fusion: kernel-
based methods, graphical models, and neural networks.
Multiple kernel learning (MKL) methods are an extension
to kernel support vector machines (SVM) that allow for the
use of different kernels for different modalities/views of
the data [73]. As kernels can be seen as similarity functions
between data points, modality-specific kernels in MKL
allows for better fusion of heterogeneous data.
MKL approaches have been an especially popular method
for fusing visual descriptors for object detection [32], [69] and
only recently have been overtaken by deep learning methods
for the task [114]. They have also seen use for multimodal
affect recognition [38], [94], [189], multimodal sentiment
analysis [169], and multimedia event detection [245]. Fur-
thermore, McFee and Lanckriet [142] proposed to use MKL
to perform musical artist similarity ranking from acoustic,
semantic and social view data. Finally, Liu et al. [130] used
MKL for multimodal fusion in Alzheimer’s disease classifi-
cation. Their broad applicability demonstrates the strength
of such approaches in various domains and across different
modalities.
Besides flexibility in kernel selection, an advantage of
MKL is the fact that the loss function is convex, allowing for
model training using standard optimization packages and
global optimum solutions [73]. Furthermore, MKL can be
used to both perform regression and classification. One of
the main disadvantages of MKL is the reliance on training
data (support vectors) during test time, leading to slow
inference and a large memory footprint.
Graphical models are another family of popular methods
for multimodal fusion. In this section we overview work
done on multimodal fusion using shallow graphical models.
A description of deep graphical models such as deep belief
networks can be found in Section 3.1.
Majority of graphical models can be classified into two
main categories: generative—modeling joint probability; or
discriminative—modeling
conditional
probability
[209].
Some of the earliest approaches to use graphical models for
multimodal fusion include generative models such as cou-
pled [155] and factorial hidden Markov models [70] along-
side dynamic Bayesian networks [67]. A more recently-
proposed multi-stream HMM method proposes dynamic
weighting of modalities for AVSR [78].
Arguably, generative models lost popularity to discrimi-
native ones such as conditional random fields (CRF) [120]
which sacrifice the modeling of joint probability for predic-
tive power. A CRF model was used to better segment
images by combining visual and textual information of
image description [63]. CRF models have been extended to
model latent states using hidden conditional random fields
[172] and have been applied to multimodal meeting seg-
mentation [180]. Other multimodal uses of latent variable
discriminative graphical models include multi-view hidden
CRF [202] and latent variable models [201]. More recently
Jiang et al. [97] have shown the benefits of multimodal hid-
den conditional random fields for the task of multimedia
classification. While most graphical models are aimed at
classification, CRF models have been extended to a continu-
ous version for regression [171] and applied in multimodal
settings [14] for audio visual emotion recognition.
The benefit of graphical models is their ability to easily
exploit spatial and temporal structure of the data, making
them especially popular for temporal modeling tasks, such
as AVSR and multimodal affect recognition. They also allow
to build in human expert knowledge into the models. and
often lead to interpretable models.
Neural networks have been used extensively for the task of
multimodal fusion [157]. The earliest examples of using
neural networks for multi-modal fusion come from work on
AVSR [170]. Nowadays they are being used to fuse informa-
tion for visual and media question answering [66], [135],
[237], gesture recognition [156], affect analysis [100], [159],
and video description generation [98], [221]. Both shallow
[66] and deep [159], [221] neural models have been explored
for multimodal fusion.
Neural networks have also been used for fusing temporal
multimodal information through the use of RNNs and
LSTMs. One of the earlier such applications used a bidirec-
tional LSTM was used to perform audio-visual emotion
classification [232]. More recently, W€ollmer et al. [231] used
LSTM models for continuous multimodal emotion recogni-
tion, demonstrating its advantage over graphical models
and SVMs. Similarly, Nicolaou et al. [158] used LSTMs for
continuous emotion prediction. Their proposed method
used an LSTM to fuse the results from a modality specific
(audio and facial expression) LSTMs.
Approaching modality fusion through recurrent neural
networks has been used in various image captioning tasks,
example models include: neural image captioning [223]
where a CNN image representation is decoded using an
LSTM language model, gLSTM [95] which incorporates the
image data together with sentence decoding at every time
step fusing the visual and sentence data in a joint represen-
tation. A more recent example is the multi-view LSTM
(MV-LSTM) model proposed by Rajagopalan et al. [173].
MV-LSTM model allows for flexible fusion of modalities in
the LSTM framework by explicitly modeling the modality-
specific and cross-modality interactions over time.
A big advantage of deep neural network approaches in
data fusion is their capacity to learn from large amount of
data. Second, recent neural architectures allow for end-to-
end training of both the multimodal representation compo-
nent and the fusion component. Finally, they show good
performance when compared to non neural network based
system and are able to learn complex decision boundaries
that other approaches struggle with.
BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY
435
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
The major disadvantage of neural network approaches is
their lack of interpretability. It is difficult to tell what the
prediction relies on, and which modalities or features play
an important role. Furthermore, neural networks require
large training datasets to be successful.
6.3
Discussion
Multimodal fusion has been a widely researched topic with
a large number of approaches proposed to tackle it, includ-
ing model agnostic methods, graphical models, multiple
kernel learning, and various types of neural networks. Each
approach has its own strengths and weaknesses, with some
more suited for smaller datasets and others performing bet-
ter in noisy environments. Most recently, neural networks
have become a very popular way to tackle multimodal
fusion, however graphical models and multiple kernel
learning are still being used, especially in tasks with limited
training data or where model interpretability is important.
Despite these advances multimodal fusion still faces the
following challenges: 1) signals might not be temporally
aligned (possibly dense continuous signal and a sparse
event); 2) it is difficult to build models that exploit supple-
mentary and not only complementary information; 3) each
modality might exhibit different types and different levels
of noise at different points in time.
7
CO-LEARNING
The final multimodal challenge in our taxonomy is co-
learning—aiding the modeling of a (resource poor) modal-
ity by exploiting knowledge from another (resource rich)
modality. It is particularly relevant when one of the modali-
ties has limited resources—lack of annotated data, noisy
input, and unreliable labels. We call this challenge co-learn-
ing as most often the helper modality is used only during
model training and is not used during test time. We identify
three types of co-learning approaches based on their training
resources: parallel, non-parallel, and hybrid. Parallel-data
approaches require training datasets where the observations
from one modality are directly linked to the observations
from other modalities. In other words, when the multimodal
observations are from the same instances, such as in an
audio-visual speech dataset where the video and speech
samples are from the same speaker. In contrast, non-parallel
data approaches do not require direct links between observa-
tions from different modalities. These approaches usually
achieve co-learning by using overlap in terms of categories.
For example, in zero shot learning when the conventional
visual object recognition dataset is expanded with a second
text-only dataset from Wikipedia to improve the generaliza-
tion of visual object recognition. In the hybrid data setting the
modalities are bridged through a shared modality or a data-
set. An overview of methods in co-learning can be seen in
Table 6 and summary of data parallelism in Fig. 3.
7.1
Parallel Data
In parallel data co-learning both modalities share a set of
instances—audio recordings with the corresponding videos,
images and their sentence descriptions. This allows for two
types of algorithms to exploit that data to better model the
modalities: co-training and representation learning.
Co-training is the process of creating more labeled train-
ing samples when we have few labeled samples in a multi-
modal problem [22]. The basic algorithm builds weak
classifiers in each modality to bootstrap each other with
labels for the unlabeled data. It has been shown to discover
more training samples for web-page classification based on
the web-page itself and hyper-links leading in the seminal
work of Blum and Mitchell [22]. By definition this task
requires parallel data as it relies on the overlap of multi-
modal samples.
Co-training has been used for statistical parsing [185] to
build better visual detectors [125] and for audio-visual
speech recognition [42]. It has also been extended to deal
with disagreement between modalities, by filtering out
unreliable samples [43]. While co-training is a powerful
method for generating more labeled data, it can also lead to
biased training samples resulting in overfitting.
Transfer learning is another way to exploit co-learning
with parallel data. Multimodal representation learning
(Section 3.1) approaches such as multimodal deep Boltzmann
machines [206] and multimodal autoencoders [157] transfer
information from representation of one modality to that of
another. This not only leads to multimodal representations,
TABLE 6
A Summary of Co-Learning Taxonomy,
Based on Data Parallelism
DATA PARALLELISM
TASK
REFERENCE
Parallel
Co-training
Mixture
[22], [115]
Transfer learning
AVSR
[157]
Lip reading
[148]
Non-parallel
Transfer learning
Visual classification
[64]
Action recognition
[134]
Concept grounding
Metaphor class.
[188]
Word similarity
[107]
Zero shot learning
Image class.
[64], [198]
Thought class.
[165]
Hybrid data
Bridging
MT and image ret.
[174]
Transliteration
[154]
Parallel data—multiple modalities can see the same instance. Non-parallel
data—unimodal instances are independent of each other. Hybrid data—the
modalities are pivoted through a shared modality or dataset.
Fig. 3. Types of data parallelism used in co-learning: parallel—modalities
are from the same dataset and there is a direct correspondence between
instances; non-parallel—modalities are from different datasets and do
not have overlapping instances, but overlap in general categories or con-
cepts; hybrid—the instances or concepts are bridged by a third modality
or a dataset.
436
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 41,
NO. 2,
FEBRUARY 2019
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
but also to better unimodal ones, with only one modality
being used during test time [157] .
Moon et al. [148] show how to transfer information from
a speech recognition neural network (based on audio) to a
lip-reading one (based on images), leading to a better visual
representation, and a model that can be used for lip-reading
without need for audio information during test time. Simi-
larly, Arora and Livescu [10] build better acoustic features
using CCA on acoustic and articulatory (location of lips,
tongue and jaw) data. They use articulatory data only dur-
ing CCA construction and use only the resulting acoustic
(unimodal) representation during test time.
7.2
Non-Parallel Data
Methods that rely on non-parallel data do not require the
modalities to have shared instances, but only shared catego-
ries or concepts. Non-parallel co-learning approaches can
help when learning representations, allow for better seman-
tic concept understanding and even perform unseen object
recognition.
Transfer learning is also possible on non-parallel data and
allows to learn better representations through transferring
information from a representation built using a data rich or
clean modality to a data scarce or noisy modality. This type
of trasnfer learning is often achieved by using coordinated
multimodal representations (see Section 3.2). For example,
Frome et al. [64] used text to improve visual representations
for image classification by coordinating CNN visual fea-
tures with word2vec textual ones [146] trained on separate
large datasets. Visual representations trained in such a way
result in more meaningful errors—mistaking objects for
ones of similar category [64]. Mahasseni and Todorovic
[134] demonstrated how to regularize a color video based
LSTM using an autoencoder LSTM trained on 3D skeleton
data by enforcing similarities between their hidden states.
Such an approach is able to improve the original LSTM and
lead to state-of-the-art performance in action recognition.
Conceptual grounding refers to learning semantic mean-
ings or concepts not purely based on language but also on
additional modalities such as vision, sound, or even smell
[17]. While the majority of concept learning approaches
are purely language-based, representations of meaning in
humans are not merely a product of our linguistic exposure,
but are also grounded through our sensorimotor experience
and perceptual system [18], [131]. Human semantic knowl-
edge relies heavily on perceptual information [131] and
many concepts are grounded in the perceptual system and
are not purely symbolic [18]. This implies that learning
semantic meaning purely from textual information might
not be optimal, and motivates the use of visual or acoustic
cues to ground our linguistic representations.
Starting from work by Feng and Lapata [62], grounding
is usually performed by finding a common latent space
between the representations [62], [190] (in case of parallel
datasets) or by learning unimodal representations separately
and then concatenating them to lead to a multimodal one
[30], [105], [179], [188] (in case of non-parallel data). Once a
multimodal representation is constructed it can be used on
purely linguistic tasks. Shutova et al. [188] and Bruni et al.
[30] used grounded representations for better classification
of metaphors and literal language. Such representations
have also been useful for measuring conceptual similarity
and relatedness—identifying how semantically or conceptu-
ally related two words are [31], [105], [190] or actions [179].
Furthermore, concepts can be grounded not only using
visual signals, but also acoustic ones, leading to better perfor-
mance especially on words with auditory associations [107],
or even olfactory signals [106] for words with smell associa-
tions. Finally, there is a lot of overlap between multimodal
alignment and conceptual grounding, as aligning visual
scenes to their descriptions leads to better textual or visual
representations [113], [168], [179], [248].
Conceptual grounding has been found to be an effective
way to improve performance on a number of tasks. It also
shows that language and vision (or audio) are complemen-
tary sources of information and combining them in multi-
modal models often improves performance. However, one
has to be careful as grounding does not always lead to better
performance [106], [107], and only makes sense when
grounding has relevance for the task—such as grounding
using images for visually-related concepts.
Zero shot learning (ZSL) refers to recognizing a concept
without having explicitly seen any examples of it. For exam-
ple classifying a cat in an image without ever having seen
(labeled) images of cats. This is an important problem to
address as in a number of tasks such as visual object classifi-
cation: it is prohibitively expensive to provide training
examples for every imaginable object of interest.
There are two main types of ZSL—unimodal and multi-
modal. The unimodal ZSL looks at component parts or
attributes of the object, such as phonemes to recognize an
unheard word or visual attributes such as color, size, and
shape to predict an unseen visual class [57]. The multimodal
ZSL recognizes the objects in the primary modality through
the help of the secondary one—in which the object has been
seen. The multimodal version of ZSL is a problem facing
non-parallel data by definition as the overlap of seen classes
is different between the modalities.
Socher et al. [198] map image features to a conceptual
word space and are able to classify seen and unseen con-
cepts. The unseen concepts can be then assigned to a word
that is close to the visual representation—this is enabled by
the semantic space being trained on a separate dataset that
has seen more concepts. Instead of learning a mapping from
visual to concept space Frome et al. [64] learn a coordinated
multimodal representation between concepts and images
that allows for ZSL. Palatucci et al. [165] perform prediction
of words people are thinking of based on functional mag-
netic resonance images, they show how it is possible to pre-
dict unseen words through the use of an intermediate
semantic space. Lazaridou et al. [123] present a fast map-
ping method for ZSL by mapping extracted visual feature
vectors to text-based vectors through a neural network.
7.3
Hybrid Data
In the hybrid data setting two non-parallel modalities are
bridged by a shared modality or a dataset (see Fig. 3c). The
most notable example is the Bridge Correlational Neural Net-
work [174], which uses a pivot modality to learn coordinated
multimodal representations in presence of non-parallel data.
For example, for multilingual image captioning, the image
modality would be paired with at least one caption in any
BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY
437
Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.
language. Such methods have also been used to bridge lan-
guages that might not have parallel corpora but have access
to a shared pivot language, such as for machine translation
[154], [174] and document transliteration [104].
Instead of using a separate modality for bridging, some
methods rely on existence of large datasets from a similar or
related task to lead to better performance in a task that only
contains limited annotated data. Socher and Fei-Fei [197]
use the existence of large text corpora in order to guide
image segmentation. While Hendricks et al. [81] use sepa-
rately trained visual model and a language model to lead to
a better image and video description system, for which only
limited data is available.
7.4
Discussion
Multimodal co-learning allows for one modality to influ-
ence the training of another, exploiting the complementary
information across modalities. It is important to note that
co-learning is task independent and could be used to create
better fusion, translation, and alignment models. This chal-
lenge is exemplified by algorithms such as co-training, mul-
timodal representation learning, conceptual grounding, and
zero shot learning (ZSL) and has found many applications
in visual classification, action recognition, audio-visual
speech recognition, and semantic similarity estimation.
8
CONCLUSION
Multimodal machine learning is a vibrant multi-disciplinary
field which aims to build models that can process and relate
information from multiple modalities. This paper surveyed
recent advances in multimodal machine learning and pre-
sented them in a common taxonomy built upon five technical
challenges faced by multimodal researchers: representation,
translation, alignment, fusion, and co-learning. For each chal-
lenge, we presented taxonomic sub-classification that allows
to understand the breath of the current multimodal research.
Although the focus of this survey paper was primarily on the
last decade of multimodal research, it is important to address
future challenges with a knowledge of past achievements.
Moving forward, the proposed taxonomy gives research-
ers a framework to understand current research and identify
understudied challenges for future research. We summarized
each technical challenge with a discussion of future directions
and research problems (see Sections 3.3, 4.3, 5.3, 6.3 and 7.4).
We believe that all these aspects of multimodal research are
needed if we want to build computers able to perceive, model
and generate multimodal signals. One specific area of multi-
modal machine learning which seems to be under-studied is
co-learning, where knowledge from one modality helps with
modeling in another modality. This challenge is related to the
concept of coordinated representations where each modality
keeps its own representation but find a way to exchange
and coordinate knowledge. We see these lines of research as
promising directions for future research.