自己用来翻译的素材-1Multimodal Machine Learning: A Survey and Taxonom

Multimodal Machine Learning:

A Survey and Taxonomy

Tadas Baltru�saitis

, Chaitanya Ahuja

, and Louis-Philippe Morency

Abstract—Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste ﬂavors.

Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when

it includes multiple such modalities. In order for Artiﬁcial Intelligence to make progress in understanding the world around us, it needs to

be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate

information from multiple modalities. It is a vibrant multi-disciplinary ﬁeld of increasing importance and with extraordinary potential.

Instead of focusing on speciﬁc multimodal applications, this paper surveys the recent advances in multimodal machine learning itself

and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader

challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning.

This new taxonomy will enable researchers to better understand the state of the ﬁeld and identify directions for future research.

Index Terms—Multimodal, machine learning, introductory, survey

INTRODUCTION

HE world surrounding us involves multiple modalities—

we see objects, hear sounds, feel texture, smell odors,

and so on. In general terms, a modality refers to the way in

which something happens or is experienced. Most people

associate the word modality with the sensory modalities which

represent our primary channels of communication and sen-

sation, such as vision or touch. A research problem or dataset

is therefore characterized as multimodal when it includes

multiple such modalities. In this paper we focus primarily,

but not exclusively, on three modalities: natural language

which can be both written or spoken; visual signals which are

often represented with images or videos; and vocal signals

which encode sounds and para-verbal information such as

prosody and vocal expressions.

In order for Artiﬁcial Intelligence to make progress in

understanding the world around us, it needs to be able to

interpret and reason about multimodal messages. Multi-

modal machine learning aims to build models that can pro-

cess and relate information from multiple modalities.

From early research on audio-visual speech recognition

to the recent explosion of interest in language and vision

models, multimodal machine learning is a vibrant multi-

disciplinary ﬁeld of increasing importance and with

extraordinary potential.

The research ﬁeld of Multimodal Machine Learning

brings some unique challenges for computational research-

ers given the heterogeneity of the data. Learning from

multimodal sources offers the possibility of capturing corre-

spondences between modalities and gaining an in-depth

understanding of natural phenomena. In this paper we

identify and explore ﬁve core technical challenges (and

related sub-challenges) surrounding multimodal machine

learning. They are central to the multimodal setting and

need to be tackled in order to progress the ﬁeld. Our taxon-

omy goes beyond the typical early and late fusion split, and

consists of the ﬁve following challenges:

Representation. A ﬁrst fundamental challenge is learn-

ing how to represent and summarize multimodal

data in a way that exploits the complementarity and

redundancy of multiple modalities. The heterogene-

ity of multimodal data makes it challenging to con-

struct such representations. For example, language is

often symbolic while audio and visual modalities

will be represented as signals.

Translation. A second challenge addresses how to

translate (map) data from one modality to another.

Not only is the data heterogeneous, but the relation-

ship between modalities is often open-ended or sub-

jective. For example, there exist a number of correct

ways to describe an image and and one perfect trans-

lation may not exist.

Alignment. A third challenge is to identify the direct

relations between (sub)elements from two or more

different modalities. For example, we may want to

align the steps in a recipe to a video showing the

dish being made. To tackle this challenge we need

to measure similarity between different modalities

and deal with possible long-range dependencies

and ambiguities.

�

T. Baltru�saitis is with Microsoft Corporation, Cambridge CB1 2FB,

United Kingdom. E-mail: tbaltrus@cs.cmu.edu.

�

C. Ahuja and L-P. Morency are with the Language Technologies Institute,

Carnegie Mellon University, Pittsburgh, PA 15213.

E-mail: {cahuja, morency}@cs.cmu.edu.

Manuscript received 22 May 2017; revised 21 Nov. 2017; accepted 4 Jan.

Date of publication 24 Jan. 2018; date of current version 16 Jan. 2019.

(Corresponding author: Tadas Baltru�saitis.)

Recommended for acceptance by T. Berg.

For information on obtaining reprints of this article, please send e-mail to:

reprints@ieee.org, and reference the Digital Object Identiﬁer below.

Digital Object Identiﬁer no. 10.1109/TPAMI.2018.2798607

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

423

0162-8828 � 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See ht_tp://www.ieee.org/publication… for more information.

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

Fusion. A fourth challenge is to join information from

two or more modalities to perform a prediction. For

example, for audio-visual speech recognition, the

visual description of the lip motion is fused with the

speech signal to predict spoken words. The informa-

tion coming from different modalities may have

varying predictive power and noise topology, with

possibly missing data in at least one of the modalities.

Co-learning. A ﬁfth challenge is to transfer knowl-

edge between modalities, their representation, and

their predictive models. This is exempliﬁed by algo-

rithms of co-training, conceptual grounding, and zero

shot learning. Co-learning explores how knowledge

learning from one modality can help a computational

model trained on a different modality. This challenge

is particularly relevant when one of the modalities has

limited resources (e.g., annotated data).

For each of these ﬁve challenges, we deﬁnes taxonomic

classes and sub-classes to help structure the recent work in

this emerging research ﬁeld of multimodal machine learn-

ing. We start with a discussion of main applications of

multimodal machine learning (Section 2) followed by a dis-

cussion on the recent developments on all of the ﬁve core

technical challenges facing multimodal machine learning:

representation (Section 3), translation (Section 4), alignment

(Section 5), fusion (Section 6), and co-learning (Section 7).

We conclude with a discussion in Section 8.

APPLICATIONS: A HISTORICAL PERSPECTIVE

Multimodal machine learning enables a wide range of

applications: from audio-visual speech recognition to image

captioning. In this section we present a brief history of mul-

timodal applications, from its beginnings in audio-visual

speech recognition to a recently renewed interest in lan-

guage and vision applications.

One of the earliest examples of multimodal research is

audio-visual speech recognition (AVSR) [251]. It was moti-

vated by the McGurk effect [143]—an interaction between

hearing and vision during speech perception. When human

subjects heard the syllable /ba-ba/ while watching the lips

of a person saying /ga-ga/, they perceived a third sound:

/da-da/. These results motivated many researchers from

the speech community to extend their approaches with

visual information. Given the prominence of hidden Mar-

kov models (HMMs) in the speech community at the time

[99], it is without surprise that many of the early models for

AVSR were based on various HMM extensions [25], [26].

While research into AVSR is not as common these days, it

has seen renewed interest from the deep learning commu-

nity [157].

While the original vision of AVSR was to improve speech

recognition performance (e.g., word error rate) in all con-

texts, the experimental results showed that the main advan-

tage of visual information was when the speech signal was

noisy (i.e., low signal-to-noise ratio) [78], [157], [251]. In

other words, the captured interactions between modalities

were supplementary rather than complementary. The same

information was captured in both, improving the robust-

ness of the multimodal models but not improving the

speech recognition performance in noiseless scenarios.

A second important category of multimodal applications

comes from the ﬁeld of multimedia content indexing and

retrieval [11], [196]. With the advance of personal com-

puters and the internet, the quantity of digitized multime-

dia content has increased dramatically [2]. While earlier

approaches for indexing and searching these multimedia

videos were keyword-based [196], new research problems

emerged when trying to search the visual and multimodal

content directly. This led to new research topics in multi-

media content analysis such as automatic shot-boundary

detection [128] and video summarization [55]. These

research projects were supported by the TrecVid initiative

from the National Institute of Standards and Technologies

which introduced many high-quality datasets, including

the

multimedia event detection

(MED) tasks started

in 2011 [1].

A third category of applications was established in the

early 2000s around the emerging ﬁeld of multimodal inter-

action with the goal of understanding human multimodal

behaviors during social interactions. One of the ﬁrst land-

mark datasets collected in this ﬁeld is the AMI Meeting

Corpus which contains more than 100 hours of video

recordings of meetings, all fully transcribed and annotated

[34]. Another important dataset is the SEMAINE corpus

which allowed to study interpersonal dynamics between

speakers and listeners [144]. This dataset formed the basis

of the ﬁrst audio-visual emotion challenge (AVEC) orga-

nized in 2011 [186]. The ﬁelds of emotion recognition and

affective computing bloomed in the early 2010s thanks to

strong technical advances in automatic face detection, facial

landmark detection, and facial expression recognition [48].

The AVEC challenge continued annually afterward with the

later instantiation including healthcare applications such as

automatic assessment of depression and anxiety [217]. A

great summary of recent progress in multimodal affect rec-

ognition was published by D’Mello et al. [52]. Their meta-

analysis revealed that a majority of recent work on multi-

modal affect recognition show improvement when using

more than one modality, but this improvement is reduced

when recognizing naturally-occurring emotions.

Most recently, a new category of multimodal applica-

tions emerged with an emphasis on language and vision:

media description. One of the most representative applica-

tions is image captioning where the task is to generate a text

description of the input image [86]. This is motivated by the

ability of such systems to help the visually impaired in their

daily tasks [21]. Recently, progress has been made in the

inverse task media generation from text [37], [178]. The

main challenges media description and generation is evalu-

ation: how to evaluate the quality of the predicted descrip-

tions and media. The task of visual question-answering

(VQA) was recently proposed to address some of the evalu-

ation challenges [9] by providing a correct answer.

In order to bring some of the mentioned applications to

the real world we need to address a number of technical

challenges facing multimodal machine learning. We sum-

marize the relevant technical challenges for the above

mentioned application areas in Table 1. One of the most

important challenges is multimodal representation, the

focus of our next section.

424

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

MULTIMODAL REPRESENTATIONS

Representing data in a format that a computational model

can work with has always been a challenge in machine

learning. Following Bengio et al. [19] we use the term fea-

ture and representation interchangeably, with each refer-

ring to a vector or tensor representation of an entity, be it an

image, audio sample, individual word, or a sentence. A

multimodal representation is a representation of data using

information from multiple such entities. Representing mul-

tiple modalities poses many difﬁculties: how to combine the

data from heterogeneous sources; how to deal with different

levels of noise; and how to handle missing data. The ability

to represent data in a meaningful way is crucial to multi-

modal problems, and forms the backbone of any model.

Good representations are important for the performance

of machine learning models, as evidenced behind the recent

leaps in performance of speech recognition [82] and visual

object classiﬁcation [114] systems. Bengio et al. [19] identify a

number of properties for good representations: smoothness,

temporal and spatial coherence, sparsity, and natural clus-

tering amongst others. Srivastava and Salakhutdinov [206]

identify additional desirable properties for multimodal rep-

resentations: similarity in the representation space should

reﬂect the similarity of the corresponding concepts, the

representation should be easy to obtain even in the absence

of some modalities, and ﬁnally, it should be possible to ﬁll-in

missing modalities given the observed ones.

The development of unimodal representations has been

extensively studied [4], [19], [127]. In the past decade there

has been a shift from hand-designed for speciﬁc applica-

tions to data-driven. For example, one of the most popular

ways to represent an image in the early 2000s was through a

bag of visual words representation of hand designed fea-

tures, such as the scale invariant feature transform (SIFT)

[132]. However, currently most images (or their parts) are

represented using descriptions are learned from data using

neural architectures such as convolutional neural networks

(CNN) [114]. Similarly, in the audio domain, acoustic fea-

tures such as Mel-frequency cepstral coefﬁcients (MFCC)

have been superseded by data-driven deep neural networks

in speech recognition [82] and recurrent neural networks

for para-linguistic analysis [216]. In natural language proc-

essing, the textual features initially relied on counting word

occurrences in documents, but have been replaced data-

driven word embeddings that exploit the word context

[146]. While there has been a huge amount of work on

unimodal representation, up until recently most multi-

modal representations involved simple concatenation of

unimodal ones [52], but this has been rapidly changing.

To help understand the breadth of work, we propose two

categories of multimodal representation: joint and coordi-

nated. Joint representations combine the unimodal signals

into the same representation space, while coordinated

representations process unimodal signals separately, but

enforce certain similarity constraints on them to bring them

to what we term a coordinated space. An illustration of dif-

ferent multimodal representation types can be seen in Fig. 1.

Mathematically, the joint representation is expressed as

xm ¼ fðx1; . . . ; xnÞ;

(1)

where the multimodal representation xm is computed using

function f (e.g., a deep neural network, restricted Boltz-

mann machine, or a recurrent neural network) that relies

on unimodal representations x1; . . . xn. While coordinated

representation is as follows:

fðx1Þ � gðx2Þ;

(2)

TABLE 1

A Summary of Applications Enabled by Multimodal Machine Learning

CHALLENGES

APPLICATIONS

REPRESENTATION

TRANSLATION

ALIGNMENT

FUSION

CO-LEARNING

Speech recognition

Audio-visual speech recognition

Event detection

Action classiﬁcation

Multimedia event detection

Emotion and affect

Recognition

Synthesis

Media description

Image description

Video description

Visual question-answering

Media summarization

Multimedia retrieval

Cross modal retrieval

Cross modal hashing

Multimedia generation

(Visual) speech and sound synthesis

Image and scene generation

For each application area we identify the core technical challenges that need to be addressed in order to tackle it.

BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY

425

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

where each modality has a corresponding projection func-

tion (f and g above) that maps it into a coordinated multi-

modal space. While the projection into the multimodal

space is independent for each modality, but the resulting

space is coordinated between them (indicated as �). Exam-

ples of such coordination include minimizing cosine dis-

tance [64], maximizing correlation [7], and enforcing a

partial order [220] between the resulting spaces.

3.1

Joint Representations

We start our discussion with joint representations that proj-

ect unimodal representations together into a multimodal

space (Equation (1)). Joint representations are mostly (but

not exclusively) used in tasks where multimodal data is

present both during training and inference steps. The sim-

plest example of a joint representation is a concatenation of

individual modality features (also referred to as early fusion

[52]). In this section we discuss more advanced methods for

creating joint representations starting with neural networks,

followed by graphical models and recurrent neural net-

works (representative works can be seen in Table 2).

Neural networks have become a very popular method for

unimodal data representation [19]. They are used to repre-

sent visual, acoustic, and textual data, and are increasingly

used in the multimodal domain [157], [163], [225]. In this

section we describe how neural networks can be used to

construct a joint multimodal representation, how to train

them, and what advantages they offer.

In general, neural networks are made up of successive

building blocks of inner products followed by non-linear

activation functions. In order to use a neural network as a

way to represent data, it is ﬁrst trained to perform a speciﬁc

task (e.g., recognizing objects in images). Due to the multi-

layer nature of deep neural networks each successive layer

is hypothesized to represent the data in a more abstract way

[19], hence it is common to use the ﬁnal or penultimate neu-

ral layers as a form of data representation. To construct a

multimodal representation using neural networks each

modality starts with several individual neural layers fol-

lowed by a hidden layer that projects the modalities into a

joint space [9], [150], [163], [235]. The joint multimodal

representation is then be passed through multiple hidden

layers itself or used directly for prediction. Such models can

be trained end-to-end—learning both to represent the data

and to perform a particular task. This results in a close rela-

tionship between multimodal representation learning and

multimodal fusion when using neural networks.

As neural networks require a lot of labeled training data,

it is common to pre-train such representations using either

unsupervised data (e.g., using autoencoder models [12],

[83]) or supervised data from a different but related domain

[9], [221]. The model proposed by Ngiam et al. [157]

extended the idea of using autoencoders to the multimodal

domain. They used stacked denoising autoencoders to rep-

resent each modality individually and then fused them into

a multimodal representation using another autoencoder

layer. Similarly, Silberer and Lapata [191] proposed to use a

multimodal autoencoder for the task of semantic concept

grounding (see Section 7.2). In addition to using a recon-

struction loss to train the representation they introduce a

term into the loss function that uses the representation to

predict object labels.

The major advantage of neural network based joint rep-

resentations comes from their ability to pre-train from unla-

bled data when labeled data is not enough for supervised

learning. It is also common to ﬁne-tune the resulting repre-

sentation on a particular task at hand as the representation

constructed with unsupervised data is generic and not nec-

essarily optimal for a speciﬁc task [225]. One of the disad-

vantages comes from the model not being able to handle

missing data naturally—although there are ways to alleviate

this issue [157], [225]. Finally, deep networks are often difﬁ-

cult to train [72], but the ﬁeld is making progress with new

Fig. 1. Structure of joint and coordinated representations. Joint representations are projected to the same space using all of the modalities as input.

Coordinated representations, on the other hand, exist in their own space, but are coordinated through a similarity (e.g., Euclidean distance) or

structure constraint (e.g., partial order).

TABLE 2

A Summary of Multimodal Representation Techniques

REPRESENTATION

MODALITIES

REFERENCE

Joint

Neural networks

Images + Audio

[150], [157], [235]

Images + Text

[191]

Graphical models

Images + Text

[206]

Images + Audio

[108]

Sequential

Audio + Video

[100], [158]

Images + Text

[173]

Coordinated

Similarity

Images + Text

[64], [110]

Video + Text

[166], [239]

Structured

Images + Text

[33], [220], [256]

Audio + Articulatory

[228]

We identify three subtypes of joint representations (Section 3.1) and two

subtypes of coordinated ones (Section 3.2). For modalities + indicates the

modalities combined.

426

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

techniques such improved regularization [204], batch nor-

malization [92] and adaptive gradient algorithms [109].

Probabilistic graphical models can be used to construct rep-

resentations through the use of latent random variables [19].

In this section we describe how probabilistic graphical mod-

els are used to represent unimodal and multimodal data.

One such way to represent data is through deep Boltzmann

machines (DBM) [183], that stack restricted Boltzmann

machines (RBM) [84] as building blocks. Similar to neural

networks, each successive layer of a DBM is expected to rep-

resent the data at a higher level of abstraction. The appeal of

DBMs comes from the fact that they do not need supervised

data for training [183]. As they are graphical models the

representation of data is probabilistic, however it is possible

to convert them to a deterministic neural network—but this

loses the generative aspect of the model [183].

Work by Srivastava and Salakhutdinov [205] introduced

multimodal deep belief networks and multimodal DBMs

[206] as multimodal representations. Kim et al. [108] used a

deep belief network for each modality and then combined

them into joint representation for audiovisual emotion rec-

ognition. Huang and Kingsbury [89] used a similar model

for AVSR, and Wu et al. [233] for audio and skeleton joint

based gesture recognition. Ouyang et al. [163] explored

the use of multimodal DBMs for the task of human pose

estimation from multi-view data. They demonstrated that

integrating the data at a later stage—after unimodal data

underwent nonlinear transformations—was beneﬁcial for

the model. Similarly, Suk et al. [207] used multimodal DBM

representation to perform Alzheimer’s disease classiﬁcation

from positron emission tomography and magnetic reso-

nance imaging data.

One of the big advantages of using multimodal DBMs for

learning multimodal representations is their generative

nature, which allows for an easy way to deal with missing

data—even if a whole modality is missing, the model has a

natural way to cope. It can also be used to generate samples

of one modality in the presence of the other one, or both

modalities from the representation. Similar to autoencoders

the representation can be trained in an unsupervised man-

ner enabling the use of unlabeled data. The major disadvan-

tage of DBMs is the difﬁculty of training them—high

computational cost, and the need to use approximate varia-

tional training methods [206].

Sequential Representation. So far we have discussed mod-

els that can represent ﬁxed length data, however, we often

need to represent varying length sequences such as senten-

ces, videos, or audio streams. Recurrent neural networks

(RNNs), and their variants such as long-short term memory

(LSTMs) networks [85], have recently gained popularity

due to their success in sequence modeling across various

tasks [13], [222]. So far RNNs have mostly been used to rep-

resent unimodal sequences of words, audio, or images, with

most success in the language domain. Similar to traditional

neural networks, the hidden state of an RNN can be seen as

a representation of the data, i.e., the hidden state of RNN at

timestep t can be seen as the summarization of the sequence

up to that timestep. This is especially apparent in RNN

encoder-decoder frameworks where the task of an encoder

is to represent a sequence in the hidden state of an RNN in

such a way that a decoder could reconstruct it [13], [244].

The use of RNN representations has not been limited to

the unimodal domain. An early use of constructing a multi-

modal representation using RNNs comes from work by

Cosi et al. [45] on AVSR. They have also been used for repre-

senting audio-visual data for affect recognition [39], [158]

and to represent multi-view data such as different visual

cues for human behavior analysis [173].

3.2

Coordinated Representations

An alternative to a joint multimodal representation is a

coordinated representation. Instead of projecting the modal-

ities together into a joint space, separate representations are

learned for each modality but are coordinated through a

constraint. We start our discussion with coordinated repre-

sentations that enforce similarity between representations,

moving on to coordinated representations that enforce

more structure on the resulting space (representative works

of such coordinated representations can be seen in Table 2).

Similarity models minimize the distance between modali-

ties in the coordinated space. For example such models

encourage the representation of the word dog and an image

of a dog to have a smaller distance between them than dis-

tance between the word dog and an image of a car [64]. One

of the earliest examples of such a representation comes

from the work by Weston et al. [229], [230] on the WSABIE

(web scale annotation by image embedding) model, where

a coordinated space was constructed for images and their

annotations. WSABIE constructs a simple linear map from

image and textual features such that corresponding annota-

tion and image representation would have a higher inner

product (smaller cosine distance) between them than non-

corresponding ones.

More recently, neural networks have become a popular

way to construct coordinated representations, due to their

ability to learn representations. Their advantage lies in the

fact that they can jointly learn coordinated representations

in an end-to-end manner. An example of such coordinated

representation is DeViSE—a deep visual-semantic embed-

ding [64]. DeViSE uses a similar inner product and ranking

loss function to WSABIE but uses more complex image and

word embeddings. Kiros et al. [110] extended this to sen-

tence and image coordinated representation by using an

LSTM model and a pairwise ranking loss to coordinate the

feature space. Socher et al. [199] tackle the same task, but

extend the language model to a dependency tree RNN to

incorporate compositional semantics. A similar model was

also proposed by Pan et al. [166], but using videos instead

of images. Xu et al. [239] also constructed a coordinated

space between videos and sentences using a hsubject, verb,

objecti compositional language model and a deep video

model. This representation was then used for the task of

cross-modal retrieval and video description.

While the above models enforced similarity between rep-

resentations, structured coordinated space models go beyond

that and enforce additional constraints between the modal-

ity representations. The type of structure enforced is often

based on the application, with different constraints for hash-

ing, cross-modal retrieval, and image captioning.

Structured coordinated spaces are commonly used in

cross-modal hashing—compression of high dimensional

data into compact binary codes with similar binary codes

BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY

427

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

for similar objects [226]. The idea of cross-modal hashing is

to create such codes for cross-modal retrieval [28], [97],

[118]. Hashing enforces certain constraints on the resulting

multimodal space: 1) it has to be an N-dimensional Ham-

ming space—a binary representation with controllable

number of bits; 2) the same object from different modalities

has to have a similar hash code; 3) the space has to be

similarity-preserving. Learning how to represent the data as

a hash function attempts to enforce all of these three require-

ments [28], [118]. For example, Jiang and Li [96] introduced a

method to learn such common binary space between sen-

tence descriptions and corresponding images using end-to-

end trainable deep learning techniques. While Cao et al. [33]

extended the approach with a more complex LSTM sentence

representation and introduced an outlier insensitive bit-wise

margin loss and a relevance feedback based semantic simi-

larity constraint. Similarly, Wang et al. [227] constructed a

coordinated space in which images (and sentences) with sim-

ilar meanings are closer to each other.

Another example of a structured coordinated representa-

tion comes from order-embeddings of images and language

[220], [257]. The model proposed by Vendrov et al. [220]

enforces a dissimilarity metric that is asymmetric and imple-

ments the notion of partial order in the multimodal space.

The idea is to capture a partial order of the language and

image representations—enforcing a hierarchy on the space;

for example image of a woman walking her dog ! text

woman walking her dog ! text woman walking. A similar

model using denotation graphs was also proposed by Young

et al. [246] where denotation graphs are used to induce a par-

tial ordering. Lastly, Zhang et al. present how exploiting

structured representations of text and images can create con-

cept taxonomies in an unsupervised manner [257].

A special case of a structured coordinated space is one

based on canonical correlation analysis (CCA) [87]. CCA

computes a linear projection which maximizes the correla-

tion between two random variables (in our case modalities)

and enforces orthogonality of the new space. CCA models

have been used extensively for cross-modal retrieval [79],

[111], [176] and audiovisual signal analysis [184], [195].

Extensions to CCA attempt to construct a correlation maxi-

mizing nonlinear projection [7], [121]. Kernel canonical cor-

relation analysis (KCCA) [121] uses reproducing kernel

Hilbert spaces for projection. However, as the approach is

nonparametric it scales poorly with the size of the training

set and has issues with very large real-world datasets. Deep

canonical correlation analysis (DCCA) [7] was introduced

as an alternative to KCCA and addresses the scalability

issue, it was also shown to lead to better correlated repre-

sentation space. Similar correspondence autoencoder [61]

and deep correspondence RBMs [60] have also been pro-

posed for cross-modal retrieval.

CCA, KCCA, and DCCA are unsupervised techniques

and only optimize the correlation over the representations,

thus mostly capturing what is shared across the modalities.

Deep canonically correlated autoencoders [228] also include

an autoencoder based data reconstruction term. This en-

courages

the

representation

also

capture

modality

speciﬁc information. Semantic correlation maximization

method [256] also encourages semantic relevance, while

retaining correlation maximization and orthogonality of the

resulting space—this leads to a combination of CCA and

cross-modal hashing techniques.

3.3

Discussion

In this section we identiﬁed two major types of multimodal

representations—joint and coordinated. Joint representa-

tions project multimodal data into a common space and are

best suited for situations when all of the modalities are pres-

ent during inference. They have been extensively used for

AVSR, affect, and multimodal gesture recognition. Coordi-

nated representations, on the other hand, project each

modality into a separate but coordinated space, making

them suitable for applications where only one modality is

present at test time, such as: multimodal retrieval and trans-

lation (Section 4), grounding (Section 7.2), and zero shot

learning (Section 7.2). Furthermore, while joint representa-

tions have been used in situations to construct representa-

tions of more than two modalities, coordinated spaces have,

so far, been mostly limited to two. Finally, the multimodal

networks we discussed are largely static, in the future we

may see more work on one modality driving the structure

of a network applied to another modality [6].

TRANSLATION

A big part of multimodal machine learning is concerned with

translating (mapping) from one modality to another. Given

an entity in one modality the task is to generate the same

entity in a different modality. For example given an image

we might want to generate a sentence describing it or given a

textual description generate an image matching it. Multi-

modal translation is a long studied problem, with early work

in speech synthesis [91], visual speech generation [141] video

description [112], and cross-modal retrieval [176].

More recently, multimodal translation has seen renewed

interest due to combined efforts of the computer vision and

natural language processing (NLP) communities [20] and

recent availability of large multimodal datasets [40], [214].

A particularly popular problem is visual scene description,

also known as image [223] and video captioning [222],

which acts as a great test bed for a number of computer

vision and NLP problems. To solve it, we not only need to

fully understand the visual scene and to identify its salient

parts, but also to produce grammatically correct and com-

prehensive yet concise sentences describing it.

While the approaches to multimodal translation are very

broad and are often modality speciﬁc, they share a number

of unifying factors. We categorize them into two types—

example-based, and generative. Example-based models use a

dictionary when translating between the modalities. Genera-

tive models, on the other hand, construct a model that is able

to produce a translation. This distinction is similar to the

one between non-parametric and parametric machine learn-

ing approaches and is illustrated in Fig. 2, with representa-

tive examples summarized in Table 3.

Generative models are arguably more challenging to build

as they require the ability to generate signals or sequences of

symbols (e.g., sentences). This is difﬁcult for any modality—

visual, acoustic, or verbal, especially when temporally and

structurally consistent sequences need to be generated.

This led to many of the early multimodal translation systems

428

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

relying on example-based translation. However, this has been

changing with the advent of deep learning models that are

capable of generating images [178], [218], sounds [161], [164],

and text [13].

4.1

Example-Based

Example-based algorithms are restricted by their training

data—dictionary (see Fig. 2a). We identify two types of

such algorithms: retrieval based, and combination based.

Retrieval-based models directly use the retrieved translation

without modifying it, while combination-based models rely

on more complex rules to create translations based on a

number of retrieved instances.

Retrieval-based models are arguably the simplest form of

multimodal translation. They rely on ﬁnding the closest

sample in the dictionary and using that as the translated

result. The retrieval can be done in unimodal space or inter-

mediate semantic space.

Given a source modality instance to be translated, unim-

odal retrieval ﬁnds the closest instances in the dictionary in

the space of the source—for example, visual feature space

for images. Such approaches have been used for visual

speech synthesis, by retrieving the closest matching visual

example of the desired phoneme [27]. They have also been

used in concatenative text-to-speech systems [91]. More

recently, Ordonez et al. [162] used unimodal retrieval to gen-

erate image descriptions by using global image features to

retrieve caption candidates [162]. Yagcioglu et al. [240] used a

CNN-based image representation to retrieve visually similar

images using adaptive neighborhood selection. Devlin et al.

[51] demonstrated that a simple k-nearest neighbor retrieval

with consensus caption selection achieves competitive trans-

lation results when compared to more complex generative

approaches. The advantage of such unimodal retrieval

approaches is that they only require the representation of a

single modality through which we are performing retrieval.

However, they often require an extra multimodal post-proc-

essing step such as re-ranking of retrieved translations [140],

[162], [240]. This indicates a major problem with this

approach—similarity in unimodal space does not always

imply a good translation.

An alternative is to use an intermediate semantic space

for similarity comparison during retrieval. An early exam-

ple of a hand crafted semantic space is one used by Farhadi

et al. [58]. They map both sentences and images to a space

of hobject, action, scenei, retrieval of relevant caption to an

image is then performed in that space. In contrast to hand-

crafting a representation, Socher et al. [199] learn a coordi-

nated representation of sentences and CNN visual features

(see Section 3.2 for description of coordinated spaces).

They use the model for both translating from text to

images and from images to text. Similarly, Xu et al. [239]

used a coordinated space of videos and their descriptions

for cross-modal retrieval. Jiang and Li [97] and Cao et al.

[33] use cross-modal hashing to perform multimodal

translation from images to sentences and back, while

Hodosh et al. [86] use a multimodal KCCA space for

image-sentence retrieval. Instead of aligning images and

sentences globally in a common space, Karpathy et al.

[103] propose a multimodal similarity metric that inter-

nally aligns image fragments (visual objects) together

with sentence fragments (dependency tree relations).

Retrieval approaches in semantic space tend to perform

better than their unimodal counterparts as they are retriev-

ing examples in a more meaningful space that reﬂects both

modalities and that is often optimized for retrieval. Further-

more, they allow for bi-directional translation, which is not

straightforward with unimodal methods. However, they

require manual construction or learning of such a semantic

space, which often relies on the existence of large training

dictionaries (datasets of paired samples).

Fig. 2. Overview of example-based and generative multimodal translation. The former retrieves the best translation from a dictionary, while the latter

ﬁrst trains a translation model on the dictionary and then uses that model for translation.

TABLE 3

Taxonomy of Multimodal Translation Research

TASKS

DIR.

REFERENCES

Example-based

Retrieval

Image captioning

)

[58], [162]

Media retrieval

[199], [239]

Visual speech

)

[27]

Image captioning

[102], [103]

Combination

Image captioning

)

[77], [119], [124]

Generative

Grammar based

Video description

)

[15], [213]

Image description

)

[53], [126], [147]

Encoder-decoder

Image captioning

)

[110], [139]

Video description

)

[222], [249]

Text to image

)

[137], [178]

Continuous

Sounds synthesis

)

[161], [164]

Visual speech

)

[5], [49], [212]

For each class and sub-class, we include example tasks with references. Our

taxonomy also includes the directionality of the translation: unidirectional

()) and bidirectional (,).

BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY

429

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

Combination-based models take the retrieval based app-

roaches one step further. Instead of just retrieving examples

from the dictionary, they combine them in a meaningful

way to construct a better translation. Combination based

media description approaches are motivated by the fact that

sentence descriptions of images share a common and simple

structure that could be exploited. Most often the rules for

combinations are hand crafted or based on heuristics.

Kuznetsova et al. [119] ﬁrst retrieve phrases that describe

visually similar images and then combine them to generate

novel descriptions of the query image by using Integer Lin-

ear Programming with a number of hand crafted rules.

Gupta et al. [77] ﬁrst ﬁnd k images most similar to the

source image, and then use the phrases extracted from their

captions to generate a target sentence. Lebret et al. [124] use

a CNN-based image representation to infer phrases that

describe it. The predicted phrases are then combined using

a trigram constrained language model.

A big problem facing example-based approaches for trans-

lation is that the model is the entire dictionary—making the

model large and inference slow (although, optimizations

such as hashing alleviate this problem). Another issue facing

example-based translation is that it is unrealistic to expect

that a single comprehensive and accurate translation relevant

to the source example will always exist in the dictionary—

unless the task is simple or the dictionary is very large.

This is partly addressed by combination models that are

able to construct more complex structures. However, they

are only able to perform translation in one direction, while

semantic space retrieval-based models are able to perform

it both ways.

4.2

Generative Approaches

Generative approaches to multimodal translation construct

models that can perform multimodal translation given a

unimodal source instance. It is a challenging problem as it

requires the ability to both understand the source modality

and to generate the target sequence or signal. As discussed

in the following section, this also makes such methods

much more difﬁcult to evaluate, due to large space of possi-

ble correct answers.

In this survey we focus on the generation of three modal-

ities: language, vision, and sound. Language generation has

been explored for a long time [177], with a lot of recent

attention for tasks such as image and video description [20].

Speech and sound generation has also seen a lot of work

with a number of historical [91] and modern approaches

[161], [164]. Photo-realistic image generation has been less

explored, and is still in early stages [137], [178], however,

there have been a number of attempts at generating abstract

scenes [261], computer graphics [47], and talking heads [5].

We identify three broad categories of generative models:

grammar-based, encoder-decoder, and continuous generation

models. Grammar based models simplify the task by

restricting the target domain by using a grammar, e.g., by

generating restricted sentences based on a hsubject, object,

verbi template. Encoder-decoder models ﬁrst encode the

source modality to a latent representation which is then

used by a decoder to generate the target modality. Continu-

ous generation models generate the target modality contin-

uously based on a stream of source modality inputs and are

most suited for translating between temporal sequences—

such as text-to-speech.

Grammar-based models rely on a pre-deﬁned grammar for

generating a particular modality. They start by detecting

high level concepts from the source modality, such as objects

in images and actions from videos. These detections are then

incorporated together with a generation procedure based on

a pre-deﬁned grammar to result in a target modality.

Kojima et al. [112] proposed a system to describe human

behavior in a video using the detected position of the person’s

head and hands and rule based natural language generation

that incorporates a hierarchy of concepts and actions. Barbu

et al. [15] proposed a video description model that generates

sentences of the form: who did what to whom and where and

how they did it. The system was based on handcrafted object

and event classiﬁers and used a restricted grammar suitable

for the task. Guadarrama et al. [76] predict hsubject, verb,

objecti triplets describing a video using semantic hierarchies

that use more general words in case of uncertainty. Together

with a language model their approach allows for translation

of verbs and nouns not seen in the dictionary.

To describe images, Yao et al. [243] propose to use an

and-or graph-based model together with domain-speciﬁc

lexicalized grammar rules, targeted visual representation

scheme, and a hierarchical knowledge ontology. Li et al.

[126] ﬁrst detect objects, visual attributes, and spatial rela-

tionships between objects. They then use an n-gram lan-

guage model on the visually extracted phrases to generate

hsubject, preposition, objecti style sentences. Mitchell et al.

[147] use a more sophisticated tree-based language model

to generate syntactic trees instead of ﬁlling in templates,

leading

diverse

descriptions.

majority

approaches represent the whole image jointly as a bag of

visual objects without capturing their spatial and semantic

relationships. To address this, Elliott et al. [53] propose to

explicitly model proximity relationships of objects for

image description generation.

Some grammar-based approaches rely on graphical mod-

els to generate the target modality. An example includes

BabyTalk [117], which given an image generates hobject, prep-

osition, objecti triplets, that are used together with a condi-

tional random ﬁeld to construct the sentences. Yang et al.

[241] predict a set of hnoun, verb, scene, prepositioni candi-

dates using visual features extracted from an image and

combine them into a sentence using a statistical language

model and hidden Markov model style inference. A similar

approach has been proposed by Thomason et al. [213], where

a factor graph model is used for video description of the form

hsubject, verb, object, placei. The factor model exploits lan-

guage statistics to deal with noisy visual representations.

Going the other way Zitnick et al. [261] propose to use condi-

tional random ﬁelds to generate abstract visual scenes based

on language triplets extracted from sentences.

An advantage of grammar-based methods is that they

are more likely to generate syntactically (in case of lan-

guage) or logically correct target instances as they use pre-

deﬁned templates and restricted grammars. However, this

limits them to producing formulaic rather than creative

translations. Furthermore, grammar-based methods rely on

complex pipelines for concept detection, with each concept

requiring a separate model and a separate training dataset.

430

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

Encoder-decoder models based on end-to-end trained neu-

ral networks are currently some of the most popular techni-

ques for multimodal translation. The main idea behind the

model is to ﬁrst encode a source modality into a vectorial

representation and then to use a decoder module to gener-

ate the target modality, all this in a single pass pipeline.

Although, ﬁrst used for machine translation [101], [208],

such models have been successfully used for image caption-

ing [139], [223], and video description [181], [222]. While

encoder-decoder models have been mostly used to generate

text, they can also generate images [137], [178], and speech

and sound [161], [164].

The ﬁrst step of the encoder-decoder model is to encode

the source object, this is done in modality speciﬁc way. Pop-

ular models to encode acoustic signals include RNNs [36]

and DBNs [82]. Most of the work on encoding words sen-

tences uses distributional semantics [146] and variants of

RNNs [13]. Images are most often encoded using convolu-

tional neural networks [114], [193]. Although there are

methods for learning video representations [59], [192],

hand-crafted features are still used [181], [213]. While it is

possible to use unimodal representations to encode the

source modality, it has been shown that using a coordinated

space (see Section 3.2) leads to better results [110], [166].

Decoding is most often performed by an RNN or an LSTM

using the encoded representation as the initial hidden state

[56], [137], [223], [223]. A number of extensions have been

proposed to traditional LSTM models to aid in the task of

translation. A guide vector could be used to tightly couple

the solutions in the image input [95]. Venugopalan et al.

[222] demonstrate that it is beneﬁcial to pre-train a decoder

LSTM for image captioning before ﬁne-tuning it to video

description. Rohrbach et al. [181] explore the use of various

LSTM architectures (single layer, multilayer, factored) and a

number of training and regularization techniques for the

task of video description.

A problem facing translation generation using an RNN is

that the model has to generate a description from a single

vectorial representation of the image, sentence, or video.

This becomes especially difﬁcult when generating long

sequences as these models tend to forget the initial input.

This has been partly addressed by including the encoded

information during every step of the decoder [95]. Attention

models (see Section 5.2) have also been proposed to allow

the decoder to better focus on certain parts of an image

[238], sentence [13], or video [244] during generation.

Generative attention-based RNNs have also been used

for the task of generating images from sentences [137], while

the results are still far from photo-realistic they show a lot of

promise. More recently, a large amount of progress has been

made in generating images using generative adversarial

networks [74], which have been used as an alternative to

RNNs for image generation from text [178].

While neural network based encoder-decoder systems

have been very successful they still face a number of issues.

Devlin et al. [51] suggest that it is possible that the network

is memorizing the training data rather than learning how to

understand the visual scene and generate it, based on the

observation that k-nearest neighbor models perform simi-

larly to those based on generation. Furthermore, such mod-

els often require large quantities of data for training.

Continuous generation models are intended for sequence

translation and produce outputs at every timestep in an

online manner. These models are useful when translating

from a sequence to a sequence such as text to speech, speech

to text, and video to text. A number of different techniques

have been proposed for such modeling—graphical models,

continuous encoder-decoder approaches, and various other

regression or classiﬁcation techniques. The extra difﬁculty

that needs to be tackled by these models is the requirement

of temporal consistency between modalities.

A lot of early work on sequence to sequence translation

used graphical or latent variable models. Deena and Galata

[49] proposed to use a shared Gaussian process latent vari-

able model for audio-based visual speech synthesis. The

model creates a shared latent space between audio and

visual features that can be used to generate one space from

the other, while enforcing temporal consistency of visual

speech at different timesteps. Hidden Markov models

(HMM) have also been used for visual speech generation

[212] and text-to-speech [253] tasks. They have also been

extended to use cluster adaptive training to allow for train-

ing on multiple speakers, languages, and emotions allowing

for more control when generating speech signal [252] or

visual speech parameters [5].

Encoder-decoder models have recently become popular

for sequence to sequence modeling. Owens et al. [164] used

an LSTM to generate sounds resulting from drumsticks

based on video. While their model is capable of generating

sounds by predicting a cochleogram from CNN visual fea-

tures, they found that retrieving a closest audio sample

based on the predicted cochleogram led to best results.

Directly modeling the raw audio signal for speech and

music generation has been proposed by van den Oord et al.

[161]. The authors propose using hierarchical fully convolu-

tional neural networks, which show a large improvement

over previous state-of-the-art for the task of speech synthe-

sis. RNNs have also been used for speech to text translation

(speech recognition) [75]. More recently encoder-decoder

based continuous approach was shown to be good at pre-

dicting letters from a speech signal represented as a ﬁlter

bank spectra [36]—allowing for more accurate recognition

of rare and out of vocabulary words. Collobert et al. [44]

demonstrate how to use a raw audio signal directly for

speech recognition, eliminating the need for audio features.

A lot of earlier work used graphical models for multi-

modal translation between continuous signals. However,

these methods are being replaced by neural network

encoder-decoder based techniques. Especially as they have

recently been shown to be able to represent and generate

complex visual and acoustic signals.

4.3

Model Evaluation and Discussion

A major challenge facing multimodal translation methods

is that they are very difﬁcult to evaluate. While some

tasks such as speech recognition have a single correct

translation, tasks such as speech synthesis and media

description do not. Sometimes, as in language translation,

multiple answers are correct and deciding which transla-

tion is better is often subjective. Fortunately, there are a

number of approximate automatic metrics that aid in

model evaluation.

BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY

431

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

Often the ideal way to evaluate a subjective task is

through human judgment. That is by having a group of peo-

ple evaluating each translation. This can be done on a Likert

scale where each translation is evaluated on a certain

dimension: naturalness and mean opinion score for speech

synthesis [161], [252], realism for visual speech synthesis

[5], [212], and grammatical and semantic correctness, rele-

vance, order, and detail for media description [40], [117],

[147], [222]. Another option is to perform preference studies

where two (or more) translations are presented to the partic-

ipant for preference comparison [212], [252]. However,

while user studies will result in evaluation closest to human

judgments they are time consuming and costly. Further-

more, they require care when constructing and conducting

them to avoid ﬂuency, age, gender and culture biases.

While human studies are a gold standard for evaluation,

a number of automatic alternatives have been proposed for

the task of media description: BLEU [167], ROUGE [129],

Meteor [50], and CIDEr [219]. However, the use of them has

faced a lot of criticism and have been shown to only weakly

correspond to human judgements [54], [90].

Hodosh et al. [86] propose using retrieval as a proxy for

image captioning evaluation, as a better way to reﬂect

human judgments. Instead of generating captions, a retrieval

based system ranks the available captions based on their ﬁt

to the image, and is then evaluated by assessing if the correct

captions are given a high rank. As a number of caption gener-

ation models are generative they can be used directly to

assess the likelihood of a caption given an image and are

being adapted by image captioning community [103], [110].

Such retrieval based evaluation metrics have also been

adopted by the video captioning community [182].

Visual question-answering [135] task was proposed partly

due to the issues facing evaluation of image captioning. VQA

is a task where given an image and a question about its con-

tent the system has to answer it. Evaluating such systems is

easier due to the presence of a correct answer, turning the task

into a multimodal fusion (see Section 6) rather than a transla-

tion one. Image co-reference task [113], [138] was proposed to

address this ambiguity as well, by framing the task as that of

multi-modal alignment (see Section 5).

We believe that addressing the evaluation issue will be

crucial for further success of multimodal translation systems.

This will allow not only for better comparison between

approaches, but also for better objectives to optimize.

ALIGNMENT

We deﬁne multimodal alignment as ﬁnding relationships

and correspondences between sub-components of instances

from two or more modalities. For example, given an image

and a caption we want to ﬁnd the areas of the image corre-

sponding to the caption’s words or phrases [102]. Another

example is, given a movie, aligning it to the script or the

book chapters it was based on [260]. Ability to do this is par-

ticularly important for multimedia retrieval, as it enables us

to search video content based on text, e.g., ﬁnding scenes in

a movie where a particular character appears, or ﬁnding

images that contain blue chairs.

We categorize multimodal alignment into two types—

implicit and explicit. In explicit alignment, we are explicitly

interested in aligning sub-components between modalities,

e.g., aligning recipe steps with the corresponding instruc-

tional video [136]. Implicit alignment is used as an interme-

diate (often latent) step for another task, e.g., image

retrieval based on text description can include an alignment

step between words and image regions [103]. An overview

of such approaches can be seen in Table 4 and is presented

in more detail in the following sections.

5.1

Explicit Alignment

We categorize papers as performing explicit alignment if

their main modeling objective is alignment between sub-

components of instances from two or more modalities. A

very important part of explicit alignment is the similarity

metric. Most approaches rely on measuring similarity

between sub-components in different modalities as a basic

building block. These similarities can be deﬁned manually

or learned from data. We identify two types of algorithms

that tackle explicit alignment—unsupervised and (weakly)

supervised. The ﬁrst type operates with no direct alignment

labels (i.e., labeled correspondences) between instances

from the different modalities. The second type has access to

such (sometimes weak) labels.

Unsupervised multimodal alignment tackles modality

alignment without requiring any direct alignment labels.

Most of the approaches are inspired from early work on

alignment for statistical machine translation [29] and genome

sequences [116], [151]. To make the task easier the approaches

assume certain constrains on alignment, such as temporal

ordering of sequence or an existence of a similarity metric

between the modalities.

Dynamic time warping (DTW) [116], [151] is a dynamic

programming approach that has been extensively used

to align multi-view time series. DTW measures the similar-

ity between two sequences and ﬁnds an optimal match

between them by time warping (inserting frames). It requires

the timesteps in the two sequences to be comparable and

requires a similarity measure between them. DTW can be

used directly for multimodal alignment by hand-crafting

similarity metrics between modalities; for example Anguera

et al. [8] use a manually deﬁned similarity between gra-

phemes and phonemes; and Tapaswi et al. [210] deﬁne a

similarity between visual scenes and sentences based on

TABLE 4

Summary of Our Taxonomy for the

Multimodal Alignment Challenge

ALIGNMENT

MODALITIES

REFERENCE

Explicit

Unsupervised

Video + Text

[136], [210], [211]

Video + Audio

[160], [215], [259]

Supervised

Video + Text

[24], [260]

Image + Text

[113], [138], [168]

Implicit

Graphical models

Audio/Text + Text

[194], [224]

Neural networks

Image + Text

[102], [236], [238]

Video + Text

[244], [249]

For each sub-class of our taxonomy, we include reference citations and modali-

ties aligned.

432

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

appearance of same characters [210] to align TV shows and

plot synopses. DTW-like dynamic programming approaches

have also been used for multimodal alignment of text to

speech [80] and video [211].

As the original DTW formulation requires a pre-deﬁned

similarity metric between modalities, it was extended using

canonical correlation analysis to map the modalities to a

coordinated space. This allows for both aligning (through

DTW) and learning the mapping (through CCA) between

different modality streams jointly and in an unsupervised

manner [187], [258], [259]. While CCA based DTW models

are able to ﬁnd multimodal data alignment under a linear

transformation, they are not able to model non-linear rela-

tionships. This has been addressed by the deep canonical

time warping approach [215], which can be seen as a gener-

alization of deep CCA and DTW.

Various graphical models have also been popular for

multimodal sequence alignment in an unsupervised man-

ner. Early work by Yu and Ballard [247] used a generative

graphical model to align visual objects in images with spo-

ken words. A similar approach was taken by Cour et al. [46]

to align movie shots and scenes to the corresponding

screenplay. Malmaud et al. [136] used a factored HMM to

align recipes to cooking videos, while Noulas et al. [160]

used a dynamic Bayesian network to align speakers to vid-

eos. Naim et al. [153] matched sentences with correspond-

ing video frames using a hierarchical HMM model to align

sentences with frames and a modiﬁed IBM [29] algorithm

for word and object alignment [16]. This model was then

extended to use latent conditional random ﬁelds for align-

ments [152] and to incorporate verb alignment to actions in

addition to nouns and objects [203].

Both DTW and graphical model approaches for alignment

allow for restrictions on alignment, e.g., temporal consis-

tency, no large jumps in time, and monotonicity. While DTW

extensions allow for learning both the similarity metric and

alignment jointly, graphical model based approaches require

expert knowledge for construction [46], [247].

Supervised alignment methods rely on labeled aligned

instances. They are used to train similarity measures that

are used for aligning modalities.

A number of supervised sequence alignment techniques

take inspiration from unsupervised ones. Bojanowski et al.

[23], [24] proposed a method similar to canonical time warp-

ing, but have also extended it to take advantage of existing

(weak) supervisory alignment data for model training.

Plummer et al. [168] used CCA to ﬁnd a coordinated space

between image regions and phrases for alignment. Gebru

et al. [68] trained a Gaussian mixture model and performed

semi-supervised clustering together with an unsupervised

latent-variable graphical model to align speakers in an

audio channel with their locations in a video. Kong et al.

[113] trained a Markov random ﬁeld to align objects in 3D

scenes to nouns and pronouns in text descriptions.

Deep learning based approaches are becoming popular

for explicit alignment (speciﬁcally for measuring similarity)

due to very recent availability of aligned datasets in the lan-

guage and vision communities [138], [168]. Zhu et al. [260]

aligned books with their corresponding movies/scripts by

training a CNN to measure similarities between scenes and

text. Mao et al. [138] used an LSTM language model and a

CNN visual one to evaluate the quality of a match between

a referring expression and an object in an image. Yu et al.

[250] extended this model to include relative appearance

and context information that allows to better disambiguate

between objects of the same type. Finally, Hu et al. [88] used

an LSTM based scoring function to ﬁnd similarities between

image regions and their descriptions.

5.2

Implicit Alignment

In contrast to explicit alignment, implicit alignment is used

as an intermediate (often latent) step for another task. This

allows for better performance in a number of tasks includ-

ing speech recognition, machine translation, media descrip-

tion, and visual question-answering. Such models do not

explicitly align data and do not rely on supervised align-

ment examples, but learn how to latently align the data dur-

ing model training. We identify two types of implicit

alignment models: earlier work based on graphical models,

and more modern neural network methods.

Graphical models have seen some early work used to bet-

ter align words between languages for machine translation

[224] and alignment of speech phonemes with their tran-

scriptions [194]. However, they require manual construction

of a mapping between the modalities, for example a genera-

tive phone model that maps phonemes to acoustic features

[194]. Constructing such models requires training data or

human expertise to deﬁne them manually.

Neural networksTranslation (Section 4) is an example of a

modeling task that can often be improved if alignment is

performed as a latent intermediate step. As we mentioned

before, neural networks are popular ways to address this

translation problem, using either an encoder-decoder model

or through cross-modal retrieval. When translation is per-

formed without implicit alignment, it ends up putting a lot

of weight on the encoder module to be able to properly

summarize the whole image, sentence or a video with a sin-

gle vectorial representation.

A very popular way to address this is through attention

[13], which allows the decoder to focus on sub-components

of the source instance. This is in contrast with encoding all

source sub-components together, as is performed in a con-

ventional encoder-decoder model. An attention module will

tell the decoder to look more at targeted sub-components of

the source to be translated—areas of an image [238], words

of a sentence [13], segments of an audio sequence [36], [41],

frames and regions in a video [244], [249], and even parts of

an instruction [145]. For example, in image captioning

instead of encoding an entire image using a CNN, an atten-

tion mechanism will allow the decoder (typically an RNN) to

focus on particular parts of the image when generating each

successive word [238]. The attention module which learns

what part of the image to focus on is typically a shallow neu-

ral network and is trained end-to-end together with a target

task (e.g., translation).

Attention models have also been successfully applied to

question answering tasks, as they allow for aligning the

words in a question with sub-components of an information

source such as a piece of text [236], an image [65], or a video

sequence [254]. This allows for better accuracy and leads to

better model interpretability [3]. In particular, different

types of attention models have been proposed to address

BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY

433

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

this problem, including hierarchical [133], stacked [242],

and episodic memory attention [236].

Another neural alternative for aligning images with cap-

tions for cross-modal retrieval was proposed by Karpathy

et al. [102], [103]. Their proposed model aligns sentence

fragments to image regions by using a dot product similar-

ity measure between image region and word representa-

tions. While it does not use attention, it extracts a latent

alignment between modalities through a similarity measure

that is learned indirectly by training a retrieval model.

5.3

Discussion

Multimodal alignment faces a number of difﬁculties: 1)

there are few datasets with explicitly annotated alignments;

it is difﬁcult to design similarity metrics between modali-

ties; 3) there may exist multiple possible alignments and not

all elements in one modality have correspondences in

another. Earlier work on multimodal alignment focused on

aligning multimodal sequences in an unsupervised manner

using graphical models and dynamic programming techni-

ques. It relied on hand-deﬁned measures of similarity

between the modalities or learnt them in an unsupervised

manner. With recent availability of labeled training data

supervised learning of similarities between modalities has

become possible. However, unsupervised techniques of

learning to jointly align and translate or fuse data have also

become popular.

FUSION

Multimodal fusion is one of the original topics in multi-

modal machine learning, with previous surveys emphasiz-

ing early, late and hybrid fusion approaches [52], [255]. In

technical terms, multimodal fusion is the concept of inte-

grating information from multiple modalities with the goal

of predicting an outcome measure: a class (e.g., happy ver-

sus sad) through classiﬁcation, or a continuous value (e.g.,

positivity of sentiment) through regression. It is one of the

most researched aspects of multimodal machine learning

with work dating to 25 years ago [251].

The interest in multimodal fusion arises from three main

beneﬁts it can provide. First, having access to multiple

modalities that observe the same phenomenon may allow

for more robust predictions. This has been especially

explored and exploited by the AVSR community [170]. Sec-

ond, having access to multiple modalities might allow us to

capture complementary information—something that is not

visible in individual modalities on their own. Third, a multi-

modal system can still operate when one of the modalities is

missing, for example recognizing emotions from the visual

signal when the person is not speaking [52].

Multimodal fusion has a very broad range of applica-

tions, including audio-visual speech recognition [170], mul-

timodal emotion recognition [200], medical image analysis

[93], and multimedia event detection [122]. There are a

number of reviews on the subject [11], [170], [196], [255].

Most of them concentrate on multimodal fusion for a partic-

ular task, such as multimedia analysis, information retrieval

or emotion recognition. In contrast, we concentrate on the

machine learning approaches themselves and the technical

challenges associated with these approaches.

While some prior work used the term multimodal fusion

to describe all multimodal algorithms, we classify approaches

as fusion when the multimodal integration is performed at

the later prediction stages, with the goal of predicting

outcome measures. Recently, the line between multimodal

representation and fusion has been blurred for models

such as deep neural networks where representation learning

interacts with classiﬁcation or regression objectives.

We classify multimodal fusion into two main categories:

model-agnostic approaches (Section 6.1) that are not directly

dependent on a speciﬁc machine learning method; and model-

based (Section 6.2) approaches that explicitly address fusion in

their construction—such as kernel-based approaches, graphi-

cal models, and neural networks. An overview of such

approaches can be seen in Table 5.

6.1

Model-Agnostic Approaches

Historically, the vast majority of multimodal fusion has

been done using model-agnostic approaches [52]. Such

approaches can be split into early (i.e., feature-based), late

(i.e., decision-based) and hybrid fusion [11]. Early fusion

integrates features immediately after they are extracted

(often by simply concatenating their representations). Late

fusion on the other hand performs integration after each of

the modalities has made a decision (e.g., classiﬁcation or

regression). Finally, hybrid fusion combines outputs from

early fusion and individual unimodal predictors. An advan-

tage of model agnostic approaches is that they can be imple-

mented using almost any unimodal classiﬁers or regressors.

Early fusion could be seen as an early attempt by multi-

modal researchers to perform multimodal representation

learning—as it can learn to exploit the correlation and inter-

actions between low level features of each modality. It also

only requires the training of a single model, making the

training pipeline easier compared to late and hybrid fusion.

In contrast, late fusion uses unimodal decision values

and fuses them using a fusion mechanism such as averaging

[188], voting schemes [149], weighting based on channel

noise [170] and signal variance [55], or a learned model [71],

[175]. It allows for the use of different models for each

TABLE 5

A Summary of Our Taxonomy of Multimodal Fusion Approaches

FUSION TYPE

OUT

TEMP

TASK

REFERENCE

Model-agnostic

Early

class

Emotion rec.

[35]

Late

reg

yes

Emotion rec.

[175]

Hybrid

class

Multimedia

[122]

event detection

Model-based

Kernel-based

class

Object class.

[32], [69]

class

Emotion rec.

[94], [189]

Graphical

class

yes

AVSR

[78]

models

reg

yes

Emotion rec.

[14]

class

Media class.

[97]

Neural

class

yes

Emotion rec.

[100], [232]

networks

class

AVSR

[157]

reg

yes

Emotion rec.

[39]

OUT—output type (class—classiﬁcation or reg—regression), TEMP—is tempo-

ral modeling possible.

434

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

modality as different predictors can model each individual

modality better, allowing for more ﬂexibility. Furthermore,

it makes it easier to make predictions when one or more of

the modalities is missing and even allows for training when

no parallel data is available. However, late fusion ignores

the low level interaction between the modalities.

Hybrid fusion attempts to exploit the advantages of both

of the above described methods in a common framework. It

has been used successfully for multimodal speaker identiﬁ-

cation [234] and multimedia event detection [122].

6.2

Model-Based Approaches

While model-agnostic approaches are easy to implement

using unimodal machine learning methods, they end up

using techniques that are not designed for multimodal data.

In this section we describe three categories of approaches

that are designed to perform multimodal fusion: kernel-

based methods, graphical models, and neural networks.

Multiple kernel learning (MKL) methods are an extension

to kernel support vector machines (SVM) that allow for the

use of different kernels for different modalities/views of

the data [73]. As kernels can be seen as similarity functions

between data points, modality-speciﬁc kernels in MKL

allows for better fusion of heterogeneous data.

MKL approaches have been an especially popular method

for fusing visual descriptors for object detection [32], [69] and

only recently have been overtaken by deep learning methods

for the task [114]. They have also seen use for multimodal

affect recognition [38], [94], [189], multimodal sentiment

analysis [169], and multimedia event detection [245]. Fur-

thermore, McFee and Lanckriet [142] proposed to use MKL

to perform musical artist similarity ranking from acoustic,

semantic and social view data. Finally, Liu et al. [130] used

MKL for multimodal fusion in Alzheimer’s disease classiﬁ-

cation. Their broad applicability demonstrates the strength

of such approaches in various domains and across different

modalities.

Besides ﬂexibility in kernel selection, an advantage of

MKL is the fact that the loss function is convex, allowing for

model training using standard optimization packages and

global optimum solutions [73]. Furthermore, MKL can be

used to both perform regression and classiﬁcation. One of

the main disadvantages of MKL is the reliance on training

data (support vectors) during test time, leading to slow

inference and a large memory footprint.

Graphical models are another family of popular methods

for multimodal fusion. In this section we overview work

done on multimodal fusion using shallow graphical models.

A description of deep graphical models such as deep belief

networks can be found in Section 3.1.

Majority of graphical models can be classiﬁed into two

main categories: generative—modeling joint probability; or

discriminative—modeling

conditional

probability

[209].

Some of the earliest approaches to use graphical models for

multimodal fusion include generative models such as cou-

pled [155] and factorial hidden Markov models [70] along-

side dynamic Bayesian networks [67]. A more recently-

proposed multi-stream HMM method proposes dynamic

weighting of modalities for AVSR [78].

Arguably, generative models lost popularity to discrimi-

native ones such as conditional random ﬁelds (CRF) [120]

which sacriﬁce the modeling of joint probability for predic-

tive power. A CRF model was used to better segment

images by combining visual and textual information of

image description [63]. CRF models have been extended to

model latent states using hidden conditional random ﬁelds

[172] and have been applied to multimodal meeting seg-

mentation [180]. Other multimodal uses of latent variable

discriminative graphical models include multi-view hidden

CRF [202] and latent variable models [201]. More recently

Jiang et al. [97] have shown the beneﬁts of multimodal hid-

den conditional random ﬁelds for the task of multimedia

classiﬁcation. While most graphical models are aimed at

classiﬁcation, CRF models have been extended to a continu-

ous version for regression [171] and applied in multimodal

settings [14] for audio visual emotion recognition.

The beneﬁt of graphical models is their ability to easily

exploit spatial and temporal structure of the data, making

them especially popular for temporal modeling tasks, such

as AVSR and multimodal affect recognition. They also allow

to build in human expert knowledge into the models. and

often lead to interpretable models.

Neural networks have been used extensively for the task of

multimodal fusion [157]. The earliest examples of using

neural networks for multi-modal fusion come from work on

AVSR [170]. Nowadays they are being used to fuse informa-

tion for visual and media question answering [66], [135],

[237], gesture recognition [156], affect analysis [100], [159],

and video description generation [98], [221]. Both shallow

[66] and deep [159], [221] neural models have been explored

for multimodal fusion.

Neural networks have also been used for fusing temporal

multimodal information through the use of RNNs and

LSTMs. One of the earlier such applications used a bidirec-

tional LSTM was used to perform audio-visual emotion

classiﬁcation [232]. More recently, W€ollmer et al. [231] used

LSTM models for continuous multimodal emotion recogni-

tion, demonstrating its advantage over graphical models

and SVMs. Similarly, Nicolaou et al. [158] used LSTMs for

continuous emotion prediction. Their proposed method

used an LSTM to fuse the results from a modality speciﬁc

(audio and facial expression) LSTMs.

Approaching modality fusion through recurrent neural

networks has been used in various image captioning tasks,

example models include: neural image captioning [223]

where a CNN image representation is decoded using an

LSTM language model, gLSTM [95] which incorporates the

image data together with sentence decoding at every time

step fusing the visual and sentence data in a joint represen-

tation. A more recent example is the multi-view LSTM

(MV-LSTM) model proposed by Rajagopalan et al. [173].

MV-LSTM model allows for ﬂexible fusion of modalities in

the LSTM framework by explicitly modeling the modality-

speciﬁc and cross-modality interactions over time.

A big advantage of deep neural network approaches in

data fusion is their capacity to learn from large amount of

data. Second, recent neural architectures allow for end-to-

end training of both the multimodal representation compo-

nent and the fusion component. Finally, they show good

performance when compared to non neural network based

system and are able to learn complex decision boundaries

that other approaches struggle with.

BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY

435

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

The major disadvantage of neural network approaches is

their lack of interpretability. It is difﬁcult to tell what the

prediction relies on, and which modalities or features play

an important role. Furthermore, neural networks require

large training datasets to be successful.

6.3

Discussion

Multimodal fusion has been a widely researched topic with

a large number of approaches proposed to tackle it, includ-

ing model agnostic methods, graphical models, multiple

kernel learning, and various types of neural networks. Each

approach has its own strengths and weaknesses, with some

more suited for smaller datasets and others performing bet-

ter in noisy environments. Most recently, neural networks

have become a very popular way to tackle multimodal

fusion, however graphical models and multiple kernel

learning are still being used, especially in tasks with limited

training data or where model interpretability is important.

Despite these advances multimodal fusion still faces the

following challenges: 1) signals might not be temporally

aligned (possibly dense continuous signal and a sparse

event); 2) it is difﬁcult to build models that exploit supple-

mentary and not only complementary information; 3) each

modality might exhibit different types and different levels

of noise at different points in time.

CO-LEARNING

The ﬁnal multimodal challenge in our taxonomy is co-

learning—aiding the modeling of a (resource poor) modal-

ity by exploiting knowledge from another (resource rich)

modality. It is particularly relevant when one of the modali-

ties has limited resources—lack of annotated data, noisy

input, and unreliable labels. We call this challenge co-learn-

ing as most often the helper modality is used only during

model training and is not used during test time. We identify

three types of co-learning approaches based on their training

resources: parallel, non-parallel, and hybrid. Parallel-data

approaches require training datasets where the observations

from one modality are directly linked to the observations

from other modalities. In other words, when the multimodal

observations are from the same instances, such as in an

audio-visual speech dataset where the video and speech

samples are from the same speaker. In contrast, non-parallel

data approaches do not require direct links between observa-

tions from different modalities. These approaches usually

achieve co-learning by using overlap in terms of categories.

For example, in zero shot learning when the conventional

visual object recognition dataset is expanded with a second

text-only dataset from Wikipedia to improve the generaliza-

tion of visual object recognition. In the hybrid data setting the

modalities are bridged through a shared modality or a data-

set. An overview of methods in co-learning can be seen in

Table 6 and summary of data parallelism in Fig. 3.

7.1

Parallel Data

In parallel data co-learning both modalities share a set of

instances—audio recordings with the corresponding videos,

images and their sentence descriptions. This allows for two

types of algorithms to exploit that data to better model the

modalities: co-training and representation learning.

Co-training is the process of creating more labeled train-

ing samples when we have few labeled samples in a multi-

modal problem [22]. The basic algorithm builds weak

classiﬁers in each modality to bootstrap each other with

labels for the unlabeled data. It has been shown to discover

more training samples for web-page classiﬁcation based on

the web-page itself and hyper-links leading in the seminal

work of Blum and Mitchell [22]. By deﬁnition this task

requires parallel data as it relies on the overlap of multi-

modal samples.

Co-training has been used for statistical parsing [185] to

build better visual detectors [125] and for audio-visual

speech recognition [42]. It has also been extended to deal

with disagreement between modalities, by ﬁltering out

unreliable samples [43]. While co-training is a powerful

method for generating more labeled data, it can also lead to

biased training samples resulting in overﬁtting.

Transfer learning is another way to exploit co-learning

with parallel data. Multimodal representation learning

(Section 3.1) approaches such as multimodal deep Boltzmann

machines [206] and multimodal autoencoders [157] transfer

information from representation of one modality to that of

another. This not only leads to multimodal representations,

TABLE 6

A Summary of Co-Learning Taxonomy,

Based on Data Parallelism

DATA PARALLELISM

TASK

REFERENCE

Parallel

Co-training

Mixture

[22], [115]

Transfer learning

AVSR

[157]

Lip reading

[148]

Non-parallel

Transfer learning

Visual classiﬁcation

[64]

Action recognition

[134]

Concept grounding

Metaphor class.

[188]

Word similarity

[107]

Zero shot learning

Image class.

[64], [198]

Thought class.

[165]

Hybrid data

Bridging

MT and image ret.

[174]

Transliteration

[154]

Parallel data—multiple modalities can see the same instance. Non-parallel

data—unimodal instances are independent of each other. Hybrid data—the

modalities are pivoted through a shared modality or dataset.

Fig. 3. Types of data parallelism used in co-learning: parallel—modalities

are from the same dataset and there is a direct correspondence between

instances; non-parallel—modalities are from different datasets and do

not have overlapping instances, but overlap in general categories or con-

cepts; hybrid—the instances or concepts are bridged by a third modality

or a dataset.

436

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 41,

NO. 2,

FEBRUARY 2019

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

but also to better unimodal ones, with only one modality

being used during test time [157] .

Moon et al. [148] show how to transfer information from

a speech recognition neural network (based on audio) to a

lip-reading one (based on images), leading to a better visual

representation, and a model that can be used for lip-reading

without need for audio information during test time. Simi-

larly, Arora and Livescu [10] build better acoustic features

using CCA on acoustic and articulatory (location of lips,

tongue and jaw) data. They use articulatory data only dur-

ing CCA construction and use only the resulting acoustic

(unimodal) representation during test time.

7.2

Non-Parallel Data

Methods that rely on non-parallel data do not require the

modalities to have shared instances, but only shared catego-

ries or concepts. Non-parallel co-learning approaches can

help when learning representations, allow for better seman-

tic concept understanding and even perform unseen object

recognition.

Transfer learning is also possible on non-parallel data and

allows to learn better representations through transferring

information from a representation built using a data rich or

clean modality to a data scarce or noisy modality. This type

of trasnfer learning is often achieved by using coordinated

multimodal representations (see Section 3.2). For example,

Frome et al. [64] used text to improve visual representations

for image classiﬁcation by coordinating CNN visual fea-

tures with word2vec textual ones [146] trained on separate

large datasets. Visual representations trained in such a way

result in more meaningful errors—mistaking objects for

ones of similar category [64]. Mahasseni and Todorovic

[134] demonstrated how to regularize a color video based

LSTM using an autoencoder LSTM trained on 3D skeleton

data by enforcing similarities between their hidden states.

Such an approach is able to improve the original LSTM and

lead to state-of-the-art performance in action recognition.

Conceptual grounding refers to learning semantic mean-

ings or concepts not purely based on language but also on

additional modalities such as vision, sound, or even smell

[17]. While the majority of concept learning approaches

are purely language-based, representations of meaning in

humans are not merely a product of our linguistic exposure,

but are also grounded through our sensorimotor experience

and perceptual system [18], [131]. Human semantic knowl-

edge relies heavily on perceptual information [131] and

many concepts are grounded in the perceptual system and

are not purely symbolic [18]. This implies that learning

semantic meaning purely from textual information might

not be optimal, and motivates the use of visual or acoustic

cues to ground our linguistic representations.

Starting from work by Feng and Lapata [62], grounding

is usually performed by ﬁnding a common latent space

between the representations [62], [190] (in case of parallel

datasets) or by learning unimodal representations separately

and then concatenating them to lead to a multimodal one

[30], [105], [179], [188] (in case of non-parallel data). Once a

multimodal representation is constructed it can be used on

purely linguistic tasks. Shutova et al. [188] and Bruni et al.

[30] used grounded representations for better classiﬁcation

of metaphors and literal language. Such representations

have also been useful for measuring conceptual similarity

and relatedness—identifying how semantically or conceptu-

ally related two words are [31], [105], [190] or actions [179].

Furthermore, concepts can be grounded not only using

visual signals, but also acoustic ones, leading to better perfor-

mance especially on words with auditory associations [107],

or even olfactory signals [106] for words with smell associa-

tions. Finally, there is a lot of overlap between multimodal

alignment and conceptual grounding, as aligning visual

scenes to their descriptions leads to better textual or visual

representations [113], [168], [179], [248].

Conceptual grounding has been found to be an effective

way to improve performance on a number of tasks. It also

shows that language and vision (or audio) are complemen-

tary sources of information and combining them in multi-

modal models often improves performance. However, one

has to be careful as grounding does not always lead to better

performance [106], [107], and only makes sense when

grounding has relevance for the task—such as grounding

using images for visually-related concepts.

Zero shot learning (ZSL) refers to recognizing a concept

without having explicitly seen any examples of it. For exam-

ple classifying a cat in an image without ever having seen

(labeled) images of cats. This is an important problem to

address as in a number of tasks such as visual object classiﬁ-

cation: it is prohibitively expensive to provide training

examples for every imaginable object of interest.

There are two main types of ZSL—unimodal and multi-

modal. The unimodal ZSL looks at component parts or

attributes of the object, such as phonemes to recognize an

unheard word or visual attributes such as color, size, and

shape to predict an unseen visual class [57]. The multimodal

ZSL recognizes the objects in the primary modality through

the help of the secondary one—in which the object has been

seen. The multimodal version of ZSL is a problem facing

non-parallel data by deﬁnition as the overlap of seen classes

is different between the modalities.

Socher et al. [198] map image features to a conceptual

word space and are able to classify seen and unseen con-

cepts. The unseen concepts can be then assigned to a word

that is close to the visual representation—this is enabled by

the semantic space being trained on a separate dataset that

has seen more concepts. Instead of learning a mapping from

visual to concept space Frome et al. [64] learn a coordinated

multimodal representation between concepts and images

that allows for ZSL. Palatucci et al. [165] perform prediction

of words people are thinking of based on functional mag-

netic resonance images, they show how it is possible to pre-

dict unseen words through the use of an intermediate

semantic space. Lazaridou et al. [123] present a fast map-

ping method for ZSL by mapping extracted visual feature

vectors to text-based vectors through a neural network.

7.3

Hybrid Data

In the hybrid data setting two non-parallel modalities are

bridged by a shared modality or a dataset (see Fig. 3c). The

most notable example is the Bridge Correlational Neural Net-

work [174], which uses a pivot modality to learn coordinated

multimodal representations in presence of non-parallel data.

For example, for multilingual image captioning, the image

modality would be paired with at least one caption in any

BALTRU�SAITIS ET AL.: MULTIMODAL MACHINE LEARNING: A SURVEY AND TAXONOMY

437

Authorized licensed use limited to: Newcastle University. Downloaded on February 12,2023 at 11:53:38 UTC from IEEE Xplore. Restrictions apply.

language. Such methods have also been used to bridge lan-

guages that might not have parallel corpora but have access

to a shared pivot language, such as for machine translation

[154], [174] and document transliteration [104].

Instead of using a separate modality for bridging, some

methods rely on existence of large datasets from a similar or

related task to lead to better performance in a task that only

contains limited annotated data. Socher and Fei-Fei [197]

use the existence of large text corpora in order to guide

image segmentation. While Hendricks et al. [81] use sepa-

rately trained visual model and a language model to lead to

a better image and video description system, for which only

limited data is available.

7.4

Discussion

Multimodal co-learning allows for one modality to inﬂu-

ence the training of another, exploiting the complementary

information across modalities. It is important to note that

co-learning is task independent and could be used to create

better fusion, translation, and alignment models. This chal-

lenge is exempliﬁed by algorithms such as co-training, mul-

timodal representation learning, conceptual grounding, and

zero shot learning (ZSL) and has found many applications

in visual classiﬁcation, action recognition, audio-visual

speech recognition, and semantic similarity estimation.

CONCLUSION

Multimodal machine learning is a vibrant multi-disciplinary

ﬁeld which aims to build models that can process and relate

information from multiple modalities. This paper surveyed

recent advances in multimodal machine learning and pre-

sented them in a common taxonomy built upon ﬁve technical

challenges faced by multimodal researchers: representation,

translation, alignment, fusion, and co-learning. For each chal-

lenge, we presented taxonomic sub-classiﬁcation that allows

to understand the breath of the current multimodal research.

Although the focus of this survey paper was primarily on the

last decade of multimodal research, it is important to address

future challenges with a knowledge of past achievements.

Moving forward, the proposed taxonomy gives research-

ers a framework to understand current research and identify

understudied challenges for future research. We summarized

each technical challenge with a discussion of future directions

and research problems (see Sections 3.3, 4.3, 5.3, 6.3 and 7.4).

We believe that all these aspects of multimodal research are

needed if we want to build computers able to perceive, model

and generate multimodal signals. One speciﬁc area of multi-

modal machine learning which seems to be under-studied is

co-learning, where knowledge from one modality helps with

modeling in another modality. This challenge is related to the

concept of coordinated representations where each modality

keeps its own representation but ﬁnd a way to exchange

and coordinate knowledge. We see these lines of research as

promising directions for future research.