Challenges of multimodality learning
To use the multimodality information, we will face a few core challenges, such as representation, translation, alignment, fusion, and co-learning (non-exclusive). In this section, we will briefly talk about each of them.
Representation
Representation refers to the computer-interpretable description of the multimodal data (for example, vector and tensor). It covers the following, but is not limited to:
- How to handle different symbols and signals—for example, in machine translation, Chinese characters and English characters are two distinct linguistic systems; in a self-driving system, point clouds from LIDAR sensors and image pixels from the RGB camera are two distinct sources with distinct characteristics
- How to handle different granularities
- Modality can be either static or sequential
- Different noise distribution
- Unbalanced proportions.
Translation
Translation/mapping refers to the process of changing data from one modality to another, for example, image captioning...