This repository is build in association with our position paper on "Multimodality for NLP-Centered Applications: Resources, Advances and Frontiers".
As a part of this release we share the information about recent multimodal datasets which are available for research purposes.
We found that although 100+ multimodal language resources are available in literature for various NLP tasks, still publicly available multimodal datasets are under-explored for its re-usage in subsequent problem domains.
- Sentiment Analysis
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| EmoDB | A Database of German Emotional Speech | Paper | Dataset |
| VAM | The Vera am Mittag German Audio-Visual Emotional Speech Database | Paper | Dataset |
| IEMOCAP | IEMOCAP: interactive emotional dyadic motion capture database | Paper | Dataset |
| Mimicry | A Multimodal Database for Mimicry Analysis | Paper | Dataset |
| YouTube | Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web | Paper | Dataset |
| HUMAINE | The HUMAINE database | Paper | Dataset |
| Large Movies | Sentiment classification on Large Movie Review | Paper | Dataset |
| SEMAINE | The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent | Paper | Dataset |
| AFEW | Collecting Large, Richly Annotated Facial-Expression Databases from Movies | Paper | Dataset |
| SST | Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank | Paper | Dataset |
| ICT-MMMO | YouTube Movie Reviews: Sentiment Analysis in an AudioVisual Context | Paper | Dataset |
| RECOLA | Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions | Paper | Dataset |
| MOUD | Utterance-Level Multimodal Sentiment Analysis | Paper | |
| CMU-MOSI | MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos | Paper | Dataset |
| POM | Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia | Paper | Dataset |
| MELD | MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations | Paper | Dataset |
| CMU-MOSEI | Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph | Paper | Dataset |
| AMMER | Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning | Paper | On Request |
| SEWA | SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild | Paper | Dataset |
| Fakeddit | r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection | Paper | Dataset |
| CMU-MOSEAS | CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French | Paper | Dataset |
| MultiOFF | Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text | Paper | Dataset |
| MEISD | MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations | Paper | Dataset |
| TASS | Overview of TASS 2020: Introducing Emotion | Paper | Dataset |
| CH SIMS | CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality | Paper | Dataset |
| Creep-Image | A Multimodal Dataset of Images and Text | Paper | Dataset |
| Entheos | Entheos: A Multimodal Dataset for Studying Enthusiasm | Paper | Dataset |
- Machine Translation
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| Multi30K | Multi30K: Multilingual English-German Image Description | Paper | Dataset |
| How2 | How2: A Large-scale Dataset for Multimodal Language Understanding | Paper | Dataset |
| MLT | Multimodal Lexical Translation | Paper | Dataset |
| IKEA | A Visual Attention Grounding Neural Model for Multimodal Machine Translation | Paper | Dataset |
| Flickr30K (EN- (hi-IN)) | Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data | Paper | On Request |
| Hindi Visual Genome | Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation | Paper | Dataset |
| HowTo100M | Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models | Paper | Dataset |
- Information Retrieval
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| MUSICLEF | MusiCLEF: a Benchmark Activity in Multimodal Music Information Retrieval | Paper | Dataset |
| Moodo | The Moodo dataset: Integrating user context with emotional and color perception of music for affective music information retrieval | Paper | Dataset |
| ALF-200k | ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists | Paper | Dataset |
| MQA | Can Image Captioning Help Passage Retrieval in Multimodal Question Answering? | Paper | Dataset |
| WAT2019 | WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset | Paper | Dataset |
| ViTT | Multimodal Pretraining for Dense Video Captioning | Paper | Dataset |
| MTD | MTD: A Multimodal Dataset of Musical Themes for MIR Research | Paper | Dataset |
| MusiClef | A professionally annotated and enriched multimodal data set on popular music | Paper | Dataset |
| Schubert Winterreise | Schubert Winterreise dataset: A multimodal scenario for music analysis | Paper | Dataset |
| WIT | WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning | Paper | Dataset |
- Question Answering
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| MQA | A Dataset for Multimodal Question Answering in the Cultural Heritage Domain | Paper | - |
| MovieQA | Movieqa: Understanding stories in movies through question-answering MovieQA | Paper | Dataset |
| PororoQA | Deep story video story qa by deep embedded memory networks | Paper | Dataset |
| MemexQA | MemexQA: Visual Memex Question Answering | Paper | Dataset |
| VQA | Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering | Paper | Dataset |
| TDIUC | An analysis of visual question answering algorithms | Paper | Dataset |
| TGIF-QA | TGIF-QA: Toward spatio-temporal reasoning in visual question answering | Paper | Dataset |
| MSVD QA, MSRVTT QA | Video question answering via attribute augmented attention network learning | Paper | Dataset |
| YouTube2Text | Video Question Answering via Gradually Refined Attention over Appearance and Motion | Paper | Dataset |
| MovieFIB | A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering | Paper | Dataset |
| Video Context QA | Uncovering the temporal context for video question answering | Paper | Dataset |
| MarioQA | Marioqa: Answering questions by watching gameplay videos | Paper | Dataset |
| TVQA | Tvqa: Localized, compositional video question answering | Paper | Dataset |
| VQA-CP v2 | Don’t just assume; look and answer: Overcoming priors for visual question answering | Paper | Dataset |
| RecipeQA | RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes | Paper | Dataset |
| GQA | GQA: A new dataset for real-world visual reasoning and compositional question answering | Paper | Dataset |
| Social IQ | Social-iq: A question answering benchmark for artificial social intelligence | Paper | Dataset |
| MIMOQA | MIMOQA: Multimodal Input Multimodal Output Question Answering | Paper | - |
- Summarization
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| SumMe | Tvsum: Summarizing web videos using titles | Paper | Dataset |
| TVSum | Creating summaries from user videos | Paper | Dataset |
| QFVS | Query-focused video summarization: Dataset, evaluation, and a memory network based approach | Paper | Dataset |
| MMSS | Multi-modal Sentence Summarization with Modality Attention and Image Filtering | Paper | - |
| MSMO | MSMO: Multimodal Summarization with Multimodal Output | Paper | - |
| Screen2Words | Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning | Paper | Dataset |
| AVIATE | IEMOCAP: interactive emotional dyadic motion capture database | Paper | Dataset |
| Multimodal Microblog Summarizaion | On Multimodal Microblog Summarization | Paper | - |
- Human Computer Interaction
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| CAUVE | CUAVE: A new audio-visual database for multimodal human-computer interface research | Paper | Dataset |
| MHAD | Berkeley mhad: A comprehensive multimodal human action database | Paper | Dataset |
| Multi-party interactions | A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction | Paper | - |
| MHHRI | Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement | Paper | Dataset |
| Red Hen Lab | Red Hen Lab: Dataset and Tools for Multimodal Human Communication Research | Paper | - |
| EMRE | Generating a Novel Dataset of Multimodal Referring Expressions | Paper | Dataset |
| Chinese Whispers | Chinese whispers: A multimodal dataset for embodied language grounding | Paper | Dataset |
| uulmMAC | The uulmMAC database—A multimodal affective corpus for affective computing in human-computer interaction | Paper | Dataset |
- Semantic Analysis
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| WN9-IMG | Image-embodied Knowledge Representation Learning | Paper | Dataset |
| Wikimedia Commons | A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions | Paper | Dataset |
| Starsem18-multimodalKB | A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning | Paper | Dataset |
| MUStARD | Towards Multimodal Sarcasm Detection | Paper | Dataset |
| YouMakeup | YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension | Paper | Dataset |
| MDID | Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts | Paper | Dataset |
| Social media posts from Flickr (Mental Health) | Inferring Social Media Users’ Mental Health Status from Multimodal Information | Paper | Dataset |
| Twitter MEL | Building a Multimodal Entity Linking Dataset From Tweets Building a Multimodal Entity Linking Dataset From Tweets | Paper | Dataset |
| MultiMET | MultiMET: A Multimodal Dataset for Metaphor Understanding | Paper | - |
| MSDS | Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline | Paper | Dataset |
- Miscellaneous
| Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
|---|---|---|---|
| MS COCO | Microsoft COCO: Common objects in context | Paper | Dataset |
| ILSVRC | ImageNet Large Scale Visual Recognition Challenge | Paper | Dataset |
| YFCC100M | YFCC100M: The new data in multimedia research | Paper | Dataset |
| COGNIMUSE | COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization | Paper | Dataset |
| SNAG | SNAG: Spoken Narratives and Gaze Dataset | Paper | Dataset |
| UR-Funny | UR-FUNNY: A Multimodal Language Dataset for Understanding Humor | Paper | Dataset |
| Bag-of-Lies | Bag-of-Lies: A Multimodal Dataset for Deception Detection | Paper | Dataset |
| MARC | A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks | Paper | Dataset |
| MuSE | MuSE: a Multimodal Dataset of Stressed Emotion | Paper | Dataset |
| BabelPic | Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concept | Paper | Dataset |
| Eye4Ref | Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations | Paper | - |
| Troll Memes | A Dataset for Troll Classification of TamilMemes | Paper | Dataset |
| SEMD | EmoSen: Generating sentiment and emotion controlled responses in a multimodal dialogue system | Paper | - |
| Chat talk Corpus | Construction and Analysis of a Multimodal Chat-talk Corpus for Dialog Systems Considering Interpersonal Closeness | Paper | - |
| EMOTyDA | Towards Emotion-aided Multi-modal Dialogue Act Classification | Paper | Dataset |
| MELINDA | MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification | Paper | Dataset |
| NewsCLIPpings | NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media | Paper | Dataset |
| R2VQ | Designing Multimodal Datasets for NLP Challenges | Paper | Dataset |
| M2H2 | M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations | Paper | Dataset |