Linguistic Knowledge for Visual Recognition and Natural Language Descriptions of Visual Content
Marcus Rohrbach
ICSI
Tuesday, June 17
12:30 p.m., Conference Room 5A
Extensive efforts are being made to improve visual recognition and semantic understanding of language. However, surprisingly little has been done to exploit the mutual benefits of combining both fields. In my PhD we showed how the different fields of research can profit from each other.
First, we scale recognition to 200 unseen object classes and show how to extract robust semantic relatedness from linguistic resources. Our novel approach extends zero-shot to few shot recognition and exploits unlabeled data by adopting label propagation for transfer learning.
Second, we capture the high variability but low availability of composite activity videos by extracting the essential information from text descriptions. For this we recorded and annotated a corpus for fine-grained activity recognition. We show improvements in a supervised case but we are also able to recognize unseen composite activities.
Third, we present a corpus of videos and aligned descriptions. We use it for grounding activity descriptions and for learning how to automatically generate natural language descriptions for a video. We show that our proposed approach is also applicable to image description and that it outperforms baselines and related work.
In the talk I talk I will present the highlights of the first two points and in more detail the third point. The last third of the talk I will spend on plans for future work and my time at ISCI, in hope the get feedback on our ideas.