Computer vision and natural language processing are two key branches of artificial intelligence. Since the goal of computer vision has always been automatic extraction, analysis, and understanding of useful information from a single image or a sequence of images, it is natural for vision and language to come together to enable high-level computer vision tasks. Conversely, information extracted from images and videos can facilitate natural language processing tasks. Recent advances in machine learning and deep learning are facilitating reasoning about images and text in a joint fashion. in this talk, we will review a recently active area of research at the intersection of vision and language, including Part I (videolanguage alignment, image and video captioning, image retrieval using complex text queries, visual question answering) and Part II (language grounding in images and videos, image generation from textual descriptions, as well as multimodal machine translation and vision-aided grammar induction), and discuss trends in this area.
Professor Luo Jiebo
University of Rochester
Time: 21 Sep 2022, Wednesday; 1.00pm -2:30pm
Venue: LT 7, Nanyang Technological University