Speaker: Pratyush More
Title: Image Captioning Through Simplified Adversarial Learning
Abstract: Automated image captioning can lead to numerous benefits such as helping the visually impaired. To a computer, though, an image is simply a matrix of numbers. Transforming these seemingly unconnected numbers to produce a natural language description for the visual scene is neither trivial nor straightforward. In recent years, machine learning approaches have achieved remarkable success at this task. Such techniques usually involve the use of a convolution neural network encoder, which captures image information, followed by a recurrent neural network decoder, which translates this information into natural language. However, these methods often produce captions that are dull, trite and lacking in detail. Newer research has tried using adversarial training frameworks to overcome this drawback, as they are touted to be better at modeling human-like output. Image captioning, though, has a complicated set of requirements. This complexity can pose challenges to training adversarial models. We propose a simplified adversarial training framework, which breaks up the complex requirements of image captioning into simpler subparts, each with its own distinct loss function. We benchmark our architecture using the MSCOCO dataset, and show that it produces captions that are vivid, versatile, rich, and more human.