This paper explores pre-training models for learning state-of-the-art image representations using natural language captions paired with images.