Image captioning approaches
There are several approaches to captioning images. Earlier methods used to construct a sentence based on the objects and attributes present in the image. Later, recurrent neural networks (RNN) were used to generate sentences. The most accurate method uses the attention mechanism. Let's explore these techniques and results in detail in this section.
Conditional random field
Initially a method was tried with the conditional random field (CRF) constructing the sentence with the objects and attributes detected in the image. The steps involved in this process are shown as follows:

System flow for an example images (Source: http://www.tamaraberg.com/papers/generation_cvpr11.pdf)
CRF has limited ability to come up with sentences in a coherent manner. The quality of generated sentences is not great, as shown in the following screenshot:

The sentences shown here are too structured despite getting the objects and attributes correct.
Note
Kulkarni et al., in the paper http://www...