- Most recent work also adapt recurrent neural networks (RNNs), using the rich deep CNN features to generate image captions. However, the applications of the previous studies were limited to natural image caption datasets such as Flickr8k , Flickr30k , or MSCOCO  which can be generalized from ImageNet.
- A common challenge in medical image analysis is the data bias. When considering the whole population, diseased cases are much rarer than healthy cases, which is also the case in the chest x-ray dataset used.
- employ a publicly available radiology dataset of chest x-rays and their reports, and use its image annotations to mine disease names to train convolutional neural networks (CNNs).
- adopt various regularization techniques to circumvent the large normal- vs-diseased cases bias.
- introduce a novel approach to use the weights of the already trained pair of CNN/RNN on the domain-specific image/text dataset, to infer the joint image/text contexts for composite image labeling.
- the first study mining from a radiology image and report dataset, not only to classify and detect images but also to describe their context.
- In order to train CNNs with chest x-ray images, we sample some frequent annotation patterns with less overlaps for each image, in order to assign image labels to each chest x-ray image and train with cross-entropy criteria.
- We find 17 unique patterns of MeSH term combinations appearing in 30 or more cases.
- We adopt various regularization techniques to deal with the normal-vs-diseased cases bias.
- From the 17 chosen disease annotation patterns, normal cases account for 71% of all images, well above the numbers of cases for the remaining 16 disease annotation pat- terns. We balance the number of samples for each case by augmenting the training images of the smaller cases where we randomly crop 224 × 224 size images from the original 256 × 256 size image.
- Inspired by this and by the concept of Dropout , we regularize the normal-vs-diseased cases bias via randomly dropping out an excessive proportion of normal cases com- pared to the frequent diseased pattern when sampling mini- batches.
- We also validate whether the dataset can benefit from a more complex GoogLeNet, which is arguably the current state-of-the-art CNN architecture.
We use recurrent neural networks (RNNs) to learn the annotation sequence given input image CNN embeddings.
- We set the initial state of RNNs as the CNN image embedding (CNN(I)), and the first an- notation word as the initial input.
- We therefore use the output of the last spatial average-pooling layer as the image embedding to initialize the RNN state vectors. The size of our RNNs’ state vectors are R1×1024, which is identical to the output size of the average-pooling layers from NIN and GoogLeNet.
- We again initialize the RNN state vectors with the CNN image embedding (ht=1=CNN(I)). We then use the CNN prediction of the input image as the first word as the input to the RNN, to sample following sequences up to five words.
We therefore use the already trained CNN and RNN to infer better image labels, integrating the contexts of the image annotations beyond just the name of the disease. This is achieved by generating joint image/text context vectors that are computed by apply- ing mean-pooling on the state vectors (h) of RNN at each step over the annotation sequence.
For a disease label having more than 170 cases (n ≥ 170 = (average+standard deviation)), we divide the cases into sub-groups of more than 50 cases by applying k-means clustering to the him:text vector with k = Round(n/50). We train the CNN once more with the additional labels (57, compared to 17 in Section 5), train the RNN with the new CNN image embedding, and finally generate image annotations.
- The final evaluated BLEU scores are provided in Table 5.
OpenI (3,955 radiology reports and 7,470 chest x-rays)
Each report is structured as comparison, indication, find- ings, and impression sections.