br Transfer learning is an approach
Transfer learning  is an approach that applies CNNs that are pretrained on large annotated image databases, such as ImageNet, from different domains for various tasks. With such an approach, the original network architecture is retained, and the pretrained weights are used to initialize the network. The initialized weights are constantly updated during the following fine-tuning stage, enabling the network to learn features specific to the target task. Currently, a large number of studies have demonstrated that fine-tuning is efficient for a variety of classifi-cation tasks in the medical domain. For example, a recent study showed that utilizing pretrained Google's Inception-V3 network on ImageNet and fine-tuned using images of skin lesions achieved very high accuracy for classification of skin cancer, comparable to that of numerous der-matologists .
In this paper, we use a well-known CNN architecture, named Google's Inception-V3, which is pretrained to 93.33% top-five accuracy on the 1000 object GSK 872 (1.28 million images) of the2014 ImageNet Challenge. Then, we fine-tune Google's Inception-V3 to learn domain-and modality-specific features for classifying breast pathological images. Such a fine-tuning approach is easier to optimize and enables the training of deeper networks, which correspondingly leads to an overall improvement in network capacity and performance.
We choose Inception-V3 for two main reasons. First, Inception-V3 network employs factorized inception modules, allowing the network to choose suitable kernel sizes for the convolution layers, which enables the network to learn both low-level features with small convolutions and high-level features with larger convolutions. Second, the compu-tational efficiency and low parameter count advantages of Inception have made it feasible to utilize Inception networks in high-resolution scenarios.
Fig. 2. An overview of our proposed methods. First, we do preprocess and data enhancement on pathological images. After this, a complete image is divided into 12 small patches on average. Then, each image patch is sent to a fine-tuned Inception-V3 to extract feature representations. We average pool the final output of the last three inception modules. Then concatenate the three vectors into a 5376 dimensions vector. In other words, one image gets 12 feature vectors. Finally, the 12 feature vectors (12 × 1 × 5376) are inputted into a bidirectional LSTM with 4 layers to fuse the features of the 12 small patches to make the final complete image-wise classification.
Fig. 3. Schematic overview of the richer multilevel feature representation. After fine-tuning pretrained Inception-V3, we use average pooling on the final output of the last three inception modules. Then, we concatenate the three vectors into a 5376D (1208D + 2048D + 2048D) dimensional vector, which is used as the richer multilevel feature representation of the pathological image patch.
Objects in pathology images possess various scales and high com-plexity, therefore, learning richer hierarchical representations is critical for image feature representation. CNNs have proven effective for image feature representation. Nevertheless, the convolutional features in the CNN gradually become coarser with increasing receptive fields. To make the feature representation of pathological image patches more representative, we efficiently combine features from multiple con-volutional layers. Thus, we can retain richer multilevel and com-plementary information such as local textures and fine-grained details lost by higher levels. In practice, as shown in Fig. 3, we use the standard pretrained Inception-V3 from the TensorFlow slim distribution for feature extraction. We removed fully connected layers from the model to allow the networks to consume images of arbitrary size. Here, unlike most of the previous models that converted the last convolutional layer consisting of 2048 channels via global average pooling into a one-di-mensional feature vector with a length of 2048, we use richer multilevel convolutional features to fit the characteristics of this mission. Speci-fically, we use average pooling on the final output of the last three inception modules. Then, we concatenate the three vectors into a 5376D (1208D + 2048D + 2048D) dimensional vector, which is used as the richer multilevel feature representation of the pathological image.
When deep learning is applied to process natural images, a complete image is directly used as input for end-to-end training. However, the size of the pathological image is too large, limited by memory size, and the original image is inevitably divided into several small patches. The accompanying problem is how to integrate the results of each small patch and obtain the image-wise classification result. The general methods are majority voting or SVM. These two kinds of methods are simple and direct, but they also have achieved good results. However, such a simple and direct method loses considerable contextual