bu-logo

Domain Adaptation Solution to Image Classification

BU CS 585 final project report - Spring 2020

Authored by
Yize Xie U14485891
Weifan Chen U51902184
Wenxing Liu U75423991
05/01/2020

1. Introduction of Task

Supervised machine learning assumes that training and testing data are sampled i.i.d from the same distribution, while in practice, the training and testing data are typically collected from related domains but under different distributions. This is known as domain shift. E.g. recognize dogs under different conditions like those shown in Fig.1.

fig1
Figure 1. Domain adaptation in object recognition

It becomes expensive if we always collect labeled samples for each new domain. To avoid the cost of annotation, Domain adaptation techniques aim to tackle domain shift between the source and target domains with unlabeled or weakly labeled images in the target domain. The task has been challenging because of the gap in feature distributions between domains, which degrades the source classifier’s performance. Domain adaptation has been applied to various applications such as image classification, semantic segmentation and object detection.

In domain adaptation contexts, the key of adaptation between different domains is always considered to be alignment, which inspired many discrepancy based methods. Recent works focus on improving alignment from different aspects, or increase adaptation by using semi-supervised learning. In our project, we reimplemented three methods aligning different domains to solve image classification tasks in different domains. In this work, we started from scratch and plan to achieve the same performance as is mentioned in [1], [5] and [3]. We performed experiments in VisDA [6] dataset, MNIST[11], SVHN[12] and USPS dataset. Our experiment result almost meets the performance from papers. Besides, we make small modifications of some method, which improves the results further. The implementations of all the methods are well analyzed with learning curves and feature representation.

2. Related Work

Domain Adaptation Unsupervised domain adaptation aims to transfer the knowledge from one or more labeled source domains to an unlabeled or weakly labeled target domain. Various methods have been proposed for UDA (Unsupervised Domain Adaptation). Alignment based models aim to align the feature distributions of the source and target domains. Such methods are based on analytic results indicating that the lowest upper bound of the error on target domain is achieved by minimizing the divergence between source and target domains. However, some of these works measure domain divergence by generating hidden features of the network without considering the relationship between its decision boundary and the target features, as recent works do.

After careful analysis and comparison, we select 3 typical works which represent modern approaches to domain adaptation and decide to re-implement them. The methods are generally based on unsupervised or semi-supervised learning. Some approaches focus on respecting decision boundaries between classes. The first work is based on classifier discrepancy [5]. To be more specific, at [5], Saito proposes a novel methodology focusing on extracting discriminative features by considering the output of discriminators. The other two utilizes dropouts and classifier entropy [2] [3] .

3. Approach

In our project, we will mainly focus on re-implementing methods which capitalize on decision boundaries.

3.1 Maximum classifier discrepancy

fig2


Figure 2. Main idea in maximum classifier discrepancy

“Maximum classifier discrepancy”(MCD) [5] has proposed a novel adversarial training method for domain adaptation. It focuses on extracting discriminative features by considering the output of discriminators (shown in Fig.2.). As described in [5], suppose we have labeled source data {Xs, Ys}, as well as an unlabeled target image xt drawn from unlabeled target images Xt. We train a feature generator network G, which takes inputs xs or xt, and classifier networks F1 and F2, which take features from G. F1 and F2 classify them into K classes, that is, they output a K dimensional vector of logits. We obtain class probabilities by applying the softmax function for the vector. We use the notation p1(y|x), p2(y|x) to denote the K-dimensional probabilistic outputs for input x obtained by F1 and F2 respectively. As a result, we have discrepancy loss (L1 loss):

eq1

The detail of training process can be split into three steps.

Step A

We train both classifiers and generator to classify the source samples correctly.

eq2

eq3

Step B

In this step, we train the classifier (F1,F2)as a discriminator for a fixed generator(G).

eq4

eq5

Step C

We train the generator to minimize the discrepancy for fixed classifiers.

eq6

3.2 Aversarial dropout regularization

fig3
Figure 3. Main idea in Adversarial dropout regularization

We also reimplemented "Adversarial dropout regularization."(ADR)[3]. The core idea (as shown in fig.3.) of [3] is a novel utilization of dropout. In this method, instead of maximizing the discrepancy between two classifiers, we forwarded input features with different dropouts into one classifier C twice. As a result, the paper denoted two different results as C1(G), C2(G). In order to improve the sensitivity of the classifier, they introduce the KL-divergence. Here, p1 and p2 are the result from C1(G), C2(G) separately.


eq7

The main training process in ADR is quite similar to method in [5]. Comparing with [5], we have one more entropy term in step C. The step C in this method is as following

eq8

However, after implementing it based on kl divergence discrepancy, we found the result is not good. So following what had been notified in the appendix part of paper[3], we used entropy of the classifier output instead. To be clear, instead of calculating adversarial loss in the following way

eq9

We calculated it with the entropy of the classifier output as shown in the following formula:

eq10
eq11

Besides, we also add another C’ in our model. We only trained it in step A to make sure it can classify the result correctly.

3.3 Minimax entropy

fig4
Figure 4. Main idea in minimax entropy

In this scenario, our task is not exactly the same as the former definition. It switched to semi-supervised domain adaptation. In the source domain, we are given source images and corresponding labels. In the target domain, we are also given a limited number of labeled target images, as well as unlabeled target images. Our goal is to train the model on all given images and evaluate on unlabeled target images.

Step A

The base model consists of a feature extractor F and a classifier C. For the feature extractor F, the model uses a deep convolutional neural network and performs normalization on the output of the network. Then, the normalized feature vector is used as an input to C which consists of weight vectorsW=[w_(1 ),w_2,...,w_K]. The output of C should be logits:


eq12

The logits pass a sigmoid function and are perceived as probability. The weight vectors can be regarded as estimated prototypes for each class. The general architecture of the model can be viewed at Figure 4.

Step B

The model estimates domain-invariant prototypes by performing entropy maximization with respect to the estimated prototype. Then, it extracts discriminative features by performing entropy minimization with respect to feature extractor. Entropy maximization prevents overfitting that can reduce the expressive power of the representations. The model use a standard cross-entropy loss to train F and C for classification:


eq13

The method can be formulated as adversarial learning between C and F. The task classifier C is trained to maximize the entropy, whereas the feature extractor F is trained to minimize it. Both C and F are also trained to classify labeled examples correctly. The overall adversarial learning objective functions are:


eq14

4. Dataset

4.1 Digital Dataset

MNIST[11] database has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The grayscale digits have been size-normalized and centered in a fixed-size image. The size of image in MNIST is 28*28.

SVHN[12] contains 73257 digits for training, 26032 digits for testing. The size of image in SVHN is 32*32. The images in this dataset are color images.

USPS has 7291 train and 2007 test images. The images are 16*16 grayscale pixels.

We used torchvision to load these datasets with batch size 64. At the same time, since the image size in different dataset is different, we resized al the image into 32*32 and then normalized all the images with mean 0.5 and variance 0.5.

4.2 VisDA Dataset

fig5
Figure 5. Examples of VisDA 2017


As for the image classification experiment, we also used VisDA[6] dataset which represents the largest for cross-domain object classification, with over 280k images across 12 categories in the combined training, validation, and testing domains. This dataset evaluates adaptation from synthetic-object to real-object images. The source images were generated by rendering 3D models of the same object categories as in the real data from different angles and under different lighting conditions. It contains 152,397 synthetic images. The validation images were collected from MSCOCO [13] and they amount to 55,388 in total. Figure 5 is an example of VisDA.

We resized all the data into 256 * 256 * 256 and randomly flipped images horizontally. Then we normalized the images with mean [0.485, 0.456, 0.406] and variances [0.229, 0.224, 0.225].

5. Results

The measurement of domain adaptation is simple: accuracy. We divided our results into two parts, one on digital datasets and another on VisDA datasets.

5.1 Digital Datasets

table1
Table 1. Experiments on digital datasets

As is depicted in Table 1, the white part is results stated in the paper, while the blue part is our result. For MCD and ADR models, It is clear that our model reaches comparable or better performance as original implementations. There are many other things to be pointed out in our result.

We used Adam optimizer with the learning rate set to 0.001 in both ADR and MCD models. In each iteration, after implementing step A and step B, we updated the generator 4 times.

For the ADR model, the original implementation on digital datasets has some drawbacks. That is, the mnist and usps are treated as one channel gray-scale image, which leads to over simplified model structure. Therefore, we have improved the original model by converting the image into 3-channel color space, and have tested the learning curve with optimized implementation. For the experiment result, it is clear that our implementation outperforms the result given by the original paper. We can also explain the result by diving deeper into the learning curve.


fig6

Figure 6. Paper result of ADR usps2mnist

fig7

Figure 7. Our result of ADR usps2mnist


As is shown in Figure 6 and 7, acc-c1 means the result from classifier C which is sensitive to noise, acc-c2 is the result from classifier C’ and acc-c3 is the result combining the result of two classifiers. Our classifier has got less discrepancy between two classifiers during the training period than the implementation from paper, indicating better alignment. We think this result can also be explained as: our more complex model makes our model which is sensitive to noise perform better.

For the Minimax Entropy model, the paper does not provide results on digit datasets. But we conducted experiments on them as new experiments, expecting comparison between Minimax Entropy and MCD, ADR. From table 1 we can see from SVHN to MNIST and from USPS to MNIST this model outperforms all other models. However, it adapts extremely badly when the target domain is SVHN. Our guess of explanation here is that the entropy model actually depends too much on the source domain. If the information from the source domain is not enough theoretically for classification on target domain, the performance will drop significantly.


fig8

Figure 8. Original feature extraction after PCA

fig9

Figure 9. Trained feature extraction after PCA


The feature extraction of this model is shown in Figure 8 and Figure 9. After training, the feature is better separated after the extractor, and making classification a much easier task for following structures.

5.2 VisDA Dataset

table2

Table 2. Experiments on VisDA datasets


Table 2 is the result of experiments in VisDA dataset. We mainly performed experiments in this dataset with MCD[5] and ADR[3] methods. The results are shown that even in larger domain shift classification tasks, our implementation still worked well.

In our experiment, we considered the images of validation splits as the target domain and trained models in unsupervised domain adaptation settings. We compared the classification accuracy to show the performance of our implementation.

We used the ResNet101 [14] model pre-trained on Imagenet [15] as the backbone of our model. The final fully-connected layer was removed and all layers were updated with the same learning rate because this dataset has abundant source and target samples. We regarded the pre-trained model as a generator network and we used three-layered fully-connected networks for classification networks. The batch size was set to 32 and we used SGD with learning rate 1.0 × 10−3 and momentum 0.9 to optimize the model. We report our best accuracy.

Meanwhile, As for MCD, we added class balance loss at step A and step B which is also mentioned in [5]


eq15

We set constant term and multiply this term to the class balance loss.

As we can see in the table, our performance is almost the same as the performance mentioned in paper(the result filled with blue is the result in our performance and the result filled with white is the result written in paper). As for the accuracy in some classes, we have better performance. To be more specific, we have higher accuracy in class of plane, bicycle, horse, knife, motorcycle, truck comparing the result in the MCD paper. We have higher accuracy in class of plane, car, knife, motorcycle, person, plant and truck comparing the result in the ADR paper. One thing to mention, because we use entropy loss as mentioned in part 3 in ADR [3]. So we only compare our result with the result in paper using this method (denoted as ENT in paper).

6. Code repository

https://github.com/eissieLiu/maximum_classifier_discrepancy

7. Reference

  1. Hsu, Han-Kai, et al. "Progressive domain adaptation for object detection." The IEEE Winter Conference on Applications of Computer Vision. 2020.
  2. Saito, Kuniaki, et al. "Semi-supervised domain adaptation via minimax entropy." Proceedings of the IEEE International Conference on Computer Vision. 2019.
  3. Saito, Kuniaki, et al. "Adversarial dropout regularization." arXiv preprint arXiv:1711.01575 (2017).
  4. Saito, Kuniaki, et al. "Strong-weak distribution alignment for adaptive object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
  5. Saito, Kuniaki, et al. "Maximum classifier discrepancy for unsupervised domain adaptation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
  6. Peng, Xingchao et al. “VisDA: The Visual Domain Adaptation Challenge.” ArXiv abs/1710.06924 (2017): n. pag.
  7. Richter, Stephan R. et al. “Playing for Data: Ground Truth from Computer Games.” ArXiv abs/1608.02192 (2016): n. Pag.
  8. Ros, Germán et al. “The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 3234-3243.
  9. Cordts, Marius et al. “The Cityscapes Dataset for Semantic Urban Scene Understanding.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 3213-3223.
  10. Peng, Xingchao et al. “Moment Matching for Multi-Source Domain Adaptation.” ArXiv abs/1812.01754 (2018): n. Pag.
  11. LeCun, Yann and Cortes, Corinna. "MNIST handwritten digit database." (2010)
  12. Yuval Netzer, Tao Wang et al. Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
  13. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- ´ mon objects in context. In ECCV, 2014. 7
  14. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 7
  15. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.