Figure 1. Conceptual comparisons between the traditional compression view and the proposed disentanglement view. Prior arts compress implicit domain information to obtain domain-indistinguishable representations; while in this work, we pursue explicit decouplings of domain-specific information from other information via generative modeling.
Figure 2. Graphical illustrations of the proposed generative
and inference models for sequential domain disentanglement.
Figure 3. Overview of our TranSVAE framework.
The input videos are fed into an encoder to extract the visual features, followed by an LSTM to explore the temporal information.
Two groups of mean and variance networks are then applied to model the posterior of the latent factors, i.e.,
q(ztD|x<tD)
and q(zdD|x1:TD). The new representations z1D, ..., zTD
and zdD are sampled, and then concatenated and passed to an decoder for reconstruction. Four constraints are proposed to regulate the latent factors for adaptation purpose.
Figure 4. Loss integration studies on UCF101 → HMDB51. Left: The t-SNE plots for class-wise (top row) and domain (bottom row, red
source & blue target) features. Right: Ablation results (%) by adding each loss sequentially, i.e., row (a) - row (e).
Figure 5. Loss integration studies on HMDB51 → UCF101. Left: The t-SNE plots for class-wise (top row) and domain (bottom row, red
source & blue target) features. Right: Ablation results (%) by adding each loss sequentially, i.e., row (a) - row (e).
@article{wei2022transvae,
title={Unsupervised Video Domain Adaptation: A Disentanglement Perspective},
author={Wei, Pengfei and Kong, Lingdong and Qu, Xinghua and Yin, Xiang and Xu, Zhiqiang and Jiang, Jing and Ma, Zejun},
journal={arXiv preprint arXiv:2208.07365},
year={2022},
}