Figure 2. Graphical illustrations of the proposed generative
and inference models for sequential domain disentanglement.
Sequential Domain Disentanglement
The blue/red nodes are the observed source/target videos xS/xT, respectively, over t timestamps.
The static variables zdS and zdT follow a joint distribution and are domain-specific.
Combining either of them with the dynamic variable zt at each timestamp, we can construct one frame data of a domain.
Note that the sequences of the dynamic variables are shared across domains and are domaininvariant.
Figure 3. Overview of our TranSVAE framework.
The input videos are fed into an encoder to extract the visual features, followed by an LSTM to explore the temporal information.
Two groups of mean and variance networks are then applied to model the posterior of the latent factors, i.e.,
q(
ztD|
x<tD)
and q(
zdD|
x1:TD). The new representations
z1D, ...,
zTD
and
zdD are sampled, and then concatenated and passed to an decoder for reconstruction. Four constraints are proposed to regulate the latent factors for adaptation purpose.
Figure 4. Loss integration studies on UCF
101 → HMDB
51.
Left: The t-SNE plots for class-wise (top row) and domain (bottom row, red
source & blue target) features.
Right: Ablation results (%) by adding each loss sequentially, i.e., row (a) - row (e).
Figure 5. Loss integration studies on HMDB
51 → UCF
101.
Left: The t-SNE plots for class-wise (top row) and domain (bottom row, red
source & blue target) features.
Right: Ablation results (%) by adding each loss sequentially, i.e., row (a) - row (e).
Figure 6. Domain disentanglement and transfer examples.
Left: Video sequence inputs for D = P
1 (“Human”) and D = P
2
(“Alien”).
Middle: Reconstructed sequences with
z1D, ...,
zTD.
Right: Domain transferred sequences with exchanged
zdD.