This paper proposes a framework for training variational autoencoders (VAEs) for image distributions that have latent groups of factors. Our key idea is to introduce a mechanism to predict the factor group an image belongs to while simultaneously disentangling factors in it. More specifically, we propose an architecture consisting of three components: an encoder, a decoder, and a factor-group prediction header. The first two components are trained with a VAE objective, and the last one is trained with the proposed algorithm using the loss of unsupervised contrastive learning. In experiments, we designed a task in which more than one group of factors were entangled by combining multiple datasets and demonstrated the effectiveness of the proposed framework. The Mutual Information Gap score was improved from 0.089 to 0.125 on a merged dataset of Color-dSprites, 3DShapes, and MPI3D.