AIST++ Dataset - Description

Overview of AIST++

The above video contains music. You can click the video to unmute it.

The AIST++ Dance Motion Dataset is constructed from the AIST Dance Video DB. With multi-view videos, an elaborate pipeline is designed to estimate the camera parameters, 3D human keypoints and 3D human dance motion sequences:

It provides 3D human keypoint annotations and camera parameters for 10.1M images, covering 30 different subjects in 9 views. These attributes makes it the largest and richest existing dataset with 3D human keypoint annotations.
It also contains 1,408 sequences of 3D human dance motion, represented as joint rotations along with root trajectories. The dance motions are equally distributed among 10 dance genres with hundreds of choreographies. Motion durations vary from 7.4 sec. to 48.0 sec. All the dance motions have corresponding music.

With those annotations, AIST++ is designed to support tasks including:

Multi-view Human Keypoints Estimation.
Human Motion Prediction/Generation.
Cross-modal Analysis between Human Motion and Music.

Publications

The following paper describes AIST++ dataset in depth: from the data processing to detailed statistics about the data. If you use the AIST++ dataset in your work, please cite this article.

Ruilong Li*, Shan Yang*, David A. Ross, Angjoo Kanazawa.
AI Choreographer: Music Conditioned 3D Dance Generation with AIST++
ICCV, 2021.
[PDF] [BibTeX] [Web]

@misc{li2021learn,
      title={Learn to Dance with AIST++: Music Conditioned 3D Dance Generation}, 
      author={Ruilong Li and Shan Yang and David A. Ross and Angjoo Kanazawa},
      year={2021},
      eprint={2101.08779},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Please also consider citing the original AIST Dance Video Database if you find our dataset useful. ([BibTex])

@inproceedings{aist-dance-db,
   author    = {Shuhei Tsuchida and Satoru Fukayama and Masahiro Hamasaki and Masataka Goto}, 
   title     = {AIST Dance Video Database: Multi-genre, Multi-dancer, and Multi-camera Database for Dance Information Processing}, 
   booktitle = {Proceedings of the 20th International Society for Music Information Retrieval Conference, {ISMIR} 2019},
   address   = {Delft, Netherlands},
   pages     = {501--510},
   year      = 2019, 
   month     = nov
}

Dataset organization

The dataset is split into training/validation/testing sets in different ways serving for different purposes.

For tasks such as human pose estimation and human motion prediction, we recommend using data splits described in Table 1. Here we split the trainval and testing sets based on different subjects, which also makes sure the human motions in the trainval and testing sets have no overlap. Note training and validation sets share the same group of subjects.
For tasks dealing with motion and music such as music conditioned motion generation, we recommend using data splits described in Table 2. In the AIST database, same music and same choreography are shared by multiple human motion sequences. So we carefully split the dataset to make sure that the music and choreography in the training set does not overlap with that in the testing/validation set. Note validation and testing sets share the same group of music but different choreographies.

Table 1: Data Splits based on Subjects.

	Train	Validation	Test
Images	6,420,059	508,234	3,179,722
Sequences	868	70	470
Subjects	20*	20*	10

Table 2: Data Splits based on Music-Choreography.

	Train	Validation	Test
Seconds	13,963.6	187.6	187.6
Sequences	980	20	20
Choreographies	420	20	20
Music	50	10*	10*

*splits have shared data across this field.

Licenses

The annotations are licensed by Google LLC under CC BY 4.0 license.