Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

1Stanford University, 2Snap Research
Image caption

Abstract

Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field’s capabilities of handling complex interaction scenarios.

🌘 Paper Overview πŸŒ’

Our paper can be summarized into three main points:

  1. We propose a novel data generation method for close interactions that leverages noisy automatic annotations to scale data acquisition, producing pseudo-ground truth meshes from in-the-wild images.
  2. We curate APU, a dataset of paired images and pseudo-ground truth meshes featuring a diverse array of close interaction types and subjects.
  3. As an application, we demonstrate that our data significantly enriches the representation space of a close contact prior for HME, improving accuracy particularly for less common interaction scenarios in a case study of the NTU RGB+D 120 dataset.

πŸŒ— Ask Pose Unite Dataset πŸŒ“

Image caption

We compile our APU dataset by building on a key insights of prior works [1,2]: using 2D images with weak labels to target interaction diversity. We have gathered more than 6,000 meshes paired with images, contact annotations, and natural language descriptions of the interactions from both laboratory and in-the-wild scenes, encompassing a variety of ages, subjects, and interactions.

πŸŒ– Data generation method πŸŒ”

Image caption

We aim to curate images depicting pairs of people closely interacting with well-reconstructed pseudo-ground truth meshes from any set of in-the-wild images. To achieve this, we propose a data generation method. Specifically, our goal is to locate pairs of closely interacting people within any set of images and produce mesh estimates for each pair. As we only rely on weak supervision in the form of predicted contact maps, 2D keypoints, and interaction labels, we also automatically select the well-reconstructed meshes.

πŸŒ• Case Study: APU to improve novel interactions πŸŒ•

Joint PA-MPJPE results on close interaction NTU RGB+D 120 test set. PA-MPJPE: Joint two-person Procrustes aligned MPJPE. Auto CM: contact maps generated by our method.

The main advantage of our dataset and data acquisition method is to introduce training data from a larger variety of interactions for downstream HME models. We study the effect of enhancing the representation space one such model, a contact interaction prior BUDDI [2], with our dataset. We select the close interaction from the test set of the NTU RGB+D 120 dataset and label the contact frames with a combination of of the distance between the annotated 3D joints, 2D keypoints from an off-the shelf estimator, and manual frame-level annotation. Results: We show the Joint PA-MPJPE (lower is better) for prior works BEV and BUDDI, our baseline of contact maps generated by our method (Auto CM), and our trained contact prior. The automatic contact map baseline (Auto CM) improves on most classes over the initial meshes from BEV. The contact prior benefits from the in-domain training data, performing better than BUDDI on all classes, and showing significant improvements on uncommon interactions such as step on foot, grab stuff, and support. Common actions, such as handshake and high-five, also benefit from a larger diversity of training examples.

BibTeX

@article{bravo2024ask,
  title = {Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models},
  author = {Bravo-S{\'a}nchez, Laura and Heo, Jaewoo and Weng, Zhenzhen and Wang, Kuan-Chieh and Yeung-Levy, Serena},
  journal = {arXiv preprint arXiv:2410.00309},
  year = {2024},
  pdf = {https://arxiv.org/abs/2410.00309},
  website = {https://laubravo.github.io/apu_website/},
}
}