VibeSync App

Serokell developed an ML-based app that synchronizes users' dance videos with music using object pose estimation.
Our AI experts conducted research and developed an innovative
application that uses machine learning to synchronize dance video
and audio based on object pose estimation.

Our goal was not only to eliminate the delay between the video and audio recordings but also to adjust the movement speed to match the music beats or even recommend appropriate music for the user's dance video.

The task was to investigate the existing Python frameworks, whitepapers, and libraries, and provide an effective tool for creating appealing musical clips.

We sourced data from multiple publicly available databases. Through rigorous testing, we chose the best ML model and refined it until we arrived at an MVP solution.

Our solution

Through research, we discovered a method for extracting visual tempograms from video and found relevant open-source datasets to help us kickstart the project.
Some audio choices were not aligned with the movement in the video, which led us to explore dance embeddings. They allowed us to incorporate more information from the video in our audio choices.
Serokell software development case: VibeSync language server
Serokell software development case: VibeSync language server
We used the MMPose library, which offers pre-trained models for human pose estimation. This enabled us to embed the dance in a machine-readable form and improve the syncing between the video and audio.

Google AIST++ , a dataset that contains
the key points of dancers' bodies proved to be
a viable solution for dance embeddings.

We used the following resources in
our research:

MMPose library of pre-trained models for human pose estimation.
Librosa library, a Python-based module for audio and music processing.
AIST Dance Motion dataset.
Visual Rhythm and Beat paper on extracting visual tempograms from the video.
VibeSync project development by Serokell: 3 dialects
Applying all those research findings to a real-world case presented the biggest challenge.
Our team consisted of ML engineers with both academic background and experience in cases with a focus on video and audio processing and analysis.

Tech Stack





To complete this project, we conducted extensive research of the AI theory, turned the academic whitepaper into a working machine learning model, and designed an MVP application.

Serokell: VibeSync language server case study summary

We taught the model to:

Isolate the rhythm from the audio.

Extract visual tempograms from the video.

Analyze visual and audio tempograms to find a match.

We also incorporated a feature that enables video speed adjustment.

VibeSync language server on developer tooling
This case serves as an example of our approach, which involves leveraging the latest scientific research for AI software development.

Get in Touch

Partner with Serokell to bring to life your vision
for modern AI development.