VibeSync App
Serokell developed an ML-based app that synchronizes users' dance videos with music using object pose estimation.
Our AI experts conducted research and developed an innovative
application that uses machine learning to synchronize dance video
and audio based on object pose estimation.
Our goal was not only to eliminate the delay between the video and audio recordings but also to adjust the movement speed to match the music beats or even recommend appropriate music for the user's dance video.
The task was to investigate the existing Python frameworks, whitepapers, and libraries, and provide an effective tool for creating appealing musical clips.
We sourced data from multiple publicly available databases. Through rigorous testing, we chose the best ML model and refined it until we arrived at an MVP solution.
Our solution
Through research, we discovered a method for extracting visual tempograms from video and found relevant open-source datasets to help us kickstart the project.
Some audio choices were not aligned with the movement in the video, which led us to explore dance embeddings. They allowed us to incorporate more information from the video in our audio choices.
We used the MMPose library, which offers pre-trained models for human pose estimation. This enabled us to embed the dance in a machine-readable form and improve the syncing between the video and audio.
Google AIST++ , a dataset that contains
the key points of dancers' bodies proved to be
a viable solution for dance embeddings.
We used the following resources in
our research:
Applying all those research findings to a real-world case presented the biggest challenge.
Our team consisted of ML engineers with both academic background and experience in cases with a focus on video and audio processing and analysis.
Tech Stack
Python
OpenCV
PyTorch
Results
To complete this project, we conducted extensive research of the AI theory, turned the academic whitepaper into a working machine learning model, and designed an MVP application.
We taught the model to:
Isolate the rhythm from the audio.
Extract visual tempograms from the video.
Analyze visual and audio tempograms to find a match.
We also incorporated a feature that enables video speed adjustment.
This case serves as an example of our approach, which involves leveraging the latest scientific research for AI software development.