Cult’s energy meter, launched in 2020, has played a major role in delivering an engaging experience to the user. The innovation uses deep learning and computer vision to calculate metrics such as energy score, and rep-counts, and monitor poses to provide users with meaningful insights and feedback.

Various applications find it challenging to identify human body joints like shoulders, hips, and ankles. In this blog, we explain the nuts and bolts of our pose estimation pipeline and highlight some of the other ways it’s being used to make workouts interactive and fun.

Pose Estimation using an in-house trained model:

Pose estimation

In order to generate predictions in real-time, we decided to use an Embedded ML System Architecture. This means that we embed the machine learning model as a dependency of our application. We embed the model in a mobile device using the TensorflowLite framework, which supports both Android and iOS devices.

Initially, we explored some highly accurate state-of-the-art pose estimation models in the literature, some of them being CPM (Convolutional Pose Machines), HRNet (Deep High-Resolution Representation Learning for Human Pose Estimation), etc. But, there were two main issues with these models:

  • Model Size: Being pretty heavy models, they will increase the app size to a great extent, around 100s of MBs and more, if deployed in the app. Usually, in the realm of ML models, we achieve better results as the model size increases, but it becomes increasingly challenging to integrate the same model on edge devices.
  • Latency: With heavier models, we achieve better results. But this comes at the cost of the time taken to process or infer a single input image. And so, the inference time increases significantly, too large to build any helpful feature on top of pose estimation.

Keeping in mind the end goal of running the model on edge devices and the web, we decide to use a model which is both lightweight and accurate. We use a CenterNet architecture: UNet with a family of mobile-first computer vision models, MobileNet as the backbone, aka the feature extractor.


Some noteworthy features of the UNet model are:

  • Excellent loss flow during parameter optimization
  • Heatmap Size: Bigger heatmaps (64 x 64, in contrast to 32 x 32 in CPM model) with UNet model ensure accuracy is relatively high for complex poses

Training specifications

The input to the model is a 256 x 256 image, and the output is a 64 x 64 x 30 heatmap head (since we detect 30 body keypoints). Our ground truth consists of images with corresponding body part annotations (points); for better loss propagation, we convert these key points in an image to ground truth heatmaps. We do so by splatting all ground truth key points onto a heatmap using a Gaussian kernel.

Fig: Training sample

Training time optimizations:

  1. Mixed Precision: A FP16 representation of model parameter values ensures a faster training time and uses less memory as compared to an FP32 representation without a significant reduction in integration metrics.
  2. Data Caching: Involves storing images in the memory buffer instead of loading from the path at the time of training. Memory lookup is faster than fresh loading from the path.

Model variants

 With the basic structure mentioned above, we train three variations of the deep-learning model:

  1. SMALL: This model is extremely lightweight, fast, and accurate for the task of pose estimation
  2. MEDIUM: With a moderate number of parameters, this model gives lower latency compared to the LARGE model and higher accuracy compared to the SMALL model 
  3. LARGE: This model has a similar UNet architecture as described above with a higher number of parameters which results in better accuracy but this comes at the cost of latency of the model

Loss function & Optimizer

Once the heatmaps are predicted, we use an MSE (Mean Squared Error) loss function and an Adam optimizer with an epoch-based step learning-rate scheduler to minimize the loss.


Publicly available datasets like COCO, MPII Human Pose, LSP, Halpe don’t contain the complex poses our users do. Hence we also have an internal manually-tagged pose dataset that contains complex yoga, dance, and workout poses. We train our LARGE model with these images and use it to tag more images that have complex poses. This allows us to create a dataset of 100k complex pose images by manually tagging just 15% of them.

Image augmentation

We also use some image augmentation techniques to increase the accuracy of our model. A few of them worth mentioning are: 

1. Random Occluding Rectangles: 

When people move, not all parts of their body are visible . E.g. when doing a crunch, your head is invisible. We want our model to still correctly guess where the head should be. We do this by randomly hiding body parts in images with colored rectangles.

2. Random Brightness Contrast:

People have varying lighting conditions and cameras on their mobiles. This augmentation makes the model more robust to different lighting conditions.

3. Random Rotate90

Many poses have people lying down on the ground (yoga, crunches, etc) but our dataset contained very few images like these. So we added a 90-degree rotation augmentation to all images.

4. Random Horizontal flipping:

The Training pipeline can be visualized here:


The way we infer key points from this model is similar to most of the CenterNet architectures. At inference time, the peak or the argmax of each 30 predicted heat map gives the keypoint of that corresponding body part. We directly predict argmax when using the tflite model. 

Why Argmax? Our use case is to estimate the pose of a single person doing some workout in front of their  mobile phone. Hence, we want to ignore all other noises (people) in the background. We are also working on building the capability for multi-person keypoint estimation in our system.

Some salient features of our system are:

1. Cropping

During inference, we use the previous frames’ predicted keypoints to construct a bounding box which helps crop a region in the current frame input. So, rather than using the entire noisy input image, we use a cropped region for keypoint estimation, thereby increasing the accuracy of our pose-estimation model. This also helps in our use case of ignoring people in the background.

2. Low Pass Filter

a) Jitter Problem: The predicted keypoints are constantly fluctuating/vibrating about a specific location. This has a few critical consequences:

  • It results in a bad performance of the algorithm, lowering the reliability of the feature used  for this pose estimation
  • It also leads to a high probability of getting false positive data points

These jitters are high-frequency vibrations, whereas our normal movements are low-frequency. In order to address this issue, we use a Low Pass Filter that would attenuate all the jitter of high frequency as can be seen in the gif below:

The inference pipeline can be visualized here:

Model evaluation:

To evaluate the trained model, we use the normalized Euclidean distance between ground truth and predicted keypoints. The comparison of average euclidean distances for different variants of the model is shown in the table above. The metric reported is calculated on a sample validation dataset.


  1. Rep counting

One of the most critical areas where we use this pose estimation tech is rep counting in Cult gyms. We develop a rule-based approach that, based on the key points associated with the relevant body parts involved in a workout, gives the reps. We define states within an exercise in a workout and based on the keypoint output, we check whether a particular state gets accomplished. Currently, this feature is being tested internally via two different modes: 

  • Inside app (demo link)
  • A mirror device inside the gym (demo link).
  1. Energy Meter

We’ve built an Energy score to reward the users for their movements using the pose. The higher the velocity of their key points compared to the previous frame, the higher their score. Users can see the energy score in real-time on UI shown below:

Looking Ahead

The ideas are simple and well known in academics, challenges arise when we embody these ideas in a production application. Building a very lightweight, accurate model that can be integrated into the app allows a developer to experiment with a vast number  of features and applications that can be built on top of pose estimation. We are also planning to use this tech for gym analytics and AI-based smart workouts.

This is the first blog of a recent blog series on computer vision at This series will cover how computer vision is powering interactive workouts in

Credits - Aakash, Ayushi & Apurva

Feb 19, 2022

More from 



View All
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.