Human Pose Estimation a Deep Learning approach

5 min readJun 12, 2021

Drastic growth in digital data has given rise to Deep Learning applications. Such applications include image classification or object detection. One of the similar problems is Human pose estimation that involves the classification of activity performed by the human in the given image and identifying the major body parts and joints using regression. Human Pose Estimation is an approach to detect key points from Human Bodies or detecting various Human Poses. The dataset used for this problem is the MPII Human Pose Dataset which contains nearly 25k images, covering multiple categories of Human Activities.

Dataset

For Human Pose Estimation, we choose MPII Human Pose Dataset, one of the popular choice among various available datasets for Human Pose Estimation. As this dataset is huge as well as diverse and also publicly available. This dataset consists of 25k images with annotations file, which contains the information of the joints. The annotation file consists of head position information and various joints location with their ids. There 16 joints (0–15), including ankle, knee, hip, upper neck, pelvis, thorax, elbow, shoulder, head top, wrist. Overall there are 20 categories and 397 activities present in the dataset. Figure 1 shows the original distribution of classes.

Preprocessing

MPII Human Pose Dataset contains images from YouTube videos thus various images have uneven dimensions. We resize the images to the shape of (244x244x3). Using data augmentation techniques too, we resize and re-scale the images to feed into the Deep Learning models. The batch size used for data augmentation was 30.

Issues during preprocessing -

Low computational power — As the image dataset requires preprocessing and high computational power for better training of models and data preparation, unavailability of such power causes problem.
Huge amount of data — The MPII dataset is huge and diverse and thus contains large amount of data which sometimes is difficult to preprocess with limited resources, and the process itself is time consuming which might cause hurdle during model preparation.

Solutions to these issues -

Reduce the dataset — For limited computational power or huge dataset, one solution is to reduce the dataset such that no important information is missing and dataset is still diverse and suitable for training of model.
Increase the computational power that might be costlier but efficient.

Models for Classification problem

Deep learning models that we used to classify Human Poses, included CNNs, VGG16, VGG19, ResNet50, and Xception.

Following is the architecture of CNN model we proposed -

Proposed CNN model reduces image size after each pooling layer. After final pool layer, model flatten() the results and dense layer architecture can be imposed. For compiling ‘adam’ optimizer with categorical cross-entropy and accuracy metrics is used. The following gif shows the glimpse of CNN model working.

Another model that we used was Xception model, that was not a popular choice among the papers we read, however we found it outperforming among the various models that we used. Xception is one of the robust deep convolutional neural network architectures developed by Google Researchers.

Following are the results from our model showing a correct classification and a misclassification.

Models for Regression Problem

Regression has been performed using the same CNN model architecture as mentioned by replacing the SoftMax activation at dense layer to Linear activation, and a popular baseline ResNet50 was also used for this problem. GitHub link for the code is given below for the reference.

Following image shows the comparison between the actual joints and joints predicted by our model.

Results

For the Activity Classification we measure the results using Accuracy as a metric, for various subsets of data. Xception model performed relatively accurate in all the three sets.

For joint detection, CNN worked quite well when compared to ResNet50. Both these models used mean absolute error as loss function and thus try to minimize loss using the difference between original joints and predicted joints. Proposed CNN gives 82.71% PCP, 88.22% PCKh, and ResNet50 model give 86.36% PCP and 88.84% PCKh.

For evaluation of regression results the two metric used were PCP and PCKh -

PCP : stands for Percentage of correct parts, a limb is considered detected if the distance between the two predicted joint locations and the true limb joint locations is less than half of the limb length.
PCKh : stands for Percentage of correct Keypoints, a detected joint is considered correct if the distance between the predicted and the true joint is within a certain threshold. PCKh is when the threshold = 50% of the head bone link.

Conclusion

For addressing the Human Activity Classification problem, seven different models were applied and compared. For the Joint localization problem, two models were used and compared. However, the results achieved in the case of regression could be improved by minimizing the loss. We also present the Xception model in this paper and found that it gives outstanding classification accuracy and out-performing other baselines.
Due to lack of resources, we are not able to check the proposed model’s results for the whole data because the whole data consists of 25k images and near about 13 GB, but colab’s RAM is around 12 GB, So it is not feasible to run large data on this environment.
Future aspect is to run proposed models for the whole data and change their architecture accordingly to get maximum accuracy. One can also use other emerging deep learning models like GAN’s to check for classification and joints detection using regression.

For the code please follow the GitHub repo Link !!

Code Implementation : Deepankar Kansal, Palak Tiwari