Pose estimation has always fascinated me as I see it as a cost-effective alternative to expensive motion capture equipment. This winter break, I decided to explore various models for pose estimation as a fun project. My goal is to identify suitable models that could be used for future projects of this kind.
Finding the models
Finding good models takes most of the time for this project because I always prefer to make my programs in C. Consequently, it is difficult to directly use a Pytorch model.
To solve this I’d generally prefer to export it ONNX, because I am very fond of it. But this isn’t easy every time. To do this, first, ensure pre-processing and post-processing are not too hard to implement. Conducting multi-dimension array math is not easy in C and a trivial task in NUMPY can be very difficult in C. Second, the model has to be able to be exported to ONNX, over the years Pytorch’s capability in exporting models has improved tremendously as more and more operation translations are implemented. However, custom operations can still stop you from converting. Finally, be nice and check the license, check if it matches your use case, and make sure to reference them.
Writing the program
The difference between Python and C is that you have to be more precise in memory management. Do you want a deep copy? Smart array? Scope? These specifics will take the most time if you understand the ML models enough when you are experimenting with Python. Do not attempt implementing in C++ if you can’t figure out the entire pipeline in Python, you are wasting your time. Although I have been doing this for years, this entire program still took me three days of work.
This program can be considered as a half point on really making the program, to make this easy for use in a real project, consider making it a dynamic-link library.
Program Summary
This project utilizes three ML models, YOLOv7 for detection, RTMPose for 2d pose detection, and Motion Bert for 3d pose estimation.
Yolov7 detects objects of the 80 MS COCO dataset classes. RTMPose detects the 2d joint points from the detected person class. Motion Bert is a 2d to 3d lift-style model, so the input is the detected 2d joint points, it is designed for video, so it needs a 10-frame buffer to run.
Some precision is compromised as there aren’t many 2d joint detectors that are trained on the Human3.6m dataset, 3d pose estimators are trained, most are trained using the COCO format, and the RTMPose I use is trained with halpe26. Some joint positions are averaged from several joint points.
In the GIF above the red circles are 2d joints and the purple are 3d joints, the purple joint number 7(belly) is averaged from hips and shoulders.
Opencv is used to fetch frames, conduct some of the processing, and draw the results, ONNX-runtime is used to run the models.
Link to the project is here! https://github.com/JINSCOTT/Detection--2d-pose--3d-pose-lift
Epilogue
In the future, I hope I have some time to make this program into a DLL, and maybe adapt it to some use, but I should really focus on professional development now. Maybe I should delve into AI compilers and CUDA programming to make a full circle in my skill set!