Learning-Based Human Character Animation Synthesis for Content Production
Type of position: PhD
Location: Rennes (Technicolor & Inria)
Pierre Hellier – Pierre.Hellier@technicolor.com
Francois Le Clerc – Francois.LeClerc@technicolor.com
Ludovic Hoyet - email@example.com
Introduction and context
Content production for film and advertising increasingly relies on computer-generated imagery to lower
costs and enhance creative possibilities. In particular, many of today’s movies and advertisements
feature synthetic human characters. The animation of the characters’ bodies is driven by the dynamics
of an underlying skeleton, built from the main joints of the human body. The skeleton is later fleshed
into a 3D mesh by a process known as skinning, whereby the displacement of each vertex of the mesh is
computed from the displacement of the neighbouring skeleton joints it is bound to. Accurately capturing
the naturalness of human motion in the dynamics of the skeleton is key to the perceptual plausibility of
the rendered animation.
Creating animations for photorealistic computer-generated movies is a highly demanding complex part
of the film production workflow that requires an insane amount of manual work. Keyframing and
motion capture are the two dominant techniques used in the industry today. Keyframing refers to a
purely manual editing process wherein artists draw the skeletons at selected temporal frames
(“keyframes”), and further define non-linear interpolation paths for joints locations in-between the
keyframes. Motion capture is performed in a green room with specialized hardware, with marker-based
setups that requires some involvement on the part of the actors, as well as manual post-processing to
incorporate artistic edits into the animations. In both cases, the amount of human intervention and
hence the production costs are very high. Thus, there is a strong business justification in the automation
of the non-creative parts of the animation process.
Advances in machine learning and particularly deep learning in recent years have boosted the research
effort towards obtaining skeletal animations from the analysis of videos. The idea is to learn a mapping
between the image of a human character and the 2D or even 3D locations of the joints of the character
body. However, due in part to the difficulty of the problem and in part to the lack of 3D annotated
training data, the accuracy on joint location estimates is often poor, especially in the depth direction
that is not observable in the image. Besides, the estimated skeletons consist of only a few joints and
often fail to cover the hands and the feet.
The generation of animations from videos offers promising prospects for optimizing the animation
workflow in the content production industry. Still, a lot of work is needed to improve the resolution and
accuracy of the produced animations, and to adapt the technology to make it usable in an interactive
way by animation artists. Advancing towards these goals is the main purpose of the proposed PhD.
Existing techniques and limitations
The estimation of animation skeletons, a.k.a. human poses, in images and videos is an active research
area, dominated by supervised machine learning approaches that leverage databases of images
annotated with human joint locations. The initial target of 2D pose estimation  has now been
extended to 3D, see for instance [2, 3]. Inferring the depth components of the skeleton joints turns out
to be a challenging ill-posed problem. Even though various regularization strategies have been
proposed, the estimated joint locations are still quite noisy, especially in the depth direction orthogonal
to the plane of the observed image. This is also, to some extent, a consequence of the scarcity of 3D
skeleton annotations, which are difficult to generate in “in-the-wild” environments . A further issue
with annotations, and as a result human pose estimates, is that they are limited to a small number of
body joints, excluding hands and feet. Overall, the accuracy and resolution of state-of-art “video to
analysis” techniques is still unsuitable for animating even secondary characters in photorealistic films
In parallel to human pose estimation, some research effort has been devoted to the characterization of
human motion kinematics using learning-based approaches. The seminal work of Holden  leverages
an autoencoder framework to learn a “manifold” of human motion. It further proposes methods for
editing animations in this manifold and mapping the editing controls to human-understandable
high-level parameters. The learnt parameters of the encoder can be used to characterize the style of the
motion and perform style transfer on animations. This technique could be extended to learn a specific
motion model for a given character, perhaps based on initially produced animation sequences for this
character, and further improve the generation of subsequent animations for this same character based
on the learnt model.
Directions for research
Directions of research are flexible within the proposed context, but will explore areas related to
improving animation quality for production usages.
Requirements for candidacy
Strong programming skills (Python recommended)
Strong knowledge of machine learning
Basic knowledge of computer animation and graphics
We are looking for motivated candidates, please send CV, a motivation letter, reference letters, and any
relevant material to firstname.lastname@example.org, email@example.com and firstname.lastname@example.org
This PhD will be conducted in the context of a CIFRE collaboration between Technicolor and the
MimeTIC team (Inria Rennes). Technicolor is a leading company in the VFX world, combining their R&D
expertise in Computer Vision and Computer Graphics with the artistic expertise from their studios, such
as The Mill, Moving Picture Company, Mikros Image, etc. Inria is a French leading research centre in
Computer Sciences, where research activities in MimeTIC focus on simulating virtual humans that
behave in a natural manner and act with natural motions.
 A. Newell, K. Yang and J. Deng, « Stacked Hourglass Networks for Human Pose Estimation, » in European Conference on Computer Vision, 2016.
 D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas and C.
Theobalt, « VNect: Real-Time 3D Human POse Estimation with a Single RGB Camera, » ACM
Transactions on Computer Graphics, vol. 36, no. 4, pp. 44:1 – 44:14, 2017.
 B. Tekin, A. Rozantsev, V. Lepetit and P. Fua, « Direct Prediction of 3D Body Poses from Motion
Compensated Sequences, » in IEEE International Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
 X. Zhou, Q. Huang, X. Sun, X. Xue and Y. Wei, « Towards 3D Human Pose Estimation in the Wild: a
Weakly-supervised Approach, » in IEEE International Conference on Computer Vision (ICCV), 2017.