Contexte et atouts du poste
The Phd will be done at Inria in the Willow research team.
Mission confiée
Short Overview of the PhD Project:
This PhD thesis aims to enhance the physical consistency of current video generation
models by exploring various techniques to inject physics awareness into them.
PhD Project Description:
The motivation for this PhD thesis is to address a critical limitation in current video
generation models: their lack of consistency with the laws of physics.
Although these models
are increasingly adept at generating high-quality content that can almost perfectly match
real-world scenes, their capabilities to effectively model the underlying laws governing
dynamic interactions remain limited [1,2,3,4,6].
Simple scenarios, such as object freefall, are
sufficient to demonstrate these limitations [3].
Improving these capabilities is a fundamental
step towards building more robust models that can function as true world simulators.
Proposed Research Directions:
Different approaches have been explored to overcome the aforementioned limitations.
Some
works integrate 3D geometry and dynamics awareness as critical elements for generating
physically plausible videos [7].
Another interesting approach is model-based simulation
guidance, where physics engine simulations are used as an intermediate step to guide the
video generation process [4].
Furthermore, we consider post-training techniques to be
particularly promising.
In [3], the authors present a two-stage post-training pipeline
consisting of self-supervised fine-tuning on high-quality data and an Object Reward
Optimization (ORO) phase.
In [5], a novel framework called VideoREPA is proposed, which
distills physics understanding from video foundational models into text-to-video generation
models by aligning token-level relations.
Building on this, a primary direction for our research is the use of reasoning-capable models,
such as Large Language Models (LLMs) or Vision-Language Models (VLMs), to create
physically grounded scene descriptions that can guide the video generation process.
We
hypothesize that this could be a direct way to transfer the reasoning capabilities of
understanding models to generative ones.
Different settings and formats for this guidance,
from free-form text to more structured inputs, will be explored.
Moreover, we aim to investigate post-training techniques based on physics-informed reward
methods, such as those presented in [3].
Given that this work focuses on the specific case of
object freefall, a logical first step is to extend this approach to more complex and diverse
physical scenarios.
During the PhD thesis, the initial research directions will be adapted based on the evolution
of the field and the insights obtained during experimentation.
Evaluation and Benchmarking:
Recent benchmarks such as VideoPhy-2 [1], Phy-World [2], and PISA [3] are valuable
resources for measuring our contributions.
However, a key part of this project will also
involve identifying the limitations of current benchmarks.
Consequently, designing novel
tasks and evaluation strategies to better assess physical plausibility presents an additional
opportunity for contribution for this PhD project.
References:
[1] VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video
Generation
H.
Bansal, C.
Peng, Y.
Bitton, R.
Goldenberg, A.
Grover, K.
W.
Chang
[2] How Far is Video Generation from World Model: A Physical Law Perspective
B.
Kang, Y.
Yue, R.
Lu, Z.
Lin, Y.
Zhao, K.
Wang, G.
Huang, J.
Feng
[3] PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by
Watching Stuff Drop
C.
Li, O.
Michel, X.
Pan, S.
Liu, M.
Roberts, S.
Xie
[4] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
S.
Liu, Z.
Ren, S.
Gupta, S.
Wang
[5] VideoREPA: Learning Physics for Video Generation through Relational Alignment with
Foundation Models
X.
Zhang, J.
Liao, S.
Zhang, F.
Meng, X.
Wan, J.
Yan, Y.
Cheng
[6] MOTIONCRAsFT: Physics-based Zero-Shot Video Generation
L.
S.
Aira, A.
Montanaro, E.
Aiello, D.
Valsesia, E.
Magli
[7] Towards Physical Understanding in Video Generation: A 3D Point Regularization
Approach
Y.
Chen, J.
Cao, A.
Kag, V.
Goel, S.
Korolev, C.
Jiang, S.
Tulyakov, J.
Ren
Principales activités
Main activities:
Analyse and implement related work.
Design novel innovative solutions.
Write progress reports and papers.
Present work at conferences.
Compétences
Technical skills and level required : programming skills are required.
Languages : English and possibly French.
Relational skills : Good communication skills.
Avantages