PhD Position F/M Physically-Grounded Video Generation at INRIA

Job Overview

Company

INRIA

Location

Paris

Ready to Apply?

Take the Next Step in Your Career

Join INRIA and advance your career in Computer Occupations

Apply for This Position

Click the button above to apply on our website

Job Description

Contexte et atouts du poste

The Phd will be done at Inria in the Willow research team.

Mission confiée

Short Overview of the PhD Project:
This PhD thesis aims to enhance the physical consistency of current video generation
models by exploring various techniques to inject physics awareness into them.
PhD Project Description:
The motivation for this PhD thesis is to address a critical limitation in current video
generation models: their lack of consistency with the laws of physics.

Although these models
are increasingly adept at generating high-quality content that can almost perfectly match
real-world scenes, their capabilities to effectively model the underlying laws governing
dynamic interactions remain limited [1,2,3,4,6].

Simple scenarios, such as object freefall, are
sufficient to demonstrate these limitations [3].

Improving these capabilities is a fundamental
step towards building more robust models that can function as true world simulators.
Proposed Research Directions:
Different approaches have been explored to overcome the aforementioned limitations.

Some
works integrate 3D geometry and dynamics awareness as critical elements for generating
physically plausible videos [7].

Another interesting approach is model-based simulation
guidance, where physics engine simulations are used as an intermediate step to guide the
video generation process [4].

Furthermore, we consider post-training techniques to be
particularly promising.

In [3], the authors present a two-stage post-training pipeline
consisting of self-supervised fine-tuning on high-quality data and an Object Reward
Optimization (ORO) phase.

In [5], a novel framework called VideoREPA is proposed, which
distills physics understanding from video foundational models into text-to-video generation
models by aligning token-level relations.
Building on this, a primary direction for our research is the use of reasoning-capable models,
such as Large Language Models (LLMs) or Vision-Language Models (VLMs), to create
physically grounded scene descriptions that can guide the video generation process.

We
hypothesize that this could be a direct way to transfer the reasoning capabilities of
understanding models to generative ones.

Different settings and formats for this guidance,
from free-form text to more structured inputs, will be explored.
Moreover, we aim to investigate post-training techniques based on physics-informed reward
methods, such as those presented in [3].

Given that this work focuses on the specific case of
object freefall, a logical first step is to extend this approach to more complex and diverse
physical scenarios.
During the PhD thesis, the initial research directions will be adapted based on the evolution
of the field and the insights obtained during experimentation.
Evaluation and Benchmarking:
Recent benchmarks such as VideoPhy-2 [1], Phy-World [2], and PISA [3] are valuable
resources for measuring our contributions.

However, a key part of this project will also
involve identifying the limitations of current benchmarks.

Consequently, designing novel
tasks and evaluation strategies to better assess physical plausibility presents an additional
opportunity for contribution for this PhD project.
References:
[1] VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video
Generation
H.

Bansal, C.

Peng, Y.

Bitton, R.

Goldenberg, A.

Grover, K.

W.

Chang
[2] How Far is Video Generation from World Model: A Physical Law Perspective
B.

Kang, Y.

Yue, R.

Lu, Z.

Lin, Y.

Zhao, K.

Wang, G.

Huang, J.

Feng
[3] PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by
Watching Stuff Drop
C.

Li, O.

Michel, X.

Pan, S.

Liu, M.

Roberts, S.

Xie
[4] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
S.

Liu, Z.

Ren, S.

Gupta, S.

Wang
[5] VideoREPA: Learning Physics for Video Generation through Relational Alignment with
Foundation Models
X.

Zhang, J.

Liao, S.

Zhang, F.

Meng, X.

Wan, J.

Yan, Y.

Cheng
[6] MOTIONCRAsFT: Physics-based Zero-Shot Video Generation
L.

S.

Aira, A.

Montanaro, E.

Aiello, D.

Valsesia, E.

Magli
[7] Towards Physical Understanding in Video Generation: A 3D Point Regularization
Approach
Y.

Chen, J.

Cao, A.

Kag, V.

Goel, S.

Korolev, C.

Jiang, S.

Tulyakov, J.

Ren

Principales activités

Main activities:

Analyse and implement related work.
Design novel innovative solutions.
Write progress reports and papers.
Present work at conferences.

Compétences

Technical skills and level required : programming skills are required.

Languages : English and possibly French.

Relational skills : Good communication skills.

Avantages

Subsidized meals

Partial reimbursement of public transport costs

Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)

Possibility of teleworking and flexible organization of working hours

Professional equipment available (videoconferencing, loan of computer equipment, etc.)

Social, cultural and sports events and activities

Access to vocational training

Social security coverage

About INRIA

Quick Access Links

Job Details:
https://fr.expertini.com/jobs/job/phd-position-fm-physically-grounded-video-generation-paris-inria-c59d0cd5f40e/

Company Jobs:
More INRIA Jobs

Location Jobs:
Jobs in Paris

Category Jobs:
Computer Occupations Jobs

Don't Miss This Opportunity!

INRIA is actively hiring for this PhD Position F/M Physically-Grounded Video Generation position

Apply Now