Visual Generative Models: Past, Present, and Future

About

Recent breakthroughs in generative adversarial networks, diffusion and autoregressive models have dramatically advanced the state of visual content generation, including widespread applications in generating images, videos, 3D objects, and more. These advancements not only push the frontiers of synthesis quality and scalability but also unlock new applications in design, entertainment, vision, scientific domains, and even improving or reformulating the vision tasks. However, several fundamental and practical challenges remain, e.g., improving controllability, enhancing fidelity and realism, scaling across modalities, ensuring alignment with human values, and achieving efficient, safe deployment. This workshop aims to provide a broad forum for exploring the past breakthroughs, current developments, and future directions of visual generative models, with particular emphasis on foundational innovations, emerging challenges, and practical applications.

News

📍 Updated Location Announcement

We are excited to share that the venue for this workshop has now been confirmed! The event will take place at:

🏫 Union House – Function Room 1 (🗺️ Map)
The University of Adelaide

🗓 Agenda Now Available

The full workshop schedule including invited talks and oral presentations has been released — scroll down to view details.

👉 Click to jump to the agenda

🔥 Call for Oral Presentations

We invite researchers to join us as one of the SIX oral presenters at the workshop to showcase their outstanding research. If you are interested, please complete the Google Form with the details of the work you would like to present.
Note that you have to give an in-person presentation at the workshop.

Eligibility (any of the following):

Published work (within one year) in top-tier venues, e.g., TPAMI, IJCV, NeurIPS, ICML, ICLR, CVPR, ICCV, ACM MM, etc.
Recent work under submission showcasing advanced techniques in visual generative models, including new theories, architectures, training strategies, evaluation methods, or applications.

Submission Information

Submission Site: https://forms.gle/ia2egs5tUkkjdja58
Submission Deadline: November 27, 2025 (AEDT)
Author Notification: November 29, 2025 (AEDT)
Workshop Date: December 2, 2025

Schedule

Time	Session	Speaker
Morning Session
09:00–09:10	Welcome & Opening Remarks	Dr. Dong Gong
09:00–09:10	Keynote 1: Continue of Simulator Environments Show more ▼ *Abstract:* I will talk about the topic of morphing images using text-guidance and diffusion processes. The approach seeks to find continuous trajectories of images between two end points, while tracing out a path of plausible, high-probability images. The ultimate goal is to morph 3D environments that are to be used for robot simulation. *Bio:* Dr. Richard Hartley is an Emeritus Professor of Computer Science at the Australian National University whose work focuses on computer vision and imaging geometry. He is co-author of Multiple View Geometry in Computer Vision, a standard reference on camera calibration and 3D reconstruction. His research covers projective geometry, structure from motion, and related optimization methods, and earlier in his career he worked in low-dimensional topology and VLSI design. He is a Fellow of the Australian Academy of Science and the Royal Society, and has received the Hannan Medal for his contributions to computer vision.	Prof. Richard Hartley
10:10–11:10	Keynote 2: From generative image synthesis to 4D modelling (Online) Show more ▼ *Abstract:* Integrating geometric reconstruction and generative modeling is critical for AI systems with human-like spatial reasoning. We propose a unified framework enabling geometry-aware reasoning via joint optimization of 4D dynamic reconstruction, action-conditioned video prediction, and goal-conditioned visual planning. Our task-interleaved feature learning facilitates knowledge sharing across these goals, yielding zero-shot synthetic-to-real generalization and reconstruction performance on par with domain-specific models—even without real-world data. Using camera trajectories as geometry-informed action spaces, we can enhance action-conditioned prediction and visual planning. Furthermore, recent advances in 4D world modeling have boosted spatial and temporal understanding, yet progress is hindered by scarce high-quality, diverse data. To resolve this, we introduce OmniWorld: a large, multi-domain 4D modeling dataset merging newly collected OmniWorld-Game with public datasets. OmniWorld provides broader modality and realistic interactions; benchmarking with it uncovers gaps in current methods. Fine-tuning on OmniWorld markedly improves performance across reconstruction and generation tasks, confirming it as a vital resource for advancing physically realistic world models. *Bio:* Dr. Chunhua Shen is a Chair Professor at Zhejiang University since 2022. Prior to that, he was with The University of Adelaide, NICTA and Australian National University. His research mainly focuses on Machine Learning and Computer Vision. He demonstrated innovative approaches to translate research for economic and societal gain. A recent example is the work he did with a leading mobile phone company on the development of image parsing techniques for AI driven photography –-- which was successfully deployed to over tens of millions mobile phones. His PhD student alumni are now in tenured or tenure track roles in 30+ universities in China, Australia, Singapore. His Google scholar citation is 90,000+ with H index 146.	Prof. Chunhua Shen
11:10–11:30	Break and Coffe Chat
11:30–11:45	Oral Presentation 1: Toward Human-like Multimodal Understanding, Reasoning, and Generation Show more ▼ *Abstract:* We aim to bridge the cognitive gap between machines and humans by developing frameworks for human-like multimodal understanding, reasoning, and generation. By integrating large language models (LLMs), multimodal alignment techniques, and generative AI, we seek to enable agents to perceive, think, and predict like humans across diverse sensory inputs, such as text, audio, vision, and 3D environments. These models are structured around three core capabilities: (1) Cross-modal Understanding, enabling them to interpret and align cross-modal information; (2) Human-like Reasoning, allowing them to plan and make sequential decisions with human-like thinking strategies; and (3) Multimodal Generation, enabling them to produce coherent and contextually grounded outputs across multiple modalities such as text, speech, image, and video. The aim of this research is to perceive and understand multimodal information, generate task-specific outputs through a combination of vertical, lateral, and knowledge-driven reasoning strategies, and ultimately apply them to real-world scenarios.	Qi Chen
11:45–12:00	Oral Presentation 2: Controllability Matters: From Static Generation to Interactive World Models Show more ▼ *Abstract:* How do we move from generating images to simulating worlds? This talk examines Controllability as the key driver in this evolution. I will introduce our recent work on precise subject control (EZIGen ) and temporal music-video synchronization (MVAA ) as examples of how we currently define "actors" and "dynamics" in generative media. Beyond these foundations, I will discuss the emerging paradigm of Generative World Models, arguing that the future lies in Embodied Control—shifting the user's role from a passive director to an active agent capable of interacting with the generated reality.	Zicheng Duan
Lunch Break 12:00–13:30 (Union House - Function room 1)
Afternoon Session
13:30–14:30	Keynote 3: End-to-end image generation training Show more ▼ *Abstract:* There are lots of different explorations of how to best train image generation models, from pixel-space generation, latent diffusion models, to representation learning alignment, representation auto-encoders and the recent JIT. In this talk, I will first discuss our recent works on end-to-end image generation which trains the VAE and generative model together. Then, I will draw comparisons and insights between our work and existing methods in terms of latent space quality, the usefulness of ImageNet training, and T2I pre-training. *Bio:* Dr. Liang Zheng is an Associate Professor at the Australian National University and a Scientist at Canva. He is interested in representation learning for perception and generation. He contributed many useful datasets and methods to the object re-identification field that were later used in wider domains. He is currently working on image generation in both aspects of pre-training and post-training. He has degrees in Biology, Economics and Computer Science from Tsinghua University.	A/Prof. Liang Zheng
14:30–14:45	Oral Presentation 3: Point-cloud–centric paradigm for general 3D scene and asset generation Show more ▼ *Abstract:* We present our recent advances in 3D scene reconstruction and generation. First, a training-free pipeline reconstructs indoor scenes and synthesis novel views from sparse, unposed RGB images using feed-forward point cloud prediction and 3D-aware diffusion rendering, which can also enabling interactive object-level editing. Second, we embed explicit point clouds into latent 3D diffusion model, using inpainting to generate geometry-controllable 3D assets and scenes.	Jiatong Xia
14:45–15:00	Oral Presentation 4: Growing Models on Demand: Dynamic Modular Expansion for Continual Learning Show more ▼ *Abstract:* Continual learning often demands new representational capacity while preserving knowledge from past tasks. Dynamic modular expansion strategies provide a scalable solution: the model begins compact, then grows on demand by adding modules for observed distribution shifts. In this talk, we discuss design principles for dynamic modular expansion and demonstrate how such mechanisms enable sublinear parameter growth and robust continual adaptation across diverse visual domains. We present dynamic expansion as a general paradigm for building scalable systems capable of efficient long-term continual learning.	Huiyi Wang
15:00–15:30	Break and Coffe Chat
15:30–15:45	Oral Presentation 5: The Human Brain as a Blueprint for AI’s Frontier: Symbolic Perception, Vision-Grounded Reasoning, and Learning Through Experience Show more ▼ *Abstract:* Humans do not see the world as pixels—we perceive structure, relations, and meaning. Vision speaks its own language, directing how we understand geometry, objects, and the physical world. Natural language is the expression of thought, but not the thought itself; it cannot fully replace perception or the grounded processes that shape reasoning. To advance AI, we must build regimes beyond natural language and move toward systems that truly live in the world. In this talk, I will present recent work on rethinking perception as symbolic sensing, reasoning as a process grounded in visual spotlight movement, and interaction as the pathway for learning from mistakes and experience.	Shan Zhang
15:45–16:00	Oral Presentation 6: Exploring Primitive Visual Measurement Understanding and the Role of Output Format in Learning in Vision-Language Models Show more ▼ *Abstract:* This work investigates how current vision–language models (VLMs) measure primitive shape attributes under controlled 2D configurations with variations in layout, occlusion, rotation, size, and appearance. We fine-tune 2B–8B parameter VLMs with LoRA and evaluate them on multiple out-of-domain splits from our benchmark. Coherent sentence-based outputs consistently outperform tuple formats, especially under large domain shifts. Furthermore, scaling the loss on numeric tokens improves numerical approximation for spatial and measurement tasks. Overall, we show that output format and numeric loss scaling are crucial design choices for robust, OD-generalizable visual measurement.	Ankit Yadav
16:00–16:30	Panal Discussion	Keynote Speakers and Organizers
16:30–16:40	Closing Remarks