Latest Generative AI Research Papers

Research on AI systems that create new content including image generation, text-to-image, video synthesis, and creative AI applications.

76 Papers

Showing 20 of 20 papers

CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Haodong Li, Chunmei Qing, Huanyu Zhang +10 more

Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the pr...

Unified Multimodal ModelsChain-of-Thought reasoningtext-to-image generationexecutable codestructured draft construction+5 more

Mar 9, 202626

HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Kai Zou, Dian Zheng, Hongbo Liu +3 more

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods t...

autoregressive diffusiontemporal continuityerror accumulationdenoisingbidirectional diffusion models+8 more

Mar 9, 202624

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Lingen Li, Guangzhi Wang, Xiaoyu Li +5 more

Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion model...

spatio-temporal autoregressive diffusion modelcubemap representationsautoregressive synthesissparse context attentioncube-aware positional encoding+3 more

Mar 4, 202611

Helios: Real Real-Time Long Video Generation Model

Shenghai Yuan, Yuanyang Yin, Zongjian Li +3 more

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly u...

autoregressive diffusion modelvideo generationlong-video driftingself-forcingerror-banks+14 more

Mar 4, 2026136

Kling-MotionControl Technical Report

Kling Team, Jialu Chen, Yikang Ding +21 more

Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework en...

DiT-based frameworkheterogeneous motion representationsadaptive identity-agnostic learningmulti-stage distillationsemantic motion understanding+4 more

Mar 3, 202621

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Yiying Yang, Wei Cheng, Sijin Chen +5 more

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON fil...

Lottietokenizervision language modelsmulti-modal interleaved instructionsvector animations+3 more

Mar 2, 2026111

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng +3 more

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlen...

video editinginstruction-followingreference-guided editingimage generative modelslatent visual features+2 more

Mar 2, 202614

WildActor: Unconstrained Identity-Preserving Video Generation

Qin Guo, Tianyu Yang, Xuanhua He +5 more

Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level c...

human video generationidentity consistencyviewpoint adaptationasymmetric identity-preserving attentionviewpoint-adaptive monte carlo sampling+2 more

Feb 28, 202617

DreamWorld: Unified World Modeling in Video Generation

Boming Tan, Xiangdong Zhang, Ning Liao +5 more

Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to intr...

video generationworld modeljoint world modeling paradigmtemporal dynamicsspatial geometry+5 more

Feb 28, 202616

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai, Weili Nie, Chao Liu +8 more

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local...

Decoupled Diffusion Transformerflow matchingdistribution matchingmode-seeking reverse-KL divergencelong-range coherence+5 more

Feb 27, 202627

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang, Chaoran Feng, Yufan Deng +5 more

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attemp...

SpatialReward-DatasetSpatialScorereward modeltext-to-image generationreinforcement learning+2 more

Feb 27, 202643

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie, Lisong C. Sun, Ashley Neall +3 more

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is ...

video world modelsdiffusion transformer3D head posejoint-level hand posesdexterous hand-object interactions+3 more

Feb 20, 202622

HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation

Lei Xin, Yuhao Zheng, Ke Cheng +3 more

Modeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from...

Hybrid Attentionlinear attentionsoftmax attentionlong-term stable preferencesshort-term intent spikes+3 more

Feb 20, 202653

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

Hojung Jung, Rodrigo Hormazabal, Jaehyeong Jo +5 more

Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle ...

diffusion modelsmolecular generationgraph diffusion modelschemical validityhierarchical discrete diffusion model+5 more

Feb 19, 202654

Unified Latents (UL): How to train your latents

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink +1 more

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight u...

diffusion priordiffusion modellatent representationstraining objectivelatent bitrate+4 more

Feb 19, 202646

Image Generation with a Sphere Encoder

Kaiyu Yue, Menglin Jia, Ji Hou +1 more

We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent spac...

sphere encodergenerative frameworkspherical latent spaceencoder-decoder architectureimage reconstruction losses+2 more

Feb 16, 202613

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Yuang Ai, Jiaming Han, Shaobin Zhuang +7 more

We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to 2^{256} states, yielding a compact yet highly expressive discrete representation. Sampling fr...

autoregressive image generatorbinary visual tokenshigh-entropy binary latentsbinary diffusion headnext-patch diffusion+6 more

Feb 15, 202610

FireRed-Image-Edit-1.0 Techinical Report

Super Intelligence Team, Changhao Qiao, Chao Hui +16 more

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-ima...

diffusion transformerdata curationtraining methodologyevaluation designtext-to-image+11 more

Feb 12, 20263

Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

Huai-Hsun Cheng, Siang-Ling Zhang, Yu-Lun Liu

Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We pres...

vector sketchingsemantic transformationStroke of Surprisegenerative frameworkdual-branch Score Distillation Sampling+4 more

Feb 12, 202627

Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

Xu Ma, Yitian Zhang, Qihua Dong +1 more

High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a...

text-to-imagefine-tuningdiffusion modelsautoregressive modelstext-image alignment+3 more

Feb 10, 20265

Page 1 of 4Next

View all categories

Latest Generative AI Research | Generative AI Papers