![Shitian Zhao Profile](https://pbs.twimg.com/profile_images/1750212339390144516/NONotZBs_x96.jpg)
Shitian Zhao
@zst96687522
Followers
472
Following
627
Statuses
386
Looking for CS PhD position in 2025Fall. Researcher @ Shanghai AI Lab @opengvlab Bachelor @ ECNU @ECNUER Previous Intern @ CCVL @JohnsHopkins
shanghai, China
Joined April 2021
Thanks AK for posting our work!
Lumina-mGPT Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining paper page: We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.
0
0
6
RT @DanHendrycks: We're releasing EnigmaEval, a collection of long, complex reasoning challenges that take groups of people many hours or d…
0
115
0
RT @_jasonwei: We do not rise the power of our RL optimization algorithms—we fall to the hackability of our RL environment
0
21
0
RT @largemodelgame: [1/N] LLM evaluations can be done while you are playing live computer games. 🤯 We are excited to announce our game: AI…
0
19
0
RT @lockonlvange: Introducing CodeI/O (, a systematic way to condense diverse reasoning patterns via code input-out…
0
48
0
RT @PointsCoder: Can Vision-Language Models (VLMs) truly understand the physical world? 🌍🔬 Introducing PhysBench – the first benchmark to…
0
71
0
RT @aclmentorship: 📢 Join us for the ACL Mentorship Session on Zoom! Session Link: Questions:
0
11
0
RT @_akhaliq: DeepScaleR-1.5B-Preview a open-source, 1.5B-parameter model trained with RL to surpass o1-preview for general math reasoning…
0
132
0
RT @SharonYixuanLi: How should we assign rewards to intermediate steps in reasoning? DeepSeek-R1 paper highlights it as an open challenge.…
0
59
0
RT @GaoyueZhou: Can we extend the power of world models beyond just online model-based learning? Absolutely! We believe the true potential…
0
100
0
The Real-Time "Canvas"
DeepSeek R1 is great. How do humans think with reasoning machines like R1? CoT-Lab represents @ii_posts's latest exploration at UBAI into cognitive partnership — where human intuition and AI reasoning become co-evolving thought partners. We enable collaboration with reasoning models: An interface to guide humans through AI's reasoning flow, actively reshape thought trajectories, and collectively elevate cognitive outcomes. True intelligence emerges not from artificial or human minds — but through their symbiotic dance. We're exploring the choreography for this new cognitive ballet. Like skilled dance partners, human and machine refine each other's moves.
0
0
0
RT @ideogram_ai: The Ideogram Text Tool is here. Add text, choose fonts, and customize colors. All within Ideogram Canvas. Premium graphic…
0
42
0
RT @allen_ai: Here is Tülu 3 405B 🐫 our open-source post-training model that surpasses the performance of DeepSeek-V3! The last member of t…
0
391
0
RT @shaneguML: Iterated synthetic data + filtering + distillation is RL. Check for the data quality to avoid reward hacking. It's called re…
0
11
0
RT @Alibaba_Qwen: Announcing Qwen2.5-VL Cookbooks! 🧑🍳A collection of notebooks showcasing use cases of Qwen2.5-VL, include local model a…
0
525
0
RT @karpathy: For friends of open source: imo the highest leverage thing you can do is help construct a high diversity of RL environments t…
0
835
0
RT @alsuhr: Check out our SWE-Gym RL environment: @jiayi_pirate @xingyaow_ @hengjinlp @YizheZhangNLP @gneubig
0
10
0