![Sulin Liu Profile](https://pbs.twimg.com/profile_images/1869600616189952000/xc1Sw7gz_x96.jpg)
Sulin Liu
@su_lin_liu
Followers
567
Following
2K
Statuses
186
Postdoc @MIT Ex: Machine Learning PhD @Princeton @Meta @NTUsg @NUSingapore
Joined March 2011
Discrete generative models use denoisers for generation, but they can slip up. What if generation *isn’t only* about denoising?🤔 Introducing DDPD: Discrete Diffusion with Planned Denoising🤗🧵(1/11) w/ @junonam_ @AndrewC_ML @HannesStaerk @xuyilun2 Tommi Jaakkola @RGBLabMIT
5
54
226
Re "only works in conditional sampling", agree that this is a limitation, does not allow you to sample general images in the same way prompting a LLM can achieve. the Bayes' rule kind of makes sense to me -- it's more about making use of the difference of the "conditioned score direction" and the "average score direction". I imagine this to be more training data efficient? but this also makes CFG lose the general sampling capability. probably also why the model size of diffusion models are much smaller
0
0
2
@agihippo @Swarooprm7 It's not even as close as that, more like OAI v.s. some unknown lab ... (wait...
0
0
0
@jxmnop Not true, many are PhDs from top 2 china universities, which can be on par with US top PhDs
1
0
10
@AharonAzulay @ma_nanye Our recent DDPD paper might be of interest
Discrete generative models use denoisers for generation, but they can slip up. What if generation *isn’t only* about denoising?🤔 Introducing DDPD: Discrete Diffusion with Planned Denoising🤗🧵(1/11) w/ @junonam_ @AndrewC_ML @HannesStaerk @xuyilun2 Tommi Jaakkola @RGBLabMIT
0
0
1
Cool paper on inference time search for diffusion models! The use of verifier for search at test-time are similar in spirit but very different in method/theory for continuous diffusion and discrete diffusion (DDPD). So many interesting things to further explore!
Inference-time scaling for LLMs drastically improves the model's ability in many perspectives, but what about diffusion models? In our latest study—Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps—we reframe inference-time scaling as a search problem over sampling noises. Our results show that increasing search computation can further enhance generation performance, pushing the capabilities of diffusion models further. 🧵[1/n]
0
1
7
@kohjingyu Haha, it's very open up to interpretation😂 Personally I don't think the deterministic activations in NN match the neurons which are inherently quantum from electro-chemical processes in the brain. But really little idea about what to model the brain as a whole.
0
0
1
RT @brekelmaniac: I wrote a thing about "RL or control as Bayesian inference", which encompasses - RLHF and controlled generation in LLMs -…
0
100
0
RT @JerryWeiAI: My holiday side quest at @AnthropicAI: How well can Claude play Geoguessr? 🗺️ I had Claude look at 200K+ Street View image…
0
28
0
@YouJiacheng @cloneofsimo Ah that's a good idea. Yea cost will be an issue, but maybe a few steps with parallel decoding. Essentially what we were doing in the paper is one-step parallel sampling of z_t.
0
0
1
@YouJiacheng @cloneofsimo That's a cool idea! The model will need to infer change (flow) of the z vector from the change of x, which might offer a more consistent estimate of z.
1
0
0
p(x^d | x_noisy, z^d = N) is the reconstruction probability we want to calculate when d-th dimension is picked by the planner for denoising (i.e. the denoising step) in discrete diffusion, denoising prediction is restricted (by transformer) to be per dimension instead of joint probability. (13) is a way to utilize masked denoiser to calculate the denoising step for x_t with noisy variables. the approximation error might occur for p(z_t | x_t, z_t^d = N) when using independent p(z_t^d | x_t) predictions from the planner (also a transformer). but for text, we found this approximation error is minimal, for image it might be more than for text
1
0
1
Sure, happy to! (13) states that the denoising probability for a noisy image (no [MASK] token state modeled) can be decomposed into the expectation over the latent state z of noisy/clean and denoising based on z. To use a mask denoiser, one can sample a realization of z_t from the planner , and then apply [MASK] to z that are noisy, and use mask denoiser to get the reconstruction probability p(x_1^d | x_t, z_t). In practice, the z predictions are made independently for each position, thus might introduce approximation errors for p(z_t | x_t)
1
0
1
Cool visual about how DDPD ( works!
In DDPD, planner decides which tokens to denoise, and denoiser decides what to replace it with. Model's knowledge is decomposed to guessing which part is incoherent and how its incoherent. Left is planner's prediction on 'whats wrong'. Right is denoising state. You can see its very confident on the noisy part
0
0
14