I am excited to announce that I am joining the UCLA Department of Mathematics
@uclamath
as an Assistant Professor this coming Fall. I will dearly miss my time at Seoul National University, but looking forward to the new exciting opporunities ahead.
Found our book being advertised at ICML!
Our book on large-scale convex optimization by Wotao and myself is scheduled to drop in October 2022. You can find a preprint here.
(1/3)
Large-Scale Convex Optimization by
@ErnestRyu
and
@wotaoyin
A unified analysis of first-order optimization methods, including parallel-distributed algorithms, using monotone operators.
☑️
🎉 Excited to announce that our lab has 4 papers accepted at NeurIPS. 1 in empirical deep learning, 1 in RL theory, and 2 in convex optimization theory. I'll post an advertisement for these papers, one per day!
1/N
I recently gave a talk at the Stanford ISL colloquium entitled ‘Toward a Grand Unified Theory of Accelerations in Optimization and Machine Learning.’
In this talk, I explain why I find acceleration an exciting and vibrant subject.
Excited to announce that our lab at SNU has 3 papers accepted at ICML, 2 as long talks! Let me use this opportunity to advertise these papers quickly. (1/10)
Final NeurIPS paper advert
Paper 4 (Convex optimization)
When minimizing a convex function f(x), we usually want an approximate solution such that f(x) is small. But what if we instead want ||∇f(x)|| to be small?
12/N
Image clustering is a quintessential unsupervised learning task, but clustering methods often fail to cluster images the way you want. Wouldn't it be great if you could "tell" the method, in words, the criterion by which it should cluster the images?
(1/8)
NeurIPS paper advert continued.
Paper
#2
(RL theory)
The classical value iteration (VI) is a cornerstone of RL theory. Surprisingly, VI is suboptimal! We show how to accelerate it in 'Accelerating Value Iteration with Anchoring'
@IjongminS
6/N
NeurIPS paper advert continued.
Paper 3 (Convex optimization)
Anchor acceleration is a mechanism distinct from Nesterov acceleration, and it has been used to accelerate algorithms for minimax optimization, fixed-point problems, and value iterations.
9/N
ICML2022! My first in-person conference since ICML 2019. There’s such a surreal feeling to finally returning to normalcy. Looking forward to having a great time.
@sehyunkwon22
@jadenpark98
@minkyu_kim__
@jaewoong_cho
@Kangwook_Lee
In this work, we use a vision-language model (LLaVA) and a large language model (GPT-4) to cluster images based on user-specified text criteria, and it works very well! A given dataset can be clustered in multiple ways by varying the clustering criterion.
(3/8)
Professor Hédy Attouch, who has made significant contributions to convex analysis, monotone operator theory, and optimization, has passed away. Much of my research on splitting methods is influenced by his work. Thank you, Professor Attouch, for being an inspiring mathematician.
The performance estimation problem (PEP) is a computer-assisted proof methodology for optimization algorithms. In our collaboration with
@shuvo_das_gupta
and Bart Van Parys of MIT, we greatly expand this framework using branch-and-bound algorithms.
(1/8)
Excited to share a presentation for our paper "BnB-PEP: A Unified Methodology for Constructing Optimal Optimization Methods" (joint work with Bart Van Parys and
@ErnestRyu
) on YouTube: .
Roland Glowinski is known for being one of the inventors of ADMM. This is a huge loss to the optimization community. I guess we are all part of the circle of life, and great mathematicians also pass, even if their work lives on.
@prof_grimmer
This is very nice! If I may advertise relevant results from my work with
@shuvo_das_gupta
and Bart, we numerically investigate the *optimal* stepsizes of GD. (1/N)
Paper
#1
(Empirical deep learning)
Diffusion models are very powerful, but they often generate undesirable images alongside the desired ones. Can we prevent (censor) a diffusion model from generating bad images while allowing it only to generate good images?
2/N
@TaeHo_Y00N
@KibeomM00
@keonlee9420
@jaewoong_cho
We show that labeling around 100 images as "good" or "bad" (which takes about 3 minutes of human time) is enough to train an auxiliary reward model that can guide the sampling process away from bad images.
4/N
@TaeHo_Y00N
@KibeomM00
@keonlee9420
@jaewoong_cho
The reward model is a small ResNet trained on a few hundred labeled images, and the diffusion model's score network is not fine-tuned. Our approach to censoring is human-label efficient, computationally lightweight, and very effective!
5/N
(Paper
#2
advert coming tomorrow)
@AsuOzdaglar
@chanwoopark20
@Jaeyeon_Kim_0
Usual notions of duality in optimization are between functions and optimization problems. We establish the first instance of a duality between *algorithms*. Our view is that an optimization algorithm is a mathematical object that we can take the dual of!
16/N
In 'Continuous-time Analysis of Anchor Acceleration', we perform a continuous-time analysis of this mechanism and provide some insight into the counter-intuitive acceleration mechanisms.
@Jaewook_J_Suh
11/N
In 'Time-Reversed Dissipation Induces Duality Between Minimizing Gradient Norm and Function Value', joint work with Asu at MIT, we present H-duality, a duality principle relating the acceleration mechanisms.
@AsuOzdaglar
@chanwoopark20
@Jaeyeon_Kim_0
14/N
We want this book serves as a nice introduction to the, in my view, elegant field of convex optimization algorithms. Slides (with their LaTeX source code) and YouTube lecture videos are also available.
(3/3)
It turns out that Nesterov acceleration is not optimal for reducing gradient magnitude; Kim-Fessler acceleration is. Interestingly, there is a glaring symmetry between these two acceleration mechanisms.
13/N
@sehyunkwon22
@jadenpark98
@minkyu_kim__
@jaewoong_cho
@Kangwook_Lee
In this work, we use a vision-language model (LLaVA) and a large language model (GPT-4) to cluster images based on user-specified text criteria, and it works very well! A given dataset can be clustered in multiple ways by varying the clustering criterion.
(3/8)
@AsuOzdaglar
@chanwoopark20
@Jaeyeon_Kim_0
This concludes my 4-part announcement of our lab's NeurIPS papers. If anybody wants to chat about our work, shoot us a DM or meet us at NeurIPS 23!
17/N
Say you have an optimization algorithm, and you want to find its convergence proof. Do you want computer assistance in finding the proof? If so, check out Shuvo's talk given at the 2023 PEP Workshop, UCLouvain! 🇧🇪
🎥 Excited to share the YouTube video of my presentation at the 2023 Workshop on Performance Estimation Problems (PEP), UCLouvain, Belgium! 🇧🇪 Check out my talk titled "Design and analysis of first-order methods via nonconvex QCQP frameworks" at (1/5)
Anchoring pulls iterates back towards the starting point of the algorithm, and the strength of this anchor diminishes over iterations. This counter-intuitive acceleration mechanism is understood much less than Nesterov acceleration.
10/N
Dear Twitter ML theory folks,
Does anybody know what the plans are for COLT23? I am part of a group planning a workshop next summer, and we may choose to run it as a satellite event of ICML or COLT. (1/2)
Taeho presented work on censoring (aligning) diffusion models to generate good images, as defined through a minimal amount of human feedback. Here’s the poster and the link to the paper
(2/2)
Through modern multi-model foundation models (which are 🤯 mind-blowingly powerful) we present a new paradigm of image clustering.
If interested in further details, check out our preprint
or meet us at the NeurIPS workshop!
(8/8)
In this book, we provide a unified presentation of convex optimization algorithms using monotone operators. In chapters 1-3, there is only *one* theorem, and approximately 40 convex optimization algorithms are analyzed by reducing them to instances of "Theorem 1".
(2/3)
@AsuOzdaglar
@chanwoopark20
@Jaeyeon_Kim_0
In continuous time, taking the H-dual corresponds to reversing the time dependence of the dissipation (friction) term. The H-duality principle, loosely speaking, shows that under the H-dual, a function-value guarantee translates to a gradient-norm guarantee and vice versa.
15/N
In addition to the faster rate, we provide a matching complexity lower bound. This proves that Anc-VI is optimal (while classical VI is suboptimal) in the regime γ≈1.
8/N
The crucial mechanism is 'anchoring'. More on this in the paper
#3
advert coming tomorrow!
@sehyunkwon22
@jadenpark98
@minkyu_kim__
@jaewoong_cho
@Kangwook_Lee
Step 3 uses the LLM to assign the image descriptions from Step 1 to the K cluster names from Step 2.
Finally, if the results are unsatisfactory, you can adjust your text criterion to be more specific and go back to Step 1.
(7/8)
Question to RL twitter. Lately, I've been studying modern deep RL, and there's one thing that has puzzled me.
Why does double Q-learning help? I understand that double DQN mitigates overestimation of Q values, but why should overestimating Q-values lead to bad policies? (1/3)
@sehyunkwon22
@jadenpark98
@minkyu_kim__
@jaewoong_cho
@Kangwook_Lee
If you think about it, providing only the images for clustering is not a realistic setup. In practice, users are willing to guide the clustering algorithm through some minimal input, and foundation models are available for us to use.
(4/8)
Classical VI converges at rate O(γ^k), but this rate is problematic when the discount factor γ is close to 1, and the iteration count k is not too large. Anchored value iteration (Anc-VI) achieves a faster rate on the Bellman error.
7/N
@Singularitarian
I’m not sure what Daniel’s context is, but I am very happy to answer any questions about my book on Twitter or MathOverflow! So far, I have gotten email questions, but no public questions yet.
@sehyunkwon22
@jadenpark98
@minkyu_kim__
@jaewoong_cho
@Kangwook_Lee
Recent foundation models allow us to effectively incorporate textual user input into this computer vision task.
Given a user-specified text criterion TC,
Step 1 uses the instruction-tuned vision-language model LLaVA to extract a description of the image relevant to TC.
(5/8)
@sehyunkwon22
@jadenpark98
@minkyu_kim__
@jaewoong_cho
@Kangwook_Lee
Step 2a uses the LLM GPT-4 to summarize the image description into a single word in the context of TC.
Step 2b uses the LLM to consolidate these single-word descriptions into K cluster names, where K is a pre-specified cluster count.
(6/8)
@konstmish
My work with monotone operators has partially been driven by this desire to find some unifying structure to the zoo algorithms. However, I find myself adding to this "problem" by publishing new methods and thereby increasing, not decreasing, the number of optimization algorithms!
@PatrickKidger
Hi Patrick, I understand your frustration, but I must say that I found your 2020 COLT talk on width-bounded depth-wise uni. apx to be incredibly informative and useful. It was a great introduction to an area I had mot worked on before.
@PatrickKidger
The recorded talk lead me to read your paper and to even remember your name 😉 So it your time spent on the video has produced some value.
@prof_grimmer
@shuvo_das_gupta
Detail (iii) Our numerics go up to N=50. Our work says nothing about whether this phenomenon persists for larger N.
@prof_grimmer
’s work makes progress towards a theoretical understanding of this phenomenon. (5/N, N=5)
@konstmish
Personally, I am quite interested in the problem of "there are too many optimization algorithms".
Zhao, Lessard, Udell has recently proposed an algorithm (based on ideas from control theory) for detecting equivalent algorithms
@prof_grimmer
@shuvo_das_gupta
Detail (i): Optimal stepwise is not unique. Sometimes, but not always, we can permute the orders or some step sizes and maintain the same optimal performance. (3/N)
@boazbaraktcs
It was wonderful to have you Boaz! There was so much excitement among the students. Thanks for visiting SNU and delivering an inspiring lecture series!
@konstmish
In algebra, mathematicians classify groups and other structures into a handful of equivalence classes. I wonder if optimization algorithms can be collected, classified, and organized in a similar manner.
Our central thesis is that the analysis of such ODEs simplifies dramatically under a change of variables into "dilated coordinates." The classical O(1/t^2) rate follows from a conservation-of-energy argument, and the mysterious "z-iterates" of AGM are conjugate momentum. (4/10)
@sangwoomo
@jasondeanlee
Uni. apx. thm., in its classical form, is not quantitative, i.e., we can approximate with large enough size, but who knows how large. Since basically all neural networks (wide or deep) now have uni. apx. thms and there are constructive ways to find the approximations, (2/3)
The prior analysis of overparameterized neural networks via the NTK theory focused on infinitely wide networks. For infinitely deep networks with the standard LeCun initialization, the NTK becomes degenerate. (9/10)
@GlockerBen
I also believe that the requirement was not made sufficiently clear. When the same mistake becomes numerous, at some point, one can reasonably attribute the cause to poor UX design (unclear announcement) rather than individuals' oversight. (3/4)
@GlockerBen
I asked the program chairs if they could extend some leniency. They promptly and respectfully stated that the decision was final. I was sad, but the mistake is my fault, and I understand that keeping order for a conference with ~10000 submissions is a near-intractable task. (2/4)
@prof_grimmer
@shuvo_das_gupta
Our numerics go up to about N=50 iterations, but we find that mostly small step sizes with an occasional longstep is optimal. (2/N)
@FeiziSoheil
Hi Soheil, we haven't met, but I know you through your work on GANs and your Deep Learning Foundations YouTube videos. It'd be great if we could meet and chat at ICML.
@sangwoomo
@jasondeanlee
the remaining question is in their efficiency. My understanding of Jason’s line of work is to obtain quantitative results. Two-layer nets *can* approximate the function with SGD, it just needs too many neurons. Three-layers nets can do the job with far fewer neurons. (3/3)
@GlockerBen
In any case, "desk rejections" now is nonsense. My guess is that many missing checklists were overlooked (or perhaps ACs refused to desk reject), and the PCs just found out about them. So, to enforce the rules consistently, they are now belatedly being desk rejected. (4/4)
@shuvo_das_gupta
We call this the BnB-PEP. This vastly expands the applicability of the PEP's computer-assisted proof methodology. If interested, please check out our paper and Shuvo's excellent and detailed talk.
(7/8)
Our work shows that an infinitely-deep ReLU MLP with a particular initialization is provably trainable. The key technical challenge is establishing the NTK's invariance in the infinite-depth limit, which requires controlling the behavior under infinite compositions. (10/10)
@sangwoomo
@jasondeanlee
Rahimi and Recht's random kitchen sink approach can be used to constructively perform the universal approximation with 2-layer MLPs with only the final layer trained. You can also appeal to the NTK theory to show something similar for L-layer MLPs with all layers trained. (1/3)
@konstmish
Ming Yan and
@wotaoyin
had a work showing that (roughly speaking) applying ADMM to a primal problem results in an equivalent algorithm compared to applying ADMM to the dual problem.
@konstmish
To shamelessly promote my work, I worked on a result establishing that under certain (very restrictive) conditions, DRS is the only splitting method that utilizes resolvents.
Continuous-time models of momentum-based accelerated gradient methods do provide some insight, but their analyses are still somewhat mysterious; where do these energy functions come from, and what do the terms mean? (3/10)
For the general problem of fixed-point iterations with a non-expansive or contractive operator, are the simple fixed-point iterations à la Banach or KM optimal? Can we characterize the optimal complexity as was done for Nesterov's accelerated gradient method? (6/10)
@shuvo_das_gupta
Given a fixed optimization algorithm, finding its best convergence proof is, in principle, an optimization problem, and we call this the performance estimation problem (PEP). Surprisingly, the PEP can be formulated as a finite-dimensional (convex) SDP, no relaxation. 🤯 (2/8)
In this work, we accelerate such fixed-point iterations via an "anchoring" mechanism (from O(1/k) to O(1/k^2)) and provide an exact matching complexity lower bounds to establish exact optimality. (7/10)
@gautamcgoel
@shuvo_das_gupta
Of course, proofs establishing
f(x^N)-f(x*)\le O(1/N)
are known, but their constants are worse. The best proof for the simple gradient descent is not known.