During my PhD, my advisor would tell me “never use a symbol in text without reminding what it is.”
[Example]
𝘉𝘢𝘥: “So, 𝜑 is bounded.”
𝘎𝘰𝘰𝘥: “So, the value function 𝜑 is bounded.”
I told a grad-student co-author "try to imagine me as the guy from 'Memento' who can't remember anything & needs clues to pick-up the thread on projects" and they said "I believe you, because you already used that memento analogy"
Job search completed: Excited to join
@ChicagoBooth
as an Assistant Professor of Operations Management starting July 2024! 🥳
Thank you to all my friends and mentors who helped me along the way 🙏🙏🙏.
@alz_zyd_
Yeah, that's my sense too. I feel like "Python for Data Science/ML"-type classes never teach list comprehension and jump straight to numpy, so students who learn python that way never got exposed...
Honored to receive an ICLR 2022 Outstanding Paper Award for “Neural Collapse under MSE Loss” w/ Vardan Papyan and Dave Donoho!
Come by
#ICLR2022
on 4/26 1AM PST (Oral) & 4/27 6:30PM PST (Poster) to chat w/ us about Neural Collapse and its open questions!
64 GPUs for one research lab is pretty nice. Across all Stanford,
@StanfordCompute
has 700+ shared GPUs. Rumor is
@StanfordData
folks are talking of buying a new cluster of 1000+ GPUs. The point is valid, but the 64 GPU example feels misleading.
Apple announces MM1
Methods, Analysis & Insights from Multimodal LLM Pre-training
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through
@sirbayes
I thought a lot about this question: Did my PhD in a dept with optimizers who love linesearch. Within the classic optimization community (the SIOPT/ICCOPT/ISMP crowd) I think linesearch is pretty popular. Within ML, the problem is memory: you need to keep yet another copy of your
FOUR new Neural Collapse works accepted to
#NeurIPS2022
investigating (1) Neural Collapse on different losses, (2) as modeled as Riemannian gradient flow, (3) as a motivating design network classifier design, and (4) under class imbalance. Congrats to all the authors!! 🥳
Fun story Erhan Çinlar told us in Princeton’s ORFE 309 course: Back in the day, US academics were deciding what to call math-for-decision-making. A good name already existed: 𝘤𝘺𝘣𝘦𝘳𝘯𝘦𝘵𝘪𝘤𝘴. 𝗕𝘂𝘁 it was the Cold War and “cybernetics” was a term already used by the
TEN new Neural Collapse related submissions found in
#ICLR2023
. 🤩🥳👏
The original NC paper took almost 3 years of exploring and experimentation to write. Extremely grateful to see so many now share our interest. 🙏🙏🙏
Amazing survey on the subtleties, historical contexts, and open questions of Neural Collapse. Very readable & comprehensive. One of the best so far! ⭐️⭐️⭐️⭐️⭐️
To authors
@kvignesh1420
, E. Rasromani, & V. Awatramani: Thanks x💯 for your interest in NC and this fantastic review!
Two exciting new papers examining Neural Collapse in
#ICML2022
(both spotlights!). Congratulations to the authors!
(T. Tirer and
@joanbruna
) and ( J. Zhou, X. Li, T. Ding, C. You,
@Qing_Qu_1006
, and
@ZhihuiZhu
)
Proud to share this new work with my supervisor, Adrian Lewis, in which we develop a multipoint generalization of gradient descent for nonsmooth optimization. (1/4)
We interviewed
@XYHan_
, Vardan Papyan, and David Donoho about their ICLR outstanding paper on the neural collapse phenomenon. Read what they had to say here:
Fun stories from
@Princeton
: During undergrad, Tarjan subbed for one of our Intro Algorithms (COS226) lectures. He started with this beautiful remark:
“Hi, I’m Bob Tarjan… Not Bob Sedgewick. Bob Sedgewick wrote your textbook. I wrote the algorithm 𝘪𝘯 your textbook.”
On “Future Directions”: A Suggestion for the Academic Job Market
“Future Directions” is often the hardest part of the research statement. Took me multiple rewrites. Eventually, I found the following trick useful.
Imagine you got the faculty position. Visualize yourself living
How does neural collapse connect to prior works on implicit max-margin separation like Lyu & Li 2019, Soudry et al 2018, and Nacson et al 2019?
W.Ji,
@2prime_PKU
, Y.Zhang,
@zhun_deng
&
@weijie444
solidifies the connection in their new
#ICLR2022
paper. 9:30PM EDT!
👏👏👏 Much needed and overdue.
A huge personal pain point for me as an opt researcher is that popular constrained opt solvers (cvxpy, gurobi, mosek, etc) require specialized syntax for the constraints and end up moving back to CPU and so can't advantage of GPU matmult... (1/2)
Together with
@phschiele1
, we wrote a package to solve constrained optimization problems, where all functions are arbitrary
@PyTorch
modules.
This is mainly intended for optimization with pre-trained NNs as objective/constraints.
Pre-2010, it went the other way. I still see Stats folks who roll their eyes at CS seminar speakers who lack mathematical rigor... and CS folks who dismiss Stats speakers as useless-for-SoTA. We all find different problems interesting: nobody's better than anyone else.
When I’m asked where one might start to learn about Neural Collapse. This survey is 𝘢𝘭𝘸𝘢𝘺𝘴 among my top recommendations. Ecstatic to see it cross the finish line. Congrats
@kvignesh1420
!! (The reviews & discussion are amazing too! Hits on some key points 👏👏👏.)
@bradneuberg
The only way to get them is to keep paying Google per hour. You can’t just buy one with funding—whether it’s VC, academic, or otherwise. Plus, earlier on, it only worked with Tensorflow. It was only 3-4 years after that PyTorch compatibility came along. The performance was
@Adam235711
It’s useful for the “surrogate loss” argument in theory. Specifically: (1) somebody develops a convex loss that doesn’t do too bad; (2) most of the time, it doesn’t actually catch on outside of the research group that developed it; (3) But, using it doesn’t change the behavior of
Important info for new PI's buying compute hardware! It's more than just GPUs. If you don't get the interconnect (go for infiniband) and storage type (go for NVME or SAS SSDs) right, you're gonna get bottlenecked by dataloading no matter how good your GPUs are.
The Machine Learning Engineering Networking chapter has been updated with multiple provider intra- and inter-node connectivity information/specs and easy to use bandwidth comparison tables:
If I'm still missing some commonly used
Neural collapse observes last-layer class variation collapses 𝘵𝘰𝘸𝘢𝘳𝘥𝘴 0 with training. 𝗕𝘂𝘁: As it does, one can 𝘴𝘵𝘪𝘭𝘭 find informative, fine-grained structures in the residual small variations at 𝘧𝘪𝘹𝘦𝘥 epochs (even ones that look “collapsed”!).
Check out this
Today at
#ICML2023
,
@YongyiYang7
is presenting "Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations."
We discovered important fine-grained structure exists in NN representations despite the apparent "Neural Collapse."
See you at 2-3:30pm!
@sp_monte_carlo
"linear convergence" was confusing af until
@prof_grimmer
told me during the 2nd year of my PhD "linear means linear in log-scale". I actually added a footnote to my job market research statement just to not confuse non-specialists:
I want a LLM-upgrade plan that switches my subscription whenever a new benchmark comes out.
(GPT 4.5 -> Gemini Pro 1.5 -> Claude 3 -> ???)
Maybe with unlimited text and data?
@Verizon
@ATT
Today, we're announcing Claude 3, our next generation of AI models.
The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision.
What if you pass out-of-distribution data into NNs showing neural collapse? How's that useful?
Turns out, the out-of-distribution data become orthogonal to in-distribution data & you can then use that to detect those OOD points.
TIL Costco sells a pallet of freeze-dried food.
Preppers can buy 5,400 servings of food with a 25 year shelf life for $2,500
Lots of pasta and rice dishes, some soups, oatmeals & milk. Just add water!
910,080 calories, 364 calories/$.
@ben_golub
They do. Candidates can and do ask schools to match offers. The power is usually in a Dean’s hands. Prob is Deans have to be fair across the school (ex. Art & Sci or Eng) they oversee. Paying a new CS prof more than a tenured math prof will piss ppl off, for example.
@DrJimFan
@DeepMind
Suppose I write some exponential time procedure in RASP. Say, exhaustive search for solution of traveling salesman problem. Would the compiled transformer then be able to always give me the solution in the constant time of a forward pass? What am I missing?
While a previous work claims happiness peaks at 10,000 H100 GPUs; a new PNAS study shows that happiness continues to grow with resources up to 500,000 H100 GPUs for the top 30% of the GPU rich.
At my fastest during the job market, I could pack for a four-day trip from scratch in <30 mins. Now, I’ve deteriorated to… more than twice that. Didn’t know you could lose muscle memory on these things… 😥
@docmilanfar
It’s also mindblowing that a still-active researcher was already a grad student and part of someone’s origin story a year before I was even born. 😵💫
@sirbayes
This is also my guess for why (L-)BFGS and bundle methods — which you hear a but about in the opt community, but not as much in ML — aren’t more popular.
@damekdavis
Check out this talk by Dave Donoho at IHES! He elaborates on this point and the existence of two cultures (empirical results vs theorem proving) there.
This is my default grading policy as well:
• If you do A+ work with AI, you get an A+.
• If the AI plagiarized or made things up, you get an 0 as if you did it yourself.
I just don't get why people are trying to detect whether AI is being used for writing instead of just grading whether the writing is good or bad. If student A uses AI and writes better than student B, student A should get a better grade than student B
Also, in the tech specs, note the 1493 “privately owned” nodes. Those are nodes associated with specific PI groups (many containing GPUs). Sherlock contributors can use the idle nodes of other PI groups as well making effective number of shared GPUs much higher than 700.
Yearly reminder: To get natbib \citep and \citet commands working properly with NeurIPS citation style (numbers rather than author lists), use the following commands.
\usepackage[nonatbib]{neurips_2021}
\usepackage[numbers]{natbib}
...
\bibliographystyle{abbrvnat}
@miniapeur
There’s this guy who was a physicist who became a US senator for New Jersey. I remember lots of faculty at
@Princeton
liked him because it was nice getting represented by a real scientist.
.
@YiMaTweets
What do you think of WizardLM from PKU and Yi from as representatives of Chinese open source AI? They seem to be doing well in the leaderboards…
Honestly, I prefer reading clear GPT-assisted emails rather than deciphering intentions in unclear, "organic" ones. ChatGPT is effectively the modern version of spellcheck/Grammarly. The same for class assignments --- as long as students take responsibility for GPT-induced errors
It's unjust to criticize students, especially non-native English speakers for using ChatGPT to communicate. They might be investing extra time in crafting these emails, navigating linguistic and cultural nuances. Just to clarify,this tweet was crafted with the help of ChatGPT!
I realize this is seemingly an unpopular opinion, but I can't get onboard with these Twitter criticisms of some of the recent
#ICML2022
best paper awardees. I've been thinking about this all day. A thread... 🧵 1/N
"No management overhead or product cycles" & "insulated from short-term commercial pressures" is. literally. academia.
But, instead of asking NSF for $200-500k,
@ssi
raised many times that from VCs purely on reputation. This is what happens when you beat the game🤯.
Superintelligence is within reach.
Building safe superintelligence (SSI) is the most important technical problem of our time.
We've started the world’s first straight-shot SSI lab, with one goal and one product: a safe superintelligence.
It’s called Safe Superintelligence
@damekdavis
My personal thoughts (influenced by co-teaching a course on this) is that the definition of "way forward" is slightly vague.
If we define it as
(1) creating new tools that push forward society. Then, even if there is something special about transformers and neural nets,
@ben_golub
My understanding (from having done a salary neg b/n a bschool and an eschool) is there’s flexibility in the pay, but the discrepancy can’t be too big and needs to be justifiable by how much money the dept’s masters program and alum donations pull in.
@zacharylipton
Is this really a university-level decision? Isn’t dept funding and salaries is tied to the profitability of the corresponding Masters-level program?
As in, Tepper’s MBA tuition ($39k) is 1.34x the SCS MS tuition ($29k) at CMU.
@Adam235711
From what I’ve seen, SOTA methods tend to come out of lots of trial-and-error by researchers who are good at having hunches about data and choosing which ones to act on. In implementation, they draw from their math-education to make design decisions. Since opt courses tends to
@bhutanisanyam1
Such a guide would be immensely helpful. I am an academic AI researcher who recently went through the (quite challenging) process of building a GPU cluster.
Questions I wished I understood beforehand (and still am still fuzzy on) are the following:
1) What should researchers
In Operations Research, through the lens of DLD's Data Science at the Singularity, it's trickier to achieve both [FR1: Common Data] and [FR3: Common Benchmarks] since modeling context/structure in OR often entails modifying the data collection itself.
@ProfKuangXu
The challenge I’ve seen is that many benchmark datasets strip away a lot of problem context — like the miplib library. Without the richness of setting, it’s hard to use them broadly.
Interesting new paper proposing a NC-inspired loss that mitigates undesirable biases when training deep nets on imbalanced data. Builds upon the prior work of C.Fang,
@hangfeng_he
,
@DrQiLong
, &
@weijie444
showing minority collapse under imbalanced training. (1/2)
The greatest accomplishment of my statistics career has been winning this year’s
@UCBStatistics
T-shirt design competition with a
@SFBART
-inspired shirt designed w/
@aashen12
!
{stats nerds} ∩ {public transit nerds} ≠ ∅ 📉🚅
@alz_zyd_
Uncomfortable part is that a non-negligible part of K-12 teachers’ skillset is memorizing and teaching students to go through the motions. If AI is allowed in K-12 classrooms, it calls into question the entire training of K-12 teachers trained pre-2022. It’s clear the curriculum
Anyone know how to search the
#ISMP2024
program? From the site, seems the only way is to click open each individual session or speaker name to see what talks there are for that particular session/person??
@math_opt
"Watercolor portrayal of a sunset scene where supercomputer mountains are silhouetted against a fiery sky. Streams of luminous data pour from their summits, filling a gleaming lake below that reflects the array of scientific discoveries."
#DallE3
🤩
@ShiqianMa
@sirbayes
I thought about it from this angle too: Forward passing on a deep net is expensive. But I couldn't quite convince myself for the following reasons.
(1) You have to do function evaluations for SGD too. In fact, linesearch evaluations are better evaluation/compute-wise because you
@ben_golub
😮 I didn’t know about that. I can only speak to the discrepancy between operation research (engineering) and operations management (business) that draws from the same pool of people. From what you say, a different mechanism does seem at work there… thanks for the insight!
Sparsity is achieved *without* the familiar l1 penalization. And it simply uses gradient descent on kernel loss without extra “tricks”. Worthwhile read with original ideas and new techniques.
Demis Hassabis admits that Google has some secret sauce in how Gemini is able to process 1-10m token context windows. The extreme context length in Gemini 1.5 Pro "can't" be achieved "without some new Innovations". This is an astonishing development that seems to hint at
[Then:] Read the final, published paper. It’s the most polished!
[Now:] Find the preprint. It’s formatted as the authors wanted without copyediting artifacts.