Coming to
#NeurIPS23
now. Will be there until Friday night.
DM me to chat about: reasoning, AI for math, and what we’re doing
@xai
.
Also will be at
#MATHAI
workshop panel discussion on Friday morning. See you there!
Euclidean geometry problems have been my favorite math puzzles since middle school. The most intriguing part of it is the creation of auxiliary lines, which opens a space for imagination and the freedom to explore various diagrams. Once a proof is found, these auxiliary lines
Language models can dramatically improve their reasoning by learning from chains of thought that they generate.
With STaR, just a few worked examples can boost accuracy to that of a 30X larger model (GPT-J to GPT-3).
W.
@ericzelikman
, Noah Goodman
1/
After showing a few examples, large language models can translate natural language mathematical statements into formal specifications.
We autoformalize 4K theorems as new data to train our neural theorem prover, achieving SOTA on miniF2F!
1/
Paper:
Can Neural Networks solve IQ tests? We propose Scattering Compositional Learner (SCL) for RPM Task. SCL improves SOTA from 63.9% to 95.0%. It is even capable of zero-shot generalization and learns disentangled representations!
paper:
(1/n)
How do you make a transformer recurrent?
You just turn the transformer 90 degree, and apply it in the lateral direction!
Now, with recurrence, the context size is infinite!
Let's make the recurrence great again with Block-Recurrent Transformers:
You think the RNN era is over? Think again!
We introduce "Block-Recurrent Transformer", which applies a transformer layer in a recurrent fashion & beats transformer XL on LM tasks.
Paper:
W. DeLesley Hutchins, Imanol Schlag,
@Yuhu_ai_
&
@ethansdyer
1/
Super excited to share Minerva!! – a language model that is capable of solving MATH with 50% success rate, which was predicted to happen in 2025 by Steinhardt et. al. ()!
#Minerva
1/
Very excited to present Minerva🦉: a language model capable of solving mathematical questions using step-by-step natural language reasoning.
Combining scale, data and others dramatically improves performance on the STEM benchmarks MATH and MMLU-STEM.
Autoformalization with LLMs in Lean!
@zhangir_azerbay
and Edward Ayers built a chat interface to formalize natural language mathematics in Lean:
Very impressive work!
🚨We are organizing the 2nd MATHAI workshop at NeurIPS!
Check it out if you're interested in AI for math, and machine reasoning in general🤯!
We have a great lineup of speakers & panelists!
See more in call for papers: 👇
Hello
#NeurIPS2022
! I'm at New Orleans and will be here until Thursday morning (Dec 1). Let's brainstorm AI for math, LLMs, Reasoning 🤯🤯!
We'll present 8 papers (1 oral and 7 posters) + 2 at workshops (MATHAI and DRL). Featuring recent breakthroughs in AI for math! See👇
Memorizing Transformer's camera ready is released!
Main updates:
1. Adding 8K memory ~ 5X-8X larger model parameters.
2. You can easily turn a pretrained LLMs into a memorizing transformer! (4% of pretraining cost to obtain 85% of the benefit)
Memorizing Transformer's camera ready is released!
Main updates:
1. Adding 8K memory ~ 5X-8X larger model parameters.
2. You can easily turn a pretrained LLMs into a memorizing transformer! (4% of pretraining cost to obtain 85% of the benefit)
Thanks a lot to
@Yuhu_ai_
,
@MarkusNRabe
and Delesley Hutchins for their hard work of updating our ICLR paper on retrieval-augmented language modeling, aka "Memorizing Transformer"!
Here is a short thread on why we think this is important.
🧵 1/n
In May, we discovered that LLMs can autoformalize theorem statements:
In June, we showed that LLMs can solve challenging math problems with Minerva.
Now, we show LLMs can turn its generated informal proofs into verified formal proofs!🤯
What's next?😎
Large language models can write informal proofs, translate them into formal ones, and achieve SoTA performance in proving competition-level maths problems!
LM-generated informal proofs are sometimes more useful than the human ground truth 🤯
Preprint:
🧵
Excited to share this new work, which sheds light on the understanding of pre-training via synthetic tasks.
We did three experiments that iteratively simplify pre-training while still retaining gains.
Paper:
W. Felix Li,
@percyliang
.
1/
We discover that you can teach LLMs to solve longer problems *only* via in-context learning, instead of fine-tuning.
This is mind-blowing🤯🤯! -- that certain skills are hard to be encoded in model weights, but much easier to be acquired from the context.
🆕📜We study large language models’ ability to extrapolate to longer problems!
1) finetuning (with and without scratchpad) fails
2) few-shot scratchpad confers significant improvements
3) Many more findings (see the table & thread)
Paper: []
1/
We’re excited to announce the MathAI workshop at ICLR 2021 : On the Role of Mathematical Reasoning in General Artificial Intelligence. Now accepting submissions!
Submission Link:
Deadline: Feb 26, 11:59PM PST
Quanta magazine covers our two works on large language models for mathematical reasoning: Autoformalization and Minerva.
Together, they show a path how to improve reasoning capabilities of large language models for the future.
Can neural network agents prove theorems outside of the training distribution? We perform a systematic evaluation along 6 generalization dimensions with INT: an inequality theorem proving benchmark:
Joint work with Albert Jiang, Jimmy Ba,
@RogerGrosse
.
If you use K-FAC you only need to do 1 update (ACKTR), but if you use first order optimizer, you need to do 320 updates (PPO). AND 1 update by K-FAC still wins. This is what we (with
@baaadas
) find by comparing ACKTR vs. PPO vs. PPOKFAC.
The next figure shows a perfect translation of a grade school math problem by PaLM. This is remarkable because such a statement is completely out-of-distribution – no formal mathematicians are interested in formalizing grade school math problems ;)
4/
Compared to the 1st MATHAI workshop 1 year ago, the number of submissions this time almost doubled! Glad to see the field is growing rapidly 🙌
Also there are many mind-blowing works 🤯🤯 Stay tuned!
🚨👇Reminder that the submission deadline for the MATH-AI workshop at
#NeurIPS2022
is tomorrow -- Sep 30, 11:59pm PT.
Submit your recent works (e.g. ICLR submissions) if they are about Math&AI, reasoning, algorithmic capabilities!
Two papers in
@ICLR18
:
1. Short horizon bias in meta-learning optimization:
2. RELAX:
One invited to workshop:
3. Exploration in Meta-Reinforcement Learning
Our finding hence shows a very surprising capability of these models. They learned very general and transferable knowledge that allows them to work with low-resource formal language.
12/
Never focus too much on your short term reward, the optimal strategy in the long run might be completely opposite. Don't be fooled by short horizon bias. Both in life and meta learning.
Now, let the examples do the talking!
See the figure attached – Codex perfectly formalizes an IMO problem! It handles the negation “there is no function” by proof-by-contradiction. It understands the phrase “into itself” and correctly formalizes the co-domain of f.
I’m curious to find out how far we can push Minerva to theorem proving!
In the long run, I am expecting a great synergy between a strong natural language math model and an autoformalizer () to tackle challenging mathematical theorems!
3/
We propose STaR, a Self-Taught Reasoner. We start with few-shot prompting to generate the rationale for all the problems in the dataset.
We collect rationales that lead to the correct answer, and fine-tune the LLM further.
6/
1. The formal math data is very scarce. The whole Isablle proof script is only about 180MB. 2. There is almost zero aligned data between natural language and formal mathematics, whereas docstrings for language like Python are broadly available.
I'm glad to share that LIME is accepted at
#ICML2021
! One of the things I like about our publishing process is that there is always the next conference :) If you truly believe in your paper, then it will be published sooner or later! Just keep polishing 🛠️🛠️
@yaringal
We had a paper rejected with 8,7,6,6, with thorough reviews and lots of discussion.
The one-sentence reason for rejection -- that training on data is the wrong way to instill knowledge in an algorithm -- feels like something out of AAAI 1993.
At
#ICML2020
, we present OPtions as Responses (OPRE), an HRL agent in multi-agent settings. Our hierarchical agent generalizes to unseen opponent strategies and learns interpretable options. (1/n)
Paper:
Poster: .
Yeah I am stunned by this. Don't know what to think of it. We have worked so hard on this. Getting rejected by just one sentence meta-review, overriding all decisions made by the reviewers, just seems so crazy and unfair.
@yaringal
We had a paper rejected with 8,7,6,6, with thorough reviews and lots of discussion.
The one-sentence reason for rejection -- that training on data is the wrong way to instill knowledge in an algorithm -- feels like something out of AAAI 1993.
We show two randomly chosen few-shot examples in the prompt, from latex to formal math (Isabelle). Note that these two examples are merely examples of syntactical translations, without much sophistication in reasoning or natural language understanding.
2/
Why is this surprising? People know large language models can turn natural language descriptions into code. However, the existing known successes are limited to commonly used programming languages (e.g., Python). Formalizing mathematics is different for at least two reasons.
10/
We use Codex to formalize 3908 MATH problems. We then run expert iteration on these autoformalized statements. This allows us to achieve a new state of the art on the miniF2F theorem proving benchmark.
This is the first proof-of-concept of practical autoformalization!
7/
Can the model learn to formalize such problems if the prompt contains an example that explains the concept? We find if we add a tangentially related problem, then the model can formalize the “linear function” perfectly!
6/
🚨👇Reminder that the submission deadline for the MATH-AI workshop at
#NeurIPS2022
is tomorrow -- Sep 30, 11:59pm PT.
Submit your recent works (e.g. ICLR submissions) if they are about Math&AI, reasoning, algorithmic capabilities!
🚨Call for Papers🚨 Submission to the
#NeurIPS2022
MATH-AI Workshop will be due on Sep 30, 11:59pm PT (2 days after ICLR😆). The page limit is 4 pages (not much workload🤩). Work both in progress and recently published is allowed. Act NOW and see you in
#NewOrleans
!🥳🥳🍻
This is also a fundamentally iterative process. With a better model, it can generate better rationales, and that can be used to train a better model.
7/
🔥Internship Opportunity on Improving the Reasoning Capabilities of Massive Language Models🔥: solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning.
Please see the following link for more info:
STaR shows many possible future directions. In general, any tasks that has an input and an output can be augmented with intermediate rationales.
Tasks that require multiple steps of reasoning can benefit from it the most, such as theorem proving, program synthesis etc.
10/
We further explore if the model can handle more advanced mathematics beyond competition problems. We find these models are surprisingly good at turning formal statements into natural language as well!
8/
In addition, for those problems the model answered incorrectly, we give the model a hint -- tell the model the right answer, and ask it to provide a justification.
8/
We see the model makes a jump in reasoning. Going from the definition, "for all x, if x in A -> x in B", to a more concise and abstract phrase "A is a subset of B". Also the same for "finite intersections" and "arbitrary unions". See examples in the figures!
9/
Come and join us today (Wed) to learn about our recent works using neural nets for theorem proving
#ICLR2021
!
IsarStep: High-level Mathematical Reasoning
, 12-2pm ET
INT: Evaluating Generalization in Theorem Proving
, 8-10pm ET
I'll moderate a panel discussion tomorrow 10am PT/1pm ET at MATHAI , featuring Fields Medalist Tim Gowers
@wtgowers
and Turing Award winner Yoshua Bengio.
We will be discussing reasoning, the role of math in general intelligence, and the challenges ahead.
🔥Opening in our team – Blueshift🔥
We are looking for a research engineer interested in extending the capabilities of large language models.
Learn more about the role & apply here:
Learn about our team:
Please retweet :-) 🙏
Excited to share this new work! We trained a GNN-based branching heuristics for model counting. It generalizes to problems of much larger sizes, achieving improvement over SOTA by orders of magnitude.
Can neural network agents improve wall-clock performance of propositional model counters? We present Neuro#, a neuro-symbolic solver that can do that:
Joint work w/
@gilled34
,
@Yuhu_ai_
,
@cjmaddison
,
@RogerGrosse
, Edward Lee, Sanjit Seshia, Fahiem Bacchus
This morning I read through this new paper by James Martens. It's a great extensive summary/review of second order gradient-based optimization, highly recommended:
A new work with Emilio and other CMU collaborators. The goal is to meta learn exploration. Instead of using a single agent to explore, which would result in a long horizon problem, we have multiple agents explore simultaneously, sharing findings from one to another.
Check out Emilio's new paper: Concurrent Meta Reinforcement Learning (w/
@Yuhu_ai_
,
@rsalakhu
, and others)
tl;dr CMRL learns a multi-agent communication protocol to coordinate exploration between parallel rollout agents.
@ericzelikman
Human reasoning is often the result of extended chains of thought.
We want to train a model that can generate explicit rationales before answering a question.
The main challenge: most of the datasets only contain a question answer pair, but not the intermediate rationales.
Autoformalization with LLMs in Lean... for everyone!
The chat interface for autoformalizing theorem statements in Lean built by myself and
@ewayers
is now publicly available as a vs-code extension.
🚨We are organizing the 2nd MATHAI workshop at NeurIPS!
Check it out if you're interested in AI for math, and machine reasoning in general🤯!
We have a great lineup of speakers & panelists!
See more in call for papers: 👇
Autoformalization with LLMs in Lean!
@zhangir_azerbay
and Edward Ayers built a chat interface to formalize natural language mathematics in Lean:
Very impressive work!
"Exploring Length Generalization in Large Language Models" accepted as an *Oral presentation*! We discovered that certain skills are hard to be encoded in model weights, but much easier to be acquired from the context.
5/ 10
We discover that you can teach LLMs to solve longer problems *only* via in-context learning, instead of fine-tuning.
This is mind-blowing🤯🤯! -- that certain skills are hard to be encoded in model weights, but much easier to be acquired from the context.
We performed experiments on the arithmetic problem (from Nye et al.), and CommonsenseQA. On CQA, STaR with GPT-J attained 72.3%, which was on par with the result obtained by GPT-3 (73%), finetuned to directly output the final answer.
9/
@cHHillee
@giffmana
That's right. But Grok-1 (in the blog) was also not trained for benchmarks. So you'll see the raw model has pretty much similar numbers as in the blog post.
RELAX! Our new gradient estimator handles discrete variables and black-box functions. Now going to try hard attention, latent graphs, and more RL problems. by amazing students
@wgrathwohl
@chlekadl
@Yuhu_ai_
@geoffroeder
After showing a few examples, large language models can translate natural language mathematical statements into formal specifications.
We autoformalize 4K theorems as new data to train our neural theorem prover, achieving SOTA on miniF2F!
1/
Paper:
Fun fact: Hu et. al. () found that most of the previous successful neural methods exploited a short-cut solution. After removing the dataset bias, those methods suffered a lot (e.g., CoPINet went from 91.4% -> 46.3%). SCL was not affected at all.
(4/n)
One solution is to use human labels [Rajani et al. ]. But this is costly and hence not scalable. In addition, the model cannot improve beyond human labels.
3/
Into the future, we believe it may be possible to develop synthetic tasks that outperform natural pre-training on some downstream tasks: the complexity of existing natural data is fixed, while in some sense the complexity of fully synthetically generated data is infinite.
10/
SCL is designed to discover the compositional structures of the data. In RAVEN, It learns to discover the compositions of objects, attributes, and relationships. The figure shows an example where SCL learns the concept of “size”.
(2/n)
Camera ready version on short-horizon bias, to appear in
#iclr2018
. It tells you why you should always start with aggressive learning rate and then decay. Meta-optimization is hard because the objective is biased. A fantastic collaboration with
@mengyer
,
@RogerGrosse
and Renjie.
Generalization to longer horizons is the Achilles' heel of gradient-based meta-optimization. Short horizon meta-optimizers decay the learning rate really quickly and stop making progress. New paper w/
@Yuhu_ai_
,
@mengyer
, and Renjie Liao.
Very excited to present Minerva🦉: a language model capable of solving mathematical questions using step-by-step natural language reasoning.
Combining scale, data and others dramatically improves performance on the STEM benchmarks MATH and MMLU-STEM.
If you are interested in solving challenging multi-step reasoning problems with LLMs, join us!
We have an opening for a Research Scientist position at Blueshift!
Learn more about the role & apply here:
Learn about our team:
Subgoal search is an appealing class of methods to solve complex tasks by considering intermediate subgoals that advance towards the goal. Is it beneficial to vary the subgoal distance (and how)? Turns out the answer is yes:
A thread:
1/8
Another solution is to use in-context learning to induce rational generation [Nye et al. , Wei et al. ]. But few-shot performance significantly underperforms finetuning.
4/
A very cool work on natural language theorem proving from
@wellecks
et. al.!
It's nice to see lots of observations are shared between informal and formal math proving: the importance of premise selection, failure cases etc.
Looking forward to combine the best of both worlds!
New paper:
Theorem proving in natural mathematical language- the mix of symbolic and natural language used by humans- tests reasoning and plays a central role in mathematical education.
Can language models prove theorems & help us when we're stuck? 1/N
🆕📜When can **Equilibrium Models** learn from simple examples to handle complex ones?
We identify a property — Path Independence — that enables this by letting EMs think for longer on hard examples.
(NeurIPS) 📝: []()
We’re excited to announce the MathAI workshop at ICLR 2021 : On the Role of Mathematical Reasoning in General Artificial Intelligence. Now accepting submissions!
Submission Link:
Deadline: Feb 26, 11:59PM PST
APE generates “Let’s work this out in a step by step way to be sure we have the right answer”, which increases text-davinci-002’s Zero-Shot-CoT performance on MultiArith (78.7 -> 82.0) and GSM8K (40.7->43.0). Just ask for the right answer?
@ericjang11
@shaneguML
By learning compositional structures, it can even generalize to unseen analogies. E.g., After learning (“color”, “constant”), and (“size”, “progression”), the model can generalize to (“color”, “progression”).
(3/n)
LLMs are not good at premise selection in theorem proving due to limited context window. Thor addresses this by combining symbolic AI (sledgehammer) to achieve SOTA:
6/10
Language models are bad at retrieving useful premises from large databases for theorem proving, mainly because they're limited by a small context window. We use symbolic tools to overcome this difficulty, boosting proof rates from 39% to 57%.
Thor:
1/
Excited to share this new work, which sheds light on the understanding of pre-training via synthetic tasks.
We did three experiments that iteratively simplify pre-training while still retaining gains.
Paper:
W. Felix Li,
@percyliang
.
1/
We recently worked on extracting datasets for training neural theorem provers for Lean. Our model can prove 35.9% test theorems.
Check out the following Demo! We created a tool for querying a 3B GPT model when writing math proofs in VS code.
#InteractiveNeuralTheoremProving
Excited to share this demo of interactive neural theorem proving in Lean (joint WIP with Jason Rute,
@Yuhu_ai_
, Ed Ayers, and
@spolu
)!
Below, the `gpt` tactic is querying a 3B param transformer trained on Lean proofs. We can prove 35.9% of theorems in a held-out test set.