🔥Breaking News from Arena
Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to
@Google
for the remarkable achievement!
The race is heating up like never before! Super excited to see what's next for Bard + Gemini
Thrilled to announce🦉Minerva: a large language model capable of solving mathematical problems using step-by-step reasoning in natural language.
See blog here: and samples here: (1/n)
Very excited to present Minerva🦉: a language model capable of solving mathematical questions using step-by-step natural language reasoning.
Combining scale, data and others dramatically improves performance on the STEM benchmarks MATH and MMLU-STEM.
Gemini Pro 1.5 is here! 10M token context window (1M in production for now), comparable evals to Ultra 1.0, and considerably more compute efficient. This model can perceive and answer questions about 10 hours of video, or 100 hours of audio, or 300K lines of code, and will
To show what’s possible with the drastically huge context window in Gemini 1.5 Pro, we prompted it with the three.js examples code - over 100,000 lines of code/800k+ tokens!
(That’s not even the max, it can handle millions of tokens 😀)
Gemini was able to process all the code
All you really need to be happy is health, gratitude, fulfilling relationships, a roof over your head, an exascale compute cluster, and a few trillion tokens. Is that too much to ask?
Exciting times, welcome Gemini (and MMLU>90)! State-of-the-art on 30 out of 32 benchmarks across text, coding, audio, images, and video, with a single model 🤯
Co-leading Gemini has been my most exciting endeavor, fueled by a very ambitious goal. And that is just the beginning!
I'm at
#NeurIPS22
! Interested in LLMs, tool use, scaling, reasoning, AGI, or good restaurants in New Orleans? DM me if you'd like to meet, or stop by our poster on Minerva (Tues 4-6p, no. 920)
Google latest 540B language model solves 1/3 of STEM undergrad problems from MIT with 50% accuracy on MATH, what
@JacobSteinhardt
predicted would happen by 2025
Can confirm. Sergey has been in the bullpen with the rest of us since the beginning of Gemini, shares our deep fascination with this technology, and recognizes that this is a transformative moment in the history of humanity. His engagement has been immensely motivating for all.
BIG Bench is not only a fascinating collection of tasks for LLMs, it's also a shining example of how open and collaborative research should be organized. Glad to have been part of it!
After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the paper is now live: . BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.
"Grokking" is weird: Neural Networks trained to fill in binary operation tables will quickly overfit to the training data, but after many, many steps suddenly "get it" and achieve 100% validation accuracy.
🤯 When you realize everyone you've worked for or considered working for is now casually hanging out with the President and VP! 🇺🇸 Hey
@JoeBiden
,
@KamalaHarris
, can I join the next reunion? 🤔 I promise to bring my A(I)-game!
Today,
@POTUS
and
@VP
met with CEOs to underscore the fundamental responsibility companies have to ensure their AI products are safe before they’re released, and the importance of responsible American innovation in AI that protect people’s rights and safety.
Nice post comparing vision capabilities for
#GeminiAI
Pro, GPT-4V, and open source models (CogVLM 17b), Qwen-VL, LLaVA etc) from
@roboflow
. Wanted to note that Ultra does correctly solve the document OCR and image / serial number OCR tasks.
To recruiters trying to poach top LLM talent: you will have a better chance if the opportunity you're pitching involves (1) pushing buttons on a computer and (2) collaborating with astonishingly intelligent people (3) in unified pursuit of (4) building some kind of software God
Supercharge your terminal with this utility for calling Gemini. Example uses:
ls -al | gem "Which filenames are likely to be scientific PDFs?"
cat server.log | gem "Why is this error happening? How do I fix it?"
cat meeting_notes.txt | gem "Extract the key decisions from these
It’s the final session of
#MAICON22
and we’re diving deep with a fireside chat featuring
@vedantmisra
where
@paulroetzer
is interviewing him about how
#AI
is not only transforming business but how it will transform humanity in general.
Writing computer code is a great evaluation of reasoning capabilities, and it was a thrill to work on Codex! All we need now is models that write research-grade deep learning code... Check out the paper here:
Welcome,
@github
Copilot — the first app powered by OpenAI Codex, a new AI system that translates natural language into code.
Codex will be coming to the API later this summer.
Technology is improving at an increasingly unpredictable pace, and it's becoming harder to convince oneself that there's no major socioeconomic transformation on the immediate horizon 1/
It took us longer to go from the Mark I Perceptron to LSTMs (29 years) than from LSTMs to
#ChatGPT
(27 years). It took us longer to go from Word2Vec to Transformers (4 years) than from Transformers to
#OpenAI
's
#GPT3
(3 years). Exponential growth is wild.
#AI
#MachineLearning
5 / n: We evaluated whether Minerva memorizes solutions by modifying problems to introduce changes in their framing or numerical content, and compared accuracy over sampled solutions before and after the modification; our results suggest minimal memorization.
We show that if you train a network the size of a lab rat's brain on a dataset of 800B tokens, it achieves SoTA performance on code and natural language, reasons better than the average ten-year-old, and learns to explain jokes. What happens if we train on more data for longer?
Introducing the 540 billion parameter Pathways Language Model. Trained on two Cloud
#TPU
v4 pods, it achieves state-of-the-art performance on benchmarks and shows exciting capabilities like mathematical reasoning, code writing, and even explaining jokes.
Learn from the mistakes of others. You can't live long enough to make them all yourself, and off-policy learning has better sample efficiency. - Eleanor Roosevelt
We’ve developed two neural networks which have learned by associating text and images. CLIP maps images into categories described in text, and DALL-E creates new images, like this, from text.
A step toward systems with deeper understanding of the world.
An AI using this GPT thing which folks go on about, has golfed two short proofs in mathlib, Lean's maths library :o
So we now have 134 human contributors (including several ICL undergraduates) and 1 computer.
Thanks to
@jessemhan
for letting me know!
I don't have an academic degree.
@ilyasut
has a PhD (though dropped out of high school!). We do not select on the basis of academic credentials, but on evidence of exceptional ability.
2 / n: Minerva is based on PaLM🌴and was trained on a large dataset of scientific papers and webpages with mathematical content. Combining scale, data, and inference techniques dramatically improves performance on the MATH benchmark and on STEM problems in MMLU.
3 / n: We also evaluated our model on a dataset of over 200 STEM undergraduate problems from MIT and found that it could solve nearly a third of them. We found that our model even outperforms the national average in Poland’s 2022 National Math Exam.
4 / n: In addition to scale and data, by using chain-of-thought / scratchpad prompting and majority voting to boost performance, Minerva achieves state-of-the-art performance on technical benchmarks without the use of external tools such as a Python interpreter.
We're participating in WELM/BIG-Bench, a collaborative effort to measure the capabilities and limitations of large language models. You can submit a task for the benchmark and/or workshop here:
@mpshanahan
@ilyasut
@mpshanahan
what makes you say there are no space and time in a computer? If you think only meat can be conscious, you *might* be in thrall to an overly simplistic definition of consciousness. The most basic criterion is for there to be a subjective experience, not a body.
"Computer, write me a screenplay":
@DeepMind
shipped Dramatron, demonstrating an approach to hierarchical generation of coherent long-form stories () 7/
Today in
@Nature
:
#AlphaTensor
, an AI system for discovering novel, efficient, and exact algorithms for matrix multiplication - a building block of modern computations. AlphaTensor finds faster algorithms for many matrix sizes: & 1/
It's only a matter of time before you can command your home robot to bring you lunch, and your TV to render up an alternative ending to Game of Thrones, just by thinking it 2/
@andrew_n_carr
This partitions all hairs into sets A and B such that all elements of A end up on the floor and all elements of B remain on the scalp with each element of B longer than each element of A