Google DeepMindのMoEに関する論文。既存のSparse MoE方式にSoft MoEという方式を提案。それぞれの専門家モデルに仕事を割り当てるルータと呼ばれるモデルの方式を工夫。専門家モデルを4096人に増やしてもスケールする。DeepMindもこっちの方向?スケーリング則はどうなったのか。
From Sparse to Soft Mixtures of Experts
The soft MoE layer assigns the result of a weighted average of all input tokens, whereas the sparse MoE router assigns an input token for each slot.
MoE is known as a way to increase model size without paying the full computational cost.
Accelerating LLM Inference with Staged Speculative Decoding
paper page:
Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in
OpenAI CEO: Superintelligent AIs are coming soon
Joe Rogan: “When you started OpenAI, what kind of timeline did you have in mind, and has it stayed on that timeline?”
Sam Altman: “AGI isn’t the end point. To accomplish [ASI - Artificial Superintelligence], that’ll take us until
Training our first RedPajama 7B model is going well! Less than half way through training (after 440 billion tokens) the model achieves better results on HELM benchmarks than the well-regarded Pythia-7B trained on the Pile.
Details at
Apple is reportedly spending millions of dollars per day to train Ajax, its most advanced language model, which they believe to be more powerful than ChatGPT.
Though Apple may seem late to the game on AI, don't underestimate a company with a $300B research war chest.
We can estimate the size of OpenAI models by timing how fast they run.
If I assume that runtime is proportional to model size, and that GPT-4 has 220B parameters, this is what I get 👇
🔥🔥🔥
Introduce the newest WizardCoder 34B based on Code Llama.
✅WizardCoder-34B surpasses GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass
@1
🖥️Demo:
http://47.103.63.15:50085/
🏇Model Weights:
🏇Github:
The 13B/7B
New
@GoogleAI
paper! 📜
Language models repeat a user’s opinion, even when that opinion is wrong. This is more prevalent in instruction-tuned and larger models.
Finetuning with simple synthetic-data () reduces this behavior.
1/
Contrary to rumors, I assure you the Phi-2 model is downloadable, but it is against the license to redistribute it. You can login to Azure ML Studio (free with a basic Azure account), select it from the Model Catalog, and then download the files from the Artifacts tab.
I like this definition. I’d raise the bar to 95% performance of 95% economically valuable jobs. This would necessarily require us to solve ROBOTICS. LLMs alone are not enough to realize physically embodied, generalist agents.
Robotics will be the last and by far the most
If Google didn't publish the Transformer paper, the history of AI (and possibly humanity) would be set back many years. Everyone would've been worse off.
Open research is a powerful strategy. It pains me to see an emerging trend of not only closing models, but also refusing to
I think DALL·E 3 is not just a stance against MidJourney. It's actually a sneak peak of the upcoming, epic battle of massively multimodal LLMs, against DeepMind Gemini.
Quote: "DALL·E 3 is built natively on ChatGPT". This is the key phrase.
DALL·E 3's extraordinary language
‘when the capitalists have run out of ideas, the interest rates will go to zero’ seemed like a very interesting observation to me over most of the last decade.
but now i am interested in the inverse—when we have more and better ideas than ever before, what will happen to rates?
Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out . A 🧵:
📍Introducing an AI Dungeon Master’s Guide🧙♂️, or how to make a
#DnD
DM dialogue agent trained with intents and theory of mind-inspired💭reinforcement learning.
Predicting how your players will react to you ahead of time makes for a better DM!
📃
For most companies, hiring more people is strictly better. However, this is often not true in AI research. AI research is often bottlenecked by compute, and when this is the case, hiring more researchers can be counter-productive.
I remember back at Google Brain, my manager once