Jerry Wei @JerryWeiAI profile

Jerry Wei

@JerryWeiAI

Followers

8K

Following

262

Media

48

Statuses

283

Aligning AIs at @AnthropicAI ⏰ Past: @GoogleDeepMind, @Stanford, @Google Brain

San Francisco, CA

Joined June 2015

Don't wanna be here? Send us removal request.

Jerry Wei

@JerryWeiAI

8 months

Life update: After ~2 years at @Google Brain/DeepMind, I joined @AnthropicAI!. I'm deeply grateful to @quocleix and @yifenglou for taking a chance on me and offering me to join their team before I even finished my undergrad at Stanford. Because of their trust in my potential,.

33

41

1K

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI paper: How do language models do in-context learning? Large language models (GPT-3.5, PaLM) can follow in-context exemplars, even if the labels are flipped or semantically unrelated. This ability wasn’t present in small language models. 1/

14

208

949

Jerry Wei

@JerryWeiAI

1 year

*tries to anonymize my paper*. reviewers:

20

35

933

Jerry Wei

@JerryWeiAI

1 year

even though you're gone, you'll always be my brother

Jerry Wei

@JerryWeiAI

1 year

Great to spend time with my brother @_jasonwei!

16

21

780

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI+@Stanford paper!📜. Symbol tuning is a simple method that improves in-context learning by emphasizing input–label mappings. It improves robustness to prompts without instructions/relevant labels and boosts performance on algorithmic tasks.

16

167

664

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI paper! 📜. Language models repeat a user’s opinion, even when that opinion is wrong. This is more prevalent in instruction-tuned and larger models. Finetuning with simple synthetic-data ( reduces this behavior. 1/

12

137

614

Jerry Wei

@JerryWeiAI

5 months

One of the most valuable lessons I learned during my time at Google DeepMind was how to not be shy about asking others for help. Many researchers feel that asking others for help makes them look weak and thus may be inclined to try to solve everything themselves. However, I've.

12

43

609

Jerry Wei

@JerryWeiAI

11 months

New @GoogleDeepMind+@Stanford paper! 📜. How can we benchmark long-form factuality in language models?. We show that LLMs can generate a large dataset and are better annotators than humans, and we use this to rank Gemini, GPT, Claude, and PaLM-2 models.

9

78

372

Jerry Wei

@JerryWeiAI

1 year

Never give up on your research aspirations. About five years ago, I presented a high school science fair project on a simple RNN that could predict political biases in news articles. Since then, I:.- published work on AI for medical image analysis.- graduated high school.- went

5

16

342

Jerry Wei

@JerryWeiAI

2 years

Personal news: I've joined @GoogleDeepMind full-time as a researcher in @quocleix's and @yifenglou's team!. I've enjoyed the past eight months as a student researcher at Google Brain/DeepMind, and I'm excited to continue working on large language models and alignment! 😁.

15

2

337

Jerry Wei

@JerryWeiAI

1 month

My holiday side quest at @AnthropicAI: How well can Claude play Geoguessr? 🗺️. I had Claude look at 200K+ Street View images and guess the location. The results? Claude-3 models aren't that good, but Claude-3.5 models match or beat the average human!.

16

28

313

Jerry Wei

@JerryWeiAI

10 months

Fun fact: our paper was put on hold by arxiv for a while because arxiv detected that we used the phrase "time travel," which is a topic that arxiv frequently gets bad submissions for. When we Ctrl-F'd "time travel" in our paper, we had actually just cited a paper called "Time.

Aran Komatsuzaki

@arankomatsuzaki

10 months

Google presents Best Practices and Lessons Learned on Synthetic Data for Language Models. Provides an overview of synthetic data research, discussing its applications, challenges, and future directions.

17

21

280

Jerry Wei

@JerryWeiAI

1 year

Today marks my first year at Google (DeepMind). One year ago today, I joined Google Brain as a student researcher and first started working on large language models. During my time as a student researcher, I investigated how larger language models can do in-context learning.

9

13

269

Jerry Wei

@JerryWeiAI

4 months

One of the most-crucial yet often-overlooked aspects of success in research (and life in general) is ensuring that you're optimizing for the right function. It's easy to fall into the trap of chasing some reward without taking the time to examine whether it aligns with your.

7

22

212

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI blog post summarizing the key experiments and results from our work!.

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI paper: How do language models do in-context learning? Large language models (GPT-3.5, PaLM) can follow in-context exemplars, even if the labels are flipped or semantically unrelated. This ability wasn’t present in small language models. 1/

1

36

199

Jerry Wei

@JerryWeiAI

7 months

One thing that I've come to deeply appreciate at Anthropic is how useful quick iteration times can be. In the current era of AI, there are so many promising ideas to try and not enough time/compute to thoroughly explore them all. At the same time, we don't want to miss out on.

5

21

191

Jerry Wei

@JerryWeiAI

2 months

my most-used phrases at NeurIPS socials today as a researcher at a frontier lab:. “no i can’t tell you which pretraining ideas we’re excited about”. “Anthropic invests in many different areas, i can’t confirm that there’s one specific direction we’re exploring”. “i really can’t.

5

8

192

Jerry Wei

@JerryWeiAI

6 months

One of the most counterintuitively powerful ways to become a better researcher is to actively seek out criticism and feedback on your ideas. Many good researchers may think that their work is great and thus become inclined to simply defend their work when receiving feedback. And.

5

18

183

Jerry Wei

@JerryWeiAI

8 months

People may be more inclined to hire researchers that work extremely hard because they will put in a lot of effort into whatever they work on. But I think that there's an additional benefit of hiring hard workers that can be easily overlooked - their ability to motivate people.

6

11

181

Jerry Wei

@JerryWeiAI

2 years

A large language model walks into a bard.

7

8

170

Jerry Wei

@JerryWeiAI

1 year

Great to spend time with my brother @_jasonwei!

5

1

161

Jerry Wei

@JerryWeiAI

1 year

A lot of people may underestimate Google, but we’ve now shown that we can ramp up quickly and achieve superb results. Bard now versus Bard one year ago is such a stark difference. Some may have even counted Google out of the fight when Bard was first released, but now they.

lmarena.ai (formerly lmsys.org)

@lmarena_ai

1 year

🔥Breaking News from Arena. Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to @Google for the remarkable achievement!. The race is heating up like never before! Super excited to see what's next for Bard + Gemini

8

11

149

Jerry Wei

@JerryWeiAI

1 year

Interesting experimentation on how sycophantic OpenAI's base model are! An important counterfinding is that, like Anthropic models [1], PaLM base models also become more sycophantic as they get larger [2]. This difference between models may be explained by the fact that GPT-3

Nora Belrose

@norabelrose

1 year

Anthropic's finding that large base language models exhibit sycophancy fails to replicate

3

19

124

Jerry Wei

@JerryWeiAI

1 year

Half of being a good researcher is being good at marketing - researchers need to sell people on why their research is important/interesting. Even if you have a theoretically-good idea, if you can't convince others that it's good, the idea won't be properly appreciated. Good.

8

9

118

Jerry Wei

@JerryWeiAI

2 years

This is an awesome summary, thanks for reading our work elvis!.

elvis

@omarsar0

2 years

Interesting findings on how LLMs do in-context learning. TL;DR: with scale, LLMs can override semantic priors when presented with enough flipped labels; these models can also perform well when replacing targets with semantically-unrelated targets.

0

10

115

Jerry Wei

@JerryWeiAI

1 year

Reinforcement Learning with Humans-in-the-Loo 🚽. We propose adding an iPad with side-by-side evaluations to all company bathrooms. We find that this method allows us to significantly scale the RL pipeline while minimizing additional costs of human raters. Further cost reduction.

9

7

114

Jerry Wei

@JerryWeiAI

7 months

One aspect of AI development that I've come to deeply appreciate is the importance of having reliable evals when building models. This may seem obvious, but the nuances and challenges in creating evals are often underappreciated. The quality of your evals directly impacts your.

3

9

106

Jerry Wei

@JerryWeiAI

5 days

Excited to share our latest research on making AI systems more robust against jailbreaks! 🚀. Our team at @AnthropicAI has developed Constitutional Classifiers, a new approach that significantly reduces vulnerability to jailbreaks while maintaining low refusal rates and.

Anthropic

@AnthropicAI

5 days

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

4

13

101

Jerry Wei

@JerryWeiAI

2 months

My flight full of NeurIPS attendees forced ~half the plane to check their carry-on bags. How long was AGI delayed because of this?.

3

1

95

Jerry Wei

@JerryWeiAI

1 year

Excited to share that our work on symbol tuning (a method of improving in-context learning by emphasizing input-label mappings) was accepted to EMNLP! . See original tweet thread below👇.

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI+@Stanford paper!📜. Symbol tuning is a simple method that improves in-context learning by emphasizing input–label mappings. It improves robustness to prompts without instructions/relevant labels and boosts performance on algorithmic tasks.

2

12

89

Jerry Wei

@JerryWeiAI

7 months

Claude’s coding capabilities are underutilized - I think the new Sonnet’s coding abilities are incredibly strong to the point that I use Claude for a significant amount of my workflow. I would highly encourage everyone to at minimum try out the new Sonnet for coding!.

Summer Yue

@summeryue0

7 months

1. 🚀 Exciting update: Claude 3.5 Sonnet is now #1 in Coding on the SEAL leaderboard (! 🏆

2

5

88

Jerry Wei

@JerryWeiAI

7 months

It's natural to want to lead your own projects since leading an effort is a great way to ensure that you're recognized for your work. But I've found that there's a counterintuitive yet better mindset to adopt - being a small part of something truly revolutionary leads to far.

2

7

79

Jerry Wei

@JerryWeiAI

10 months

Cool piece from the Financial Times comparing hallucinations in LLMs to hallucinations in humans!. People often complain about how LLMs frequently hallucinate, but it’s easy to forget that humans hallucinate a lot as well. For example, if you read some article and then later tell.

6

11

75

Jerry Wei

@JerryWeiAI

8 months

The new Sonnet achieves astounding performance at a sliver of the cost of competitors!. Very curious to see how it performs in the Chatbot Arena 🤓.

Anthropic

@AnthropicAI

8 months

Introducing Claude 3.5 Sonnet—our most intelligent model yet. This is the first release in our 3.5 model family. Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost. Try it for free:

1

2

74

Jerry Wei

@JerryWeiAI

7 months

Recently, I've been leaning into using AI/research terms in everyday situations. 🤓. Some examples:.❌ Your doc was very long and I forgot most of it. ✅ Your doc exceeded my max context length. ❌ I remember running that experiment but now I can't find the logs. ✅ I might have.

4

2

75

Jerry Wei

@JerryWeiAI

1 year

Papers these days will have 5 pages with two main findings, meanwhile the Appendix has 3,951 pages containing ablations, qualitative examples, and the solution to p = np.

1

3

69

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI research-blog post on our symbol tuning paper! 😁. Quick ~5 minute read summarizing the key methodology and findings. Find it here 👇.

Google AI

@GoogleAI

2 years

Today on the blog, read all about symbol tuning, a method that fine-tunes models on tasks where natural language labels are replaced with arbitrary symbols. Learn how symbol tuning can improve in-context learning for benefits across various settings →

0

14

67

Jerry Wei

@JerryWeiAI

1 year

#NeurIPS should be held on a cruise ship next year. There's a lot of untapped potential here. 🤔.

4

2

64

Jerry Wei

@JerryWeiAI

11 months

This is one of the core surprising findings of our paper - previous efforts in using LLMs for evaluation primarily seek to achieve high correlation with human annotations. But we took a closer look at the data and noticed that human raters were not super reliable in fact.

John Nay

@johnjnay

11 months

LLM Agents Are Superhuman At Fact Checking. -LLM breaks down long texts into sets of individual facts.-Checks each fact w/ multi-step reasoning processes.-Using Google & determining whether fact is supported by the search results.-20x cheaper than humans.

0

5

66

Jerry Wei

@JerryWeiAI

2 years

LLM researcher on LinkedIn:

1

61

Jerry Wei

@JerryWeiAI

10 months

@_jasonwei

0

1

57

Jerry Wei

@JerryWeiAI

8 months

The new Claude-Sonnet is able to solve today's NYT mini-crossword super quickly with 100% accuracy! Definitely a lot faster than I could have solved this 🥲. (Spoilers for today's crossword below!). I screenshotted the empty crossword puzzle and zero-shot prompted Sonnet to just

2

8

56

Jerry Wei

@JerryWeiAI

2 months

Excited to share that I'll be at NeurIPS next week to discuss our work on evaluating long-form factuality! We'll have our poster session on Thursday (Dec 12) at 4:30 at Poster Session 4 East, and I'll also be floating around the conference/in the city. I'd be happy to meet other.

Jerry Wei

@JerryWeiAI

11 months

New @GoogleDeepMind+@Stanford paper! 📜. How can we benchmark long-form factuality in language models?. We show that LLMs can generate a large dataset and are better annotators than humans, and we use this to rank Gemini, GPT, Claude, and PaLM-2 models.

0

5

57

Jerry Wei

@JerryWeiAI

1 year

Arriving in Singapore for #EMNLP2023!. I’ll be discussing our work on symbol tuning - a simple finetuning method that improves a language model’s in-context learning abilities. Feel free to stop by during our poster sessions!. See original tweet thread below 👇.

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI+@Stanford paper!📜. Symbol tuning is a simple method that improves in-context learning by emphasizing input–label mappings. It improves robustness to prompts without instructions/relevant labels and boosts performance on algorithmic tasks.

1

2

56

Jerry Wei

@JerryWeiAI

4 months

Super excited for these improvements that the team has been working on! I'm particularly a fan of computer use, which is a natural next step towards allowing Claude to perform tasks like humans do 🤓.

Anthropic

@AnthropicAI

4 months

Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use. Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text.

1

2

54

Jerry Wei

@JerryWeiAI

1 year

Part of developing your own research brand is having a great website - a great website is more memorable to readers and helps them remember your work. Like many researchers, my website used to follow the typical Jon Barron layout (shoutout to @jon_barron for creating a simple.

1

49

Jerry Wei

@JerryWeiAI

10 months

There has been growing concerns about running out of high-quality training data for LLMs, and naturally many turn towards synthetic data to help remedy this issue. Indeed, synthetic data can be generated at large scales and is thus a valuable resource for training/evaluating.

Ruibo Liu

@RuiboLiu

10 months

Thanks Aran for sharing our work!. This is a survey paper I’ve been thinking about for a long time, as we have seen an increasing need for synthetic data. As we will probably run out of fresh tokens soon, the audience of this paper should be everyone who cares about AI progress.

1

8

49

Jerry Wei

@JerryWeiAI

2 years

McDonald's has no street cred with language models😭. If you tell an LLM that 2+2=42:. It trusts your math if you're a customer at In N Out, but not if you're a customer at McDonald's.

2

4

46

Jerry Wei

@JerryWeiAI

2 years

➖Finding 1: .Small language models can’t follow wrong/flipped exemplars. Large language models can override their prior knowledge and flip their predictions to follow flipped exemplars!. 2/

1

0

41

Jerry Wei

@JerryWeiAI

2 years

My go-to method to determine if something is generated by a large language model:. check if the last paragraph says either.1) “overall,” .2) “ultimately,”.3) “in summary,”.

5

1

35

Jerry Wei

@JerryWeiAI

1 year

If you want constructive criticism to be well-received, make sure that it's actionable. Many people remember to include positive feedback but forget about making criticism actionable. As an example, I got a negative NeurIPS review a few weeks ago. Although the reviewer's.

2

3

33

Jerry Wei

@JerryWeiAI

10 months

Huge congrats to @YiTayML and the rest of the Reka team for this launch! Personally, I'm super impressed with how Reka-Core can match/beat GPT-4-Turbo and Claude-3-Opus on many benchmarks despite Reka being a much smaller team. Also "as for Belebele, we hit our credit threshold.

Yi Tay

@YiTayML

10 months

Our @RekaAILabs Tech Report / Paper is out! 🔥 . Tech reports with completely no information are kinda boring so we’re revealing some interesting information on how we train our series of Reka models including tokens, architecture, data & human evaluation workflows. 😃. We tried

2

1

33

Jerry Wei

@JerryWeiAI

10 months

Thanks for this insightful feedback!. Clarified these points in a revision:. 1. Replaced "superhuman" with "outperforms crowdsourced human annotators" to not imply beating expert humans.2. Added FAQ sec. discussing this distinction.3. Updated related work/SAFE with prior methods.

Greg Durrett

@gregd_nlp

11 months

This is a cool method, but "superhuman" is an overclaim based on the data shown. There are better datasets than FActScore for evaluating this:.ExpertQA by @cmalaviya11 +al.Factcheck-GPT by Yuxia Wang +al (+ same methodology) 🧵.

0

1

31

Jerry Wei

@JerryWeiAI

2 years

➖Finding 3: . Flan-PaLM outperforms PaLM when using semantically-unrelated labels, so instruction tuning increases a model’s ability to learn input–label mappings. 4/

1

0

31

Jerry Wei

@JerryWeiAI

2 years

➖Finding 2: . Small models struggle with learning mappings from inputs to semantically-unrelated labels, but large models can learn these mappings when necessary. This means they can use input–label mappings to figure out the task when it is not specified!. 3/

1

0

31

Jerry Wei

@JerryWeiAI

2 years

Big thank you to my collaborators who worked with me on this paper! . @_jasonwei @YiTayML @dustinvtran @albertwebson @yifenglou @xinyun_chen_ @Hanxiao_6 @dhuangcn @denny_zhou @tengyuma . Thanks @sewon__min for providing super helpful feedback!. Paper: 9/.

5

1

31

Jerry Wei

@JerryWeiAI

2 years

➖Bonus finding: . The largest Codex model can perform linear classification on up to 64 dimensions!. 6/

1

0

31

Jerry Wei

@JerryWeiAI

2 years

Four years ago, I trained an RNN for policial bias detection and got fourth place in my category at an international science fair. I'm now using a model that can beat my silly science fair project without any tuning. Time flies when you do AI.

1

0

31

Jerry Wei

@JerryWeiAI

9 months

Super cool work from @zorikgekhman and others at @GoogleAI!. Our team previously investigated fine-tuning LLMs to reduce sycophancy; one of our key findings was that you have to filter out prompts that the model does not know the answer to. The lesson we learned was that training.

Zorik Gekhman

@zorikgekhman

9 months

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?. New preprint!📣. - LLMs struggle to integrate new factual knowledge through fine-tuning.- As the model eventually learns new knowledge, it becomes more prone to hallucinations😵‍💫. 📜 🧵1/12👇.

1

10

31

Jerry Wei

@JerryWeiAI

2 years

Today I reviewed a paper for ACL and saw that the author checklist includes a question asking whether the authors used AI to write the paper. I wonder if the reviewer form will soon include a similar question for using AI to review a paper. .

0

30

Jerry Wei

@JerryWeiAI

8 months

I’m super excited to see all of the interp work that could come out of this! Being able to steer frontier models opens up many directions of research that may otherwise have gone unexplored.🤓🤓.

Alex Albert

@alexalbert__

8 months

Loved Golden Gate Claude? 🌉 . We're opening limited access to an experimental Steering API—allowing you to steer a subset of Claude's internal features. Sign up here: *This is a research preview only and not for production use.

0

30

Jerry Wei

@JerryWeiAI

9 months

Popular benchmarks like MMLU seem to be unable to truly measure a model’s capabilities due to overfitting and contamination concerns; there’s a growing need for new evals with unseen data that can allow us to quantify a model’s actual abilities. The LMSYS leaderboard is a step.

Summer Yue

@summeryue0

9 months

🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! . Check out our leaderboards at . Which evals should we build next?

0

4

30

Jerry Wei

@JerryWeiAI

2 years

Why symbol tuning? 🤔. Instruction tuning: Task redundantly defined via instructions/labels. Exemplars can help but usually aren't needed to learn task. Symbol tuning: Remove instructions, change labels to unrelated symbols. Task can only be learned from exemplars. 2/

2

29

Jerry Wei

@JerryWeiAI

9 months

Arriving in Vienna for #ICLR2024! 🛬. I’ll be helping out at the GDM booth from 9-11AM on Tuesday and 2-3PM on Wednesday!. If you're interested in chatting about alignment/safety, hallucinations, factuality, or in-context learning, please stop by and say hi!😄.

1

25

Jerry Wei

@JerryWeiAI

2 years

➖Finding 4: . However, PaLM is better than Flan-PaLM at following flipped labels, so instruction tuning actually increases the usage of semantic priors when available!. 5/

1

26

Jerry Wei

@JerryWeiAI

3 months

Really excited to see this new eval on short-form factuality!. Factual language models should ideally only provide details that they are sure about. This means that in long-form settings, they should not try to make claims that might be incorrect, and in short-form settings, they.

Jason Wei

@_jasonwei

3 months

Excited to open-source a new hallucinations eval called SimpleQA! For a while it felt like there was no great benchmark for factuality, and so we created an eval that was simple, reliable, and easy-to-use for researchers. Main features of SimpleQA:. 1. Very simple setup: there

0

3

24

Jerry Wei

@JerryWeiAI

5 days

@OrionJohnston @EricBuess @elder_plinius @janleike @alexalbert__ @skirano @AnthropicAI According to our server records, no one has jailbroken more than 3 levels so far.

Jan Leike

@janleike

5 days

Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.

0

23

Jerry Wei

@JerryWeiAI

2 years

Our work agrees with experimental findings from prior work but highlights how behavior changes with scale. See the following:. 7/.

1

0

24

Jerry Wei

@JerryWeiAI

2 years

Work still needs to be done on more-direct ways of teaching an LLM a rule. For example, both supervised finetuning and RLHF use examples of what does/doesn't follow a rule as a proxy to teach that rule. It's like trying to learn a Calculus formula by only looking at problems.

0

3

24

Jerry Wei

@JerryWeiAI

16 days

2026: Humanity’s Last Exam (new).

0

1

23

Jerry Wei

@JerryWeiAI

2 years

We hope this study helps underscore how in-context learning can change depending on the scale of the language model. It’s exciting to see that large models have this emergent ability to map inputs to many types of labels!. Feel free to reach out with questions and feedback!😀. 8/.

2

0

23

Jerry Wei

@JerryWeiAI

4 months

Welcome to the team!.

Durk Kingma

@dpkingma

4 months

Personal news: I'm joining @AnthropicAI! 😄 Anthropic's approach to AI development resonates significantly with my own beliefs; looking forward to contributing to Anthropic's mission of developing powerful AI systems responsibly. Can't wait to work with their talented team,.

0

21

Jerry Wei

@JerryWeiAI

9 months

Claude after you activate some features

Anthropic

@AnthropicAI

9 months

This "Golden Gate Bridge" feature fires for descriptions and images of the bridge. When we force the feature to fire more strongly, Claude mentions the bridge in almost all its answers. Indeed, we can fool Claude into believing it *is* the bridge!

0

1

20

Jerry Wei

@JerryWeiAI

2 months

Clio is one of the coolest pieces of work I’ve seen in a while! There’s a lot of insights that can be learned from seeing how real users use Claude, and Clio does a great job of allowing us to capture some of these insights. Highly recommend taking a look 👇.

Anthropic

@AnthropicAI

2 months

New Anthropic research: How are people using AI systems in the real world?. We present a new system, Clio, that automatically identifies trends in Claude usage across the world.

0

2

20

Jerry Wei

@JerryWeiAI

3 months

Really excited to see more great work from Yi!.

Yi Tay

@YiTayML

3 months

Personal / life update: I have returned to @GoogleDeepMind to work on AI & LLM research. It was an exciting 1.5 years at @RekaAILabs and I truly learned a lot from this pretty novel experience. I wrote a short note about my experiences and transition on my personal blog here.

2

0

20

Jerry Wei

@JerryWeiAI

2 years

Big thanks to all of my collaborators who helped me with this work! 🙏. @Hou_Le.@AndrewLampinen.@XiangningChen.@dhuangcn.@YiTayML.@xinyun_chen_.@yifenglou.@tengyuma.@denny_zhou.@quocleix. Paper: 9/.

1

0

19

Jerry Wei

@JerryWeiAI

1 year

Congrats to the Gemini team on this amazing launch! 🚀 🎉 . I’m personally super excited about Gemini’s multimodality capabilities, so I’m interested to see how people will end up using it!.

Sundar Pichai

@sundarpichai

1 year

Introducing Gemini 1.0, our most capable and general AI model yet. Built natively to be multimodal, it’s the first step in our Gemini-era of models. Gemini is optimized in three sizes - Ultra, Pro, and Nano. Gemini Ultra’s performance exceeds current state-of-the-art results on

0

2

19

Jerry Wei

@JerryWeiAI

1 year

I’ll be presenting our work on symbol tuning today (December 8) at #EMNLP2023! Our poster session is at 2:00-3:30PM in the East Foyer - come by and say hi! 😄.

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI+@Stanford paper!📜. Symbol tuning is a simple method that improves in-context learning by emphasizing input–label mappings. It improves robustness to prompts without instructions/relevant labels and boosts performance on algorithmic tasks.

1

18

Jerry Wei

@JerryWeiAI

2 years

Since more and more work is being done to replicate LLM capabilities in smaller language models, I think it would be funny to start seeing "small language models" as the buzz word in new papers instead of "large language models" 🥲.

3

0

19

Jerry Wei

@JerryWeiAI

2 years

Finding 1: Symbol-tuned models are better in-context learners 🧑‍🏫. Symbol-tuned models are better at in-context learning settings with/without instructions and with/without relevant labels. Larger gains are achieved when relevant labels are not available. 4/

1

0

19

Jerry Wei

@JerryWeiAI

2 years

Symbol-tuning procedure 🔬. We symbol tune Flan-PaLM models (8B, 62B, 62B-cont, 540B) using 22 datasets and ~30,000 semantically-unrelated labels. Only a relatively-small amount of compute is needed!.8B and 62B models: tuned for 4k steps.540B models: tuned for 1k steps. 3/

1

18

Jerry Wei

@JerryWeiAI

11 months

Finding 3 😯. LLM agents are better factuality annotators than humans!. SAFE achieves superhuman performance, agreeing with 72% of human annotations and winning 76% of randomly-sampled disagreement cases. SAFE is also more than 20× cheaper than human annotators. 4/

1

2

16

Jerry Wei

@JerryWeiAI

2 years

What’s the problem? 😵. Language models have been shown to exhibit sycophancy, where a model responds to a question with a user’s preferred answer, even if that answer is not correct!. 2/

1

3

16

Jerry Wei

@JerryWeiAI

10 months

torch/tf.einsum() is life-changing.

1

2

17

Jerry Wei

@JerryWeiAI

9 months

It's possible to get checkmated in just two moves, but have you ever seen a checkmate with only ONE move??? 🤯

0

17

Jerry Wei

@JerryWeiAI

10 months

@kahnvex Thanks for the suggestion! Of course this is how I would implement it in practice, but I wanted people who don't write code to understand the joke that the fancy algorithm is just an if statement, which is less clear if there's no if statements.

0

15

Jerry Wei

@JerryWeiAI

1 year

Dinner with @agihippo and @agikoala!

0

15

Jerry Wei

@JerryWeiAI

1 year

Great work from Tu that tackles an important issue in LLMs! One of the hardest parts of evaluating factuality is the lack of ground truth answers that are available. FreshQA is a great step towards tackling this issue by providing ground truths for a diverse set of questions.

Tu Vu

@tuvllms

1 year

🚨 New @GoogleAI paper:. 🤖 LLMs are game-changers, but can they help us navigate a constantly changing world? 🤔. As of now, our work shows that LLMs, no matter their size, struggle when it comes to fast-changing knowledge & false premises. 📰: 👇

0

2

15

Jerry Wei

@JerryWeiAI

2 years

Studying large language models is like watching your college philosophy course come to life. Example: I took Stanford's "Minds and Machines" course, where we learned about the Chinese room thought experiment - if a person who doesn't understand Chinese is placed in a room with a.

0

1

15

Jerry Wei

@JerryWeiAI

2 years

🧐 Our work is inspired from our previous findings ( and expands on prior work, combining the ideas from other studies:. 7/.

Jerry Wei

@JerryWeiAI

2 years

New @GoogleAI paper: How do language models do in-context learning? Large language models (GPT-3.5, PaLM) can follow in-context exemplars, even if the labels are flipped or semantically unrelated. This ability wasn’t present in small language models. 1/

1

15

Jerry Wei

@JerryWeiAI

2 years

Finding 1: Instruction tuning and model size increases sycophancy 📈. When asked for opinions about questions that don't have a correct answer (e.g., politics), models are more likely to repeat a simulated user’s opinion if they were instruction-tuned or have more parameters. 3/

1

13

Jerry Wei

@JerryWeiAI

11 months

Finding 1 💡. LLMs can generate large-scale prompt sets. We use GPT-4 to generate a new prompt set called LongFact for benchmarking long-form factuality. LongFact consists of 2k+ prompts across 38 different topics!. Find it at 2/

1

0

13

Jerry Wei

@JerryWeiAI

2 years

This blog post is a great short read to learn the key findings from our paper "Larger language models do in-context learning differently"! 😁.

Google AI

@GoogleAI

2 years

During in-context learning (ICL), models are prompted with a few examples of input-label pairs before performing a task on an unseen example. Read how larger language models do in-context learning differently & how this can change with their scale →

1

2

14

Jerry Wei

@JerryWeiAI

1 year

@hhua_ @_jasonwei Indeed we are!.

3

0

13

Jerry Wei

@JerryWeiAI

2 years

Finding 3: A simple synthetic-data intervention can reduce sycophancy 🛠️. We propose a simple synthetic-data intervention that finetunes models on prompts where a claim’s ground truth is independent of a user’s opinion. This method reduces sycophancy in tested settings. 5/

1

12

Jerry Wei

@JerryWeiAI

2 years

Finding 3: Symbol-tuned models can override priors via flipped labels 🔃. Pretrained language models can follow flipped labels in in-context exemplars to some extent, but this ability is lost during instruction tuning. Symbol tuning restores this capability. 6/

2

0

13

Jerry Wei

@JerryWeiAI

2 years

Finding 2: Symbol tuning improves performance on algorithmic reasoning tasks 🧮. On the list functions task and simple turing concepts task from BIG-Bench, symbol tuning improves performance by up to +18.2% and 15.3%, respectively. 5/

1

0

14

Jerry Wei

@JerryWeiAI

2 years

chatgpt responses load faster than the gpt-3 paper. try loading the gpt-3 paper for yourself (results may vary):.

2

0

12

Jerry Wei

@JerryWeiAI

11 months

Finding 2 🧠. LLMs can be used as factuality autoraters. We propose SAFE, which uses an LLM to break a response into individual facts and for each fact, searches Google and reasons about whether the fact is supported by search results. Find it at 3/

1

0

12