🤖🧠NEW PAPER🧠🤖
Language models are so broadly useful that it's easy to forget what they are: next-word prediction systems
Remembering this fact reveals surprising behavioral patterns: 🔥Embers of Autoregression🔥 (counterpart to "Sparks of AGI")
1/8
It has become acceptable for acronyms to use any letters within a word, not just the first letter.
E.g., ORNATE = acrOnyms fRom noN-initial chAracTErs
But why stick with whole letters? In my new paradigm CLIP, an acronym can use any curves or line segments from the base phrase!
How am I only learning now that Latvia's prime minister has a PhD in linguistics from Penn??
I've seen many lists of "jobs for linguists outside academia" but they never include Prime Minister of Latvia.
Linguists: In case you could use a diversion, I've made a phonetic crossword - all the answers must be written in the IPA, one phoneme per square.
(Non-linguists: Here's a chance to learn some phonetics!)
Puzzle:
Answers:
🤖🧠NEW PAPER🧠🤖
What explains the dramatic recent progress in AI?
The standard answer is scale (more data & compute). But this misses a crucial factor: a new type of computation.
Shorter opinion piece:
Longer tutorial:
1/5
🤖🧠NEW PAPER🧠🤖
Bayesian models can learn rapidly. Neural networks can handle messy, naturalistic data. How can we combine these strengths?
Our answer: Use meta-learning to distill Bayesian priors into a neural network!
Paper:
1/n
*NEW PREPRINT*
Neural-network language models (e.g., GPT-2) can generate high-quality text. Are they simply copying text they have seen before, or do they have generalizable linguistic abilities?
Answer: Some of both!
Paper:
1/n
Transformers are the current state of the art, but one day LSTMs may overtake them.
That would make LSTMs current again. You could even say…re-current.
Takeaways from
#NeurIPS
:
1) In-distribution generalization is out
2) Out-of-distribution generalization is in
3) We want compositionality (whatever it is)
4) "GPT-2" is very hard to say
My colleagues and I are accepting applications for PhD students at Yale. If you think you would be a good fit, consider applying! Most of my research is about bridging the divide between linguistics and artificial intelligence (often connecting to CogSci & large language models)
@ShunyuYao12
@danfriedman0
@mdahardy
@cocosci_lab
Another example: shift ciphers - decoding a message by shifting each letter N positions back in the alphabet.
On the Internet, the most common values for N are 1, 3, and 13. These are the only ones for which GPT-4 performs well!
5/8
I am incredibly honored to receive a Glushko Dissertation Prize!
A huge thank-you goes to:
- My dissertation advisors,
@TalLinzen
and
@Paul_Smolensky
, for being incredibly supportive throughout my PhD
- (continued in next tweet)
1/2
The Cognitive Science Society is thrilled to announce the winners of the 2024 Glushko Dissertation Prize! 🏆
Let’s meet the brilliant minds behind groundbreaking research in Cognitive Science 🧵👇
I am hoping to hire a postdoc who would start in Fall 2024. If you are interested in the intersection of linguistics, cognitive science, and AI, I encourage you to apply!
Please see this link for details:
I’m now halfway through my PhD. One lesson I've learned: Don’t get discouraged comparing yourself to others.
Most comparisons are unfair; no two people have the same background. Plus, you get to define what success means to you-it doesn’t have to look like anyone else’s version.
A
#CompLing
proof:
a. Consider these sentences:
1. "How do you get down from a horse?"
2. "How do you get down from a goose?"
b. In (1), “down” is a preposition; i.e., “down” = P
c. In (2), “down” is a noun phrase; i.e., “down” = NP
d. By transitivity: P = NP
Phonology: Ain't no party like a fricative party cuz a fricative party don't stop
Syntax: Ain't no recursion like infinite recursion, cuz there ain't no recursion like infinite recursion, cuz...., cuz infinite recursion don't stop
Semantics: Ain't no Partee like Barbara Partee
Human language learning is fast & robust because of the inductive biases that guide it. Neural nets lack these biases, limiting their utility for cognitive modeling. We introduce an approach to address this w/ meta-learning.
Demo:
Two recent times when English failed me:
1) Passive form of "let someone know" ("He wants to be let known"?)
2) Adverb form of "hoity-toity" ("hoitily-toitily"?)
We need a flag, like on Wikipedia: "This linguistic phenomenon is incomplete. You can help English by expanding it."
@katiedimartin
For a class I TAed, I made this intro to Python structured around a running example from phonology:
It's meant to be gone through in one to two 75-minute lectures, so it might actually be more basic than what you want.
@ShunyuYao12
@danfriedman0
@mdahardy
@cocosci_lab
Our results show that we should be cautious about applying LLMs in low-probability situations
We should also be careful in how we interpret evaluations. A high score on a test set may not indicate mastery of the general task, esp. if the test set is mainly high-probability
7/8
@ShunyuYao12
@danfriedman0
@mdahardy
@cocosci_lab
By reasoning about next-word prediction, we make several hypotheses abt factors that'll cause difficulty for LLMs
1st is task frequency: we predict better performance on frequent tasks than rare ones, even when the tasks are equally complex
Eg, linear functions (see img)!
4/8
When language models produce text, is the text novel or copied from the training set?
For answers, come to our poster today at
#acl2023nlp
! Session 1 posters, 11:00 - 12:30 today
Critics* are calling the work "monumental"
Link to paper:
1/2
New paper: "Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks" w/
@Bob_Frank
&
@TalLinzen
to appear in TACL
Paper
Website
Interested in syntactic generalization? Read on! 1/
Before
#acl2019nlp
,
@TalLinzen
gave me some transformative advice: If there are people you would like to meet at a conference, email them to set up a meeting!
(1/5)
"Whatever accidental meaning her *words* might have, she *herself* never meant anything at all."
- Lewis Carroll (presumably talking about language models)
Paul Smolensky’s class “Foundations of CogSci” now has a 2.5-hr summary on YouTube!
This course is the reason I think of myself as a cognitive scientist. Highly recommended.
The word2vec analogy "king - man ≈ queen - woman" is famous. What other types of vectors, besides word embeddings, have been argued to display additive analogies? (e.g., vector representations of faces? or phonemes? or documents?)
Linguists, I have a terminology proposal:
- When you study competence, you're doing linguistics
- When you study performance, you're doing lingusitics
Of course, the object of study would be language or langauge, respectively
If you're interested in the NYT lawsuit (about GPT-4 copying from NYT articles), you should check out our paper "How Much Do Language Models Copy From Their Training Data?"
TACL link:
1/n
*NEW PREPRINT*
Neural-network language models (e.g., GPT-2) can generate high-quality text. Are they simply copying text they have seen before, or do they have generalizable linguistic abilities?
Answer: Some of both!
Paper:
1/n
Me, young and naive, reading about LSTMs for the first time: "Huh, I have no idea what an LSTM is. Well, I'll just look up what the letters stand for, and that should clear it up!"
From finite linguistic experience, we can acquire languages that are infinite. How do we make this leap?
New preprint on artificial language learning of center embedding:
w/
@DrCulbertson
, Paul Smolensky, & Geraldine Legendre, to appear
@cogsci_soc
1/n
@ShunyuYao12
@danfriedman0
@mdahardy
@cocosci_lab
Our big question: How can we develop a holistic understanding of large language models (LLMs)?
One popular approach has been to evaluate them w/ tests made for humans
But LLMs are not humans! The tests that are most informative about them might be different than for us
2/8
New tech report with Junghyun Min and
@TalLinzen
: "BERTs of a feather do not generalize together"
Across 100 re-runs, BERT fine-tuned on MNLI has a consistent score on MNLI but extreme variation in syntactic generalization (measured w/ HANS).
Link:
1/7
*NEW RESOURCE*
Neural networks can vary dramatically across reruns.
As a tool for studying this variation, we've released the weights for 100 instances of BERT fine-tuned on natural language inference (MNLI):
w/ Junghyun Min and
@TalLinzen
Yesterday I went to check out the classroom where I'll be teaching. At first I thought the door was locked, but it turned out that it was just very heavy.
It felt like a metaphor for life - often the doors that we think are locked are actually just heavy!
Excited to have a new
@ICLR2019
paper with
@TalLinzen
,
@EwanDun
, and Paul Smolensky! We find implicit compositional structure in RNN encodings by approximating them with Tensor Product Representations.
Paper:
Demo:
@ShunyuYao12
@danfriedman0
@mdahardy
@cocosci_lab
So how can we evaluate LLMs on their own terms?
We argue for a *teleological approach*, which has been productive in cognitive science: understand systems via the problem they adapted to solve
For LLMs this is autoregression (next-word prediction) over Internet text
3/8
Tomorrow I'll be speaking at the new
@NLPwithFriends
about using meta-learning to improve linguistic generalization in neural networks.
See below for details!
We are very excited to announce our next speaker!!
🗣Tom McCoy(
@RTomMcCoy
), telling us about "Universal Linguistic Inductive Biases via Meta-Learning"
🗓August 12th, 14:00 UTC
📝Sign up:
Keep up to date with talks at
Tune in Thursday, March 18, at a special time -- 11 a.m. Eastern -- and help our special guest the TODAY Show's
@alroker
crush
@RTomMcCoy
's tumultuous crossword! Join us on Twitter, YouTube or Twitch.
Illustration by James Doane.
@ShunyuYao12
@danfriedman0
@mdahardy
@cocosci_lab
The 2nd factor we predict will influence LLM accuracy is output probability
Indeed, across many tasks, LLMs score better when the output is high-probability than when it is low-probability - even though the tasks are deterministic
E.g.: Swapping adjacent words (see img)
6/8
In Toronto for
#acl2023nlp
- please reach out if you want to meet up! Some interests:
- connecting linguistics & NLP
- interpretability & evaluation
- other things on this list:
- PhDs & postdocs at Yale Linguistics or CS (I'll be recruiting for 2024-2025)
Paper accepted to
@ACL2019_Italy
! "Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference" with
@TalLinzen
and Ellie Pavlick (building on work done with the JSALT team led by Ellie and
@sleepinyourhat
). Link:
Mechanical Turk is incredibly useful for collecting data, but using it effectively can be tricky.
Here is a list of tips that helped me get good data & save money:
This very nice piece by Ted Chiang describes ChatGPT as a lossy compression of the Internet.
This idea is helpful for building intuition, but it's easy to miss an important point: Lossiness is not always a problem! In fact, if done right, it is exactly what we want.
1/14
Language models show impressive performance on a wide variety of tasks, but are they overfitting to evaluation instances and specific task instantiations seen in their pretraining? How much of this performance represents general task/reasoning abilities?
1/4
One perk of academia that doesn’t get enough love is eduroam. So many times when I’ve needed wifi, eduroam has unexpectedly been there - e.g., when I was in a Swedish airport & urgently needed to rebook a flight.
(1/5) To understand our models, we need to understand how they have been affected by their training data. Methods like this one will help us do that.
@XiaochuangHan
,
@byron_c_wallace
, Yulia Tsvetkov.
Excited to have 2 papers accepted to
#acl2020nlp
! Both are about syntactic generalization in neural networks (via data augmentation or tree-based architectures), and both are joint work with some fantastic collaborators.
Titles are in replies, links are yet to come:
Machine learning techniques ranked by the sturdiness of the building materials in their names:
1) METALearning
2) reinforCEMENT learning
3) logiSTIC regression
Seems like part of the joke went unnoticed...
At the risk of ruining humor by explaining it: The sub-letter shenanigans are unnecessary - try reading the first letters of the words in the picture.
It has become acceptable for acronyms to use any letters within a word, not just the first letter.
E.g., ORNATE = acrOnyms fRom noN-initial chAracTErs
But why stick with whole letters? In my new paradigm CLIP, an acronym can use any curves or line segments from the base phrase!
Standard evaluations in NLP can mask striking differences between models.
To hear more, come to our talk “BERTs of a feather do not generalize together” on Friday at
#BlackboxNLP
! w/ Junghyun Min and
@TalLinzen
Paper:
Some historical phonetics: The [f] sound was originally made by clenching your teeth together. Only in the past few centuries did we switch to the current approach of lower-lip-against-upper-teeth.
The name for this shift: dental f loss
(3/5) This one’s a twofer: Both papers give hard evidence that evaluating only on English can make us overestimate our models (
#BenderRule
in action).
Kate McCurdy, Sharon Goldwater, Adam Lopez.
Forrest Davis,
@marty_with_an_e
.
(2/5) Many papers ask, “Do language models learn syntax?” I like that this work moves beyond that to “What type of syntax do language models learn?”
Artur Kulmizev,
@vin_ivar
, Mostafa Abdou,
@JoakimNivre
.
NEW PREPRINT
Excited to release my first first-author paper! We investigate if neural network learners (LSTMs and Transformers) generalize to the hierarchical structure of language when trained on the amount of data children receive.
Paper:
You've probably seen results showing impressive few-shot performance of very large language models (LLMs). Do those results mean that LLMs can reason? Well, maybe, but maybe not. Few-shot performance is highly correlated with pretraining term frequency.
My pipeline for typing special characters:
1) Find the Wikipedia page about the character
2) Copy the character from Wikipedia and paste it into my browser's search bar, to remove formatting
3) Copy from the search bar into the document I'm typing
Prediction: one of these days, someone will announce a new LLM that has an infinite context length - but it will turn out to be a reinvention of the LSTM.
🌲Interested in language acquisition and/or neural networks? Check out our poster today at
#acl2023nlp
! Session 4 posters, 11:00-12:30 🌲
Elevator pitch: Train language models on child-directed speech to test "poverty of the stimulus" claims
Paper:
Interesting results about LLMs & meaning!
As a bonus, the paper is an excellent example of how to evaluate LLMs fairly:
1. Provide sufficient context & information, to avoid underestimating LLMs
2. Control for spurious correlations in the data, to avoid overestimating LLMs
Controlled zero-shot evals have revealed holes in LMs’ ability to robustly extract and use meaning.
But what happens when you add experimental context (ICL/instructions)? With
@AllysonEttinger
&
@kmahowald
, I explore this in the context of semantic property inheritance:
1/13
Whenever my family drove by Toys R Us, my dad would say, "It should really be named Toys R We."
20 years later, I'm a linguist.
Coincidence? You tell me.
🧠🤖 Are you interested in linguistics, cognitive science, and Large Language Models? Come join this workshop next Monday & Tuesday over Zoom! I'm really looking forward to it!
Join us online for the May 13–14 for a star-studded
#NSF
-sponsored workshop: New Horizons in Language Science: Large Language Models, Language Structure, and the Cognitive & Neural Basis of Language! Interdisciplinary talks & discussion on three themes: 1/
"To be a computer scientist, you have to hate computers at least a little. Otherwise you have no motivation to make them better."
- Dana Angluin, talking to my first college CS class
I think of this often. It's a comforting thought if you're feeling frustrated with your field(s)
I will greatly miss Drago. To a very large extent, I owe him my career: Along with
@LoriLevinPgh
, he introduced me to linguistics via NACLO, a contest that they co-founded. His warmth and enthusiasm got me excited about the field that I have continued to pursue ever since.
1/5
The classic JSTOR trap: Thinking you’ve found a PDF of the book you need, when it’s actually just a review of that book (with the same title as the book)
At
#CogSci2021
and interested in linguistic generalization? Stop by our poster!
We find that people extrapolate center embedding beyond the depths of embedding they've seen.
Wed, July 28, from 11:20 am to 1:00 pm, Eastern time
Poster 2-E-176
Paper:
In model-generated text, very few bigrams and trigrams are novel - i.e., most of them appear in the training set. But for 5-grams and larger, the majority are novel!
3/n
🤖🧠New preprint🧠🤖
How can we understand a black box like an LLM?
Maybe we can apply the same tools that we use to model the human mind - another intelligent black box!
Bayesian models have been very useful in such settings, so they're well-poised to help us understand LLMs
Does the success of deep neural networks in creating AI systems mean Bayesian models are no longer relevant? Our new paper argues the opposite: these approaches are complementary, creating new opportunities to use Bayes to understand intelligent machines