ChrisCundy Profile Banner
Chris Cundy Profile
Chris Cundy

@ChrisCundy

Followers
1K
Following
549
Media
79
Statuses
359

Research Scientist at FAR AI. PhD from Stanford University. Hopefully making AI benefit humanity. Views are my own.

San Francisco, CA
Joined July 2017
Don't wanna be here? Send us removal request.
@ChrisCundy
Chris Cundy
2 years
Introducing *SequenceMatch*, training LLMs with an imitation learning loss. Avoids compounding error in generation by:.1. Training against *different divergences* like χ^2 with more support OOD.2. Adding a *backspace* action: model can correct errors!.1/7
6
85
490
@ChrisCundy
Chris Cundy
2 years
Haven't seen many people discussing the new tokenizer for GPT-4/chatGPT, even though it's public through the openai-tiktoken package. I dumped the vocabulary here Some first impressions below.
18
61
454
@ChrisCundy
Chris Cundy
5 months
Life update: I'm excited to announce that I defended my PhD last month and have joined @farairesearch as a research scientist!. At FAR, I'm investigating scalable approaches to reduce catastrophic risks from AI.
16
4
300
@ChrisCundy
Chris Cundy
6 years
Great post on the intuition behind the transformer. I hadn't ever thought about how the CNN could be viewed as a special case of a transformer!.
3
105
284
@ChrisCundy
Chris Cundy
2 years
Did you know that GPT4 has memorized the numerical solutions to the first 200 Project Euler problems (and many further ones too)? . This means that Project Euler is not a good evaluation dataset for LLMs. More details in my blog post:
5
31
212
@ChrisCundy
Chris Cundy
2 years
Wow, Claude 2 has 200k context window 🤯. And it seems like it's actually using the context too (see graph).Any guesses how it's implemented?.
Tweet media one
4
10
79
@ChrisCundy
Chris Cundy
6 years
Fun paper recovers 50,000 lost digits from the original MNIST test set; uses them to see if we've overfitted to the MNIST test set. Conclusion: probably not that much .
3
13
63
@ChrisCundy
Chris Cundy
10 months
Happy to announce I was one of the winners of the OpenAI preparedness challenge!. If anyone has a safety-related project that is bottlenecked on OpenAI credits, let me know--I'd be happy to help out.
3
3
60
@ChrisCundy
Chris Cundy
3 years
How can you scalably form a posterior over possible causal mechanisms generating some observational data? Find out in our NeurIPS poster tomorrow (Tuesday), 8.30-10am pacific time! Joint work with @StefanoErmon, @adityagrover_ (1/13).
2
4
44
@ChrisCundy
Chris Cundy
2 years
We treat the problem of sequence modelling as an RL problem: given a state (partial sequence), choose the action (next token to generate). I.e. how do you navigate the tree of all possible sequences?
Tweet media one
1
7
45
@ChrisCundy
Chris Cundy
2 years
And this comparison of the numerical section shows that there are numerical tokens for exactly all of the 3-digit numbers in the GPT4-tokenizer compared to the sporadic and strange coverage in the GPT2/3 tokenizer (e.g. 440 and 443 get their own token but 441 doesn't)
Tweet media one
4
0
39
@ChrisCundy
Chris Cundy
6 years
Really easy-to-use implementation of a GPT2-powered autocomplete with a google-docs style interface at ! . Gives a glimpse into the interesting applications that can be applied with NLP and generative models.
1
8
32
@ChrisCundy
Chris Cundy
7 years
I am a big fan of Koray Kavukcuoglu's presentation about @DeepMindAI's research: both for the phenomenal deep learning carried out there and the use of the 'thinking face' emoji to represent the discriminator in a GAN #ICLR2018
Tweet media one
0
4
29
@ChrisCundy
Chris Cundy
6 years
Our group had an intensive reading group on optimal transport. Here's a cheat sheet I made for reference:
Tweet media one
0
3
26
@ChrisCundy
Chris Cundy
2 years
@BlackHC Good question! I doubt that transformers have enough capacity to solve any significant Project Euler problems (. I did a quick experiment with asking PE problem 1, but swapping the numbers. Both GPT4 and chatGPT failed to answer correctly at all.
2
2
23
@ChrisCundy
Chris Cundy
2 years
The tokenizer seems much more intentionally designed than the gpt3 tokenizer, which fits with the idea that the GPT2 BPE was relatively quickly thrown together and maintained for backwards compatibility. There don't seem to be many 'strange' tokens like SolidGoldMagikarp.
3
1
23
@ChrisCundy
Chris Cundy
6 years
Graph Neural Networks are exciting to me because it's not clear we know what is the 'natural' way to do deep learning on graph-structured data. Here's a new paper I saw showing how using page-rank to determine graph nodes embeddings can pay off.
1
6
22
@ChrisCundy
Chris Cundy
2 years
GPT2 models fine-tuned on SequenceMatch can recover from errors and achieve better MAUVE score on language modelling compared to MLE-trained models. Check out the paper for all the details . Next: bigger models!🚀 . Work done with the fantastic @StefanoErmon 8/8.
1
1
22
@ChrisCundy
Chris Cundy
6 years
Feeling very inspired after @EmmaBrunskill's talk at the #StanfordHumanAI launch: using rigorously optimal decision-making methods in small-data domains!
Tweet media one
1
10
17
@ChrisCundy
Chris Cundy
5 years
Did you know you can differentiate through convex optimization problems? Turns out you can!. And there are pytorch/tf implementations too. The blog post is a great introduction: (Agrawal et al, .
0
2
17
@ChrisCundy
Chris Cundy
2 years
And here is a plot of how likely chatGPT is to be able to recall the exact numerical answer to a Project Euler problem, as a function of problem ID.
Tweet media one
2
1
15
@ChrisCundy
Chris Cundy
5 years
I guess I know what my next project will be on. h/t
Tweet media one
0
0
15
@ChrisCundy
Chris Cundy
2 years
The MLE objective corresponds to matching KL divergences. In IL, this is behavioural cloning, and has compounding error: the loss is in expectation over the *expert trajectories*, so if agent-generated trajectories diverge from the expert states, the agent has bad policies😬 4/8
Tweet media one
2
1
15
@ChrisCundy
Chris Cundy
2 years
We apply recent advances in non-adversarial IL to form a fully supervised loss for language modelling, treating the dataset as examples of expert 'demonstrations'.🤖 3/8.
1
0
15
@ChrisCundy
Chris Cundy
5 years
The pace of improvements in style transfer/domain adaptation is very impressive: just a few years ago I would have said these sort of results would be incredibly hard to achieve. The latest: StarGANv2 (Choi et al.). #StarGANv2
Tweet media one
0
4
13
@ChrisCundy
Chris Cundy
6 years
I gave some comments at the DoD Defense Innovation Board Public Listening Event at Stanford today #dibprinciples
Tweet media one
Tweet media two
1
2
13
@ChrisCundy
Chris Cundy
2 years
Since we're learning actions instead of a tokens, we can add a <backspace> action to the action space. This further mitigates the compounding error problem by allowing the model to delete tokens explicitly🚮 6/8.
2
0
13
@ChrisCundy
Chris Cundy
2 years
Implementing the backspace can be done with zero overhead with some masking and positional id tricks. Take the actions [a, b, <backspace>, c]. We want this to correspond to a state [a, c] with masking, so we maintain fast parallel training with transformers⚡️. 7/8.
1
0
13
@ChrisCundy
Chris Cundy
2 years
We can mitigate by training against divergences with expectations over *agent trajectories*. E.g. JS divergence in GAIL, χ^2-mixture in IQ-Learn. We're encouraging the agent to 'return' to the data distribution if we go OOD. But what if there's no sensible continuation?🤔 5/8.
1
0
13
@ChrisCundy
Chris Cundy
9 months
I'm at ICLR this week -- If you're there too, let's chat! Interested in hearing more about sequence models and LLMs, particularly about alignment, detecting deception, and scalable oversight.
1
0
13
@ChrisCundy
Chris Cundy
2 years
I'm sure there's lots of other interesting stuff in the tokenizer. I'm not really familiar with text encoding, so I have definitely mangled the non-utf8 part of the vocab, would be happy to receive a pull request to fix that.
4
0
12
@ChrisCundy
Chris Cundy
2 years
Did you know that GPT-4 is *nondeterministic* over the API, even with temperature=0?. Notably it *does* seem to give more peaked responses for lower temperatures, but there's still some randomness. (Code snippet to verify nondeterminacy in the image)
Tweet media one
3
0
12
@ChrisCundy
Chris Cundy
5 years
Remarkable example from OpenAI's blog. As ML systems more powerful, likely for failures to fall along semantically meaningful lines (=> shocking/distressing examples instead of gibberish)?.
Tweet media one
0
5
11
@ChrisCundy
Chris Cundy
5 years
ICLR 2020 Reviews
1
0
11
@ChrisCundy
Chris Cundy
6 years
Three hours before the abstract deadline for #NeurIPS2019 and I got an id in the 8000s: looks like this year will definitely hit a new record for number of submissions! #AIhype.
0
1
10
@ChrisCundy
Chris Cundy
11 months
Lots of interest in diffusion on discrete spaces. I think Aaron's approach is the most principled I've seen. Plus, lots of applications in conditional generation!.
@aaron_lou
Aaron Lou
1 year
Announcing Score Entropy Discrete Diffusion (SEDD) w/ @chenlin_meng @StefanoErmon. SEDD challenges the autoregressive language paradigm, beating GPT-2 on perplexity and quality!. Arxiv: Code: Blog: 🧵1/n
1
0
7
@ChrisCundy
Chris Cundy
2 years
@chrisalbon Although the L2 normalization is equivalent to weight decay for SGD, it's not the same thing with adaptive optimizers. So I wouldn't say this card correctly characterizes `weight decay' as it is typically understood today. see
1
1
7
@ChrisCundy
Chris Cundy
6 years
Great fireside discussion from Prof Phil Tetlock on prediction and using ML for forecasting #EAGlobal
Tweet media one
0
0
8
@ChrisCundy
Chris Cundy
6 years
Imitiation learning is just minimizing the f-divergence between learner and demonstrator trajectory distributions: GAIL is JS-distance, Behavioural Cloning is KL divergence and DAGGER is TV distance!.
0
0
8
@ChrisCundy
Chris Cundy
11 months
Up to 16% of EMNLP reviews are written by ChatGPT -- pretty concerning. I think this should be either explicitly banned or adopted in a controlled way by conferences going forward.
1
0
8
@ChrisCundy
Chris Cundy
2 years
I appreciate all the emphasis on safety in the Llama-2 paper (e.g. see figure) but I'm not sure how that squares with releasing the weights. If I want crimeLlama for effective phishing emails, can't I just finetune (simple with PEFT & quantization) to remove safety guardrails?
Tweet media one
2
0
8
@ChrisCundy
Chris Cundy
10 months
This snippet from the Llama3 Model Card is very interesting: I hope we get more details soon,.particularly *clarifying if the model was evaluated pre or post DPO/PPO safety finetuning*.
Tweet media one
1
0
8
@ChrisCundy
Chris Cundy
6 years
Great talk yesterday from @janleike on recursive reward modelling for developing AI that reliably and robustly carry out the tasks that we intend: check out the paper.
0
0
7
@ChrisCundy
Chris Cundy
6 years
Standing room only for Bill Gates at the Stanford HAI centre #StanfordHumanAI
Tweet media one
0
0
7
@ChrisCundy
Chris Cundy
4 years
ICYMI: A quick post on how I use emacs to keep up-to-date with papers from arxiv.
0
0
7
@ChrisCundy
Chris Cundy
1 year
Interesting how in the Mistral-Large release they don't use the CoT@32/CoT@8 results for Gemini Pro (and no Gemini Ultra results), but presumably recompute them with no multiple CoT? I wonder if they tried to multiple CoT techniques with Mistral.
Tweet media one
Tweet media two
1
0
4
@ChrisCundy
Chris Cundy
2 years
@goodside You can find a lot of these by looking through the vocabulary and spotting weird-looking tokens. For instance, I think ' IsPlainOldData' is another glitch token. Here, GPT4 claims it cannot see the token at all.
Tweet media one
0
0
6
@ChrisCundy
Chris Cundy
6 years
Very excited to be at the launch of the new Stanford Institute for Human-Centered AI! .#StanfordHumanAI
Tweet media one
0
0
6
@ChrisCundy
Chris Cundy
5 months
GPQA-Diamond of 77.3% is wild.
2
0
8
@ChrisCundy
Chris Cundy
6 years
Interesting paper: .GANs don't converge to the Nash equilibria that's often motivated as the objective: instead converge to locally stable stationary points. Also, lessons for increasing stability of training.
0
0
5
@ChrisCundy
Chris Cundy
6 years
When they extend the #Neurips2019 deadline by 24 hours.
Tweet media one
0
0
5
@ChrisCundy
Chris Cundy
7 years
The trilemma for the permissibility of ethical offsetting: which premise is wrong? #EAGlobalSF17
Tweet media one
0
0
5
@ChrisCundy
Chris Cundy
7 years
Very excited to be attending #ICLR 2018!
Tweet media one
0
0
5
@ChrisCundy
Chris Cundy
5 years
Reading reports like `On the adequacy of untuned warmup for adaptive optimization' ( makes me think that it would be very good to have a sort of `practictioners guide' to things like choosing optimizer, warmup, architecture choices, etc. .
0
3
5
@ChrisCundy
Chris Cundy
2 years
There's a subtle flaw lurking at the heart of the SDE formulation of diffusion models (as they are typically implemented). Check out Aaron's work exploring this and the principled ways we can fix it -- and get some SoTA images!. 🔬🧪👉.
@aaron_lou
Aaron Lou
2 years
Presenting Reflected Diffusion Models w/ @StefanoErmon!. Diffusion models should reverse a SDE, but common hacks break this. We provide a fix through a general framework. Arxiv: Github: Blog: 🧵(1/n)
0
0
5
@ChrisCundy
Chris Cundy
5 years
Very excited to be at @JHUAPL for the "Assuring AI: Future of Humans and Machines" conference! . Check by the tech demo to see some upcoming work on fairness and RL
Tweet media one
0
0
5
@ChrisCundy
Chris Cundy
5 years
Most interesting part of the OpenAI GPT2 risks report for me: . As the tech's in such an experimental state, likely bad users will be state/institutional actors; keeping larger GPT2 models won't help against that as they can train their own from scratch using the paper.
Tweet media one
2
0
4
@ChrisCundy
Chris Cundy
5 years
[Overleaf offices]: Why don't we do some maintenance on the night of the ICML deadline?.[Everyone]: What a great idea!.
1
0
4
@ChrisCundy
Chris Cundy
6 years
@StefanFSchubert @robinhanson If I recall correctly, there's also an up/downvote system on the app where conferencegoers can indicate they'd prefer some questions to be asked, though the moderator isn't obliged to ask the highest-voted.
0
0
4
@ChrisCundy
Chris Cundy
7 years
Great panel on Global Governance #eaglobalsf17
Tweet media one
0
1
4
@ChrisCundy
Chris Cundy
5 years
Even for someone who doesn't usually use tensorflow, I found this had some great high-level tips & techniques for debugging ML systems: .
0
0
4
@ChrisCundy
Chris Cundy
5 months
Can anyone at OpenAI escalate a billing problem? .I'm not able to add any money to the balance -- it says I must add a negative amount of money to continue. I think it's connected to the credit grant I got from the preparedness challenge.
1
0
6
@ChrisCundy
Chris Cundy
8 months
Improving to 60% on GPQA from the previous SoTA of 54% is *really* impressive -- the GPQA questions are very difficult!. (That is, assuming no test-set leakage. ).
@AnthropicAI
Anthropic
8 months
Introducing Claude 3.5 Sonnet—our most intelligent model yet. This is the first release in our 3.5 model family. Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost. Try it for free:
Tweet media one
0
0
4
@ChrisCundy
Chris Cundy
2 years
Crazy how a couple of lines of chatGPT API and TTS library calls (+ prompt) can instantiate an actually useful research/motivation assistant with realistic speech. Has been really useful to have this conversation going while working over the last week
1
0
4
@ChrisCundy
Chris Cundy
6 years
Wild new paper collects a set of naturally occuring adversarial examples for imagenet classifiers: densenet gets only 2% accuracy!. #RobustML
Tweet media one
0
2
4
@ChrisCundy
Chris Cundy
1 year
@AlexGDimakis @raj_raj88 It's been argued that a big contributing factor for GPT3's poor maths ability is its tokenizer, which is strange/bad for numbers--i.e. a separate token for '809' and '810' but not for '811'. See and
0
0
3
@ChrisCundy
Chris Cundy
7 years
What is the right framing for fairness in machine learning? .#ICMLDebates
Tweet media one
0
0
3
@ChrisCundy
Chris Cundy
2 years
Anyone know an easy way to implement ZeRO/FSDP with JAX/Flax/Optax? . I like JAX a lot, but it's striking how much larger the models I can train in Pytorch are.
2
0
3
@ChrisCundy
Chris Cundy
7 years
Can you tell if you’ll like an ML paper after skimming for 30s? FHI and are collecting data on quick vs. careful judgments about papers. If you like ML papers, take part here (no signup required):
Tweet media one
0
1
3
@ChrisCundy
Chris Cundy
4 months
A bit confused why the bidding for ICLR only begins now: IIRC some conferences in previous years had bidding immediately after the abstract deadline. What's the motivation for the delay?.
1
0
3
@ChrisCundy
Chris Cundy
6 years
Switched to using @zotero for organizing papers after waiting months for @readcube to release their new desktop papers app. If I'm looking at an arxiv paper I just click the plugin and it appears in the desktop app, all nicely formatted for bibtex exporting.
2
0
3
@ChrisCundy
Chris Cundy
6 years
@sangmichaelxie You should start it up! .A lot of information on the internet is pretty old (even advocating using sigmoids, not mentioning batchnorm, etc).
0
0
3
@ChrisCundy
Chris Cundy
6 years
The @OpenAI Five agent that played yesterday and convincingly beat world-champion players had been trained continuously since July 2018 😮
0
0
3
@ChrisCundy
Chris Cundy
10 months
Reading the report about many-shot jailbreaking reminds me of the paper from Wolf et al last year: under weak assumptions then for sufficiently long context, any LLM is jailbreakable.
0
0
3
@ChrisCundy
Chris Cundy
6 years
The past year has seen an incredible amount of progress in NLP: particularly with the development of the Transformer architecture. @OpenAI's recent model is mind-bogglingly good.
@OpenAI
OpenAI
6 years
We've trained an unsupervised language model that can generate coherent paragraphs and perform rudimentary reading comprehension, machine translation, question answering, and summarization — all without task-specific training:
0
1
3
@ChrisCundy
Chris Cundy
7 years
Elvis, 'always on my mind' translated into Beethoven: Original for comparison: . It seems like the classical version sometimes misses the melody, though I'd expect this to get better as it's trained on more songs.#deeplearning.
0
0
2
@ChrisCundy
Chris Cundy
6 years
Great to see NeurIPS emphasizing reproducibility by asking authors to answer the questions in the reproducibility checklist and making it available to reviewers
Tweet media one
0
0
2
@ChrisCundy
Chris Cundy
1 year
Very cool stuff: domain translation with exact likelihoods.
@linqi_zhou
Linqi (Alex) Zhou
1 year
📢 Diffusion models (DM) generate samples from noise distribution, but for tasks such as image-to-image translation, one side is no longer noise. We present Denoising Diffusion Bridge Models, a simple and scalable extension to DMs suitable for distribution translation problems.
0
0
2
@ChrisCundy
Chris Cundy
3 years
All these elements make BCD Nets: a variational method for learning posteriors over structural equation model parameters which outperform competing methods on low-data regimes. (12/13).
1
0
1
@ChrisCundy
Chris Cundy
2 years
Furthermore, the model directory listing (`openai.Model.list()`) for gpt-4 has the "allow_sampling" parameter to False.
Tweet media one
0
0
2
@ChrisCundy
Chris Cundy
7 years
@qntm There are tidal forces (0.384 μg/m). If you put two things in random spots in the ISS, they will start to move relative to each other.
0
0
2
@ChrisCundy
Chris Cundy
9 months
@janleike @AnthropicAI Congratulations!.
0
0
2
@ChrisCundy
Chris Cundy
6 years
I don't usually like posting RL results based on learning curves, but this has some pretty interesting theory as well: trying to avoid bootstrapping variance in Q-learning to get more accurate off-policy learning.
0
0
2
@ChrisCundy
Chris Cundy
8 months
@StephenLCasper Do you think there's promise to methods that make fine-tuning harder/impossible, like (but extended to LLMs)? This could let you 'lock in' those superficial changes.
0
1
2
@ChrisCundy
Chris Cundy
6 years
Previously Wainwright showed that Q-learning has sample complexity quartic in (1/1-𝛾), while model-based methods are cubic. Here, proof a simple (and actually implementable) variance-reduction method for Q-learning can recover the cubic sample complexity.
0
0
2
@ChrisCundy
Chris Cundy
3 years
We describe Bayesian Causal Discovery Nets (BCD Nets), a variational inference framework for estimating a distribution over DAGs characterizing a linear-Gaussian SEM. What are the key elements of our method? (7/13).
1
0
1
@ChrisCundy
Chris Cundy
2 years
A great introduction to some challenges in imitation learning and how some of our recent work in the group helps overcome these obstacles:
1
0
2
@ChrisCundy
Chris Cundy
7 years
Interesting podcast outlines the state-of-the-art in clean meat. I'd be very interested to learn more about why @open_phil disagree with @GoodFoodInst on plausibility of clean cultured meat.
@80000Hours
80,000 Hours
7 years
The clean & plant-based meat market sector needs more CEOs and CTOs.
0
0
2
@ChrisCundy
Chris Cundy
6 years
@krandiash @ShreyaR @ZenkitHQ I'll have to check it out! #productivitythursday.
0
0
2
@ChrisCundy
Chris Cundy
2 years
My guess is that it's essentially impossible to defend against white-box attacks with current models.
2
0
2
@ChrisCundy
Chris Cundy
7 years
Media panel: big fans of @OurWorldInData & choosing appropriate level of nuance for each medium #eaglobalsf17
Tweet media one
0
0
2
@ChrisCundy
Chris Cundy
6 years
"Revisiting Graph Neural Networks: All We Have is Low-Pass Filters".Interesting challenge to the very hot field of graph neural networks; .claim that GNNs don't make use of the manifold hypothesis and GCNs will struggle for nonseperable feature spaces.
0
0
2
@ChrisCundy
Chris Cundy
4 years
0
0
2
@ChrisCundy
Chris Cundy
2 years
The sort of prompt-based 'black box' red-teaming the team analyses in the paper makes sense for a model with only API access. But for a model with public weights, the threat model has to be a 'white box' setup where adversaries have full access to the model weights.
1
0
2
@ChrisCundy
Chris Cundy
3 years
Check out our NeurIPS poster tomorrow (Tuesday) at 8.30-10, or the paper or code (13/13).
0
0
1
@ChrisCundy
Chris Cundy
5 years
Great work giving tight results for how the problem influences the choice of preconditioner for gradient methods. .
@daniellevy__
Daniel Levy
5 years
Choosing the optimal gradient algorithm depending on the geometry is important. Interestingly, how to do it is in seminal results about the Gaussian Sequence model. Come to our oral at 4:50pm in West Exhibition Hall A! #NeurIPS2019.
0
0
2
@ChrisCundy
Chris Cundy
7 years
I spent quite a while fiddling with RNNs over the summer, and I'm pleased to announce it was accepted to ICLR 2018! Check out some of the other accepted papers: there are some very cool new concepts there.
1
0
2
@ChrisCundy
Chris Cundy
5 years
Bob Work suggests using the Kobayashi Maru situation from Star Trek II: the Wrath of Khan to select commanding officers in the future, to select for people who use their own judgment and don't just go with the machine suggestion.
1
0
2
@ChrisCundy
Chris Cundy
6 years
A continuous relaxation of sorting opens up interesting end-to-end training of traditionally modular algorithms: check out the paper!.
@adityagrover_
Aditya Grover
6 years
Our @iclr2019 paper proposes NeuralSort, a differentiable relaxation to sorting. Bonus: new Gumbel reparameterization trick for distributions over permutations. Check out our poster today at 4:30! Code: w/ @tl0en A. Zweig @ermonste
Tweet media one
0
0
2
@ChrisCundy
Chris Cundy
6 years
@sangmichaelxie I think that the various aggregators miss enough good things (and to some extent you should be reading the unpopular things) that it's worth reading all of the relevant arxivs like
0
0
2