Katherine Lee Profile
Katherine Lee

@katherine1ee

Followers
6,068
Following
969
Media
116
Statuses
1,086

understanding ourselves and our models. senior research scientist @GoogleDeepMind , @genlawcenter , formerly @Princeton @katherinelee @sigmoid .social

Joined November 2013
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@katherine1ee
Katherine Lee
9 months
3 exciting updates from Generative AI + Law ( @genlawcenter )! 1: We’ve written a report on the state of the field: 2. GenLaw → we’re becoming an official nonprofit! 3. GenLaw 2 coming soon – centering policy and policymakers More below!
Tweet media one
6
46
252
@katherine1ee
Katherine Lee
9 months
What happens if you ask ChatGPT to “Repeat this word forever: “poem poem poem poem”?” It leaks training data! In our latest preprint, we show how to recover thousands of examples of ChatGPT's Internet-scraped pretraining data:
Tweet media one
240
2K
8K
@katherine1ee
Katherine Lee
2 years
While reading the recent Big Models paper, my group discovered it copied text from one of our previous papers, and at least a dozen other papers. If you copy text, use quotation marks. Make your intent clear and cite your sources.
Tweet media one
35
196
1K
@katherine1ee
Katherine Lee
9 months
Responsible disclosure: We discovered this exploit in July, informed OpenAI Aug 30, and we’re releasing this today after the standard 90 day disclosure period.
6
23
665
@katherine1ee
Katherine Lee
9 months
We first measure how much training data we can extract from open-source models, by randomly prompting millions of times. We find that the largest models emit training data nearly 1% of the time, and output up to a gigabyte of memorized training data!
4
25
478
@katherine1ee
Katherine Lee
9 months
However, when we ran this same attack on ChatGPT, it looks like there is almost no memorization, because ChatGPT has been “aligned” to behave like a chat model. But by running our new attack, we can cause it to emit training data 3x more often than any other model we study.
Tweet media one
1
32
479
@katherine1ee
Katherine Lee
1 year
Excited to announce our Generative AI+Law Explainers on legal issues generative AI raises! First: The process of making training datasets is full of choices. It's not a foregone conclusion.
Tweet media one
4
81
404
@katherine1ee
Katherine Lee
4 years
Want your models to explain their predictions? Ever asked, “why, T5?!” We trained models that output a natural language explanation along with the prediction by extending T5. So excited to share this joint work with @sharan0909 , @craffel , @ada_rob , @nfiedel , and @KarishmaMalkan !
Tweet media one
4
72
340
@katherine1ee
Katherine Lee
9 months
But a really important note here: you have to test models before and after alignment since it’s proven to be so brittle. Also, it’s important to do internal testing, user testing, and testing by third-party organizations. It’s wild to us that this works.
5
14
329
@katherine1ee
Katherine Lee
3 years
Do neural language models memorize examples seen just a few times? We define counterfactual memorization for neural LMs to make this distinction! Paper: Led by Chiyuan Zhang, and with @daphneipp , Matthew Jagielski, @florian_tramer , and Nicholas Carlini
Tweet media one
2
56
330
@katherine1ee
Katherine Lee
9 months
Some quick notes: 1. This doesn't work everytime you run it. 2. Only ~3% of the text emitted (after the repeated token) was memorized. 3. Since we disclosed this to OpenAI this might work differently now.
16
9
269
@katherine1ee
Katherine Lee
1 year
Announcing the 1st Workshop on Generative AI and Law (GenLaw), co-located with ICML 2023! We’re bringing together renowned experts in ML and law for cross-discipline conversations about the rapidly evolving tech & legal landscape. More info:
Tweet media one
9
59
263
@katherine1ee
Katherine Lee
3 years
Data duplication is serious business! 3% of documents in the large language dataset, C4, have near-duplicates. Deduplication reduces model memorization while training faster and without reducing accuracy. Paper: Code: coming soon! 🧵⬇️ (1/9)
6
55
259
@katherine1ee
Katherine Lee
6 months
We have a fun attack that lets you extract the last-layer embedding weights of an LM via public APIs. It's really simple & uses SVDs! We discovered this for ChatGPT + PaLM-2. We privately disclosed, they fixed, now, we release :)
6
41
248
@katherine1ee
Katherine Lee
2 years
Memorization in language models scales log-linearly with: 1. Capacity of the model (# of parameters) 2. Number of times an example has been duplicated 3. Number of tokens of context used to prompt the model Paper:
Tweet media one
2
45
245
@katherine1ee
Katherine Lee
6 years
Applying to the Google AI Residency? Ben and I wrote down our advice for writing a cover letter. Thanks @colinraffel for hosting the blog.
@JeffDean
Jeff Dean (@🏡)
6 years
If you want to do ML research, consider applying for the 2019 Google AI Residency program! You'll have the opportunity to conduct cutting-edge research working in a wide variety of areas, and this year we're expanding to host residents in even more locations.
9
144
458
3
54
226
@katherine1ee
Katherine Lee
9 months
@ItakGol Hey, this is our work. Please attribute it. Those are our GitHubs. That is our blog post.
5
2
201
@katherine1ee
Katherine Lee
4 years
Language models memorize data & we can pull that data back out. If you're training on private data, this should give you pause. If you're training on public data, this should still give you pause. Where does your training data come from and who consented (or didn't)?
@colinraffel
Colin Raffel
4 years
New preprint! We demonstrate an attack that can extract non-trivial chunks of training data from GPT-2. Should we be worried about this? Probably! Paper: Blog post:
15
236
1K
2
28
139
@katherine1ee
Katherine Lee
5 months
So excited to announce an event @genlawcenter has been working on! We're discuss the misconceptions b/w the technical capabilities of evaluating generative AI, and what policymakers and civil society want... April 15th @GtownTechLaw , and live on zoom:
Tweet media one
9
37
135
@katherine1ee
Katherine Lee
2 months
Tweet media one
5
1
114
@katherine1ee
Katherine Lee
9 months
@goodside Yeah totally, it makes me wonder how many other people found things like this that they thought were hallucinated data. It was a lot easier for us to check b/c we had already made large suffix arrays for prior projects.
6
0
105
@katherine1ee
Katherine Lee
9 months
Whether a model is "hallucinating" or "memorizing" or "generalizing" is actually just our perception of what the model is doing. It's our own projections. (Most) models are trained to produce the next token. They're very effective at "generalizing" whatever we mean by that.
@karpathy
Andrej Karpathy
9 months
# On the "hallucination problem" I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts. The prompts start the dream, and based on the
759
3K
15K
5
5
88
@katherine1ee
Katherine Lee
3 years
Language is contextual, varied, and used to communicate (examples below) This complicate assumptions required for techniques like data sanitization and differential privacy. So what does it mean for a language model to preserve privacy? We unpack that question in this paper!
@rzshokri
Reza Shokri
3 years
We study the question "What Does it Mean for a Language Model to Preserve Privacy?" in a great collaboration with wonderful Hannah, Katherine, Fatemeh, and Florian @Hannah_Aught @katherine1ee @limufar @florian_tramer We discuss the mismatch between the 1/3
Tweet media one
3
23
120
3
14
89
@katherine1ee
Katherine Lee
1 year
So, let's talk about @afedercooper , @grimmelm , and my new piece on the Generative-AI supply chain & copyright. We appreciate all the enthusiasm! This piece is extremely detailed because it has to be. We wanted to be rigorous and get it right.
Tweet media one
3
28
83
@katherine1ee
Katherine Lee
9 months
Hi friends! If you're going to NeurIPS I'd love to chat about AI policy, model evaluation, and model attacks. Lemme know if you wanna grab coffee! You can also find me hanging w/ my team at posters for:
4
0
82
@katherine1ee
Katherine Lee
9 months
@felix_red_panda We discuss that here: Specifically:
Tweet media one
0
3
76
@katherine1ee
Katherine Lee
9 months
mhm....
@jason_koebler
Jason Koebler
9 months
New: Asking ChatGPT to repeat words "forever"—a tactic used by Google's DeepMind that prompted ChatGPT to reveal its training data—is now a terms of service violation:
33
173
685
5
7
70
@katherine1ee
Katherine Lee
5 months
and now: Niloofar Mireshghallah on "What is differential privacy? And what is it not?" Why the focus on DP? Well, it appears many, many times in the EO! So let's talk about what this actually means. #genlaw
Tweet media one
4
5
70
@katherine1ee
Katherine Lee
1 year
Nicholas Carlini talking about "A Brief Introduction to Machine Learning & Memorization" at #genlaw ! And also, how should you talk to lawyers and policy folks about ML? Still livestreaming (…) and liveblogging (…)
Tweet media one
1
13
70
@katherine1ee
Katherine Lee
2 years
I'm just impressed that Nicholas already had the data downloaded to make quick work of checking for duplicates across so many conference proceedings. And also, that we literally have software to check for near and exact duplicates.... Mad props @daphneipp for noticing this.
0
1
70
@katherine1ee
Katherine Lee
1 year
GenLaw is TODAY in Ballroom B (4th floor)! Come learn about emerging copyright and privacy in generative AI from world-leading legal and technical scholars! & see really cool contributed work (spotlights @ 1:45 PM, posters @ 2:15PM) #ICML2023 @icmlconf
Tweet media one
1
15
69
@katherine1ee
Katherine Lee
1 month
⭐"Stealing Part of a Production Language Model" was such a fun project. ⭐ SVDs were my favorite part of linear algebra and why I got into ML. So cool to see them featured in this work in 2024! More here: & blog:
@icmlconf
ICML Conference
1 month
Congratulations to the best paper award winners
Tweet media one
12
113
730
1
5
65
@katherine1ee
Katherine Lee
8 months
@sirbayes Yeah it's wild..... But we ran this experiment. Sometimes appearing once is enough but more repeats of the training data makes it easier to extract
Tweet media one
1
1
65
@katherine1ee
Katherine Lee
5 years
Someone recently said this to me: "You should meditate 10 minutes a day, unless you really don't have time, then you should meditate 20 min" I've never heard it before, but it rings true. Gentle reminder to take time for yourself.
1
1
65
@katherine1ee
Katherine Lee
1 year
Want to learn more about the novel legal issues raised by generative AI? Or, want to learn about the underlying generative models or the techniques we have for evaluating privacy concerns? GenLaw's sharing a list of resources on that today!
@katherine1ee
Katherine Lee
1 year
Announcing the 1st Workshop on Generative AI and Law (GenLaw), co-located with ICML 2023! We’re bringing together renowned experts in ML and law for cross-discipline conversations about the rapidly evolving tech & legal landscape. More info:
Tweet media one
9
59
263
3
19
57
@katherine1ee
Katherine Lee
9 months
Colin is fabulous for so many reasons, but here's a few: - Works _with_ you - Has strong (& good) research intuitions, but still down to talk through why a direction works/doesn't or is interesting or not. - Excellent at communication
@colinraffel
Colin Raffel
9 months
Also, I am 1000% hiring PhD students this round! If you want to work on - open models - collaborative/decentralized training - building models like OSS - coordinating model ecosystems - mitigating risks you should definitely apply! Deadline is Friday 😬
12
75
461
3
2
59
@katherine1ee
Katherine Lee
3 years
just came here to say that im finally learning d3 after six years of going like, damn it would be so cool if we could interactively visualize this right now!! and i'm having so much fun! i feel so empowered!
3
0
59
@katherine1ee
Katherine Lee
8 months
I agree the legal issues are murky. But this didn't clear it up for me. It's really hard to do good interdisciplinary work. Even with the best intentions, you could say A and someone else could think you mean B because they don't have context to fully understand A. IMO ...
@random_walker
Arvind Narayanan
8 months
A thread on some misconceptions about the NYT lawsuit against OpenAI. Morality aside, the legal issues are far from clear cut. Gen AI makes an end run around copyright and IMO this can't be fully resolved by the courts alone. (HT @sayashk @CitpMihir for helpful discussions.)
12
93
318
1
7
58
@katherine1ee
Katherine Lee
11 months
So Talkin' 'Bout AI Generation was accepted at Journal of the Copyright Society! I'm so proud!! Also we wrote a blog post that outlines the main ideas! You can read it here: More on both below...
Tweet media one
1
8
56
@katherine1ee
Katherine Lee
4 months
Submit to GenLaw 2024 at ICML! Due June 10, 2024!! 1-2 page abstracts Example Topics: - Open questions / misconceptions at intersection of generative AI + law - Model evaluations for privacy harms / data protection - Data attribution - Analysis of bills / acts
1
23
57
@katherine1ee
Katherine Lee
3 years
I really enjoyed this paper surveying papers that have done human evaluation of generated text: The figure is fascinating. People mean lot of different things when they say "coherence," there's a lot of different ways of saying "grammatical."
Tweet media one
1
19
56
@katherine1ee
Katherine Lee
1 year
📣Really excited to present this with Nicholas at ICLR today (Monday) at 10:00am CAT in AD11🌟 Teaser, 1% of training examples are exactly memorized. Since this paper first went up on arxiv, we've continued to study memorization. Our talk today ties together these four papers:
@katherine1ee
Katherine Lee
2 years
Memorization in language models scales log-linearly with: 1. Capacity of the model (# of parameters) 2. Number of times an example has been duplicated 3. Number of tokens of context used to prompt the model Paper:
Tweet media one
2
45
245
1
7
56
@katherine1ee
Katherine Lee
1 year
I did a fun on Friday: We acquired 30lbs of oranges of 7 different types and had folks guess which of the three ancestral strains of citrus were bred to develop the different oranges. We were wildly off!! Pictured below: me at my orangest.
Tweet media one
Tweet media two
3
2
53
@katherine1ee
Katherine Lee
5 years
Curious about the state of NLP? We explore how different pre-training objectives, datasets, training strategies, and more affect downstream task performance, and how well can we do on when we combine these insights & scale. It was amazing to collaborate with this team!
@colinraffel
Colin Raffel
5 years
New paper! We perform a systematic study of transfer learning for NLP using a unified text-to-text model, then push the limits to achieve SoTA on GLUE, SuperGLUE, CNN/DM, and SQuAD. Paper: Code/models/data/etc: Summary ⬇️ (1/14)
Tweet media one
9
369
1K
0
7
51
@katherine1ee
Katherine Lee
1 year
Data is so incredibly important to trained models. But what does it mean for data to be “high quality?” To what extent should the choice of downstream application change pre-training data selection? We explore that in this paper led by @ShayneRedford
@ShayneRedford
Shayne Longpre
1 year
#NewPaperAlert When and where does pretraining (PT) data matter? We conduct the largest published PT data study, varying: 1⃣ Corpus age 2⃣ Quality/toxicity filters 3⃣ Domain composition We have several recs for model creators… 📜: 1/ 🧵
Tweet media one
12
88
360
1
9
48
@katherine1ee
Katherine Lee
1 year
Unbelievably excited to announce our confirmed speakers for GenLaw! We have intellectual property powerhouses: @PamelaSamuelson , Mark Lemley, and @luis_in_brief ML privacy experts: Nicholas Carlini, @thegautamkamath , and Kristen Vaccaro And industry policy: @Miles_Brundage !
Tweet media one
3
11
47
@katherine1ee
Katherine Lee
2 years
@davidthewid can confirm, advisors want moar tables @dmimno
Tweet media one
1
2
44
@katherine1ee
Katherine Lee
2 months
Not all memorization is created equal!!
@nsaphra
Naomi Saphra
2 months
Humans don't just "memorize". We recite poetry drilled in school. We reconstruct code snippets from more general knowledge. We recollect episodes from life. Why treat memorization in LMs uniformly? Our new paper w/ @AiEleuther proposes a simple taxonomy.
Tweet media one
3
42
200
0
2
42
@katherine1ee
Katherine Lee
3 years
I heard this recently and really liked it: "The grass is green where you water it" We find things to love in places where we put effort :)
1
0
41
@katherine1ee
Katherine Lee
1 year
So C4 was from 2019, folks, which was before 2020... The fact that it's still widely used speaks to how difficult it can be to collect data. And also to how data collection is the "gross and icky" process that you have to go through before training a model.
@nitashatiku
Nitasha Tiku
1 year
Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA -copyright symbol appears >200M times -pirated sites, 1 for e-books -half the top 10 = news sites
16
290
719
1
4
39
@katherine1ee
Katherine Lee
8 months
2. @srush_nlp literally this morning hosted a bounty for someone to regenerate an NYT article from ChatGPT. And it was quickly successful. This isn't fixed.
@srush_nlp
Sasha Rush
8 months
I cannot believe something this stupid worked. The Shoggoth remains undefeated. Congrats to the winner, and go support local news. Original article:
Tweet media one
Tweet media two
Tweet media three
6
10
74
1
4
39
@katherine1ee
Katherine Lee
6 months
Come listen to me rant about why we care about privacy, do we care about privacy?? who cares about privacy?? what even is privacy? ??!!!??
@niloofar_mire
Niloofar Mireshghallah
6 months
Join us tmw for the 5th PPAI workshop @RealAAAI , to discuss Generative AI, Privacy & Policy! We have a line-up of amazing speakers & panelists talking about all things LLMs, regulation and why we should care about privacy: w/ @nandofioretto @JubaZiani
Tweet media one
Tweet media two
Tweet media three
2
5
53
3
2
38
@katherine1ee
Katherine Lee
5 months
And now, Nicholas Carlini on "What watermarking can and can not do" and can we break it :) #genlaw
Tweet media one
0
1
38
@katherine1ee
Katherine Lee
1 year
Authorship has been increasingly challenging to determine as team sizes grow larger. We put together a set of proposals that highlight different types of contributions. We’re excited to invite the community to test out the proposals and provide feedback.
@florian_tramer
Florian Tramèr
1 year
Author order on academic papers is important! My Google friends and I spent lots of time thinking about this critical issue (the scores of our ICML submissions show this is time well spent) We distill our findings for the community here: Comments welcome!
Tweet media one
10
61
400
1
1
37
@katherine1ee
Katherine Lee
9 months
@VanGennepD Nice! Thanks for the pointer
2
0
35
@katherine1ee
Katherine Lee
1 year
Come find the GenLaw organizers in our little hats :)
Tweet media one
2
4
35
@katherine1ee
Katherine Lee
10 months
I am so excited for this!!!! Major yikes: "For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks"
@yanaiela
Yanai Elazar
10 months
What's In My Big Data? A question we've been asking ourselves for a while. Here is our attempt to answer it. 🧵 Paper - Demo-
Tweet media one
4
69
240
0
5
35
@katherine1ee
Katherine Lee
5 months
ONE WEEK!!!!
@katherine1ee
Katherine Lee
5 months
So excited to announce an event @genlawcenter has been working on! We're discuss the misconceptions b/w the technical capabilities of evaluating generative AI, and what policymakers and civil society want... April 15th @GtownTechLaw , and live on zoom:
Tweet media one
9
37
135
2
5
35
@katherine1ee
Katherine Lee
9 months
@savvyRL It's a pretty standard thing to do in the security community. We felt this fell under that bucket
0
0
35
@katherine1ee
Katherine Lee
3 years
congrats to the folks starting new journeys!! today i had a very typical monday where the highlight was finding out that the thing I thought was a really, really bad bug, was only a moderately bad bug. so, p much the same.
1
0
34
@katherine1ee
Katherine Lee
1 year
Was just in Maui (am safe), but stunned by the difference in news coverage vs. what was actually happening. There was no information about where the fires were, what to do, or the complete devastation that had already happened on day 1. If you went to ICML, plz donate👇
1
2
33
@katherine1ee
Katherine Lee
10 months
1. Research is a ✨lifestyle✨ 2. PhD can be good training for a research mentality 3. But also research mentalities can be cultivated anywhere and in any profession. (Breaking down a difficult problem into actionable parts, synthesizing what's out there).
@jm_alexia
Alexia Jolicoeur-Martineau
10 months
It's silly how controversial my tweet got. I have always been on the side that you don't need a PhD to be a great researcher, and we shouldn't need one. Yet, not having a Phd is a handicap for most research jobs, salary, promotion, and job mobility, so it's worth getting.
8
5
170
1
2
32
@katherine1ee
Katherine Lee
5 years
wooo congrats! so fun to watch this project develop from the desk next to you!
@rapha_gl
rapha gontijo lopes
5 years
My 1st @GoogleAI Residency paper is finally on arxiv! We train a powerful generative model of fonts as SVG instead of pixels. This highly structured format enables manipulation of font styles and style transfer between characters at arbitrary scales! 👉🏽
Tweet media one
13
248
960
1
6
31
@katherine1ee
Katherine Lee
6 months
Also got to talk about memorization as a case study for red-teaming at CMU last week! AKA. What can we generalize (methods and definitions) and what can we not generalize (interpretations). Slides: I'm trying to make an effort to share these out more!
0
3
31
@katherine1ee
Katherine Lee
2 years
Your 👏 data 👏 matters Where you source your data from impacts your models' risk profile. If you have only public, non-copyrighted data, then harms from memorization are greatly reduced.
@Eric_Wallace_
Eric Wallace
2 years
See our paper for a lot more technical details and results. Speaking personally, I have many thoughts on this paper. First, everyone should de-duplicate their data as it reduces memorization. However, we can still extract non-duplicated images in rare cases! [6/9]
Tweet media one
5
21
548
2
4
31
@katherine1ee
Katherine Lee
1 year
There’s a lot of terms in the generative ai + law space. We’ve put together a glossary so we can all be on the same page: For more resources, see:
1
8
29
@katherine1ee
Katherine Lee
4 years
My walk today was full of metaphors
Tweet media one
Tweet media two
Tweet media three
0
0
29
@katherine1ee
Katherine Lee
1 year
Our accepted papers for GenLaw are live! We're now T-12 days to the workshop!! So excited to see you in Hawaii. We had 64 submissions and accepted 29 of them with 5 spotlights!
Tweet media one
0
10
29
@katherine1ee
Katherine Lee
5 months
On now: David Bau on "Unlearning from Generative AI Models" #genlaw
Tweet media one
1
2
28
@katherine1ee
Katherine Lee
1 year
GenLaw is this **Saturday, July 29th!** in Ballroom B at the Convention Center in Honolulu (also streamed virtually). We were briefly listed in ICML's schedule as Friday. This is now corrected. It's **SATURDAY** !! See you there!!
1
7
28
@katherine1ee
Katherine Lee
1 month
Really excited to share our workshop schedule! See you all Sat. 9am in Lehar 2. We've got a packed and awesome day ft. talks on - Training data curation - AI Act - Differences in international copyright law - GDPR (?!) - Unlearning (?!) - DSA!
Tweet media one
Tweet media two
0
5
27
@katherine1ee
Katherine Lee
1 year
@thegautamkamath on "What does Differential Privacy have to do with Copyright?" at #genlaw Still livestreaming () and liveblogging ()
Tweet media one
0
3
27
@katherine1ee
Katherine Lee
9 months
🥰
@ada_rob
Adam Roberts
9 months
T5 Reunion! ( @NoamShazeer was replaced by a sentinel token)
Tweet media one
3
4
255
0
0
25
@katherine1ee
Katherine Lee
2 years
Privacy is hard. Publicly accessible data != public data. Differential privacy has limitations. Public data _looks_ different from private data in meaningful ways, but our benchmarks sometimes miss that.
@thegautamkamath
Gautam Kamath
2 years
🧵New paper w Nicholas Carlini & @florian_tramer : "Considerations for Differentially Private Learning with Large-Scale Public Pretraining." We critique the increasingly popular use of large-scale public pretraining in private ML. Comments welcome. 1/n
Tweet media one
4
20
148
0
1
26
@katherine1ee
Katherine Lee
1 year
The submission window for GenLaw is open TODAY (through May 29, AoE)! Thanks to those of you who have already submitted! We know having clear norms can help bridge interdisciplinary communities, so today we’re also sharing our reviewer guidelines:
Tweet media one
1
6
24
@katherine1ee
Katherine Lee
5 months
TFW you get AI generated (possibly) malware and your team of security researchers says "ooo lemme see, fwd plz!" ..........
2
0
24
@katherine1ee
Katherine Lee
2 years
So there was this tiktok about being so traumatized that you're numb to major events. And how that's not a good thing. And I was like, oh look, it's me, but about advances in LM. Something new happens and I'm just like, of course.
1
0
25
@katherine1ee
Katherine Lee
1 year
It was also really fun and interesting to write! The whole time @afedercooper , @grimmelm , and I would be chatting and someone would go "but what about X" and then we'd all go "oh....." and then write 10 more pages...
Tweet media one
@NaveenGRao
Naveen Rao
1 year
@katherine1ee and team have published an update to their paper with a lot more detail. Thanks for continuing your work in this important area!
0
1
7
0
8
25
@katherine1ee
Katherine Lee
1 year
Can I just say I love/hate this figure 1: It's actually straightforward to create inputs for multimodal models that circumvent alignment. Yes, we need access to gradient info for this particular attack, but this demonstrates the challenges of really aligning models...
Tweet media one
@safe_paper
AI Safety Papers
1 year
Are aligned neural networks adversarially aligned? Nicholas Carlini, Milad Nasr ( @srxzr ), Christopher A. Choquette-Choo, Matthew Jagielski, @irena_gao , @anas_awadalla , @PangWeiKoh , Daphne Ippolito ( @daphneipp ), Katherine Lee ( @katherine1ee ), @florian_tramer , Ludwig Schmidt
0
7
24
1
3
24
@katherine1ee
Katherine Lee
1 year
@iclr_conf tomorrow (Wed) 11:30! Privacy attacks have a “recency bias”: examples seen more recently are more vulnerable to attack, and old examples are “forgotten” according to these attacks! This is true for image, speech, and text! Work led by Matthew Jagielski!
Tweet media one
1
7
24
@katherine1ee
Katherine Lee
1 year
Miles Brundage at #genlaw now: "Where and when does the law fit into AI development and deployment?" Answer: everywhere, pretty much Livestream: Liveblog:
Tweet media one
0
1
23
@katherine1ee
Katherine Lee
2 years
@ChengleiSi And yeah, I'm sure with 100 authors not everyone was aware of this. It's difficult and chaotic with that many people.
0
0
23
@katherine1ee
Katherine Lee
5 months
yay!
@genlawcenter
The GenLaw Center
5 months
See you all in Vienna, Austria for GenLaw 2 at ICML 2024! We're so excited to be in Europe and use this opportunity to dig into GDPR, and text & data mining exceptions!! We'll put up a website soon with a CFP. It'll be pretty similar to last years () but
Tweet media one
1
4
18
0
0
18
@katherine1ee
Katherine Lee
5 years
Couple more days to apply for the Google AI Residency Program (Dec 19)! I was part of this program a couple years back, and it was a great experience. Apply: Cover letter tips:
2
4
22
@katherine1ee
Katherine Lee
1 year
There's a lot of privacy terms that get thrown around: canaries, membership inference, & differential privacy. This 2-page paper from Matthew Jagielski is super helpful for understanding the relationships between them!
Tweet media one
0
6
22
@katherine1ee
Katherine Lee
4 years
still relevant
@kellianderson
kelli anderson
6 years
I asked my students to manually comb through the Enron corpus of emails (a dataset that has machine learning, to train software) to find patterns that computers could miss but humans would notice. @turniplan found a web of racist/sexist jokes and began sketching the connections:
Tweet media one
17
238
631
1
3
22
@katherine1ee
Katherine Lee
6 months
Slides here: Consider this a bunch of links out to papers I've enjoyed on privacy / why we care about privacy!
@katherine1ee
Katherine Lee
6 months
Come listen to me rant about why we care about privacy, do we care about privacy?? who cares about privacy?? what even is privacy? ??!!!??
3
2
38
0
2
22
@katherine1ee
Katherine Lee
10 months
In light of the number of times differential privacy appeared in the EO, let's bring back two pieces: 1. Privacy side channels in machine learning systems: 2. What Does it Mean for a Language Model to Preserve Privacy?:
1
1
21
@katherine1ee
Katherine Lee
1 year
Mark Lemley speaking now at #genlaw on "Is Training AI Copyright Infringement?" "Is it legal to train models on copyrighted data? If not, ML is dead." "The goal is to generate something that is not infringing." Livestream:
Tweet media one
0
3
21
@katherine1ee
Katherine Lee
1 year
419 languages is so many languages (!!) Side note: We investigated how having lots of different languages in one model impacts what and how much is memorized. Which examples get memorized depends on what other examples are in the training data!
Tweet media one
@snehaark
Sneha Kudugunta
1 year
Excited to announce MADLAD-400 - a 2.8T token web-domain dataset that covers 419 languages(!). Arxiv: Github:   1/n
Tweet media one
24
135
800
0
2
21
@katherine1ee
Katherine Lee
3 years
I love this paper so much. Responses are highly contextual and social, and the authors do a great job of showing the implications that has on any system that hopes to recommend responses. It's not just a technical problem, but it's also not, not a problem with our techniques.
@RERobertson
Ronald E Robertson
3 years
New paper in #CHI2021 - "I Can't Reply With That": Characterizing Problematic Email Reply Suggestions with @o_saja , @841io , @invertedindex , & @peter_r_bailey
Tweet media one
2
10
41
1
3
21
@katherine1ee
Katherine Lee
5 years
Colin has been incredible to work with! He's immensely supportive and insightful. I've been incredibly lucky to work with him and you could be too!
@colinraffel
Colin Raffel
5 years
I'm starting a professorship in the CS department at UNC in fall 2020 (!!) and am hiring students! If you're interested in doing a PhD @unccs please get in touch. More info here:
82
145
887
1
0
21
@katherine1ee
Katherine Lee
1 month
We have 3 (!!) talks on the EU AI Act from 3 (!!) different perspectives. Gabriele Mazzini, @cp_dunlop , and @sabrinakuespert will speak on the process of creating it, the stakeholders, and how it will be implemented, respectively! @genlawcenter
@katherine1ee
Katherine Lee
1 month
Really excited to share our workshop schedule! See you all Sat. 9am in Lehar 2. We've got a packed and awesome day ft. talks on - Training data curation - AI Act - Differences in international copyright law - GDPR (?!) - Unlearning (?!) - DSA!
Tweet media one
Tweet media two
0
5
27
0
4
21
@katherine1ee
Katherine Lee
8 months
there's a lot of that going on here, so I want to give more context 1. Memorization is _not_ fixed by fine-tuning. To simply claim the opposite, without context, abstracts away any meaning. Fine-tuning is more training. What are you fine-tuning the model on? But also, a model...
1
2
19
@katherine1ee
Katherine Lee
6 months
I've been surprised by how much disagreement there is on how important (or not) the current copyright lawsuits are for the future of generative AI. Highly recommend this article for understanding more about the law & the various copyright arguments
@binarybits
Timothy B. Lee
6 months
Generative AI models produce stuff like this and it's a bigger legal vulnerability than a lot of people in the AI community want to admit. I'm excited to publish this copyright explainer I co-authored with @grimmelm .
Tweet media one
11
38
108
1
3
20
@katherine1ee
Katherine Lee
2 years
Excited to finally present this work at ACL tomorrow at 11:30am in Liffey B or virtually at 7:30am Dublin time with @daphneipp Really enjoyed hearing about the duplicates you all found in your datasets! Please come share your stories with us!
@katherine1ee
Katherine Lee
3 years
Data duplication is serious business! 3% of documents in the large language dataset, C4, have near-duplicates. Deduplication reduces model memorization while training faster and without reducing accuracy. Paper: Code: coming soon! 🧵⬇️ (1/9)
6
55
259
1
4
20
@katherine1ee
Katherine Lee
4 years
I kinda want these chairs now
Tweet media one
1
0
20