Jimmy Lin Profile Banner
Jimmy Lin Profile
Jimmy Lin

@lintool

Followers
13,607
Following
840
Media
330
Statuses
4,073

I profess CS-ly at the @UWaterloo and gaze into the technological crystal ball at @Primal . I used to write code for @Twitter and slides for @Cloudera .

Nearby data lake
Joined February 2010
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@lintool
Jimmy Lin
1 year
With so much noise from ChatGPT, LLMs, and the "we're all going to die" crowd, my students have (understandably) been experiencing existential angst, asking me about the implications for NLP and IR.
Tweet media one
10
30
152
@lintool
Jimmy Lin
4 years
Reviewers automatically assume that simple is not novel. This is sheer laziness. Yes, it may be simple and obvious in retrospect, but someone had to have that insight first. Simple is good. Simple is robust, easy to implement and reproduce, broadly applicable, etc.
58
508
4K
@lintool
Jimmy Lin
2 years
DAAM... You saw it here first! Attribution maps for Stable Diffusion based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork: example for "an angry, bald man doing research" below - demo at
Tweet media one
14
153
904
@lintool
Jimmy Lin
1 year
GPT-4 and its ilk are awesome for rapid prototyping and one-offs, but at the end of the day, enterprises will deploy far smaller distilled models in production. Here's my contrarian take -
28
153
742
@lintool
Jimmy Lin
3 years
So, CV researchers are looking at transformers and NLP researchers are looking at CNNs (again). What a strange world.
16
51
726
@lintool
Jimmy Lin
4 years
Still cropping and modifying BERT diagrams from Devlin et al. (2019)? I spent several hours redrawing BERT in PowerPoint so you don't have to... Perfect for use in presentations, papers, etc.! Happy to share under Releasing under CC BY 4.0
Tweet media one
15
134
714
@lintool
Jimmy Lin
6 years
Following the AI Residency program by Google, Facebook, Microsoft, Uber, etc., I'd like to start the Waterloo AI Residency program. It's called grad school.
6
48
351
@lintool
Jimmy Lin
3 years
Look what came in the mail!
Tweet media one
3
13
336
@lintool
Jimmy Lin
4 years
"NLP makes IR interesting and IR makes NLP useful!" - slides from my #sigir2020 summer school talk at: Get your rotten tomatoes and eggs out!
9
49
302
@lintool
Jimmy Lin
1 year
My (contrarian?) take: prompt engineering is programming in natural language. We've tried this before, with attempts dating back decades. Recent advances do not change the fact that natural languages are ambiguous, imprecise, under-specified, highly contextual, etc.
10
22
201
@lintool
Jimmy Lin
2 years
Recently, @CohereAI boasted "3X better performance" in multilingual text understanding. We tested that claim by evaluating Cohere embeddings on MIRACL: tl;dr - We weren't able to replicate the 3X claim, but we did observe a 38% improvement over BM25.
5
26
200
@lintool
Jimmy Lin
4 months
RAG is all the RAGe these days, but we (still) don't quite know how to evaluate it properly... This year, we are taking a stab at it in the context of TREC, building on 30+ years of experience in evaluating IR systems.
2
46
193
@lintool
Jimmy Lin
4 years
Case in point: "Passage Re-ranking with BERT" by @rodrigfnogueira and @kchonyc was never accepted anywhere because of the "too simple, not novel" laziness. Yet that paper is LITERALLY cited in every single BERT-for-ranking paper ever published since.
1
10
179
@lintool
Jimmy Lin
1 year
Just how good are commercially available embedding APIs for vector search? An effort led by @ehsk0 evaluated a few of them - @OpenAI @CohereAI @Aleph__Alpha - on BEIR and MIRACL... Check out the results! - forthcoming #ACL2023 industry track paper
3
36
180
@lintool
Jimmy Lin
1 year
ACM Fellows, class of 2022 - from the awards banquet last night
Tweet media one
10
4
166
@lintool
Jimmy Lin
4 years
Not novel. Not novel. Not novel. Reviews seem to be generated with a simple n-gram LM.
4
9
163
@lintool
Jimmy Lin
4 years
On the subject of current citation formats being, essentially, racist: I have two students (current/former) with surname Wang, 2 x Liu, 3 x Yang + 1 collaborator; ~dozen x Zhang in my collaboration circle. A citation like [Zhang et al. 2019] is unhelpful.
20
30
160
@lintool
Jimmy Lin
2 years
I've thought long and hard in recent years about building a culture of reproducibility in research. Over the holidays, I had a chance to organize my thoughts in a piece that has been on the back burner for many months. I'd welcome your comments!
7
21
160
@lintool
Jimmy Lin
1 year
tl;dr - in the enterprise context, you'll start with rapid prototyping using GPT-4 (or another LLM) but eventually end up with a far smaller but just as capable specialized distilled model. That's the journey I see from PoC to prod.
8
20
145
@lintool
Jimmy Lin
10 months
New entrants into the camelidae family 🦙 for retrieval! @xueguang_ma presents RepLLaMA (a dense retrieval model) and RankLLaMA (a pointwise reranker) fine-tuned on (you guessed it!) LLaMA for multi-stage text retrieval:
5
18
139
@lintool
Jimmy Lin
3 years
"Pretrained Transformers for Text Ranking: BERT and Beyond" with @rodrigfnogueira and @andrewyates - it started on June 18, 2020 and culminates here with the official publication... Enjoy! institutional subscribers: retail orders:
2
29
131
@lintool
Jimmy Lin
1 year
My (contrarian?) predictions on ChatGPT, Bard, and its ilk: Regarding the two biggest problems today, (1) hallucinations and (2) toxicity, the first will be transient (i.e., solved relatively shortly) and the second will be perpetual (i.e., will never be solved). Rationale:
6
27
131
@lintool
Jimmy Lin
4 years
Those who argue that curation is the answer to "more ethical AI/ML/LMs" come across as intellectually naive. Archivists have been grappling with this issue, literally, for millenia. They'd be well advised to consult some of the literature from that field.
6
20
122
@lintool
Jimmy Lin
1 year
I'll go on the record with perhaps another contrarian opinion: Lucene HNSW is the future and (pure) vector DB vendors are in trouble. Why?
5
34
121
@lintool
Jimmy Lin
4 years
We are scheming to write a paper where the author list is Jimmy Lin, Jimmy Lin, Jimmy Lin.
6
3
119
@lintool
Jimmy Lin
4 years
I'll conclude by being constructive. As an AC, I have overridden this "not novel, too simple" garbage on more than one occasion. In some cases I have spent hours pouring over the literature to determine if this paper was indeed the first to have a particular insight. 1/2
1
3
116
@lintool
Jimmy Lin
1 year
Dense retrieval without requiring a dedicated vector DB! Here's a guide on how you can take OpenAI ada2 embeddings (on MS MARCO passages) and perform retrieval directly using Lucene with our group's Anserini toolkit
3
17
114
@lintool
Jimmy Lin
5 years
What a sad state of affairs we find our field in: rejected because you didn't STOA; rejected because you STOA'ed (just leaderboard chasing, no insight); flag plant on arXiv, rejected (reviewer cites your paper as evidence of lack of novelty); don't arXiv, your idea is scooped.
4
18
110
@lintool
Jimmy Lin
3 years
The modern search landscape confusing you? Dense retrieval, sparse retrieval, transformer-based rerankers, multi-stage architectures, nearest neighbors, HNSW, blah blah blah. My attempt to sort it all out in a single conceptual framework:
1
22
107
@lintool
Jimmy Lin
7 months
Prompt-decoder LLMs for listwise reranking too large for you? Introducing our new LiT5 family of listwise reranking models: nearly as good but *much* smaller. Yup, T5's still got tricks to offer!
3
16
98
@lintool
Jimmy Lin
11 months
Another addition to the "X is all you need" genre of papers: We took OpenAI embeddings of MS MARCO passages and stuffed them into Lucene - turns out you don't need fancy schmancy vector stores for dense retrieval! Lucene will do.
@lintool
Jimmy Lin
1 year
Dense retrieval without requiring a dedicated vector DB! Here's a guide on how you can take OpenAI ada2 embeddings (on MS MARCO passages) and perform retrieval directly using Lucene with our group's Anserini toolkit
3
17
114
9
16
101
@lintool
Jimmy Lin
4 years
This includes checking with PC chairs on whether the previous work and the paper in question are by the same authors. And if I do recommend rejection based on novelty, it is with a citation to a paper confirmed not to be by the authors of the same paper under review. 2/2
6
4
100
@lintool
Jimmy Lin
6 years
Knuth says not everyone should be working on deep learning...
Tweet media one
4
8
98
@lintool
Jimmy Lin
3 years
With @rodrigfnogueira and @andrewyates we're happy to share the revised version of our book "Pretrained Transformers for Text Ranking: BERT and Beyond" - significant updates to transformer-based reranking models and dense retrieval techniques!
4
27
99
@lintool
Jimmy Lin
3 years
Presenting AfriBERTa, a pretrained LM for 11 African languages by @Kelechukwu_ What's neat is that its pretraining corpus is 0.04% that of XLMR (<1GB), but AfriBERTa performs just as well on downstream tasks! To appear at MRL workshop at #EMNLP2021
2
19
93
@lintool
Jimmy Lin
2 years
Slides from my recent talk: A Conceptual Framework for a Representational Approach to Information Retrieval:
1
12
92
@lintool
Jimmy Lin
2 years
By knocking out two major datacenters, road construction work has literally brought down a non-trivial fraction of all deep learning experiments being conducted in Canada.
5
4
85
@lintool
Jimmy Lin
9 months
Belated, but congrats to @cohere for their new Embed v3 model! With @JinaAI_ making a similar product announcement recently, it's clear that this space is heating up! This is awesome for the community... and of course for my students working in this space!
2
11
89
@lintool
Jimmy Lin
3 years
"Serverless BM25 Search and BERT Reranking." #DESIRES2021 paper: slides:
Tweet media one
2
18
87
@lintool
Jimmy Lin
9 years
It's official! I'm leaving Maryland to take up the David R. Cheriton Chair in the School of Computer Science at the University of Waterloo!
42
13
84
@lintool
Jimmy Lin
1 year
#ACL2023NLP (or #ACL2023 ) is a great opportunity to plug what is shaping up to be perhaps the largest collection of core NLP faculty in Canada 🇨🇦🍁 at Waterloo @UWCheritonCS - joining @WenhuChen and me next year will be @fredahshi @hllo_wrld @yuntiandeng ! Come find us to chat!
1
12
84
@lintool
Jimmy Lin
4 years
Thanks to the tremendous effort of @edwinzhng @1729_gupta @kchonyc we're proud to present the Neural Covidex, our updated AI-powered search interface to @allen_ai 's COVID-19 corpus: Powered primarily by Lucene, T5, and BioBERT.
3
42
82
@lintool
Jimmy Lin
4 years
Introducing... SegaBERT! by @Richard_baihe et al. intuition is to introduce hierarchical position embeddings (paragraph, sentence, tokens) to better capture context during pretraining: simple idea, fairly large gains!
1
20
79
@lintool
Jimmy Lin
4 years
Why I hate doing reimbursements: the default assumption by @UWaterloo is that you're a criminal trying to embezzle money from research accounts. Maryland was more sane. What's your experience been like at other places?
18
0
76
@lintool
Jimmy Lin
3 years
If you're interested in dense retrieval, you'll want to check out this DPR replication effort led by @xueguang_ma tl;dr - BM25 is better than the original authors made it out to be, and free QA boost with better evidence fusion!
5
6
77
@lintool
Jimmy Lin
6 years
BERTserini: combining the magic of BERT with Anserini for end-to-end open-domain QA:
Tweet media one
2
22
75
@lintool
Jimmy Lin
3 years
@colinraffel You have to grow a grey beard first. And don't say "bag of words". Dress it up as "heuristically weighted sparse representations".
1
0
75
@lintool
Jimmy Lin
2 years
"Should You Take My Advice?" I recently wrote up this essay on how I communicate with my students. Maybe it applies more broadly to other advisors as well? Feedback/comments/questions welcome!
1
12
76
@lintool
Jimmy Lin
2 years
Wanna train a multilingual dense retrieval model but confused what to do? For example: Which backbone? Pre-fine-tune? Use non-target language data? Here's a helpful guide that begins to compile together best practices:
1
13
71
@lintool
Jimmy Lin
6 years
I expect this will generate some discussion... "The Neural Hype and Comparisons Against Weak Baselines"
2
26
69
@lintool
Jimmy Lin
10 months
New work by @ralph_tang @crystina_z @xueguang_ma adds yet another prompting technique to the mix: *permutation* self-consistency prompting to overcome positional bias in LLMs. Useful for listwise ranking... read all about it!
Tweet media one
0
10
69
@lintool
Jimmy Lin
7 years
We're celebrating the opening of the new Data Systems Lab at UWaterloo today! First innovation: we've learned to turn off gravity.
Tweet media one
2
12
66
@lintool
Jimmy Lin
7 years
PSA: I say to my students - if you're working on NNs and DL, there are at least 5 PhD students at Tsinghua, Peking, Jiao Tong, Zhejiang, ... working on your idea right now. Be paranoid, execute, and publish!
3
25
66
@lintool
Jimmy Lin
3 years
Our group has this server with 100TB disk... and it's always full. Why? These dense retrieval models take up so much $#!& space. But @xueguang_ma et al. came up with simple compression solutions, to appear in #emnlp2021
2
11
68
@lintool
Jimmy Lin
2 years
My basic point: many people (researchers, companies, etc.) are making claims about their ability to do multi-lingual search. With MIRACL there's actually an objective (vendor-neutral) benchmark over 18 languages #wsdm2023 - put your 💰 where your 👄 is?
6
17
68
@lintool
Jimmy Lin
6 years
This came in the mail today. It weights 37 pounds. I told my wife, "it's for the kids". Of course, that's a lie.
Tweet media one
5
2
65
@lintool
Jimmy Lin
5 years
From the "ivory tower" to the "real-world": the story of how block-max WAND made its way into Lucene 8 by @jpountz et al. is told in a #ecir2020 preprint: Lessons for academics seeking to achieve research impact?
2
20
64
@lintool
Jimmy Lin
1 year
At the outset, zero-shot effectiveness of LLMs will be impressive, but it'll likely not be "good enough". Improving the model requires you to more precisely characterize the task - that is, you need what "in the old days" we'd call a task description and annotation guidelines.
3
1
65
@lintool
Jimmy Lin
4 years
The new SPECTER embeddings from @allen_ai are awesome: They're even more awesome when integrated into the Neural Covidex to power related article search: given an article, find similar articles. Try it out!
0
14
65
@lintool
Jimmy Lin
3 years
I've written about anti-Asian bias and model minority issues before: ... and I've gotten the eye-roll reaction of "why don't you sit down since you've got it so good already"... which is exactly the problem.
0
11
65
@lintool
Jimmy Lin
4 years
New work on using doc2query for summarization by @rodrigfnogueira et al. - works surprisingly well! Samples from CORD-19 corpus related to COVID-19 below.
Tweet media one
0
15
65
@lintool
Jimmy Lin
1 year
The path will be from expensive general-purpose models (e.g., GPT-4) to cheaper specialized distilled models (e.g., encoder-only). Here's the progression I see -
2
2
61
@lintool
Jimmy Lin
3 years
Oh, btw, I'm looking for post-docs.
4
7
63
@lintool
Jimmy Lin
4 years
Another instance of the disconnect between academic reviewing and real-world impact: tl;dr - our keyword spotting model in JavaScript now powers wake word detection "Hey Firefox!" in Firefox Voice:
3
14
62
@lintool
Jimmy Lin
4 years
Thanks to @edwinzhng and @1729_gupta our Anserini IR toolkit can now search @allen_ai 's COVID-19 corpus. - @kchonyc 's connected it up to SciBERT , and bam, we have a two-stage neural ranking pipeline! Join and build on our work!
3
22
62
@lintool
Jimmy Lin
13 years
Interested in large-scale machine learning ( #hadoop and otherwise)? I recommend this tutorial at #KDD2011 :: http://t.co/NlEzGEe
0
34
60
@lintool
Jimmy Lin
1 year
LLMs are missing a critical ingredient... and @Primal knows what it is! (Hint: knowledge graphs and neuro-symbolic approaches) Here's a writeup of the journey so far, featuring CEO @YvanCouture - oh btw, I'm the CTO.
2
6
61
@lintool
Jimmy Lin
9 years
"I couldn't do my assignment because GitHub is down" will be the next "my dog ate my homework".
2
49
61
@lintool
Jimmy Lin
4 years
Use of IBM Model 1 translation probs learned from MS MARCO to improve ranking by @srchvrs is brilliant and insightful!
1
6
56
@lintool
Jimmy Lin
3 years
To every university president who's sent mass email condolences (offering support, solidarity, etc.) after previous racially-motivated hate crimes before... we're waiting.
6
6
59
@lintool
Jimmy Lin
4 months
Well, since you asked... a (short) history lesson on multi-stage ranking in IR...
1
8
57
@lintool
Jimmy Lin
1 year
And finally... with a fine-tuned model: "Why do I need all those billions of parameters?" Cost pressure is a powerful economic driver. Distillation (along with pruning, quantization, etc.) provides the answer (and will likely yield safer models).
1
1
57
@lintool
Jimmy Lin
5 years
In preprint of our #sigir2019 short we conducted a meta-analysis of 100+ papers reporting results on Robust04. tl;dr - weak baselines are still prevalent (both neural and non-neural models).
Tweet media one
5
18
56
@lintool
Jimmy Lin
4 years
Further evidence of the one-sidedness of recent attacks on pretrained language models: the critics have conveniently forgotten about the benefits they bring to disadvantaged populations.
12
4
53
@lintool
Jimmy Lin
9 years
Check out my new plates!
Tweet media one
3
7
53
@lintool
Jimmy Lin
3 years
We've released a new version (v0.11.0.0) of our Pyserini Python toolkit to support replicable IR research, now providing first-stage retrieval for sparse, dense, and hybrid representations. Our new arXiv paper provides an overview:
2
11
54
@lintool
Jimmy Lin
6 years
I am satisfied to end a recent paper thusly: Finally, in our collective frenzy to improve results on standard benchmarks, we may sometimes forget that the ultimate goal of science is knowledge, not owning the top entry in a leaderboard.
4
11
54
@lintool
Jimmy Lin
4 years
Apparently, I was recognized as an outstanding area chair at #emnlp2020 and didn't realize it until now... #humblebrag (do people even use this hashtag anymore?)
Tweet media one
1
1
51
@lintool
Jimmy Lin
4 years
It's not every day you land a $1M (CAD) grant... announcing our Archives Unleashed 2 project led by @ianmilligan1 Looking forward to working with @ruebot @jefferson_bail @SamVFritz over the next few years!
Tweet media one
2
2
51
@lintool
Jimmy Lin
5 years
Our latest study of BM25 variants, including Lucene's weird doc length encoding, with @kamphuis_c @arjenpdevries @srchvrs tl;dr - it's okay!
0
20
51
@lintool
Jimmy Lin
2 years
Sparse or dense representations for retrieval? Or hybrid? psssssh, says @jacklin_64 - neither! Densify sparse lexical reps and tack on dense semantic reps: best of both worlds and simplified infrastructure also (no need for HNSW or inverted indexes!)
1
7
48
@lintool
Jimmy Lin
1 year
The Iron Triangle of LLMs: capable models, low development costs (CapEx), low inference costs (OpEx)... pick two, because you can't have all three! Examples? OpenAI API = low CapEx, high OpEx; BloombergGPT = (very) high CapEx but flexibility in OpEx depending on deployment.
Tweet media one
2
7
51
@lintool
Jimmy Lin
4 years
We've connected Anserini to Solr to Blacklight to present a search frontend to @allen_ai 's COVID-19 corpus! Check out - awesome work by @edwinzhng and @1729_gupta
4
18
50
@lintool
Jimmy Lin
3 years
Slides of my keynote at the BIR workshop @ #ecir2021 are available at We couldn't get "smarter" to beat "bigger"... but that's okay!
Tweet media one
2
8
51
@lintool
Jimmy Lin
3 years
It's hard to build usable software, but tweets like this make all the blood, sweat, and tears worthwhile. Credit goes to an awesome team!
@CharlotteHase
Claudia Hauff 🇪🇺 🇺🇦 🇩🇪 🇳🇱
3 years
In the Information Retrieval course I let my students pick the IR toolkit of their choice among all the solutions we have available as research community. Clear front-runner by a mile was Pyserini: . In big part thanks to its elaborate documentation!
3
9
57
2
2
50
@lintool
Jimmy Lin
3 years
MS MARCO v2 datasets are available for TREC 2021! Anserini baselines are available here:
0
6
50
@lintool
Jimmy Lin
2 years
I'll be giving a talk on a Conceptual Framework for a Representational Approach to Information Retrieval on April 5, 4pm PT as a Pinterest Labs Tech Talk @PinterestEng . RSVP and learn more here!
Tweet media one
0
5
49
@lintool
Jimmy Lin
3 years
Happy to share Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in 11 languages by @crystina_z @xueguang_ma @ShiPeng16 tl;dr - think of this as the open-retrieval condition of TyDi. Paper: Data:
2
12
50