My 2023 summary:
🎓 PhD graduation
🏆 EMNLP Outstanding Paper
💯 Crossed 1000+ citations
🦙 Met a real Alpaca in Peru 🇵🇪 Tried luring it with 🥕 for a fun photo, but the Alpaca had own thought and DOES NOT follow my instruction at all 🤣
Embracing 2024 with fresh enthusiasm 🚀
When I tried OpenAI O1-preview on complex Chinese math problems, the model still thinks in English. This behavior aligns with our findings in our
#ACL24
paper on "Leveraging Pivot Language in Cross-Lingual Problems"
We found that answering non-English questions while thinking in
🚢Introduce WebVoyager -> Building an End-to-End Web Agent with Large Multimodal Models
📌A GPT-4V powered web agent, can complete user instructions end-to-end on real-world websites
📌Given [task instruction, trajectory], we show GPT-4V can be a good web agent task evaluator
𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 𝗥𝗮𝘁𝗵𝗲𝗿 𝗧𝗵𝗮𝗻 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲 is now 𝗮𝗰𝗰𝗲𝗽𝘁𝗲𝗱 to #𝗜𝗖𝗟𝗥𝟮𝟬𝟮𝟯 🎉🎉 Without using DPR/Google, it achieved SoTA on multiple open-domain QA and knowledge-intensive benchmarks! Work done
@ms_knowledgenlp
!
Code and paper:
📢New paper: Chain-of-Note
Retrieval-Augmented LMs are often misled by noisy, irrelevant documents. Adding IR could even hurt performance in some scenarios😅
Chain-of-Note improves +7.5 over standard RALM on NQ when all documents are noisy!
ArXiv:
🎉Personal Update: Successfully defend my PhD and now part of
@TencentGlobal
AI Lab Seattle. Huge thanks to my advisor
@Meng_CS
for unwavering support. I'll work on frontier NLP research, focusing on novel tech in LLM, IR & Instruction tuning
& free feel reach out for internship
🎉🎉#𝗘𝗠𝗡𝗟𝗣𝟮𝟬𝟮𝟮 𝗔 𝗨𝗻𝗶𝗳𝗶𝗲𝗱 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗘𝗻𝘁𝗶𝘁𝘆 𝗠𝗲𝗺𝗼𝗿𝘆: A close-book model with much better performance than 𝗘𝗮𝗘, e.g. 47.2 EM on TriviaQA, and outperform open-book on ELI5!
ArXiv:
📢 I am actively looking for research interns working with me in summer 2024 at
@TencentGlobal
AI Lab in Seattle.
If you have research backgrounds in IR & RAG, Factuality, Reasoning, Agent and interested in the working with me, feel free to DM me! 😊
#𝐄𝐌𝐍𝐋𝐏𝟐𝟎𝟐𝟐 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐀𝐮𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐂𝐨𝐦𝐦𝐨𝐧𝐬𝐞𝐧𝐬𝐞 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠: 𝐀 𝐔𝐧𝐢𝐟𝐢𝐞𝐝 𝐀𝐩𝐩𝐫𝐨𝐚𝐜𝐡. A simple way that retrieves relevant information from commonsense corpora for reasoning tasks.
#NLProc
💡Introducing DSBench: a challenging benchmark to evaluate LLM systems on real-world data science problems. GPT-4o scores only 28% accuracy, while humans achieve 66%. A clear gap, but an exciting challenge for AI advancement! 🧐
Paper:
Project lead by our
🎉 New preprint! Generate rather than Retrieve: Large Language Models are Strong Context Generators. Our proposed method achieved new SoTA on open-domain QA! (1/5)
Arxiv link:
📢New paper "Sub-sentence Encoder" (led by
@soshsihao
), a contrastively-learned contextual embedding model for fine-grained semantic representation of text.
🏆Outperform SimCSE, GTR, ST5 and other sentence embedding methods by large margin!
ArXiv:
Text embeddings = one embedding for the entire text sequence.
But what if the text is long and says many things?
Can encoders produce contextual embedding for an individual piece of meaning in one text sequence?
❗Check out: Sub-Sentence Embeddings
1/6
🎉
#EMNLP
paper: LLM is greatly influenced by the quality of instructions, and manually written instructions for each task is laborious and unstable. We (led by
@zhihz0535
) introduce Auto-Instruct, automatically improve the quality of instructions provided to LLMs.
🎉𝐃𝐞𝐧𝐬𝐞 𝐗 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥: What Retrieval Granularity Should We Use?
Both passage and sentence level index are not optimal for dense retrieval. We introduce a novel retrieval unit, proposition, for dense retrieval. See details in this thread ~
#EMNLP
#NLProc
Wanted to share some our new research 𝗱𝗶𝗿𝗲𝗰𝘁𝗶𝗼𝗻𝘀 on 𝗢𝗽𝗲𝗻-𝗱𝗼𝗺𝗮𝗶𝗻 𝗤𝗔😁:
1. Generate-then-Read: using GPT-3 to generate contexts
2. Entity Memory: attend knowledge from memory, no retrieval
3. KG for QA: using Wikidata to better retrieve and read
We (Tencent AI Seattle Lab) still has one summer internship position, focused on RAG, Web Agent, or Multi-modal research. Please DM me if you are interested and have a relevant background. 😊
🎉🎉EMNLP 2022: Knowledge Graph Enhanced Passage Reader for Open-domain Question Answering. With the same retriever and the same set of retrieved passages, GRAPE can outperform the state-of-the-art reader FiD by a large margin.
ArXiv:
📢 Introducing Cognitive Kernel: an open-source agent system towards generalist autopilots. The system can interact with real-world environment, handling user-provided files, access websites (e.g., Amazon), and manage long-term chat history.
Our system is fully open-sourced and
📢 Excited to share that we will organize the 3rd workshop on Knowledge-Augmented NLP at ACL 2024. We will have six amazing speakers! We welcome your submissions and invite you to discuss with our speakers and organizers at the workshop. Looking forward to seeing you in Thailand!
📌Many LLM systems allow users upload documents, such as GPT-4, Claude, and Kimi. Have you used any of these systems?🤔 𝐇𝐚𝐯𝐞 𝐲𝐨𝐮 𝐞𝐯𝐞𝐫 𝐰𝐨𝐧𝐝𝐞𝐫𝐞𝐝 𝐰𝐡𝐢𝐜𝐡 𝐬𝐲𝐬𝐭𝐞𝐦 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐬 𝐭𝐡𝐞 𝐛𝐞𝐬𝐭 𝐰𝐡𝐞𝐧 𝐲𝐨𝐮 𝐚𝐬𝐤 𝐚 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧 𝐛𝐚𝐬𝐞𝐝 𝐨𝐧
📢 Introducing IfQA - the first large-scale open-domain question answering (ODQA) dataset centered around counterfactual reasoning. Together with
@Meng_CS
@ai2_aristo
!
Paper link:
📢 New paper: Compared to 𝐌𝐮𝐥𝐭𝐢-𝐦𝐨𝐝𝐚𝐥 𝐂𝐨𝐓, We found 𝐃𝐞𝐬𝐜𝐫𝐢𝐛𝐞 (visual description generation)-then-𝐑𝐞𝐚𝐬𝐨𝐧 (generating 𝐌𝐮𝐥𝐭𝐢-𝐦𝐨𝐝𝐚𝐥 𝐂𝐨𝐓 with the assistance of descriptions) could greatly improve math reasoning on MathVista and MathVerse.
📢 Fall semester internship at
@TencentGlobal
AI Lab in Seattle: We are actively looking for research interns working on IR & RAG, Complex Reasoning, Multi-modal and Language Agent. If you are interested in the working with us, feel free to DM me! 📷
🤓 Arriving at
#ACL2024
with
@hongming110
. Excited to meet old and new friends, and to discuss LLM agents, multi-modal learning, and RAG.
Our Tencent AI Lab, with locations in Seattle, Shenzhen, and Beijing, has multiple FTE and intern positions available. If you're looking for
I deeply appreciate of the implementation of WebVoyager and fantastic video that explains how to utilize LangGraph for its construction, as well as the comprehensive discussion surrounding LangGraph.
Our team will provide more detailed information and make our source code
⛴️ WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
WebVoyager is a new kind of web-browsing agent, developed by Hongliang He,
@wyu_nd
, et. al.
Powered by large multi-modal models, like GPT-4V, it uses browser screenshots to conduct research, analyze
Successful conclusion of the first Knowledge-Augmented NLP workshop at
#AAAI23
! With over 50 in-person attendees and 20 virtual participants, it was a huge success and one of the most well-attended events at
#AAAI
. Check out the blog and photos below!
#EMNLP2022
🧭Task Compass: Scaling Multi-task Pre-training with Task Prefix
🤔When multi-task pre-training in scale, how to explore task relationships?
💡We find that task relationships can be probed by simply adding single-token task prefixes!
𝐍𝐞𝐰 𝐒𝐮𝐫𝐯𝐞𝐲 𝐩𝐚𝐩𝐞𝐫 in
#eacl2023
! New perspectives to summarize multi-task learning in NLP from task relatedness and training methods! Also nice future work discussion.
#NLProc
Our paper "A Survey of Multi-task Learning in Natural Language Processing: Regarding Task Relatedness and Training Methods" has been accepted to
#eacl2023
main conference! Collaboration with
@wyu_nd
,
@Meng_CS
,
@Zhichun5
and Mengxia Yu.
Thanks
@_akhaliq
for covering our work! WebVoyager🚢 is a GPT-4V powered web agent that can follow human instructions and complete tasks (e.g. ticket booking, shopping) on various real-world websites (e.g. Google flights, Amazon)!
The paper also present a new benchmark dataset
Tencent presents WebVoyager
Building an End-to-End Web Agent with Large Multimodal Models
paper page:
The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives
Excited to announce four highly esteemed keynote speakers
@amit_p
,
@boydgraber
,
@scottyih
, Chandan at our upcoming
@knowledgenlp
#AAAI23
workshop on Feb 13th! Dive into the cutting-edge topics of neuro-symbolic AI, code understanding, retrieval-augmented LM, and advanced QA.
📣Our 3rd workshop of knowledge augmented NLP will happen in ACL 2024 this year! Submission ddl: May 17, 2024! Looking forward to seeing you in Thailand!
🎉Excited to announce the 3rd Workshop on Knowledge-Augmented NLP at ACL 2024 in Thailand!
Submission deadline: May 17, 2024.
Eager to reconnect with old friends and welcome new faces in the Knowledge NLP community!
#ACL2024
#NLProc
Our
@MegagonLabs
Best Paper Award winner was "Empowering Language Models with Knowledge Graph Reasoning for Question Answering" by Ziniu Hu et al from UCLA!
Paper link:
Thank you to award sponsor
@MegagonLabs
for supporting our event! (4/4)
PLUG is a novel cross-lingual instruction tuning method which could make LLaMa follow Chinese instructions (and other low resource language) very well!
Check out our paper at
🤨LLMs struggle to follow instructions in low-resource languages?
⚡️Introducing PLUG: leveraging pivot language in cross-lingual instruction tuning
📈Improved LLaMA-2 by 32% on 4 diverse languages!
Check out our new preprint at➡️
Thanks LangChain AI for covering and implementing Chain-Of-Note app as a LangChain template. Chain-Of-Note improves performance when retrieved information contains noise.
Check out our paper at
🗒️Chain-of-Note Template
Chain-of-Note is a new prompting technique by
@wyu_nd
et al for RAG applications that helps improve performance when the retrieved information might be noisy.
We implemented a Chain-of-Note app as a LangChain template. Given a question, query Wikipedia
🔎Proposition-Based Retrieval
This new paper by
@tomchen0
introduces a new retrieval method by changing 🎯what is indexed🎯 in the first place
This can easily use our 🌲multi-vector retriever🌲, and we've added a template to get started with it easily!
💡How does it work? 👇
📢Calling all
#NLP
enthusiasts! The 2nd Knowledge Augmented Methods for NLP workshop at
#KDD2023
is now accepting paper submissions 📝👩💻! Deadline: May 23rd. Accepted papers will be non-archival. For more info, check out 👉
#AI
#MachineLearning
#NLProc
Combing Retrieval AND Generation (in step1) can further improve the model performance, as shown in Figure 3. The choice of retrieval or generation is interesting, and their complementarity is worth exploring. Using retriever or generator only where it helps.
Right now we do:
1. retrieve docs
2. LLM generate output w/ those
But this doesn't fully leverage LLM power for step 1.
What if we directly generate contextual docs for a question, instead of retrieving external docs?!
Paper
Code
I will be at
#ACL2024
, will be hosting our 3rd workshop on knowledge-augmented methods for NLP, on August 16. We invited 6 keynote speakers, with 30 accepted oral and poster papers, covering diverse topics on RAG, KG, Agent …
See details at
Thrilled to announce our finalized schedule at
#ACL2024
! We're excited to feature 6 keynote speakers and 30 accepted papers. Join us for an inspiring event!
🧐We introduce a new method: using reflective thoughts to improve the model's reasoning capability, just as we humans often do when we step back to question our assumptions, make analogies, and explore alternative solutions.
🧐Previous math augmentation focused on improving single-round QA
🎯We introduce a new method that1⃣augments standard math settings2⃣excels in reflective thinking scenarios!
👉Check our latest preprint at
Pls consider submitting your work to our Knowledge Augmented NLP workshop at
#AAAI2023
! Looking forward to seeing you at Washington DC next February 🎉
Hello World! The first workshop on Knowledge Augmented Methods for NLP at
#AAAI2023
is welcoming submissions🙌! Papers due by Nov. 8! Accepted paper will be non-archival! Details are available 👉
The new paper from our Tencent AI lab identifies 8 valuable insights into the current state of machine translation research in the LLM era, and propose potential avenue for future advances!
Check the paper below 😊
💡 How are Large Language Models reshaping the landscape of Machine Translation? 🎈
🚀 Check out our latest paper to find interesting findings. We comprehensively revisited Six Classic Challenges of MT in the context of LLM. 🎉
👉 Dive in here: .
And
“Retrieves non-parametric memories only when necessary.” This is a very insightful conclusion by asking “how retrieval is complementary to LLM parametric knowledge.” We showed the same observation in paper but did not give detailed analysis. Learned a lot!
Can we solely rely on LLMs’ memories (eg replace search w ChatGPT)? Probably not.
Is retrieval a silver bullet? Probably not either.
Our analysis shows how retrieval is complementary to LLMs’ parametric knowledge [1/N]
📝
💻
Hello friends at
#NeurIPS2023
, our
@TencentGlobal
AI Lab in Seattle is actively looking for research interns for 2024. If you are interested in topics such as RAG, Reasoning, LLM Agent, and user interfaces, feel free to DM me for a chat!😊
Welcome to our presentation at today 11:30-11:45 at Hall B
#EMNLP2022
! Unified entity memory network have much stronger capabilities than EaE (first released by Google’s
@professorwcohen
), which is not restricted to only entity outputs.
🎉🎉#𝗘𝗠𝗡𝗟𝗣𝟮𝟬𝟮𝟮 𝗔 𝗨𝗻𝗶𝗳𝗶𝗲𝗱 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗘𝗻𝘁𝗶𝘁𝘆 𝗠𝗲𝗺𝗼𝗿𝘆: A close-book model with much better performance than 𝗘𝗮𝗘, e.g. 47.2 EM on TriviaQA, and outperform open-book on ELI5!
ArXiv:
Call for papers! The first workshop on Knowledge Augmented Methods for NLP (
#NLProc
) at
#AAAI2023
is welcoming submissions🙌! Papers due on Nov. 4! Papers will be non-archival, so published papers (e.g.
#EMMLP2022
) can also present at our workshop! Details👉
📢📢Excited to have one paper accepted to
#NeurIPS2022
! We present a new dataset, ScienceQA, and develop large language models to learn to generate lectures and explanations as the chain of thought (CoT). Data and code are public now! Please check👇👇
My awesome student
@JamesYHuang36
just received an outstanding paper award at
#EMNLP2023
! He is looking for summer research intern. Please interview him.
(2/3) Unified Encoder-Decoder Framework with Entity Memory (
#EMNLP2022
): The entity knowledge is stored in the memory as latent representations, and the memory is pre-trained on Wikipedia along with encoder-decoder parameters.
In this paper, we introduce DocBench, a new benchmark designed to evaluate LLM-based document reading systems. Our benchmark involves a meticulously crafted process, including the recruitment of human annotators and the generation of synthetic questions. It includes 229 real
In 2021, we wrote a survey () to hightlight a key LM challenge: augmenting with external knowledge via IR, tools, etc. The introduction of plugins in ChatGPT reaffirms the effectiveness of knowledge augmentation for infusing LLMs with up-to-date information
We are adding support for plugins to ChatGPT — extensions which integrate it with third-party services or allow it to access up-to-date information. We’re starting small to study real-world use, impact, and safety and alignment challenges:
o1-preview-2024-09-12 on BigCodeBench-Hard
Complete 34.5% (slightly better than Claude-3.5-Sonnet-20240620)
Instruct 23.0% (far below other top models)
Average 28.8%
o1-preview may follow detailed instructions reasonably well, but not the brief ones.
Not sure how consistent
Thank you, Jerry
@jerryjliu0
, for highlighting our proposition retrieval work in the llama-index. The LlamaPack truly demonstrates the practical application and effectiveness of proposition-based retrieval systems!
A big factor for building production RAG is deciding the "chunk" used for retrieval + synthesis: should it be a sentence? Paragraph?
In the "Dense X Retrieval" paper (
@tomchen0
et al.), the authors propose a concept that we've advocated for a while: decouple the indexed chunk
(1/3) Generate-then-read propose a novel pipeline for solving open-domain QA tasks, i.e., replacing the process of retrieving contextual documents from large-scale corpora such as Wikipedia by prompting GPT-3 to generate relevant contextual documents.
We improve current RALM on two aspects: (1) Noise Robustness: The ability to discern and disregard noisy information present in irrelevant retrieved documents, (2) Unknown Robustness: The ability to acknowledge its limitations by responding with “unknown” (1/4)
💡𝐍𝐞𝐰 𝐌𝐚𝐭𝐡 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤: Different from existing single-turn math QA datasets, MathChat is the first benchmark focusing on multi-turn conversations about math.
🔔Existing LLMs exhibit a significant decline in math reasoning ability after multi-turn conversations!
🚀 Excited to share our latest research MathChat! 📊 We explore the new frontiers in interactive math problem-solving. Check it out! 🧵👇
MathChat is a benchmark designed to evaluate LLMs on mathematical multi-turn interaction and open-ended generation.
DSBench requires LLM systems to read user uploaded files, write and execute codes to solve data science problems.
This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.
The dataset is available at
@nembal
I think mainly due to the imbalance in the language distribution in the pre-training corpus. Knowledge embeddings aren't as well connected across different languages.
I remember when I studied abroad, it took me longer to learn a new concept compared to taking a similar class
New paper 🎉:
@lupantech
Pan’s survey is a good summary and analysis of the recent work of language models in mathematical reasoning. If you are interested in mathematical reasoning, definitely check it out! Feedback welcome!
🎉New paper! The survey of deep learning for mathematical reasoning (
#DL4MATH
) is now available. We've seen tremendous growth in this community since 2018, and this review covers the tasks, datasets, and methods from the past decade.
Check it out now:
In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios!
-- Here comes BigCodeBench, benchmarking LLMs on solving
@ZhiruoW
This is a great work! We also noticed irrelevant context could hurt model performance in industry applications. We just released a paper yesterday, with similar goal, to improve noise robustness in RAG
We also present a novel clustering-based prompting approach to generate diverse contextual documents that increases the likelihood of generating a correct answer with more generations. This approach can significantly improve performance on downstream tasks. (3/5)
Shout for Notre Dame's iSURE program and CSE PhD program. You may get interested in them, if you get a chance to read my student Wenhao's stories.
Wenhao Yu is a rising 4th-year PhD with Bloomberg Fellowship, working on NLP / QA.
Want to teach logical reasoning 💭 skills to LMs 🤖?
Check out Apollo, our new adaptive pretraining strategy to improve logical reasoning in LMs. It is
(a) Simple to implement
(b) Generalizable across task formats
(c) Needs minimal data processing
Paper:
[2/n] This plug-and-play pipeline generates initial outputs, retrieves relevant info from document collections, and refines these outputs, thus efficiently addressing LLMs' limitations.
I'm happy to share that our paper "Open-domain Question Answering via Chain of Reasoning over Heterogeneous Knowledge" is now online. We proposed a unified framework for solving single&multi-hop questions that require reasoning over tables and/or text.
Work done with my colleagues at Tencent AI Seattle lab Hongming Zhang (
@hongming110
) Kaixin Ma (
@KaixinMa9
), Xiaoman Pan, Hongwei Wang, Dong Yu. (4/4)
DocBench construction pipeline. (a) Document Collection: gathering PDF files from five different domains; (b) QA-pair Generation: creating diverse and comprehensive QA pairs through a combination of LLMs and human effort; (c) Quality Check: ensuring data quality through a
Our experiments across four open-domain QA benchmarks show that RALMs equipped with CoN significantly outperform standard RALMs. Notably, CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents. (3/4)
[3/n] Empirical results: over +6.0% improvement under zero-shot settings and +2.5% under few-shot settings compared to baselines on multiple open-domain QA, dialogue benchmarks.
We propose a novel generate-then-read pipeline for solving open-domain QA tasks, i.e., replacing the process of retrieving contextual documents from large-scale corpora such as Wikipedia by prompting a large language model to generate relevant contextual documents. (2/5)
Chain-of-note generates a series of reading notes for retrieved documents, enabling a comprehensive assessment of their relevance to the input query. We employed ChatGPT to create training data for CoN, which was subsequently trained on an LLaMa-2 7B model. (2/4)
We conduct experiments with three knowledge-intensive NLP tasks. By only leveraging language models, our method can outperform dense retrieval methods. We establish new a SoTA result on TriviaQA and WebQ, improving exact match by +6.5 and +5.7 compared to DPR-FiD. (4/5)
[3/n] 🚀The unique challenges posed by the IfQA benchmark will undeniably spur advancements in retrieval and counterfactual reasoning, driving the next frontier in QA research.
#AI
#ML
#NLProc
We proposed a unified framework of RetrievalAugmented Commonsense reasoning, including a newly constructed commonsense corpus with over 20 million documents and novel strategies for training a commonsense retriever. (1/4)
We also introduce FACTOIDWIKI, a processed English Wikipedia dump, where each page is segmented into multiple granularities: 100-word passages, sentences and propositions.
[3/4]