![Shikib Mehri Profile](https://pbs.twimg.com/profile_images/1572984874117758976/YqdOi7-h_x96.jpg)
Shikib Mehri
@shikibmehri
Followers
393
Following
5K
Statuses
133
MTS @ContextualAI | Previously @AmazonScience; PhD @LTIatCMU
Joined January 2018
We're clearly on the path to AGI, but is that all you need for real-world impact? Not even close. At @ContextualAI we are building specialized RAG agents that harness cutting-edge document understanding, retrieval, and grounded language modeling in a unified system. With our systems-first approach, we're moving beyond benchmarks and demos to deliver real enterprise success. We envision a future where every organization has RAG agents that seamlessly integrate with their environment, adapt to custom success metrics, and reason effectively across complex multimodal content - from unstructured documents to structured data. Today, with our platform's general availability, we're taking a major step toward making this vision reality. Check it out! š
Today, weāre excited to announce the general availability of the Contextual AI Platform. This is the first enterprise platform designed for building specialized RAG agents to support expert knowledge work. What is a specialized RAG agent? First, a general-purpose AI agent is one designed to automate simple daily tasks like scheduling a meeting or responding to an email. On the other hand, a specialized RAG agent is one designed to augment subject-matter experts performing complex domain-specific work. The Contextual AI Platform allows you to create these agents easily andĀ achieve SOTA accuracy right out of the box. Check out what the Contextual AI Platform can do.
0
4
24
RT @ShuhaibMehri: š” Introducing Reference-Level Feedback: A new paradigm for using feedback to improve synthetic data! š
0
9
0
RT @KarelDoostrlnck: APO-zero was designed to be a very simple alignment objective, the goal was never max performance. To my amazement, APā¦
0
4
0
RT @rajistics: š§Ŗ Deep dive into LLM evaluation using Natural Language Unit Tests Sharing my notebook on how to systematically evaluate LLMā¦
0
2
0
RT @ContextualAI: Contextual's RAG Agents already outperform traditional RAG systems. But we know that accuracy is crucial for critical entā¦
0
5
0
RT @sheshanshag: Note that runner ups are different and disjoint on each of these benchmarks. Seems like no-one apart from Contextual isā¦
0
6
0
šSoTA on RAG-QA Arena, OmniDocBench, BEIR, BiRD-SQL, and on our internal customer evals.š I joined Contextual a year ago inspired by the vision (both research + product). Since then, we've not only strengthened and expanded that vision, but built an exceptional team, established strong research foundations, and consistently tackled the field's hardest problems. Incredibly proud of the team for what we've accomplished and super excited for what's coming next!
Super excited to share something weāve been working on for a long time: the @ContextualAI platform is now generally available! SOTA performance across the RAG stack, for each individual component as well as end-to-end š§µš
0
5
16
RT @ContextualAI: Today, weāre excited to announce the general availability of the Contextual AI Platform. This is the first enterprise plaā¦
0
56
0
this paper's primary influence is the terrible title deceiving people unfamiliar/inexperienced with synthetic data for some reason "recursively generated" implies "indiscriminately generated" -- the paper clarifies this nuance but I know several people that have been misled/have incorrect assumptions about synthetic data because of this paper. adding any inductive bias to the data generation process (eg verifier, prompts, multi-step pipelines) leads to the core claim in the title being wrong. tbh despite being a decently executed paper, the problem formulation of indiscrimate recursive data generation is very contrived
šØ "AI models collapse when trained on recursively generated data" was among the most influential AI papers of 2024 - don't miss it! Bookmark & download it below. Interesting quotes: "The development of LLMs is very involved and requires large quantities of training data. Yet, although current LLMs2,4ā6, including GPT-3, were trained on predominantly human-generated text, this may change. If the training data of most future models are also scraped from the web, then they will inevitably train on data produced by their predecessors. In this paper, we investigate what happens when text produced by, for example, a version of GPT forms most of the training dataset of following models. What happens to GPT generations GPT-{n} as n increases? We discover that indiscriminately learning from data produced by other models causes āmodel collapseāāa degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time" - "Our evaluation suggests a āfirst mover advantageā when it comes to training models such as LLMs. In our work, we demonstrate that training on samples from another generative model can induce a distribution shift, whichāover timeācauses model collapse. This in turn causes the model to misperceive the underlying learning task. To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time. The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale. (...)" ā” Authors: Ilia Shumailov, Zakhar Shumaylov, Yiren (Aaron) Zhao, Nicolas Papernot,Ā Ross AndersonĀ & Yarin Gal ā” Link to the paper below. š„ To stay up to date with the latest developments in AI policy, compliance & regulation, including excellent research, join 44,400+ people who subscribe to my AI newsletter (link below).
0
0
5
would you characterize system2->system1 distillation as moving the curve up or left? I agree that most capability advancements are not necessarily novel in that they could have been achieved with weaker models + large inference compute + rm/external verifier -- but imo compressing that into a single LM is still moving the curve up from the perspective of the LM (ie system1)
0
0
1
@lateinteraction There'd be more value if the learned evaluation policy of different generators was drastically different -- but seems like all current instruction tuned LLMs have very correlated preference judgements/learn the same biases
1
0
0
RT @wordgrammer: DeepSeekās new model seems to have proven this For the majority of startups. You cannot build a better datacenter than Miā¦
0
107
0
The open challenge is defining high-signal rewards for all "computer tasks." And maybe optimizing within noisy environments. My prediction is that (at least for some time) everybody will be working on creating increasingly sophisticated/realistic environments ("gyms") and aiming to transfer learnings from simulation to real world challenging tasks. Learning from real world deployments/downstream rewards too
0
0
5
RT @douwekiela: Why this is so exciting: SOTA on key benchmarks, top 10 on RewardBench (without cheating!), a big leap in alignment with huā¦
0
6
0
my interpretation: the core intelligence/reasoning lies in having a good value function or action selector -- search is just the mechanical process of exploring the action space using that capability. So the search itself isn't the interesting part, it's just a tool for traversing possibilities
1
0
2
Great work Milan! I'm excited about exploring new approaches to human-model collaboration in evaluation, and HumanRankEval seems like a promising step forward! Your approach of using an LM as a preference model and evaluating against human preference data is compelling (similar to what many early RLHF papers used as their primary evaluation method). Two interesting areas for future exploration could be: (1) testing HumanRankEval's robustness against false negatives (cases where LM generation is bad but likelihoods are good), and (2) investigating whether collecting human preferences over decomposed unit tests/criteria could make this even more impactful (given the known noise in human preference data and our results in improving IAA).
1
0
1
@ContextualAI Really excited to see how people use the LMUnit API! I break down three of my biggest takeaways from the paper in this thread:
One of our most exciting results: when humans evaluate LLM outputs using natural language unit tests instead of traditional preference judgments, inter-annotator agreement jumps from 71% to 86%! As LLMs continue to improve, we need more principled ways to identify flaws and measure quality. Breaking down the complex notion of "good responses" into explicit, testable criteria is crucial for this - and our results validate this across multiple studies. Throughout our paper we demonstrate this benefit of decomposition, but it's especially clear in the human annotation results. Being able to reliably measure and agree on model behavior will be key to systematic improvement in LLM capabilities š
0
4
10
Paper: Blog: API: Thank you so much to my amazing co-authors: @rajan__vivek @w33lliam @JonSaadFalcon @nandita__naik @FranklinMatija @bertievidgen @apsdehal @douwekiela (looking for more co-authors:
0
0
7