If you are building an LLM application that uses RAG , poor retrieval can be detrimental to its UX. Phoenix now supports passing in your knowledge base as a corpus dataset so that you can inspect how your retrieval system is querying for relevant documents from your vector store.
Phoenix now supports DSPy! 🎉
With DSPy, you can declare the architecture of your LLM app and automatically generate prompts and fine-tune models to optimize for your specific task.
Try out the notebook:
@lateinteraction
#LLMs
#AI
Webinar with
@llama_index
!
@jerryjliu0
shows us the critical components across the data lifecycle to power RAG: ingest, index, query 🪄
And with the new OpenInferenceCallback, you can easily troubleshoot the search and retrieval with
@ArizePhoenix
! 🔮
Lots of requests for richer observability in DSPy.
In March,
@mikeldking
& I are holding a DSPy <>
@arizeai
meetup in SF to show you how to do that w
@ArizePhoenix
-DSPy integration. Video by
@axiomofjoy
.
Good chance to show something cool with DSPy. What would you like to see?
LLM frameworks are game-changing, but the resulting abstractions can be hard to debug. Phoenix now enables you to trace through the execution of your LLM application so you can understand its internals and troubleshoot problems related to things like retrieval and tool execution!
📚
@LangChainAI
@pinecone
@arizeai
Workshop on troubleshooting search and retrieval - the cornerstone of connecting LLMs to private data are VectorDBs and agent frameworks.
🙏
@hwchase17
@UnstructuredIO
@trychroma
on the future of RAG. From time sensitive retrieval to productionizing large vector stores, nothing was off the table.
💡 Try all the new approaches! If you are using RAG, you are on the cutting edge.
@CShorten30
DSPy 🤝 @
@weaviate_io
🤝 Us!
For anyone interested Phoenix is fully open source and fully private 🔐 Your data is your own!
Great work
@CShorten30
! Let us know if there is more visibility you need!
Before you try advanced retrieval techniques (query decomposition, reranking, hierarchical chunking, etc.) to improve your RAG pipeline, you should implement evals.
What are the different types of evals?
✅ E2E evals (generated responses)
✅ Retrieval evals (retrieved chunks)
Excited to see what everyone at the
@MistralAI
hackathon with
@cerebral_valley
's been cooking up 👩🍳
If you need to debug your mistral calls, traces might come in super handy . It's as simple as a few lines of code in your app 👇
Phoenix is public! Unlock new embedding troubleshooting workflows right in your notebook. Maintained by the
#MLOps
team at
@arizeai
Designed for rapid iteration on your
#LLMs
#ComputerVision
and
#NLP
models.
Huge S/O to
@traviscline
who helped add audio embeddings support to
@ArizePhoenix
and for winning the
@AGIHouseSF
hackathon! Being able to explore your samples using UMAP and HDBSCAN is not just fun, it’s crazy useful for sample discovery 🎧🎶🎼🥁🎸🎹🎺🎻
For the full release notes, check out our GitHub. And if you want to see more retrieval troubleshooting tools, give us a ⭐️ and drop us an issue! We'd love to hear from you.
Phoenix +
@MistralAI
= OSS 🫶
phoenix now supports Mistral as well as Mistral instrumentation!
😻 arize-phoenix-evals - use mistral for evals and synthetic data
😻o peninference-instrumentation-mistralai - native instrumentation for the mistralai SDK
Le Chat! 😻
Example of
@AnthropicAI
#LLM
evals dataset loaded into phoenix - The clear vector space of the
#LLM
responses via UMAP is😍and how answer_matches_behavior and label_confidence label directly correlates to the HDBSCAN clusters... 🪄
Not only that, it visually overlays the retrieval connections within the point cloud so you can visually highlight the vector store clusters your retriever is pulling data from. For all the details, check out our notebooks that cover search and retrieval!
Are you fine-tuning your GPT models? Crazy times that fine-tuning is just an API call away. Great insights on how to fine-tune and evaluate LLMs using evals.
LLMOps in a notebook is a game-changer - it lets you leverage the ecosystem. Take ragas - RAG assessment evals library. By combining
@ArizePhoenix
with ragas evals for answer relevancy, context relevancy, and faithfulness, you can QA your app in ways never before possible.
🚀LLM Evals now crank! SIGNIFICANT speedups via concurrency and via carefully managing token limits. We've seen massive speedups in our runs (typically 5x).
When doing EDD, you need to move fast so you can iterate. Crank up the concurrency and see the evals fly!
🧠We have a great workshop coming up with
@llama_index
‘s very own
@jerryjliu0
! It will cover:
☑️ How to build a RAG powered chatbot using
@llama_index
☑️ How to use
@ArizePhoenix
to inspect and analyze retrievals.
✍️ Sign up here!
#embeddings
are a powerful tool in EDA - even if your model doesn't use them!🤯
@arizeai
provides an AutoEmbeddings package that let's you generate embeddings on your tabular data. You might be surprised what insights you might uncover 🔮
Adjusting your chunk size is one of the first things you should tackle in improving your RAG app - but it’s not always intuitive!
⚠️ More chunks ≠ better (lost in the middle problems / context overflows)
⚠️ Reranking retrieved chunks doesn’t necessarily improve results, in
We are publishing pre-releases so you can live on the bleeding edge!
In v0.0.23rc0 we've added HDBSCAN tuning so you can get your clusters just right!
Clusters provide an "auto-lasso", helping you identify groups of
#embeddings
that require attention. Check it out!
Phoenix automatically computes the distance between your queries and document embeddings (query distance), helping you quickly identify slices of your data that represent user queries that are not contained in your vector store.
New Phoenix + Ragas cookbook by
@Shahules786
@mikeldking
and
@axiomofjoy
dives into using Ragas for synthetic test generation and evaluation; Phoenix for tracing, visualization and cluster analysis; and
@llama_index
for building RAG pipelines.
😍Sneak peak!
We've built composable instrumentation modules under the OpenInference moniker and have published images for the server so that you can run Phoenix as a sidecar to any LLM application. Here's what Phoenix tracing looks like with
@llama_index
's create-llama.
✨0.0.45 🚄
↕️ Reranker spans for models like
@cohere
rerank! view how documents are getting reranked for RAG use-cases!
🔎Search and filtering. Find problematic traces with ease.
🧠
@OpenAI
gpt-3.5-turbo-instruct support for evals
🖨️ verbose mode for eval runs!
Phoenix LLM Traces get a nice shoutout here from the CTO of
@databricks
! If you are looking for a free OSS tracing solution that integrates with evaluation frameworks, Phoenix is a great fit 😉
Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models.
AlphaCode, ChatGPT+, Gemini are examples.
In this post, we discuss why this is and emerging research on designing & optimizing such systems.
"The downside of that is you have zero visibility into what's actually going on under the hood."
This is exactly why we built tracing - you get full transparency
🟩 documents
🟩 score (cosine / euclidean distance)
🟩 metadata
Retrieval doesn't have to be a black box 📦
Why you should build RAG from scratch 🧑💻
@llama_index
is the
#1
framework for RAG pipelines with >600,000 downloads every month, and yet his creator,
@jerryjliu0
, is encouraging people to reimplement it from scratch at least once. Why?
Not enough people
Other key features:
Key Features:
✅ LLM input/output prompt template tracking
✅ Token usage and Timing
✅ Full Retrieval observability/visualizations
✅ Full agent support
✅ Export traces for evals
Phoenix 2.1 now has live-updating evaluations and retrieval metrics! Evals are critical to ensuring that your application is benchmarked during pre-production AND production! ☑️ Don't let hallucinations get out of hand 😵💫
Introducing a Short Course Series on Advanced RAG Orchestration 🪄🤖
As an AI engineer, it can be daunting to dive into how to build high-quality, advanced RAG yourself - there’s literally hundreds of options at every stage of the pipeline.
Easily stitch together custom modules
🔍 Cluster Search on Embeddings!
Search by keywords in text to get a better understanding of the contents of clusters, identify queries that contain certain words, or even search by ID to find that one embedding you are looking for! Happy troubleshooting! 🌌
If data privacy is paramount for your LLM use-case, let's talk. arize-phoenix runs entirely locally and can be leveraged so that you control your observability and evaluation data end-to-end. Privacy first 🔏
@OpenAI
@Meta
@AnthropicAI
@Google
5. Teams are increasingly concerned about accuracy of responses and hallucinations.
This likely points to the seriousness of adoption and need for tools around governance and LLM observability.
Phoenix escapes the notebook! We have a docker image version to try out! Perfect for a small docker compose with your LLM app or to just have running so you can persist traces beyond the lifecycle of your notebook.
📆RAG time 🎶
How do you know if you’re using the right chunk size? The right embedding model? Does poor retrieval correlate with hallucinations?
@AstronomerAmber
and
@mikeldking
will be sharing how to benchmark RAG!
Let’s make it a fun one!
The recent
@huggingface
zephyr-7b-alpha model outperforms ChatLlama 70B 😮
We immediately tested it on
@llama_index
easy-to-hard tasks 🧪
We found that it is the ONLY open 7B model atm that does well on advanced RAG/agentic tasks 🔥👇
Colab:
Phoenix for retrieval-augmented generation:
- Automatically identify areas of user interest that are not answered by the knowledge base
- Surface poorly performing queries based on user feedback and LLM-assisted ranking metrics (e.g., precision
@k
)
Advanced QA over a lot of Tabular Data (combine text-to-SQL with RAG) 📊🪄
Our brand-new mini course 🧑🏫 is a comprehensive overview of how you can build simple-to-advanced query pipelines from scratch, by composing components into complex DAGs. Presenting this in three levels:
Tomorrow, 10am PST: Learn how to create an LLM eval from scratch.
Lessons from the trenches with Lou Kratz of
@Bazaarvoice
and
@jason_lopatecki
. ⛏️
Register:
✨Excited to announce evaluations on spans and documents! This means you can evaluate your application as it runs!
@aparnadhinak
explains how LLM evals and explanations can be used in conjunction with traces.
✨ Phoenix 3.0 ✨Phoenix is now a fully OpenTelemetry compliant collector that natively renders rich LLM application data via OpenInference, a set of instrumentations and conventions around observing LLM applications.
Evaluate
#LLMs
using
@OpenAI
evals.
@ArizePhoenix
can use evals to identify
#embedding
clusters of your LLM application that are performing badly. These clusters are ideal for prompt iteration or fine-tuning
@llama_index
@LangChainAI
arize-phoenix is entirely open-source, built on open standards, and runs entirely in the privacy of your python notebook.
Try it out today and let us know what you think:
Just announced - Phoenix: Open source
#MLObservability
in a notebook! Uncover insights, surface problems, monitor and fine tune your Generative LLM, CV and Tabular Models. Get started:
Learn more here:
As you move your LLM application to production, it’s easy to neglect compliance with things like PII. Plan ahead and safeguard your users from the get go 🔐
✨v0.0.20 ✨
We have new support for tabular data! Quickly identify drift and data quality issues in your features using the new dimension details views!
Semantic search is a very effective way to search documents with a query.
But what exactly does the word “semantic” mean here?
Probably the best way to understand semantic search is to understand what is *not* semantic search.
Let’s take a look. (Thread)
🧠We have a great workshop coming up with
@llama_index
‘s very own
@jerryjliu0
! It will cover:
☑️ How to build a RAG powered chatbot using
@llama_index
☑️ How to use
@ArizePhoenix
to inspect and analyze retrievals.
✍️ Sign up here!
Great tutorial! Data freshness is something you want to monitor. Are your query embeddings close to the data stored in your vector store? Are there clusters of query embeddings that are under performing? Time for a refresh! ♻️
#llmops
📬 Retrieval augmentation helps us:
- Reduce hallucinations
- Answer Qs on internal / niche datasets
- Cite sources and help users trust LLM output
The future of LLMs will include managed long-term memory like that described here — it's worth learning! 🧑🎓
🔥🐦
#PhoenixTipOfTheDay
📖 Phoenix runs in your notebook...literally
🚀 One line of code to launch the app
👀 One line of code to view the app
🪟 You can also open Phoenix in a new browser tab or window if you want more real estate
#PhoenixProTip
#DataScience
23 Open Source AI Libraries for 2023 by
@yujian_tang
According to GitHub Stars, these are listed in order of popularity as of 5/11/23.
Auto-GPT — an open-source LLM framework for autonomous agents
OpenCV — an open source computer vision tool
PyTorch — an open source machine
In our experience, reliably parsing LLM output is half the battle when building
#LLM
applications, especially agents. Excited to check this out
@lmqllang
.
🐶 LLMs are powerful tools in content generation. However, it can be hard to generate structured data with them.
@lmqllang
can help. Given a simple template query + constraints, LMQL reliably generates tabular data, which seamlessly translates to a schema-safe DataFrame:
#lmql
Next week we're kicking off our
#llm
Evaluation Essentials series with
@jerryjliu0
of
@llama_index
!
Join us Oct 3 for Benchmarking and Analyzing Retrieval Approaches. A must-attend for
#AI
& ML engineers or anyone seeking production excellence. 🤩
"It's relatively easy to stand up a demo of an LLM workflow…but developing further toward a viable and robust application is another matter."
The DS team
@klickhealth
leverages Phoenix for LLM observability behind-the-scenes:
[2] OpenInference (
@arize_ai
) is a standard for capturing/storing AI model inferences.
It allows you to experiment/visualize LLM apps using observability tools like
@arize_phoenix
.
Check out the notebook here!
“Traditionally, data is added to an ML model by training the model on that specific data. This leads many to jump to the conclusion that fine tuning LLMs is what is needed. Let’s bust that myth.”
This talk from
@geoffreyhinton
is worth a watch. He warns of the dangers of AI that exceeds human intelligence but isn't aligned with human goals.
Let's build the guardrails to make sure
#AI
systems are
#observable
and
#aligned
.
@axiomofjoy
Instrumentation for
@OpenAI
is a critical building block since it's such a powerful building block for LLM applications and tasks. For the full details, check out the docs on how you can not only trace your LLM calls but evaluate them as well.
Evaluating your LLM system (RAG, agents) is super important 🧪, but what’s the proper methodology for doing so?
There’s two general strategies for evaluating your LLM app:
1️⃣ End-to-end ♾️: First setup the entire pipeline, and then evaluate text inputs/text outputs (don’t
Xander Song, dev advocate at
@arizeai
gives us insight into how
@ArizePhoenix
can be used to troubleshoot / improve your
#llm
and
#computervision
models.
@ArizePhoenix
runs INSIDE your notebook, providing a DX that you might have not known was possible.
Today's the day!
Phoenix is one of 24 open source projects featured this month to help people like you get into open source via simple tutorials and guides.
It’s day 11 of the Open Source Advent!
Today’s featured project is
#Phoenix
by
@arizeai
!
Get all the contest details as we count down to the holidays.
Contest Details:
Contest Discord:
#OSSAdvent2023
“Looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.”
🤯
With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates:
- Input & Output across modalities (text, audio, vision)
- Code interpreter, ability to write & run
@llama_index
@LangChainAI
Also new ✨ LLM Evals for hallucinations, relevancy, toxicity, code generation, summarization, and classification. Evaluate the performance of different LLM tasks by leveraging the powerful reasoning skills of an LLM 🧪
Quality insurance is important when you’re designing a production-ready solution for semantic search and vector storage.
Full open source workflow for embedding a large volume of data, uploading to a vector db, running similarity searches, and monitoring it in production.
Learn
Vector DBs form the 🧠s of you LLM powered applications. It’s critical to understand how they leverage
#embeddings
and to monitor how well they encode semantics. The quality of your embeddings directly translate to the quality of your
#LLM
app’s information retrieval.
#LLMOps
Vector databases & embeddings are the current hot thing in AI.
Pinecone, a vector DB company, just raised $100M at ~1b valuation.
Shopify, Brex, Hubspot and others use them for their AI apps
But what are they, how do they work and why are they SO crucial in AI? Let's find out
Metadata visibility! It’s the little things 😁 Also the cosine/Euclidean distance as well as the embeddings are captured 🌌 there’s never been a tool to give you this much visibility into RAG 🔭
Making sense of embeddings can be overwhelming. Density-based clustering you can start to reason about your embedding's higher-dimensional representation in meaningful groups!
Thanks
@cerebral_valley
on this deep dive 🤿
@ArizePhoenix
has “become almost like our secret weapon of having a tool that helps developers earlier in their journey, and then also throughout the journey of their application.”
@aparnadhinak
Our Deep Dive with
@arizeai
is now live!
ARIZE IS EXPANDING THE FIELD OF AI OBSERVABILITY 📈
Co-founder and CPO
@aparnadhinak
walks us through LLM observability, Phoenix, and their goals for 2024...
Link below 👇