I'm excited to announce a new product I've been working on called
@braintrustdata
.
Braintrust helps innovative companies and developers ship higher quality AI products by making it easy to run evals.
🔉 on
OpenAI announced: Reproducible outputs
💥 This is game changer for developers! It's now possible to actually evaluate and unit test your LLM apps! You can know that when a test passes locally, it'll pass for your team and in CICD.
1/3🧵
Over the past year, it's been an absolute joy getting to know
@mikeknoop
and build Braintrust with our friends at
@zapier
.
We worked together on a blog post that captures their workflow. If you want to build a world-class AI product, this is for you!
We evaluated Google's text-bison LLM against OpenAI's gpt-3.5-turbo on a SQL generation task in Braintrust.
Here's how they performed:
- finetuned-gpt3.5: 92.4%
- finetuned-bison: 84.2%
- gpt3.5: 78.7%
- bison: 74.8%
(We finetuned both models too!) Dig into the evals below:
.
@Retool
surveyed how companies are adopting AI. Some of the top challenges: model output accuracy, hallucinations, and prompt engineering.
Braintrust help's you solve these challenges: run evaluations, visualize and inspect your results, and experiment with prompts quickly.
Now, we can have much more fun evaluating our AI apps 🥳
Check out our docs on how to use Braintrust to evaluate your AI app. It's very easy to integrate Braintrust evals with your existing CICD workflow (see our docs below)
We are super excited to partner with Liucija, Senior Data Scientist on the AI team
@Hostinger
, as they work towards leveraging AI for use cases like customer support, website building, and more.
If you'd like to learn how Braintrust helped Hostinger:
- 3x the number of AI
Super fun to host an AIUX Demo Night last week at
@eladgil
's office! We are super excited about the future of UIUX w/ AI and loved seeing what talented people are building. Thank you to everyone who came out and special shoutout to our demoers 🙂
If you’re interested in coming
LlamaIndex just released Llama Datasets so you can easily benchmark RAG pipelines.
We contributed a help desk dataset with Coda so you can easily benchmark chat qa & support use cases.
Check it out on Llamahub
The AI app development journey:
1. Start with a prototype and manually test
2. Get tired of manually testing
3. Evaluations enlightenment: add evaluations to your code
???
4. App is in production. Users rave about your app
Braintrust makes it easy to evaluate your AI code.
🤩 New feature: text blocks in the playground!
These blocks just return a constant or variable value without any LLM call.
This makes it easy to:
- debug your prompts
- mock API responses and vectorDB calls
Don't get stuck manually inputting test cases into your LLM app after every prompt change.
Braintrust makes it easy to automatically evaluate and test your LLM apps.
We're hiring engineers :)
Do you love:
* building visualizations on text, images, and numbers that (re-)render in <100ms?
* searching/grouping billions of rows of semistructured text-heavy data in <200ms?
* grinding away LLM latency by any means necessary?
If so, LMK
👎 Before:
- your app generates different outputs every test
- if you use LLMs to grade outputs, those grades would also be random every test
👍 Now w/ reproducible outputs:
- your app generates consistent outputs even if temperature !=0
- your model graded evals are consistent
Which LLM is the best at summarizing GitHub issues?
We informally tested to find:
GPT4>Mistral7b>Claude2.1>GPT3.5
It's easy to run evaluations with Braintrust using our eval libraries and AI proxy.
Check out the code below:
⏰ We added duration stats to experiments!
See which test cases were faster or took longer.
There's a tradeoff between speed <> quality. Use Braintrust to help you find the optimal balance 😇.
🎧🍌New
@ThePeelPod
with
@EladGil
Stream the full episode here on X or links below
Timestamps:
03:46 Building cool monuments
09:12 Fixing education
16:38 Why AI is underhyped
19:02 Four trends to watch in AI
19:55 Why there aren’t large biotech companies
23:21 The current state
The LLM App Stack by a16z.
Validation is the most crucial step in building reliable and quality AI apps.
Braintrust helps you integrate evals to rapidly ship reliable AI.
😍 It's now so easy to use variables in our Playground.
We got tired of editing raw JSON so we upgraded our UI to support variable/object inputs better.
Simplify your evaluation scripts with Braintrust.
Just define 3 functions: data, task, and scores.
We do all the tedious optimizations like parallelizing requests for you.
@jerryjliu0
@llama_index
@FastAPI
Need an evaluations script for your AI app? We just opened a PR adding in Braintrust for create-llama to test and evaluate the LLM calls in the templates.
We are excited to announce Braintrust is now SOC 2 Type II certified! We have supported enterprise customers from day 1, and achieving SOC 2 compliance is further validation of how seriously our team takes governance, risk, and compliance.
We are very excited Braintrust was featured in the inaugural Future 50! We are thankful for the recognition and can’t wait to continue supporting amazing AI teams.
I’m so excited to sareh the Future 50, a database of extraordinary, high-potential startups.
A few companies you'll learn about:
🚚 A trucking company doing $45M ARR
🧬 A biotech building "AWS for biology"
🇯🇵 Japan's answer to OpenAI
📈 A payments company that grew 20x in 18
OpenAI announced: Reproducible outputs
💥 This is game changer for developers! It's now possible to actually evaluate and unit test your LLM apps! You can know that when a test passes locally, it'll pass for your team and in CICD.
1/3🧵
Don't have an eval set already? Tired of writing scoring functions?
Our `autoevals` library makes it easy to grade your LLM outputs.
It includes prebuilt scoring functions:
• Model-based (using LLMs)
• Heuristic (e.g. Levenshtein distance)
• Statistical (e.g. BLEU)
The Modern AI Stack by Menlo Ventures.
"Customers expect and deserve high-quality outputs, and enterprises are smart to be concerned that hallucinations could cause customers to lose trust."
Braintrust helps you integrate evals to rapidly ship AI without guesswork.
New cookbook on how to use the fantastic
@ragas_io
framework in Braintrust!
Among other things, the Braintrust implementation:
* Available in both TS and Python
* Uses function calling (which substantially boosts performance)
* Is fully debuggable
.
@Retool
surveyed how companies are adopting AI. Some of the top challenges: model output accuracy, hallucinations, and prompt engineering.
Braintrust help's you solve these challenges: run evaluations, visualize and inspect your results, and experiment with prompts quickly.
It's so easy to manage test sets and datasets with Braintrust. We made a web UI for editing evals with your team so you don't need to make your own with Google Sheets/Retool. Our TS/Python library also...