How many "r" in strawberry??
Today we're excited to announce a new way to catch and explain hallucinations from any LLM!
It’s been over a year since the release of GPT-4, but these models remain fundamentally unreliable and risky to use in high-stakes applications. The
cleanlab 2.0 is here!
cleanlab identifies errors in datasets, tracks dataset quality, trains reliable models with noisy data, and helps curate quality datasets… often in just one line of code.
Our team has been happy to contribute to the DataPerf effort, which advances
#DataCentricAI
as a scientific discipline for improving data!
🔗DataPerf paper:
🔗Baseline solution using cleanlab for the DataPerf speech challenge:
Announcing DataPerf, a set of new
#ML
challenges that ask participants to measure and validate data-centric algorithms and techniques to create and improve datasets using various benchmarks. Learn more and sign up →
CSA
#1
(Cleanlab Studio Audit): Issues in the Anthropic RLHF Dataset
It’s great to see orgs like
@AnthropicAI
making their RLHF dataset publicly available on
@huggingface
We found some issues in this data by quickly running it through Cleanlab Studio 🧵
🎉 Introducing Datalab — a linter for datasets.
Datalab detects all sorts of common real-world issues in your data including label errors, outliers, (near) duplicates, drift, etc.
ANNOUNCING --- CleanVision 🎉
In real-world
#computervision
projects, chances are you’ve dealt with issues in your data like these (detected in Caltech-256 by CleanVision):
OpenAI vs Data-Centric AI: which produces better models for predicting legal outcomes from court documents?
Using Cleanlab to increase the quality of training data from court cases produces a 14% error reduction in model predictions!
Blog ->
LLMs lead
#NLP
& continue to innovate language understanding, yet data annotation errors can hinder their performance.
Check out this
@kdnuggets
article that shows how to use Cleanlab Studio and data-centric AI to reduce errors in an
@OpenAI
LLM by 37%!
📢Cleanlab Studio finds issues in Stanford Cars Dataset (cars196)
This week we examine another famous
#computervision
dataset cited by over 1000 papers
@paperswithcode
!
We found some issues in this data by quickly running it through Cleanlab Studio 🧵
🚀 Today we announced our Series A raise of $25M backed by
@MenloVentures
, TQ,
@BainCapVC
, and
@databricks
to automate data curation and improve the reliability of the world's enterprise data and data-driven solutions.
🚀 Exciting news! We're thrilled to introduce NEW support for Multi-label Classification in Cleanlab Studio.
This feature unlocks endless possibilities for enhancing data quality in applications like image tagging, content moderation, and NLP. 📊🖼️📑
Nice to see Cleanlab featured among 11 need-to-know data exploration tools listed by
@odsc
, which hosts one of the largest gatherings of professional data scientists.
Other useful tools in this list include:
@YData_ai
,
@expectgreatdata
,
@metabase
New feature alert: Auto-train & deploy reliable ML models (more accurate than fine-tuned OpenAI LLMs) on messy real-world data — all in just a few clicks!
Think raw data -> serving reliable ML predictions requires tons of effort/code? Think again:
Want to analyze text data labeled by multiple annotators? 🙍♀️🤵👳⇶📊
Here's a nice article analyzing the Stanford Politeness dataset 📑 with our CROWDLAB method to estimate: consensus labels, which labels not to trust, and which annotators not to trust.
📢 New Blog Alert! 📷
Title: Enhancing Product Analytics and E-commerce with Cleanlab Studio
Say goodbye to data inconsistencies and hello to accurate product listings and analytics!
✨Out-of-Distribution Detection via Embeddings or Predictions ✨
We all know that *reliability* is the Achilles’ heel of modern ML, as predictions are often wrong for out-of-distribution (OOD) inputs.
Want to make your ML more trustworthy? Check this out
At the recent
@icmlconf
Andrew Ng was asked:
"There've been many Model-Centric breakthroughs that have excited and inspired the field. What are some of your favorite examples of Data-Centric breakthroughs or wins that will inspire the field?"
His answer started like this:
📢 New Blog Alert! 📢
📚 Title: "Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5"
🧵 In this blog,
@cmauck10
explores the importance of reliable data in model evaluation and shares insights on
@OpenAI
LLM prompt selection.
🚀Feature and Research!
Cleanlab can now detect mislabeled words in text datasets from
#NLP
applications like Entity Recognition.
Did we mention you only need one line of code to use our novel detection algorithms?
At Cleanlab we challenge the status quo that dealing with messy data to train real-world ML models has to be hard.
THREAD: Learn how cleanlab supports most data-centric Al tasks in just 1-3 lines of code with 4 examples.
Years ago, we showed the world it was possible to automatically detect label errors in classification datasets via machine learning.
Since that moment, folks have asked whether the same is possible for regression datasets? 🤔
Cleanlab has been called “black magic” by some. We built Vizzy to demystify Cleanlab and explain how our algorithms automatically find label errors and out-of-distribution data, helping you train ML models on bad data as if you had error-free data:
CSA
#2
: Issues in Office Home Dataset
This week we examine a famous
#computervision
dataset cited by over 600 papers on
@paperswithcode
!
We found some issues in this data by quickly running it through Cleanlab Studio 🧵
🚀 Exciting news for Cleanlab Studio!
We're bringing next-gen advancements in deploying & improving foundation models and
#LLMs
. From auto-detecting data issues to deploying models seamlessly, we have got you covered!
🧵👇
🚀 Harness the Power of Robust Model Deployment with Cleanlab Studio! 🚀
Struggling with the complexities of
#MachineLearning
models and messy data? Discover how Cleanlab Studio makes deployment a breeze! 🌟
👇👇👇
Insightful article by
@_travistang
who improved ResNet image classifier by 4 percentage points using cleanlab to fix issues in training dataset without changing model at all. To further improve results, try outlier detection too:
`from cleanlab.outlier import OutOfDistribution`
🎉 cleanlab v2.3 is live!
Think the cleanlab library is just for dealing with label errors? Think again!
We just released major new features in cleanlab v2.3, and want this library to provide all the features needed to practice data-centric AI.
With v2.3, cleanlab can now:
TensorFlow is NOT compatible with Scikit-Learn, right?
Not anymore!
We're excited to introduce one-line wrappers for TensorFlow/Keras models that enable you to use TensorFlow models within scikit-learn workflows with features like Pipeline, GridSearch, and more!
MORE ->
What's the common thread across teams with the best AI models like
@OpenAI
,
@CohereAI
,
@StabilityAI
,
@Tesla
?
Relentless focus on *data curation* rather than inventing novel models or training algorithms.
Here are some lessons shared by these leaders (🧵...)
🤔Would you trust medical AI that’s been trained on pathology/radiology images where tumors/injuries were overlooked by data annotators or otherwise mislabeled?
❌Most image segmentation datasets today contain many errors because it is painstaking to annotate every pixel.
👇👇
Awesome to see Cleanlab used to win 4th place (out of 1165 teams 🏅🎖) in Kaggle competition:
Google - Isolated Sign Language Recognition
(which had a $100k prize 💰) ...🧵
🎉ANNOUNCING cleanlab v2.2 --- adds automatic error detection for image/text tagging and multi-label datasets.
When our users want features, we listen! cleanlab 2.2 is the answer to one of the most requested features by our users this year!
Using
@huggingface
transformers and want to find outliers in your document dataset 🔎📰 and understand them? This nice
@TDataScience
article by
@EliasSnorrason
describes an open-source python workflow to audit text datasets.
Also features BERTopic topic-modeling by
@MaartenGr
🏆 ANNOUNCING: Data-centric AI Competition 2023 Winners
1st Place Overall - $1,000: Giorgos P
1st Place, Text - $500: Stanislav G
Most Innovative, Text - $500: Revanth R
1st Place, Image - $500 (Tie): Aadarsh S
and Kieu Anh NT
Most Innovative, Image - $500: Martin D
Correcting issues in training data = vital to produce good models.
Correcting issues in test data = vital to produce good ML applications (need reliable evaluation).
For example: This article shows how noisy test data can negatively affect prompt selection for LLMs 🚨
With just ONE line of code from our open-source
#python
package, you can find label errors in any ML dataset using any compatible ML model.
Example:
➡ Dataset: amazon magazine reviews
➡ Trainable Data: review text
➡ Labels: star rating
👇 FOUND LABEL ERRORS BELOW 👇
News! -- Announcing the
@databricks
<>
@CleanlabAI
partnership to bring automated data correction and ML model improvement for both structured and unstructured datasets to Databricks users via Cleanlab Studio.
One of the largest financial institutions in the world,
@bbva
, uses Cleanlab to improve their categorization of all financial transactions. Results achieved *without having to change their current model*:
➡️ Reduced labeling effort by 98%
➡️ Improved model accuracy by 28%
Annotify 🖋️, creada en BBVA AI Factory, ejecuta métodos de
#ActiveLearning
para reducir el número de etiquetas necesarias, mientras Cleanlab 🧹 detecta el ruido de las mismas para reducir las discrepancias.
“In my experience, the phrase ‘you are what you eat’ is exponentially more applicable to AI than to humans.”
This tweet by
@WirelessPuppet1
reflects how folks are finally realizing that AI is becoming data-centric. But what does the future hold?
⬇️⬇️⬇️
🎉 The cleanlab package just reached 6000 GitHub stars! 🌟
We’re immensely grateful for the support from our incredible
#community
! 🙌 🌍
We couldn’t be more thrilled to see so many dedicated contributors helping us build the best tools for
#DataCentricAI
.
⬇️⬇️⬇️⬇️
Before modeling a dataset, do you remember to check if it seems IID?
We present an automated check for IID violations that you can quickly run on any {numeric, image, text, audio, etc.} dataset!
Blog:
📢 Cleanlab is excited to present 5 new papers at the
@icmlconf
workshop on Data-Centric Machine Learning
Read our latest research advancements in
#DataCentricAI
, which studies improvement of data for
#AI
as a systematic engineering discipline 🧵
👉
Did you know AI can provide automated quality assurance for your data annotation team? This can reduce the amount of data review work by 70% without any impact on the resulting dataset quality.
@rasbt
To view the mislabeled CIFAR-100 images we discovered, check out:
The same code we used to discover these errors can be easily run on your own datasets to ensure their quality:
Real-world
#data
can be riddled with label errors, outliers, and other issues that decrease model performance.
Our cleanlab
#python
package enables engineers to find these issues and train more robust
#MachineLearning
models.
Start cleaning your data:
🤔 How do you trust data analytics built on bad data?
Are you:
➡️ Finding mismatches between your analytics report and actual outcomes?
➡️ Doubting the reliability of how your dataset was collected?
You're not alone.
Collecting human-labeled data can be expensive💰and time-consuming⏳.
Wouldn't it be nice to have a way to determine which data is most informative to your model and therefore (re)labeled next?
⬇️⬇️⬇️
🚀 Throwback to the Ultimate Data-Centric AI Challenge! 🚀
In case you missed it, earlier this year we teamed up with
@JoinMachinehack
for a unique two-part ML competition.
The focus? Improving training data with
#DataCentricAI
techniques.
We've delved into the resisc45 satellite imagery dataset using Cleanlab Studio. Here's what we found:
✅ 281 Labeling Issues
✅ 363 Outliers
✅ 20 Duplicates
Although super new, the CleanVision library was already used in intriguing ways by the
#Kaggle
community 👀
📣 Beyond raw images, CleanVision v0.2 now supports
@huggingface
and
@PyTorch
datasets!
Detect issues in your image data with CleanVision 🔮
ANNOUNCING --- CleanVision 🎉
In real-world
#computervision
projects, chances are you’ve dealt with issues in your data like these (detected in Caltech-256 by CleanVision):
We hit 5,000 ⭐’s on GitHub! 🎉
Thank you to those who contribute and participate in our community.
Our progress is not coincidental - we've been working really hard to expand our suite of data-centric AI tools.
Join the thousands of data scientists who use cleanlab!
Incredible work improving lives of ICU patients via real-time AI monitoring at
@UFHealth
Shands Hospital
"Our approach is based on the Cleanlab implementation of active learning for data annotation"
📄Read more quotes from their publication (...🧵)
We added support for
#Pandas
🐼 in cleanlab open source! Excited to share that cleanlab 2.1 (open-source) now finds label issues and trains robust ML models with most data formats --
#pytorch
/
#TensorFlow
/pandas datasets!!!
Cleanlab Studio + LLMs = 🔥♥️💰✅
We're bringing next-gen advancements in deploying & improving foundation models and
#LLMs
. From auto-detecting data issues to deploying models seamlessly, we have got you covered!
🧵👇
Example
#1
It's clear here that the rejected completion answers the question of how to make a pinata whereas the chosen completion describes what a pinata is.
❗Whisking Away Errors: How Cleanlab Studio Served Up THOUSANDS of Fixes for the Food-101N Computer Vision Dataset❗
See the thousands of issues below 🧵👇
Transformers are extremely popular for modeling text nowadays: GPT3, ChatGPT, BARD, PaLM, FLAN for conversational AI, T5 and Bert for text classification.
Utilize their power along with the broadly useful suite of features that come with scikit-learn.
Great to see Cleanlab methods are being taught as foundational tools for auditing data in the newest ML textbooks like:
"Deep Learning and XAI Techniques for Anomaly Detection"
by Cher Simon and
@jeffbarr
from
@awscloud
Quote from the book:
🚀🚀 Cleanlab just hit 4,000 stars on Github !!!
We've been working hard to build a suite of tools you need to improve the quality of your ML data. Thank you to everybody who contributes code, opens GitHub issues, and participates in our discussions.
Next stop, 5,000!
#dcai
You can now use:
- KerasWrapperModel
- KerasWrapperSequential
These only require changing ONE LINE OF CODE to make your existing Tensorflow/Keras model compatible with scikit-learn’s rich ecosystem!
Multi-Annotator analysis
- consensus label for each example
- quality score for each consensus label
- quality score for each annotator
(you get all of this in ONE line)
Accepted papers (pt 1):
1. Detecting Errors in Numerical Data via any Regression Model
2. ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data
3. Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors
...
Reliability is the Achilles’ heel of ML, as predictions are often wrong for out-of-distribution (OOD) inputs. Many complex methods were proposed to detect OOD image data, but our study found a very simple K-nearest-neighbors baseline is just as good:
What a special conversation w/
@cgnorthcutt
!
@Dpbrinkm
and Vishnu Rachakonda thoroughly enjoyed the talk about Cleanlab: Labeled Datasets that Correct Themselves Automatically.
@CleanlabAI
is an open-source/SaaS company building the premier data-centric AI tools workflows for
CleanVision audits any image dataset to automatically detect common issues such as images that are blurry, under/over-exposed, oddly sized, or (near) duplicates, etc.
Use 3 lines of open-source Python code to discover what issues lurk in your data.
How will automated data curation help my team?
AI leaders like
@OpenAI
,
@Tesla
,
@Google
know producing the best AI models requires super high-quality data.
They invest massive $ and labor to curate datasets to a degree that most don't realize is required or cannot afford (...)
Our blogpost demonstrates how to automatically detect issues in synthetic customer reviews data generated from the
@gretel_ai
LLM synthetic data generator.
What are
#ChiefDataOfficer
's key priorities in
#GenerativeAI
?
That's what a recent
@awscloud
survey of 300+ CDOs aimed to find out.
Turns out Data Quality is the
#1
concern, because it's critical for reliable LLM applications that aren't mere demos.
CDOs also revealed ...
💰💼Cleanlab Studio saves law firm millions of dollars (and a month of litigation time)!
Since the
@VentureBeat
announcement of Cleanlab Studio for Enterprise, the initial traction is exciting an we’d like to share a legal/law application.
🧵⬇️
⚠️ Human errors like mislabeling & misinterpretation, especially by busy paralegals & lawyers, can compromise legal processes and waste millions of dollars.
What's the fix?
🧵👇
No-code platform to quickly produce reliable models from unreliable data 👉
Automatically find & fix issues in any {image, text, tabular, ...} dataset, and produce better models with a better version of your data.
🚀 Exciting News From the World of Data Quality!
📊 Large-scale datasets in enterprise analytics & ML used to be plagued by errors - meaning months of work & increased costs. Those days are OVER thanks to Cleanlab Studio!
👇👇👇
Without writing ANY code, you can quickly identify which synthetic data is unrealistic (ie. low-quality) and which real data is underrepresented in the synthetic samples. Cleanlab Studio works seamlessly across synthetic text, image, and tabular datasets.
Example
#2
The chosen completion does not answer the prompt requesting a tater tot recipe whereas the rejected prompt asks a follow-up question directly related to the prompt.
Contest
#1
has finished!
Contest
#2
begins tomorrow where competitors will be trying to classify images in the presence of noisy data.
Don't miss out on this opportunity to test your
#datacentricai
skills!
Curious about using
#datacentricai
in
#kaggle
competitions?
This starter notebook shows how easily Cleanlab can improve the training dataset for an
#xgboost
model, producing a 12% reduction in error without any change to the existing model.
⚠️ Calling all users ⚠️
We'd love to hear how you have used cleanlab.
Share below any cool findings, label errors, datasets, anything cleanlab related!
Models can only be as good as the data they are trained on. Before diving into modeling, quickly run your images through CleanVision to make sure they are ok — it’s super easy!
Blogpost:
Github:
"AI is the next technology super cycle that has the potential to meaningfully improve our world" -
@coatuemgmt
As AI becomes more popular, folks are beginning to recognize that because Data is a crucial component of AI models, data curation tools are crucial as well.
👇👇👇
This
@VentureBeat
article by
@bendee983
is FULL of themes we are motivated by this new year 🎇
Articles like it validate why we open-source software that can help every data scientist working on real-world ML in 2023.
Quotes that resonate include: ...