5 years ago (after Alphago beat Lee Sedol), I and many others thought RL would soon change the world. It's impact has been smaller than we anticipated (mainly due to the Sim2Real problem)
If LLM's have less impact by 2027 than many now expect, what will be the reason?
Kaggle is formally releasing a new micro-course by
@alexis_b_cook
next week.
But it's already available at
It's just so good.
Possibly the single best resource on the internet for someone new to Python who wants the fast path to useful skills.
What would a more applied approach to AI look like?
I think I have the answer. So I'm leaving Google and Kaggle to build a business around it. Check it out at
Want help getting more than predictions, so you can optimize decisions?
Let's talk. I can help
Announcing something I've wanted to do for years
A Decision Optimization course with
@weights_biases
covering both simple & sophisticated techniques to help data scientists and ML engineers make their existing skills far more valuable
The backstory 👇
I'm 50/50 on whether we'll still use deep learning in 2030
But I'm confident we'll still use transfer learning
Transfer learning is such a good idea, it's gotta be here to stay.
1) Start with a brainless baseline
2) Repeatedly make small improvements
That's how xgboost and deep learning work
It's how people run successful ML projects
Not a bad strategy. We should use that more in other places.
Kaggle just released a new Python course based on the wildly successful 7-day Learn Python Challenge.
Check it out:
Great explanations, and a range of exercises that will be fun for both new and experienced Python programmers.
Great summary of the state of RL.
Why it has huge potential
How it currently doesn't work (really, it doesn't)
Suggestions on where to go from here
I hope RL returns from its academic meandering, and we refocus on what's needed to solve real problems
For anyone that wants to Learn Python, Kaggle will be host the "Learn Python Challenge" from June 11-18.
In 20 minutes a day, you'll learn the basics most relevant for data science (and apply it to interesting hands-on puzzles).
I keep hearing the claim "ChatGPT is just autocompletion" or "modern AI approaches just predict the next word."
That changed a year ago with the InstructGPT paper. Important read for anyone that wants to understand current AI
My favorite questions when interviewing data scientists are about ML explainability. ML explainability is so useful, but it isn't as widely known as it should be.
Kaggle has a free micro-course teaching the key ideas in ML explainability.
I just saw that the notebooks I authored for
#Kaggle
Learn courses have been forked over 2,000,000 times 🤯
There are a lot of great, free, applied data science courses at
Python puzzle:
My code is
fns = [lambda x: term for term in ('a', 'b', 'c')]
out = [f(None) for f in fns]
---
The result is that out is ['c', 'c', 'c']
What's happening here?
I heard about
@streamlit
earlier this week
Tried it for the first time this morning
🔥Holy smokes 🔥
I'm not sure I'll ever write a Jupyter notebook again. They might still have some use case?
But Streamlit is shockingly nice to use
Thx
@HamelHusain
for telling me about it
Why is this workflow uncommon?
1) Train ML model
2) Calculate absolute value of errors in the validation set, and build a "confidence" model that predicts error magnitudes?
Then prospectively, you'd make calls to both models, yielding both a prediction and a confidence level
HYPOTHESIS: Most data science today is still just experimentation
TEST: Many people seeing this are data scientists. Who has personally built a model for their current job that has been put in production or used to inform a meaningful decision?
3 of the very best data scientists I've met have no college degree.
And I more generally observe ~ 0 correlation between DS skill and formal degree.
My samples size is small, but I'm still convinced academic achievement is terribly overrated in our field.
This problem isn't unique to ML/AI
It comes up with most data analysis, even casually looking descriptive statistics
We're tempted to attribute differences to whatever feature tells a good story... but data collection usually introduces many confounding factors
People think they can't improve themselves quickly.
But the 4-hour Intro to ML course on Kaggle gives you enough background to independently have fun and grow with your own Machine Learning projects.
Most AI researchers hide human knowledge from models, so the model is a tabula rasa to learn from data
What about this: Try to embed as much human knowledge as possible. Then learn from data on top of that
It's less focused on "AGI" and more focused on solving problems well
I've historically been luke-warm about matplotlib
But a multi-year tour of other python graphing libraries has me more appreciative
Is flexible enough for anything I'd want... and while others might have clever API's, the comprehensive matplotlib docs make up for it.
Today
@weights_biases
is releasing part 2 of my Decision Optimization course.
Learn the key tricks about going from standard loss functions to good decisions. And see how to directly optimize the outcomes you care about
My Twitter newsfeed would suggest everyone's busy with fun modeling libraries
StackOverflow tells a different story
A reminder that people are busy munging data.
Also, I didn't realize Spark was this widely used (or at least subject to so many questions).
I've seen some great data scientists struggle to have a practical impact because they don't know simulation techniques
I'll share a better way in this webinar
Explanations for why GANs produce nice results always felt hand-wavy
I just saw this video with an approach to generate high quality images without GANs
I predict we see more new approaches to generative models in 2019. Maybe replacing GANs entirely
I've had conversations with 40 pro data scientists in the last 2 weeks
Most assumed they were using ML predictions in an approximately optimal way... and most found they could do much better
I'll show how to deliver more value from the same predictions
Anthony Goldbloom was my most helpful angel investor when I started Decision AI
He just started a VC fund (AIX Ventures) with Richard Socher, Pieter Abeel, and Chris Manning! That's a legit all-star lineup of AI.
If I started something, those are the first investors I'd want
I hear "PhD data scientist" to describe the persona of great data scientists.
But the most effective data scientists I know don't have advanced degrees. So I'm going to start referring to the expert DS persona with the phrase "high school graduate data scientist"
@__mharrison__
My favorite ML podcasts are Gradient Dissent by
@l2k
and TWIML by
@samcharrington
The Analytics Engineering Podcast is great for analytics and the data industry.
@arkosiorek
Sim2real in RL
RL is the most promising and least practically useful area of ML... And it's current uselessness is because learning in simulation doesn't work in reality
Data scientists won't tune hyperparameters or design architectures in 5 years. AutoML will replace that.
Instead we'll structure rich (multi-equation) models to reflect outside knowledge. That can't be automated.
And Probabilistic Programming will be the key tool to do it.
I've heard gatekeeping from students saying projects should use personal implementations of ML algos. But I never hear that from people with more experience
You're just so much faster and deliver fewer bugs when using well-tested tools with higher-level APIs
The
#Gartner
magic quadrant is so bad.
Everyone knows it is pay-to-play, and the results contradict reality so badly.
Gartner's success says something disconcerting about the executives who purchase these technologies.
Why is it hard to find time to learn coding?
Because most people's first programs are uninspiring.
The new Data Visualization course by
@alexis_b_cook
changes all that. You can make fun and impressive graphics from Day 1, and learn Python in the process
I like the standard LLM fine-tuning tools. So I wan't sure I'd like Predibase
Then I used it. Their UI is really nice... especially the data visibility part.
I'm going to use this more and more
So I'm pumped that
@predibase
just offered free compute credits to all participants
Early TF users struggled to choose between the various higher-level APIs. The embarrassment of riches was solved when Keras became the official high-level TF API
Now Jax is in the same place as early TF w/ Flax, Haiku, Trax, Elegy
I hope the community consolidates on one again
@HamelHusain
and I are thinking about teaching a four-session, cohort-based course on LLM fine-tuning for data scientists and software engineers.
We set up this survey to gauge interest:
If you take the survey, we'll make sure you're the first to hear
Model interpretation is so valuable to data scientists, but way too few data scientists know how to see what their ML models are learning.
Starting next week, Kaggle Learn can show you how to extract the insights from your ML models
DS project I hope someone does:
Curate top 50
#COVID19
tweets of the day (w/ Twitter API)
Signals for ranking top 50?
Total likes. Retweets by people like
@NAChristakis
, etc
Submit to for auto-updating and big audience
Respond here to brainstorm
1/2
I wanted to see climate change so far in different places, so I made this
@streamlit
app to explore it
There's been a lot of aridification in the American Southwest, but it's changes in Europe that surprised me most
@kareem_carr
I had the views in this thread until I read and looked at the literature
@sarahookr
points to
It's intuitive that models transmit rather than create bias. But research shows that's not correct
Yesterday, I ended up in a debate where the position was "algorithmic bias is a data problem".
I thought this had already been well refuted within our research community but clearly not.
So, to say it yet again -- it is not just the data. The model matters.
1/n
I'm frequently amazed how disconnected our conservation efforts are from what actions actually help with environmental conservation.
Today's reminder: landfill vs recycling
I bet a lot of developers can write CSS 10X faster than me
And I could probably write pandas code 10X faster than them
Sooooo, I guess we're all 10X engineers!
🥳🥳🥳
To all the haters who said tech is building addictive technologies that don't improve users' lives:
You haters were right. Sorry for ever doubting you.
@HamelHusain
I tell my kids that our family is a tech startup
They thought it was weird at first, but I showed them the deed to our home, confirming they have no equity.
So now they get it
I just saw that the machine learning lessons I wrote for
#kaggle
Learn surpassed a 1M uses.
That's pretty good. Though the best courses are the ones by
@alexis_b_cook
Worth checking those out at
I love reading data science stuff
But my phone pushed a story from "Towards Data Science" about GPT-3 replacing programmers
Thank phone... for the reminder to block articles from Towards Data Science
2019 is one heck of a year to get into tech.
Most top titles in Glassdoor's "Highest Paying Entry Level Jobs" are some version of
- Data scientist
- UI designer
- Software engineer
@bernhardsson
Perhaps you haven't heard about Amazon's new guarantee.
If your data isn't still available millennia after humans wipe each other off the face of the earth... they'll refund your storage charges.
Most people struggle to use ML models well
Solving this problem is the
#1
thing data scientists can do to build trust with colleagues and impact a company's bottom line
If you're ready for a modeling tool that helps you bridge the gap, we should talk
Congrats DataRobot on raising another $206M. It's a hard working team solving real problems
Data scientists and aspiring data scientists should think about developing skills that will be useful in an age of AutoML
Fiddling with model parameters will be an outdated workflow
1/3
Interactive data analysis is WAY more engaging than static graphs or text
People default to static publishing because it used to be SO much easier. Tools like Streamlit & Dash are changing that
I wonder what a Substack for interactive data apps would look like.
@myelbows
?
I'm frequently asked how to get a first job in data science. My answer, which I'm confident is good advice:
Do interesting projects
Make the results public
Make them look polished
Your resume might get someone to look at your projects, but the proof is in good (real) work
Good discussion on Kaggle’s role in an online portfolio for job candidates. TL;DR: it’s helpful to create and link to high quality kernels - demonstrates that you can apply your skills vs. “I completed these courses”
Have you seen people using averages or point predictions when they should look at distributions?
Decision AI tracks full distributions, because it's important in so many practical situations
I tried a few transcription API's last week. None had the speed + accuracy I wanted. Just tried OpenAI's new API (based on the whisper-large model)
It transcribed a 1 minute clip in 5s with no transcription errors. Really happy with this.
@HamelHusain
@lc0d3r
@kaggle
Here are the links
@HamelHusain
:
0. Use cases for ML Insights:
1. Permutation Importance:
2. Partial Dep Plots:
3. Shap Values:
4. Advanced uses of Shap Values:
Yesterday, I ended up in a debate where the position was "algorithmic bias is a data problem".
I thought this had already been well refuted within our research community but clearly not.
So, to say it yet again -- it is not just the data. The model matters.
1/n
It's a common practice to use ML without even thinking about casualty
But the resulting predictions generalize poorly in a changing world. Like the one we live in
There's a lot you can do, and we are learning more. Glad to see research like this
I frequently want alerts when code finishes running so I can check the results
Is there a super-easy way to drop in a line of Python that sends me an SMS?
Congrats to
@DataRobot
for raising another $100M round of financing.
They have an awesome product that can make most data scientists and analysts more effective.
I'm experimenting with LLMs to define 3D models of physical objects (write OpenSCAD and the FreeCAD Python API code)
Anyone exploring related topics? I'd love to chat.
Data scientists need to make models quickly & account for breaks from the patterns in historical data
Probabilistic simulation is the tool
We're building a tool for it:
Pondering whether simulation is right for your problem? Let's chat
I admire the AutoML Tables team for publicizing the entry with the tool before the competition started.
They could have entered silently, and published the result only if it was good
But this result is more representative and compelling because it wasn't cherry-picked.
Update from
#KaggleDays
, 5 hours into the competition and Google AutoML still maintains its lead. Three hours to go (five hours since I took the pictures).
Say hypothetically you're a student at a US university
You'd rather not pay a bajillion $ for online classes
Fall job options for a 20 year old aren't great
You aren't some Thiel fellow type autodidact who can solve nuclear fusion in your semester off
What would you do?
@AnnieLowrey
@yanathomas
We already have too much to read rather than too little. So the lack of content may be a feature rather than a shortcoming
I like your writing, and I'd buy an Atlantic to read it. But I won't spend the time picking an Atlantic off the shelf to figure if you wrote anything in it
Most people won't realize how important a development this is for Kaggle competitions.
And in a couple years, the old style of kaggle competitions will feel primitive.
We're excited to announce...🥁🥁🥁
Synchronous Kernels-only Competitions! What's this? Read all about it in this blog post by
@wcukierski
*AND* check out our 1st synchronous Kernels-only competition (linked in blog).
#nofreehunch
I recently spoke with
#DataFramed
, the
@DataCamp
podcast, about how data teams can move from making predictions to optimizing decisions
Episode came out today
We do ML that "doesn't matter" because standard ML workflows are insufficient to optimize decisions in complex dynamic environments
I think my new project at will make ML on tabular data vastly more actionable
Personally, I like ML research. But it covers different issues than real-world problems
Decision AI is built to improve real-world decision-making
It isn’t for everyone, but pragmatic data scientists (and the people who pay them) will love the difference
Decision AI is focused on letting data scientists hit exactly these criteria
Most people won't realize how far off ML has been until they see a better way.
Sure, machine learning is fun, but have you ever written a function that delivers business value, is well tested, and can be iterated on by your colleagues?