We're releasing an open-source massively multilingual speech recognizer!
Repo (+ colab notebook):
It's a 1-billion-parameter CTC transformer. This is a very cool model, for a few reasons:
Has your neural net ever NaN’d so hard you thought about chucking your laptop in the trash and moving to British Columbia to be a tree planter instead?
Model-free RL: If I push this button, I will get a treat.
Model-based RL: 𝘎𝘦𝘯𝘵𝘭𝘦𝘮𝘦𝘯, 𝘵𝘩𝘳𝘰𝘶𝘨𝘩 𝘱𝘢𝘪𝘯𝘴𝘵𝘢𝘬𝘪𝘯𝘨 𝘤𝘢𝘭𝘤𝘶𝘭𝘢𝘵𝘪𝘰𝘯𝘴, 𝘐 𝘩𝘢𝘷𝘦 𝘥𝘦𝘵𝘦𝘳𝘮𝘪𝘯𝘦𝘥 𝘵𝘩𝘢𝘵 𝘪𝘧 𝘐 𝘱𝘶𝘴𝘩 𝘵𝘩𝘪𝘴 𝘣𝘶𝘵𝘵𝘰𝘯, 𝘐 𝘸𝘪𝘭𝘭 𝘨𝘦𝘵 𝘢 𝘵𝘳𝘦𝘢𝘵.
Researchers Had To Shut Down AI After It Taught Itself 19 Languages?!* 🤔😱🤖😤 Like👍 Subscribe🔔
* = we used pseudo-labeling to train a single massively multilingual speech recognizer for all 60 languages of Common Voice.
Paper:
🧵
“Why do convolutions work, father?”
“Well, son, translation equivariance is a sensible inductive bias for many modalities, like images and audio.”
“Why do Einsum(bhnk,bhmk−>bhnm)TransposeSigmoid1x1Nets work, father?”
“Enough questions for today, son.”
Introducing the Transducer! A sequence-to-sequence model from 2012 (!) that combines the best aspects of CTC and attention models for problems like speech recognition—long neglected, but starting to have a comeback.
Blog:
Code:
Hidden Markov Models have gotten a bit less love in the age of deep learning, but they are really nifty models that can learn even from tiny datasets.
I’ve written a notebook introducing HMMs and showing how to implement them in PyTorch—check it out here:
Sorry, guys. Facebook is down because a neural net I trained during my internship grew too large and began to eat the other computers (this happens sometimes).
I wrote a short post about logsumexp:
It's an operation you've almost certainly used, if you do machine learning—but not everyone has taken a moment to ponder it and understand it intuitively.
How can you train a speech recognizer using only unpaired audio and text? Here's a simple recipe:
- train language model (LM) for the target language
- train acoustic model (AM) for some other (!) source language
- iterative pseudo-labeling using AM + LM
Has anyone tried RNN architectures with all the transformer stuff except for self-attention?
(in other words, layer norm + residuals + feedforward + deep, and then just RNN instead of self-attention)
Looks like I'll be interning with Ronan Collobert and Gabriel Synnaeve
@syhw
at FAIR this summer! I believe our... colloberation... should have a lot of... synnergy...
We're releasing an open-source massively multilingual speech recognizer!
Repo (+ colab notebook):
It's a 1-billion-parameter CTC transformer. This is a very cool model, for a few reasons:
"Pseudo-Labeling for Massively Multilingual Speech Recognition" accepted to ICASSP 2022!
See you in Singapore, assuming the Pi or Rho variant doesn't thwart my plans.
Researchers Had To Shut Down AI After It Taught Itself 19 Languages?!* 🤔😱🤖😤 Like👍 Subscribe🔔
* = we used pseudo-labeling to train a single massively multilingual speech recognizer for all 60 languages of Common Voice.
Paper:
🧵
My theory is that non-doomers are common but not well-represented online because they are emotionally stable people who are not temperamentally well-suited for wading into a debate where people are shrieking about whether AI will be Racist or SkyNet.
From what I infer, most doomers fall in to three categories: they are either (a) fundamentally misanthropic, (b) like to think they're "saving the world", or (c) looking for a moat. Or some combination of the above.
Let's take a look at this.
[1/4]
Authorship idea: if you helped with a paper, but maybe not quite enough to merit being a co-author, you can get a “Feat.”, like “Attention Is All You Need (Feat. Pitbull)”
A reviewer complained that my paper did not have a distinct “Related Work” heading. Remember, your paper should always have:
Introduction
Post-Introduction
Related Work
Unrelated Work
Background
Foreground
Foreplay
Wait For It
Almost There
Experiments
Brace For Impact
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework
Proposes OFA, which achieves new SotA on multimodal tasks and performs on par with uni-modal models (BERT, MAE, etc.) in uni-modal tasks.
Broke:
@GaryMarcus
arguing that neural networks can’t implement System 2 cognition
Woke: Hubert Dreyfus arguing that expert systems can’t implement System 1 cognition
I’m delighted to announce that this fall I’ll be starting a PhD in AI with two of my favorite professors,
@bretthmeyer
and
@DerekRenderling
, at
@McGillU
/
@MILAMontreal
!
Excited to go on a journey of learning, computers, and getting computers to do learning :)
I wish computer vision systems did the thing where a bunch of vertices dance across the image until a mesh forms and it starts blinking and printing some shit like “𝙼𝙰𝚃𝙲𝙷 𝙲𝙾𝙽𝙵𝙸𝚁𝙼𝙴𝙳”. Instead it’s like
𝚝𝚎𝚗𝚜𝚘𝚛([-𝟶.𝟽𝟼𝟿𝟾, 𝟷.𝟹𝟹𝟾𝟹, ...
Hindi ASR challenge: 100 hours of labeled data, 1000 hours of unlabeled data. Perfect opportunity for testing out semi-supervised learning algorithms on Not-Librispeech!
If you have a meta-learning idea, the Bengio brothers were probably already doing it in the ‘90s.
If you have a meta-meta-learning idea, Schmidhuber was probably already doing it in the ‘80s.
Inside you there are two wolves.
One wolf likes smol models with strong inductive biases that can learn from 100 training examples.
The other likes 1 trillion parameter transformers that can eat the entire Internet and do on-the-fly meta-learning.
Which wolf will you feed?
Wordle 203 5/6
⬛🟨⬛⬛🟨
🟨🟨⬛⬛⬛
⬛⬛🟩🟨🟩
🟩🟩🟩⬛🟩
🟩🟩🟩🟩🟩
I will not waste a day coding a neural net to play Wordle
I will not waste a day coding a neural net to play Wordle
I will not waste a day coding a neural net to play Wordle
I will not waste a day coding a n-
I went backcountry skiing, during an Extreme Cold Warning, with some friends who are a bit more daring and outdoorsy than I am.
The best part of the experience was that I Did Not Die.
Silly AI regulation hype
One cannot regulate AI research, just like one cannot regulate math.
One can regulate applications of AI in finance, cars, healthcare. Such fields already have continually adapting regulatory frameworks in place.
Don’t stifle the open-source movement!
I wonder if in the future a false etymology for “Zoomer” will arise, like:
“Ah yes, in 2020 the world began using Zoom because of the pandemic. Hence the new generation became known as ‘Zoomers’.”
@giffmana
(Also 1.0 was a pure CNN and 2.0 used a transformer)
(Also 1.0 called itself "unsupervised", but by the time 2.0 rolled around they had jumped on the bandwagon and started saying "self-supervised" :P)
1. No need to tell the model which language you're speaking! The model implicitly figures that out and transcribes in the appropriate script, etc., unlike a "multi-headed" / "multi-decoder" model.
Your scientists were so preoccupied with how much wood a woodchuck _could_ chuck… that they did not stop to ask how much wood a woodchuck _should_ chuck.
This is work in progress: we're still training some models, and we're currently working on releasing the weights, and afterwards we'll update the paper.
Until then, enjoy and let me know if you have any questions.
It’s my first day at Mila. Time to find the biggest, meanest neural network in the yard and make it overfit to a single minibatch, to assert my dominance.
Manifold has just launched a dating app! 🤯💖
The premise is simple:
OkCupid meets prediction markets!
Bet on who would date who for at least 6 months. It's crowdsourced matchmaking!
100+ profiles created in just a couple days. What are you waiting for? Get in there!
If I were a 1995 hacker, my hacker name would be MAXIMUM LIKELIHOOD, and I would train neural nets with over 100,000 (!!) weights, using data swiped from my rivals using Hacking.
Still can't believe that one of the fundamental reinforcement learning algorithms is just called "REINFORCE". It's like naming your neural net architecture "NEUR".
VERA: Vector-Based Random Matrix Adaptation
Presents VeRA, which reduces the number of trainable parameters by 10x compared to LoRA, yet maintains the same performance
We also discovered something delightful for the monolingual setting: if you train on pseudo-labels for the wrong language, it still works!
This is like teaching a child to write English by reading them "The Cat in the Hat" and then making them transcribe Telemundo for 2 years.
LongNet: Scaling Transformers to 1,000,000,000 Tokens
Presents LONGNET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences
abs:
repo: