Mehrdad Farajtabar Profile
Mehrdad Farajtabar

@MFarajtabar

Followers
5,490
Following
154
Media
24
Statuses
112
Explore trending content on Musk Viewer
@MFarajtabar
Mehrdad Farajtabar
6 days
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the
Tweet media one
188
1K
5K
@MFarajtabar
Mehrdad Farajtabar
2 years
Our team at Apple is looking for interns to work on Continual/Lifelong/Transfer Learning, Multi-Modal Large Models, ML Efficiency, and likewise. The position is available as early as next month and the duration is >6 months. Feel free to send your resumes to m_farajtabar @apple
18
86
529
@MFarajtabar
Mehrdad Farajtabar
11 months
My team at #Apple is looking for interns to work on Large Language Models ( #LLM ) especially on efficient "inference" and training. Please email your CV and highlighted related research or codes to m_farajtabarATappleDOTcom. The ideal candidate must:
20
100
518
@MFarajtabar
Mehrdad Farajtabar
6 days
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama , #Phi , #Gemma , and #Mistral and leading closed models, including the recent #OpenAI #GPT -4o and #o1 -series. Their behavior is better explained by sophisticated
53
273
1K
@MFarajtabar
Mehrdad Farajtabar
7 months
Reflecting on LLM research from a bird's-eye view, Noam Shazeer is "the" single most important technical person behind LLM/GenAI revolution.
1
5
42
@MFarajtabar
Mehrdad Farajtabar
20 days
I just shared the following note with my team
Tweet media one
1
0
45
@MFarajtabar
Mehrdad Farajtabar
6 days
12/ Understanding LLMs' true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable—especially in #AI_safety , #alignment , #education , #health_care , and #decision_making systems. Our findings emphasize the
5
40
308
@MFarajtabar
Mehrdad Farajtabar
11 months
1) be available for a long internship (both spring and summer), 2) has work authorization in US and can ideally move to Seattle, 3) has related research artifacts e.g. papers in NLP and ML conferences, 4) hands-on experience with PyTorch/Jax.
2
1
37
@MFarajtabar
Mehrdad Farajtabar
6 days
8/ This begs the question: Do these models truly understand mathematical concepts? Introducing #GSM_NoOp ! We add a single clause that seems relevant but doesn't contribute to the overall reasoning (hence "no-op"). Check out what happens next!
Tweet media one
19
32
306
@MFarajtabar
Mehrdad Farajtabar
6 days
3/ Introducing GSM-Symbolic—our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the #GSM8K test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic
Tweet media one
3
19
235
@MFarajtabar
Mehrdad Farajtabar
6 days
9/ #Result 4: A massive performance drop! All models, including o1 models, show significant declines. While it’ll be interesting to see how grade-school students perform on similar datasets, I doubt the drop would be this severe.“
Tweet media one
6
26
277
@MFarajtabar
Mehrdad Farajtabar
11 months
5) should be enrolled in a PhD program, more ideally towards graduation. Apology in advance if I can not respond to all the inquiries. But, be sure I'll read all the emails and take them into consideration. Will only followup with the ones the our projects fit the most.
2
0
32
@MFarajtabar
Mehrdad Farajtabar
6 days
5/ #Result 2: The fragility of supposed LLM reasoning. LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?
Tweet media one
10
37
332
@MFarajtabar
Mehrdad Farajtabar
6 days
2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine #logical / #symbolic reasoning? vs.
Tweet media one
3
22
238
@MFarajtabar
Mehrdad Farajtabar
10 months
🚀 Excited to share our latest research on efficient large language model (LLM) inference with limited memory. We're tackling the challenge of running LLMs beyond the usual assumption that the entire model fits into the DRAM! #LLM #AI Thanks @_akhaliq for covering our work!
Tweet media one
@_akhaliq
AK
10 months
Apple announces LLM in a flash: Efficient Large Language Model Inference with Limited Memory paper page: Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their
Tweet media one
32
487
3K
2
5
27
@MFarajtabar
Mehrdad Farajtabar
10 months
I'll be at #neurips2023 from Thursday to Saturday. Looking forward to meeting old and new friends. Besides that, I'll be around our 4 posters! Happy to chat about Large Language Models #LLM efficient inference and training, #Multimodal and #CLIP models, #Continual learning. /0
1
0
27
@MFarajtabar
Mehrdad Farajtabar
4 years
@farajtabar Dude, thanks for the mention! :)
1
0
27
@MFarajtabar
Mehrdad Farajtabar
9 months
Our new paper 'Weight Subcloning' proposes a method for initialization and faster training of #transformer models. This approach helps transfer knowledge from large pretrained models to smaller versions through directly copying weights, after sorting and shuffling, 1/2
1
1
25
@MFarajtabar
Mehrdad Farajtabar
6 days
7/ #Result 3: As questions increase in difficulty (M1 → Symbolic → P1 → P2), not only does performance drop, but variance also rises, making models increasingly unreliable.
Tweet media one
6
15
208
@MFarajtabar
Mehrdad Farajtabar
6 days
11-/ .... but even o1-preview shows the same silly mistakes like this. Either it doesn't understand what 'now' is, or it doesn't understand what 'last year' is, or a more likely explanation is that its training data with inflation has this pattern, and it's following that again.
Tweet media one
17
25
219
@MFarajtabar
Mehrdad Farajtabar
6 days
4/ #Result 1: Current accuracies on GSM8K are not reliable! We observe LARGE performance variation: Llama 8B scores anywhere between 70% to 80%, Phi-3 scores between 75% and 90%, and so on. For most models, the average performance on GSM-Symbolic is lower than on GSM8K
Tweet media one
Tweet media two
8
20
207
@MFarajtabar
Mehrdad Farajtabar
6 days
10/ #Result 5: Can scaling data, models, or compute fundementaly solve this? We don't think so! #OpenAI 's #o1 -series is performing better but still suffers from slight performance variations. #o1_preview shows significant improvements, but...
Tweet media one
4
20
193
@MFarajtabar
Mehrdad Farajtabar
10 months
It seems like we've reached an all-time high in the number of upvotes on Hugging Face research papers.
@ClementDelangue
clem 🤗
10 months
Lots of cool work from @apple recently . Check out MLX: . Almost 200 upvotes on their latest paper on HF:
Tweet media one
Tweet media two
Tweet media three
9
37
295
1
0
21
@MFarajtabar
Mehrdad Farajtabar
1 year
#ImageNet_Moment of #Continual_Learning ! Do we really need to train Foundation Models from scratch every time we get new data and waste much compute, time, energy to have an updated one? Check out our paper with @saurabh_garg67 , @FartashFg , @Vaishaal and colleagues at #Apple !
@saurabh_garg67
Saurabh Garg
1 year
Q: How to keep foundation models up to date with the latest data? ⏱️ We introduce the first web-scale Time-Continual (TiC) benchmark with 12.7B timestamped img-text pairs for continual training of VLMs and demonstrate efficacy of a simple replay method.
1
34
127
1
2
19
@MFarajtabar
Mehrdad Farajtabar
7 months
I asked #Gemini to find the equivalence of a #Persian phrase to #English and it spitted out one #Russian term in the middle of #generation . A very interesting #bug ! Perhaps, it couldn't recollect the Persian equivalent of "option" at the context & used its Russian Knowledge ;-)
Tweet media one
3
3
15
@MFarajtabar
Mehrdad Farajtabar
6 days
6/ What if we adjust question difficulty? We introduce 3 new variants of GSM-Symbolic to study model behavior: removing one clause (GSM-M1), adding one clause (GSM-P1), or adding two clauses (GSM-P2).
Tweet media one
2
11
162
@MFarajtabar
Mehrdad Farajtabar
5 months
I'm not attending #ICLR2024 but here are two of my papers from our team #Apple : 1) ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models Wed 8 May 2) TiC-CLIP: Continual Training of CLIP Models Fir 10 May
2
0
15
@MFarajtabar
Mehrdad Farajtabar
20 days
A better/edited version of "I just shared the following note with my team"
Tweet media one
@MFarajtabar
Mehrdad Farajtabar
20 days
I just shared the following note with my team
Tweet media one
1
0
45
1
0
14
@MFarajtabar
Mehrdad Farajtabar
1 year
Indeed we've shown that there is not a significant gap in performance when you use Relu in Llama or Falcon, after not-many epochs of fine-tuning (and also from scratch training) and the gap (if any) can be easily bridged with extra training (scaling law!)
@teortaxesTex
Teortaxes▶️
1 year
They are hinting at that, sure. But they're testing on OPT, as in most of those Hype-Aware Quantization papers Why? OPT's FF layers use ReLU. It sacrifices perplexity but makes activations sparse. I'm skeptical it'll work for SwiGLU in LLaMA… without retrain. (paper:MoEfication)
Tweet media one
Tweet media two
2
1
26
0
0
10
@MFarajtabar
Mehrdad Farajtabar
3 years
Have SpaceX, Tesla, Waymo, Meta(verse), Apple, etc paid their debt to Disney and Pixar for all the motivation and inspiration?
1
0
7
@MFarajtabar
Mehrdad Farajtabar
10 months
Lots of good feedback and interesting avenues for future works.
@atiorh
Atila
10 months
My takeaways from Apple's “LLM in a flash" (1/n)
3
68
372
0
0
6
@MFarajtabar
Mehrdad Farajtabar
10 months
#ReLU activation function offers new horizons for #efficient #LLM training and inference!
@mayfer
murat 🍥
10 months
with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work
Tweet media one
10
20
245
0
1
7
@MFarajtabar
Mehrdad Farajtabar
18 days
"Continual Learning" for the win!
@ilyasutsk
Ilya Sutskever (Parody)
19 days
Our company is embarking on an ambitious journey to develop Artificial General Intelligence (AGI). With a $1 billion investment, we are positioning ourselves to unlock the future of AI. Here’s how we will allocate that funding and drive this groundbreaking initiative forward. 1.
71
127
1K
1
1
7
@MFarajtabar
Mehrdad Farajtabar
5 days
@sirbayes If you're referring to Llama's, it's Llama 3 8B which is quite an advance model and has presumably been trained with lots of similarly crafted data, still 10 pc deviation is too much to me. For the older models it's more damning (a table in appendix has all the numbers). I may
1
4
53
@MFarajtabar
Mehrdad Farajtabar
9 months
aiming to reduce training time and resource use. Ideal for contexts with less data or computing power, it offers a promising alternative for #efficient model training. w M. Samragh, @sacmehtauw @raviteja_vemu @FartashFg D. Naik @OncelTuzel @morastegari 2/2
0
0
5
@MFarajtabar
Mehrdad Farajtabar
4 years
@farajtabar @Chaay ببینی و بهینه‌اش کنی. حتی میلگرام هم یه مقاله در این زمینه داشت. مثال‌های جالبی که به ذهنم میرسه اینهاست: همونطور که اشاره کردی برای اقتصادی‌ها که یادگیری تقویتی و فرآین مارکوف بلدند توپ و میدان آماده است.
2
0
5
@MFarajtabar
Mehrdad Farajtabar
8 days
Here are my nominations for the #Nobel Prize in Literature! @Yoshua_Bengio @ylecun
@boredyannlecun
Bored Yann LeCun
4 years
You got the citations, but do the ablations You'd lack solutions without convolutions Denoise yourself prof, you owe me & Geoff! Little bro Bengio, you're just a soph, did I cough? My GPU wows, you just hide behind eyebrows I self-supervise, you capsize 🧠🧠🧠🧠🧠🧠 #torched
4
15
223
0
0
7
@MFarajtabar
Mehrdad Farajtabar
4 years
@farajtabar @Chaay با سر و صدای زیاد استفاده از هوش مصنوعی برای دیزاین مارکت یا تغییر قوانین و بهبودش هست. در کاربرد‌های عادی از AI استفاده می‌کنند برای پیش‌بینی رفتار یا تخمین وضعیت نهان عامل‌های اقتصادی یا بهینه کردن یه تابع هدف برای عامل یا سیستم. اما اینکه خود مارکت رو به صورت یه عامل هوشمند
1
0
4
@MFarajtabar
Mehrdad Farajtabar
4 years
@farajtabar @Chaay برای کار پژوهشی موافقم که در سرمایه‌گذاری در مورد ۳ و ۴ مزیت رقابتی بیشتری هست. ولی گهگداری برای گرفتن نتیجه عالی و قابل ارائه یا تسریع ارزیابی مدل‌ها تنت به پیه ۱ و ۲ حتما می‌خوره. توییت‌هایی که باقی در جواب نوشتند مانع و جامع بود و واردش نمیشم ولی یه کار بامزه، سخت و
1
0
3
@MFarajtabar
Mehrdad Farajtabar
10 months
4) SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding () on Friday in UniReps: Unifying Representations in Neural Models Workshop (arxiv: )
0
1
3
@MFarajtabar
Mehrdad Farajtabar
1 year
@iofu728 @main_horse @_akhaliq Indeed, with a similar motivation we looked at sparsity of Llama and Falcon and saw that their activation function can be replaced by Relu condition on small amount of fine-tuning (or even from scratch training) while not affecting the performance. .
1
0
3
@MFarajtabar
Mehrdad Farajtabar
5 months
If you have any questions about them please attend the oral and poster sessions or find Iman Mirzadeh or @FartashFg to chat about #inference #efficiency and #continual #training of large vision language and large language models ( #LLM ).
0
0
3
@MFarajtabar
Mehrdad Farajtabar
4 years
@HDNeverFalls @Chaay @farajtabar I'm genuine. That account is fake. Don't look at the number his followers. They are all fake, bots, trols, reformists, stability islanders, Arzeshi, Barandaz, zero-sum-game enthusiasts (:-P), etc, etc.
2
0
3
@MFarajtabar
Mehrdad Farajtabar
3 years
@polkirichenko (1) Happy birthday! (2) Great work ;-)
1
0
3
@MFarajtabar
Mehrdad Farajtabar
10 months
1) ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models () on Saturday at Efficient Natural Language and Speech Processing workshop (arxiv: )
1
0
2
@MFarajtabar
Mehrdad Farajtabar
7 months
BTW, I'm so glad we passed the era of traditional #search ! #Generative_AI is here to make a real difference!
Tweet media one
0
0
3
@MFarajtabar
Mehrdad Farajtabar
18 days
@danie1marczak Correct! Same here. It felt so real :D
0
0
2
@MFarajtabar
Mehrdad Farajtabar
5 days
@boazbaraktcs Thanks Boaz for the comment. I think prompting can help a bit, or even more than a bit (like how CoT helps), but, especially on harder problems like GSM-p1 or -p2, but at the end of day one come up with harder ones (-pn) or distractions (no-op) that have not been seen in
0
0
16
@MFarajtabar
Mehrdad Farajtabar
10 months
2) TiC-CLIP: Continual Training of CLIP Models () on Friday in  Workshop on Distribution Shifts: New Frontiers with Foundation Models (arxiv: )
1
0
1
@MFarajtabar
Mehrdad Farajtabar
20 days
@_dsevero They can pretend to reason before the deadline :D
0
0
1
@MFarajtabar
Mehrdad Farajtabar
10 months
🙏 All these fun images were generated by #DALL_E . Please check out our paper for more details and more serious images: ! Kudos to my co-authors and other colleagues for the fantastic cross functional ( #AIML , #SW , #HW ) collaboration!
Tweet media one
0
0
0
@MFarajtabar
Mehrdad Farajtabar
2 months
If you are traveling to #Bangkok for #ACL2024 don't miss Keivan's Oral (Monday 3pm) and Poster presentation (Wednesday 10:30-12:00, session F)!
@KeivanAlizadeh2
Keivan Alizadeh
2 months
Hey Guys, I'm gonna present LLM in a flash in ACL 2024. Hit me up if you are in Bangkok. Updates from previous version: - Llama 2 results - Some results on Apple GPUs (Metal) - Speculative decoding - Memory Latency Tradeoff - Impact of longer generation
0
6
45
0
0
2
@MFarajtabar
Mehrdad Farajtabar
10 months
3) CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement () on Friday in UniReps: Unifying Representations in Neural Models Workshop (arxiv: )
2
0
1
@MFarajtabar
Mehrdad Farajtabar
4 years
@a_ghasemi @farajtabar Yep. you're right :)
0
0
1
@MFarajtabar
Mehrdad Farajtabar
10 months
💡 Key Insight: Store model parameters in higher-capacity flash memory and load them selectively into DRAM during inference. This avoids needing to fit the entire model in DRAM. Our method optimizes data management, reducing data transfer and enhancing memory usage efficiency.
Tweet media one
1
0
1
@MFarajtabar
Mehrdad Farajtabar
10 months
@MiladShahidi @ImangAdy نه بابا. شوخی کردم. اوکی بود :)
0
0
1
@MFarajtabar
Mehrdad Farajtabar
4 years
@farajtabar @a_ghasemi Hi there :) Looks interesting. I assume you already have word embedding for persian texts. If you don't want to completely start scratch you can use them and train a small seq2seq model using those embeddings on the poems. That may work better than sent2vec model from scratch?
2
0
1