1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the
Our team at Apple is looking for interns to work on Continual/Lifelong/Transfer Learning, Multi-Modal Large Models, ML Efficiency, and likewise. The position is available as early as next month and the duration is >6 months. Feel free to send your resumes to m_farajtabar
@apple
My team at
#Apple
is looking for interns to work on Large Language Models (
#LLM
) especially on efficient "inference" and training. Please email your CV and highlighted related research or codes to m_farajtabarATappleDOTcom. The ideal candidate must:
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like
#Llama
,
#Phi
,
#Gemma
, and
#Mistral
and leading closed models, including the recent
#OpenAI
#GPT
-4o and
#o1
-series. Their behavior is better explained by sophisticated
12/ Understanding LLMs' true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable—especially in
#AI_safety
,
#alignment
,
#education
,
#health_care
, and
#decision_making
systems. Our findings emphasize the
1) be available for a long internship (both spring and summer), 2) has work authorization in US and can ideally move to Seattle, 3) has related research artifacts e.g. papers in NLP and ML conferences, 4) hands-on experience with PyTorch/Jax.
8/ This begs the question: Do these models truly understand mathematical concepts? Introducing
#GSM_NoOp
! We add a single clause that seems relevant but doesn't contribute to the overall reasoning (hence "no-op"). Check out what happens next!
3/ Introducing GSM-Symbolic—our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the
#GSM8K
test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic
9/
#Result
4: A massive performance drop! All models, including o1 models, show significant declines. While it’ll be interesting to see how grade-school students perform on similar datasets, I doubt the drop would be this severe.“
5) should be enrolled in a PhD program, more ideally towards graduation.
Apology in advance if I can not respond to all the inquiries. But, be sure I'll read all the emails and take them into consideration. Will only followup with the ones the our projects fit the most.
5/
#Result
2: The fragility of supposed LLM reasoning. LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?
2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine
#logical
/
#symbolic
reasoning? vs.
🚀 Excited to share our latest research on efficient large language model (LLM) inference with limited memory. We're tackling the challenge of running LLMs beyond the usual assumption that the entire model fits into the DRAM!
#LLM
#AI
Thanks
@_akhaliq
for covering our work!
Apple announces LLM in a flash: Efficient Large Language Model Inference with Limited Memory
paper page:
Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their
I'll be at
#neurips2023
from Thursday to Saturday. Looking forward to meeting old and new friends. Besides that, I'll be around our 4 posters! Happy to chat about Large Language Models
#LLM
efficient inference and training,
#Multimodal
and
#CLIP
models,
#Continual
learning. /0
Our new paper 'Weight Subcloning' proposes a method for initialization and faster training of
#transformer
models. This approach helps transfer knowledge from large pretrained models to smaller versions through directly copying weights, after sorting and shuffling, 1/2
7/
#Result
3: As questions increase in difficulty (M1 → Symbolic → P1 → P2), not only does performance drop, but variance also rises, making models increasingly unreliable.
11-/ .... but even o1-preview shows the same silly mistakes like this. Either it doesn't understand what 'now' is, or it doesn't understand what 'last year' is, or a more likely explanation is that its training data with inflation has this pattern, and it's following that again.
4/
#Result
1: Current accuracies on GSM8K are not reliable! We observe LARGE performance variation: Llama 8B scores anywhere between 70% to 80%, Phi-3 scores between 75% and 90%, and so on. For most models, the average performance on GSM-Symbolic is lower than on GSM8K
10/
#Result
5: Can scaling data, models, or compute fundementaly solve this? We don't think so!
#OpenAI
's
#o1
-series is performing better but still suffers from slight performance variations.
#o1_preview
shows significant improvements, but...
Q: How to keep foundation models up to date with the latest data?
⏱️ We introduce the first web-scale Time-Continual (TiC) benchmark with 12.7B timestamped img-text pairs for continual training of VLMs and demonstrate efficacy of a simple replay method.
I asked
#Gemini
to find the equivalence of a
#Persian
phrase to
#English
and it spitted out one
#Russian
term in the middle of
#generation
. A very interesting
#bug
! Perhaps, it couldn't recollect the Persian equivalent of "option" at the context & used its Russian Knowledge ;-)
6/ What if we adjust question difficulty? We introduce 3 new variants of GSM-Symbolic to study model behavior: removing one clause (GSM-M1), adding one clause (GSM-P1), or adding two clauses (GSM-P2).
I'm not attending
#ICLR2024
but here are two of my papers from our team
#Apple
:
1) ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Wed 8 May
2) TiC-CLIP: Continual Training of CLIP Models
Fir 10 May
Indeed we've shown that there is not a significant gap in performance when you use Relu in Llama or Falcon, after not-many epochs of fine-tuning (and also from scratch training) and the gap (if any) can be easily bridged with extra training (scaling law!)
They are hinting at that, sure.
But they're testing on OPT, as in most of those Hype-Aware Quantization papers
Why?
OPT's FF layers use ReLU. It sacrifices perplexity but makes activations sparse.
I'm skeptical it'll work for SwiGLU in LLaMA… without retrain.
(paper:MoEfication)
with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important
llama-class models (SwiGLU) might not have much longevity afterall
once all the Metal work
Our company is embarking on an ambitious journey to develop Artificial General Intelligence (AGI). With a $1 billion investment, we are positioning ourselves to unlock the future of AI. Here’s how we will allocate that funding and drive this groundbreaking initiative forward.
1.
@sirbayes
If you're referring to Llama's, it's Llama 3 8B which is quite an advance model and has presumably been trained with lots of similarly crafted data, still 10 pc deviation is too much to me. For the older models it's more damning (a table in appendix has all the numbers). I may
@farajtabar
@Chaay
ببینی و بهینهاش کنی. حتی میلگرام هم یه مقاله در این زمینه داشت. مثالهای جالبی که به ذهنم میرسه اینهاست:
همونطور که اشاره کردی برای اقتصادیها که یادگیری تقویتی و فرآین مارکوف بلدند توپ و میدان آماده است.
You got the citations, but do the ablations
You'd lack solutions without convolutions
Denoise yourself prof, you owe me & Geoff!
Little bro Bengio, you're just a soph, did I cough?
My GPU wows, you just hide behind eyebrows
I self-supervise, you capsize 🧠🧠🧠🧠🧠🧠
#torched
@farajtabar
@Chaay
با سر و صدای زیاد استفاده از هوش مصنوعی برای دیزاین مارکت یا تغییر قوانین و بهبودش هست. در کاربردهای عادی از AI استفاده میکنند برای پیشبینی رفتار یا تخمین وضعیت نهان عاملهای اقتصادی یا بهینه کردن یه تابع هدف برای عامل یا سیستم. اما اینکه خود مارکت رو به صورت یه عامل هوشمند
@farajtabar
@Chaay
برای کار پژوهشی موافقم که در سرمایهگذاری در مورد ۳ و ۴ مزیت رقابتی بیشتری هست. ولی گهگداری برای گرفتن نتیجه عالی و قابل ارائه یا تسریع ارزیابی مدلها تنت به پیه ۱ و ۲ حتما میخوره.
توییتهایی که باقی در جواب نوشتند مانع و جامع بود و واردش نمیشم ولی یه کار بامزه، سخت و
4) SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding () on Friday in UniReps: Unifying Representations in Neural Models Workshop (arxiv: )
@iofu728
@main_horse
@_akhaliq
Indeed, with a similar motivation we looked at sparsity of Llama and Falcon and saw that their activation function can be replaced by Relu condition on small amount of fine-tuning (or even from scratch training) while not affecting the performance.
.
@HDNeverFalls
@Chaay
@farajtabar
I'm genuine. That account is fake. Don't look at the number his followers. They are all fake, bots, trols, reformists, stability islanders, Arzeshi, Barandaz, zero-sum-game enthusiasts (:-P), etc, etc.
1) ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models () on Saturday at Efficient Natural Language and Speech Processing workshop (arxiv: )
@boazbaraktcs
Thanks Boaz for the comment. I think prompting can help a bit, or even more than a bit (like how CoT helps), but, especially on harder problems like GSM-p1 or -p2, but at the end of day one come up with harder ones (-pn) or distractions (no-op) that have not been seen in
🙏 All these fun images were generated by
#DALL_E
. Please check out our paper for more details and more serious images: ! Kudos to my co-authors and other colleagues for the fantastic cross functional (
#AIML
,
#SW
,
#HW
) collaboration!
Hey Guys,
I'm gonna present LLM in a flash in ACL 2024. Hit me up if you are in Bangkok.
Updates from previous version:
- Llama 2 results
- Some results on Apple GPUs (Metal)
- Speculative decoding
- Memory Latency Tradeoff
- Impact of longer generation
3) CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement () on Friday in UniReps: Unifying Representations in Neural Models Workshop (arxiv: )
💡 Key Insight: Store model parameters in higher-capacity flash memory and load them selectively into DRAM during inference. This avoids needing to fit the entire model in DRAM. Our method optimizes data management, reducing data transfer and enhancing memory usage efficiency.
@farajtabar
@a_ghasemi
Hi there :) Looks interesting.
I assume you already have word embedding for persian texts. If you don't want to completely start scratch you can use them and train a small seq2seq model using those embeddings on the poems. That may work better than sent2vec model from scratch?