kotekjedi_ml Profile Banner
Alexander Panfilov Profile
Alexander Panfilov

@kotekjedi_ml

Followers
75
Following
235
Statuses
49

IMPRS-IS & ELLIS PhD Student @ Tübingen Interested in Trustworthy ML, Security in ML and AI Safety.

Tübingen, Germany
Joined October 2013
Don't wanna be here? Send us removal request.
@kotekjedi_ml
Alexander Panfilov
4 months
Ever wondered which jailbreak attack offers the best bang for your buck?🤖 Check out our #NeurIPS2024 workshop paper on a realistic threat model for LLM jailbreaks! We introduce a threat model that considers both fluency and compute efficiency for fair attack comparisons. 🧵 1/n
Tweet media one
1
14
35
@kotekjedi_ml
Alexander Panfilov
2 days
Check out this mega cool work on model similarities from my colleague Shashwat! (curious how these insights are reflected in jailbreaking scenarios where we use LLM as an output filter)
@ShashwatGoel7
Shashwat Goel
2 days
🚨Great Models Think Alike and this Undermines AI Oversight🚨 New paper quantifies LM similarity (1) LLM-as-a-judge favor more similar models🤥 (2) Complementary knowledge benefits Weak-to-Strong Generalization☯️ (3) More capable models have more correlated failures 📈🙀 🧵👇
Tweet media one
1
3
8
@kotekjedi_ml
Alexander Panfilov
2 days
RT @micahgoldblum: Here’s an easy trick for improving the performance of gradient-boosted decision trees like XGBoost allowing them to read…
0
87
0
@kotekjedi_ml
Alexander Panfilov
7 days
RT @maksym_andr: Prefilling can be used not only for jailbreaks but also for exploratory analysis of alignment and biases encoded in a mode…
0
1
0
@kotekjedi_ml
Alexander Panfilov
14 days
RT @jiayi_pirate: We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verifi…
0
1K
0
@kotekjedi_ml
Alexander Panfilov
17 days
RT @wei_boyi: Open-sourced models suffer from dual-use risks via fine-tuning. Recently, several new defenses have been proposed to counter…
0
10
0
@kotekjedi_ml
Alexander Panfilov
1 month
RT @AlexRobey23: So while we already know a lot about how to jailbreak chatbots, I'm excited to see more work in this area! But if you wan…
0
1
0
@kotekjedi_ml
Alexander Panfilov
2 months
RT @sichengzhuml: We’ve updated the target prefixes for 100 AdvBench requests — ready for researchers to plug and play! (8/8) https://t.co/…
0
1
0
@kotekjedi_ml
Alexander Panfilov
2 months
RT @AlexRobey23: After rejections at ICLR, ICML, and NeurIPS, I'm happy to report that "Jailbreaking Black Box LLMs in Twenty Queries" (i.e…
0
16
0
@kotekjedi_ml
Alexander Panfilov
2 months
RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets…
0
82
0
@kotekjedi_ml
Alexander Panfilov
2 months
RT @AlexRobey23: In around an hour (at 3:45pm PST), I'll be giving a talk about jailbreaking LLM-controlled robots at the AdvML workshop at…
0
4
0
@kotekjedi_ml
Alexander Panfilov
2 months
RT @machinestein: Check our new paper at NeurIPS 2024. We show that you can solve offline RL as Optimal Transport (OT) and get SOTA resul…
0
7
0
@kotekjedi_ml
Alexander Panfilov
2 months
RT @VaclavVoracekCZ: Confidence intervals use fixed data, unsuitable for adaptive analysis. Confidence sequences compute new confidence in…
0
3
0
@kotekjedi_ml
Alexander Panfilov
3 months
RT @maksym_andr: 🚨I'm on the faculty job market this year!🚨 My research focuses on AI safety & generalization. I'm interested in developin…
0
42
0
@kotekjedi_ml
Alexander Panfilov
3 months
RT @dpaleka: Threat models. New framework compares jailbreak attacks under consistent fluency and computational constraints. Uses bigram pe…
0
1
0
@kotekjedi_ml
Alexander Panfilov
3 months
RT @micahgoldblum: 📢I’ll be admitting multiple PhD students this winter to Columbia University 🏙️ in the most exciting city in the world!…
0
150
0
@kotekjedi_ml
Alexander Panfilov
4 months
RT @bnjmn_marie: When evaluating LLMs, the batch size should not make a difference, right? If you take the Evaluation Harness (lm_eval), p…
0
4
0
@kotekjedi_ml
Alexander Panfilov
4 months
@bnjmn_marie I faced the same issue when evaluating jailbreaks' success rates with the target model in bfloat16. Larger batch sizes led to lower ASR, highlighting how brittle some jailbreak suffixes are. My takeaway: for lower precision, always use batch_size=1.
Tweet media one
0
0
1
@kotekjedi_ml
Alexander Panfilov
4 months
This was an exciting collaboration with Valentyn Boreiko (@valentynepii), Václav Voráček (@VaclavVoracekCZ), Matthias Hein and Jonas Geiping (@jonasgeiping). Thank you! Find out more in our full paper at 🧵 6/n
0
0
1