Jitesh Jain
@praeclarumjj
Followers
234
Following
2K
Statuses
211
CS PhD Student @ICatGT | Prev. Intern @MSFTResearch @PicsartAI | CSE'23 @iitroorkee đź“– Frequently Reading, đź“ť Occasionally Writing
Joined December 2014
💠How do MLLMs improve their visual perception with more training data or visual inputs (depth/seg map)? 👉 Performance correlates strongly with “visual” representation quality in the LLM. 🤔 So, why not optimize these representations directly? 🚀 You guessed it—hola OLA-VLM!
Introducing OLA-VLM: a new paradigm to distilling vision knowledge into the hidden representations of LLMs, enhancing visual perception in multimodal systems. Learn more: GT x Microsoft collab by @praeclarumjj @zhengyuan_yang @JianfengGao0217 @jw2yang4ai
2
10
22
RT @yash2kant: 🚀 Introducing Pippo – our diffusion transformer pre-trained on 3B Human Images and post-trained with 400M high-res studio im…
0
33
0
@kchonyc Great and relatable blog! I have been thinking about writing something similar for a long time based on my conversations with my fellow students too, thanks for the motivation hehe
0
0
3
This is a great blog outlining the increased anxiety in PhD students, not only experienced by senior but also junior students ig. "this [incremental and stable improvements] is precisely the opposite of what [creative and innovative] PhD programs are designed to train them for."
feeling a bit under the weather this week … thus an increased level of activity on social media and blog:
0
0
1
RT @praeclarumjj: Exciting direction! In our OLA-VLM, we explored a similar idea, optimizing LLM features via auxiliary visual embedding…
0
3
0
complete OLA-VLM explainer thread: This is also quite relevant to previous works like REPA and I-JEPA from @sainingxie @ylecun Great to see that cross-modal training remains a promising direction! (also seen in works like SEED-LLaMA, Emu, DreamLLM, etc.)
💠How do MLLMs improve their visual perception with more training data or visual inputs (depth/seg map)? 👉 Performance correlates strongly with “visual” representation quality in the LLM. 🤔 So, why not optimize these representations directly? 🚀 You guessed it—hola OLA-VLM!
0
0
0
RT @srush_nlp: Rare sincere tweet: December can be tough in academia. As a student I thought everyone had it together. As an advisor you s…
0
44
0
RT @thaoshibe: It costs $89-$199 for a poster printing Estimated $260.000-$597.000 for ~3k posters (main conference) $0.5M dollars go to t…
0
94
0
RT @jw2yang4ai: 🔥Check out our OLA-VLM! We took the first step to ask the VLMs not only decode the text tokens but also the visual tokens…
0
4
0
RT @jiachenl6: Check out the CuMo poster at the East Exhibit Hall A-C #3400 on Friday afternoon if you're into multimodal LLM! #NeurIPS2024…
0
1
0
RT @fionakryan: Introducing Gaze-LLE, a new model for gaze target estimation built on top of a frozen visual foundation model! Gaze-LLE ac…
0
486
0
RT @praeclarumjj: 💠How do MLLMs improve their visual perception with more training data or visual inputs (depth/seg map)? 👉 Performance co…
0
10
0
RT @humphrey_shi: Introducing OLA-VLM: a new paradigm to distilling vision knowledge into the hidden representations of LLMs, enhancing vis…
0
23
0
@zhengyuan_yang @humphrey_shi @JianfengGao0217 @jw2yang4ai 🙏Lastly, I sincerely thank the GCR team at Microsoft for their support in helping me navigate the infrastructure challenges during my internship at MSR.
0
0
0