Super excited to introduce 🌳Acadia (
@AcadiaAI
) Playground, an interpretable data exploration tool to understand your evaluation data’s quality and help unlock insights into model performance using AI!
🧵
Thrilled to welcome our newest cohort of Venture Partners to the Contrary family! With nearly 1300 applications, this year was our most competitive yet.
We’re excited to work with you all to meet and invest in the next generation of exceptional founders and companies.
We also
from last minute late night ideas to fruition, the beautiful Figma offices to the inspiring ppl. true thanks
@hackclub
and the Assemble team for making things happen!
#assemble22
#sf
3/ Hierarchical semantic clustering
Clustering scheme that generates an interconnected hierarchy that links ideas together into a single post
Consolidate your notes into a blog post
🥇First place
@JvNixon
@_nathanmarquez_
@zvhgpyxqtnys
the most sf imagery from today is seeing two ppl squeeze into a waymo front seat and another waymo blow up from fireworks 😳 anways.. happy lunar new year!🧧
This is our first of many steps towards bringing interpretability into datasets and evals of growing quantity, complexity, and modalities.
We want to make it easy to unlock high quality signal from the data for many LLM + multimodal applications.
6/6
AI companies: introducing our new talented 👏 brilliant 👏incredible👏amazing👏show stopping👏spectacular👏never the same👏model
Also AI companies: you cant use it yet
Here's a new SOTA text-to-image eval metric that's much better at complex compositional reasoning than current ones (e.g CLIPScore, PickScore)!
We also show that it generalizes to video/3d evaluation + released a comprehensive t2visual meta-eval metrics benchmark.
Great to have
In text-to-image generation, evaluating how well the generated image matches the prompt is a major challenge. We address this with VQAScore: a SOTA metric that significantly surpasses CLIPScore, PickScore, ImageReward, TIFA, and more!
VQAScore works especially well on complex
🗃️ Combine "Topics" of choice to filter and inspect individual datums
🧐 Select a model of interest, toggle on failure case mode, log, and visualize where failure cases occurs
2/6
🛝You can define a custom set of task-specific "Topics" of interest, and Acadia Playground visually decomposes a target datasets' content into these categories
🔍 Explore dynamic embedding views of your data points--either embedded by overall semantics or “Topic” slices
1/6
@khoomeik
@ArYoMo
i'm curious--how are you baselining with gpt4v exactly? inputting screenshot & directly prompting it to output observation, thought, and action? i usually find gpt4v to be better at relative spatial reasoning/spitting out img descriptions
pulling out a weekend project from a few mos ago...
Fireo🔥, a neural net tensor shape debugger!
- Useful print statements only
- Only needs pseudo input + model class
- No more hours spent manually tracing through shapes in your dl model dev workflow
@AcadiaAI
Playground is multimodal! We used it to analyze
🖼️ Winoground (VLM image caption matching task)
💻 HumanEval (LLM code generation task)
More details coming soon :)
3/6
@AcadiaAI
Playground can also be used for:
- Cross comparison of various models to evaluate the best model for your use case
- Identify and target weaknesses in your dataset distribution (such as duplication or misrepresented categories), inform better data curation
4/6
200 on clip is crazy 😱. there’ll probably be a lot more on nerfs / 3d vision once 2d vision is solved (alr feels like it has by gpt4v but opensource still has a long way to go)
demo day was awesome. cv has always been extremely interesting to me but I had never first-hand witnessed how inspiring it may also be for others until today, esp by it’s real world applications that bridge imaginative sci-fi with reality. 🦾
#gangstaminecraft
when reading research papers, isn't it so annoying to click the link to see the citations but then have to scroll all the way back up or am i missing out on something?
JUST IN: Meta AI introduces LLaMA, a 65B parameter LLM.
LLaMa only relies on publicly available data and outperforms GPT-3 on most benchmarks despite being 10x smaller.
@itsandrewgao
the swin transformer for example. also, although the naive attention’s work is in order n^2, multi-headed attention/parallelize-ability makes the span closer to linear or logn.
reliable models only result from robust evaluations and metrics. what are (relatively) non-subjective ways to eval generative models or is that just its nature?
@YiMaTweets
hmm feels like it's more prior ⊆ latter. classification/recog. are discriminative tasks whose objective is to learn conditional prob distribution P(X|Y) aka decision boundaries, which is a subset of generative models that learn a joint distribution P(X,Y) where we sample from
@akbirthko
awesome, this was what i was leaning towards. but in this case, what is the point of even having different heads if their end result is concatenated together anyways b4 the linear layer? don't the q, k, v operate independently between the different hidden dims anyway?
@HaoliYin
I've actually thought about this b4 haha! I feel like generating accurate and robust 3d mesh/point cloud/surface is pretty difficult and unsolved problem.
currently playing with
@runwayml
's gen-2 video gen models -- definitely something going on
"A baker pulling freshly baked bread out of an oven in a bakery"
send in some prompts👇
@MarioKrenn6240
Due to the influx of papers, bec it's rare for any AI researcher to have read every single paper in their relative subdomain, there're undoubtedly lots of overlapping "novelties." So even just having a systematic approach for tracking defs and training paradigms would be helpful
IT IS OFFICIAL!!! The world’s biggest, most powerful rocket ever, will attempt its first launch on the morning of Monday, April 17th!!! We have our stream ready to go with some amazing views and incredible audio to help bring you along!
@gdb
increase in RPD limits; random server errors occur at times; browser version feels like it’s much more willing to describe; log probs would be great!
imagine if there exists an arXiv that consists of papers/logs of project ideas that failed or went nowhere. that way, actual innovation might progress much faster.
@O42nl
@MetaAI
good call. i am sure they could cut down the cost by a lot considering its scale.
but still, very unattainable for any labs or small companies. :/
@O42nl
actually, the W_Q, and W_K don't have to be square matrices, they just have to be d_model x d_k, and W_V has to be d_model x d_v. d_k doesn't have to equal to d_v, but by convention it is, right?
quick technical question: does increasing # of heads in the transformer MSA increase param count? i've gotten mixed answers. if this is implementation dependent then is there a standard? for most implementation i've seen (pytorch & swin) the answer seems to be a no.
Apart from intention-based factors such as company direction and algorithm design, it’s interesting to note the dissimilarity of the current knowledge transfer ability bet. natural language-based (twitter) vs vision/img/vid based (insta, tiktok) mediums. language is clearly ahead
@HaoliYin
@alexfmckinney
i say try the former, if not good enough then the latter, we def have stronger text embedding models than vision. also i'm interested to see how close CLIP img encoder embeddings are to img->description->CLIP text embeddings, perhaps that could be a finetuning objective for CLIP
The software engineering aspect of deep learning repos I've been watching closely is how they store, catalogue, override, manage and plumb hyperparameter configs. Have come to dislike argparse, YAMLs (too inflexible), and fully enumerated kwargs on classes/defs. Any favorites?
simple math shows that training
@MetaAI
's llama would have costed anyone ~ $4 mil to train according to A100's global pricing of $4/hr/GPU. 504hrs *$4*2048 GPUs. and it is only 65B params
seem to split up the hidden dim in attention up into nheads, and each heads operates on a different set of Q, K, V weights. and at last a linear layer is applied to the concatenated outputs from each head