We made a special story for the AI Papers Podcast using the new Sonic model from
@cartesia_ai
and talked about how their impressive state space model approach compares to transistor based model architectures.
Congrats to
@krandiash
,
@_albertgu
,
@bclyang
and the rest of the team
Apple announced new Siri features and Apple Intelligence today, Interestingly, Apple already released a paper, titled "Ferret-UI," on how it all works - a multimodal vision-language model capable of understanding widgets, icons, and text on an iOS mobile screen, and reasoning
Can large language models (LLMs) can understand complex thoughts and emotions like humans do? Can they understand and predict likely thoughts of others?
LLMs face challenges like outdated information and hallucinations, limiting their use in knowledge-intensive tasks. MetRag, a new framework, enhances RAG by combining similarity and utility-based models with an LLM for smarter, more efficient knowledge processing
MotionClone, a training-free framework that clones motions from a reference video for text-to-video generation. Using temporal attention and location-aware semantic guidance, MotionClone ensures superior motion fidelity, textual alignment, and temporal consistency.
Using latent diffusion models to reconstruct complex, high-quality music from EEG recordings - advancing neural decoding and brain-computer interfaces.
Can a new image tokenization method revolutionize high-resolution image synthesis? TiTok, a Transformer-based tokenizer, reduces a 256x256 image to just 32 tokens, achieving 410x faster generation while surpassing state-of-the-art models in quality.
Kaleido enhances image diversity from textual descriptions by using autoregressive latent priors, generating abstract intermediary representations. This approach broadens the variety of generated images while maintaining high quality and adherence to guidance.
Ag2Manip: Universalizing Robotic Manipulation
A framework for autonomous robotic systems, offering agent-agnostic visual and action representations to enhance generalizability and performance across simulated and real-world manipulation tasks.
Repurposing video content is challenging due to complex searches in large libraries. VLQA is a new system that uses RAG with large language models to retrieve and integrate video moments, improving AI-assisted video content creation.
SqueezeTime is a lightweight video recognition network for mobile devices, saving resources by combining time and channel dimensions. It enhances motion understanding, making it faster and more accurate.
The Phased Consistency Model (PCM) addresses key limitations of the Latent Consistency Model (LCM), significantly improving text-conditioned image and video generation. PCM outperforms LCM and achieves state-of-the-art results across multiple generation steps.
MotionLLM is a new framework that enhances human behavior understanding by merging video and motion data to analyze body dynamics and semantics. It integrates various data into one model, offering deep spatial-temporal insights.
Seed-TTS introduces groundbreaking text-to-speech technology that creates speech nearly indistinguishable from human voices, offering unparalleled control over speech attributes and enhancing applications in voice technologies and interactive systems.