kerimrocks Profile Banner
kerimkaya Profile
kerimkaya

@kerimrocks

Followers
2K
Following
3K
Statuses
1K

Building the Infinite Data Generation Engine @driaforall: Everything Synthetic Data, LLMs, and Multi-Agent Systems ( @swanforall)

Joined June 2017
Don't wanna be here? Send us removal request.
@kerimrocks
kerimkaya
2 days
Aligned with @robertnishihara with AI trends. • Data: The Year-Long Obsession Everyone’s set on scaling their datasets for at least another year—think bigger volumes and better quality, with AI supercharging the pipeline. • Traditional web scraping is hitting diminishing returns Enterprises have shifted away from web scraping and other traditional data collection techniques because of diminishing marginal returns. Now it’s all about AI-driven filtering/annotation and generating synthetic data (code/math verifiable FTW). • GPU-Hungry Data Future Everything’s about to get super data-intensive, and GPUs will be on fire. Multimodal LLMs let us tap into previously unused data and push infinite data generation to the max.
Tweet media one
0
2
8
@kerimrocks
kerimkaya
3 days
@0xPrismatic Already exists:
@driaforall
Dria
17 days
Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.
1
0
3
@kerimrocks
kerimkaya
3 days
Google Benchmarking Gemini
Tweet media one
0
1
9
@kerimrocks
kerimkaya
3 days
Each day, companies aggregating public or human-sourced data from verified domains fall into panic mode, while nimble two-person startups flip the script with synth datasets without humans in the loop. Synthetic data will break through the final barrier: non-verifiable domains
0
0
12
@kerimrocks
kerimkaya
8 days
Vibes after the release of R1
Tweet media one
@martin_casado
martin_casado
8 days
Tweet media one
0
0
11
@kerimrocks
kerimkaya
10 days
@ozenhati Congrats Hatice! We encountered a few challenges with JSON-based function calling and reasoning for tool calls. However, we discovered a novel approach that significantly boosted the model’s performance. Would love to talk, Check your DM.
1
0
1
@kerimrocks
kerimkaya
10 days
Glad to be the Deepseek inference sponsor for @cognitivecompai’s reasoning dataset. Open Source Reasoning Summer is here
@cognitivecompai
Eric Hartford
10 days
Following up - I announce the release of the Dolphin-R1 dataset with Apache 2.0 license! Half Gemini Flash Thinking and Half DeepSeek R1. This dataset is made possible by generous sponsorship from @driaforall and @BuildChutes
Tweet media one
1
0
10
@kerimrocks
kerimkaya
12 days
RT @driaforall: We’re open-sourcing the Pythonic Function Calling Dataset for Dria-Agent-α—alongside the synthetic data generation pipeline…
0
8
0
@kerimrocks
kerimkaya
12 days
Teams that add “backed by @ any-vc” to their Twitter bios don’t realize how uncool it comes across
0
0
6
@kerimrocks
kerimkaya
12 days
@bo_wangbo If you want generate synth data with deepseek for free, check this out:
@driaforall
Dria
17 days
Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.
1
0
0
@kerimrocks
kerimkaya
13 days
It’s time to be the Steve Ballmer of Synthetic Data again. Data was—and still is—the most crucial part of the stack and your architecture; your compute only runs as well as your data. But building a complete synthetic data pipeline isn’t easy. You need specialized models, a damn good orchestration system, thorough monitoring, and—last but not least—an optimized compute layer. This way, while you maximize the value of each generated token through high-quality generation capabilities, you also minimize the cost of token generation. Synthetic data generation at scale and data curation is the race to the top.
@robertnishihara
Robert Nishihara
13 days
Just sat down to read the DeepSeek-R1 paper. We're entering an era where compute isn't primarily for training. It's for creating better data. I expect to see the money & compute spent on data processing (generation / annotation / curation) grow to match and exceed the money & compute spent on pre-training. People have talked about pre-training plateauing because we're "running out of data" on the internet to scrape. While that may be the case, capability improvements are going to continue full steam ahead. The improvements in intelligence are going to come not from putting in more data (scraped from the internet) but rather from putting in more compute (to generate higher-quality data). Intuitively, this feels similar to me to how people learn. You don't learn just by ingesting lots of tokens. In many cases, you learn by thinking more (I am referring to training time, but thinking more also applies at inference time). There are many creative ways to put in more compute to get better data, and this problem will be an important research area for a number of years. - In this paper, they train two models. Why two models? The first one (a reasoning model trained via RL) is used to generate data to train the second. This works by using the first model to generate reasoning traces and then selectively keeping only the high quality outputs (quality is judged by simply checking the results). This approach of "checking the results" works well for domains like math and coding where you can easily check the results. - In drug development, it is super common to put compute into generating better data in two phases. In the first phase, a generative model (e.g., for protein sequences) generates a massive number of candidate drugs. In the second phase, scoring or filtering is done with a slew of predictive models which may predict structure, toxicity, solubility, binding affinity, etc. After all this work is done, you may end up with 100 data points. - In physical domains (e.g., climate applications), expensive but accurate physics simulators exist (these simulations are run on super computers for long periods of time to simulate the physics of the atmosphere or some other system). All of that data can be used to train models, which is showing a ton of promise. The question of "how to put more compute into generating better data" is central to progress in AI right now.
0
0
9
@kerimrocks
kerimkaya
13 days
@driaforall
Dria
17 days
Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.
0
0
2
@kerimrocks
kerimkaya
13 days
RT @Benioff: Deepseek is now #1 on the AppStore, surpassing ChatGPT—no NVIDIA supercomputers or $100M needed. The real treasure of AI isn’t…
0
661
0
@kerimrocks
kerimkaya
15 days
@jiayi_pirate Check DM!
0
0
2
@kerimrocks
kerimkaya
17 days
RT @driaforall: Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate bound…
0
79
0
@kerimrocks
kerimkaya
17 days
@cognitivecompai We are interested @driaforall , check DM
0
0
21
@kerimrocks
kerimkaya
18 days
0
0
3
@kerimrocks
kerimkaya
18 days
@PrimeIntellect Congrats team!
0
0
1