kerimkaya
@kerimrocks
Followers
2K
Following
3K
Statuses
1K
Building the Infinite Data Generation Engine @driaforall: Everything Synthetic Data, LLMs, and Multi-Agent Systems ( @swanforall)
Joined June 2017
Aligned with @robertnishihara with AI trends. • Data: The Year-Long Obsession Everyone’s set on scaling their datasets for at least another year—think bigger volumes and better quality, with AI supercharging the pipeline. • Traditional web scraping is hitting diminishing returns Enterprises have shifted away from web scraping and other traditional data collection techniques because of diminishing marginal returns. Now it’s all about AI-driven filtering/annotation and generating synthetic data (code/math verifiable FTW). • GPU-Hungry Data Future Everything’s about to get super data-intensive, and GPUs will be on fire. Multimodal LLMs let us tap into previously unused data and push infinite data generation to the max.
0
2
8
@0xPrismatic Already exists:
Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.
1
0
3
@ozenhati Congrats Hatice! We encountered a few challenges with JSON-based function calling and reasoning for tool calls. However, we discovered a novel approach that significantly boosted the model’s performance. Would love to talk, Check your DM.
1
0
1
Glad to be the Deepseek inference sponsor for @cognitivecompai’s reasoning dataset. Open Source Reasoning Summer is here
Following up - I announce the release of the Dolphin-R1 dataset with Apache 2.0 license! Half Gemini Flash Thinking and Half DeepSeek R1. This dataset is made possible by generous sponsorship from @driaforall and @BuildChutes
1
0
10
RT @driaforall: We’re open-sourcing the Pythonic Function Calling Dataset for Dria-Agent-α—alongside the synthetic data generation pipeline…
0
8
0
@bo_wangbo If you want generate synth data with deepseek for free, check this out:
Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.
1
0
0
It’s time to be the Steve Ballmer of Synthetic Data again. Data was—and still is—the most crucial part of the stack and your architecture; your compute only runs as well as your data. But building a complete synthetic data pipeline isn’t easy. You need specialized models, a damn good orchestration system, thorough monitoring, and—last but not least—an optimized compute layer. This way, while you maximize the value of each generated token through high-quality generation capabilities, you also minimize the cost of token generation. Synthetic data generation at scale and data curation is the race to the top.
Just sat down to read the DeepSeek-R1 paper. We're entering an era where compute isn't primarily for training. It's for creating better data. I expect to see the money & compute spent on data processing (generation / annotation / curation) grow to match and exceed the money & compute spent on pre-training. People have talked about pre-training plateauing because we're "running out of data" on the internet to scrape. While that may be the case, capability improvements are going to continue full steam ahead. The improvements in intelligence are going to come not from putting in more data (scraped from the internet) but rather from putting in more compute (to generate higher-quality data). Intuitively, this feels similar to me to how people learn. You don't learn just by ingesting lots of tokens. In many cases, you learn by thinking more (I am referring to training time, but thinking more also applies at inference time). There are many creative ways to put in more compute to get better data, and this problem will be an important research area for a number of years. - In this paper, they train two models. Why two models? The first one (a reasoning model trained via RL) is used to generate data to train the second. This works by using the first model to generate reasoning traces and then selectively keeping only the high quality outputs (quality is judged by simply checking the results). This approach of "checking the results" works well for domains like math and coding where you can easily check the results. - In drug development, it is super common to put compute into generating better data in two phases. In the first phase, a generative model (e.g., for protein sequences) generates a massive number of candidate drugs. In the second phase, scoring or filtering is done with a slew of predictive models which may predict structure, toxicity, solubility, binding affinity, etc. After all this work is done, you may end up with 100 data points. - In physical domains (e.g., climate applications), expensive but accurate physics simulators exist (these simulations are run on super computers for long periods of time to simulate the physics of the atmosphere or some other system). All of that data can be used to train models, which is showing a ton of promise. The question of "how to put more compute into generating better data" is central to progress in AI right now.
0
0
9
Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.
0
0
2
RT @driaforall: Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate bound…
0
79
0