kerimkaya @kerimrocks profile

kerimkaya

@kerimrocks

Followers

2K

Following

3K

Statuses

1K

Building the Infinite Data Generation Engine @driaforall: Everything Synthetic Data, LLMs, and Multi-Agent Systems ( @swanforall)

Joined June 2017

Don't wanna be here? Send us removal request.

kerimkaya

@kerimrocks

2 days

Aligned with @robertnishihara with AI trends. • Data: The Year-Long Obsession Everyone’s set on scaling their datasets for at least another year—think bigger volumes and better quality, with AI supercharging the pipeline. • Traditional web scraping is hitting diminishing returns Enterprises have shifted away from web scraping and other traditional data collection techniques because of diminishing marginal returns. Now it’s all about AI-driven filtering/annotation and generating synthetic data (code/math verifiable FTW). • GPU-Hungry Data Future Everything’s about to get super data-intensive, and GPUs will be on fire. Multimodal LLMs let us tap into previously unused data and push infinite data generation to the max.

0

2

8

kerimkaya

@kerimrocks

3 days

@0xPrismatic Already exists:

Dria

@driaforall

17 days

Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.

1

0

3

kerimkaya

@kerimrocks

3 days

Google Benchmarking Gemini

0

1

9

kerimkaya

@kerimrocks

3 days

Each day, companies aggregating public or human-sourced data from verified domains fall into panic mode, while nimble two-person startups flip the script with synth datasets without humans in the loop. Synthetic data will break through the final barrier: non-verifiable domains

0

12

kerimkaya

@kerimrocks

8 days

Vibes after the release of R1

martin_casado

@martin_casado

8 days

0

11

kerimkaya

@kerimrocks

10 days

@ozenhati Congrats Hatice! We encountered a few challenges with JSON-based function calling and reasoning for tool calls. However, we discovered a novel approach that significantly boosted the model’s performance. Would love to talk, Check your DM.

1

0

1

kerimkaya

@kerimrocks

10 days

Glad to be the Deepseek inference sponsor for @cognitivecompai’s reasoning dataset. Open Source Reasoning Summer is here

Eric Hartford

@cognitivecompai

10 days

Following up - I announce the release of the Dolphin-R1 dataset with Apache 2.0 license! Half Gemini Flash Thinking and Half DeepSeek R1. This dataset is made possible by generous sponsorship from @driaforall and @BuildChutes

1

0

10

kerimkaya

@kerimrocks

12 days

RT @driaforall: We’re open-sourcing the Pythonic Function Calling Dataset for Dria-Agent-α—alongside the synthetic data generation pipeline…

0

8

0

kerimkaya

@kerimrocks

12 days

Teams that add “backed by @ any-vc” to their Twitter bios don’t realize how uncool it comes across

0

6

kerimkaya

@kerimrocks

12 days

@bo_wangbo If you want generate synth data with deepseek for free, check this out:

Dria

@driaforall

17 days

Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.

1

0

kerimkaya

@kerimrocks

13 days

It’s time to be the Steve Ballmer of Synthetic Data again. Data was—and still is—the most crucial part of the stack and your architecture; your compute only runs as well as your data. But building a complete synthetic data pipeline isn’t easy. You need specialized models, a damn good orchestration system, thorough monitoring, and—last but not least—an optimized compute layer. This way, while you maximize the value of each generated token through high-quality generation capabilities, you also minimize the cost of token generation. Synthetic data generation at scale and data curation is the race to the top.

Robert Nishihara

@robertnishihara

13 days

Just sat down to read the DeepSeek-R1 paper. We're entering an era where compute isn't primarily for training. It's for creating better data. I expect to see the money & compute spent on data processing (generation / annotation / curation) grow to match and exceed the money & compute spent on pre-training. People have talked about pre-training plateauing because we're "running out of data" on the internet to scrape. While that may be the case, capability improvements are going to continue full steam ahead. The improvements in intelligence are going to come not from putting in more data (scraped from the internet) but rather from putting in more compute (to generate higher-quality data). Intuitively, this feels similar to me to how people learn. You don't learn just by ingesting lots of tokens. In many cases, you learn by thinking more (I am referring to training time, but thinking more also applies at inference time). There are many creative ways to put in more compute to get better data, and this problem will be an important research area for a number of years. - In this paper, they train two models. Why two models? The first one (a reasoning model trained via RL) is used to generate data to train the second. This works by using the first model to generate reasoning traces and then selectively keeping only the high quality outputs (quality is judged by simply checking the results). This approach of "checking the results" works well for domains like math and coding where you can easily check the results. - In drug development, it is super common to put compute into generating better data in two phases. In the first phase, a generative model (e.g., for protein sequences) generates a massive number of candidate drugs. In the second phase, scoring or filtering is done with a slew of predictive models which may predict structure, toxicity, solubility, binding affinity, etc. After all this work is done, you may end up with 100 data points. - In physical domains (e.g., climate applications), expensive but accurate physics simulators exist (these simulations are run on super computers for long periods of time to simulate the physics of the atmosphere or some other system). All of that data can be used to train models, which is showing a ton of promise. The question of "how to put more compute into generating better data" is central to progress in AI right now.

0

9

kerimkaya

@kerimrocks

13 days

@robertnishihara

Dria

@driaforall

17 days

Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate boundless synthetic data by parallelizing R1 reasoning to distill models.

0

2

kerimkaya

@kerimrocks

13 days

RT @Benioff: Deepseek is now #1 on the AppStore, surpassing ChatGPT—no NVIDIA supercomputers or $100M needed. The real treasure of AI isn’t…

0

661

0

kerimkaya

@kerimrocks

15 days

@jiayi_pirate Check DM!

0

2

kerimkaya

@kerimrocks

17 days

RT @driaforall: Distributed nodes worldwide are running @deepseek_ai's R1 locally, generating reasoning traces in parallel. Generate bound…

0

79

0

kerimkaya

@kerimrocks

17 days

@cognitivecompai We are interested @driaforall , check DM

0

21

kerimkaya

@kerimrocks

18 days

@vincentweisser @MatternJustus Lets go!

0

3

kerimkaya

@kerimrocks

18 days

@PrimeIntellect Congrats team!

0

1