Doug Safreno @dougsafreno profile

Doug Safreno

@dougsafreno

Followers

457

Following

944

Statuses

516

Co-founder, CEO @GentraceAI. Proud ice cream tester for @hieesuh.

San Francisco, CA

Joined February 2015

Don't wanna be here? Send us removal request.

Doug Safreno

@dougsafreno

2 days

@HanchungLee are you saying you think it's still on paper or just they didn't wanna figure out how to dedup by ssn when they migrated to a computerized system?

1

0

1

Doug Safreno

@dougsafreno

6 days

Feeling deja vu with agent frameworks: they're like the early days of JS frameworks (2009-2015). Everyone's building their own solution, but we haven't found our "React moment" yet. Start simple, build from scratch until patterns emerge.

0

4

Doug Safreno

@dougsafreno

7 days

The biggest trap when building agents that even OpenAI is struggling with: I was testing ChatGPT's search features yesterday and noticed something that keeps coming up: teams obsess over "did it pick the right tool?" while missing the bigger picture. Example: I asked it for the @gentraceai address. It correctly chose the search tool (great!) but then searched for a completely unrelated company (not great!). Tool selection was perfect, execution was useless. This pattern is everywhere: - Agents choose to edit your document but don't update the right section - Search results overflowing context windows, losing critical info - Tools failing silently with no recovery strategy The reality is tool selection is just step one. The real engineering challenge is managing these interactions reliably at scale. You need trace validation, context management, and error recovery strategies. What's the worst agent failure mode you've seen?

0

2

Doug Safreno

@dougsafreno

8 days

First impressions using Deep Research: - It's really slow - It generates a very verbose answer full of links / citations - I'm too lazy to click the links and see if it's right, but I kinda don't trust it - Ok I'll stop being lazy and click on the links - The linking is actually really impressive; it took me to a snippet in a recorded podcast and showed me exactly what I was looking for - Ok this is pretty cool! Now I'm wondering, what's the point of Operator? If Deep Research could take actions, wouldn't it be better?

0

5

Doug Safreno

@dougsafreno

11 days

@thenanyu This has to be the biggest day for the NBA tab in years. I'm in shock

0

1

Doug Safreno

@dougsafreno

13 days

Keep it simple and add new tools incrementally to maintain full test coverage for each new workflow. We're advising this approach to our customers at @gentraceai - lmk what you think!

0

Doug Safreno

@dougsafreno

15 days

@OpenAI We're hosting an event on the future of agents w/ @ayanb and our speakers @bryantchou @rodrigodavies @prabhavjain @ezelby @patrickt010 - come join us!

0

Doug Safreno

@dougsafreno

15 days

@martinfowler Great article! For your LLM-as-a-judge evals, consider using an "unfair advantages" framework to get good performance. It's way too easy to write bad LLM-as-a-judge evals, I wouldn't rely on just "another model will be able to critique the first."

0

1

Doug Safreno

@dougsafreno

18 days

@thenanyu Actually I think I found my answers in the screenshots 😉

1

0

1

Doug Safreno

@dougsafreno

21 days

@cramforce Here for this. It's starting to happen:

1

0

Doug Safreno

@dougsafreno

21 days

RT @GentraceAI: Self-hosted just got an upgrade. Now you can deploy Gentrace in your @kubernetesio cluster with: - Helm charts for quick s…

0

3

0

Doug Safreno

@dougsafreno

23 days

There's always truth in user feedback, but 80% of the time it's not literal. Building exactly what users ask for isn’t a guaranteed path to success. Sometimes the workflow is too clunky or they move to other priorities. Find the nugget of truth in their feedback to shape your product vision instead.

0

4

Doug Safreno

@dougsafreno

28 days

Multi-model evals might seem like a clever way to fix unreliable LLM product tests. But they’re generally a smell that you should rethink your evals. Multi-model evals (where you test a panel of models and average the scores) don’t fix a bad eval. If the eval is flaky, all you’re doing is averaging the flakiness. So you’re spending too much time optimizing for solutions with limited upside. Instead, make your eval work reliably with one model by giving it an unfair advantage than try to patch over it with more models.

0

1

3

Doug Safreno

@dougsafreno

30 days

Today’s golden datasets are stuck in the past, built on formats like CSVs and JSON that don’t match how modern LLM apps actually work. Datasets should integrate directly with your app, not sit as disconnected CSVs or JSON blobs. For LLMs, the future is dynamic datasets that pull from live app data, like fixtures, sources, and custom objects. This shift will make datasets easier to read, modify, and grow. We’re planning to build this out in @gentraceai. Just like Experiments lets you run tests connected to your app from the Gentrace UI, you’ll be able to create datasets connected to your app. If this sounds cool, we’re hiring:

0

4