dougsafreno Profile Banner
Doug Safreno Profile
Doug Safreno

@dougsafreno

Followers
457
Following
944
Statuses
516

Co-founder, CEO @GentraceAI. Proud ice cream tester for @hieesuh.

San Francisco, CA
Joined February 2015
Don't wanna be here? Send us removal request.
@dougsafreno
Doug Safreno
2 days
@HanchungLee are you saying you think it's still on paper or just they didn't wanna figure out how to dedup by ssn when they migrated to a computerized system?
1
0
1
@dougsafreno
Doug Safreno
6 days
Feeling deja vu with agent frameworks: they're like the early days of JS frameworks (2009-2015). Everyone's building their own solution, but we haven't found our "React moment" yet. Start simple, build from scratch until patterns emerge.
0
0
4
@dougsafreno
Doug Safreno
7 days
The biggest trap when building agents that even OpenAI is struggling with: I was testing ChatGPT's search features yesterday and noticed something that keeps coming up: teams obsess over "did it pick the right tool?" while missing the bigger picture. Example: I asked it for the @gentraceai address. It correctly chose the search tool (great!) but then searched for a completely unrelated company (not great!). Tool selection was perfect, execution was useless. This pattern is everywhere: - Agents choose to edit your document but don't update the right section - Search results overflowing context windows, losing critical info - Tools failing silently with no recovery strategy The reality is tool selection is just step one. The real engineering challenge is managing these interactions reliably at scale. You need trace validation, context management, and error recovery strategies. What's the worst agent failure mode you've seen?
0
0
2
@dougsafreno
Doug Safreno
8 days
First impressions using Deep Research: - It's really slow - It generates a very verbose answer full of links / citations - I'm too lazy to click the links and see if it's right, but I kinda don't trust it - Ok I'll stop being lazy and click on the links - The linking is actually really impressive; it took me to a snippet in a recorded podcast and showed me exactly what I was looking for - Ok this is pretty cool! Now I'm wondering, what's the point of Operator? If Deep Research could take actions, wouldn't it be better?
0
0
5
@dougsafreno
Doug Safreno
11 days
@thenanyu This has to be the biggest day for the NBA tab in years. I'm in shock
0
0
1
@dougsafreno
Doug Safreno
13 days
Keep it simple and add new tools incrementally to maintain full test coverage for each new workflow. We're advising this approach to our customers at @gentraceai - lmk what you think!
0
0
0
@dougsafreno
Doug Safreno
15 days
@OpenAI We're hosting an event on the future of agents w/ @ayanb and our speakers @bryantchou @rodrigodavies @prabhavjain @ezelby @patrickt010 - come join us!
0
0
0
@dougsafreno
Doug Safreno
15 days
@martinfowler Great article! For your LLM-as-a-judge evals, consider using an "unfair advantages" framework to get good performance. It's way too easy to write bad LLM-as-a-judge evals, I wouldn't rely on just "another model will be able to critique the first."
0
0
1
@dougsafreno
Doug Safreno
18 days
@thenanyu Actually I think I found my answers in the screenshots 😉
1
0
1
@dougsafreno
Doug Safreno
21 days
@cramforce Here for this. It's starting to happen:
1
0
0
@dougsafreno
Doug Safreno
21 days
RT @GentraceAI: Self-hosted just got an upgrade. Now you can deploy Gentrace in your @kubernetesio cluster with: - Helm charts for quick s…
0
3
0
@dougsafreno
Doug Safreno
23 days
There's always truth in user feedback, but 80% of the time it's not literal. Building exactly what users ask for isn’t a guaranteed path to success. Sometimes the workflow is too clunky or they move to other priorities. Find the nugget of truth in their feedback to shape your product vision instead.
0
0
4
@dougsafreno
Doug Safreno
28 days
Multi-model evals might seem like a clever way to fix unreliable LLM product tests. But they’re generally a smell that you should rethink your evals. Multi-model evals (where you test a panel of models and average the scores) don’t fix a bad eval. If the eval is flaky, all you’re doing is averaging the flakiness. So you’re spending too much time optimizing for solutions with limited upside. Instead, make your eval work reliably with one model by giving it an unfair advantage than try to patch over it with more models.
0
1
3
@dougsafreno
Doug Safreno
30 days
Today’s golden datasets are stuck in the past, built on formats like CSVs and JSON that don’t match how modern LLM apps actually work. Datasets should integrate directly with your app, not sit as disconnected CSVs or JSON blobs. For LLMs, the future is dynamic datasets that pull from live app data, like fixtures, sources, and custom objects. This shift will make datasets easier to read, modify, and grow. We’re planning to build this out in @gentraceai. Just like Experiments lets you run tests connected to your app from the Gentrace UI, you’ll be able to create datasets connected to your app. If this sounds cool, we’re hiring:
0
0
4