![Doug Safreno Profile](https://pbs.twimg.com/profile_images/1848542395119333376/8TDWnhLT_x96.jpg)
Doug Safreno
@dougsafreno
Followers
457
Following
944
Statuses
516
Co-founder, CEO @GentraceAI. Proud ice cream tester for @hieesuh.
San Francisco, CA
Joined February 2015
@HanchungLee are you saying you think it's still on paper or just they didn't wanna figure out how to dedup by ssn when they migrated to a computerized system?
1
0
1
The biggest trap when building agents that even OpenAI is struggling with: I was testing ChatGPT's search features yesterday and noticed something that keeps coming up: teams obsess over "did it pick the right tool?" while missing the bigger picture. Example: I asked it for the @gentraceai address. It correctly chose the search tool (great!) but then searched for a completely unrelated company (not great!). Tool selection was perfect, execution was useless. This pattern is everywhere: - Agents choose to edit your document but don't update the right section - Search results overflowing context windows, losing critical info - Tools failing silently with no recovery strategy The reality is tool selection is just step one. The real engineering challenge is managing these interactions reliably at scale. You need trace validation, context management, and error recovery strategies. What's the worst agent failure mode you've seen?
0
0
2
First impressions using Deep Research: - It's really slow - It generates a very verbose answer full of links / citations - I'm too lazy to click the links and see if it's right, but I kinda don't trust it - Ok I'll stop being lazy and click on the links - The linking is actually really impressive; it took me to a snippet in a recorded podcast and showed me exactly what I was looking for - Ok this is pretty cool! Now I'm wondering, what's the point of Operator? If Deep Research could take actions, wouldn't it be better?
0
0
5
Keep it simple and add new tools incrementally to maintain full test coverage for each new workflow. We're advising this approach to our customers at @gentraceai - lmk what you think!
0
0
0
@OpenAI We're hosting an event on the future of agents w/ @ayanb and our speakers @bryantchou @rodrigodavies @prabhavjain @ezelby @patrickt010 - come join us!
0
0
0
@martinfowler Great article! For your LLM-as-a-judge evals, consider using an "unfair advantages" framework to get good performance. It's way too easy to write bad LLM-as-a-judge evals, I wouldn't rely on just "another model will be able to critique the first."
0
0
1
RT @GentraceAI: Self-hosted just got an upgrade. Now you can deploy Gentrace in your @kubernetesio cluster with: - Helm charts for quick s…
0
3
0
There's always truth in user feedback, but 80% of the time it's not literal. Building exactly what users ask for isn’t a guaranteed path to success. Sometimes the workflow is too clunky or they move to other priorities. Find the nugget of truth in their feedback to shape your product vision instead.
0
0
4
Multi-model evals might seem like a clever way to fix unreliable LLM product tests. But they’re generally a smell that you should rethink your evals. Multi-model evals (where you test a panel of models and average the scores) don’t fix a bad eval. If the eval is flaky, all you’re doing is averaging the flakiness. So you’re spending too much time optimizing for solutions with limited upside. Instead, make your eval work reliably with one model by giving it an unfair advantage than try to patch over it with more models.
0
1
3
Today’s golden datasets are stuck in the past, built on formats like CSVs and JSON that don’t match how modern LLM apps actually work. Datasets should integrate directly with your app, not sit as disconnected CSVs or JSON blobs. For LLMs, the future is dynamic datasets that pull from live app data, like fixtures, sources, and custom objects. This shift will make datasets easier to read, modify, and grow. We’re planning to build this out in @gentraceai. Just like Experiments lets you run tests connected to your app from the Gentrace UI, you’ll be able to create datasets connected to your app. If this sounds cool, we’re hiring:
0
0
4