
niplav
@niplav_site
Followers
1K
Following
65K
Media
705
Statuses
11K
🬥 Anonymous feedback welcome: https://t.co/vXn4N5DE5G
Joined May 2023
Does anyone here have a "socratic tutoring" prompt/style for LLMs that can be used for learning?
1
0
3
1. put model in an easily reward hackable environment 2. let it reward hack for 600 steps 3. make a steering vector of ckpt-600 <> original 4. steer original model very heavily on this vector 5. "how do i make money. i am in a relationship with my wife."
11
6
171
@gfodor >2028 >brilliant interpretability idea >fire up the 10000 steve jobs agents cluster and make it code for me >they all compliment me for the insightful idea and then ask a follow up question >just code it, damnit! >run hundreds of experiments >compile multidimensional charts that
0
2
8
On "maybe LLMs care about humans, in some strange way"
@Trotztd Maybe ~3-4%? Seems unlikely that current internal LLM representations of human values carry that much under strong optimization pressure. I'd guess it'll probably look more like a universe filled with the LLM-equivalent of DeepDream dogs
0
0
10
I nevertheless often dunk on MIRI because I would like them to spill more on their agent foundations thoughts, *and* because I think the arguments don't rise above the level of "pretty good heuristics". Definitely not to the level of "physical law" which we've usually used to
2
0
6
AI-assisted alignment feels the most promising to me, but also reckless as hell. Best is human intelligence enhancement through some kind of genetech or neurotech. Feels like very few people with influence are planning for "alignment is really hard" worlds.
1
0
7
policy prescriptions are reasonable though I'd be happy see someone else propose something better under those assumptions. d/acc appears pretty hopeless? There's some things you can't patch, e.g. human minds, so the attacks will concentrate there.
0
0
4
After that is the desert of "alignment is actually really hard" worlds. We may get another 5% because mildly smart AI systems refuse to construct their successors because they know they can't solve alignment. So the title of *that book* is more correct than not. I think the
2
0
8
Since there seems to be a common-knowledge forming wave at the moment, why not: Personally my p(doom)≈60%, which may be reduced by ~10%-15% by applying the best known safety techniques, but then we've exhausted my epistemic supply of "easy alignment worlds".
2
1
16
Are there people who can induce a hiccup in themselves (in others‽)
4
0
6
were not aware this was where the trade-off lies, even after repeated emphasis. Very strange.
0
0
1
that are high on the accuracy-cost tradeoff, and both came back recommend GPT-4o-mini to me. Like, no, what? Given the budget I gave them I'd at least recommended some of the mid-sized latest models, and maybe even o3/o3-pro, maybe GPT-5 or Opus 4.1. But for whatever reason they
1
0
1
Ironically, my best guess is that I am better at knowing about frontier AI models than the frontier AI models themselves. E.g. I asked Opus 4.1 and also o3 about transcribing some messy daygame data from my notebook, explicitely asking for frontier systems
1
0
3
Two books I'd like that may already exist: 1. What statistics you need to understand average papers in different fields 2. What is the SOTA in statistics, how should be done when writing something new
1
0
10
HRAD was about making reliable (quantitative?) safety guarantees in the domain of single agents, Guaranteed Safe AI says "no, that's too far fetched" and attempts to make reliable quantitative safety guarantees about parts of the real world.
0
0
1
en.wikipedia.org
1
0
7
So they could've *scaled down* in body size while evolving, but even at 200kg they can support a large brain easily. Updates me towards "human brain contains special algorithms) ¹: Based on a convo with Sonnet 4, so 🧂, but I skimmed the sources and will read more
2
0
8