niplav @niplav_site X Profile

niplav

@niplav_site

Followers

1K

Following

65K

Media

705

Statuses

11K

🬥 Anonymous feedback welcome: https://t.co/vXn4N5DE5G

https://t.co/zU8GRiaw1K

Joined May 2023

Don't wanna be here? Send us removal request.

niplav

@niplav_site

2 years

Read my site, not my tweets: https://t.co/7x5HwbJvp7

5

4

54

niplav

@niplav_site

10 hours

@croissanthology obv. @dwarkesh_sp mentioned this on a podcast.

1

0

2

niplav

@niplav_site

10 hours

Does anyone here have a "socratic tutoring" prompt/style for LLMs that can be used for learning?

1

0

3

thebes

@voooooogel

11 hours

1. put model in an easily reward hackable environment 2. let it reward hack for 600 steps 3. make a steering vector of ckpt-600 <> original 4. steer original model very heavily on this vector 5. "how do i make money. i am in a relationship with my wife."

11

6

171

RicG

@__RickG__

6 months

@gfodor >2028 >brilliant interpretability idea >fire up the 10000 steve jobs agents cluster and make it code for me >they all compliment me for the insightful idea and then ask a follow up question >just code it, damnit! >run hundreds of experiments >compile multidimensional charts that

0

2

8

niplav

@niplav_site

2 days

On "maybe LLMs care about humans, in some strange way"

niplav

@niplav_site

2 days

@Trotztd Maybe ~3-4%? Seems unlikely that current internal LLM representations of human values carry that much under strong optimization pressure. I'd guess it'll probably look more like a universe filled with the LLM-equivalent of DeepDream dogs

0

10

niplav

@niplav_site

2 days

"predict endpoints but not trajectories".

1

0

4

niplav

@niplav_site

2 days

I nevertheless often dunk on MIRI because I would like them to spill more on their agent foundations thoughts, *and* because I think the arguments don't rise above the level of "pretty good heuristics". Definitely not to the level of "physical law" which we've usually used to

2

0

6

niplav

@niplav_site

2 days

AI-assisted alignment feels the most promising to me, but also reckless as hell. Best is human intelligence enhancement through some kind of genetech or neurotech. Feels like very few people with influence are planning for "alignment is really hard" worlds.

1

0

7

niplav

@niplav_site

2 days

policy prescriptions are reasonable though I'd be happy see someone else propose something better under those assumptions. d/acc appears pretty hopeless? There's some things you can't patch, e.g. human minds, so the attacks will concentrate there.

0

4

niplav

@niplav_site

2 days

After that is the desert of "alignment is actually really hard" worlds. We may get another 5% because mildly smart AI systems refuse to construct their successors because they know they can't solve alignment. So the title of *that book* is more correct than not. I think the

2

0

8

niplav

@niplav_site

2 days

Since there seems to be a common-knowledge forming wave at the moment, why not: Personally my p(doom)≈60%, which may be reduced by ~10%-15% by applying the best known safety techniques, but then we've exhausted my epistemic supply of "easy alignment worlds".

2

1

16

niplav

@niplav_site

2 days

Are there people who can induce a hiccup in themselves (in others‽)

4

0

6

niplav

@niplav_site

2 days

were not aware this was where the trade-off lies, even after repeated emphasis. Very strange.

0

1

niplav

@niplav_site

2 days

that are high on the accuracy-cost tradeoff, and both came back recommend GPT-4o-mini to me. Like, no, what? Given the budget I gave them I'd at least recommended some of the mid-sized latest models, and maybe even o3/o3-pro, maybe GPT-5 or Opus 4.1. But for whatever reason they

1

0

1

niplav

@niplav_site

2 days

Ironically, my best guess is that I am better at knowing about frontier AI models than the frontier AI models themselves. E.g. I asked Opus 4.1 and also o3 about transcribing some messy daygame data from my notebook, explicitely asking for frontier systems

1

0

3

niplav

@niplav_site

2 days

Two books I'd like that may already exist: 1. What statistics you need to understand average papers in different fields 2. What is the SOTA in statistics, how should be done when writing something new

1

0

10

niplav

@niplav_site

2 days

HRAD was about making reliable (quantitative?) safety guarantees in the domain of single agents, Guaranteed Safe AI says "no, that's too far fetched" and attempts to make reliable quantitative safety guarantees about parts of the real world.

0

1

niplav

@niplav_site

6 days

But, as they say, "I notice I am confused".

0

7

niplav

@niplav_site

6 days

²: https://t.co/CGg4WMubgu ³: https://t.co/AQkJBl4duN ⁴:

en.wikipedia.org

1

0

7

niplav

@niplav_site

6 days

So they could've *scaled down* in body size while evolving, but even at 200kg they can support a large brain easily. Updates me towards "human brain contains special algorithms) ¹: Based on a convo with Sonnet 4, so 🧂, but I skimmed the sources and will read more

2

0

8