m.makes.musings

no time in llms

dumb vent musing about some stuff that maybe i won't ever work on but is part of the next big problems

transformer input + output are token streams

it's low risk but i think it's dumb

human brains don't process information as homogeneous token streams. maybe it becomes that way at some point in the brain eventually but

image + sound + proprioception all come in at once - and are always coming in at once - then get integrated in an upper layer later on

(1) that's not what current transformers do. you can't have multiple types of sensory data show up at once and have it be simul-streaming in

or i guess maybe you could in a franken-model sort of way

(2) but having the fraken-model train fast would be hard

also the real life has penalties if you take too long to decide + react; transformer models don't have a notion of reaction time (and based on papers floating around, poor notions of absolute position in a sequence unless you force it in)

fact that models aren't learning time/how to count/etc on their own means something's wrong somewhere or if not wrong, at least

(3) it's not modelling the resource limitations of the world correctly - and thus not modelling some of the biggest parts of the incentive structure of the natural world correctly

(arguably maybe you don't need to be modelling the natural world fully correctly to be modelling the natural world closely enough to be useful for enough things and maybe you can brute-force-data-hump the problem enough to get over some of these problems but it feeeeeeeeeeels like there's a couple indicator things missing)


went back through and labelled a couple of the big behavioral bits that the models aren't doing that feel like it's missing.


since i know there's like, one person that reads this blog and their big contention with transformers is not liking the lack of feedback loops between different regions - yeah we prolly need to address that too eventually

but that's more of a pathing problem to me - do we still that in now vs later?