Rendered at 19:14:13 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
aesthesia 3 hours ago [-]
I'm not totally convinced by this:
> It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.
One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality, and architectures that let you train on more data are better because they learn better underlying world models. Knowledge transfers: LLMs are good at writing code partly because they've seen a lot of code, but also because they understand (at least to some extent) the relationship between that code and the rest of the world. Constraining a model's output structure also constrains the data that is available to train it. So the big question is whether you can actually meaningfully scale training with these kinds of strictly structured outputs.
davebren 1 hours ago [-]
At the same time treating everything as tokens and next word prediction will never produce any real understanding like what humans do when they learn how to program. The bitter lesson is an admission that we still have no clue what is at the core of human learning and reasoning so we have to brute force it with tons of data generated by humans. I also don't know if expert systems and ML techniques like feature extraction are really any worse in practice or if we just didn't have enough engineering resources or a proper way to organize and scale their development. They seemed to work quite well in a lot of cases with more predictable results and several orders of magnitude less compute. And LLMs still suffer the long-tail problem despite their insane amounts of data.
If we're at the end of the data and most new data is now produced by LLMs with little human oversight, where do we go? Seems like figuring out ways to mix LLMS with more structured models that can reliably handle important classes of problems is the next logical step. In a way that is what programming languages and frameworks/libraries are doing, but they've massively disincentivized work on those by claiming that LLMS will do everything.
The chess example is a good one, it's effectively solved so why shouldn't an LLM have a submodule that it can use to play chess and save some energy.
toolslive 8 hours ago [-]
> This is what most programmers do. They type raw text into the editor; the compiler either processes it into structured data, or returns an error the programmer has to internalise before resubmitting.
In the 1980s structural editors were quite popular (fe the basic editor in the ZX81). Using these, it is impossible for the programmer to create text that is not a valid program.
jnpnj 7 hours ago [-]
I never saw any structured editor on these machines, how did they operate ? grammar guided insertion ?
deskamess 5 hours ago [-]
Not a technical answer but when we started up the system (zx 16k) we were in a prompt. We would add commands with line numbers. After each line number the list of possible commands were embossed on the keyboard and you would start with that (if, peek, poke, etc). What you could complete was limited by that. Edit: BASIC programming
That's interesting, I hadn't heard of this. I guess IDEs kind of took some elements from that but with more flexibility.
jnpnj 3 hours ago [-]
I see. So if IIUC, it's similar to programmable pocket calculators too, and it's very enjoyable to have a physical key per language construct.
pron 5 hours ago [-]
The problem is that the search space is so large that correcting errors via guardrails is only effective if the original error rate is low (how many Integer -> Integer functions are there? There's ~1 way to get it right and ~∞ ways to get it wrong).
Sure, we can help the easy cases, but that's because they're easy to begin with. In general, we know (or at least assume) that being able to check a solution tractably does not make finding the solution tractable, or we'd know that NP = P. So if LLMs could effectively generate a proof that they've found the correct Integer -> Integer function, either that capability will be very limited or we've broken some known or assumed computational complexity limit. As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
Of course, it is possible that machine learning could learn some class of problems previously unknown to be in P and find that it is in P, but we should understand that that is what it's done: realised that the problem was easy to begin with rather than finding a solution to a hard problem. This is valuable, but we know that hard problems that are of great interest do exist.
>As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
From a model-checking point of view. This is about taking a proof-theoretic approach...
Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity (and we should consider Probabilistic complexity classes, so the picture is even more "complex").
calebh 4 hours ago [-]
In my mind the main problem here is setting up the environment for training the LLM and ensuring that there's enough high quality training data for consumption. Getting an environment set up for a single project is non-trivial - here I'm assuming that you want something similar to autocomplete in an IDE or language server integration. Even if you could set this up, are there enough projects to even train on in the first place?
Maybe this set-up will work for Haskell, but you can abandon any hope of setting up environments for C or C++. Even languages like Rust or C# may be impossible to train on, despite the build chain being a bit nicer than C or C++.
> The reason I put off starting the series for so long is one of the same reasons blocking the writing of the paper: some of the introductory material is some of the most difficult to write. It has been such a long time that I no longer know how to adequately explain why the problem is so difficult.
My sympathies to Jules
Daffrin 7 hours ago [-]
The connection between type systems and neural net structure is underexplored in practice. One thing I'd add: when you're dealing with multi-modal inputs in production — say, mixed structured and unstructured content — the type-safety problem compounds. You end up with implicit contracts at inference boundaries that are very hard to enforce.
Has the author written anything on how this applies to transformer architectures specifically? The attention mechanism seems like a place where a richer type theory would be genuinely useful.
7 hours ago [-]
woolion 10 hours ago [-]
I'm not sure what to make of TFA (I don't have time right now to investigate in details, but the subject it interesting). It starts with saying you can stop generation as soon as you have an output that can't be completed -- and there's already more advanced techniques that do that. If your language is typed, then you can use a "proof tree with a hole" and check whether there's a possible completion of that tree.
References are "Type-Constrained Code Generation with Language Models" and "Statically Contextualizing Large Language Models with Typed Holes".
Then it switches to using an encoding that would be more semantic, but I think the argument is a bit flimsy: it compares chess to the plethora of languages that LLM can spout somewhat correct code for (which is behind the success of this generally incorrect approach).
What I found more dubious is that it brushed off syntactical differences to say "yeah but they're all semantically equivalent". Which, it seems to me, is kind of the main problem about this; basically any proof is an equivalence of two things, but it can be arbitrarily complicated to see it. If we consider this problem solved, then we can get better things, sure...
I think without some e.g. Haskell PoC showing great results these methods will have a hard time getting traction.
Please correct any inaccuracies or incomprehension in this comment!
bgavran 9 hours ago [-]
Author here - thanks for engaging.
On existing techniques - Type-Constrained Generation paper is discussed in the blog post (under Constrained Decoding), and I'd group typed holes in the same bucket.
The problem with those methods is that they're inference time: they don't update the weights. In this case, constrained decoding prevents the model from saying certain things, without changing what the model wants to say. This is especially problematic the more complex your type systems get, without even taking into account that type inference is for many of these undecidable.
Meaning, if I give you a starting string, in the presence of polymorphisms and lambdas you might not always be able to tell whether it completes to a term of a particular type.
On the syntactic difference: I'd gently reframe. The question isn't whether syntactically different programs are semantically equivalent, it's that regardless of which form you pick, the existing methods don't let the model learn the constructor choice.
That's what the next section is about.
woolion 7 hours ago [-]
Thank you for your reply. FTR, I find the subject very interesting and I hope there will be more work on this line of approach.
>The problem with those methods is that they're inference time
I agree, I just thought it was missing some prior art (not affiliated with these papers :-P)
What is not clear to me at all is, is this the draft of a research idea?
Or is there already some implementation coming in a later post?
It seems to me that such an idea would be workable on a given language with a given type system, but it seems to me there would be a black magic step to train a model that would work in a language-agnostic manner. Could you clarify?
big-chungus4 12 hours ago [-]
So the model generates code, and let's say it is wrongly typed, we then take the rightly typed version and use cross entropy between them? Is that right? That just sounds like the typical training, unless you can somehow take arbitrary code that the model generated and automatically find the rightly typed version, so you won't need a dataset for it
yorwba 12 hours ago [-]
Rather than letting the model generate arbitrary code and type-checking it afterward, the author wants to pre-restrict the output with templates that are well-typed by construction and only let the model make choices between valid alternatives in that restricted output space.
ogogmad 4 hours ago [-]
"Even more, this coproduct can be thought of as an instance of a dependent pair indexed by a finite set" - I know what the individual terms mean, but I don't get what this is saying.
bgavran 17 minutes ago [-]
A coproduct in the category Set is a disjoint union of sets, i.e. A + B + C where A, B, C are sets.
We can think of this coproduct as involving two choices:
1) a choice of which component of the coproduct we're interested in (first, second, or third)
2) a choice of an element of that component
That is, `A + B + C` is isomorphic to `(i : Fin 3 * D i)` where `Fin 3` is a set with three elements, and `D : Fin 3 - > Type` and
`D(0)=A`, `D(1)=B`, `D(2)=C`.
Then, the idea is: why index by a finite set? If you replace `Fin 3` by some arbitrary set, you start to be able to model a very general notion of a dependent type.
> It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.
One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality, and architectures that let you train on more data are better because they learn better underlying world models. Knowledge transfers: LLMs are good at writing code partly because they've seen a lot of code, but also because they understand (at least to some extent) the relationship between that code and the rest of the world. Constraining a model's output structure also constrains the data that is available to train it. So the big question is whether you can actually meaningfully scale training with these kinds of strictly structured outputs.
If we're at the end of the data and most new data is now produced by LLMs with little human oversight, where do we go? Seems like figuring out ways to mix LLMS with more structured models that can reliably handle important classes of problems is the next logical step. In a way that is what programming languages and frameworks/libraries are doing, but they've massively disincentivized work on those by claiming that LLMS will do everything.
The chess example is a good one, it's effectively solved so why shouldn't an LLM have a submodule that it can use to play chess and save some energy.
In the 1980s structural editors were quite popular (fe the basic editor in the ZX81). Using these, it is impossible for the programmer to create text that is not a valid program.
That was not a very good description so try this: https://www.usebox.net/jjm/notes/basic/ (scroll down for keyboard pic and also some code). Or this video : https://youtu.be/zgjGsNS6a0Y?t=167
Sure, we can help the easy cases, but that's because they're easy to begin with. In general, we know (or at least assume) that being able to check a solution tractably does not make finding the solution tractable, or we'd know that NP = P. So if LLMs could effectively generate a proof that they've found the correct Integer -> Integer function, either that capability will be very limited or we've broken some known or assumed computational complexity limit. As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
Of course, it is possible that machine learning could learn some class of problems previously unknown to be in P and find that it is in P, but we should understand that that is what it's done: realised that the problem was easy to begin with rather than finding a solution to a hard problem. This is valuable, but we know that hard problems that are of great interest do exist.
[1]: https://lsv.ens-paris-saclay.fr/Publis/PAPERS/PDF/Sch-aiml02...
From a model-checking point of view. This is about taking a proof-theoretic approach...
Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity (and we should consider Probabilistic complexity classes, so the picture is even more "complex").
Maybe this set-up will work for Haskell, but you can abandon any hope of setting up environments for C or C++. Even languages like Rust or C# may be impossible to train on, despite the build chain being a bit nicer than C or C++.
https://cybercat.institute/2025/05/07/neural-alchemy/
https://cybercat.institute/2026/02/20/categorical-semantics-...
https://cybercat.institute/2025/10/16/dependent-optics-ii/
> The reason I put off starting the series for so long is one of the same reasons blocking the writing of the paper: some of the introductory material is some of the most difficult to write. It has been such a long time that I no longer know how to adequately explain why the problem is so difficult.
My sympathies to Jules
Has the author written anything on how this applies to transformer architectures specifically? The attention mechanism seems like a place where a richer type theory would be genuinely useful.
Then it switches to using an encoding that would be more semantic, but I think the argument is a bit flimsy: it compares chess to the plethora of languages that LLM can spout somewhat correct code for (which is behind the success of this generally incorrect approach). What I found more dubious is that it brushed off syntactical differences to say "yeah but they're all semantically equivalent". Which, it seems to me, is kind of the main problem about this; basically any proof is an equivalence of two things, but it can be arbitrarily complicated to see it. If we consider this problem solved, then we can get better things, sure...
I think without some e.g. Haskell PoC showing great results these methods will have a hard time getting traction.
Please correct any inaccuracies or incomprehension in this comment!
On existing techniques - Type-Constrained Generation paper is discussed in the blog post (under Constrained Decoding), and I'd group typed holes in the same bucket.
The problem with those methods is that they're inference time: they don't update the weights. In this case, constrained decoding prevents the model from saying certain things, without changing what the model wants to say. This is especially problematic the more complex your type systems get, without even taking into account that type inference is for many of these undecidable.
Meaning, if I give you a starting string, in the presence of polymorphisms and lambdas you might not always be able to tell whether it completes to a term of a particular type.
On the syntactic difference: I'd gently reframe. The question isn't whether syntactically different programs are semantically equivalent, it's that regardless of which form you pick, the existing methods don't let the model learn the constructor choice.
That's what the next section is about.
>The problem with those methods is that they're inference time
I agree, I just thought it was missing some prior art (not affiliated with these papers :-P)
What is not clear to me at all is, is this the draft of a research idea? Or is there already some implementation coming in a later post?
It seems to me that such an idea would be workable on a given language with a given type system, but it seems to me there would be a black magic step to train a model that would work in a language-agnostic manner. Could you clarify?
We can think of this coproduct as involving two choices:
1) a choice of which component of the coproduct we're interested in (first, second, or third)
2) a choice of an element of that component
That is, `A + B + C` is isomorphic to `(i : Fin 3 * D i)` where `Fin 3` is a set with three elements, and `D : Fin 3 - > Type` and `D(0)=A`, `D(1)=B`, `D(2)=C`.
Then, the idea is: why index by a finite set? If you replace `Fin 3` by some arbitrary set, you start to be able to model a very general notion of a dependent type.