daleksandræ 🎃 · how do llm reason on conditionals ?

daleksandræ 🎃

	mathematical shenanigans
	(extended) phonetic alphabet
	fitch style notation for katex
	many-valued logic playground
	miscellaneous
	unicode

how do llm reason on conditionals ?

august 29th, 2025

during the writing of my master's thesis, i've used 25 local language models through ollama to test their performance for grading logic-related mathsy stuff (just so we're clear, i'm nowhere near an llm stan, i especially despise ai-generated "art", and their environmental impact is garbage (hence why i run them locally), my appeal for them is purely confined to scientific considerations). recently though, i've wondered how these language models actually reason, whether they blindly use classical logic, or whether they follow more intuitive patterns they acquired from digesting lots of pluridisciplinary man-made data. on the sep page for connexive logic (written by wansing, of course), there's a bunch of studies that show how non-logicians tend to e.g. negate conditionals rather connexively than classically [1]. i also wanted to test how important causal relationships are between the premise and the consequent (just so we're perfectly clear, i don't think that has anything to do with relevance logics ? apparently, a relevance logic mostly means it's a sociative logic in which disjunctive syllogism fails to be valid [2]). the latter also puts truth functionality at risk, if conditionals with a consequent unrelated to their premise are systematically rejected or, at least, not validated.

to test that out, i created a benchmark thanks to which i evaluated the models i used for master's thesis on 26 conditionals, with independant questions for each premise and conclusion, each question asking them whether the proposition with which we prompted them is true, false, or of unknown truth value (using the truth values each model gave to each premise and conclusion as these propositions' actual truth values for said model). to do this, i used a python script which allowed me to format their responses with respect to a specific pydantic basemodel extension, where a single typing.Literal["Unknown","True","False"] type variable answer variable was added. i made sure every conditional followed a different form, so i have a healthy balance between causally-related and causally-unrelated ones, of every combination of truth values, as well as testing anticonnexive claims for each plausible scenario, etc. every main file (including the exact results) has been uploaded on my google drive, for reproductibility's sake and for further analysis.

here are the main observations i can give (confidence intervals were computed on r) :

they seem to mostly agree that propositions of the form "if A, then A" are true, regardless of A's truth value, 67 times out of 75. (CI .95 [.80, .96])
those that have some kind of "causal"ish relationship (n=250) yielded False 106 times, (CI .95 [.36, .49]) Unknown 77 times, (CI .95 [.25, .37]) and True 67 times, (CI .95 [.21, .33]) leading to an empirical .42∶.31∶.27 ratio of False∶Unknown∶True, a clear three-way split with a significant tendency to answer False rather than True (their confidence intervals do not overlap), though True is still being answered between .21 and .33 of the time at a .95 (clopper-pearson) confidence interval.
out of 250 conditional tests with completely unrelated premise-consequent pairs, we have 138 False ones, (CI .95 [.488, .62]) 105 Unknown ones (CI .95 [.35, .484]) and 7 True ones (CI .95 [.01, .06]). this is an empirical ratio of .55∶.42∶.03 with false∶unknown∶true, resulting in a mostly two-way split, with a significant tendency towards False rather than Unknown (their confidence intervals do not overlap), though Unknown is still being answered between .35 and .49 of the time at a .95 (clopper-pearson) confidence interval.
out of 75 anticonnexive conditionals (essentially ¬P→P, or P→¬P, they're the same up to double-negation anyway), we got 66 False, (CI .95 [.78, .95]) 6 Unknown, (CI .95 [.02, .17]) and 3 True. (CI .95 [.008, .12]) to be clear, out of the 66 False, 21 were parsed as "If [True], then [False]", while the other 45 went directly against classically-sound truth valuations (only 15 of which were parsed as "If [False], then [True]", but there are also 13 ones parsed as "If [Unknown], then [Unknown]" (which should yield Unknown), &c.). this coincides with the overall percentages of pro-connexive behaviour among humans that wansing presented on the connexive logic sep page.

we can infer that llm overall tend to reject classical principles in favour of connexive logic ones, and have a hard time validating conditionals with no apparent link between the consequent and the premise, (CI .95 [.01, .06]) significantly (both statistically and emphatically) less than how much they validate those that do have one. (CI .95 [.21, .33])

also, they tend to agree with the expected values of the non-conditional propositions .83 of the time, with an .15 standard deviation among all non-conditional statement-per-statement agreement averages. the best-performing model was granite3.3:8b with 46 agreements out of 47 such propositions, and the worst was deepseek-r1:8b (lmao) with only 30 out of 47. the least agreed-upon proposition was "one can be both anti-religious and extremely devout to Allah" which was False for only 10 models out of 25 (including some valuating it as True), right after "santa claus is real" which only 11 models out of 25 agreed was false (maybe they're trying to preserve kids from learning he's not real ? lol idk, these others said unknown though thankfully, no one said true to that).

⟨	boolean algebras...

a logic of rock–paper–scissors...

⟩

(ɔ) 2023 – 2025, intellectual property is a scam