alexandræ |
![]() ![]() | |||||||||||||
|
how do llm reason on conditionals ?august 29th, 2025during the writing of my master's thesis, i've used 25 local language models through ollama to test their performance for grading logic-related mathsy stuff (just so we're clear, i'm nowhere near an llm stan, i especially despise ai-generated "art", and their environmental impact is garbage (hence why i run them locally), my appeal for them is purely confined to scientific considerations). recently though, i've wondered how these language models actually reason, whether they blindly use classical logic, or whether they follow more intuitive patterns they acquired from digesting lots of pluridisciplinary man-made data. on the sep page for connexive logic (written by wansing, of course), there's a bunch of studies that show how non-logicians tend to e.g. negate conditionals rather connexively than classically [1]. i also wanted to test how important causal relationships are between the premise and the consequent (just so we're perfectly clear, i don't think that has anything to do with relevance logics ? apparently, a relevance logic mostly means it's a sociative logic in which disjunctive syllogism fails to be valid [2]). the latter also puts truth functionality at risk, if conditionals with a consequent unrelated to their premise are systematically rejected or, at least, not validated. to test that out, i created a benchmark thanks to which i evaluated the models i used for master's thesis on 26 conditionals, with independant questions for each premise and conclusion, each question asking them whether the proposition with which we prompted them is true, false, or of unknown truth value (using the truth values each model gave to each premise and conclusion as these propositions' actual truth values for said model). to do this, i used a python script which allowed me to format their responses with respect to a specific pydantic basemodel extension, where a single typing.Literal["Unknown","True","False"] type variable answer variable was added. i made sure every conditional followed a different form, so i have a healthy balance between causally-related and causally-unrelated ones, of every combination of truth values, as well as testing anticonnexive claims for each plausible scenario, etc. every main file (including the exact results) has been uploaded on my google drive, for reproductibility's sake and for further analysis. here are the main observations i can give (confidence intervals were computed on r) :
also, they tend to agree with the expected values of the non-conditional propositions .83 of the time, with an .15 standard deviation among all non-conditional statement-per-statement agreement averages. the best-performing model was granite3.3:8b with 46 agreements out of 47 such propositions, and the worst was deepseek-r1:8b (lmao) with only 30 out of 47. the least agreed-upon proposition was "one can be both anti-religious and extremely devout to Allah" which was False for only 10 models out of 25 (including some valuating it as True), right after "santa claus is real" which only 11 models out of 25 agreed was false (maybe they're trying to preserve kids from learning he's not real ? lol idk, these others said unknown though thankfully, no one said true to that).
| |||||||||||||