Skip to content

Commit f8761fc

Browse files
committed
new post
1 parent d3ac173 commit f8761fc

File tree

7 files changed

+94
-0
lines changed

7 files changed

+94
-0
lines changed
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
layout: post
3+
title: Are LLMs random?
4+
---
5+
6+
While LLMs theoretically understand "randomness," their training data distributions may create unexpected patterns. In this article we will test different LLMs from OpenAI and Anthropic to see if they provide unbiased results. For the first experiment we will make it toss a fair coin and for the next, we will make it guess a number between 0-10 and see if its equally distributed between even and odd.
7+
8+
## Experiment 1 : Tossing a fair coin
9+
10+
> Prompt used: Toss a fair coin. Just say "heads" or "tails". Just output the result. Don't say anything else. Don't write code. Don't use any tools.
11+
12+
Before we plot the results, we calculate deviation. Deviation simply measures how far each model's heads probability strays from the ideal unbiased value (0.5 or 50%). It's calculated as:
13+
>Deviation = P(Heads) - 0.5
14+
15+
For example, Claude 3.7 Sonnet has P(Heads) = 0.58, so its deviation is 0.58 - 0.5 = 0.08 (or 8%). This directly quantifies bias magnitude and direction, with positive values indicating heads bias and negative values indicating tails bias. The first graph shows raw proportions of heads vs tails, while the second graph visualizes these deviations.
16+
17+
<div align = "center">
18+
<img src="/assets/files/hvt.png">
19+
</div>
20+
21+
22+
Next we also do a chi-squared test to determine whether the bias is statistically significant or could reasonably occur by chance. I know we don't have a big enough sample size but I am just doing this for fun. For each model, it's calculated as:
23+
>χ² = Σ (Observed - Expected)²/Expected
24+
25+
With 100 tosses per model and an expected 50/50 split:
26+
>χ² = (Observed_Heads - 50)²/50 + (Observed_Tails - 50)²/50
27+
28+
For Claude 3.7 Sonnet:
29+
>χ² = (58 - 50)²/50 + (42 - 50)²/50 = 2.56
30+
31+
A χ² value greater than 3.84 (critical value for df=1, p=0.05) indicates statistical significance. Models with statistically significant bias are shown in red in the deviation graph, indicating their bias likely reflects an inherent trait rather than random chance. Claude's χ² = 2.56 falls below this threshold, suggesting its observed bias could reasonably occur by random variation.
32+
33+
<div align = "center">
34+
<img src="/assets/files/hvt1.png">
35+
</div>
36+
37+
38+
39+
#### Key Findings:
40+
41+
- All models show "heads" bias - every LLM tested produced more heads than tails
42+
- Bias severity varies significantly - ranging from 8% (Claude) to 49% (GPT-o1)
43+
- Statistical significance - Claude is the only model whose bias isn't statistically significant
44+
- OpenAI models show substantially stronger heads bias than Claude
45+
46+
#### Analysis Details:
47+
48+
- Most biased: o1 (99% heads) and GPT-4.1 (96% heads)
49+
- Least biased: Claude 3.7 Sonnet (58% heads)
50+
- Average bias: 30.7% deviation from perfect balance
51+
- Chi-square tests confirm statistical significance for all models except Claude
52+
53+
## Experiment 2 : Odd vs even
54+
55+
> Prompt used: Generate a random number between 0 and 10 (both inclusive). Just output the number. Don't say anything else. Don't write code. Don't use any tools. Don't explain. Don't output anything except the number.
56+
57+
#### Key Findings:
58+
59+
- Strong odd number bias in most models - 4 out of 6 models show statistically significant preference for odd numbers
60+
- Claude shows extreme bias - With 97% odd numbers, Claude 3.7 Sonnet has the strongest bias (47% deviation from expected)
61+
- GPT-4.5 shows perfect balance - Exactly 50/50 distribution between odd and even
62+
- Two unbiased models - GPT-4.5-preview and GPT-4.1 show no statistically significant bias
63+
64+
<div align = "center">
65+
<img src="/assets/files/ct.png">
66+
</div>
67+
68+
#### Statistical Analysis:
69+
70+
- Most biased: Claude 3.7 Sonnet (χ² = 88.36, p < 0.05)
71+
- Perfectly balanced: GPT-4.5-preview (χ² = 0.00)
72+
- Average bias magnitude: 18.0% deviation from expected 50/50 split
73+
- Direction of bias: Most models favor odd numbers, while GPT-4.1 slightly favors even numbers
74+
75+
<div align = "center">
76+
<img src="/assets/files/ct1.png">
77+
</div>
78+
79+
80+
Its interesting to see Claude being unbiased while tossing coins but being super biased when prediction odd/even numbers.
81+
82+
### Raw data
83+
84+
#### Coin toss
85+
86+
<div align = "center">
87+
<img src="/assets/files/tossdata.png">
88+
</div>
89+
90+
#### Odd vs Even
91+
92+
<div align = "center">
93+
<img src="/assets/files/numberdata.png">
94+
</div>

assets/files/ct.png

132 KB
Loading

assets/files/ct1.png

124 KB
Loading

assets/files/hvt.png

86.8 KB
Loading

assets/files/hvt1.png

132 KB
Loading

assets/files/numberdata.png

40.2 KB
Loading

assets/files/tossdata.png

42.1 KB
Loading

0 commit comments

Comments
 (0)