Re: [math-fun] World's hardest sudoku puzzle?

15 Nov 2006

      Summary: 
(1) AI Escargot is indeed hard, 
(2) it's nowhere near the hardest findable in a couple hours,
and 
(3) we don't have a good metric for predicting hardness.
...
Date: Fri, 10 Nov 2006 01:14:15 -0500 (GMT-05:00)
From: Daniel Asimov <dasimov@earthlink.net>
IDEA: 
If (i,j) is empty, let P(i,j) be the subset of {1,2,...,9} that does
NOT appear in the union of row i, column j, and the 3x3 box
containing (i,j); if (i,j) is filled with the number k, then P(i,j) =
{k}.  (P stands for "Possible set".)
Let s(i,j) := #(P(i,j)).
I propose that a good intrinsic measure of the difficulty of a Sudoku
is the *geometric* mean of the 81 s(i,j)'s, since the number of
possibilities is multiplicative.
It would be interesting to see how this measure compares with the
time it takes software to solve a puzzle.
I looked a little deeper at this, and had some fun with R programming.

I took a sample of 500 randomly generated sudokus, each of which had
58 blanks, so that it was in the same general class as AI Escargot.
Each sudoku was solved 5 times, and I took the median solution time
for each sudoku as representative of how long that particular one took
to solve.  (This turned out to be overkill; all 5 times were always
pretty close.)

The timing data and R programs are available from me on request.

Now let's look at how those solution times (in sec) were distributed
over the sample of 500:

  ---------------------------------------------------------
...
with(result, quantile(median, probs = seq(0, 1, by = 0.25)))
      0%    25%    50%    75%   100% 
    2.85   5.30   6.13   7.17 930.74

That is, the solution time ranged from just under 3sec to around
15min, with most rather tightly clustered around 6sec.

What we learn from this:

* Most randomly chosen sudokus are of comparable difficulty, i.e., the
  results were pretty tight about the median.

* AI Escargot, which clocks in at around 30sec, is indeed harder than
  most.  In fact, it's at about the 95th percentile, if you look at
  the spreadsheet of all the sample timings.  WAY harder than most.

* On the other hand, there exist sudokus which are much more
  fiendishly difficult.  (They are, however, rare.)  For example, the
  worst case in the sample was this, taking about 15min to solve:

  ---------------------------------------------------------
...
printSudoku(hardestSudokuInSample(result))
    +-------+-------+-------+
    |       | 7   3 |       |
    |       |     6 | 8 7   |
    | 3   7 |       |       |
    +-------+-------+-------+
    |       |     7 |       |
    |     2 |     5 | 3 8 7 |
    | 7     |       |       |
    +-------+-------+-------+
    | 2   1 | 6     |       |
    |       | 3     |   4   |
    |     3 |   2   |   5 6 |
    +-------+-------+-------+

(Solution also available on request. :-)

  This is "hard" only in the sense that the algorithm in the R sudoku
  library -- whatever it is -- takes especially long.  I have no idea
  if a human would find it hard, or would find it hard in comparison
  to AI Escargot, or ... whatever.

* Then I looked at your (Dan's) suggestion about a difficulty metric;
  let's call that "difficulty" in what follows.  It seemed pretty
  cool, but alas, the data say otherwise:

  - The difficulty scores are approximately normally distributed, with
    relatively narrow spread:
    ---------------------------------------------------------
    > quantile(difficulty, probs = seq(0, 1, by = 0.25))
          0%      25%      50%      75%     100% 
    2.520093 2.732545 2.795985 2.857816 3.118089 
    ---------------------------------------------------------

  - The difficulty score of AI Escargot is 2.56, i.e., well BELOW the
    median difficulty?!

  - The difficulty of the worst-case example above is 2.95.

  - The correlation of difficulty with median time to solve is only
    0.085.  Thus the difficulty score explains only about 0.7% of the
    variance of the times... i.e., very little at all.

  - A plot of time to solve vs difficulty is almost a horizontal line,
    i.e., no predictive power.

  - A linear regression model which tries to predict time from
    difficulty score (each centered around its mean and scaled by its
    standard deviation, i.e., as Z scores to get them on the same
    scale) with slope beta and intercept alpha:

      time - mean(time)          difficulty - mean(difficulty)
      ----------------- = beta * ----------------------------- + alpha
         sd(time)                      sd(difficulty)

    ... confirms that there is no predictive value. :-(

    The regression is significant, but explains almost none of the
    variance (R^2 = 0.006):

    ---------------------------------------------------------
    > summary(lm(scale(result$"median") ~ scale(difficulty)))
    Call:
    lm(formula = scale(result$median) ~ scale(difficulty))

    Residuals:
        Min      1Q  Median      3Q     Max 
    -0.4249 -0.1566 -0.0966 -0.0273 27.4646 

    Coefficients:
                        Estimate Std. Error   t value Pr(>|t|)   
    (Intercept)       -8.659e-17  3.152e-02 -2.75e-15   1.0000   
    scale(difficulty)  8.570e-02  3.154e-02     2.717   0.0067 **
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

    Residual standard error: 0.9968 on 998 degrees of freedom
    Multiple R-Squared: 0.007344,       Adjusted R-squared: 0.006349 
    F-statistic: 7.383 on 1 and 998 DF,  p-value: 0.006697 
    ---------------------------------------------------------
-- 
Steve Rowley <sgr@alum.mit.edu> http://alum.mit.edu/www/sgr/ Skype: sgr000