Hey Everyone!
A long time ago I wrote an app for the iPhone
that let you take a grab a sudoku puzzle using
your iPhone's camera.
Recently when I was investigating my self
organising Christmas lights project I realised
that the browser APIs and ecosystem had advanced
to the point where it was possible
to recreate my Sudoku application running purely in the
browser.
As you can see it works pretty well - you
can try it out for yourself using the links
in the description and as always, all the
code is in GitHub for you to try out for yourself.
Hopefully, this video will give you a good
idea how the system works and the thinking behind what I've done.
Let's take a look under the hood...
Let's define our problem:
Given an image identify and extract the Sudoku
puzzle, recognise the numbers in each square
of the puzzle, solve the puzzle, render the
results on top of the original image.
As a classically trained image processing
guy, my first port of call is to remove redundant information.
We don't care about colours, so we'll convert
the image to greyscale.
As you can see we've not really lost any information.
Looking at the image, we only ready care about
paper and ink,
so it makes sense to binarise the image and make black or white.
Next we need to identify the blob that is
the puzzle.
Once we've got that we can extract the image
of the puzzle and extract the contents of
each box of the puzzle.
Applying some OCR to the contents of each box,
we can then solve the puzzle and our job is done.
It's a pretty straightforward image processing
pipeline!
Let's dive into each of these steps in turn
and see how they work.
We're taking a feed from the camera on the
device.
This comes into us as an RGB image.
We're not really interested in colour as we're
working with printed puzzlestypically
which will be printed in black and white.
So our first step is to convert from RGB to
greyscale.
If we look at the standard formula for this
we get this fomula (https://en.wikipedia.org/wiki/Grayscale).
We can see that a large
portion of the value is coming from the green channel.
So we can apply a little shortcut to our RGB
to greyscale conversion and just take the
green channel of the image.
We're going to be using morphological operations for locating the puzzle
typically these work on black and white binary images, so our next step is binarise our image.
This involves applying a threshold to seperate
out foreground from background pixels.
I've simulated an image here that has been
taken in poor lighting conditions.
As you can see from the histogram there's
not a clear seperation of ink and paper.
This makes it very hard to apply a global
threshold to the whole image.
However, if we look at a small section of
the image we can see from the histogram that
we do have an obvious collection of ink pixels
and paper pixels.
What we need to do is examine each pixel in the context of its surrounding area.
A simple way to do this is to apply a blur
to the image and then compare the original
image with this blurred image.
Doing this gives us a very clean segmented
image even when lighting conditions are not ideal
We've now got a fairly clean image that contains
our puzzle along with a bunch of random other
printed elements.
A classic approach would be to jump straight in here with something like a hough transform to detect lines
However, we can be a bit smarter about what
we are doing and take a bit of a shortcut.
We know that whoever is taking the picture
should be pointing the camera at a puzzle.
So we can assume that the puzzle would be the largest object in the image
We can also apply simple heuristics to filter out elements that probably aren't a puzzle.
If we scan the image pulling out all connected components.
The connected component that contains the puzzle should be the one with the most points. And matches our heuristics.
This lets us isolate the puzzle grid.
With the puzzle grid identified we now need
to identify the four corners of the puzzle.
Once again we can apply a fairly simple heuristic
to this problem.
The pixels that are in the four corners should
have the smallest manhatten distance from
the corners of the max extents of the extracted
component.
You can see the calculation of manhattan
distance in the animations.
Runing this algorithm for all four extents
of our located puzzle gives us the four corners.
We know where the puzzle is in our image.
We have the location of the four corner points.
To extract the puzzle image we can reformulate our
problem into one of a homographic transform.
We want to map from the camera image of the
puzzle to an ideal image of the puzzle that
is not distorted by perpective or rotations.
We do this by computing a homography between
the two planes.
The formula shown can be transformed into
this new formula - as you can see we need four points
- this maps nicely onto the corner points
that we've already located for the puzzle.
If we rewrite the matrices algebraically we
can find h using this algorithm.
Please see the linked paper for details on
how this is derived
and alternative more stable methods for calculating the homography.
Now that we have the homography between our ideal image and the camera image we can map pixels between them.
This lets us extract a nice square puzzle image from the camera image.
Now that we have the square puzzle image we need to extract the contents of each individual cell.
We can use the thresholded version of the
image to help us with this.
Looking at each box in turn we extract the
bounds of any connected region starting from
the center of each cell.
We can then use this bounds to extract an
image of the digit from the square greyscale image.
If there is no connected region in the center
of the cell then we know that it is empty.
We now have an image for each populated cell
of the puzzle.
We're going to use a neural network to perform
OCR on each image.
I'm going to use tensorflow to train the network
and then tensorflow js to run the network
in the browser.
Our neural network is going to be trained
to recognise the digits 1 through 9.
I've synthesized a large number of training
examples from a wide selection of fonts rendering
a greyscale image of each digit.
I've also added a small amount of blur and some Gaussian noise to each image.
Hopefully this will provide a reasonable representation
of what we will be getting in a live environment.
I've also sperated out 10% of the images into
a testing data set that we can use to evaluate
how well our network performs.
I'm using an interactive jupyter notebook
to train the network.
We'll import TensorFlow and then set up the batch size
and the number of epochs for our training session.
I'm going to use a batch size of 32 and run
the training for 100 epochs.
We're also going to resize our images to 20
by 20 pixels.
We need to remember to also do this in the
javascript code so that it matches.
I'm augmenting our training data to help the
network generalise.
This will add some random mutations to our
input data set.
We'll split out 20% of our images into a validation set for use during training and also normalises the pixel values.
We'll also have a data generator that does
no augmentation and only normalises the pixels values.
We can now load the data, splitting our training
images into training and validation subsets.
Here's a small sample of our augmented training data.
For our problem we don't need a very complicated
model, I have built a single convolution layer
followed by a dense hidden layer and then
finally an output layer with nine outputs
one output for each digit.
We'll log the trainign out to tensorboard
and also create a confusion matrix to show
us which digits are causing problems.
For training, we need to use CategoricalCrossentropy loss function.
Running the training we can see that we have
pretty good accuracy
on both training and validation.
Looking at the confusion matrix we can see
that "1"s and "7"s are currently getting confused.
Once the training is complete we can see that
we have good performance on both the training data
and the validation data.
This indicates that our network is performing
well on data it has seen as well as data it
has never seen before.
Looking at the confusion matrix there are
still some issues between 1s and 7s.
Let's save the model and see which images it's actually failing on.
Looking at the failed images I'm pretty happy
with the performance.
I think the fonts that it is failing on would
not be used in the real world so I'm happy
to ignore these failures and move on with this network.
As I'm happy with neural network structure
I'm going to train it on all the data and
use the resulting network in my production
code.
The final training gives us really good acuuracy.
Checking to see if any images fail.
We now see that no images are incorrect.
We'll convert this model for use in TensorFlow.js.
To run our model we need to perform the same
image resizing and normalisations as we did
during training and then we can simply feed
our images into the model and ask it to predict
the digits.
We can solve a sudoku puzzle using Donald
Knuths application of Dancing Links to his
Algorithm X for solving the exact cover problem.
Dancing links uses the seemingly very simple
realisation that we can remove and put back
nodes from a doubly-linked list very efficiently.
We can see that when we remove node B from
the list it still has pointers back to the
nodes that used to link to it.
This means that node B contains all the information
required to add it back into the list.
So what is Algorithm X?
Algorithm X is an algorithm that solves the
exact cover problem.
The exact cover problem is solved by finding
the selection of rows in the grid that will
combine so that there is a 1 in each column.
This example is taken from Wikipedia - https://en.wikipedia.org/wiki/Knuth%27s_Algorithm_X
Looking at our grid Column 1 has the fewest
entries so we select it.
This gives us a choice of adding rows A or
B to our solution.
We try adding row A to our potential solution
set.
We can see it has an entry in columns 1, 4 and 7.
And these columns have entries in Rows A,
B, C, E and F
Removing these columns leave us with row D.
We now have zero rows in column 2 so we have failed
to find a solution.
We need to backtrack and try again.
This time we try adding Row B to our solution.
This flags column 1 and 4 for removal which
means that rows A, B and C should also be removed.
Doing this leaves us with rows D, E and F
and columns 2, 3, 5,6 and 7.
Now column 5 has the lowest number of entries
so we select it.
This gives us a choice of Row D. Selecting
row D flags columns 3, 5 and 6 for removal
which means that rows D and E should also
be removed.
We now have Row F remaining.
We include Row F in our solution which means
that columns 2 and 7 are flagged for removal
which removes row F.
We now have an empty matix which means we
have found a solution.
The final solution is rows B, D and F.
And if you look in the grid you can see that combining those rows together would give you a 1 in each column.
So, how does dancing links fit into this?
Dancing links solves the problem of backtracking
in an efficient way.
We can encode the sparse matrix as a doubly
linked circular list going left and right
for the rows and up and down for the columns.
Since these are now linked lists we can use
fact that we can efficiently remove and re-add
elements to perform the column and row removals
and now when we backtrack it can be done very efficiently.
This is what a Sudoku puzzle looks like  when it's been turned into a set of constraints.
You can see as we add the known numbers.
The number of rows and columns decreases.
As we run the search algorithm you can see
how quickly the search space is reduced.
You can also see that it occasionally needs
to backtrack before we find the complete solution.
To render our results we could draw our text
into a square image and then project each
pixel onto the image coming from the camera.
This would be quite computationally intensive.
Alternatively, for each puzzle cell, we could
take an imaginary line through the centre of the cell.
We can then project that using our homographic
transform and use this project line to calculate
an approximate height and angle for the cell.
We then simply draw the digit at the projected
location of the cell with the calculated height and angle.
So that's it for this video.
All the code for this project is on GitHub
- check the video description for the link.
I hope you've enjoyed this video - please
hit the subscribe button and leave any thoughts
you might have in the comments.
Thanks for watching and I'll see you in the
next video!
