Factorising part 2 - Pollard's Rho algorithm
In which we explore how 1970's England collided with the ancient Greeks with a touch of North America in the 60s to produce a great factorisation algorithm
We are in England and the year is 1975. Queen are in the music charts with the majestic Bohemian Rhapsody, and that year will also see David Bowie have a chart-topping hit with the incredible Space Oddity. Top films that year were Jaws and the Rocky Horror Picture Show, and popular cars in England at the time were the Ford Cortina and the Vauxhall Chevette (also known as the Chevrolet Chevette in America or the Opel Kadett in mainland Europe). Was it really so long ago?
At his desk, Jazz music lover, keen chess player, and most importantly for our story, a great mathematician John Pollard, sits thinking about factorising large integers. He is of course aware of the method of "trial division". But is there a better way?
Suppose we have an integer N which we are told is the product of two prime numbers a and b. The basic algorithm of trial division is to start at 2 and see if it divides N. If that fails, try 3, then 4 then 5 and so on, until eventually you hit lucky when you get to a or b, whichever is the lower number. The problem with that approach is that you might be going for a long time until you finally get to a number which divides N if a and b are very large. Do you have a better probability of finding an answer faster if instead of starting at the bottom and progressively working up, you successively plucked random numbers out of the air between 2 and N? Could that be quicker?
He scribbles a few equations down on his notepad. Alas, no. If there are two numbers which divide N, then the probability of picking one is 2/N regardless of what method you pick them, in much the same way as winning a lottery or raffle is exactly the same whether you have ticket number 1 or ticket number 476.
But what if instead of looking for a number that divides N equally, we look for some other property of the number instead? A property that is more common, is quickly checked, and still provides a pointer to the prime factors of N?
Let's take an example using something to which we know the answer. Let's consider the number 221 which is 13 X 17. Out of 219 numbers between 2 and 220, only two of them divide evenly into 221. But what about these numbers:
13 X 1 = 13
13 X 2 = 26
13 X 3 = 39
etc. etc. up to
13 X 16 = 208
and these :
17 X 1 = 17
17 X 2 = 34
17 X 3 = 51
etc. etc. up to
17 X 12 = 204
Here's the interesting thing. There are 16 multiples of 13 and 12 multiples of 17 which all have one very important thing in common. They all share a common factor with 221.
And, of course, having found a number that has a common factor with our N, then that common factor will be a factor of N.
So, instead of looking for TWO numbers between 2 and 221 which divide equally into N, why not look for one of the TWENTY EIGHT numbers which share a common factor. By changing the property of the number which we are seeking, we have improved our chances - in this specific case - by 14 times!
That's already a massive improvement. But can we even improve on this idea? What is the probability that if we pick TWO numbers between 1 and 221, the DIFFERENCE between these numbers is a number which shares a common factor with 221? This should improve our chances massively again. For example, there is only one way to pick the number 26. But there are lots of ways to pick two numbers with a difference of 26 : for example 1 and 27, 99 and 125, 29 and 3 ...
Well, sticking to our example of factorising 221, there are 221 ways of picking the first number, and 221 ways of picking the second number, so the total number of ways of picking two numbers between 1 and 221 (given that the order matters, so picking 5 then 100 is different to picking 100 then 5) is 221 squared which is 48,841.
Now, for the sake of brevity you can either (a) work it out for yourself on a bit of paper, it's not too hard (b) write a little computer program for yourself or (c) just trust me on this - of the 48,841 possible combinations, there are over six thousand of them which result in the difference between the numbers having a factor in common with 221. (For every number between 1 and 221, there are 28 numbers that go with it for which the difference is a multiple of either 13 or 17 - for example 1 can go with 14, 18, 27, 35, 40, 52, 53, 66, 69, 79, 86, 92, 103, 105, 118, 120, 131, 137, 144, 154, 157, 170, 171, 183, 188, 196, 205 and 209)
Startlingly, this means that if you were to pick two (different) numbers between 1 and 221, there is a greater than one in eight chance that the difference between those numbers would have a factor in common with 221. Wow!
(The surprisingly high chance of success coming out of this is related somewhat to the famous "birthday paradox" - a bit of a misnomer since it isn't really a paradox at all, just a startling mathematical probability - that if you select 23 random people there will be more than a 50% probability that two of them will celebrate their birthdays on the same day of the year. With 70 people there will be a 99.9% probability, despite the fact that there are 365 days in a year! If you're not already familiar with the birthday paradox, have a look at Google and Wikipedia as the mathematics behind it are fascinating and very surprising.)
Could this thinking be the basis of a more efficient algorithm than just straightforward trial division?
Of course, any algorithm based on these ideas is only of any practical use if there is a very fast way to check if two numbers share a common factor. If checking that two numbers have a common factor is a long and laborious process, then the idea of the algorithm will fall flat. Luckily, John Pollard, being a keen mathematician knew immediately of a very efficient way indeed. A method that has in fact been around since around 300BC - Euclid's algorithm.
Let's consider two numbers, A and B. We can consider A greater than B with no loss of generality. The Euclidean algorithm is based on the idea that if a number n divides evenly into A and it also divides evenly into B then it must therefore also divide evenly into A-B. For example, 7 divides into 49 (7 times) and 7 divides into 63 (nine times) so therefore it divides into the difference between 63 and 49 = 14 (two times).
So we can devise an algorithm to find the greatest common divisor of A and B as follows:
1. Start with A and B
2. Take the smaller number from the larger and call it C
3. Take C and the smaller number out of A and B. If they are the same, then C is a common factor
4. If not, take the smaller from the larger and repeat. When the two numbers are the same, that is the greatest common factor of A and B.
Let's have a go with the numbers 252 and 105
252 - 105 = 147 (so whatever is a factor of 252 and 105 is also a factor of 147)
147 - 105 = 42 (taking the smallest two numbers out of the equation above and repeating)
105 - 42 = 63
63 - 42=21
42 - 21 = 21
21 - 21 = 0 <------- 21 is a factor of 252 and 105
We can even speed this algorithm up a bit and do it in three steps as follows:
252 = 2 X 105 + 42
105 = 2 X 42 + 21
42 = 2 X 21 with no remainer <---- 21 is a factor of 252 and 105
In fact, Euclid's algorithm gives us not just any common factor of A and B, it gives us the Greatest Common Divisor of A and B. If we are trying to find the factors of a number N and we can find a number M (such that M < N) for which Euclid's algorithm says there is a Greatest Common Divisor > 1 then that greatest common divisor is a factor of N.
Excellent! The plan is coming together! The final step is to come up with a formula by which we can quickly generate random numbers between 1 and N. If we have such a formula, we can leave a computer running away in the background trying to factor our integer while we drink cups of tea or otherwise get on with our lives.
Of course, seasoned mathematicians will know that "a random number generated by a formula" is totally a contradiction in terms!
A number which is generated by a known formula cannot be truly random! So what we are hoping to achieve is what is known as a "pseudo-random number" i.e. one which for our purposes is just as good as being a random number. Provided of course you don't use it to pick lottery or raffle tickets at the annual convention of mathematicians.
Here's an idea which will generate pseudo-random numbers between 1 and N. Take the formula :
and start with x = 1. Whatever the output of the equation, feed it back in and keep going. Because the answer is always modulo N, we'll always generate a number between 1 and N.
Let's try this formula to generate random numbers between 1 and 59
x = 1 ---> outputs 2
x = 2 --> outputs 5
x = 5 --> outputs 26
x = 26 --> outputs 28 (26 squared plus 1 is 677 which is 59 X 11 + remainder 28)
x = 28 --> outputs 18
x = 18 --> outputs 30
x = 30 --> outputs 16
x = 16 --> outputs 21
x = 21 --> outputs 29
x = 29 --> outputs 16
x = 16 --> outputs 21
x = 21 --> outputs 29
Oops. Do you notice what's happened? We started out great, generating the pseudo-random sequence 2,5, 26, 28, 18, 30, 16, 21, 29 and then we got locked into an infinite cycle going round 16, 21, 29 forever.
Sadly, this is often the way with pseudo-random-number generating formulae. It is in fact, as an aside, where the name of the algorithm "Pollard's Rho algorithm" comes from. If you draw the numbers of a typical sequence on a directed graph, the numbers move along from one to the next for a little while before entering a circular loop. The resulting shape can be drawn to look like the lower case Greek letter Rho, hence the name. (John Pollard must have liked the concept of "Rho" algorithms because he came up with another one with the same name for solving a problem known as the discrete logarithm problem. If you are doing further research on Pollard's Rho algorithm on Google or Wikipedia, make sure you are looking at the right one)
But - all is not lost. If we can find a way to quickly detect when we've got ourselves stuck in a loop then we can change the formula (maybe x squared plus TWO modulo N etc.) and try again - and if/when we get stuck in a loop with that one, we can switch to yet another formula and so on.
And here, John Pollard knew again of exactly what to do. He didn't need to go as far back as ancient Greece this time, only as far back as 1967 but this time across the North Atlantic to a brilliant computer scientist by the name of Bob Floyd from New York. Around that time, a very famous lady by the name of Debbie Harry had moved to New York and had become a singer with the band ... "The Wind In The Willows" ... (Blondie would not be formed until 1974). It would be nice to think that Bob Floyd went to see Debbie Harry in concert in the 1960s, but around that time, he had moved away to the Illinois Institute of Technology in Chicago. On a trivial fact note though, we was once a college room-mate of Carl Sagan.
His technique - now known eponymously as "Floyd's Cycle Finding Algorithm" and sometimes colloquially known as the "hare and tortoise" algorithm works like this: From the starting point, set off a tortoise which moves one step at a time through the sequence, and set off a hare which moves two steps through the sequence each time. At some point down the line, both the hare and the tortoise will have entered and be locked in the loop with the hare going twice as fast round the loop as the tortoise.
Taking the example given above:
Tortoise: 1, 2, 5, 26, 28, 18, 30, 16, 21, 29, 16, 21, 29, 16, 21, 29
Hare: 1, 5, 28, 30, 21, 16, 29, 21, 16, 29
Notice that when the hare and the tortoise are at the same point in the loop, is the point at which the tortoise has gone round the loop completely once and is about to go around the loop a second time.
Time to put it all together. Here's how it all goes :
1. Come up with a formula to generate a pseudo-random number between 1 and N. Typically, we will use g(x) = (x^2 + M) modulo N where N is the number we are trying to factorise and M is some other integer.
2. Set off a "tortoise sequence" x= g(x) starting at x=1, and at the same time, set off a "hare" sequence y = g(g(y)) starting at y=1. The hare will move two steps along the sequence for every one step that the tortoise takes. Consider x and y as our two "pseudo-random" numbers.
3. Calculate the value of |y-x| (i.e. take the smaller number from the larger to get the absolute difference)
4. Use Euclid's algorithm to quickly check if the difference between our "random numbers" has a factor in common with N. If it does - we've found a factor of N.
5. If not, move the tortoise and the hare again i.e. x=g(x) and y = g(g(y)) and try again.
6. If we end up going round in a loop without success, tweak the formula in step 1 and try again.
For the computer programmers amongst us (and hey, why not, after all P vs NP is a problem in computational mathematics!) then the following code will give an example of the algorithm in action. It's in Javascript but could be easily adapted to work in C, Java, BASIC, Python, PHP or whatever your language of choice! Note that for the sake of simplicity, I've not included any cycle detection, if you wish to develop this algorithm, please feel free!
<SCRIPT LANGUAGE = "javascript">
N = 9655379 // the number to be factorised factor = 1 // declare the variable x = 1; // initial condition y = 1; // ditto initial condition for y steps = 1; // keeps track of how many steps it
// takes to get to an answer
// the formula we use to generate our pseudo-random
// numbers
function g(x) { return (x*x + 1) % N }
// this function finds the greatest common divisor of
// integers a and b by using Euclids algorithm
function gcd(a, b) { while (b != 0) { remainder = a % b; a = b; b = remainder; } return a; }
// here we go ........
document.write("Trying to find a factor of " + N +
"<P>");
while (factor == 1) { x = g(x); y = g(g(y)); difference = Math.abs(y - x); factor = gcd(difference, N); document.write("step " + steps + ": x = " + x + "y = " + y + " factor = " +
factor + "<BR>"); steps++; }
// if the factor is greater than one, then halt and print
// out the result
document.write("<P><B>A factor of " + N + " is " +
factor + "</B>");
</SCRIPT>
Here you can see I'm trying to find a factor of 9655379. You can change this line to find factors of other numbers. If you don't need the running commentary that goes on while the program is running, just delete all the document.write statements except of course, the very last one which prints the result.
When I ran it, even though 9655379 is a reasonably formidable integer for a human to factorise, the algorithm solved it in the blink of an eye and with very few steps:
Trying to find a factor of 9655379
step 1: x = 2 y = 5 factor = 1 step 2: x = 5 y = 677 factor = 1 step 3: x = 26 y = 3963377 factor = 1 step 4: x = 677 y = 6126670 factor = 1 step 5: x = 458330 y = 5878781 factor = 1 step 6: x = 3963377 y = 7750099 factor = 1 step 7: x = 1838272 y = 3324569 factor = 1 step 8: x = 6126670 y = 5556863 factor = 1 step 9: x = 7685323 y = 7248122 factor = 1 step 10: x = 5878781 y = 6060311 factor = 2017
A factor of 9655379 is 2017
If you liked this article, follow the P versus NP blog at sofosband.wixsite.com/pversusnp for more articles and follow our Facebook page P versus NP problem