--------------------------------------------------------------------------------
This blog post receives thousands of hits per month. I do not want to change the blog post from 2007, since it is part of a larger context.
Visit
Finding palindromes for a more extensive description of the algorithm and the underlying concepts and ideas.
Visit
http://www.jeuring.net/homepage/palindromes/ to obtain the software for finding palindromes.
--------------------------------------------------------------------------------
A palindrome is a number, verse, word or a phrase that is the same whether you read it backwards or forwards, for example the word `refer'.
When I started as a PhD student in Computer Science in 1988, my professor gave the following problem assignment to me. Construct an algorithm that, when given a string, finds the longest palindrome substring occurring in the string. For example, given the string "yabadabadoo", the algorithm should return the substring "abadaba". The algorithm should not only
return the longest palindrome substring, it should also return it
fast. The requirement was that the algorithm should only use a number of steps linear in the length of the argument string. This was a difficult problem, which gave me severe headaches in the months to follow. I spent four months on the problem, and solved it. The one comment I got from my professor was "The best thing about this problem and its solution is that they have absolutely no practical relevance."
He was wrong.
Since 1988 I have found that palindromes play an important role in the world around us. Palindromes appear in music, in genomes, in art, architecture and design, mathematics, physics, chemistry, etc.
Particularly interesting is the fact that palindromes occur in the male genome: of the about 20,000,000 characters representing the male-specific region of the Y chromosome, 5,700,000 characters form together eight large palindromes (off by a couple of characters). I don't think there is any proven explanation for the presence of palindromes in the male genome, but I suspect it has something to do with repairing genomes. If you know more about it: please let me know.
As you may or may not have noticed when working on repairing Endo's DNA, the cloud is corrupted. It turns out that the cloud's genome also is a palindrome (off by a couple of characters), and that the corruption can be repaired by taking the mirror image of the palindrome. To do so, you have to find the palindrome. In the case of Endo, this is pretty easy: `duolc' follows cloud immediately in the symbol table. Had this information not been in the symbol table, you would have to find this palindrome in the DNA by means of an algorithm for finding palindromes. Actually, the DNA for the cloud is only 6,500 acids long, and using a naive quadratic-time algorithm for finding palindromes is not a real problem if you have a sufficiently fast machine. A quadratic-time algorithm for determining palindromes in human male DNA isn't going to work: it would take a long time to find the palindromes.
This blog message explains how to find palindrome substrings in linear time. I will develop a program in Haskell for finding exact palindromes. The problem of finding approximate palindromes is left as an exercise to the reader :-) Interestingly, the algorithms for finding palindromes developed by computational biologists generally use suffix trees, and both space and time consumption seem to be worse compared with the algorithm I give in this blog message. But descriptions of the algorithm given here have been published in the eighties (by Galil and others, described in RAM-code) and nineties (by me, described in Squiggol, which is probably even harder to decypher than RAM-code) of the previous century, which is a long time ago of course.
This blog message is a literate Haskell file, which can be interpreted with ghci.
The type of a function for finding palindromes
I want to find the longest palindrome substring in a string. The first attempt at a type for a function longestPalindrome is hence:
longestPalindrome :: String -> String
To determine whether or not a string is a palindrome, I have to compare characters at different positions in a list. It would be helpful if I had random access into the string. For that purpose, I'm going to change the input type of the longestPalindrome function to an array. Furthermore, the longestPalindrome algorithm can also be used to find the longest palindrome in a list of integers, or any other type on which I can define an equality function. So here is the second attempt at a type for a function for finding the longest palindrome:
longestPalindrome :: Eq a => Array Int a -> Array Int a
To find the longest Palindrome in an array, I have to calculate the longest palindrome around each position of the array, where a position in an array is either on an element, or before or after an element. For example, the array corresponding to [1,2,3] has 7 positions: before the 1, on the 1, before the 2, ..., until after the 3. So I want to express the function longestPalindrome in terms of a function longestPalindromes of type
longestPalindromes :: Eq a => Array Int a -> [Int]
where the result list is the list of the lengths of the longest palindrome around each position in the argument array. For example,
?longestPalindromes (arrayList (0,2) [1,2,2])
[0,1,0,1,2,1,0]
I will omit the definition of longestPalindrome in terms of longestpalindromes, and define function longestPalindromes below.
A naive algorithm for finding palindromes
A naive algorithm for finding palindromes calculates all positions of an input array, and for each of those positions, calculates the length of the longest palindrome around that position.
> module Palindromes where
>
> import Data.Array
>
> longestPalindromesQ :: Eq a => Array Int a -> [Int]
> longestPalindromesQ a =
> let (afirst,alast) = bounds a
> positions = [0 .. 2*(alast-afirst+1)]
> in map (lengthPalindromeAround a) positions
Function lengthPalindromeAround takes an array and a position, and calculates the length of the longest palindrome around that position.
> lengthPalindromeAround :: Eq a => Array Int a
> -> Int
> -> Int
> lengthPalindromeAround a position
> | even position =
> extendPalindromeAround (afirst+pos-1)
> (afirst+pos)
> | odd position =
> extendPalindromeAround (afirst+pos-1)
> (afirst+pos+1)
> where pos = div position 2
> (afirst,alast) = bounds a
> extendPalindromeAround start end =
> if start < 0
> || end > alast-afirst
> || a!start /= a!end
> then end-start-1
> else extendPalindromeAround (start-1)
> (end+1)
For each position, this function may take an amount of steps linear in the length of the array, so this is a quadratic-time algorithm.
A linear-time algorithm for finding palindromes
I now describe a linear-time algorithm for finding palindromes. Although the program is only about 15 lines long, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.
The algorithm processes the input array from left to right. It maintains the length of the current longest tail palindrome, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. Function longestPalindromes is expressed in terms of function extendTail.
> longestPalindromes :: Eq a => Array Int a -> [Int]
> longestPalindromes a =
> let (afirst,alast) = bounds a
> in extendTail a afirst 0 []
Function extendTail takes an array as argument, the current position in the array, the length of the current longest tail, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. There are four cases to be considered in function extendTail. If the current position is after the end of the array, we only have to add the final palindromes in the array to the list of longest palindromes, for which we use the function finalCentres. If the current longest tail palindrome extends to the start of the array, we extend the list of lengths of longest palindromes by means of function extendCentres. If the element at the current position in the array equals the element before the longest tail palindrome we extend the longest tail palindrome. If these elements are not equal, we extend the list of length of longest palindromes.
> extendTail a n currentTail centres
> | n > alast =
> -- reached the end of the array
> finalCentres currentTail centres
> (currentTail:centres)
> | n-currentTail == afirst =
> -- the current longest tail palindrome
> -- extends to the start of the array
> extendCentres a n (currentTail:centres)
> centres currentTail
> | a!n == a!(n-currentTail-1) =
> -- the current longest tail palindrome
> -- can be extended
> extendTail a (n+1) (currentTail+2) centres
> | otherwise =
> -- the current longest tail palindrome
> -- cannot be extended
> extendCentres a n (currentTail:centres)
> centres currentTail
> where (afirst,alast) = bounds a
Function extendCentres adds palindromes to the list of lengths of longest palindromes. It takes the array as argument, the current position in the array, the list of palindromes to be extended, and the list of palindromes around centres before the centre of the current longest palindrome tail. It uses the mirror property of palindromes to calculate longest palindromes around centres after the centre of the current longest palindrome tail. If the last centre is on the last element, we call extendTail with a longest tail palindrome of length 1, and the position moved tot he right. If the previous element in the centre list reaches exactly to the end of the last tail palindrome, use the mirror property of palindromes to find the longest tail palindrome. In the other cases, we've found the longest palindrome around a centre, and add that to the list of length of longest palindromes. We proceed by moving the centres one position, and calling extendCentres again.
> extendCentres a n centres tcentres centreDistance
> | centreDistance == 0 =
> -- the last centre is on the last element:
> -- try to extend the tail of length 1
> extendTail a (n+1) 1 centres
> | centreDistance-1 == head tcentres =
> -- the previous element in the centre list
> -- reaches exactly to the end of the last
> -- tail palindrome use the mirror property
> -- of palindromes to find the longest tail
> -- palindrome
> extendTail a n (head tcentres) centres
> | otherwise =
> -- move the centres one step
> -- add the length of the longest palindrome
> -- to the centres
> extendCentres a n (min (head tcentres)
> (centreDistance-1):centres)
> (tail tcentres) (centreDistance-1)
Function finalCentres calculates the lengths of the longest palindromes around the centres that come after the centre of the longest tail palindrome of the array. These palindromes are obtained by using the mirror property of palindromes.
> finalCentres 0 tcentres centres = centres
> finalCentres (n+1) tcentres centres =
> finalCentres n
> (tail tcentres)
> (min (head tcentres) n:centres)
At each step in this algorithm, either the longest tail palindrome is extended, and the current position in the array is moved, or the list of lengths of longest palindromes is extended. Since both the array, and the list of lengths of longest palindromes are linear in the length of the array, this is a linear-time algorithm.