Friday, 10 August 2007

Finding palindromes

--------------------------------------------------------------------------------
This blog post receives thousands of hits per month. I do not want to change the blog post from 2007, since it is part of a larger context.

Visit Finding palindromes for a more extensive description of the algorithm and the underlying concepts and ideas.

Visit http://www.jeuring.net/homepage/palindromes/ to obtain the software for finding palindromes.
--------------------------------------------------------------------------------
A palindrome is a number, verse, word or a phrase that is the same whether you read it backwards or forwards, for example the word `refer'.

When I started as a PhD student in Computer Science in 1988, my professor gave the following problem assignment to me. Construct an algorithm that, when given a string, finds the longest palindrome substring occurring in the string. For example, given the string "yabadabadoo", the algorithm should return the substring "abadaba". The algorithm should not only return the longest palindrome substring, it should also return it fast. The requirement was that the algorithm should only use a number of steps linear in the length of the argument string. This was a difficult problem, which gave me severe headaches in the months to follow. I spent four months on the problem, and solved it. The one comment I got from my professor was "The best thing about this problem and its solution is that they have absolutely no practical relevance."

He was wrong.

Since 1988 I have found that palindromes play an important role in the world around us. Palindromes appear in music, in genomes, in art, architecture and design, mathematics, physics, chemistry, etc.

Particularly interesting is the fact that palindromes occur in the male genome: of the about 20,000,000 characters representing the male-specific region of the Y chromosome, 5,700,000 characters form together eight large palindromes (off by a couple of characters). I don't think there is any proven explanation for the presence of palindromes in the male genome, but I suspect it has something to do with repairing genomes. If you know more about it: please let me know.

As you may or may not have noticed when working on repairing Endo's DNA, the cloud is corrupted. It turns out that the cloud's genome also is a palindrome (off by a couple of characters), and that the corruption can be repaired by taking the mirror image of the palindrome. To do so, you have to find the palindrome. In the case of Endo, this is pretty easy: `duolc' follows cloud immediately in the symbol table. Had this information not been in the symbol table, you would have to find this palindrome in the DNA by means of an algorithm for finding palindromes. Actually, the DNA for the cloud is only 6,500 acids long, and using a naive quadratic-time algorithm for finding palindromes is not a real problem if you have a sufficiently fast machine. A quadratic-time algorithm for determining palindromes in human male DNA isn't going to work: it would take a long time to find the palindromes.

This blog message explains how to find palindrome substrings in linear time. I will develop a program in Haskell for finding exact palindromes. The problem of finding approximate palindromes is left as an exercise to the reader :-) Interestingly, the algorithms for finding palindromes developed by computational biologists generally use suffix trees, and both space and time consumption seem to be worse compared with the algorithm I give in this blog message. But descriptions of the algorithm given here have been published in the eighties (by Galil and others, described in RAM-code) and nineties (by me, described in Squiggol, which is probably even harder to decypher than RAM-code) of the previous century, which is a long time ago of course.

This blog message is a literate Haskell file, which can be interpreted with ghci.


The type of a function for finding palindromes


I want to find the longest palindrome substring in a string. The first attempt at a type for a function longestPalindrome is hence:

longestPalindrome :: String -> String

To determine whether or not a string is a palindrome, I have to compare characters at different positions in a list. It would be helpful if I had random access into the string. For that purpose, I'm going to change the input type of the longestPalindrome function to an array. Furthermore, the longestPalindrome algorithm can also be used to find the longest palindrome in a list of integers, or any other type on which I can define an equality function. So here is the second attempt at a type for a function for finding the longest palindrome:

longestPalindrome :: Eq a => Array Int a -> Array Int a

To find the longest Palindrome in an array, I have to calculate the longest palindrome around each position of the array, where a position in an array is either on an element, or before or after an element. For example, the array corresponding to [1,2,3] has 7 positions: before the 1, on the 1, before the 2, ..., until after the 3. So I want to express the function longestPalindrome in terms of a function longestPalindromes of type

longestPalindromes :: Eq a => Array Int a -> [Int]

where the result list is the list of the lengths of the longest palindrome around each position in the argument array. For example,

?longestPalindromes (arrayList (0,2) [1,2,2])
[0,1,0,1,2,1,0]

I will omit the definition of longestPalindrome in terms of longestpalindromes, and define function longestPalindromes below.


A naive algorithm for finding palindromes


A naive algorithm for finding palindromes calculates all positions of an input array, and for each of those positions, calculates the length of the longest palindrome around that position.


> module Palindromes where
>
> import Data.Array
>
> longestPalindromesQ :: Eq a => Array Int a -> [Int]
> longestPalindromesQ a =
> let (afirst,alast) = bounds a
> positions = [0 .. 2*(alast-afirst+1)]
> in map (lengthPalindromeAround a) positions


Function lengthPalindromeAround takes an array and a position, and calculates the length of the longest palindrome around that position.


> lengthPalindromeAround :: Eq a => Array Int a
> -> Int
> -> Int
> lengthPalindromeAround a position
> | even position =
> extendPalindromeAround (afirst+pos-1)
> (afirst+pos)
> | odd position =
> extendPalindromeAround (afirst+pos-1)
> (afirst+pos+1)
> where pos = div position 2
> (afirst,alast) = bounds a
> extendPalindromeAround start end =
> if start < 0
> || end > alast-afirst
> || a!start /= a!end
> then end-start-1
> else extendPalindromeAround (start-1)
> (end+1)


For each position, this function may take an amount of steps linear in the length of the array, so this is a quadratic-time algorithm.


A linear-time algorithm for finding palindromes


I now describe a linear-time algorithm for finding palindromes. Although the program is only about 15 lines long, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.

The algorithm processes the input array from left to right. It maintains the length of the current longest tail palindrome, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. Function longestPalindromes is expressed in terms of function extendTail.


> longestPalindromes :: Eq a => Array Int a -> [Int]
> longestPalindromes a =
> let (afirst,alast) = bounds a
> in extendTail a afirst 0 []


Function extendTail takes an array as argument, the current position in the array, the length of the current longest tail, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. There are four cases to be considered in function extendTail. If the current position is after the end of the array, we only have to add the final palindromes in the array to the list of longest palindromes, for which we use the function finalCentres. If the current longest tail palindrome extends to the start of the array, we extend the list of lengths of longest palindromes by means of function extendCentres. If the element at the current position in the array equals the element before the longest tail palindrome we extend the longest tail palindrome. If these elements are not equal, we extend the list of length of longest palindromes.


> extendTail a n currentTail centres
> | n > alast =
> -- reached the end of the array
> finalCentres currentTail centres
> (currentTail:centres)
> | n-currentTail == afirst =
> -- the current longest tail palindrome
> -- extends to the start of the array
> extendCentres a n (currentTail:centres)
> centres currentTail
> | a!n == a!(n-currentTail-1) =
> -- the current longest tail palindrome
> -- can be extended
> extendTail a (n+1) (currentTail+2) centres
> | otherwise =
> -- the current longest tail palindrome
> -- cannot be extended
> extendCentres a n (currentTail:centres)
> centres currentTail
> where (afirst,alast) = bounds a


Function extendCentres adds palindromes to the list of lengths of longest palindromes. It takes the array as argument, the current position in the array, the list of palindromes to be extended, and the list of palindromes around centres before the centre of the current longest palindrome tail. It uses the mirror property of palindromes to calculate longest palindromes around centres after the centre of the current longest palindrome tail. If the last centre is on the last element, we call extendTail with a longest tail palindrome of length 1, and the position moved tot he right. If the previous element in the centre list reaches exactly to the end of the last tail palindrome, use the mirror property of palindromes to find the longest tail palindrome. In the other cases, we've found the longest palindrome around a centre, and add that to the list of length of longest palindromes. We proceed by moving the centres one position, and calling extendCentres again.


> extendCentres a n centres tcentres centreDistance
> | centreDistance == 0 =
> -- the last centre is on the last element:
> -- try to extend the tail of length 1
> extendTail a (n+1) 1 centres
> | centreDistance-1 == head tcentres =
> -- the previous element in the centre list
> -- reaches exactly to the end of the last
> -- tail palindrome use the mirror property
> -- of palindromes to find the longest tail
> -- palindrome
> extendTail a n (head tcentres) centres
> | otherwise =
> -- move the centres one step
> -- add the length of the longest palindrome
> -- to the centres
> extendCentres a n (min (head tcentres)
> (centreDistance-1):centres)
> (tail tcentres) (centreDistance-1)


Function finalCentres calculates the lengths of the longest palindromes around the centres that come after the centre of the longest tail palindrome of the array. These palindromes are obtained by using the mirror property of palindromes.


> finalCentres 0 tcentres centres = centres
> finalCentres (n+1) tcentres centres =
> finalCentres n
> (tail tcentres)
> (min (head tcentres) n:centres)


At each step in this algorithm, either the longest tail palindrome is extended, and the current position in the array is moved, or the list of lengths of longest palindromes is extended. Since both the array, and the list of lengths of longest palindromes are linear in the length of the array, this is a linear-time algorithm.

32 comments:

Greg Bakker said...

Nice problem, and interesting writeup. Thanks.

Jon Bentley's 'Programming Pearls' has a neat related example, finding the largest repeated substring of a string.

His solution is n log n, but given that repetitions are nonlocal this seems reasonable.

Spoiler: He creates a list of pointers into the string, and sorts them by alphabetical order of the text they point to. It's then guaranteed that any repeated substring will be next to its duplicate, and the longest can be found by scanning the list.

Searle said...

Everythings shorter in Perl :-)

echo "yabadabaabadabadoo" | perl -ne 'length($&) > length($r) ? $r= $& : undef while (/(.*).?(.*)(?(?{$1 ne reverse $2})(?!))/ig); print "$r\n"';

I found the idea here:
http://aspn.activestate.com
/ASPN/Cookbook/Rx/Recipe/326097

Hard stuff, though...

admp said...

Searle, I think you are really missing the point since Perl regexps like this are not matched in linear time.

Searle said...

Sorry, I just love RegExes. For the case that I left a wrong impression: I enjoyed the article a lot! (And did get the point, actually.)

Clive said...

Just came back here for a look (after a few days absence) to see if this blog had anything new.

Found you have written about Palindromes (on the 10th), so I checked my notes (that I am making as I plough very slowly through the processing of learning about Endo's DNA) and in another odd coincidence I found that the 10th was also the exact day that I first spotted the cloud/duolc thing and also played with finding other exact palindromes.

Back then I also replaced "cloud" with the mirror of "duolc" and ran the integrity checking for the relevant gene page again but still got a "Fail", and assumed there must be errors in both parts. At least I thought I got a "Fail" - because I've just repeated the exercise and it PASSed this time. I guess I must have made a silly typo somewhere the first time. Again!

So thanks for "helping me" get that little detail sorted out!

All I need now are some pointers with the (apparently) ambiguous specs for pelcgb nytbevguz O! I don't want to be accused of laziness but perhaps you can tell me when your next blog entry will be posted? :-)

Jochen said...

BTW, to restore the cloud function you can use the following prefix.

IIPIPIICCIICCPIICIICIPPPIPPPIICIPCCI
ICIICCPICIICIICIIPIPIICCIICCPIICIIPI
PCICCPIICFICIICCCICCIC
CCCCCCFCFCFFCCCCFCCCCFFIC
CCICCICFICCICIPPCPFICCCFC
CFCICFICCCICCICCIPCPCPICCCICCI
CFICCCFCCFCCFIPCPPIPCPPCCICICC
ICICFICCICICCFICCICICFFICCCFII
CICIICIICICCIICP

If you have more stuff in your prefix you need to add the number of bases that comes after this code to the highlighted number (note that it is quoted).

And pelcgb nytbevguz O is really
pelcgb nytbevguz EP4 :)

Clive said...

I've just spotted your reply to my earlier comment Jochen. Thanks.

Pelcgb nytbevguz EP4 eh? Well, I never knew that. I did eventually (15 August) work out what was required but only by painstakingly stepping through the Fuun implementation and comparing intermediate results to those from mine to work out what I was doing wrong.

I'm afraid that the error in 5B (presumably deliberate?) in the specs on page 4405829 was too much for me - perhaps I was supposed to recognise EP4 from what was given? Maybe I will next time. :) I must have tried about a million variations (or so it seemed!) in a hit and miss approach to getting it to work properly before finally resorting to stepping through the DNA.

Anyway, after finally managing a correct implementation of EP4, I then spent many more hours, perhaps even days before finally (19 August) realising that page 23 needed nothing more than EBG2. Arrgghh!

Clearly I needed more like 72 days rather than 72 hours for this year's contest!!

Still Fuun though. :)

Michael Roger said...

Either I failed to copy-paste correctly, or something is not right in the corner cases:

# These results are backwards:
> let arr = array (0,2) (zip [0..2] [1,2,2]) in longestPalindromes arr
[0,1,2,1,0,1,0]

> let arr = array (0,2) (zip [0..2] [2,2,1]) in longestPalindromes arr
[0,1,0,1,2,1,0]


This result is backwards and a bit wrong:

> let arr = array (0,3) (zip [0..] [2,2,1,2]) in longestPalindromes arr
[2,1,0,3,0,1,2,1,0]

-- Should be
-- [0,1,2,1,0,3,0,1,0]

Johan Jeuring said...

There was something wrong in a corner case. In particular, the function finalCentres was lacking an occurrence of min (added to the post now).

Unfortunately, this error wasn't caught by my unit tests. I have now added a QuickCheck property to my code:

> propPalindromes :: Property
> propPalindromes =
> forAll (arbitrary:: Gen [Int]) $
> \l -> let a = array (0,length l - 1) (zip [0..] l)
> in reverse (longestPalindromes a) ==
> longestPalindromesQ a

That is, functions longestPalindromes and longestPalindromesQ should return the same result. This property passes 1000 tests.

And indeed, function longestPalindromes returns the list of lengths of palindromes in reverse. This is easily reversed of course, but requires some tricks in order to make the algorithm still `on-line' (a result is printed whenever it has been computed).

Michael Roger said...

Aha. That makes more sense.
As originally published,
finalCentres n t c === (reverse (take n t)) ++ c
and I wondered why you used a more complicated-looking definition.

I believe using a fingertree (Data.Sequence) instead of a list for centres gives the queue performance characteristics needed for online streaming, but that makes the code less pretty.

(It would be lovely if the Prelude-defined list (and string) syntax were typeclasses, so efficient alternatives could be used with the same elegant-looking code.)


(Formatting note: The corrected, longer definition of finalCentres now runs off past the margin, chopping off the end of the line, in Firefox on Mac, unless I shrink the display font)

Johan Jeuring said...

I agree, it would be nice if I could easily change lists for something like Data.Sequence.

Thanks for the formatting error, corrected now.

yiannis said...

Excuse me could somebody please rewrite the code using C as i (and probably more people) have problem following your idea...

reyt said...
This comment has been removed by a blog administrator.
M said...

what is a longest 'tail' palindrome? how is it different from a simple palindrome? I am missing something here!

M said...

Ok so I thought about this and wrote this code:
_______________________

string in; //load in as the input string
vector(int) a; //integer array
//initialized to 1
for(int i=1;i< a . size();i++)
{
if(in[i]==in[i-1])
a[i]=2;
if(i-a[i-1]-1 >=0)
if(in[i]==in[i-a[i-1]-1])
a[i]=a[i-1]+2;
}
ans=max(a);
It seems to work for me. So why do you need to maintain anything except the palindrome number for the ith element?

M said...

OK! I figured out why your solution is such a genius :)

abababa did it!

Anurag said...

I am finding the same problem with every above proposed algorithm. Kindly let me know where am I wrong-

for the case "aabbaabba"
I am getting the same result with all the above codes, i.e. "aabbaa"

while I expected the longest palindrome as "abbaabba"

Anurag said...

Please be kind enough to respond

Johan Jeuring said...

If I run palindromes on your input I get:

johan-jeurings-macbook-2:johanj$ palindromesi
aabbaabba
"abbaabba"

as expected. (Here I use the software available via http://www.jeuring.net/Palindromes/.) I did not change anything in the software for calculating palindromes, so it should work for the code posted in this blog as well. Can you maybe contact me with the exact details?

Noah said...

Awesome algorithm. Since it doesn't require full random access, just forward and backward movement of the start of the palindrome, you can adapt it to work on list-based input using a zipper.

Here's my implementation that does so. It also produces lazy output, so `take 10 $ maximalPalindromeLengths (repeat ())` is `[0..9]`.

Johan Jeuring said...

Interesting. Both ideas (using a zipper-like structure and returning results lazily) make sense. I'll have a look at your implementation, and will try to find out how the efficiency compares with my implementation.

rishi said...

I just have one question, is the linear answer the best case or the worst case.

I tested a string "cddeeeffffggggghhhhhhhiiiiiiii" which the loops run for around 104 times for a string of length of 30 and you'll find the same case with similar patterns.

more data [here](http://pastebin.com/QawhPeuv). seems either worst case is not O(n) but O(nlogn) or worse.

PS: except in such patterns it's still a fast solution.

Johan Jeuring said...

To Noah: I've now compared your lazy solution with my random-access solution using Criterion on the text of the King James' Bible. It turns out that the random access solution is about 1.7 times faster under unoptimized compilation. Using -O2 makes both solutions go slower.

Johan Jeuring said...

To rishi: the theoretical analysis of the algorithm shows it is linear-time (O(n)). My implementation might be worse than linear time, but not because of the algorithm used. It might be caused by printing, reading, etc.

Shank said...

Consider that you're trying to find the longest palindrome for the string "aaaaa". We will be performing O(n^2) operations in this case because every time the current center of the palindrome moves to the right by only one place till it reaches the center of the string (at which a time we stop checking because of the mirror property). Hence, we have (2n+1)/2 movements of the center of the palindrome and in the worst case the number of characters it has to check to the left and right of the center to see if it's a palindrome is O(n). Hence, O(n^2) in the worst case. Am I missing something?

Johan Jeuring said...

Yes, I'm afraid you are missing something. I don't check naively for the longest palindrome around each centre. I calculate this list of palindromes around each centre, together with the longest tail palindrome, in one sweep over the list. At each step I either move the centre, or I extend the longest tail palindrome, and there are only a linear amount of such steps available.

rishi said...

I'm really not counting the time taken or adding complexities due to reading or printing or addition / comparison. I'm simply counting the no of iterations that are happening in the loop. I've tried this algorithm in Java / C / python / JS / PHP and I simple counted the no of iterations that are practically happening. Usually, Its has slight variances but always larger than O(n). As I said before, its still the most efficient algo I've come across but it doesn't seem to have a practical complexity of O(n). I will still digg a bit deeper and check if I've gone wrong somewhere or may be my understanding of the algorithm is still not as correct as it should be.

Johan Jeuring said...

To rishi: the algorithm is really linear-time. I will publish a much more extensive description of the algorithm and its background in a month time or so.

Abhishek said...

wonderful post. This one really explained it well.

Abhishek said...

wonderful post. This one really explained it well.

rosalie de castro said...

i need your help
i need a program that display a longest palindrome using object oriented programming
i hope you can help me
THANKS A LOT!!!

Johan Jeuring said...

Why don't you have a look at one of the solutions using other languages posted on my more extensive blog on finding palindromes?