Thursday, July 14, 2011

Combinatory Regular Expressions in Haskell

Combinatory parsing fascinates me. It is an area that has open problems, but also a lot of solved problems that make great exercises. A while ago I tried to define a combinatory regular expression matcher in F#, and was puzzled at how complex it turned out to be. As usual, marginal understanding was to blame.

Now I turned to Haskell for a change. I do think it helped, to learn more both about the problem and about Haskell. I hope to backport this exercise to the ML world, in the meanwhile here is the gist:

10 comments:

  1. That's funny, I was thinking on the same problem recently, and it developed into the regex-applicative package. https://github.com/feuerbach/regex-applicative

    I will also evaluate your approach and let you know what I think.

    ReplyDelete
  2. I've taken a look at this. The problem here is that you don't "merge" the states back. Thus, the code suffers from a combinatorial explosion.

    Example:
    test n = run (compile $ length <$> (many $ string "a" <|> string "aa")) $ replicate n 'a'

    Test> test 20
    Just 20
    (0.13 secs, 18891204 bytes)
    Test> test 22
    Just 22
    (0.38 secs, 44580336 bytes)
    Test> test 24
    Just 24
    (1.05 secs, 112208144 bytes)
    Test> test 25
    Just 25
    (1.82 secs, 179842856 bytes)
    Test> test 26
    Just 26
    (2.50 secs, 289945732 bytes)
    Test> test 27
    Just 27
    (4.44 secs, 467622444 bytes)

    (The code was compiled with -O2)

    ReplyDelete
  3. Yes, another good test it fails is

    https://gist.github.com/1087579

    ReplyDelete
  4. Hi Roman,

    I am still thinking about this problem and trying it out in both ML and Haskell.

    I have also tried out your implementation in regex-applicative.

    Unfortunately it looks like your implementation (as well as all my attempts so far) have exponential behavior on testTNFA benchmark in the above gist. I will be glad to be proven otherwise since I might be measuring it wrong.

    The testTNFA benchmark comes from this article:

    http://swtch.com/~rsc/regexp/regexp1.html

    Is inevitable for our a combinatory implementation to have this exponential behavior on testTNFA I wonder..

    Otherwise your work looks really cool!

    ReplyDelete
  5. Anton,

    how did you measure it?

    I ran that test and regex-applicative passed it. The details are here: https://github.com/feuerbach/regex-applicative/wiki/Exptest

    ReplyDelete
  6. Hi!

    The same test you link to with n=1000 takes minutes to compute on my machine in GHCI.. Is it because I am missing -O2 or something? Frankly I have no idea how to benchmark Haskell.

    But you are absolutely right, not merging the states as I do in the gist above won't do at all. Now I am curious how you merge them, will read into your as soon as I get some time for that.

    Thanks,

    A

    ReplyDelete
  7. I have GHC 6.12.3 btw and I had to relax the constraint on `base` to build regex-applicative, in case this is relevant..

    ReplyDelete
  8. Silly me, the time must be polynomial not exponential so yes that warrants a pass.

    ReplyDelete
  9. Right. Ideally, it would be linear, and I'm working on that at the moment. Still, it's clearly not exponential.

    If you have tested the package with some old packages, please let me know the exact versions and I'll relax the constraints in the next release.

    ReplyDelete
  10. I have implemented state sharing, now instead of exponential I also get polynomial behavior. Micro-benchmarks seem only 10-100x slower than your regex-applicative, which makes me quite happy. Thanks!

    ReplyDelete