Return to Answer

replaced http://tools.ietf.org/html/rfc with https://www.rfc-editor.org/rfc/rfc

Source Link

edited Oct 7, 2021 at 6:47

The code that you've written is generally really good, but as you seem to have found parsing non-trivial strings starts to get kind of complicated, and has all sorts of room for nasty edge cases.

Your remove_comments method appears not to account for nested comments, which are explicitly allowed by the RFC RFC.

As expected, remove_comments("Hello (new) world") returns "Hello world", but when I ran it, remove_comments("Hello (new (old) ish) world") returned 'Hello ish) world'.

Removing nested comments with regular expressions is hard, indeed with a purist view of regular expressions, it's impossible. Basically, to do this you need a recursive regex, which seems not to be supported by Python's RE engine.

In this particular case, it shouldn't be too hard for you to roll your own comment remover, all you really need to do is iterate over the string, keeping track of the number of brackets currently open. For a next iteration, this shouldn't be too hard.

You may find, though, that this gets unmanageable quite quickly when you try and account for quoted strings and escaped characters - how would you write your parser such that it parses foo"\")"("")@example.com down to foo")@example.com? If you really want to hit as many pathological edge cases as possible, I'd suggest learning about formal languages and parsers, then digging out a parser library for Python to help you build your own. The Python Wiki lists several, and this one in particular looks pretty nice, though I haven't tried to use it myself.

The code that you've written is generally really good, but as you seem to have found parsing non-trivial strings starts to get kind of complicated, and has all sorts of room for nasty edge cases.

Your remove_comments method appears not to account for nested comments, which are explicitly allowed by the RFC.

As expected, remove_comments("Hello (new) world") returns "Hello world", but when I ran it, remove_comments("Hello (new (old) ish) world") returned 'Hello ish) world'.

The code that you've written is generally really good, but as you seem to have found parsing non-trivial strings starts to get kind of complicated, and has all sorts of room for nasty edge cases.

Your remove_comments method appears not to account for nested comments, which are explicitly allowed by the RFC.

As expected, remove_comments("Hello (new) world") returns "Hello world", but when I ran it, remove_comments("Hello (new (old) ish) world") returned 'Hello ish) world'.

Source Link

created Jan 23, 2016 at 22:58

ymbirtt

The code that you've written is generally really good, but as you seem to have found parsing non-trivial strings starts to get kind of complicated, and has all sorts of room for nasty edge cases.

Your remove_comments method appears not to account for nested comments, which are explicitly allowed by the RFC.

As expected, remove_comments("Hello (new) world") returns "Hello world", but when I ran it, remove_comments("Hello (new (old) ish) world") returned 'Hello ish) world'.