What class of languages do real modern regexes actually recognise?
Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1
) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε
— the context-free language of matching pairs of parens.
Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.
Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?
Links to relevant papers would be much appreciated.