正则表达式分割 camelCase 或 TitleCase (高级)

我找到了一个 太棒了来提取 camelCase 或 TitleCase 表达式的一部分。

 (?<!^)(?=[A-Z])

它的工作原理与预期一致:

  • 价值-> 价值
  • 骆驼值-> 骆驼值/骆驼值
  • TitleValue-> Title/Value

例如 Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

我的问题在于,它在某些情况下不起作用:

  • 案例1: VALUE-> V/A/L/U/E
  • 案例2: eclipseRCPExt-> eclipse/R/C/P/Ext

在我看来,结果应该是:

  • 案例1: 价值
  • 案例2: eclipse/RCP/Ext

换句话说,给定 n 个大写字符:

  • 如果 n 个字符后面跟着小写字符,那么组应该是: (n-1个字符)/(n-th char + 小写字符)
  • 如果 n 个字符在末尾,那么组应该是: (n 个字符)。

对于如何改进这个正则表达式有什么想法吗?

60914 次浏览

The following regex works for all of the above examples:

public static void main(String[] args)
{
for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
System.out.println(w);
}
}

It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".

The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.

It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:

(?<=[a-z])(?=[A-Z])

Here is how this regex splits your example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCPExt

The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.

Addendum - Improved version

Note: This answer recently got an upvote and I realized that there is a better way...

By adding a second alternative to the above regex, all of the OP's test cases are correctly split.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

Here is how the improved regex splits the example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCP / Ext

Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.

Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase

I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))

and here's an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)

Here I'm separating each word with a space, so here are some examples of how the string is transformed:

  • ThisIsATitleCASEString => This Is A Title CASE String
  • andThisOneIsCamelCASE => and This One Is Camel CASE

This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))

and an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)

And here are some examples of how a string with numbers is transformed with this regex:

  • myVariable123 => my Variable 123
  • my2Variables => my 2 Variables
  • The3rdVariableIsHere => The 3 rdVariable Is Here
  • 12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too

To handle more letters than just A-Z:

s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");

Either:

  • Split after any lowercase letter, that is followed by uppercase letter.

E.g parseXML -> parse, XML.

or

  • Split after any letter, that is followed by upper case letter and lowercase letter.

E.g. XMLParser -> XML, Parser.


In more readable form:

public class SplitCamelCaseTest {


static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";


static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
);


public static String splitCamelCase(String s) {
return SPLIT_CAMEL_CASE.splitAsStream(s)
.collect(joining(" "));
}


@Test
public void testSplitCamelCase() {
assertEquals("Camel Case", splitCamelCase("CamelCase"));
assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
assertEquals("XML Parser", splitCamelCase("XMLParser"));
assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
assertEquals("VALUE", splitCamelCase("VALUE"));
}
}

You can use the expression below for Java:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)

Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):

String test = "_eclipse福福RCPExt";


Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);


Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
// matches should be consecutive
if (componentMatcher.start() != endOfLastMatch) {
// do something horrible if you don't want garbage in between


// we're lenient though, any Chinese characters are lucky and get through as group
String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
components.add(startOrInBetween);
}
components.add(componentMatcher.group(1));
endOfLastMatch = componentMatcher.end();
}


if (endOfLastMatch != test.length()) {
String end = test.substring(endOfLastMatch, componentMatcher.start());
components.add(end);
}


System.out.println(components);

This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.

Brief

Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.

Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.

Code

See this regex in use here

([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)

Results

Sample Input

eclipseRCPExt


SomethingIsWrittenHere


TEXTIsWrittenHERE


VALUE


loremIpsum

Sample Output

eclipse
RCP
Ext


Something
Is
Written
Here


TEXT
Is
Written
HERE


VALUE


lorem
Ipsum

Explanation

  • Match one or more uppercase alpha character [A-Z]+
  • Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
  • Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b

I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.

I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).

This able to split strings such as:

DrivingB2BTradeIn2019Onwards

to

Driving B2B Trade in 2019 Onwards

You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.

A JavaScript Solution

/**
* howToDoThis ===> ["", "how", "To", "Do", "This"]
* @param word word to be split
*/
export const splitCamelCaseWords = (word: string) => {
if (typeof word !== 'string') return [];
return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};