Issue
public static void main(String[] args) {
Pattern compile = Pattern
.compile("[0-9]{1,}[A-Za-z]{1,}|[A-Za-z][0-9]{1,}|[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|[0-9][0-9\\-]{4,}|[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]+");
Matcher matcher = compile.matcher("i5-2450M");
matcher.find();
System.out.println(matcher.group(0));
}
I assume this should return i5-2450M
but it returns i5
actually
Solution
The problem is that the first alternation that matches is used.
In this case the 2nd alternation ([A-Za-z][0-9]{1,}
, which matches i5
) "shadows" any following alternation.
// doesn't match
[0-9]{1,}[A-Za-z]{1,}|
// matches "i5"
[A-Za-z][0-9]{1,}|
// the following are never even checked, because of the previous match
[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|
[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|
[0-9][0-9\\-]{4,}|
[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]
(Please note, that there are likely serious issues with the regular expression in the post -- for instance, 0---#
would be matched by the last rule -- which should be addressed, but are not below due to not being the "fundamental" problem of the alternation behavior.)
To fix this issue, arrange the alternations with the most specific first. In this case it would be putting the 2nd alternation below the other alternation entries. (Also review the other alternations and the interactions; perhaps the entire regular expression can be simplified?)
The use of a simple word boundary (\b
) will not work here because -
is considered a non-word character. However, depending upon the meaning of the regular expression, anchors ($
and ^
) could be used around the alternation: e.g. ^existing_regex$
. This doesn't change the behavior of the alternation, but it would cause the initial match of i5
to be backtracked, and thereby causing subsequent alternation entries to be considered, due to not being able to match the end-of-input immediately after the alternation group.
From Java regex alternation operator "|" behavior seems broken:
Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
(The accepted answer in this question uses word boundaries.)
From Pattern:
The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.
Answered By - user166390
Answer Checked By - Mildred Charles (JavaFixing Admin)