Issue
The Javadoc for java.util.regex.Matcher.find()
says:
This method starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.
My own experiments suggest that if the regex matches a zero length string, then the method starts one character beyond the end of the previous (zero-length) match. For example, given an input string "abcabc" and a regex "a|b?", successive matches will occur at positions (0, 1, 2, 3, 4, 5, 6). Following a successful match at position 6, there is an unsuccessful match at position 6; and further calls also return false with position remaining at 6.
The documentation suggests that after finding a zero-length match at position 2 (character "c"), the next call on find will start at "the first character not matched", which is still at position 2.
The documentation also suggests that after an unsuccessful match at position 6, the next call should start at position 0, which doesn't appear to be the case.
Is the documentation simply wrong?
Is there a more precise rule somewhere, for example one that covers the behaviour at position 6?
My reason for asking is that I am trying to write an emulation of the method in C#, and I'm having trouble reproducing the precise behaviour by trial and error.
Solution
This is a special case handled by Matcher.find()
:
public boolean find() {
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++;
// If next search starts before region, start it at region
if (nextSearchIndex < from)
nextSearchIndex = from;
// If next search starts beyond region then it fails
if (nextSearchIndex > to) {
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
return false;
}
return search(nextSearchIndex);
}
If the last match was a zero-length match (indicated by first == last
) then it starts the next search at last + 1
.
If you think about it this makes sense. Without this, if the Matcher
once finds a zero-length match it would never proceed over this match.
So yes, the documentation is incomplete: it neither mentions the special case for zero length matches nor that it only ever passes once over the input string.
But IMHO both special cases are implied by "Attempts to find the next subsequence of the input sequence that matches the pattern."
After a zero length match the next possible subsequence must start one position after the previous match, and the next subsequence can never start at a position lower than the current position (i.e. start over from the beginning).
Answered By - Thomas Kläger
Answer Checked By - Marilyn (JavaFixing Volunteer)