Issue
I have a specified String and I need to check this String for identical parts of specified length. For example if the String is "abcdab" and the length is specified as 2, the identical parts in this string are "ab" (always looking for the most repeated one). I reworked my algorithm 4-5 times for the best performance, but in the end, if the length of the String is 1m+, it throws an Java Heap space error.
So my question is: how to solve the error, maybe there is another way of checking the identical parts, or maybe some other way how to construct the whole algorithm. I figured out 1 possible solution of this, but it works very slowly, so I'm asking only for solutions as fast as my current algorithm or perhaps, even faster ones. Here is the current code:
int length = 2;
String str = "ababkjdklfhcjacajca";
ArrayList<String> h = new ArrayList<String>();
h.add(str.substring(0, length));
ArrayList<Integer> contains = new ArrayList<Integer>();
contains.add(1);
String c;
for (int g = 1; g < str.length()-length+1; g++) {
c = str.substring(g, length+g);
for (int e = 0; e < h.size(); e++) {
if (h.get(e).charAt(0) == c.charAt(0) && h.get(e).charAt(length-1) == c.charAt(length-1)) {
if (h.get(e).equals(c)) {
contains.set(e, contains.get(e)+1);
break;
}
}
else if (e+1 == h.size()) {
h.add(c);
contains.add(1);
break;
}
}
}
ArrayList h
stores every unique part of the String, ArrayList contains represents amount of repeats of that every unique part of the string.
String c
is the main problem (java heap space points here). It gradually represents each part of the string before it gets stored in the ArrayList h
(if that c
is unique).
After that I'll just find the most repeated ones by using the ArrayLists
and print them.
Solution
If you want to do the search efficiently, time and memory wise, I suggest you the following:
First, create a simple character histogram containing the number of occurrences of each character. If a substring’s first character has less occurrences than the most common substring we’ve found so far, we can skip this substring.
Instead of creating substrings which contain copies of the character contents, we use a
CharBuffer
which wraps the string and adjust itsposition
andlimit
to represent a subsequence. Of course, we must not modify a buffer once it has been stored as key in our map, so we create a new buffer for each key when it has been stored in the map. So we create at most oneCharBuffer
for each distinct substring and these buffers still only wrap theString
rather then copy any character data
public static Map<String,Integer> mostCommonSubstring(String s, int len) {
int[] charHistogram = new int[Character.MAX_VALUE+1];
s.chars().forEach(ch -> charHistogram[ch]++);
int most = 0;
HashMap<Buffer, Integer> subStrings = new HashMap<>();
CharBuffer cb = CharBuffer.wrap(s);
for(int ix = 0, e = s.length()-len; ix <= e; ix++) {
if(charHistogram[s.charAt(ix)] < most) continue;
int num = subStrings.merge(cb.limit(ix+len).position(ix), 1, Integer::sum);
if(num == 1) cb = CharBuffer.wrap(s);
if(num > most) most = num;
}
final int mostOccurences = most;
return subStrings.entrySet().stream()
.filter(e -> e.getValue() == mostOccurences)
.collect(Collectors.toMap(e -> e.getKey().toString(), Map.Entry::getValue));
}
The first two lines create our histogram
int[] charHistogram = new int[Character.MAX_VALUE+1];
s.chars().forEach(ch -> charHistogram[ch]++);
Within the loop
if(charHistogram[s.charAt(ix)] < most) continue;
checks whether the first character of the current substring has less occurrences than the most common string we’ve found so far and skip the subsequent test in that case.
The next line adapts the current buffer to represent the substring, and updates the map, associating the buffer with 1
if not present or add 1
to the count of the existing mapping.
int num = subStrings.merge(cb.limit(ix+len).position(ix), 1, Integer::sum);
We use the returned value to detect whether the merge
operation has created a new association in the map, which is only the case if the result is one. In that case, we must not modify the buffer afterwards, hence, create a new one
if(num == 1) cb = CharBuffer.wrap(s);
Then, we use the result to keep track of the highest number of occurrences
if(num > most) most = num;
The final step after the loop is simple. We already have the highest number of occurrences, filter the map to keep entries with a matching number (there might be a tie) and create a new map, now converting the buffers to String
instances as we don’t want to keep references to the original String
and it affects only the few result substrings anyway.
final int mostOccurences = most; // needed because most is not “effectively final”
return subStrings.entrySet().stream()
.filter(e -> e.getValue() == mostOccurences)
.collect(Collectors.toMap(e -> e.getKey().toString(), Map.Entry::getValue));
Answered By - Holger