Tuesday, November 23, 2021

[FIXED] Java Heap space error while checking very very long String

November 23, 2021 heap-memory, java, string

Issue

I have a specified String and I need to check this String for identical parts of specified length. For example if the String is "abcdab" and the length is specified as 2, the identical parts in this string are "ab" (always looking for the most repeated one). I reworked my algorithm 4-5 times for the best performance, but in the end, if the length of the String is 1m+, it throws an Java Heap space error.

So my question is: how to solve the error, maybe there is another way of checking the identical parts, or maybe some other way how to construct the whole algorithm. I figured out 1 possible solution of this, but it works very slowly, so I'm asking only for solutions as fast as my current algorithm or perhaps, even faster ones. Here is the current code:

int length = 2;
String str = "ababkjdklfhcjacajca";
ArrayList<String> h = new ArrayList<String>(); 
h.add(str.substring(0, length));
ArrayList<Integer> contains = new ArrayList<Integer>();
contains.add(1);

String c;
for (int g = 1; g < str.length()-length+1; g++) {
    c = str.substring(g, length+g);
    for (int e = 0; e < h.size(); e++) {
        if (h.get(e).charAt(0) == c.charAt(0) && h.get(e).charAt(length-1) == c.charAt(length-1)) {
            if (h.get(e).equals(c)) {
                contains.set(e, contains.get(e)+1);
                break;
            }
        }
        else if (e+1 == h.size()) {
            h.add(c);
            contains.add(1);
            break;
        }
    }

}

ArrayList h stores every unique part of the String, ArrayList contains represents amount of repeats of that every unique part of the string. String c is the main problem (java heap space points here). It gradually represents each part of the string before it gets stored in the ArrayList h (if that c is unique). After that I'll just find the most repeated ones by using the ArrayLists and print them.

Solution

If you want to do the search efficiently, time and memory wise, I suggest you the following:

First, create a simple character histogram containing the number of occurrences of each character. If a substring’s first character has less occurrences than the most common substring we’ve found so far, we can skip this substring.
Instead of creating substrings which contain copies of the character contents, we use a CharBuffer which wraps the string and adjust its position and limit to represent a subsequence. Of course, we must not modify a buffer once it has been stored as key in our map, so we create a new buffer for each key when it has been stored in the map. So we create at most one CharBuffer for each distinct substring and these buffers still only wrap the String rather then copy any character data

public static Map<String,Integer> mostCommonSubstring(String s, int len) {
    int[] charHistogram = new int[Character.MAX_VALUE+1];
    s.chars().forEach(ch -> charHistogram[ch]++);
    int most = 0;
    HashMap<Buffer, Integer> subStrings = new HashMap<>();
    CharBuffer cb = CharBuffer.wrap(s);
    for(int ix = 0, e = s.length()-len; ix <= e; ix++) {
        if(charHistogram[s.charAt(ix)] < most) continue;
        int num = subStrings.merge(cb.limit(ix+len).position(ix), 1, Integer::sum);
        if(num == 1) cb = CharBuffer.wrap(s);
        if(num > most) most = num;
    }
    final int mostOccurences = most;
    return subStrings.entrySet().stream()
        .filter(e -> e.getValue() == mostOccurences)
        .collect(Collectors.toMap(e -> e.getKey().toString(), Map.Entry::getValue));
}

The first two lines create our histogram

    int[] charHistogram = new int[Character.MAX_VALUE+1];
    s.chars().forEach(ch -> charHistogram[ch]++);

Within the loop

        if(charHistogram[s.charAt(ix)] < most) continue;

checks whether the first character of the current substring has less occurrences than the most common string we’ve found so far and skip the subsequent test in that case.

The next line adapts the current buffer to represent the substring, and updates the map, associating the buffer with 1 if not present or add 1 to the count of the existing mapping.

        int num = subStrings.merge(cb.limit(ix+len).position(ix), 1, Integer::sum);

We use the returned value to detect whether the merge operation has created a new association in the map, which is only the case if the result is one. In that case, we must not modify the buffer afterwards, hence, create a new one

        if(num == 1) cb = CharBuffer.wrap(s);

Then, we use the result to keep track of the highest number of occurrences

        if(num > most) most = num;

The final step after the loop is simple. We already have the highest number of occurrences, filter the map to keep entries with a matching number (there might be a tie) and create a new map, now converting the buffers to String instances as we don’t want to keep references to the original String and it affects only the few result substrings anyway.

    final int mostOccurences = most; // needed because most is not “effectively final”
    return subStrings.entrySet().stream()
        .filter(e -> e.getValue() == mostOccurences)
        .collect(Collectors.toMap(e -> e.getKey().toString(), Map.Entry::getValue));

Answered By - Holger

This Answer collected from stackoverflow and tested by JavaFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 23, 2021

[FIXED] Java Heap space error while checking very very long String

Issue

Solution

Popular Posts

Labels