Issue
My Current way of generating MD5 hashes of all files under a root directory, up to a given depth is shown below.
As of now, it takes about 10 seconds (old intel core i3 cpu) to process appx 300 images, each on average 5-10 MB in size. The parallel
option in stream
does not help. With or without it, the time remains more or less same. How can I make this faster ?
Files.walk(Path.of(rootDir), depth)
.parallel() // doesn't help, time appx same as without parallel
.filter(path -> !Files.isDirectory(path)) // skip directories
.map(FileHash::getHash)
.collect(Collectors.toList());
The getHash
method used above gives a comma separated hash,<full file path>
output line for each file being processed in the stream.
public static String getHash(Path path) {
MessageDigest md5 = null;
try {
md5 = MessageDigest.getInstance("MD5");
md5.update(Files.readAllBytes(path));
} catch (Exception e) {
e.printStackTrace();
}
byte[] digest = md5.digest();
String hash = DatatypeConverter.printHexBinary(digest).toUpperCase();
return String.format("%s,%s", hash, path.toAbsolutePath());
}
Solution
The stream returned by Files.walk(Path.of(rootDir), depth)
cannot be parallelized effeciently (he has no size so it's difficult to determine slice to parallelize).
In your case for improving performance you need to collect first the result of Files.walk(...)
.
So your have to do:
Files.walk(Path.of(rootDir), depth)
.filter(path -> !Files.isDirectory(path)) // skip directories
.collect(Collectors.toList())
.stream()
.parallel() // in my computer divide the time needed by 5 (8 core cpu and SSD disk)
.map(FileHash::getHash)
.collect(Collectors.toList());
Answered By - Olivier Pellier-Cuit
Answer Checked By - Mildred Charles (JavaFixing Admin)