Issue
One ThreadPool is downloading files from the FTP server and another thread pool is reading files from it.
alt="enter image description here" />
Both ThreadPool are running concurrently. So exactly what happens, I'll explain you by taking one example.
Let's assume, I've one csv file with 100 records.
While threadPool-1 is downloading and writing it in a file in pending folder, and at the same time threadpool-2 reads the content from that file, but assume in 1 sec only 10 records can be written in a file in /pending folder and threadpool - 2 reads only 10 record.
ThreadPool - 2 doesn't know about that 90 records are currently in process of downloading. Now, threadPool - 2 will not read 90 records because it doesn't know that whole file is downloaded or not. After reading it'll move that file in another folder. So, my 90 records will be proceed further.
My question is, how to wait until whole file is downloaded and then only threadPool 2 can read contents from the file.
One more thing is that both threadPools are use scheduleFixedRate method and run at every 10 sec.
Please guide me on this.
Solution
I'm a fan of Mark Rotteveel's #6 suggestion (in comments above):
- use a temporary name when downloading,
- rename when download is complete.
That looks like:
- FTP download threads write all files with some added extension – perhaps
.pending
– but name it whatever you want. - When a file is downloaded – say
some.pdf
– the FTP download thread writes the file tosome.pdf.pending
- When an FTP download thread completes a file, the last step is a file rename operation – this is the mechanism for ensuring only "done" files are ready to be processed. So it downloads the file to
some.pdf.pending
, then at the end, renames it tosome.pdf
. - Reader threads look for files, ignoring anything matching
*.pending
I've built systems using this approach and they worked out well. In contrast, I've also worked with more complicated systems that tried to coordinate across threads and.. those often did not work so well.
Over time, any software system will have bugs. Edsger Dijkstra captured this so well:
"If debugging is the process of removing software bugs, then programming must be the process of putting them in."
However difficult it is to reason about program correctness now – while the program is still in design phase, and has not yet been built – it will be harder to reason about correctness when things are broken in production (which will happen, because bugs). That is, when things are broken and you're under time pressure to find the root cause (and fix it!), even the best of us would be at a disadvantage with a complicated (vs. simple) system.
The approach of using temporary names is simple to reason about, which should minimize code complexity and thus make it easier to implement. In turn, maintenance and bug fixes should be easier, too.
Keep it simple – let the filesystem help you out.
Answered By - Kaan
Answer Checked By - Timothy Miller (JavaFixing Admin)