I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter
and bufferedWriter
classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager
and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
Following are the issues I face:
Filewriter
, once a while I see a single black line in the file.Any suggestions, how to avoid this data integrity issue?
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process
write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.