views:

79

answers:

5

I currently have a Java program that spawns 50 threads and the goal is to look at a directory that has many files being written to it and upload those files to an ftp server and then remove them. Right now I have a super hacky way of looping through the dir in each thread and setting a lock on a ConcurrentMap to keep track of when a thread already is processing that same image to prevent duplicate work. It's working but just doesn't seem right.

So the question is.. in Java what is a preferred way of watching a directory in a multithreaded program and making sure each thread is only operating on a file that no one else has.

Update: I was considering creating a threadpool with the caveat of each thread has an ftpclient connection that I'll have to keep open and keep from timing out.

Update: What about using http://download.oracle.com/javase/tutorial/essential/io/notification.html ?

+1  A: 

Maybe having a master thread searching the directory and giving tasks out to the worker threads?

Frank
thats the solution I was leaning towards.. I was just wondering what the best way to keep those ftp client connections alive in the worker threads would be.
Maybe via a connection pool from where a thread can get one and put it back after being finished?Connection pools need not be for db connections only.
Frank
+5  A: 

Use an ExecutorService to decouple the submission of work to the threads from the threading logic itself (also take a look at the docs for the parent interface Executor to learn a bit more about their purpose).

With an ExecutorService, you simply feed work (in your case, a file) to it and threads will pick up work as they become available. There are many options and flavors of ExecutorServices you can configure: single-threaded, a maximum number of threads, unbounded thread pool, etc.

matt b
thanks matt, If I use a main thread to submit files what is the best way to make sure each thread only gets one file. A thread may fail in which case I'd want it to get picked back up by another thread so I'd still need some sort of arraylist to keep track of what files are in process so they don't get placed into more than one thread at a time
@beagleguy: Exactly how do they fail?
R. Bemrose
@beagleguy - ThreadPoolExecutor defines method afterExecute(Runnable r, Throwable t) which you can use as a hook to put a failed file back in the queue.
Darren Gilroy
A: 

I would set up a filehandler class which accepts a directory and has a concurrently locked .nextFile function which passes the next file in the directory. This way every thread asks for a file and every thread gets a unique file

Raynos
+1  A: 

IMO, it's asking for trouble to try and write something that does this yourself. There are so many nuances to parallel batch processing, that it's best to learn the API to a framework that does it for you.

In the past I've used both Spring Batch (which is open source) and Flux (which requires a license). They'll both allow you to configure jobs that watch a directory for files, and then process those files in a parallel way. As long as you're willing to invest the time in learning their APIs, then you don't need to worry about synchronization on which process is handling which files.

Just a quick note on pros/cons of Spring Batch vs Flux:

  • Spring batch is mostly XML configuration, while Flux has a nice GUI designer
  • If you're already familiar with the Spring framework, then Batch will come more naturally. (Otherwise, as a starting point their documentation is great for the basic use cases)
  • Spring batch requires scheduling to be done from the outside (usually with Quartz), while Flux also includes scheduling
  • Flux is better (and imo, more intuitive) for things like monitoring a directory/FTP/SFTP/email to kick off a job

I'm sure there are other frameworks that do this too... those are just the two I'm familiar with.

Michael D
A: 

Does the solution really need to be multi-threaded? Unless the maximum upload speed to the destination FTP server is limited per connection, surely it'd be easier sending them one at a time?

Sending 50 files of 1MB sequentially at 1Mbps (assumed max upload speed) over a single FTP connection would be no slower than sending the same 50 files concurrently at ~20Kbps with 50 FTP connections, wouldn't it?

ninesided