I have a multi-threaded perl script which does the following:
1) One boss thread searches through a folder structure on an external server. For each file it finds, it adds its path/name to a thread queue. If the path/file is already in the queue, or being processed by the worker threads, the enqueuing is skipped.
2) A dozen worker threads dequeue from the above queue, process the files, and remove them from the hard disk.
It runs on a single physical server, and everything works fine.
Now I want to add a second server, which will work concurrently with the first one, searching through the same folder structure, looking for files to enqueue/process. I need a means to make both servers aware of what each one is doing, so that they don't process the same files. The queue is minimal, ranging from 20 to 100 items. The list is very dynamic and changes many times per second.
Do I simply write to/read from a regular file to keep them sync'ed about the current items list? Any ideas?
I would be very wary of using a regular file - it'll be difficult to manage locking and caching semantics.
IPC is a big and difficult topic, and when you're doing server to server - it can get very messy indeed. You'll need to think about much more complicated scenarios, like 'what if host A crashes with partial processing'.
So first off I would suggest you need to (if at all possible) make your process idempotent. Specifically - set it up so IF both servers do end up processing the same things, then no harm is done - it's 'just' inefficient.
I can't tell you how to do this, but the general one is to permit (and discard) duplication of effort.
In terms of synchronising your two processes on different servers - I don't think a file will do the trick - shared filesystem IPC is not really suitable for a near real time sort of operation, because of caching. Default cache lag on NFS is somewhere in the order of 60s.
I would suggest that you think in terms of sockets - they're a fairly standard way of server to server IPC. As you already check 'pending' items in the queue, expanding this to query the other host (note - consider what you'll do if it's offline or otherwise unreachable) before enqueing.
The caveat here is parallelism works better the less IPC is going on. Talking across a network is generally a bit faster than talking to a disk, but it's considerably slower than the speed at which a processor runs. So if you can work out some sort of caching/locking mechanism, where you don't need to update for each and every file - then it'll run much better.
External links referenced by this document: