grid geekery
Jun. 7th, 2005 03:44 pmDoes anybody know anything about grid computing? Our lab has been limping along for years with pmake, but it's really not scaling well. We're looking into moving to another system of grid computing to make it easier to work with -- ideally a CPU-scavenging architecture.
Our current favorite (well, my current favorite) seems to be Condor, with the Sun Grid Engine being a second runner-up. PBS (no, not PBS) has also been suggested, but I'm not at all convinced about its support -- it seems to have moved to closed-source.
My question for any of you: have you used any of these? Was it difficult? Would you recommend it? We have an unwieldy cluster of some 200 nodes (with ~300 CPUs among them) and a small number of master fileservers that share a common NFS space. It would be neat to include some of the features of the Condor supersystem, but that's not really critical. What is critical is that we need to move our lab to a system that is supported by somebody outside our lab: we're a speech lab, not a parallel computing lab. We don't have the time or expertise to build clever parallel computing architectures. We'd love to leave it to the experts -- and to be able to file a bug that other people will get their degrees by fixing.
Any advice? Quite honestly, I'm not really expecting any responses, but who knows who's paying attention out there? Who's doing high-throughput, parallel computing on many nodes?
evan?
xaosenkosmos? (
evan, don't say "google filesystem", unless they want to share with us! )
no subject
Date: 2005-06-08 12:46 am (UTC)io bound is harder because with nfs, splitting it out to multiple machines doesn't necessarily make it any faster.
with gfs (yeah, i know you said not to mention it, but it's really not appropriate for your tasks anyway) you can increase the replication to get faster accessibility for data.
no subject
Date: 2005-06-08 12:58 am (UTC)But the big i/o bind right now is that multiple processes need to access the same data, and so when 40 CPUs request identical or same-disk data from the same NFS-server, the server falls over. this is why I pointed to Stork, because it allows DAG-managed data distribution that's aware of network load. There is a lot of low-net-traffic time on our subnet, but when we start 40+ parallel jobs, that all goes away. We would be fine if we could stagger it, but we've been hand-rolling solutions to do this kind of job-staggering for years and it's just gotten completely out-of-hand.
So it's not the traditional i/o bind -- it's more that we need good queue and data management so that data can get there in an orderly fashion. A traffic cop, who just reminds "no shoving, please: we'll all get there faster if we can follow the queue."
no subject
Date: 2005-06-09 10:37 pm (UTC)Personnally, I was thinking....
I like to use...
Yah. I've got nothing. Surprise !
no subject
Date: 2005-06-09 10:38 pm (UTC)Can I take that sentence out of context ?
Please.
I am still a geek reading. Just not the right kind.
no subject
Date: 2005-06-10 04:02 am (UTC)no subject
Date: 2005-06-10 04:04 am (UTC)I love you.
data replication
Date: 2005-06-11 11:48 pm (UTC)That said, i've still never managed to get an AFS cell up and running in full, and i've taken a few cracks at it.
but from all the docs i've read, it seems to do (mostly) what i want. if only it encrypted all the traffic more robustly, i'd be happy.
Alternately (more of a hack, less of a principled solution), if you can easily split out your data into static and dynamic sections, you could manually replicate your static sections across a set of N different NFS fileservers (using an hourly rsync cron job or something). then configure each client machine to mount the static export from just one of the N fileservers.
and if you have data like this that's really static, you could just replicate it (via rsync?) to the local filesystem of each of the client machines when authoritative data source gets modified. that would save you on the network crunch as well as reducing a lot of load on the fileserver.
i'm sure you've considered solutions like this in some form or another, though. i'd be interested to hear what you come up with. how well can you segregate your static data from your dynamic data?