trochee: (bithead)
[personal profile] trochee
for the geeks who may be reading:

Does anybody know anything about grid computing? Our lab has been limping along for years with pmake, but it's really not scaling well. We're looking into moving to another system of grid computing to make it easier to work with -- ideally a CPU-scavenging architecture.

Our current favorite (well, my current favorite) seems to be Condor, with the Sun Grid Engine being a second runner-up. PBS (no, not PBS) has also been suggested, but I'm not at all convinced about its support -- it seems to have moved to closed-source.

My question for any of you: have you used any of these? Was it difficult? Would you recommend it? We have an unwieldy cluster of some 200 nodes (with ~300 CPUs among them) and a small number of master fileservers that share a common NFS space. It would be neat to include some of the features of the Condor supersystem, but that's not really critical. What is critical is that we need to move our lab to a system that is supported by somebody outside our lab: we're a speech lab, not a parallel computing lab. We don't have the time or expertise to build clever parallel computing architectures. We'd love to leave it to the experts -- and to be able to file a bug that other people will get their degrees by fixing.

Any advice? Quite honestly, I'm not really expecting any responses, but who knows who's paying attention out there? Who's doing high-throughput, parallel computing on many nodes? [livejournal.com profile] evan? [livejournal.com profile] xaosenkosmos? ([livejournal.com profile] evan, don't say "google filesystem", unless they want to share with us! )

Date: 2005-06-08 12:46 am (UTC)
From: [identity profile] evan.livejournal.com
cpu bound is potentially easy if your data splits well / your different cpu processes don't need to communicate.

io bound is harder because with nfs, splitting it out to multiple machines doesn't necessarily make it any faster.

with gfs (yeah, i know you said not to mention it, but it's really not appropriate for your tasks anyway) you can increase the replication to get faster accessibility for data.

Date: 2005-06-08 12:58 am (UTC)
From: [identity profile] trochee.livejournal.com
Our processes do split well, and the different processes don't need to communicate.

But the big i/o bind right now is that multiple processes need to access the same data, and so when 40 CPUs request identical or same-disk data from the same NFS-server, the server falls over. this is why I pointed to Stork, because it allows DAG-managed data distribution that's aware of network load. There is a lot of low-net-traffic time on our subnet, but when we start 40+ parallel jobs, that all goes away. We would be fine if we could stagger it, but we've been hand-rolling solutions to do this kind of job-staggering for years and it's just gotten completely out-of-hand.

So it's not the traditional i/o bind -- it's more that we need good queue and data management so that data can get there in an orderly fashion. A traffic cop, who just reminds "no shoving, please: we'll all get there faster if we can follow the queue."

Date: 2005-06-09 10:37 pm (UTC)
From: [identity profile] boobirdsfly.livejournal.com
I think you should...
Personnally, I was thinking....
I like to use...

Yah. I've got nothing. Surprise !

Date: 2005-06-09 10:38 pm (UTC)
From: [identity profile] boobirdsfly.livejournal.com
Our processes do split well, and the different processes don't need to communicate.

Can I take that sentence out of context ?
Please.
I am still a geek reading. Just not the right kind.

Date: 2005-06-10 04:02 am (UTC)
From: [identity profile] trochee.livejournal.com
heh. that sentence could describe my entire inner life. With all the good and the bad that it implies.

Date: 2005-06-10 04:04 am (UTC)
From: [identity profile] boobirdsfly.livejournal.com
Heh. You read my mind.
I love you.

data replication

Date: 2005-06-11 11:48 pm (UTC)
From: [identity profile] http://users.livejournal.com/_dkg_/
if a bunch of processes need access to the same data, and it's relatively static data, you might want to look into AFS. it's a bitch to configure (and requires an adequate kerberos infratructure, for starters), but:

  • it's very cross-platform, with ways to hook it into Windows and many flavors of unix, including MacOS

  • it's designed to handle the replication issues natively -- that is, your various client machines just know that they are fetching data from your AFS cell, but in fact, the same data could be spread across a large number of servers to minimize the load. dunno how well this is supposed to work for more dynamic data, though.

  • it has neat features like automatic backups (snapshots of your data taken at specific points) which can be mounted read-only for recovery while the live system is still running.

  • it's expandable without having to allocate a new filesystem on a single server; for example, you (or, ahem, your administrator) can tell the system "this chunk of data is now going to be stored over here on this new disk on this new machine", and none of the clients need to be reconfigured or anything.


That said, i've still never managed to get an AFS cell up and running in full, and i've taken a few cracks at it.

but from all the docs i've read, it seems to do (mostly) what i want. if only it encrypted all the traffic more robustly, i'd be happy.

Alternately (more of a hack, less of a principled solution), if you can easily split out your data into static and dynamic sections, you could manually replicate your static sections across a set of N different NFS fileservers (using an hourly rsync cron job or something). then configure each client machine to mount the static export from just one of the N fileservers.

and if you have data like this that's really static, you could just replicate it (via rsync?) to the local filesystem of each of the client machines when authoritative data source gets modified. that would save you on the network crunch as well as reducing a lot of load on the fileserver.

i'm sure you've considered solutions like this in some form or another, though. i'd be interested to hear what you come up with. how well can you segregate your static data from your dynamic data?

Profile

trochee: (Default)
trochee

June 2016

S M T W T F S
   1234
567 89 1011
12131415 161718
19202122232425
2627282930  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 28th, 2025 05:29 am
Powered by Dreamwidth Studios