trochee: (bithead)
[personal profile] trochee
for the geeks who may be reading:

Does anybody know anything about grid computing? Our lab has been limping along for years with pmake, but it's really not scaling well. We're looking into moving to another system of grid computing to make it easier to work with -- ideally a CPU-scavenging architecture.

Our current favorite (well, my current favorite) seems to be Condor, with the Sun Grid Engine being a second runner-up. PBS (no, not PBS) has also been suggested, but I'm not at all convinced about its support -- it seems to have moved to closed-source.

My question for any of you: have you used any of these? Was it difficult? Would you recommend it? We have an unwieldy cluster of some 200 nodes (with ~300 CPUs among them) and a small number of master fileservers that share a common NFS space. It would be neat to include some of the features of the Condor supersystem, but that's not really critical. What is critical is that we need to move our lab to a system that is supported by somebody outside our lab: we're a speech lab, not a parallel computing lab. We don't have the time or expertise to build clever parallel computing architectures. We'd love to leave it to the experts -- and to be able to file a bug that other people will get their degrees by fixing.

Any advice? Quite honestly, I'm not really expecting any responses, but who knows who's paying attention out there? Who's doing high-throughput, parallel computing on many nodes? [livejournal.com profile] evan? [livejournal.com profile] xaosenkosmos? ([livejournal.com profile] evan, don't say "google filesystem", unless they want to share with us! )

Date: 2005-06-07 11:00 pm (UTC)
From: [identity profile] xaosenkosmos.livejournal.com
Got nothin'. We only do load-balancing clusters at work, not computational stuff.

Date: 2005-06-07 11:11 pm (UTC)
From: [identity profile] trochee.livejournal.com
I see. that makes sense. I'm still thinking Condor, but I'll post again if we ever actually make the move.

Date: 2005-06-07 11:51 pm (UTC)
From: [identity profile] apollinax.livejournal.com
When interviewing at UW, I chatted with the Condor folks. Those people know what they're doing: they've been doing it for a long, long time, and a lot of influential people with good taste use their system. I'd recommend it highly. That being said, I've never tried using it, so take my recommendation with a big grain o' salt.

Date: 2005-06-08 12:02 am (UTC)
From: [identity profile] trochee.livejournal.com
your opinion of the Condor people mirrors my own, except you have some personal contact as well.

which influential people with good taste did you have in mind?

Date: 2005-06-08 12:07 am (UTC)
From: [identity profile] xaosenkosmos.livejournal.com
Random fun note: the vast majority of our services (12Mhits/day, for ~400 domains and many thousands of sites) run off around a dozen machines. A dozen machines going on 4 years old, single processor, with only about 2GB of RAM each =)

That is to say: we are pretty close to the antithesis of a computational cluster.

Date: 2005-06-08 12:11 am (UTC)
From: [identity profile] evan.livejournal.com
work uses homebrewed stuff.

[livejournal.com profile] brad was hacking up something like this in perl.

are you cpu-bound or io-bound?

Date: 2005-06-08 12:13 am (UTC)
From: [identity profile] trochee.livejournal.com
depends on the user and the task. We're mostly i/o bound at the moment, for big speech recognition jobs (lots of data, relatively straightforward compute). But we could easily turn up the knob to becoming CPU bound if the i/o issues were a little less severe.

My work is CPU-bound; parsers don't require much input but spend lots of time in solution search.

Date: 2005-06-08 12:14 am (UTC)
From: [identity profile] trochee.livejournal.com
Point well taken: I can see more compute power from my desk.

Date: 2005-06-08 12:46 am (UTC)
From: [identity profile] evan.livejournal.com
cpu bound is potentially easy if your data splits well / your different cpu processes don't need to communicate.

io bound is harder because with nfs, splitting it out to multiple machines doesn't necessarily make it any faster.

with gfs (yeah, i know you said not to mention it, but it's really not appropriate for your tasks anyway) you can increase the replication to get faster accessibility for data.

Date: 2005-06-08 12:58 am (UTC)
From: [identity profile] trochee.livejournal.com
Our processes do split well, and the different processes don't need to communicate.

But the big i/o bind right now is that multiple processes need to access the same data, and so when 40 CPUs request identical or same-disk data from the same NFS-server, the server falls over. this is why I pointed to Stork, because it allows DAG-managed data distribution that's aware of network load. There is a lot of low-net-traffic time on our subnet, but when we start 40+ parallel jobs, that all goes away. We would be fine if we could stagger it, but we've been hand-rolling solutions to do this kind of job-staggering for years and it's just gotten completely out-of-hand.

So it's not the traditional i/o bind -- it's more that we need good queue and data management so that data can get there in an orderly fashion. A traffic cop, who just reminds "no shoving, please: we'll all get there faster if we can follow the queue."

Date: 2005-06-09 10:37 pm (UTC)
From: [identity profile] boobirdsfly.livejournal.com
I think you should...
Personnally, I was thinking....
I like to use...

Yah. I've got nothing. Surprise !

Date: 2005-06-09 10:38 pm (UTC)
From: [identity profile] boobirdsfly.livejournal.com
Our processes do split well, and the different processes don't need to communicate.

Can I take that sentence out of context ?
Please.
I am still a geek reading. Just not the right kind.

Date: 2005-06-10 04:02 am (UTC)
From: [identity profile] trochee.livejournal.com
heh. that sentence could describe my entire inner life. With all the good and the bad that it implies.

Date: 2005-06-10 04:04 am (UTC)
From: [identity profile] boobirdsfly.livejournal.com
Heh. You read my mind.
I love you.

data replication

Date: 2005-06-11 11:48 pm (UTC)
From: [identity profile] http://users.livejournal.com/_dkg_/
if a bunch of processes need access to the same data, and it's relatively static data, you might want to look into AFS. it's a bitch to configure (and requires an adequate kerberos infratructure, for starters), but:

  • it's very cross-platform, with ways to hook it into Windows and many flavors of unix, including MacOS

  • it's designed to handle the replication issues natively -- that is, your various client machines just know that they are fetching data from your AFS cell, but in fact, the same data could be spread across a large number of servers to minimize the load. dunno how well this is supposed to work for more dynamic data, though.

  • it has neat features like automatic backups (snapshots of your data taken at specific points) which can be mounted read-only for recovery while the live system is still running.

  • it's expandable without having to allocate a new filesystem on a single server; for example, you (or, ahem, your administrator) can tell the system "this chunk of data is now going to be stored over here on this new disk on this new machine", and none of the clients need to be reconfigured or anything.


That said, i've still never managed to get an AFS cell up and running in full, and i've taken a few cracks at it.

but from all the docs i've read, it seems to do (mostly) what i want. if only it encrypted all the traffic more robustly, i'd be happy.

Alternately (more of a hack, less of a principled solution), if you can easily split out your data into static and dynamic sections, you could manually replicate your static sections across a set of N different NFS fileservers (using an hourly rsync cron job or something). then configure each client machine to mount the static export from just one of the N fileservers.

and if you have data like this that's really static, you could just replicate it (via rsync?) to the local filesystem of each of the client machines when authoritative data source gets modified. that would save you on the network crunch as well as reducing a lot of load on the fileserver.

i'm sure you've considered solutions like this in some form or another, though. i'd be interested to hear what you come up with. how well can you segregate your static data from your dynamic data?

Interesting Condor users.

Date: 2005-11-21 10:32 pm (UTC)
From: [identity profile] alan-de-smet.livejournal.com
It's an old post, but no one answered, so I thought I would, just in case you still care. I'm Condor staff, so I'm biased, but anyway. The Hartford (http://www.thehartford.com/), an insurance company, is using Condor for their processing needs (http://www.insurancetech.com/story/?articleID=55801230). (The Hartford's PowerPoint presentation from our conference this last spring (http://www.cs.wisc.edu/condor/CondorWeek2005/presentations/nordlund_hartford.ppt).) C.O.R.E. Digital Pictures (http://www.coredp.com/) uses Condor for their rendering farm used for special effects in movies. (CORE's PPT presentation (http://www.cs.wisc.edu/condor/CondorWeek2005/presentations/stowe_core.ppt). I believe they said at our conference that everything on their current demo reel (http://www.coredp.com/reels/sr_core_toon_reel/?&w=1270&h=968) was using Condor to manage their rendering workload.) Micron (http://www.micron.com/) is using Condor in a variety of ways, including doing testing of embedded cameras (Micron PPT on the topic (http://www.cs.wisc.edu/condor/CondorWeek2005/presentations/gore_micron.ppt)) Oracle () is using Condor as part of their build and regression testing system (Oracle PPT on the topic) (http://www.ppdg.net/mtgs/Troubleshooting/Oracle%20Grid.ppt)) A number of high energy physics groups are using Condor, especially in the work preparing for the new Large Hadron Collider on the French-Swiss border.
Page generated Dec. 28th, 2025 05:23 am
Powered by Dreamwidth Studios