trochee: (loom)
2016-06-16 10:30 am

Greater data science, part 2.1 – software engineering for scientists

This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper.

In many scientific labs, the skills and knowledge required for the research (e.g. linguistics fieldwork, sociological interview practices, wet-lab biological analysis) are not the same skills involved in software engineering or in data curation and maintenance.

Some scientists thus find themselves as the “accidental techie” in their local lab — maybe not even as the “accidental data scientist“, but doing specific software engineering tasks — you’re the poor schmuck stuck making sure that everybody’s spreadsheets validate, that nobody sorts the data on the wrong field, that you removed the “track changes” history the proposal before you sent it off to the grant agencies, etc.

Scientific labs of any scale (including academic labs, though they probably don’t have the budgets or the incentives) can really benefit from data science, but especially software engineering expertise, even — or perhaps especially — when the engineer isn’t an inside-baseball expert in the research domain.  I list below a number of places an experienced software engineer (or data scientist) can make a difference to a field she (or he) doesn’t know well.

Read the rest of this entry »

Mirrored from Trochaisms.

trochee: (loom)
2016-06-16 10:17 am

spelling be hard

I’ve written a half dozen pieces of commentary on David Donoho’s work, all the while spelling his name wrong; at least once in a permalink URL. Oh, well.  At least I can edit the posts here.

Mirrored from Trochaisms.

trochee: (loom)
2016-06-16 09:15 am

Greater data science, part 2: data science for scientists

This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper.

Many aspects of Donoho’s 2015 “greater data science” can support scientists of other stripes — and not just because “data scientist is like food cook” — if data science is a thing after all, then it has specific expertise that applies to shared problems across domains. I have been thinking a lot about how the outsider-ish nature of the “data science” can provide supporting analysis in a specific domain-tied (“wet-lab”) science.

This is not to dismiss the data science that’s already happening in the wet-lab — but to acknowledge that the expertise of the data scientist is often complementary to the domain expertise of her wet-lab colleague.

Here I lay out three classes of skills that I’ve seen in “data scientists” (more rarely, but still sometimes, in software engineers, or in target-domain experts: these people might be called the “accidental data scientists”, if it’s not circular).

“Direct” data science

Donoho 2015 includes six divisions of “greater data science”:

The activities of Greater Data Science are classified into 6 divisions: 1. Data Exploration and Preparation 2. Data Representation and Transformation 3. Computing with Data 4. Data Modeling 5. Data Visualization and Presentation 6. Science about Data Science

Greater Data Science is all opportunities to help out “other” sciences.

  • methodological review on data collection and transformation
  • representational review ensuring that — where possible — the best standards for data representation are available; this is a sort of future-proofing and also feeds into cross-methodological analyses (below)
  • statistical methods review on core and peripheral models and analyses
  • visualization and presentation design and review, to support exploration of input data and post-analysis data
  • cross-methodological analyses are much easier to adapt when data representations and transformations conform to agreed-upon standards

Coping with “big” data

  • adaptation of methods for large-scale data cross-cuts most of the above — understanding how to adapt analytic methods to “embarrassingly parallel” architectures
  • refusing to adapt methods for large-scale data when, for example, the data really aren’t as large as all that. Remember, many analyses can be run on a single machine with a few thousand dollars’ worth of RAM and disk, rather than requiring a compute cluster at orders of magnitude more expense. (Of course, projects like Apache Beam aim to bake in the ability to scale down, but this is by no means mature.)
  • pipeline audit capacity — visualization and other insight into data at intermediate stages of processing is more important the larger the scale of the data

Scientific honesty and client relationships

data scientists are in a uniquely well-suited position to actually improve the human quality of the “wet lab” research scientists they support.  By focusing on the data science in particular, they can:

  • identify publication bias, or other temptations like p-hacking, even if inadvertent (these may also be part of the statistical methods review above)
  • support good-faith re-analysis when mistakes are discovered in the upstream data, the pipelines or supporting packages: if you’re doing all the software work above, re-running should be easy
  • act as a “subjects’ ombuds[wo]man” by considering (e.g.) the privacy and reward trade-offs in the analytics workflow and the risks of data leakage
  • facilitate the communication within and between labs
  • find ways to automate the boring and mechanical parts of the data pipeline process

Mirrored from Trochaisms.

trochee: (loom)
2016-06-10 02:26 pm

IDEs are Code Smell

Some wise thoughts from my complementary-distribution doppelganger Bill McNeill, currently occupying our ecological niche in Austin:

IDE-independence has a lot of advantages.

IDEs are Code Smell

Mirrored from Trochaisms.

trochee: (loom)
2016-06-08 02:38 pm

RMarkdown notebooks with Jupyter front-end

Hey, nifty. I just found out that you can write RMarkdown-style literate Python files and use the Jupyter notebook environment to view and execute them (with the notedown package, which also allows you to edit them in place).  This has nice implications for source control — changes to ipython notebooks are pretty ugly.

Mirrored from Trochaisms.

trochee: (loom)
2016-05-19 01:52 pm

Relational skills and the three wh’s

There’s a fairly tidy — but imperfect — correspondence between the three wh’s and the relational skillsets I proposed yesterday.

  • how corresponds well to the tooling skillset
  • what roughly corresponds to the data stewardship skillset
  • … leaving why to correspond to the collaboration skillset, which seems apt: why do data science if you don’t have someone you’re doing it with, or for?

Of course, the name “data science” probably isn’t all that, uh, sciencey:


Mirrored from Trochaisms.

trochee: (loom)
2016-05-15 10:02 pm

Rolling the dice at the Just World Casino

tl;dr: The tech frame of “lean startup”, venture capital funding, “exit strategies”, and relentless “valuation” talk is fundamentally anti-human for nearly all of us.

[ETA (immediately after publication):]

The kneejerk libertarianism and Randian resistance to collective action among (white, male) tech workers has led to red-in-tooth-and-claw job insecurity and instability, the “[mono]culture fit”, fetishization of youth a la The Circle, and a Just World Fallacy (“meritocracy”) of increasingly dire proportions.  In particular, rewards are wildly skewed away from effort or collective valuation, and seem to track with luck, or deep enough pockets to roll the dice often.

Big winners are the poker players lucky enough to be the first ones to loot (excuse me; I mean “disrupt”) a previously protected commons (excuse me; “fish”); some of the rest of us are settling for steady jobs as dealers, wait staff, or (for the truly ambitious) pit bosses. But the big game — besides being the house — is in bringing in the big fish unicorns.

Though unicorns make for flashy external advertisements (“Sue Anne won $10,000 at Lucky Strike yesterday! will you be next?”), the core casinos themselves are relentless in taking their cut on every big win and all the small losses.  AI fantasists (whether paranoid like Bostrom or optimist like Kurzweil and Yudkowsky) would like to think that the real questions are how to deal with “superhuman” intelligence, but the real concern is how to deal with non-human intelligence; specifically, the survival of humanity in the face of increasingly-automated bureaucracy.

Their “slow takeoff” has been burning since the East India Corporation, but has hit a recent elbow (a “fast takeoff”) with the “gig economy” (“sharing” is a bridge too far).  Some of these insecurities are bleeding into the white-collar segments of the gig economies, as with the space-sharing institutions that are beginning to collect rent from players hoping to bag a unicorn:

Oh, and this isn’t working out great, even for the casino’s winners (don’t worry, though: the house is still doing just fine).

If you like this sort of terrifying doom-saying, I recommend @PhilSandifer‘s Kickstarter:

Mirrored from Trochaisms.

trochee: (loom)
2013-04-19 10:28 am

don’t say I didn’t warn you

Apropos of this collaboration model thinking, I note that Doug Cutting is looking to “rock band” after all.

Mirrored from Trochaisms.

trochee: (loom)
2013-03-15 11:18 am

“Bank heist” collaboration pattern

Here’s my favorite collaboration pattern so far: the Bank Heist collaboration pattern. This pattern, which we know from The A-TeamOcean’s 11 and Leverage, among others, shares many properties with an excellent developer team:

  • You don’t have to like following orders to be on the team.
  • Everybody’s a generalist, and an expert in one area (pickpocket, cat burglar, safe-cracker, grifter, etc) but nobody is an expert at everything.
  • “Building the team” is part of the fun.
  • There is – or should be – mutual respect for complementary skills.
  • Everybody on the team needs to do their part and get out of the other people’s way.
  • Prima donnas ruin the whole party.
  • There’s even a role for management: the Nate Ford/Danny Ocean “mastermind” character is an ideal manager: he can do enough of all the other players’ roles to see how they can all work together and set up the whole job.

I don’t know if identifying this collaboration pattern is actually useful, or if it’s just entertaining, but it is undoubtedly attractive: most people I’ve shared this collaboration pattern with get very excited to work with a team that uses this collaboration pattern. If you or a team you’re on derives some benefit from this pattern, drop me a note.

A few afterthoughts (connecting to the “theater ensemble” thoughts from Beth on Twitter):

Heist movies pick up the drama when the team starts to violate these prescriptions: when the grifter decides he’d be a better mastermind than the current leader, for example. This opens up two perspective games I like to play:

  • heistify: take your boring office politics (“QA is dawdling because they were convinced the dev will botch it anyway”) and rewrite into a bank heist: “safe-cracker didn’t bother bringing his stethoscope because he figured the second-story man wouldn’t be able to kill the alarms”. Much more fun, isn’t it?
  • shyster: make heist movies boring again by inverting the transformation above.

Finally, heist movies have awesome soundtracks. Who wouldn’t want their workday scored with horn stings?  (And, as Josh points out: you’d have a sweet van.)

Mirrored from Trochaisms.

trochee: (loom)
2013-03-14 09:53 pm
Entry tags:

Software collaboration patterns

Software development is a fundamentally social process: it’s all about working together. We (software developers as a caste) have expressions like “programming by contract” and design patterns like “Delegation” that reflect how we humans work together – and we use these patterns to describe how we instruct our robot minions to function. We think about our programs with social metaphors because we’re social apes: we think well with social metaphors, and our software design patterns reflect how we think best.

But we rarely use these social metaphors to think about how we make software. We need patterns for collaboration that match our social creature wetware, the way “delegation” and “factory” and “handshake” patterns help design software.

Read the rest of this entry »

Mirrored from Trochaisms.

trochee: (loom)
2013-03-07 06:57 am

As random as I oughta be

From John D. Cook‘s Probability Facts twitter feed, discovered the infamous RANDU, and this absolutely marvelous quote:

One of us recalls producing a “random” plot with only 11 planes, and being told by his computer center’s programming consultant that he had misused the random number generator: “We guarantee that each number is random individually, but we don’t guarantee that more than one of them is random.” Figure that out.

which in turn reminds me of this:

“RFC 1149.5 specifies 4 as the standard IEEE-vetted random number.” — Randall Munroe

Mirrored from Trochaisms.

trochee: (loom)
2012-11-17 02:13 pm

November surprises

Two weeks I saw Argo [highly recommended!] and that made me remember the 1980 October Surprise controversies, which were hecka confusing to a five-year-old at the time.  I have my own November — or possibly early December — surprise coming: I’m about to become a dad, and I’m, well, 80% enthusiastic and about 20% terrified. Reminders that most of the world does this, at half my age, are welcome.

Please send bottled sleep, if you have any around; I’m trying to stockpile.

I’ve been too busy with my not-so-new job (I’ve been there ten months, commuting all the way across greater Seattle) and other things [see below] to write anything longform here, but I swing by now and then to clean out the spamtraps.  And — as today — to free-associate a little.

I believe I’ve changed the settings to no longer post twitter-logs here. If you miss my weekly twitter-log updates (why would you?) you can read my twitter feed directly or read a mirror of it on this blog.



Mirrored from Trochaisms.

trochee: (Default)
2012-10-14 01:30 am

Twitterlog for the week of 2012-10-14

  • Corollary: "computational linguist: better at linguistic analysis than software engineers, better at programming than linguists." #
  • SF peeps: last chance to help Bikes for San Francisco Youth on @indiegogo (I just pushed it over $4k!) #
  • TIL: What I know as Golden Hammer Syndrome ("to man with h–, world looks like nail") should be dubbed Maslow's Hammer (he of the Hierarchy) #
  • Am reading: increasingly rude argument among corpus linguists over necessity of exclusive definition for "corpus" #physicianhealthyself #
  • “Eva", 2, is obsessed with our house: she saw raccoons here once. I know she's still hooked: I can hear her mom calling her out of my yard. #
  • OH: "how does a career in technical comedy END?" #

Mirrored from Trochaisms.

trochee: (Default)
2012-10-07 01:30 am

Twitterlog for the week of 2012-10-07

  • "Burton Quim, founder of ex-gay ministry Overcome…" #aptronym #seriously #seriously #
  • academy is strange. “live-tweeting at conferences is a form of neoliberalism…“ why NOT share your ideas? via @Yendi #
  • At UW CSE department, waiting for Carlos Guestrin talk on GraphLab. Expected to see more peeps I know, but here's a few. #
  • Eight years ago, went on mostly-blind date with @imtboo . Today, we await our son. She's Best Thing Ever for me, among many awesome things. #
  • 7:35p C: full; grudging when pushy boarder (me) tells “yeah yer Seattleites & don't like talking to each other; MOVE BACK“ @westseattleblog #
  • Gates: "measuring software in LOC is like measuring an airplane by weight." Today I removed 40Mb of code from master. #git #airborne #
  • Co-worker just told me to change my business card to read 'Software Amputect'. #negativeLoC #
  • C line completely jammed. Not even standing room at 2nd & Columbia at 6 pm. @kcmetrobus, we need roughly DOUBLE this capacity at rush hour. #
  • Combined insights from @jim_adler & @imtboo into new insight (what probably only makes sense to me): "curiosity is the dual of play". #
  • …I got a bad feeling about this. :-[ #
  • Oh, man. I wish I could read THIS Fantastic Four. Sounds amazing & fun. (Also, Sue is the leader!) #

Mirrored from Trochaisms.

trochee: (Default)
2012-09-30 01:30 am

Twitterlog for the week of 2012-09-30

  • Bay Area NLPers: Long Now seeking "Lexical Data Specialist" — sounds neat. @stuartrobinson @WWRob @mdcclv (& more?) #
  • Fourth draft of project plan. Somewhat less ranty than Friday afternoon's first draft. Relaxing this weekend helped. #
  • My tech-heavy (and Y-chromosome-heavy) employer really should hire more women, if only to properly load-balance the bathrooms on our floor. #
  • Bus to west seattle has "get rhythm" performer, acoustic guitar and all, in back. Fun! #
  • OH: "some of this code has been dead so long I'm surprised it wasn't written in hieroglyphics." #
  • "… you can send me anonymous email, but first let me know you want to send me anonymous email so I can set it up" #unclearontheconcept #
  • Calling dibs : "Monty Carlo Mark of Shame" #nerdcore #shoegazer #stagename #

Mirrored from Trochaisms.

trochee: (Default)
2012-09-23 01:30 am

Twitterlog for the week of 2012-09-23

Mirrored from Trochaisms.

trochee: (Default)
2012-09-16 01:30 am

Twitterlog for the week of 2012-09-16

Mirrored from Trochaisms.

trochee: (Default)
2012-09-16 01:30 am

Twitterlog for the week of 2012-09-16

Mirrored from Trochaisms.

trochee: (Default)
2012-09-09 01:30 am

Twitterlog for the week of 2012-09-09

  • .@dresdencodak The Office (US): another No Exit remake, this time ostensibly leaving SF! (cf Truman Show, Dark City, The Matrix, The Cell) #
  • .@barbarahui I saw your name in a Shelf Life column in The Nation. You look good there! #

Mirrored from Trochaisms.

trochee: (Default)
2012-08-12 01:30 am

Twitterlog for the week of 2012-08-12

  • Ironically awesome to be getting hacked DM spam ("f8cebook giving away !Pads") from @ClosedAccessJ #

Mirrored from Trochaisms.