historical data-management
Sep. 2nd, 2004 07:39 pmI've spent the last two weeks (!) trying to figure out how to relate no less than five different kinds of truth.
Before anybody thinks I've gone mystic, I should clarify: in speech recognition research, and other machine-learning contexts, truth refers to the right answer. We have hours and hours of conversations, transcribed by listeners at the Linguistic Data Consortium.
Unfortunately, there has been more than one pass at coming up with the right words -- the right truth. And derivative data like treebanks are based off one version, and not always the latest best one. So to do the kind of work I'm doing -- relating treebank annotation to prosody annotation -- I have to relate the latest, best truth words (for which we have prosody annotations) to the substantially older truth words that the treebanks were based on.
This word-alignment was supposed to be about a day's work in coding. But it's turned into two weeks of tedious examination of the various versions of truth words, trying to discover the differences and reproduce the various changes and script-based normalizations that got us from the old bad truth to a new and better truth.
It feels, in an ironic way, like I am doing historical linguistics, with each version of the truth words being a different attested language, and trying to work out how they all relate to each other by looking for mechanisms of change (digging around in the misleading, wrong, lost, or never-written documentation), grouping together those corpora that seem similar. I'm effectively using the Historical Method, except I'm doing it the way that the historical linguists never could until recently -- with Perl and emacs in hand, hammer-and-a-nail.
It's actually been an interesting project (and it's almost done, which is what I've been saying about it for about 13 days of the last two weeks). The frustrating thing is that of all the cleverness in data-munging I've done, and all the careful code- and data-archaeology that I've done to get here, none of it is publishable. I'm just hoping that the other researchers I'm doing this for are grateful enough to put me in as a secondary author.
reading list:
The latest issue of The Nation, headlined The Coronation of George W. Bush: the GOP Convention Issue
Before anybody thinks I've gone mystic, I should clarify: in speech recognition research, and other machine-learning contexts, truth refers to the right answer. We have hours and hours of conversations, transcribed by listeners at the Linguistic Data Consortium.
Unfortunately, there has been more than one pass at coming up with the right words -- the right truth. And derivative data like treebanks are based off one version, and not always the latest best one. So to do the kind of work I'm doing -- relating treebank annotation to prosody annotation -- I have to relate the latest, best truth words (for which we have prosody annotations) to the substantially older truth words that the treebanks were based on.
This word-alignment was supposed to be about a day's work in coding. But it's turned into two weeks of tedious examination of the various versions of truth words, trying to discover the differences and reproduce the various changes and script-based normalizations that got us from the old bad truth to a new and better truth.
It feels, in an ironic way, like I am doing historical linguistics, with each version of the truth words being a different attested language, and trying to work out how they all relate to each other by looking for mechanisms of change (digging around in the misleading, wrong, lost, or never-written documentation), grouping together those corpora that seem similar. I'm effectively using the Historical Method, except I'm doing it the way that the historical linguists never could until recently -- with Perl and emacs in hand, hammer-and-a-nail.
It's actually been an interesting project (and it's almost done, which is what I've been saying about it for about 13 days of the last two weeks). The frustrating thing is that of all the cleverness in data-munging I've done, and all the careful code- and data-archaeology that I've done to get here, none of it is publishable. I'm just hoping that the other researchers I'm doing this for are grateful enough to put me in as a secondary author.
reading list:
The latest issue of The Nation, headlined The Coronation of George W. Bush: the GOP Convention Issue
no subject
Date: 2004-09-03 12:01 am (UTC)no subject
Date: 2004-09-16 03:12 pm (UTC)Yes, programmer geeks like to put capital letters on hackish, poorly defined subjects [heuristics] that we pretend are a clear, obvious concept [reified]. Is that how you meant it?
Even if it's not, it's provoked some additional thought. Examples: Bad and Wrong, Good Thing, Laziness, Impatience and Hubris.
Like some of the uses in other fields, the capitalized Reified Heuristics in geekery often get used in a contrarian way (Truth is opposed to truth in modern studies; Laziness and Impatience are opposed to "plain" laziness and impatience in geekdome). Except geeks like to turn things into TLAs (Three Letter Acronyms) as a followup to capitalizing the reified heuristic (Blue Screen of Death becomes BSOD becomes pronounced "bee-sod").
Thanks for commenting -- I'm glad I came back to this. Fun thought. Oh yeah, and congratulations on getting through the exams.