You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

Three Day Push Three: It Starts With a Shove

2

It’s been a productive morning, even though I’ve put off starting on my three-day push thus far. It has been time very well spent, though—a first transatlantic DSA session which, in spite of temporarily clogging tubes, was as helpful as ever. And it was good to share with Beau some of the feelings of progress and positive outlook that have been creeping up through the productivity of the past week.

After DSA, I spent the remainder of the morning finishing the photocopying of the Farlow books, and returned them—just in the nick of time, as it turned out! Judy was just leaving the building and said she had decided she’d give me until noon (and this was seconds before noon)…

After a short lunch break, it was finally time to settle down for the task at hand: the SQ subsampling algorithm. It took me quite a while to get my brain shifted back in gear for thinking about this. Once I did, I figured calculating a coverage estimator was the first task at hand. This is an estimator that is supposed to capture the extent of the total diversity in a time interval is captured by the diversity the sample for that time interval. The most commonly used, according to Alroy’s manuscript, is Good’s u, which is 1–o1/O, where o1 is the number of taxa that occur only once in the sample, and O is the total number of occurrences in the sample. Now, Alroy alters this to replace o1 by p1, which are single-publication taxa. The justification is that people are most likely to publish on new things (taxonomic groups, environments, times, places) rather than publish yet another random occurrence of an already well-described phenomenon (like a certain taxon). This, in turn, is likely to distort u as an estimator of taxonomic coverage. I’m not sure I understand why exactly, but I guess the more publications you have, the larger your O gets. But your o1 would also increase, and you would actually start to increase o1/O, which would decrease u, even though coverage should be getting better. This is, I think, what Alroy means.

What’s confusing, though, is his suggested solution: instead of o1, single-occurence taxa, he suggests counting p1, single-publication taxa. I’m not sure how this is any different. When would a single publication have multiple occurrences of a taxon? Well, I suppose the publication could describe a stratigraphic section with multiple formations, and the taxon could occur in any number of those. OK, so in measuring single-publication taxa, we’re including a greater number of taxa in the numerator of that ratio, which helps… make u even smaller?

In any case, although I don’t quite understand the justification, I feel like the reasons for this substitution listed by Alroy don’t really apply to the Neptune data. In some sense, the Neptune data are much closer to the “random sampling” that Alroy (correctly) suggests most paleontological collections are not. In those, he writes, “the point of publishing is not…to list further random samples of what might already be well-known times, places, environments, and taxonomic groups”. While this criticism is true in the choice of DSDP/ODP borehole locations, I think in some ways, Neptune data does record, quite systematically, at least all the taxa that are found in a location, not just those that are new and interesting.

Spent a lot of time (most of the afternoon, sadly) scratching my head about this. Perhaps I need to just go ahead and code Good’s u in the simplest, original formulation, u = o1/O. Once I actually sat down to do this, it was surprisingly (almost shockingly!) quick to do. First hurdle down. Next, I wanted to get a sense for what this statistic looks like over the course of the Cenozoic for the diatom data in Neptune, so I quickly bashed out a for loop to do that:

Interestingly, much like the “preservation” indicators I was looking at before, this statistic doesn’t change a whole lot. There’s one sample in the early Eocene that has a very low u—but I’m willing to bet that’s because it’s just a very small sample. Otherwise, u sticks quite firmly between about 60 and 80%. This suggests to me, at a first pass, that the subsampling-corrected diversity curve using this approach will look pretty similar in shape to the raw data. If anything, the Oligocene looks like a coverage optimum here, and the late Eocene as a coverage minimum. We’ll see what that does to the final shape of the curve.

Hold on! I think I just realized I made a mistake in my programming of this “really easy” algorithm… I calculated O as the total diversity, rather than the number of occurrences. Re-do:

OK, so that’s not much different. Remarkably constant over time, is the bottom line. So my struck-out thought above stands: probably, this method is going to recover a corrected diversity curve that’s similar to the raw data.

Lost substantial amounts of steam at 8pm and decided to bag it in for the day. Not the most productive big day push so far. Hard to get the mind focused on the single goal when there are so many urgent other things that need to get done, too… Hopefully tomorrow will be better.

previous:
The Day I Emptied out the Farlow Library
next:
Abandoning the Push, Saving It Up

2 Comments

  1. Beau

    September 15, 2010 @ 7:06 am

    1

    Impressive stuff. My brain must have atrophied this summer, because I was certainly struggling to make sense of your algorithm description. But it does seem very interesting, that’s for sure.

    Just a thought on the motivational dimension of 3DPs. To my mind, the 3DP works best when you have the motivation to ignore ‘pressing issues’ and focus entirely on the specific task at hand. If you find yourself losing that sense of focus on the core task, it might not be the best time to have a 3DP.

    Also, I would counsel not overusing the 3DP. As I see it, part of the effectiveness of these pushes comes from their relative infrequency and novelty. It’s worth trying to maintain that sense of uniqueness for the 3DP; if you’ve got a lot on, you can instead call it a ‘high functioning day’ (HFD; copyright 2010 Kock Motivational Systems Inc.) but don’t perhaps work as long hours, and get in a hyper-multitasking frame of mind.

    I guess what I’m trying to say is that you shouldn’t set yourself up for failure. For failure leads to disappointment, disappointment leads to demotivation, demotivation leads to suffering…

    Sorry, none of this is meant to rain on your parade! You’re doing fantastically well, and if you can maintain the 3DP pace, by all means you must. Just thinking aloud, really.

  2. kotrc

    September 15, 2010 @ 8:36 am

    2

    Many thanks for the feedback. As always, I think you’re right on the money. 3DPs work because they’re special, and demoting them to the same status as everyday distracted work lets them suffer from the demotivational pitfalls that plague the quotidian graduate student experience (at least mine).
    Also, I think the HFD designator is fantastic. Is has a really psycho-clinical ring to it, evocative of the high functioning academic Asperger’s case.