You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

Thinking Evolution, At Darwin’s

ø

After a late start (thanks to a late night at Pierre and Nicole’s yesterday), and part of the morning helping Beau figure out his very own PCO problem, spent the afternoon at Darwin’s East. Reading the Erwin paper on disparity brought up a handful of points:

  1. There are a few other disparity measures I might consider—total variance, total range, number of unique pairwise character combinations, participation ratio—that I’m not too familiar with. He cites a paper by Ciampaglio, 2001, that I should check out.
  2. Is there some way to assess whether the filling of morphospace is more rapid (or less) than would be expected if diversification filled the morphospace under some null model (say, randomly)? Can that be simulated (sort of a bootstrap/p-value for morphospace occupation)? Erwin references Foote, 1996b in the book Evolutionary Paleobiology by Jablonski, Erwin, and Lipps, Gavrilets, 1999 (“Dynamics of clade diversification…”), and Pie and Weitz, 2006 (“A null model of morphospace occupation”). The last one seems particularly relevant. Yikes! Lots of reading to do.
  3. Can clades be separated out from the morphospace and their patterns of occupancy-through-time be examined individually? (My guess is that this will break down due to the degree of uncertainty of the tree topology).

Well, these are cool and interesting things to think about. Need to actually produce something now, though. I think setting up my code to produce disparity/diversity figures for different taxon sampling methods is next. I’m a little confused about exactly how I’m going to go about this, since I have two separate Neptune data files now, the original one I was working with on the diversity project, and the modified one for the morphospace project, which includes the crucial “Genus” column (with the genus name only). But, it has about 2,000 fewer entries than the full one, and I’m not quite sure why. It probably doesn’t matter though, since it’s only a third of a percent or so of all the entries in the database. Maybe I can just ignore it. Maybe it’s just the zero-age-value occurrences I took out. (Quick check: doesn’t look like it). I’m going to ignore it for the time being. [Later note: the difference is those occurrences of taxa that are in the Neptune database but not in the morphospace—i.e. genera that I didn’t code for. It’s actually quite reassuring that they only account for a third of a percent of all the occurrences.]

Another minor niggle I noticed: the genus diversity from the morphospace increases in the first few time bins of the Paleocene, while the species diversity seems to be zero until about 60 myr. That does not make sense—something’s fishy there. Need to follow up on that. Result: probably down to setting the time bins so they only start at 60 myr. Resetting the time bins to go back to 64 myr ought to fix that.

The problem I’m grappling with at the moment is that the code to calculate convex hull volumes crashes when I run it under the “in-bin” sampling model. It works fine under “range-through” sampling. I had problems with the mean pairwise distance function initially, too, because in the in-bin sampling mode supplies some time bins with zero length taxon lists (i.e. no occurrences). But I fixed that and the mpwd works fine now, but not the convex hull volumes. What gives? Reducing the range of dimensions used for convex hull volume calculation to 3D and 4D only (rather than 3D through 10D) makes it run OK—which suggests there’s some sort of problem in the higher number of dimensions. I’ll just have to keep redoing it until it breaks. It works fine to 5D and 6D. It crashes when I try to run it in 7D. Huh?

One observation that’s a possible lead is that, under in-bin sampling, the least diverse time bin has only 7 taxa. Is it possible that at least n+1 vertices are needed to calculate the hypervolume of a shape in n-space? It sort of makes sense in low dimensions—you need at least 3 points to define an area in 2-space, and you need at least 4 points in 3-space to define a volume. Maybe you need at least 8 vertices to calculate a hypervolume in 7-space. If this is true, no time bin under range-through sampling should have less than 11 genera in it. This is, in fact, true—the least diverse bin has 13 species in it. I think I’ve figured that one out!

Here, then, is the same figure as in yesterday’s post, but using the in-bin taxon sampling rather than range-through:

It looks quite a bit more variable and messy, but otherwise it’s pretty similar overall. I’m not quite sure why the alpha volumes with values >0.11 did not plot—maybe that algorithm crashed, too?! No, the results are there. Why aren’t they plotting, though? OK, problem solved—silly indexing mistake that crept in since I copy-pasted the plotting code for the alpha shape results from the code that plots the convex hull volumes. Never mind. Here’s the corrected version:

Well. Now that I’m not distracted by those missing lines, a few differences between the two sets of panels:

  • Mean pairwise distance more variable using SIB
  • Oligocene is peak convex hull volume in RT, while in SIB, Miocene is peak and Oligocene doesn’t stand out
  • In alpha shape volume, the SIB curve is messier/noisier, but the pattern of increase with peak in the Miocene is the same
  • There are hardly noticeable differences in the species diversity curves—this is a bit surprising (?)—I thought the RT looked a bit more different from the SIB based on my term paper with Charles all those many moons ago

OK. So now that this works sorta-kinda well, on to implementing the other taxon sampling methods. I already have routines in place for rarefaction and by-list, unweighted (“UW”) subsampling in my code from the diversity project. I should try and see if I can run those routines on the morphospace-modified Neptune database. Ah. But here’s a pretty major complication I hadn’t thought of. The subsampling algorithms all work by taking subsamples many times, calculating the diversity for each subsample, and then taking an average. Obviously, that average (which is what would be plotted in the bottom panel—no problem there, in theory) doesn’t have an associated set of taxon lists for each time bin. Rather, to do this properly I’d have to write a routine that generates a subsample, calculates the associated mean pairwise distances, the convex hull volumes, and the alpha volumes, and then does that a bunch of times and calculates the averages. It’s not that this is insurmountable, it’s just going to be a fair bit of work.

It also raises the question of how this subsampling should be done. Should the subsampling routine for the morphospace be run at the species level, and then the resulting pattern be showed at the genus level in the morphospace, or should the subsampling routine itself be run at the genus level? The latter would be more work, but perhaps the more “correct” thing to do? Phew, this is more complicated than I anticipated. Much, much more complicated. Since this is going to be a pretty major effort, I should probably do it only for one subsampling method, at least at first, and so since I will want to use SQ subsampling in the diversity paper, that’s probably what I ought to use here. But, I haven’t implemented the SQ algorithm for my dataset yet… Yikes…

The Big Final Push Begins

ø

It’s been good to be away for two weeks, a break from thinking about the thesis. And time to regroup before the big, final push. Which starts today. Beau sent me, among other wonderfully thoughtful tools for facing the next months, a book on mindfulness meditation, which I started reading yesterday. It’s almost staggeringly apropos, and I think I already got a chance to practice what it preaches this morning. I walked in to the office, something I thought I might do over the next weeks to make sure I get some fresh air, but realized when I got here that I had forgotten to extract my Harvard ID from my travel documents. Being 8 in the morning on a holiday, the building was deserted, and I could feel the first waves of disappointment and despair lapping at the carefully constructed enthusiasm and optimism for this first day of unbound effort. But somehow I managed to recognize what was happening, acknowledge those emotions, and decide (having toured the whole building in search for any doors that might be errantly open) that I could just go and have a cup of tea at Darwin’s and return when the museum opened at 9. So I set off, proud of my first (emotional) achievement of the day, and thinking how perhaps this was a fortunate reminder that in spite of all hopes there will continue to be obstacles over the course of the coming months, no matter how close I am to approaching an end, when I spotted an anthropology grad student striding towards the building, and asked her if she would let me in. Win.

This first obstacle overcome, I found it challenging to get settled. It’s not surprising, being the first day back after two weeks away (and a reminder why it’s going to be important to stay focused over the months ahead), but it was definitely both overwhelming and frustrating. While I have a list of general tasks to fulfill, I found myself unsure where or how to start. The first item on the list is finding the PCO-equivalent of “loadings” of characters on the PCO axes. But since this was a sticking point before I left, I thought it better to start elsewhere. (In the meantime, reading Mike Foote’s 1999 paper on the crinoid radiation, I found that he describes a way of doing it, but the description is oblique and it sounds difficult).

So I moved on to deciding which the “important” characters are manually, and scanned through my big “everything” plot. But beyond making a list of which characters seem to cluster in well-defined areas, and another list of egregiously dispersed characters, I got stuck on what to do. Should I just make two dozen plots with these characters? Surely, something like the above (“loadings”), would be preferable. Maybe I need to come back to this and just do it if I can’t get something like loadings to work.

Next on the list is “pairwise distance”. This is, from my reading of Foote, his primary measure of “disparity”. This should be relatively easy to program, so I start here. I was eventually able to set down and do this, and the results are somewhat surprising. Basically, the disparity is flat through time, except that the Early Cretaceous time bin is a shade lower.

The very low value of the Early Cretaceous is probably because the things are all the same—and this could well be biased by the facts that 1) it’s only one single assemblage, and 2) that none of the pennate-ish taxa reported in that assemblage are coded, because there were no decent descriptions for them.

This task carried me through lunch. Then, a tough moment—to write? I had promised myself I would spend part of each day writing. Needless to say, there is something of a mental block to doing this, so I was definitely afraid of sitting down and beginning to write, feeling like I am not ready yet. I tried to get my mind to start focusing in on the “big picture” view of things by rereading both my thesis proposal and a term paper I’d written (back in 2008) about the function and purpose of the diatom frustule, which was a shockingly good read (did I really write that?). Anyhow, it mostly resulted in tiredness. At least no immediate clarity on how to frame the paper—in terms of questions about diatom evolution, or in terms of morphospaces?

These are of course oversized questions for the first day back. Obviously. Smarter to work a little bit on the methods section, which I did.

All in all, not the most productive day in history, but given my goals, a good one. By the time the sun went down, I could feel the two-week onslaught of european virobacterial attack on my respiratory system, so I decided to call it in and head home. I worked hard. I did not procrastinate. This has been a good start to the end. A worthy start for the big push.