You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

March Madness Begins: Working Towards a Subsampling Solution

ø

From the beginning of the day:

A fresh day, a fresh start. Dark grey skies and heavy “wintery mix” falling outside—perfect conditions to hole up and work hard. Where to start today? I’m torn between working on the rarefaction/UW subsampling, so that I can finish off the disparity-diversity comparison plots, and working on the “all characters” through time plot, or the associated exercise of picking a few characters, and the associated exercise of figuring out which characters are actually responsible for increases in morphospace area expansion, which would address the looming question of what the biological story might be. The latter is more important in some ways, I think, but it also has the most potential for going off the rails (in the sense that I’m going to need to do a fair bit of reading). So let me start with the subsampling, and get it done, and then move on to the character-specific story.

From the end of the day:

A very brief report, because there are no results yet—but I am getting close—from what I’ve been working on today. On account of the gruesome weather and a desire to focus, I stayed at home, and it was a good day. Put in a very decent effort and made some serious headway in adapting the existing code for generating the diversity/disparity plot so that it will perform by-list, unweighted subsampling for the morphospace. The whole thing isn’t working yet, but a tantalizing glimpse of the subsampled values for convex hulls and alpha volumes looked like the curve flattens out completely under subsampling—suggesting that the increase over time might be an artifact of sampling.

Tues at Home

ø

I have definitely been feeling a surge in anxiety over the past 24 hours. All of the uncertainty about what to do about diversity subsampling and how it relates to the morphospace, as well as the question of what to do for the diversity chapter if SQ is not applicable to Neptune data was making me very uneasy. I didn’t sleep well last night, tossing and turning, mind racing in circles, but I did my best to keep myself focused on the moment, and went out for a short run this morning even through I had trouble getting out of bed. This helped a little bit.

Walking home after work yesterday I thought that a intermediate fix for my SQS problem in the context of the morphospace might be to apply a classical rarefaction or the closely related by-list, unweighted subsampling instead. While it’s philosophically very different, the results in Alroy’s studies aren’t all that dramatically different from SQS, and crucially, at least they do *some* correction for sampling as opposed to reading observed diversity at face value. It might not be impossible to implement so that I can actually generate both the diversity curve (species level) and the morphospace occupancy data (genus level).

I also realized that I really needed to iterate back to the text of my chapter—I’ve spent the last couple of weeks focused exclusively on producing figures in R, and so I have become unmoored from the enforced linear thinking required by the writing process, which is no doubt part of the reason my mind has started swirling in these terrifying vortices.

Had a very good meeting with Beau in the afternoon, realized that I have actually accomplished a lot in the past six weeks (36 of the 50 to dos from mid-January). Seeing a slightly clearer path forward, though it’s still a little terrifying.

Weekendud

ø

Managed no work at all this weekend, in spite of having a Saturday alone with Kati in DC. Not much to say about it other than that I really, really didn’t feel like working, and jumped upon every excuse not to do it—laundry, grocery shopping, newspaper reading, relaxing, and watching TV on Saturday; sleeping in, Kati time, going for a walk in the sun and the last episode of Downton Abbey on Sunday. Anything but work.

I tried not to let this cast a cloud on my Monday, however; headed out on my scheduled run and was at my desk, ready to work, at 9am. The starting of work was delayed by a little over an hour, since Andy’s intern Sarah showed up and begged for fulfillment of an earlier promise (not as empty as I had hoped, it turned out) to show her how to set up macros to set scale and insert scale bars on the photos she’s taking with the microscope camera system. It was a time sink but she was very grateful, and I felt (for a change) extremely competent. I may have failed to produce anything of academic value (so far) in the many years I’ve spent in graduate school, but I’ve certainly picked up a lot of skills and general knowledge along the way—including how to write macros for ImageJ.

In a strange way then, this was an hour well spent, since I returned to my desk feeling much more chipper and motivated than I had been earlier that morning, peeling myself from the bed and cursing the fact that it was a holiday yet I wouldn’t be able to take time off, without extreme guilt, i.e. no time off until the thesis is done. Fortunately, I’m well placed to get myself there now.

Spent a good few hours working here and there though Pierre and Nicole and Alexandre came by to visit, and took me to lunch, and in the end I figured it was no good hanging around at work since I was mostly distracted anyway, and headed back home with Kati. Back here, made some real progress on the next set of plots, showing the various measures of disparity. First, mean pairwise distance:

Then, convex hull volume (showing the effect of increasing the number of dimensions considered when calculating the volume, strictly speaking the hypervolume—in 3, 4, … up to 10 dimensions):

The results for convex hull volume are shown normalized (to the largest value in each set of calculations) since the absolute values get smaller with the addition of each dimension. This makes sense, considering the coordinates on all axes are between -1 and 1—an area of a square of side length 0.2, for example, is 0.04, a cube of the same side length is 0.008, a hypercube in 4 dimensions 0.0016, etc.

In any case, regardless of the number of PCO axes considered, the convex hull volumes tell a totally different story from the pairwise distances. Pairwise distances are pretty much indistinguishable, with the exception of the Early Cretaceous. Convex hull volume, on the other hand, shows a definite increase with time. What gives?

Monkeying with ape

ø

After a week of relatively low efficiency, exhaustion, and feeling sorry for myself, kicked in the help of SelfControl this morning to get myself back on track. At this point I’ve reached the “morphospace vs. phylogeny” section, which requires plotting a phylogeny side-by-side with a PCO plot, with matching color codes. This requires quite a bit of R learning, since it isn’t something I’ve done before.

I first wanted to visualize what the plot should look like. I printed out tree plots of both trees I’d gotten, the one from Sörhannus and the one from Medlin. The latter I had not actually even downloaded from my emails yet. I did this, and had a good look at it, but was not impressed by the data she had sent me—the tree plotted up OK, but all the species names were abbreviated, and the tree had far fewer taxa in it than the one from Sörhannus. So, I decided to go with the Sörhannus one, whatever the repercussions might be.

I printed out his tree and the list of genus names found in the morphospace as well as on the tree (there are 41). Then I went through and manually highlighted all species on the tree belonging to those genera. It’s a pretty good, broad spread. What I want to do is to produce a pruned version of the tree with only one node for each of those genera. Had to spend a lot of time in the ape() package (which stands for “analysis of phylogenetics and evolution”) to get it to go from looking like this (the original phylogeny):

To this (the version I ended up with last night at about 7:30):

This involved removing a lot of species, shortening the names, and rotating a good many nodes to put them in an order at least broadly comparable to the Sörhannus paper (also the order, roughly, of the Medlin and Kooistra phylogenies). The motivation-sapping thing was to see how different this phylogeny is topologically quite different from the one Sörhannus published. I had just assumed that the tree he sent me was the same one that was published in the paper, and it was impossible to tell (given the image above) with the dense tangle of branches, that it was actually different. Now that I’ve spent a whole day monkeying with it, though, I don’t feel like it makes sense to abandon it and try all over. I suppose I will just have to acknowledge that it’s topologically different, and maybe email Sörhannus and ask him what the deal is. Maybe add a line or two in the paper about why, if I can figure it out.

The alternatives are to use Kooistra’s phylogeny or Medlin’s. But Kooistra was spectacularly unhelpful when I emailed him with my original character list for review, so I don’t really feel like engaging with him, and Medlin sent me her phylogeny, but it’s even messier than the one from Sörhannus because all the names are abbreviated and it would probably take me at least a day just to decipher what species the abbreviations actually stand for. So no really good alternatives.

Best to press ahead with this. I’ve got to get this thing done, after all, whether it’s perfect or not. And it won’t be.

Lay Lady LaTeX

ø

Back into the office this morning (arrived via jog) after the weekend work binge. Realized in the course of pasting what I’d written so far into LaTeX yesterday that I will need to rewrite quite substantially. I want to start with a description of the morphospace plot, some general words about visualizing high-dimensional data and the plot-points-as-data approach I take in this figure. Then I can go on to the characters-vs-axes-correlation plot (which I suppose will be Figure 2, then), describe that, and then go back to the marginal figures around the main Figure 1 plot to highlight some of the characters that come out of that analysis, plus some others.

Anyway.The first thing was to finish getting myself set up in LaTeX, which ended up taking a truly ridonculous amount of time—tables are a bitch to do at the best of times, but when it’s a 123 * 140 data matrix, it can eat up the better part of a day. In any case, I’ve got my writing so far, the character descriptions, and the complete data matrix nicely formatted and ready to go. While it seems like the sort of low-priority thing you might save until the end, I think I chose well to do this now, since the aesthetic pleasure of seeing my dissertation begin to take form is having a more-than-desired motivating effect.

 

Ladies and Gentlemen, Figure One

1

Lab meeting aside, I’ve been pushing hard since Kati left Friday morning. Two late nights (it’s 1:23 at night as I write) in a row, with essentially no distractions or breaks, and I’ve finally got something for Figure 1, at least a starting point, if not the final thing. I’m still pretty overwhelmed by how small of a part of what needs to be done this represents, but I can’t help but be a little proud of making this look rather nice:

Now, for some truly well-earned sleep. Didn’t think I had the late nights in me anymore—but I guess I’m starting to warm up again and slowly approaching my stride.

Rock Bottom, Already?

1

The last twenty-four hours have been a challenge. When I settled down to work after dinner last night, it was time to go from describing the general trends in Figure 1 to identifying the particular characters that contribute to the first three PCO axes. To my utter shock, I found that, again, the characters showing up were ridiculous—clearly marginally important characters that were hitting high correlations with the axes and high statistical significance because they are valid in only a few genera, or only have a few valid states.

This was a huge blow to my confidence, having just spent an entire week recovering from the initial setback from the whole culling choice fiasco. I almost lost it entirely—that deep aching feeling in my chest came flooding in and I could feel myself spiraling down into utter despair. All those cumulative emotions of disappointment and failure from the years of aborted projects came rushing back, and it was not pleasant. I managed to regain my composure enough to decide to let it go for the day, go through a mindfulness meditation, and fall asleep.

Then, I lost a good couple of hours in the day today when my mother called and we ended up in a rather explosive argument, though thematically disconnected probably not causally unrelated to yesterday’s revelations and today’s worries. This put even more of damper on the day.

I took a walk and some fresh air, and was eventually able to pull myself together and start to try to tackle the problem. One of the characters was in the category I had previously determined as “uninformative”, i.e. with less than two character states having more than 1 valid entry. So I went back once more to write (yet more!) code to automatically remove such characters. I think it now works—at least I hope. Now to see if this improves the list of characters that crops up as “most associated” with the first few axes.

So, I ran the new filtering and distance matrix calculation, hoping to have finally rid myself of the “uninformatives”, and now depressingly (or actually hilariously, at this juncture) found my marginal Cramér sums in Figure 1 to be the greatest on PCO axis 10 or so, not the first one…

This is utterly idiotic, because it makes it look like that axis (I guess it’s PCO 11) captures the most variance, not PCO 1. Well, perhaps there are other problems, too. The key question is what are the top few characters in axes 1-3. They were idiotic ones last night, which caused me so much grief—and it also brought up the problem of how to choose the most contribution to the axes—is it the characters with the most significant association or the characters with the strongest association, i.e. the largest Cramér values? It seemed to me that Foote looked at significance only, and indeed the results were even more idiotic when I used Cramér last night. But that raises the question—why even plot Cramér on this idiotic figure, if I’m not going to use it?! Oh, the despair.

The biggest Cramér values with p < 0.05 on PCO 1:

  • X123, the relative thickness of raphe sides (an utterly minor character that has just 31 valid states, of which 28 are the zero state)
  • X60, the shape of the structural pattern center (this is actually a really good one)
  • X34, the shape of the mantle in cross section (I guess this is a reasonable one)
  • X26, the angle between mantle and valve face (umm, OK, similar to the above, I suppose)
  • X102, the location of labiate processes (what?! this one is totally ridiculous… I think)
  • X12, the general topography of the valve face (reasonable—this is overall shape)
  • X90, marginal ribs (silly)

On PCO 2:

  • X24, the shape of the central elevation (this is just bizarre—why this character? It’s subsidiary to the presence/absence of a central elevation, I just don’t get it)
  • X21, the shape of the apical elevation summit (this also seems like an irrelevant detail in the overall morphology)
  • X34, X60, X102, X26 as above
  • X114, the extent of the raphe (hmm… but why not the raphe character itself?)
  • X51, presence of a distinct central area (OK)

And PCO 3:

  • X46, whether rays are asymmetric (this is an extremely arcane one, only valid in 4 genera, and thereby only barely passing the “uninformative” criterion; totally ridiculous)
  • X114, as above
  • X28, whether the margin acts as linkage (this seems reasonable)
  • X121, raphe keel (hmm, seems esoteric—very detailed, only applies to 32 taxa)
  • X72, pore openings at inside (OK, this seems fairly universal)
  • X68, pore size (fine, I can buy this one)
  • X67, uniformity of pore size (ditto)
  • X76, pseudoloculate pores (I should probably take this one out, since it’s not strictly a morphologically agnostic character)

Well. This is not terribly encouraging. There are a few characters in there that describe overall shape or are widely shared characters—but a lot of characters with high associations are just esoteric things that apply to a few things. That isn’t satisfying, because I’d expect the most important characters to be stuff like the outline shape, whether it has a raphe or not, whether it has apical elevations, etc. Major things. Presumably this is because it’s just statistically easier to have a strong associations with the PCO axes post hoc if you only have a few valid entries. In theory, the p-value should account for that, though, so maybe I’m just looking at the wrong thing, and Foote had it right just looking at the p-values?

These characters have the lowest p-values in PCO 1:

  • X35, warts or plaques on mantle (this is utterly crazy, why should that matter?! It has 109:11 in 0:1 states, so not easy to exclude)
  • X43, external costae or ribs (also doesn’t seem reasonable)
  • X111, a fascia (esoteric, only applies to raphids)
  • X39, brim (hmph)
  • X85, apical pseudoseptum (also not what I’d expect)
  • X1, outline shape (HALLELUJAH! this one I would expect)
  • X5, dorsoventrality (hmm)
  • X123, raphe orientation, again

PCO 2 p-value champs:

  • X114, raphe extent again
  • X76, pseudoloculate bullshit again
  • X21, apical elevation summit shape again
  • X32, mantle symmetry (bullshit)
  • X43, see above
  • X27, ornament at mantle edge (at last, ONE character I might expect to see)
  • X69, pore shape (meh)

And with PCO 3, no surprises:

  • X12, general topography again, OK, fine
  • X46, that idiotic ray symmetry character again
  • X121, raphe keel again
  • X22, presence/absence of central elevation (OK, seems reasonable!)
  • X28, rim ornament again, OK
  • X114, raphe extent again

So, essentially, many of the characters I would think would be important—and which LOOK like they are important from just inspecting the PCO 1-2 plot, are not actually on the top 5 or top 10 lists of either significance or strength of association on those axes, instead, there’s a bunch of esoteric weirdo characters.

So what this basically tells me is that this whole Figure 1 exercise was a bullshit waste of time—my VERY FIRST HUNCH about empirical morphospaces based on many categorical characters was ABSOLUTELY SPOT ON: they’re stupid things to do because you DON’T KNOW WHAT THE AXES MEAN, and this whole statistical rigmarole has just proven that.

So, perhaps what I can write up in my paper is exactly that. Don’t expect these axes to mean what they mean, because they’re going to correlate with weird things you didn’t expect. And probably, I’ll just have to show how the IMPORTANT characters (chosen by me, of course, because bullshit science doesn’t work the way think it works, no, it seems to always go ass backwards from investigator to data, at least the shitty ass science I have somehow found myself doing) and how they plot onto the PCO 1-2, 2-3 morphospace.

God. This is so, so demotivating and utterly sad.

I’m calling it for tonight, off for dinner to weep some salty, salty tears into my chicken pot pie.

Tomorrow: start tracking time with TimeSink.

Hiatus, Resumption

ø

Dropped a little off the wagon there for a while. Friday was a write-off, for good but sad reason: Jenny Meyburg’s mother passed away earlier in the week, and it was the right thing to do to show up for her memorial service and sit shiva with her and her family in the afternoon. Saturday and Sunday were pretty limited, productivity-wise, in part for having to invest some time in returning from the delicate emotional grounds being at a friend’s parent’s funeral led us into, and in part for trying to get a bit of housework done. In any case, I did manage to finally tick off item #1 on the list, implementing the marginal totals in the character loadings plot (a surprisingly fiddly task!):

The next task I’ve been engaging in is to describe this.

Interview Done, Here Cometh the List

ø

Tuesday and Wednesday ended up being a write-off, in terms of research. First, I had my annual paragraph of letdowns to write for PlanktonTech, which I ended up doing successfully, though it cost some time and emotional investment, reading back over the lofty aspirations and lost time of the past year. And on Wednesday morning, the interview—which went OK, though it left me feeling pretty well exhausted, and after a careless mistake too many I decided to call it quits for the day and went home (and actually fell asleep in the middle of the afternoon).

Anyway. It’s been a bit of an uphill struggle regaining momentum after that break of focus, short though it was. Maybe it’s the midnight meowing of our feline houseguest, or the morning runs I am still getting used to, but I’m remarkably exhausted…

In any case. I emailed Sorhannus again and asked if he would send the complete tree file, so that I could calculate patristic distances. This involves figuring out how to read this file into R, and how to monkey with it once it’s there. It’s a .nwk file, which turns out stands for Newick, and is the standard file format for trees. It’s taxon (node) names hierarchically clustered by parentheses, with numbers denoting branch lengths. Seems straightforward enough.

I managed to do this, calculate the patristic distance (thanks to some more help from Allison) using the cophenetic() function in {ape}, and plot up the result (it’s slightly less well correlated, interestingly, than the direct distance, but not by much):

Since this plot looks similar to the last one I posted, it’s no surprise that the correlation between patristic and “direct” raw distance among sequences is high (r-squared of 0.79):

Pickin’ Up Mo’ Mentum

ø

Have missed a couple of days of note-taking—mostly due to fierce productivity. Finally cracked the % variance explained problem using a two-pronged approach, first by the ratio of eigenvalues to the sum of eigenvalues (or the trace of the Gower-transformed matrix, which are supposed to be the same thing), second by the sum of the r-squared values of the correlation between squared distances in the original data matrix against the squared (Euclidean) distances in the PCO-space of the first n principal coordinates.

This fantastic success under my belt, I spent a couple of fairly agonizing days trying to understand the description of the PCO analogue to PCA axis “loadings” described by Foote in both his ’95 and ’99 crinoid papers. Eventually tracked down the reference he cites for the statistics he uses to calculate coefficients of association between the categorical characters in the original matrix (on a “nominal scale”) and the PCO scores on each axis (a continuous character, or one on an “interval scale”, as Siegel & Castellan call it). The PCO scores need to be discretized, which is very easily done with R’s cut() function.

Anyway… As of Wednesday morning I feel that I’m on the brink of figuring this shit out, so I popped into Andy’s office first thing and let him know he wasn’t going to be getting my draft outline yet, because I am on a roll and want to see if I can crack this beast.

After an incredibly frustrating but focused day, I was able to both implement the Cramér coefficient, write a code that would calculate it and the p-value for each combination of PCO axes and (most difficult of all) generate a plot that summarizes the results:

The color scale shows the significance of each combination—black is a p-value of 0 (very significant), white is a p-value of 0.05 (not as significant), and everything above 0.05 has been thrown out. The size of the circles shows the degree of association, bigger circles implying a stronger association (larger Cramér coefficient).

I need to add a legend to this, and maybe fix the outline color of the circles, but I’m done for today. Tomorrow I will try to sort the association pairs by Cramér coefficient and make a table of the, say, 20 largest associations and what PCO axes they’re on, to get a sense for what determines the axes most. But for today, this has been a pretty huge accomplishment, and today is the first anniversary of our official city hall wedding, so I’m off!