You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

March Madness Midpoint

ø

Zoiks. Best not to think about it too much.

In spite of being on the mend, still felt pretty grotty and exhausted in the morning, and consequently didn’t arrive at the office until well after 10 am. Fired off emails to Scott Edwards and Zoe—setting up a meeting with the latter for Friday morning (the former is out of the office for a while).

In the afternoon, helped Wil with a stats problem. He was trying to do a nonlinear regression in Excel (a nightmare), and I was able to figure it out in R in less than an hour. I was quite proud of myself and felt very useful—once again, it’s proving an invaluable skill to have under my belt. I did hesitate for a moment, thinking I shouldn’t take more time away from work, but since I was mostly stewing about how to write up my chapters, I figured I wasn’t that productive anyway, and that model fitting was something I really ought to know how to do in R. I promised myself I’d take no longer than an hour and then give up.  That turned out to be ample time.

I spent the rest of the working day reading two of Mike Foote’s papers, one on “Rarefaction analysis of morphological and taxonomic diversity”, which I ended up not concentrating on all too deeply, and the other on the Paleozoic crinoid morphospace, which I read with much more attention than I had given it before. Some thoughts:

  • The rarefaction paper does classical rarefaction, (none of the fancier Alroy algorithms, which I suppose post-date that paper); what’s more, it rarefies by species, not by occurrence—because of course Foote isn’t working off of an occurrence-level PBDB type database. This is of course good news for me—I think my study might be (?) the first time someone’s actually populating a morphospace with occurrence-level data in the time dimension. That allows me to subsample/rarefy by occurrence.
  • The 1995 crinoid morphospace paper is full of cross-references to other Foote papers on the crinoid morphospace. Basically, his crinoid morphospace project is a ~5-paper monster split into bits. That makes it really, really hard to read—tricky to understand one without having read all the others. I realized in the reading that I am veering into that direction with my paper, and it’s something I’d prefer to avoid. What’s the point of doing all that detailed work if nobody’s going to understand it because it’s presented in a scattered and opaque manner?
  • His work is almost all at the abstract, morphospace/disparity/diversity meta-level. Rarely does biology, function, or phylogeny enter into the discussion. This is, on the one hand, heartening—if he can do it, so can I. But it’s also unsatisfying.
  • Where he does touch on biology: early on, he sets up the idea of “morphological constraint” in crinoids, i.e. the idea that crinoids early on hit some sort of intrinsic limit on morphological diversity, and the subsequently just evolve about in a constrained space. This acts as a straw man of sorts for developing his arcane mathematical manipulations of the morphospace data.
  • In toto, I’m not entirely enamored of using Foote’s paper(s) as a model for my own.

Diesel Acceleration

ø

Spent the morning at Diesel and made good headway on the plot of disparity measures. I think the three main measures (mean pairwise distance, convex hull volume, and alpha shape volume) are mostly in place now. Now, I just need to check other measures, Doug Erwin’s paper in Palaeontology a couple of years back comes to mind, perhaps throw together a quick plot of variance or range on each axis (?), and—importantly—add a panel for diversity. Oh, and combine all the panels together on one graphics device/PDF.

I got a little worried about taxon sampling when I was making the disparity plots this morning—I had been operating under the “range-through” model in the function I wrote to return taxon lists for each time bin, and I got curious as to what it would like like under “in-bin” sampling. The results were, as you might expect, quite a bit different (though I should run a formal comparison on the completed figure that has the assembled panels). This was some brief cause of concern to me but I realized that the key to doing this right is to compare measures of diversity with morphospace constructed with the same taxon lists. What’s more interesting (I hope) is how diversity and morphospace are related given the same taxon sampling, rather than comparing, say, SQ-subsampled diversity to range-through morphospace. Ideally, diversity—morphospace comparisons under the different taxon sampling models will yield similar results in the sense that the relationship between the two will be the same, even if the outcomes compared amongst the models are quite different. Otherwise, it would be important (and interesting) to comment on how taxon sampling affects the story. Complicated, confusing, but I think that’s the right approach.

In any case, here are the panels from this morning:

A crucial (because very vexing!) detail on the bottom plot is the greek character alpha, something I’ve been trying to do over and over again over the past few months and only just figured out. Very rewarding. It turns out there’s a {plotmath} package (apparently part of the base installation) with a function bquote() that does the trick, beautifully. The only downside is that, unlike most things in base R, it’s not vectorized—so you can’t pass a vector variable to it and have it plot all of the values in it, so you have to use a for loop. Ugly, but not a problem.

In the course of the afternoon, consolidated the plots into a four-panel figure and started working on the fourth panel, diversity. Didn’t quite finish, but got close, and felt very happy with the amount of work I got done and the level of productivity and focus. Once I’d gotten home and cooked dinner (it was a rather elaborate effort), though, I was exhausted, so the workday did not stretch into the evening this time. But I am still satisfied. Hitting the cafés is definitely a good idea, and I really think I need to do it more often.

Are you shitting me? Because I am shitting myself.

ø

I am experiencing a definite bout of non-acute, low-grade, chronic time-is-ticking-freakout. I decided to stay away from the office this morning, after yesterday’s multifarious distractions, and instead worked at Darwin’s with Kati. Made decent progress of a sort, delving into how to calculate the next measure of morphospace occupancy—but in the process found myself quite paralyzed at the sheer scale of the task. I figured out how to calculate and plot the alpha shape of a point cloud in 3D (which looks very cool, by the way):

I could not, however, figure out how to save this to a movie (it’s very cool in its spinning glory), for presentation purposes. More importantly, I also couldn’t figure out how to automate generating a large number of these plots, since I’m going to have to inspect many different alpha shapes for different alpha values for all the time bins, in order to (manually and subjectively) pick the most appropriate ones. It’s very tricky because the package that does the alpha shape calculations uses RGL, an OpenGL implementation for R, for all of its plotting, which has its own set of graphical parameters that are largely unrelated to R’s base graphics (or indeed to grid, the other graphics layer I’ve been using). To really delve into this, cool as it would be, is going to take way too much time. So I’m going to have to brute-force this by hand and plot each combination of alpha values and time bins manually.

Aping Some More

ø

Started the day trying to figure out how to plot lines to demarcate clades (for color coding) on the phylogeny. Not as easy as one would hope for it to be in ape(). Eventually figured it out (ooh, hello, it’s lunch time already! shit!). It looks like this (i.e., promising):

Now the next task is to get these taxa colored up all nice and good so the points in the PCO plot can be colored correspondingly. This took quite a while, owing to some very non-intuitive complexities in the way objects of class phylo store the order of tips on the phylogeny (not, as one might expect, the order in which they are displayed!), and the way R codes for colors, which you’d think I knew by now… In any case, figured out eventually:

Was feeling quite ready to roll on and start making the matching PCO plots which will use the same colors to link phylogeny with morphology, but alas, this is going to have to serve as a temporary stopping point: since I was rudely shoehorned into co-presenting at the idiotic Geobiology meeting on Friday, I need to invest at least the next half an hour before Justin comes by to discuss what we’ll say actually reading the damn paper we’re going to present. At least I’m parked on a success/downhill.

Well, Justin showed up early and we ended up talking about the paper for a bit longer than I had anticipated, so it wasn’t until after dinner that I got back to working on the figure. In any case, I was able to get the rest of it banged out pretty quickly, and the result is at least aesthetically satisfying (although I don’t think it shows anything particularly fascinating that isn’t already obvious from the other plots I’ve made):

First off, though pennates and centrics occupy different parts of the 2D morphospace (which is obvious from the main PCO plot), their subdivisions don’t really occupy distinct areas. Radial versus bi-/multipolar centrics pretty much sit in the same part of morphospace, and araphids and raphids have a lot of overlap, too. What’s more, the clades within each of those groups (which I’ve plotted in similar colors—reds, blues, and dark greys) don’t seem to plot together, either. So in this view, while it seems that there is some very high level relationship between morphological and phylogenetic proximity, but it isn’t a particularly close relationship.

This is kinda disappointing (I think), but it sets up rather nicely the next plot, which shows pretty well the same thing, but from a slightly different perspective.

Monkeying with ape

ø

After a week of relatively low efficiency, exhaustion, and feeling sorry for myself, kicked in the help of SelfControl this morning to get myself back on track. At this point I’ve reached the “morphospace vs. phylogeny” section, which requires plotting a phylogeny side-by-side with a PCO plot, with matching color codes. This requires quite a bit of R learning, since it isn’t something I’ve done before.

I first wanted to visualize what the plot should look like. I printed out tree plots of both trees I’d gotten, the one from Sörhannus and the one from Medlin. The latter I had not actually even downloaded from my emails yet. I did this, and had a good look at it, but was not impressed by the data she had sent me—the tree plotted up OK, but all the species names were abbreviated, and the tree had far fewer taxa in it than the one from Sörhannus. So, I decided to go with the Sörhannus one, whatever the repercussions might be.

I printed out his tree and the list of genus names found in the morphospace as well as on the tree (there are 41). Then I went through and manually highlighted all species on the tree belonging to those genera. It’s a pretty good, broad spread. What I want to do is to produce a pruned version of the tree with only one node for each of those genera. Had to spend a lot of time in the ape() package (which stands for “analysis of phylogenetics and evolution”) to get it to go from looking like this (the original phylogeny):

To this (the version I ended up with last night at about 7:30):

This involved removing a lot of species, shortening the names, and rotating a good many nodes to put them in an order at least broadly comparable to the Sörhannus paper (also the order, roughly, of the Medlin and Kooistra phylogenies). The motivation-sapping thing was to see how different this phylogeny is topologically quite different from the one Sörhannus published. I had just assumed that the tree he sent me was the same one that was published in the paper, and it was impossible to tell (given the image above) with the dense tangle of branches, that it was actually different. Now that I’ve spent a whole day monkeying with it, though, I don’t feel like it makes sense to abandon it and try all over. I suppose I will just have to acknowledge that it’s topologically different, and maybe email Sörhannus and ask him what the deal is. Maybe add a line or two in the paper about why, if I can figure it out.

The alternatives are to use Kooistra’s phylogeny or Medlin’s. But Kooistra was spectacularly unhelpful when I emailed him with my original character list for review, so I don’t really feel like engaging with him, and Medlin sent me her phylogeny, but it’s even messier than the one from Sörhannus because all the names are abbreviated and it would probably take me at least a day just to decipher what species the abbreviations actually stand for. So no really good alternatives.

Best to press ahead with this. I’ve got to get this thing done, after all, whether it’s perfect or not. And it won’t be.

Ladies and Gentlemen, Figure One

1

Lab meeting aside, I’ve been pushing hard since Kati left Friday morning. Two late nights (it’s 1:23 at night as I write) in a row, with essentially no distractions or breaks, and I’ve finally got something for Figure 1, at least a starting point, if not the final thing. I’m still pretty overwhelmed by how small of a part of what needs to be done this represents, but I can’t help but be a little proud of making this look rather nice:

Now, for some truly well-earned sleep. Didn’t think I had the late nights in me anymore—but I guess I’m starting to warm up again and slowly approaching my stride.

DNA, RNA… It’s Not All the Same

1

Had a frustrating morning not knowing quite where to pick up—continue struggling with the morphological/molecular distance plot or move on to the rest of the list. There was some amount of distraction, too, with the lab’s new intern, Sarah, an undergrad from France, arriving and needing some assistance getting settled it. Fortunately, I got an email back from Maude after lunch, introducing me to her labmate Allison, an expert in all things R and phylogeny. She sat down with my problem and was extremely helpful—she very quickly pointed out some pretty major things (retrospectively pretty obvious):

  1. I was trying to read an RNA sequence with a function called read.dna(). 
  2. DNA and RNA are not the same.
  3. The distance measures implemented in the dist.dna() function probably do not apply to RNA.

She pointed me toward the package {seqinr}, which includes a function read.alignment() that reads clustal files like mine at face value, including the U nucleotides. Hallelujah. She also pointed out another fact that was almost immediately obvious retrospectively: pairwise genetic distances computed directly from sequences are not the same as the genetic distances measured along branches of a tree (those are known as patristic distances). It might make more sense to use patristic distances, since they actually reflect the ‘evolutionary’ distances, at least in so far as the phylogeny is correct. This, however, would require obtaining the files with the trees themselves from Sorhannus and Medlin. Perhaps I will send them another email.

Anyway, this was quite an eye-opener of a conversation and a very stark reminder of how much better it is to ask for help rather than to just bash your head against the wall alone. Other people are invariably smarter than oneself. It eventually resulted in this plot, which looks a lot more reasonable:

 

Now this, I think, is something I can work with. The correlation is still quite weak, but it’s now not ridiculous, and it’s not binned, and there’s still some vague positive association between the two variables, which does make sense.

Day of Wor(k)ship

ø

Tried to confirm that I was doing the right thing calculating molecular distances in R, so I downloaded MEGA (which runs on OSX through a cool Wine emulation layer), and ran a distance calculation on the alignments from Ulf Sorhannus with the default settings (which were even more complicated that in the ape package, and I understood even less). I then had to go through an export to Excel to add zeros in the diagonal and upper triangle of the matrix so I could then import it into R as a .csv file, but this allowed me to make a crossplot and compare to the morphological distances:

The correlation looks quite a bit better, and the r-squared is 0.1—so about double that by the ape methods—but it is still pretty lousy. I’m not sure what MEGA is doing differently… But in any case, it’s still not a stellar correlation, and what has to be explained is why the correlation is so poor, not why it’s so good. I tried another one (this time using a distance setting I think has an equivalent in ape, namely JC, the Jukes/Cantor model), and it looked similar, also with an r-squared of 0.1, and a different look of the plot (it’s not so “binned” in the molecular distance axis). Maybe there’s a setting for how many decimal points the dna.dist() function in ape returns, that’s set for too coarse of a setting, causing the r-squared to be lower than it would otherwise be?

This is compared to the same plot produced using ape with the JC69 setting:

Hmm. This is strange, actually—the range of molecular distance values is also much less, they only go up to 0.05 in the ape plot versus all the way to 0.14 in the MEGA plot. Something seems to be fishy. It can’t be that the distance matrix in MEGA is somehow bigger, because if the size of the distance matrix didn’t match the size of the morphological distance matrix, the plot() function would throw an error, which clearly isn’t happening. Perhaps I am not reading the data in correctly in R using the read.dna() function—after all, I don’t really know what the hell alignments or alignment files are supposed to look like.

I noticed that when printing the summary of the dna data, I get nonsensical base compositions that suggest maybe R isn’t reading in the uracil data at all… though this seems so silly as to be really unlikely.

Base composition:
a     c     g     t
0.407 0.322 0.271 0.000

Could this be the problem? Surely not… Well. After a lot of searching (ape seems to be even less well-documented than all the other packages I’ve used!), figured out the (retrospectively) obvious way of interrogating the sequence alignments directly:

(as.character(dna[1:5,1:20]))

…which reveals that the “U” nucleotide is not being read in by the read.dna() function. Wow. That is stunningly awful. I spent a good hour trawling for help, got none, and finally decided to email Maude, who I think might conceivably work on this sort of stuff, for help. I’m giving up here. Know thy limits.

Soldiering Ahead

ø

 

Although the rain caused a bit of a delay this morning (it was so disgusting that, even wearing full waterproofs, walking in turned out to be so unpleasant as to be not worth the effort—but the bus, of course, zipped by, full), I have been blazing forward. I finished up the character/PCO association plot, which involved hand-crafting a legend (the legend() function does not work with the symbol() function I used to plot the bubbles), realizing in the process that the color scale was off, that I needed to generate my own palette, and thus that I needed to learn the basics of R color palettes and how they’re used. It’s all fairly low-level stuff.

In any case, I was mighty excited when I had a final product in hand, and skipped into Andy’s office to show him—he was also very excited and said “this should definitely be a figure in your paper”, so I suppose it was worth the extra two days it took to make it. Here’s what the final thing looks like.

This plot lets me say two things:  (a) most of the strongest and most significant associations between the original characters and the PCO axes are in the first few axes, so the 2-D plots will be somewhat meaningful (i.e. big and dark circles are toward the left), and (b) there is also a fair amount of information contained in the higher PCO axes, meaning that there is real complexity structure to the data beyond the first few axes, too, and some of the characters won’t show variation expressed clearly in the first few axes. This is good, because it means I wasn’t wasting my time adding all those other characters into the morphospace—they aren’t just covarying perfectly with other characters, but have a variance structure of their own.

With the Cramér analysis that the plot is generated from, I can also now extract a) which the most important variables are that contribute to axes 1, 2, and 3, and b) which the most important variables are on other axes, that won’t show up on those plots. This is what I need to do next.

For PCO1, the ten characters with the largest Cramér values are:

X60 X102 X89 X2 X61 X90 X91 X26 X41 X1

0.81 0.51 0.43 0.43 0.40 0.39 0.38 0.35 0.35 0.34

For PCO2, they are:

X34 X26 X60 X51 X102 X84 X12 X63 X2 X27

0.81 0.49 0.40 0.38 0.36 0.32 0.31 0.30 0.29 0.28

And for PCO3:

X102 X68 X70 X73 X34 X91 X72 X60 X41 X12

0.41 0.38 0.36 0.35 0.35 0.34 0.34 0.32 0.29 0.27

Now, I’ll need to somehow highlight these on a PCO1-2 or PCO1-2-3 plot, to show how they fall on the plot, and also illustrate what they are (or at least a few of them). Altogether the biggest few are X60, X34, X102, X26, X89, X2, X61, and X60. Those are all above 0.4.

In the midst of my head spinning with all of these details and things to do, I decided I needed to step back and sketch out my plan for the main figures for the paper. I did this…

…and in the middle of it all, I received an email from the Freshman Dean’s Office informing me that I have made the first cut and am being asked for an interview! This was a fabulously exciting piece of news and came as a crowning moment for an already successful week. Time to run off for a well-deserved dinner. Huzzah!

Figured out the GOF in cmdscale()!

1

After a whole bunch of reading around, I finally figured out how the GOF (“goodness of fit”) statistic reported by the R function cmdscale() works. It reports two numbers. The first is the sum of the first k eigenvalues divided by the sum of the absolute values of all of the n-1 eigenvalues, where k is the number of eigenvectors you request from the function and n is the rank of the distance matrix you supply. The second is the same thing but divided by the sum of the positive eigenvalues only (so, the bigger, and better-looking, number).

Although I still don’t understand how the PCO algorithm works exactly, in spite of trying rather hard to work through the original Gower paper from 1966, it still feels like a victory to at least have figured out what the R function I’m using is doing. I also figured out why Mike Foote, in his morphospace papers, kept referring to the ratio of the eigenvalues to the trace of the distance matrix as his measure of variance explained, since the trace (i.e. the sum of the diagonals) of the distance matrix is, by definition, zero.

Anyway. Having done a lot of reading and google-trawling on the subject of PCO, I started feeling rather grotty as the sun went down, and decided I might not be doing myself a favor in the long run pushing myself too hard when I’m sick. So, I decided to call it and try again tomorrow, when I will hopefully be feeling better. It’s been another good day—something new understood that I didn’t understand before, another paragraph written: progress made, momentum gained. Forwards!