No Rain, No Gain
It is still bucketing down today, which Bawb Oauwkes explains is thanks to the remnants of Lee, the next-next hurricane to pass through after Irene. Hence my best-laid manly plans to rise early and swim were dashed as I laid in bed in musine paralysis…
Anyway, I eventually made it to Starbucks and am overlooking constant, perpendicular flows of umbrellas across and water down Mass Ave. It’s a pretty picture, really. After being frustrated with the outcome of the time-resolved/Neptune-connected morphospace plots, I had decided last night to re-run the analysis so far with a reduced number of taxa and characters. I culled the dataset to leave only characters with valid states for at least 50% of the genera, and genera with valid character states for at least 50% of the characters (both calculated from the original, full data matrix—but with duplicate entries for heterovalvate genera removed). This matrix now had 120 genera and 77 characters.
I generated a new dissimilarity matrix from this and ran the cmdscale() PCO algorithm, which yielded a somewhat improved goodness-of-fit scores of 0.17 and 0.22, which I’m interpreting to mean that PC1 and PC2 account for 39% of total variance in the dissimilarity matrix. I don’t yet, however, understand how exactly the GOF is calculated, since it doesn’t match the first and second eigenvalues as a proportion of the sum of all eigenvalues (however calculated). At some point I need to track down a stats textbook that has this information in it and educate myself on how PCO works, since the R documentation of the function is not particularly illuminating. The equivalent function in Matlab, incidentally, had a very much better explanation in its documentation (perhaps [?] unsurprisingly considering it’s an expensive proprietary, rather than free open-source, piece of software). That version doesn’t seem to calculate a goodness of fit, but it explains that if there are only two large eigenvalues, the spatial relationship among the points can be represented in just two dimensions. And, for their example,
The two negative eigenvalues indicate that the genetic distances are not Euclidean, that is, no configuration of points can reproduce D exactly. Fortunately, the negative eigenvalues are small relative to the largest positive ones, and the reduction to the first two columns of Y should be fairly accurate.
This isn’t the case for my dataset. The first (i.e. largest-magnitude) 16 eigenvalues are all of the same order of magnitude (between 0.6 and 0.1), and the 40 largest-magnitude negative eigenvalues are just 1 order of magnitude less than those largest positive eigenvalues. This means that there’s no euclidean representation of the relative positions of the genera in morphospace. This might not come as a great surprise, since the matrix is still quite sparse and very high-dimensional. I’m not trying to represent geographic locations of points on a globe on a map—the sort of exercise in the Matlab example—but points in a much, much higher-dimensional space…
Here’s a plot of the eigenvalues:
This is actually not too bad. The first two eigenvalues don’t necessarily explain a huge hog of the total variance, but they’re definitely substantially bigger than any of the others, so that’s good. It’s at least clear that the third doesn’t explain much more than the fourth or fifth, so it doesn’t necessarily make sense to start going down the list and including more dimensions. That’s a good thing, I think.
So what does the plot for the morphospace look like for this new culled dataset?
Interestingly, the groups seem to be better separated now—less overlap. Is the same true if we look at the subgroups, which were completely overlapped when using the full set of characters and taxa? Well, there’s still a lot of overlap, but at least there are some areas that seem to be differentiated.
Now, the (kind of) crucial question—is there better separation of occupied morphospace area with this reduced dataset? Basically the story is a little bit better than before—at least there’s some expansion of morphospace to be seen. Still, the Paleocene and Eocene (red and orange) are still completely overlapping, and already cover most of the morphospace ever to be explored. The more interesting story may be told only when pre-Cenozoic data are included, which is going to make a whole hell of extra work.
Well, this is something of a minor success, and since it’s coming up to 10pm I’m going to call it quits for the day and pick up again here tomorrow.
- previous:
- Rainy Wednesday
- next:
- OCI Career Fair






Beau
September 9, 2011 @ 7:45 am
I can’t say I fully understand what’s going on, but it does sound like you’re making some serious strides! There’s also a strong sense of inevitability in the way you’re grinding through the various analyses and slowly improving on your understanding of the problem and where to go next. No doubt these blog posts give a very clean version of what is a much more messier intellectual process, but either way, you’re looking good.