You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

Let the Analysis Begin

ø

Spent the first part of the day plotting up the results of yesterday’s first look at the data. Needed to re-learn a whole bunch of stuff I already knew how to do, it’s amazing how quickly you forget things when you don’t use them every day. Yikes. Anyway, eventually I was able to plot up and save (!) some graphs showing how many of the 127 characters are “well-used” by the genera, that is, how badly characters are affected by missing data or inapplicability.
What this first plot shows is that a little less than half the characters apply to 80% or more of the genera in the dataset. The other half of the characters have valid character states in anywhere from just a few % to 80% of the genera, somewhat evenly distributed. What this shows, I think, is that there’s an interesting exercise waiting to be done comparing an analysis of only those widely-applicable characters to one using the full set of characters. I’d predict that they give a similar answer, but it’ll be interesting to see if that’s the case.

Just to make sure the character state “v”, which stands for “variable” and means that the character state can take multiple values within a single genus (e.g. some species in the genus have spines, others don’t) isn’t a big issue, I made the same plot as above including “v” in with the invalid character states. The following plot shows that it doesn’t have a big impact:

 

I decided next to move right on to the most important part of the analysis, namely the dimensionality reduction (principal coordinates analysis, or PCO). Looked back over Kevin Boyce’s paper, as well as Lupia (1999) and Gower (1966), on which he based his method. The basic order of steps I need to implement seems to be:

  1. Calculate a dissimilarity matrix, i.e. pairwise distances between each genus-genus pair given some metric. Lupia and Boyce both use “the sum of all character state mismatches, each scored as one unit distance, divided by the number of possible matches (i.e., all characters minus inapplicable and missing characters).” This seems sensible. How to handle my “v” coding for variable character states is not clear; Boyce included a separate character state for variable, such that the variability itself becomes a character state. This doesn’t make much sense given the way my taxa were coded (in some genera I sampled only one species, in others many). My options are to treat “v” as a missing/inapplicable character, score it as a match, or score it as a 1/2 unit mismatch. Perhaps try those and see what happens. [Oh, and I am treating the characters as unordered, same as Boyce, since character state 0 is no closer to 1 than 2 or 3 in any of my characters.]
  2. Boyce then says this matrix “was transformed to move the centroid of the dissimilarity distribution to zero (Gower 1966).” This is where I lose the plot, because I’m not sure what this means, nor where in Gower’s paper this step takes place. Neither do I have any clue whether the R command to carry out PCO does this transformation or not. Frankly, I don’t quite understand what moving the centroid of the dissimilarity distribution even means.
  3. Boyce describes the final step to be calculating eigenvalues and eigenvectors of the transformed dissimilarity matrix.

Got kind of stuck here over the course of the weekend and the start of the next week…

 

previous:
Genera: Done
next:
The (Morphospace) Plot Thickens

Comments are closed.