You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

Day of Wor(k)ship

ø

Tried to confirm that I was doing the right thing calculating molecular distances in R, so I downloaded MEGA (which runs on OSX through a cool Wine emulation layer), and ran a distance calculation on the alignments from Ulf Sorhannus with the default settings (which were even more complicated that in the ape package, and I understood even less). I then had to go through an export to Excel to add zeros in the diagonal and upper triangle of the matrix so I could then import it into R as a .csv file, but this allowed me to make a crossplot and compare to the morphological distances:

The correlation looks quite a bit better, and the r-squared is 0.1—so about double that by the ape methods—but it is still pretty lousy. I’m not sure what MEGA is doing differently… But in any case, it’s still not a stellar correlation, and what has to be explained is why the correlation is so poor, not why it’s so good. I tried another one (this time using a distance setting I think has an equivalent in ape, namely JC, the Jukes/Cantor model), and it looked similar, also with an r-squared of 0.1, and a different look of the plot (it’s not so “binned” in the molecular distance axis). Maybe there’s a setting for how many decimal points the dna.dist() function in ape returns, that’s set for too coarse of a setting, causing the r-squared to be lower than it would otherwise be?

This is compared to the same plot produced using ape with the JC69 setting:

Hmm. This is strange, actually—the range of molecular distance values is also much less, they only go up to 0.05 in the ape plot versus all the way to 0.14 in the MEGA plot. Something seems to be fishy. It can’t be that the distance matrix in MEGA is somehow bigger, because if the size of the distance matrix didn’t match the size of the morphological distance matrix, the plot() function would throw an error, which clearly isn’t happening. Perhaps I am not reading the data in correctly in R using the read.dna() function—after all, I don’t really know what the hell alignments or alignment files are supposed to look like.

I noticed that when printing the summary of the dna data, I get nonsensical base compositions that suggest maybe R isn’t reading in the uracil data at all… though this seems so silly as to be really unlikely.

Base composition:
a     c     g     t
0.407 0.322 0.271 0.000

Could this be the problem? Surely not… Well. After a lot of searching (ape seems to be even less well-documented than all the other packages I’ve used!), figured out the (retrospectively) obvious way of interrogating the sequence alignments directly:

(as.character(dna[1:5,1:20]))

…which reveals that the “U” nucleotide is not being read in by the read.dna() function. Wow. That is stunningly awful. I spent a good hour trawling for help, got none, and finally decided to email Maude, who I think might conceivably work on this sort of stuff, for help. I’m giving up here. Know thy limits.

previous:
Nerves!
next:
DNA, RNA… It’s Not All the Same

Comments are closed.