So, I was thinking about PCA today... I think about it a lot, I guess, because I have trouble translating what I know about how it works with very straightforward data (given that we measured these 20 things, combinations of these things form logical components that can be used to help us format new models that explain the variation in the data better than just regression type analysis on measurements alone?) I am reading through a certain physics paper at the moment, and I really think that the PCA is a good choice because it is a system designed the best for phenominological... geez, that's a hard word to try to spell... models.
I am about to use PCA today to look at some data. I have a ton of variables about stuff measured on the watershed-- elevation, slope, LiDAR index, height, etc. What I am looking for is something that explains "primary productivity"-- the annual growth of plants, in Mg/Ha/Yr. When I look at these things versus the data one by one.. or if I use stepwise regression to look at every single possible combination of these things, what I end up with is a shit correlation of something like 0.10 R-squared. I mean, it's pathetic. The reason is that individually, none of these things are phenomenons that describe the "primary productivity"-- most relate to it in some sense or another, but they are not the "driving forces behind it."
I am going to get to the funny part, I just started to think, well, maybe if I type my logic here (although it's not very good, I know, maybe it can possibly spark something about PCA that might help with a certain physics paper). Maybe not, but it could, I guess? It is worth the effort even if it doesn't, because it cleans my thoughts for my own purposes. I apologize that most of this is probably readily apparent and not helpful at all, but for me, this "baby style PCA" took literally a month or two to learn, and I would rather over explain than not say enough. I am one of those who must work through things from the most simple, 7th grade-style-math standpoint to even get anywhere. Sad, I know. Every day I wish my brain were faster. Anyhow...
Let's pretend I'm running PCA on my data in R....
>output<-prcomp(data)
>summary(output)
which gives you the loadings and eigenvalues and eigenvectors, and P-values for usefulness of parameters.
So, in pretend world, I have run this analysis, now what do I do? First thing first is that I look at the loadings. I want to see which PC's are good to keep. Generally the accepted criterion that I know of is "eigenvalue > 1" or "cumulative explaination of variance > 0.70." I think the second one is the one common for forestry, at least, that is what S told me to use.
One thing I really like to do with PCA is to plot my loadings along the principal components. Generally this is kind of annoying with more than 2 PC's, but 3 is okay, too. I will try an example here, lets say that one of my parameters was "elevation." The analysis showed me that in PC1, 0.6 of the variance is explained by elevation, and in PC2, 0.2 of the variance is explained by elevation. So if PC1 is an axis that is orthogonal to PC2, then this point would be at (0.6, 0.2) just hanging out in the first quadrant. I do this for all components. Now I've got this nice visual which shows me where each parameter is on this PC graph. It's especially nice to have each parameter labelled or in colors or something so that you can identify them. I look at each quadrant individually and see 1) who is in that quadrant, and 2) how many clusters are in that quadrant. Let's say in the second quadrant (PC1 +, PC2 +) I have "elevation", "needle thickness", "bark thickness" and "julian days of active photosynthesis" all in a cluster. I say to myself, what do these things all have in common, phenomenon.... shit that word is hard to spell. What do they tangibly have in common? Well, all of those things describe conifers. Conifers grow at high elevations, have thick needles, thick bark, and long periods of active photosythesis. So I can say, that "primary productivity" is related to "conifers" in some way. I didn't ever have "conifers" in my data, but when I am trying to make a ecophysiological model, I will include some sort of term or input reflecting "conifers" (proportion of conifers on the land or something). I use the axes of the PC graph to guide me about the shape of this interaction-- is it increasing or decreasing.
Also, now let us say that we have several clusters in the first quadrant. Let's say there is a distinct cluster for "conifers" and two others that we have reasoned out, "rainfall" and "rocky soils." If the "divide by standard deviation" thing has been used prior to data input, we can measure the distance between the clusters (and there are numerous ways to do this, like closest point to closest point, average cluster point to average, etc.) and talk about their relationships with one another. Perhaps rainfall is very near to conifers, and rocky soil a little further away. We can infer that rainfall and conifers both impact the overall system similarly, and that rainfall and conifers are closely related to one another, more so than rocky soil is to either. For my purposes, PCA's strength is that it allows you to see the underlying phenomenons that caused the variation in your measured data. That means when you go to write a model instead of having some crazy crap like you would get from a stepwise regression such as:
Productivity acre= aX1 + bX1X3 - cX4 +dX1X4X6 + eX3X7 -f X25... which is pretty annoying to work with, you might have
Productivity per acre= (number of trees)*(amount of rainfall) - (amount of rocky soils)
Both models can technically "explain" the data, but the second model makes sense and tells us something about the real world.
I hope that was at least slightly helpful or caused some thinking of a few things. This is how I know to use PCA for model making...