Curve fitting seems like it's almost never the interesting part, but it's important to get right just the same. Which is what makes it frustrating -- you're thinking, come on, I have this function, it has two or three adjustable parameters, all I want to know is what combination of these parameters makes this curve look like my data? But, since it's nonlinear, generally there's no way to guarantee you've got the best answer, so lots of smart people have invested an enormous amount of time figuring out 1000 different ways to get something that's a pretty good approximation of the best answer.
Anyway, in the hopes that this can help someone else who is about to enter the baffling and not-at-all-fun world of nonlinear curve fitting, here's a tutorial I wrote this morning for how you can do this. (Using mostly Matlab code as examples, but also some Python, because I didn't plan out this thing too well before I started, and I'm way too lazy to go back and edit it...heh.)
# This is "pseudocode" (sort of like a skeleton for how to write a program like this, # rather than actual code that you could run). # Lines starting with a pound sign are comments, and are ignored by Python. Matlab has # a similar feature (the % sign is used instead of #), and I'm sure R does too. # # Step 1: import your data # You can do this by hand, or by typing in the proper commands. # In Python (and in Matlab) this is very straightforward, and you just need to specify # where the files are located. # For example, to load a file "data.txt" located in the C:/type/kristin/file/location/here/ # directory, you would type: INFILE = open("C:/type/kristin/file/location/here/data.txt",'r') # Python now can access data.txt through the variable INFILE. # (The option 'r' specifies that you're only interested in reading from the file. You can # use option 'w' to specify that you will be writing data back to the file.) # # Depending on what format data.txt is in, how you will need to process the file will vary. # # You'll have to look up how to do this in R. In Python, if you wanted to just copy the first # line of the file into a variable and store it: kristin_data = INFILE.readline() # The variable kristin_data now is storing the contents of the first line of data.txt. Suppose # that's all the data in the file, and we're going to fit a curve to the data in that line. Now # close the file. INFILE.close() # Python is very flexible, but you have to tell it how to process the data by hand. There are # various ways to do this; the best one will depend on the nature of your data files. # # However, Matlab was designed with tasks like this in mind, and it is very simple to import # data from flat files. ('Flat file' is a generic term for any text file containing delimited data # -- for example, comma-delimited (csv, which stands for 'comma separated values'), tab-delimited, # space-delimited, etc.) You just have to specify what the delimiter is. You would type: kristin_data = importdata('C:/type/kristin/file/location/here/data.txt',',') # This specifies to Matlab that you want to import data.txt, and the delimiter is a comma. If it # was instead tab-delimited, you would type: kristin_data = importdata('C:/type/kristin/file/location/here/data.txt','\t') # The data is now stored in kristin_data, ready to be analyzed. # # Note -- \t is the 'escape character' for tab -- escape characters show up a # lot in programming, presumably because the compiler wouldn't know what to do # with the word "tab." Another commonly used escape character is \n, which # is "newline," meaning start a new line. (i.e., the equivalent of enter.) # # Now you're ready to actually work with the data! (All code will be Matlab code.) # # The first thing I usually do when trying to analyze more than one data set is try and # make my life easier by figuring out what they have in common. For example, if your # data sets have wildly differing amplitudes, normalizing them might be a good idea. # (In the sense of dividing by the sum, not converting to a normal distribution.) # # Say kristin_data has all your x-axis values in the first row, and all the y-axis values # on the second row. Let's move these into separate variables, for clarity: x = kristin_data(1,:); y = kristin_data(2,:); # In Matlab, the : means "take everything here." That is, x = kristin_data(1,:) means take # data from all columns of row 1, and place them into the variable x. # # (If you wanted to only take part of row 1, you can use : to specify a range -- for # example, kristin_data(1,3:7) would mean select elements 3 through 7 on row 1. This # syntax is enormously useful, and once you learn it, it will annoy you to no end when # you find yourself working in a language that doesn't have a similar feature.) # # For a SINGLE data set , here's how you would do a linear fit in Matlab: [linear_fit_parameters, errors] = polyfit(x, y, 1); R_squared = 1 - errors.normr^2 / norm(y - mean(y))^2; # First, notice that an initial guess is not necessary; this is a nice feature of linear # fitting. You can use the built-in polyfit() function for any polynomial fitting, # too, not just linear -- for example, a quadratic fit would be polyfit(x, y, 2). # # Here, the square brackets indicate that the function polyfit() will return more than # one value, so we need to give it two variables to get all the information we need. # The first variable, linear_fit_parameters, is a vector containing the two fitting parameters # for a linear fit (the slope and intercept). # The second variable, errors, is complicated-looking data structure containing details # of how Matlab did the fit. (It uses a matrix factorization called the QR decomposition to # do this kind of fit.) The second line of code uses this information to calculate an # R-squared value. # # (If you're ever confused about a particular built-in function, you can just type # help polyfit (or other function) at the command line, and Matlab's extensive built-in # documentation usually has all the info you'll need.) # # For nonlinear fitting, it's slightly more complicated. Say you also want to check and # see if your data fits to an exponential function, a log-normal function, and a power law. # These need to be stored in SEPARATE files (called m-files), which just contain the # mathematical form of the function. If you actually use Matlab to do this, I'm happy to # show you exactly how to do this; for the purposes of this explanation, I'll just assume # that they exist. The syntax to access functions in separate m-files is @function_name. # The @ symbol indicates that you want to access a function stored in a separate file, # and function_name should be replaced by whatever the m-file name is. # # If you want to enable robust fitting (to minimize the impact of outliers), you need to # specify that using the statset() function. We'll create a variable called options, # containing this information. options = statset('Robust','on'); # By default, robust fits use Tukey's bisquare to reweight a least-squares fit. You can # specify if you want a different weighting function; Matlab has a bunch of them built-in. # # The nonlinear fitting function is nlinfit(). Unlike a linear fit, you need to tell it # your initial guesses for the parameters, and these are important. Define a vector p0 # which contains your initial guesses for the parameters (here, I'll assume you have 2 # parameters, but of course you might actually have more): p0 = [1, 2]; # The command to do a nonlinear fit an exponential function stored in my_exponential_function.m # is: [P, r, J, SIGMA, mse] = nlinfit(x, y, @my_exponential_function, p0, options); # P now contains the best-fit parameter estimates for this fit. The other variables contain # various estimates of how good the fit was. Matlab has a built-in function nlparci() for # calculating confidence intervals for fits done using nlinfit(). To take advantage of this, you # would use: CI = nlparci(P, r, 'covar', SIGMA); # CI now contains confidence intervals for your fit. # # You can do identical commands for your lognormal and power law fits (or whatever else # you want to try). You just need to specify the file containing your fitting function, # and of course initial parameter guesses. # # Ok, so suppose (as is likely) that you have a rough idea of what the parameters should be # but not so close that you're confident that you can get a good fit on the first try. You # could automate the guessing procedure as follows: # # p0 is the initial guess that you make: p0 = [1,2]; # The function rand(1,2) will create a 1-by-2 matrix (that is, a vector with the same # dimensions as p0) containing random decimals between 0 and 1. If you subtract 0.5 from # both elements, then you'll have random decimals between -0.5 and 0.5. If you then add # this to p0, you've got a nice, automatic way of randomly adjusting p0 up and down from # your initial guess: p0 = p0 + rand(1,2) - 0.5; # You can use a for loop to do this multiple times. p0 is a vector, but suppose we lack # confidence in our guessing capabilities, so we want to try 1000 different guesses, adjusted # using the random number generator, as discussed above. To store all these initial guesses, # let's turn p0 into a 1000-by-2 matrix, so that our randomized guesses are stored on each row: p0 = zeros(1000,2); p0(1,:) = [1,2]; for i = 1:1000 p0(i,:) = p0(1,:) + rand(1,2) - 0.5; end # What did we just do there? p0 = zeros(1000,2) just creates a 1000-by-2 matrix of all zeros, # and assigns it to the variable p0. p0(1,:) = [1,2] assigns the values 1 and 2 to the first # row of p0. # # The next three lines are a "for loop," meaning that they execute the command on the middle # line a specified number of times automatically. for i = 1:1000 means to do whatever is # inside the loop 1000 times. The variable i is the "loop index," and it keeps track of # how many times the program has executed the loop so far. # # The middle line does the same thing that we discussed above (adjusts the guesses using random # numbers). The big difference here is that it does it 1000 times, and stores the results # automatically on the 1000 rows of p0! Notice that we're actually using the loop index i # to specify which row of p0 we want to assign guesses to. This is a very handy trick which # you'll come back to again and again when you make use of for loops. # # Now you can use these 1000 guesses to do 1000 separate fits of all of your fitting functions. # We'll ignore the linear case, since the guesses don't matter there. # # To do an exponential fit: parameters = zeros(1000,2); confidence = zeros(1000,2); for i = 1:1000 [P, r, J, SIGMA, mse] = nlinfit(x, y, @my_exponential_function, p0(i,:), options); CI = nlparci(P, r, 'covar', SIGMA); parameters(i,:) = P; confidence(i,1) = abs(P(1) - abs(CI(1,1))); confidence(i,2) = abs(P(2) - abs(CI(2,1))); clear P r J SIGMA mse CI; end # Notice how the only difference in the call to nlinfit() is that we're now specifying p0(i,:) # as the rows of our initial guess matrix! Also notice how we've repeatedly used the trick # mentioned above, where we can use the loop index to specify positions in our matrices. # # Now you've got a 1000-by-2 matrix of parameter estimates, a 1000-by-2 matrix of confidence # intervals for each parameter, and a 1000-by-2 matrix of your initial guesses. # # Since these are all indexed identically, you can go back and pick which fit you think is # the best using a method of your choosing! For example, suppose you just want the fit # with the smallest total confidence intervals (I'm sure you have a better way of doing this, # but this is just for example): [smallest_CI, smallest_CI_position] = min(sum(confidence')'); # The sum() function sums down columns of a matrix; therefore, we need to transpose the # confidence matrix. In Matlab, transpose is done with an apostrophe: confidence' is # the transpose of confidence. So sum(confidence') sums across rows, which is what # we want, and then we need to transpose it back, so another apostrophe # sum(confidence')'. We've just summed together both elements of all rows of the # confidence matrix, so we now have a 1000-by-1 matrix (a vector), and we want to know what # the smallest number is in this vector. The min() function returns two pieces of # information: first, the value of the smallest number in the vector, and second, the # position in the vector where that number is located. So # [smallest_CI, smallest_CI_position] will store those pieces of information. # # Armed with this information, if you're really in a hurry, you can just say: final_parameters_exponential = parameters(smallest_CI_position,:); final_CI_exponential = confidence(smallest_CI_position,:); # This takes the parameters and confidence interval which we selected according to the # rules above, and places them in two vectors (each containing only two entries). # Alternately, you might select, say, the best 10 fits, then actually open those up in # R, and eyeball the fits to see if they're reasonable. # You can more or less copy-and-paste this code to do fits to other functional forms # of interest. In this way, you can test many nonlinear fits in a very short time! # # Realistically, you'll probably have more than just a single data set that you need to # do these fits to. You can use another for loop to repeat this entire procedure # for multiple data sets. Say you've got 500 files, all in the same directory, # named as data1.txt, data2.txt, data3.txt, etc. filepath = 'C:/type/kristin/file/location/here/'; filename = cell(500,1); kristin_data = cell(500,1); final_parameters_exponential = cell(500,1); final_CI_exponential = cell(500,1); options = statset('Robust','on'); for j = 1:500 name = strcat(strcat('data', int2str(j)), '.txt'); filename{j} = strcat(filepath, name); kristin_data{j} = importdata(filename{j}, ','); x = kristin_data{j}(1,:); y = kristin_data{j}(2,:); parameters = zeros(1000,2); confidence = zeros(1000,2); for i = 1:1000 p0(i,:) = p0(1,:) + rand(1,2) - 0.5; [P, r, J, SIGMA, mse] = nlinfit(x, y, @my_exponential_function, p0(i,:), options); CI = nlparci(P, r, 'covar', SIGMA); parameters(i,:) = P; confidence(i,1) = abs(P(1) - abs(CI(1,1))); confidence(i,2) = abs(P(2) - abs(CI(2,1))); clear P r J SIGMA mse CI; end final_parameters_exponential{j} = parameters(smallest_CI_position,:); final_CI_exponential{j} = confidence(smallest_CI_position,:); end # Ok, there's kind of a lot to unpack there -- in fact, with the exception of the m-file # (which will be in a separate file, my_exponential_function.m), that's a complete # Matlab program to do your fits for you! # # First, cell arrays. cell(500,1) creates a data structure called a 'cell array' that is # possibly unique to Matlab, and is enormously useful. Think of it as a list of matrices # that you can store anything in that you want. In other words, if you have 500 data sets, # you can store them as 500 'cells' within a single cell array variable. This is extremely # useful for indexing purposes, as you can imagine! The cells within a cell array are # accessed using curly braces {}. So for example, to access matrix element (1,2) within # the second cell of the filename cell array, you would type: filename{2}(1,2). # # So right up front, we create four cell arrays, all capable of holding 500 matrices. # filename will just be our way of storing the names of the data files; it's a lot easier # doing it this way than clicking on 500 files separately. kristin_data is the cell # array that will hold all your data sets, once we've imported them. # final_parameters_exponential and final_CI_exponential are cell arrays to hold the matrices # containing our best fit paramters and confidence intervals, respectively. # # Now, notice that we have two for loops, one inside the other. This is called a 'nested # loop,' and if you need to manipulate elements of a matrix (for example) one by one, you'll # find yourself using nested loops a lot! The loop indices of course are different (we're # using j for the 'outer' loop and i for the 'inner' loop, but they can be any variable). # # Ok, now we're just doing that same code that we wrote before (for a single data set) over # and over again. We're making use of the outer index j to select the cell that we're # interested in. The following two lines: # # name = strcat(strcat('data', int2str(j)), '.txt'); # filename{j} = strcat(filepath, name); # # are sort of cryptic looking, but all they're doing is setting up the file names auto- # matically. strcat() is "string concatenation," meaning that it takes two words and # concatenates them. So strcat('why', 'hello') would produce the output 'whyhello'. # # In this case, the path of the directory where all the data files are stored is already # stored in the filepath variable; we just need to generate the data#.txt extension. # So the way we do this is: strcat('data', int2str(j)) takes the word 'data' and # attaches the number j on the end of it (int2str() converts an integer to the string # 'version' of the integer -- this is done simply because you can't concatenate different # data types together). So say j = 1 (remember j is the loop index), then # strcat('data',int2str(j)) will produce: # # data1 # # but we still need the .txt extension, so we then concatenate '.txt' to it with another # strcat(): strcat(strcat('data', int2str(j)), '.txt'). This produces the output: # # data1.txt # # which we assign to the variable name. We can then strcat() once more with filepath: # strcat(filepath, name). This gives us the full name we need: # # C:/type/kristin/file/location/here/data1.txt # # ...which seems like a lot of pointless work, but it's really nice because it creates # all filenames data1.txt, data2.txt, ..., data500.txt automatically. The full file # name is then stored in cell 1 of the filename cell array, filename{1}. # # Everything else is pretty straightforward -- we're just doing the same thing as before, # but storing everything as cells within cell arrays, so that the program just runs # while we sit around and play video games, and then after a bit we'll have fitting # results for all 500 data sets tested against 1000 possible initial parameter values.
No comments:
Post a Comment