Fitting Experimental Data with Mathematica
by Ian Brooks
 | Download
this example as a Mathematica notebook. |
Introduction
Analysis of biomedical data usually involves fitting the measured data to a model in order
to make some predictions about the system under investigation. Choosing the right
approach to experimental curve fitting is important if you are to get good results.
Sometimes it may be sufficient to use an interpolation method, but often it will be
necessary to use nonlinear fitting and all of its associated diagnostic statistics.
For example, kinetic data may be fit to a Michaelis-Menten model in order to gain
insight into the efficiency of an enzyme, or data from a baseline control experiment
may be fit to a quadratic in order to predict the correction necessary for the full
experiment. While these two examples may seem very similar, they are in fact very
different, and it is important to understand how the differences affect the analysis.
These differences lead to different procedures for performing the fit and to differences
in the degree of care that must be taken to ensure good results. Another difference is
that in the case of the Michaelis-Menten example each of the parameters being determined
has a definite biological significance, while there is no biological significance to the
parameters in the quadratic case. Clearly it is critical to check the determined
parameters for physical reality, good confidence intervals, and so on if they are
expected to have biological significance, but it is unimportant if you just want an
approximate curve. The biggest difference though is that the Michaelis-Menten
example is a nonlinear model, while the quadratic example is a mathematically
linear model.
Linear or Nonlinear?
If you have to do a full curve fit, the critical question becomes whether a model is
linear or nonlinear. Unfortunately, this is not always immediately obvious (although,
as a general rule of thumb, biology is nonlinear). A nonlinear model can be defined as
one in which at least one of the partial derivatives of the model with respect to its
parameters includes at least one of the parameters. Use of the symbolic capabilities
of Mathematica allows the determination of linearity to be done quickly without
having to remember any calculus. (See Figure 1.)
Given some data and a linear model, the fitting algorithm solves certain linear equations
to determine the best-fit parameters and their confidence intervals. With a nonlinear model,
however, it is not possible to solve the equations directly and so the fitting algorithm has
to perform a search. With any search it is possible to get different results depending on
where you start and when you decide to stop. Nonlinear fitting is a little like being placed
in the middle of a mountain range and trying to find the lowest point. The point you think
is the lowest is dependent on where you are when you start and how long you are
prepared to look. Unless you search the entire area, no matter where you stop,
you cannot be certain that you have found the lowest point in the range or just the
lowest point that you have seen. In fitting terms, you can never be sure that you have
found the absolute best fit, just that you have found a good fit; worse, you have to make
sure that you even have a good fit. Although this is true with any software,
Mathematica's advanced algorithms and extended precision capability enable it to
perform more reliably on curve-fitting benchmark tests than any other program tested
so far (McCullough).
Because of these fundamental differences between linear and nonlinear fitting,
starting guesses and convergence checking are important for nonlinear models
but are irrelevant to linear models. The fitting algorithm in Mathematica
is powerful enough to solve many problems without needing good starting guesses, but, even
so, there are inevitably some problems where they are necessary. Mathematica's
algorithm also eliminates the need to consider effects such as parameter scaling by handling
a number of these issues invisibly.
Whether you need to perform a complex nonlinear fit or can make do with a simple
interpolation, Mathematica provides the functionality and reliability to tackle
any fitting task. This flow chart shows the four main approaches and the
Mathematica functions that are most appropriate.

Interpolation
If you are simply trying to obtain a standard curve for comparison or for baseline
subtraction, and you have plenty of data to describe the line or curve, it is simplest
to use Mathematica's Interpolation function. Interpolation
constructs a curve that gives a good approximation to the data. As a simple example,
consider trying to find an approximation to y = Sqrt[x].
data = {{0.0, 0.0}, {1.0, 1.0}, {4.0, 2.0}, {9.0, 3.0}, {16.0, 4.0}, {25.0, 5.0},
{36.0, 6.0}, {49.0, 7.0}, {64.0, 8.0}, {81.0, 9.0}, {100.0, 10.0}};
sq = Interpolation[data]
InterpolatingFunction[{{0., 100.}}, <>]
The returned InterpolatingFunction can be used to calculate the expected values of any
input x, but will be most accurate within the specified range, in this case between 0 and 100.
sq[50.0]
7.07127
Compare this result with the expected value of Sqrt[50].
Sqrt[50.]
7.07107
InterpolatingFunction will give the exact value of y at the x values used to
construct the approximation.
sq[4.0]
2.
Linear Fitting
Linear fitting is performed in Mathematica with the Fit and Regress
commands. If Interpolation is not appropriate for some reason, but you are not trying to
determine biologically significant parameters, then the Fit command (which simply
returns the best-fit function) should be sufficient.
As an example, consider fitting the square root data to a quadratic.
Fit[data, {1, x, x^2}, x]
1.01599 + 0.166593x - 0.000801884x^2
If you need diagnostic information to determine the quality of the fit or confidence
intervals of the parameters, then it will be necessary to use the function Regress
available
in the standard add-on package Statistics`LinearRegression`.
<< Statistics`LinearRegression
Regress[data, {1, x, x^2}, x]

This output is the default. In short, TStat should be high, PValue
should be below 0.05,
RSquared should be close to 1, and the lower EstimatedVariance,
the better. Many
more diagnostics can be generated, and a more complete example is shown in
Figure 2.
A complete list of the diagnostics is as follows:
RegressionReportValues[Regress]
{AdjustedRSquared, ANOVATable, BestFit, BestFitParameters, BestFitParametersDelta,
CatcherMatrix, CoefficientOfVariation, CookD, CorrelationMatrix, CovarianceMatrix,
CovarianceMatrixDetRatio, DurbinWatsonD, EigenstructureTable, EstimatedVariance,
FitResiduals, HatDiagonal, JackknifedVariance, MeanPredictionCITable, ParameterCITable,
ParameterConfidenceRegion, ParameterTable, PartialSumOfSquares, PredictedResponse,
PredictedResponseDelta, RSquared, SequentialSumOfSquares, SinglePredictionCITable,
StandardizedResiduals, StudentizedResiduals, SummaryReport, VarianceInflation}
Nonlinear Fitting
Good nonlinear curve fitting is an art form. Even with the best algorithms, it is often
necessary to provide accurate starting guesses for parameters, and it is always necessary
to check the results thoroughly. Fortunately, it is easy to provide starting guesses in
Mathematica, and a very extensive list of fit diagnostics may be generated.
<< Statistics`NonlinearFit`
data = {{1.0, 1.0, 0.126}, {2.0, 1.0, 0.219}, {1.0, 2.0, 0.076}, {2.0, 2.0, 0.126},
{0.1, 0.0, 0.186}};
This data set (Meyer and Roth, 1972) gives five values for x1, x2, and y,
describing a reaction
involving the catalytic dehydration of n-hexyl alcohol. Here x1 is partial pressure of alcohol,
x2 is olefin, and y is the rate of reaction.
model = (t1 t3 x1)/(1 + t1 x1 + t2 x2);
NonlinearRegress[data, model, {x1, x2}, {t1, t2, t3}]
In short, the lower EstimatedVariance is, the better the fit, and the
smaller the range of the confidence intervals (CI), the better. Of particular interest
is FitCurvatureTable, which may be used to gauge the reliability of the linear
approximation confidence intervals. If the linear approximations are reasonable,
the maximum intrinsic curvature will be small compared to the confidence region.
As in the linear case, the default behavior shows only a few of the many diagnostic
statistics that may be generated. A more complete example is shown in
Figure 3.
RegressionReportValues[NonlinearRegress]
{ANOVATable, AsymptoticCorrelationMatrix, AsymptoticCovarianceMatrix,
BestFit, BestFitParameters, EstimatedVariance, FitCurvatureTable, FitResiduals, HatDiagonal,
MeanPredictionCITable, ParameterBias, ParameterConfidenceRegion, ParameterCITable,
ParameterTable, PredictedResponse, SinglePredictionCITable, StartingParameters,
StandardizedResiduals, SummaryReport}
Summary
Getting good results from curve fitting is not as simple as having good data and the right model.
It is also important to choose the right algorithm. When performing nonlinear fitting,
it is critical to take notice of all of the information that is provided and check that the
fit is really a good one that makes physical sense. Remember that the algorithm being
used knows nothing about biology, and it is up to you to accept or reject its results based
on your knowledge.
Further examples of data analysis with Mathematica may be found at
http://www.wolfram.com/solutions.
References
McCullough, B. D. "The Accuracy of Mathematica 4 as a Statistical
Package," Journal of Computational Statistics (2000). (accepted)
(Ian Brooks is an applications developer at Wolfram Research.)
| |