## Fitting Experimental Data with Mathematica

by Ian Brooks

### Introduction

Analysis of biomedical data usually involves fitting the measured data to a model in order to make some predictions about the system under investigation. Choosing the right approach to experimental curve fitting is important if you are to get good results. Sometimes it may be sufficient to use an interpolation method, but often it will be necessary to use nonlinear fitting and all of its associated diagnostic statistics. For example, kinetic data may be fit to a Michaelis-Menten model in order to gain insight into the efficiency of an enzyme, or data from a baseline control experiment may be fit to a quadratic in order to predict the correction necessary for the full experiment. While these two examples may seem very similar, they are in fact very different, and it is important to understand how the differences affect the analysis.

These differences lead to different procedures for performing the fit and to differences in the degree of care that must be taken to ensure good results. Another difference is that in the case of the Michaelis-Menten example each of the parameters being determined has a definite biological significance, while there is no biological significance to the parameters in the quadratic case. Clearly it is critical to check the determined parameters for physical reality, good confidence intervals, and so on if they are expected to have biological significance, but it is unimportant if you just want an approximate curve. The biggest difference though is that the Michaelis-Menten example is a nonlinear model, while the quadratic example is a mathematically linear model.

Linear or Nonlinear?

If you have to do a full curve fit, the critical question becomes whether a model is linear or nonlinear. Unfortunately, this is not always immediately obvious (although, as a general rule of thumb, biology is nonlinear). A nonlinear model can be defined as one in which at least one of the partial derivatives of the model with respect to its parameters includes at least one of the parameters. Use of the symbolic capabilities of Mathematica allows the determination of linearity to be done quickly without having to remember any calculus. (See Figure 1.)

Given some data and a linear model, the fitting algorithm solves certain linear equations to determine the best-fit parameters and their confidence intervals. With a nonlinear model, however, it is not possible to solve the equations directly and so the fitting algorithm has to perform a search. With any search it is possible to get different results depending on where you start and when you decide to stop. Nonlinear fitting is a little like being placed in the middle of a mountain range and trying to find the lowest point. The point you think is the lowest is dependent on where you are when you start and how long you are prepared to look. Unless you search the entire area, no matter where you stop, you cannot be certain that you have found the lowest point in the range or just the lowest point that you have seen. In fitting terms, you can never be sure that you have found the absolute best fit, just that you have found a good fit; worse, you have to make sure that you even have a good fit. Although this is true with any software, Mathematica's advanced algorithms and extended precision capability enable it to perform more reliably on curve-fitting benchmark tests than any other program tested so far (McCullough).

Because of these fundamental differences between linear and nonlinear fitting, starting guesses and convergence checking are important for nonlinear models but are irrelevant to linear models. The fitting algorithm in Mathematica is powerful enough to solve many problems without needing good starting guesses, but, even so, there are inevitably some problems where they are necessary. Mathematica's algorithm also eliminates the need to consider effects such as parameter scaling by handling a number of these issues invisibly.

Whether you need to perform a complex nonlinear fit or can make do with a simple interpolation, Mathematica provides the functionality and reliability to tackle any fitting task. This flow chart shows the four main approaches and the Mathematica functions that are most appropriate.

Interpolation

If you are simply trying to obtain a standard curve for comparison or for baseline subtraction, and you have plenty of data to describe the line or curve, it is simplest to use Mathematica's Interpolation function. Interpolation constructs a curve that gives a good approximation to the data. As a simple example, consider trying to find an approximation to y = Sqrt[x].

data = {{0.0, 0.0}, {1.0, 1.0}, {4.0, 2.0}, {9.0, 3.0}, {16.0, 4.0}, {25.0, 5.0}, {36.0, 6.0}, {49.0, 7.0}, {64.0, 8.0}, {81.0, 9.0}, {100.0, 10.0}};

sq = Interpolation[data]

InterpolatingFunction[{{0., 100.}}, <>]

The returned InterpolatingFunction can be used to calculate the expected values of any input x, but will be most accurate within the specified range, in this case between 0 and 100.
sq[50.0]

7.07127

Compare this result with the expected value of Sqrt[50].
Sqrt[50.]

7.07107

InterpolatingFunction will give the exact value of y at the x values used to construct the approximation.
sq[4.0]

2.

Linear Fitting

Linear fitting is performed in Mathematica with the Fit and Regress commands. If Interpolation is not appropriate for some reason, but you are not trying to determine biologically significant parameters, then the Fit command (which simply returns the best-fit function) should be sufficient.

As an example, consider fitting the square root data to a quadratic.

Fit[data, {1, x, x^2}, x]

1.01599 + 0.166593x - 0.000801884x^2

If you need diagnostic information to determine the quality of the fit or confidence intervals of the parameters, then it will be necessary to use the function Regress available in the standard add-on package Statistics`LinearRegression`.
<< Statistics`LinearRegression

Regress[data, {1, x, x^2}, x]

This output is the default. In short, TStat should be high, PValue should be below 0.05, RSquared should be close to 1, and the lower EstimatedVariance, the better. Many more diagnostics can be generated, and a more complete example is shown in Figure 2.

A complete list of the diagnostics is as follows:

RegressionReportValues[Regress]

{AdjustedRSquared, ANOVATable, BestFit, BestFitParameters, BestFitParametersDelta, CatcherMatrix, CoefficientOfVariation, CookD, CorrelationMatrix, CovarianceMatrix, CovarianceMatrixDetRatio, DurbinWatsonD, EigenstructureTable, EstimatedVariance, FitResiduals, HatDiagonal, JackknifedVariance, MeanPredictionCITable, ParameterCITable, ParameterConfidenceRegion, ParameterTable, PartialSumOfSquares, PredictedResponse, PredictedResponseDelta, RSquared, SequentialSumOfSquares, SinglePredictionCITable, StandardizedResiduals, StudentizedResiduals, SummaryReport, VarianceInflation}

Nonlinear Fitting

Good nonlinear curve fitting is an art form. Even with the best algorithms, it is often necessary to provide accurate starting guesses for parameters, and it is always necessary to check the results thoroughly. Fortunately, it is easy to provide starting guesses in Mathematica, and a very extensive list of fit diagnostics may be generated.

<< Statistics`NonlinearFit`

data = {{1.0, 1.0, 0.126}, {2.0, 1.0, 0.219}, {1.0, 2.0, 0.076}, {2.0, 2.0, 0.126}, {0.1, 0.0, 0.186}};

This data set (Meyer and Roth, 1972) gives five values for x1, x2, and y, describing a reaction involving the catalytic dehydration of n-hexyl alcohol. Here x1 is partial pressure of alcohol, x2 is olefin, and y is the rate of reaction.
model = (t1 t3 x1)/(1 + t1 x1 + t2 x2);

NonlinearRegress[data, model, {x1, x2}, {t1, t2, t3}]

In short, the lower EstimatedVariance is, the better the fit, and the smaller the range of the confidence intervals (CI), the better. Of particular interest is FitCurvatureTable, which may be used to gauge the reliability of the linear approximation confidence intervals. If the linear approximations are reasonable, the maximum intrinsic curvature will be small compared to the confidence region. As in the linear case, the default behavior shows only a few of the many diagnostic statistics that may be generated. A more complete example is shown in Figure 3.
RegressionReportValues[NonlinearRegress]

{ANOVATable, AsymptoticCorrelationMatrix, AsymptoticCovarianceMatrix, BestFit, BestFitParameters, EstimatedVariance, FitCurvatureTable, FitResiduals, HatDiagonal, MeanPredictionCITable, ParameterBias, ParameterConfidenceRegion, ParameterCITable, ParameterTable, PredictedResponse, SinglePredictionCITable, StartingParameters, StandardizedResiduals, SummaryReport}

Summary

Getting good results from curve fitting is not as simple as having good data and the right model. It is also important to choose the right algorithm. When performing nonlinear fitting, it is critical to take notice of all of the information that is provided and check that the fit is really a good one that makes physical sense. Remember that the algorithm being used knows nothing about biology, and it is up to you to accept or reject its results based on your knowledge.

Further examples of data analysis with Mathematica may be found at http://www.wolfram.com/solutions.

References

McCullough, B. D. "The Accuracy of Mathematica 4 as a Statistical Package," Journal of Computational Statistics (2000). (accepted)

(Ian Brooks is an applications developer at Wolfram Research.)