Wolfram Library Archive

Courseware Demos MathSource Technical Notes
All Collections Articles Books Conference Proceedings
Title Downloads

Trustworthy Data Models

Mark Kotanchek
Organization: Evolved Analytics, LLC

2007 Wolfram Technology Conference
Conference location

Champaign, IL


Data modeling is intrinsically hard: What is the relative importance of the potential input variables? What variable combinations are important? What does the model mean? Does the model capture the fundamentals or is it chasing perturbations caused by noise?

The problem gets even worse when it becomes time to use the models. There is an implicit assumption that the the future looks just like the past--that is the modeled system has not changed its fundamental behavior and the model is not being asked to predict in new operating regions. Unfortunately, real-world application will likely violate these assumptions. This reliance on the future behavior closely resembling the past behavior is analogous to driving a car using only the information from the rear view mirrors as guidance! Similarly, even though using a data-driven model may be necessary or better than the alternatives, there is a certain amount of trepidation in that use.

The DataModeler add-on package for Mathematica helps to take some of the risk and trepidation out of empirical modeling. Using state-of-the-art genetic programming algorithms to perform symbolic regression, compact and human-interpretable models can be rapidly developed which automatically identify and exploit the driving variables during the process of exploring the risk-reward trade-off of model complexity versus model accuracy. The net effect is that we let the data tell us what accuracy is possible and appropriate and mitigate the risk of using overly complex and fragile models or incorporating spurious variables.

Because the developed models will minimize nuisance variables, are at appropriate levels of complexity, and are human interpretable, they are intrinsically more trustworthy than models developed using most other data modeling techniques. Furthermore, because the symbolic regression processing explores a plethora of possible model structures, we can use ensembles of diverse models to develop trustable models and trust metrics--the ensemble will be constrained to agree if during use the system has not changed and it is not being asked to extrapolate into new regions of parameter space. If these two core assumptions are violated, the models will be constrained to disagree--thereby providing awareness that the model prediction should be regarded with suspicion.

In this talk we will demonstrate the DataModeler package with a special emphasis on the development of symbolic regression models for human insight and understanding and ensembles for real-world deployment. Along the way we will feature interface improvements made possible by the migration to Mathematica 6 as well as discuss issues associated with model development, selection, and management.

*Wolfram Technology

Downloads Download Wolfram CDF Player

TrustworthyDataModels.nb (11 MB) - Mathematica Notebook [for Mathematica 6.0]