Lots of variables. Not much in the way of data. The data variables are correlated and coupled. Truth is unknown. What does it all mean? This is a common situation for many real-world data sets. . In this talk, Mathematica's CountryData[ ] is used as a surrogate for typical social science data to demonstrate how DataModeler's unique capabilities can be used to efficiently and effectively attack such data. Although basic SymbolicRegression and data exploration features will be briefly reviewed, the emphasis will be recently added capabilities: Data Balancing--which records and attributes contain the most information? The SMITS algorithm lets us explore the information content of data sets and identify key data records. This lets us quickly build balanced data sets from real-world data which leads to improved and empirical models as well as faster model development. Variable Selection--which variables count from a nonlinear perspective? Symbolic regression is attractive because it naturally handles correlated inputs and pull out the driving variables and present them without artificial constraints and assumptions. In other words, the data tells us which inputs and input combinations are important. Niching--what is the trade-off on number of variables or variable combinations? The ability to explore the trade-off between the number of variables or possible variable combination leads to increased insight into the underlying mechanisms in the modeled system. Outlier Detection--outliers are either the most interesting or nuisance nuggets of data. However, identifying outliers is difficult in multivariate data sets--especially when the outliers are inliers. Although the data balancing processing can identify some outliers, model-based outlier detection lets us quickly identify the data records which warrant special attention.
|