A Functional Dataset Toolkit

Alan Calvitti
Organization: Veterans Health Administration

Wolfram Technology Conference 2015
Champaign, Illinois USA

As the volume, velocity and diversity of data increases, the need to integrate and analyze it effectively become more pressing. Much of new data produced is NoSQL in nature: lacking a schema, weakly typed, or shaped like nested or ragged arrays. Using examples from healthcare informatics, we show how Dataset and related functionality can be effectively used to integrate, analyze and visualize hierarchical, time-series, numeric and text data in the context of cloud-based scientific workflow. Practically the entire processing pipeline can be managed as queries on Dataset objects from indexing and metadata management, common statistical queries as well as visualization. Scientific methods are only used for illustration, while the focus is on functional design patterns using Query syntax, augmented by microcodes that are legible, easy to modify and reusable. Legibility is also improved by using operator forms. When Associations are nested, their computational efficiency enables recursive queries exemplified here by an one-line trie constructor. Dataset enables query prototyping on small subsamples, then scaling up the processing to whole-study dataset with minimal revisions. Additional examples show challenges due to branch-merge data flows, missing values, composite keys, temporal data and interval matching. Finally, some current limitations and desired features are discussed.

