Challenges in Integration and Analysis of High-Dimensional Biological Data: Cases from Environmental and Health Research
- Biological data represent a large, challenging sector of data engineering applications. Biological data are typically complex and poorly standardized. Moreover, high value, rapid growth in volume and advances in acquisition technologies characterize modern environmental and health research data, humbling the classical practices for data transformation and analytics. Furthermore, data in biology make more sense when integrated with usually different data types, or data from different sources or even fields. In addition, the uniqueness of each case and research question call for a deep understanding of data life cycle and for customized solutions. Having a large volume and value, and being produced at a high velocity in a large variety, biological data encourage the investigation of scalable workflows to automate acquisition and integration, closing the gaps in optimizing analytics specially for heterogeneous data.
This thesis aims at exploring and optimizing the state-of-the-art methods for heterogeneous data integration and analysis, of sequence and non-sequence-based data, by identifying four areas of application concerning primary and secondary data from environmental and health research. It presents four challenges in data preparation and transformation for variable selection, and accompanying case studies. Particularly, the thesis investigates knowledge extraction from primary inherently high-dimensional marine sequence data, scalability in handling secondary photosynthetic sequence data, integration and statistical modeling of secondary high-dimensional relational health care claims data for adverse drug event prediction, and integration of heterogeneous primary epidemiological data for childhood obesity investigation. The thesis highlights the importance of data model development for data transformation and integration, and the role of scalable analytics in the foreseen increase in data dimensions.