Environment-wide association studies (EWAS) highlights the contribution of nongenetic components to complex phenotypes. However, the lack of high-throughput quality control (QC) pipelines for EWAS data lends itself to analysis plans where the data are cleaned after a first-pass analysis, which can lead to bias, or are cleaned manually, which is arduous and susceptible to user error. The Hall Lab offers a novel software, CLeaning to Analysis: Reproducibility-based Interface for Traits and Exposures (CLARITE), as a tool to efficiently clean environmental data, perform regression analysis, and visualize results on a single platform through user-guided automation. It exists as both an R package and a Python package. Though CLARITE focuses on EWAS, it is intended to also improve the QC process for phenotypes and clinical lab measures for a variety of downstream analyses, including phenome-wide association studies and gene-environment interaction studies. An example workflow is shown in figure 1.

Figure 1: CLARITE Flowchart
Figure 1 - This is a sample flowchart depicting a typical workflow when using the CLARITE package with the NHANES data. The user starts with raw data and alternates between summary steps (dashed lines) and filtering/quality control (QC) steps (solid lines) based on variable type (indicated by color) and either user-defined or default thresholds informed by the summary output. Once data are sufficiently cleaned, environment-wide association studies (EWAS) can be run.

With the goal of demonstrating the utility of CLARITE, we performed a novel EWAS in the National Health and Nutrition Examination Survey (NHANES).. The results are shown in figure 2.

Figure 2: Manhattan plot generated by CLARITE
Figure 2 - A Manhattan plot generated using CLARITE’s visualization tool, displaying the results of exposure categories predictive of body mass index (BMI). Along the x- axis with -log10(p-value) along the y-axis, are results included for Discovery (circle) and Replication (triangle) datasets. The red line denotes the Bonferroni threshold (alpha: 0.05) for the number of tests run in the Discovery dataset (305), and the blue line denotes the Bonferroni threshold (alpha: 0.05) for the number of tests run in the Replication dataset (99). The 16 replicating results with Bonferroni-corrected p-value < 0.05 are labeled.


Questions?

If you have any questions not answered by the documentation, feel free to open an Issue or contact John McGuigan (John.McGuigan at psu.edu).

This work is/was supported by the USDA National Institute of Food and Agriculture and Hatch Appropriations under Project #PEN04275 and Accession #1018544

Back to Research