Abstract
Purpose: Reproducibility is a core principle of science and access to a study’s data is essential to reproduce its findings. However, data sharing is uncommon in the field of Communication Sciences and Disorders (CSD), often due to concerns related to privacy and disclosure risks. Synthetic data offers a potential solution to this barrier by generating artificial datasets that do not represent real individuals yet retain statistical properties and relationships from the original data. This study aimed to explore the feasibility and preliminary utility of synthetic data to promote transparency and reproducibility in the field of CSD.
Method: Ten open datasets were obtained from previously published research within the American Speech-Language-Hearing Association ‘Big Nine’ domains (articulation, cognition, communication, fluency, hearing, language, social communication, voice and resonance, and swallowing) across a range of study outcomes and designs. Synthetic datasets were generated with the synthpop R package. General utility was assessed visually and with the standardized ratio of the propensity mean squared error (S_pMSE). Specific utility assessed whether inferential relationships from the original data were preserved in the synthetic dataset by comparing model fit indices, coefficients, and p-values.
Results: All synthetic datasets showed strong general utility, maintaining univariate and bivariate distributions. Six of nine synthetic datasets that used inferential statistics showed strong specific utility, maintaining inferential relationships from the original analysis. Specific utility was low in three datasets with hierarchical structures.
Conclusion: Findings suggest that synthetic data can effectively maintain statistical properties and relationships across a wide range of non-hierarchical data commonly seen in the field of CSD. Other approaches for hierarchical data need to be explored in future work. Researchers who use synthetic data should assess its utility in preserving their results for their own data and use-case.