Cleaning Up Data to Spruce Up the Results

Drawing conclusions from scientific studies can be difficult, in part because the data collected may be biased, which leads to a misinterpretation of the data. Let’s say we’re collecting data to investigate how many hours of sleep people get per night, during the week compared to over the weekend. We can ask 100 people their average nightly sleep time on weeknights and on weekends. To avoid bias, or skewing the data toward a particular duration, we should control for a few different factors. For example, we can limit our sample to only ask people 18 years or older, to avoid surveying children who tend to require more sleep than adults. This will avoid introducing a bias in the hours slept per night measure and prevent a trend in the data towards >8 hours a night. 


Some biases cannot be totally avoided during data collection. The existence of this unavoidable bias motivates scientists to consider including confounding variables in their data collection. Scientists use covariates when additional variables that change or differ across groups cannot be controlled for. A covariate is a variable that changes with the variable of interest, but isn’t of particular interest or importance for the question at hand. In our example, there are some other variables that may affect the amount of sleep an adult gets. This can include age (a postdoc in their late 20’s with a grant deadline might not get as much sleep as much as a retiree in their 60’s), activity level (strenuous physical activity leads to more sleep for better recovery), and caffeine intake (maybe serial coffee drinkers sacrifice an extra hour of sleep for an extra large cup in the morning). Because these variables may be different for each participant, we can measure them as observed covariates and include them in our statistical analysis.


Sometimes, as in the case with many epidemiological or public health studies, it’s difficult to measure or control for these covariates because the studies commonly use observational data from population-based studies which might not measure all potential covariates. In these studies, there may be unmeasured biases in the data that produce confounds, leading to imperfect conclusions in population studies. In our example, maybe we neglect to measure time spent on social media, which can affect someone’s total sleep time (I can’t be the only one who scrolls instagram instead of going to sleep at night…). Time spent on social media would be our unobserved covariate, which contributes to unmeasured bias in our sample. 


One way to address the problem of unmeasured bias is to pre-process the data – to fine-tune or clean up the data after it has been collected, but before statistical analysis is performed. In a recent paper, Columbia postdoc Dr. Ilan Cerna-Turoff and colleagues explored the use of a pre-processing method that can be used prior to data analysis in order to reduce the bias introduced by unmeasured covariates in a dataset. 


The pre-processing method investigated in this study is called “Full matching incorporating an instrumental variable (IV)” or “Full-IV Matching”, which aims to reduce biases between groups and thereby improve the accuracy of study findings. An instrumental variable (IV) is a measured variable that is unrelated to the covariates but is related to the variable of interest. For our example, an IV could how comfortable participants find their bed – something that is related to the time spent asleep, but isn’t related to the age or amount of coffee consumed. 


To apply the Full-IV Matching method, the researchers define an IV and “carve out” moderate values of the variable to focus on the extreme values (highest and lowest) across the range of IV measures, essentially ignoring the center of the data set. With this abridged dataset, the researchers implement a “matching” algorithm that pairs individuals who have similar values in their covariates, but who do not have similar values in their IV. In our example, participants who have similar caffeine intake levels or similar ages would be paired with participants who have the opposite bed-comfort level. This explores how the biases in the dataset change when each measured covariate is individually controlled for. Additionally, the researchers can define how much weight should be given to the unobserved covariate, depending on how much bias may be introduced into the data by this unobserved covariate. 


As proof-of-concept, Dr. Cerna-Turoff and colleagues simulated data from a scenario based on the Haitian Violence against Children and Youth Survey. Specifically, data were simulated based on measurements of social characteristics and experiences of young girls in Haiti, who were displaced either to a camp (“exposure” group) or to a wider community (“comparison” group) after the 2010 earthquake. The goal of this simulation experiment was to better understand how the displacement setting may be associated with risk of sexual violence. The researchers simulated data for 5 baseline covariates based on results from the Haitian Violence against Children and Youth Survey: (1) status of restavek (indentureship of poor children for rich families), (2) prior sexual violence, (3) living with parents, (4) age, and (5) social capital, of which the latter is an unobserved covariate. They also generated data for an exposure (camp or community), an outcome (sexual violence against girls), and an IV (earthquake damage severity). The researchers explored how the outcome was affected by the covariates and IV by quantifying the standardized mean difference of the variable across the exposure and comparison groups. A standardized mean difference value close to 0 indicates that the value of the variable was not different across the two groups, suggesting that this variable is not introducing bias into the analysis of group differences. 


The results suggest those who were displaced to a camp were at a higher risk of sexual violence than those who were displaced to a wider community, when correcting for all observed covariates. Additionally, the method successfully balanced the groups when correcting for the unobserved covariate of social capital. If not corrected for, differences in social capital might have confounded these results, such that girls with a stronger support network may appear to be at a lower risk. However, using the Full IV Matching method, bias across exposure and comparison groups for the observed covariates and the unobserved covariate of social capital was reduced, suggesting that neither the social capital nor the observed covariates contributed to the difference in risk for sexual violence observed between the two groups. 


This study provides a proof-of-concept for a pre-processing method for reducing bias across a data set. The authors mention limitations including the effect of the method on sample size and the ‘bias-variance trade-off’, in which increases in accuracy (less bias) may lead to more noise (higher variability) in the data. Ultimately, this type of methodology can aid in the correction of both observed and unobserved biases in population-based data collection, which has significant implications in epidemiologic studies, where not all sources of bias can be measured effectively.


Edited by: Emily Hokett, Pei-Yin Shih, Maaike Schilperoort; Trang Nguyen

Leave a Reply

Your email address will not be published.

Follow this blog

Get every new post delivered right to your inbox.