Publications
The importance of debiasing social media data to better understand e-cigarette-related attitudes and behaviors
Abstract
In a recent issue of JMIR, Kim and colleagues described a framework for data collection, quality assessment, and reporting standards for social media data used in health research [1]. The authors’ framework was based on two principles: retrieval precision or “how much of retrieved data is relevant” and retrieval recall or “how much of the relevant data is retrieved.” With an in-depth knowledge of the subject matter under investigation, and refinement of the keywords to develop reliable search filters, the authors suggested that irrelevant content could be weeded out and high-quality data collection could be assured. Using the topic of electronic cigarettes (e-cigarettes), discussed on Twitter, as a case study to showcase their framework, the authors demonstrated how reporting standards could be made systematic and transparent. While the authors cogently argued for better reporting standards in social media data used in health research, and their principles regarding retrieval precision and retrieval recall were thoughtfully laid out, they overlooked the importance of identifying the sources of the content being captured during data collection. For example, Twitter has quickly become subject to third party manipulation where automated accounts are created by industry groups and private companies that aim to influence discussions and promote specific ideas or products [2]. This fact is absent from the framework of Kim and colleagues [1] and according to their principle of retrieval precision, researchers could classify tweets about e-cigarettes as high-quality data regardless of its origin.
Recent research has suggested that between 70% and 80% of …
- Date
- August 9, 2016
- Authors
- Jon-Patrick Allem, Emilio Ferrara
- Journal
- Journal of medical Internet research
- Volume
- 18
- Issue
- 8
- Pages
- e6185
- Publisher
- JMIR Publications Inc., Toronto, Canada