Models built from Google search queries were able to adequately estimate malaria activity in Thailand, from 2005–2010, according to official malaria case counts reported by WHO. Search terms for the models were selected via correlation with case data uncovered using Google Correlate. These correlated queries were subsequently sub-setted by manual selection, an automated query selection method, and penalized multivariate regression selection. Additional queries useful for prediction were uncovered by surveying Thai physicians. This research demonstrates the potential for a method of malaria surveillance using real time Google search query data that could complement the traditional epidemiological methods in Thailand.
Strengths and limitations
Overall, this research suggests models using Google search queries could be useful in malaria surveillance in malaria-endemic locations such as Thailand. As with other search query surveillance studies [6–15], it is not necessarily those who are sick who are online searching, but the search volume is a proxy for where there might be higher risk of disease. The inclusion of microscopy-related terms also indicates who may be searching; in this case, technicians in Thailand, using microscopes for blood smear analysis, the common malaria diagnostic method. While imperfect, these search queries may be valid real-time indicators of malaria incidence in the population. Another limitation in using internet search queries for malaria surveillance is the difference in geographic distribution of malaria and that of the query data. Internet access is more common in urban, densely populated areas while malaria is commonly found in more rural areas because the mosquito vector prefers forest dwellings and a shaded environment [21]. Finally, models built using internet search queries for surveillance should be retrained on official case data periodically to ensure that terms selected continue to relate to disease incidence [16].
Comparison to other novel surveillance methods in the literature
In past studies relating Google search queries to infectious disease trends, the queries used to build surveillance models have been directly related to the disease of interest [6, 11]. Terms that are correlated but not related are removed from the modelling. However, this is a subjective process in which researchers decide which queries are sufficiently related. The other commonly used approach to build prediction models is to define disease-related queries prior to any investigation of correlations [6, 8, 9, 14], which was employed for the physician queries model. Both of these methods have generally resulted in terms related to symptoms of the disease, reinforcing a conceptual path between the searcher and the object of surveillance. In this study, the first approach found that terms not directly related to symptoms of the disease, but instead related to the technique in diagnosing the disease, could be related to temporal patterns of the disease. In the case of Thailand, this may be indicative of who has internet access and is involved in malaria diagnosis. Ultimately the logical relationship between the query and the disease prevalence may not need to be fully understood if the tool is continually effective in public health surveillance.
The study is unique because models were created by harnessing the publically available Google Correlate tool for finding search terms that correlate with the time series data of interest. Previous studies of infectious disease surveillance have obtained the correlated search queries via collaboration with Google, or through working with disease specialists to generate lists of queries before modelling [6, 10, 11, 13].
Correlations from three of the models were on a par with those of previous studies, and slightly lower for one model, demonstrating the feasibility of using search query surveillance for malaria [6, 7, 10, 11, 14]. Recently, other novel internet-based active surveillance methods for malaria have been explored, however search queries provide a passively available and thus continuous and potentially earlier source of information than active surveillance methods. Surveillance methods harnessing search query data can also be used to complement traditional and other novel methods, such as mobility via cell phone data and online surveys, which have value for malaria surveillance [22, 23]. To the authors knowledge this is the first study using search queries for malaria surveillance.
Implications and future outlook
In summary, monitoring Google search queries known to correlate with malaria incidence could be useful to Thai public health practitioners for detecting epidemics in real-time. Internet-based surveillance methods are not a replacement for traditional public health surveillance and diagnostic confirmation. However, they can fill gaps in current malaria surveillance because search query data are available in near real-time while traditional surveillance data is lagged on the order of days, weeks or longer [6, 24]. Future research should investigate the utility of these search queries for malaria surveillance in other malaria-endemic countries and regions. As internet access increases worldwide, search queries will become more representative. Qualitative investigations could be performed to confirm if microscopy queries are correlated to malaria incidence as the result of laboratory technicians and doctors researching malaria diagnostics or hospitals ordering more microscopy equipment during epidemics, discerning the temporal significance of these searches. Future query-based surveillance research should strongly evaluate different query selection processes as well, as these can have varied performance compared to disease incidence data. More broadly, this and future studies will continue to contribute to the evidence that web-based data can be effective as a public health tool.