Health Website Quality: Towards Automated Analysis

The innovative use of information and computing technology to deliver improved healthcare is a priority research area in the Digital Economy programme. We have developed and tested software that extracts features from health websites that may be associated with quality. This is the first stage in a larger on-going project to create a machine learning algorithm that can predict the quality of health websites. This feature extractor achieved a mean accuracy of 77% identifying features in 36 manually rated websites. A ...

Health Website Quality: Towards Automated Analysis Thomas Nind Vicki Hanson Stephen McKenna University of Dundee [email protected] University of Dundee University of Dundee Ian Ricketts Falko Sniehotta Zhiwei Guan University of Dundee University of Newcastle Google, Inc Jeremy Wyatt Lorna Gibson Wendy Moncur University of Warwick University of Dundee University of Dundee ABSTRACT The innovative use of information and computing technology to deliver improved healthcare is a priority research area in the Digital Economy programme. We have developed and tested software that extracts features from health websites that may be associated with quality. This is the first stage in a larger on-going project to create a machine learning algorithm that can predict the quality of health websites. This feature extractor achieved a mean accuracy of 77% identifying features in 36 manually rated websites. A preliminary analysis of features extracted from 8,227 websites from the Health on the Net (HON) and from Google search engines showed significant associations between the extracted features and a website’s presence in the HON database of certified sites (a proxy measure for health information quality). The following features were detected:    1. INTRODUCTION Evaluating the quality of health related websites can be difficult for people. Existing evaluation tools can be difficult to use or unreliable [1]. One way people can find high quality websites is to use a dedicated health search engine, such as that provided via the Health on the Net search portal (HON) [4]. The HON search engine serves web pages from a pool of accredited health web pages based on the user’s search terms. Alternatively people find high quality health information by focusing solely on National Health Service (NHS) sites or those recommended by their doctor. In reality, however, many people use a general purpose search engine and rely on intuition and heuristics to evaluate information quality (IQ) and predominantly choose from the first few search results [2]. One route to help these people to reliably find higher quality websites would be either to annotate or re-order search results based on a consistent quality assessment algorithm. Previous work into search annotation has demonstrated that providing information such as popularity and presence of third party certifications can improve users’ ability to judge website credibility [6]. This study differs from previous work in its focus on website surface features and in its approach to determining which features are indicative of quality.     2. FEATURE EXTRACTOR 2.1 Features In order to build a quality assessment algorithm, we must first have a set of website features that can be reliably detected and that are associated with the presence of high quality health information. The approach used in this study was to detect as many potential features as possible, based on the existing credibility and information quality literature [5], and then discard those features that were not reliably detected or that were not found to be associated with quality.  The presence of advertising was detected by searching web page HTML for entries in the ‘EasyList’, a publicly available list of advertising domain names and page content patterns (e.g. -adserver/). Overly general patterns were manually removed giving 12,195 suitable patterns. The accessibility of web pages was evaluated via 3 features. The presence of a ‘skip link’ for screen readers, the proportion of content images containing an alt text description and the proportion of decorative images (less than 5 pixels in dimensions) containing alt=””. Most health related websites provide the means to contact the site’s authors. This is often done through a contact page. Detecting the presence of such a page was important as a feature itself but also in identifying a physical address, telephone contact details and website feedback forms. Once found, contact pages were downloaded and searched for contact details (postcode/telephone/feedback form) using broad regular expressions. An important element of credibility is referencing to external sources of information. A count was made of all external hyperlinks (those going to different sites) and a count of all internal hyperlinks (those going internally). Additional note was taken if a reference list was included on the page. Calculating the readability of a website can be hard due to the difficulty of distinguish programmatically between content and navigation items. In this case the Flesch-Kincaid readability test was applied to the longest paragraph on the page containing at least 70% English words. HON certification was confirmed by searching the page for the HON stamp. High accuracy is important since ‘presence in the HON search engine’ and ‘bearing the HON stamp’ were used as a proxy for information quality in the subsequent association analysis (see below). Top level domain (.gov / / .com etc) is important for determining the source of a site (e.g. governmental) and was extracted from the URL of each page analysed. The top level domain also indicates the country of origin in many cases which may be useful for determining relevance to reader. The presence of a donation button is considered as degrading credibility. This feature was detected by searching for hyperlinks containing “donate” or “donation”. Since donation buttons may be present as images rather than text, the ‘src’ element of images was also searched.      Most health websites contain a privacy policy or disclaimer intended to limit liability in the event that readers, acting on the advice, suffer harm as a result of false or misleading information presented via the site. The presence of such a page was detected using regular expressions. The presence of a discussion forum, commenting, or a wiki was detected. User generated content can be unreliable and may dilute the quality of information presented on a website. Social rating systems may be useful predictors of website quality. For this reason, the number of Facebook Likes was extracted and recorded. Blogs are generally considered to be less credible sources of information than medical experts or journal articles. Their presence on a website may be associated with lower quality health information. It is important to distinguish between websites offering medical information and those selling a product. The presence of an online shop, ‘cart’ or ‘basket’ was detected. 2.2 Feature Extractor Accuracy A selection of test sites was required to assess the accuracy of the feature extractor at detecting each feature. The test sites were identified using the most popular UK health search terms of 2012 via Google Insights for Search. The 3 top search terms were selected from the categories: “Health Conditions”, “Ageing and Geriatrics” and “Alternative” (see Table 1). 4 web pages were downloaded for each search term. In each case the first result with a unique domain name was selected. This provided 36 web pages reflecting a range of common searches. These pages were manually assessed for the presence of each feature. The results from this manual analysis were compared to those of the results of the automated feature extractor, to give a comparative measure of performance for the extractor algorithms to detect the targeted features. Accuracy ratings are presented below (see Table 2). Mathews correlation coefficient was used to determine statistical significance. It provides a measure of the predictive quality of the feature detection algorithm, taking into account the ratio of true positive, true negative predictions to false positive and false negative predictions. All non-significant features were discarded. Accuracy calculations were not performed for programmatic features such as Facebook Likes and readability as these cannot be manually checked but are likely to be accurate. The least accurately detected features were telephone number and post code. These features rely on the successful detection of a contact page followed by a variety of country specific regular expressions. The presence of a disclaimer was also very difficult to detect because it was often buried several links into a website. A more rigid definition of what constitutes a disclaimer would be useful. A balance must be struck between trying to improve current accuracy and identifying alternative features given the purpose of the extractor is to power a prediction algorithm. 3. FEATURE ASSOCIATION WITH HEALTH WEBSITE QUALITY 3.1 Methods Previous researchers investigating automated quality assessment have often used ‘expert ratings’ as ground-truth against which to test their algorithm. This often relies on selecting a narrow topic area where there are well established guidelines e.g. depression [3]. Since machine learning requires large datasets, such manual rating is not feasible. Instead, website quality ground-truth was defined as presence in the HON search portal. The Google Insights for Search tool was used to gather 581 popular health search terms in the same manner as described above (See 2.1 Feature Extractor Accuracy). The Google search API was used to perform searches using the terms with both the HON and Google search engines. These searches resulted in 4,601 unique URLs from Google (Regular quality) and 4,200 unique URLs from HON (High quality). 574 URLs in the Google set were also present in the HON set and so were discarded. This resulted in 8,227 unique URLs for processing by the Feature Extractor. Although there were 8,227 unique pages retrieved, these came from only 2,076 separate domains i.e. many were pages on the same site e.g. Wikipedia. Where multiple pages were available for a domain, the results of the Feature Extractor were averaged to give a single result per domain. 3.2 Results All categorical features (present/not present) which achieved significant detection accuracy were entered into a chi-squared test for independence. For each feature, a 2x2 contingency table was created and a chi-squared probability calculated. The results of this analysis are presented in Table 3 (overleaf). Continuous features were analysed using a Mann-Whitney U test (see Table 4 overleaf). 4. DISCUSSION AND FUTURE WORK The association analysis demonstrates that when comparing web pages retrieved through the Google search engine with those retrieved through the HON search engine: HON sites are more likely to have an accessibility skip link (for screen readers), alt text for content images, references and to contain a privacy policy. HON sites are less likely to have user generated content (e.g. comments), be directly selling a product, use alt=”” for decorative images and have fewer Facebook Likes. It is surprising that HON sites are less likely to follow the accessibility guideline of using alt=”” in decorative images (1-5 pixel diameters) while they are better at providing alt text for content images and accessibility skip links. This may be the result of development software or the fact that it is a less well known accessibility guideline. The feature extractor could be expanded to look for other accessibility features such as use of longdesc, use of frames, noframes support, tab indexes etc. Researchers have long associated both donation links and advertising with low quality. This study demonstrates that high quality (HON certified) websites are no more likely to contain either feature than regular websites. Work is ongoing to identify a source of low quality health websites to use as a third comparison group. Possible sources include the Advertising Standards Agency, the Trading Standards Office and phishing/malware blacklists. The current feature extractor is the first step in being able to present to individual searchers of health information an estimate of the quality of website information. The next step is to implement a machine learning algorithm that can make quality predictions based on the training dataset described above. 5. ACKNOWLEDGEMENTS This work is supported by a Google Research Award, RCUK project EP/G066019/1 “SIDE: Inclusion through the Digital Economy” and by a Wolfson Merit Research Award WM080040. REFERENCES 1. 2. Bernstam, E.V., Shelton, D.M., Walji, M., and Meric-Bernstam, F. Instruments to assess the quality of health information on the World Wide Web: what can our patients actually use? International journal of medical informatics 74, 1 (2005), 13-9. Eysenbach, G. and Kohler, C. How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews. BMJ. 2002;324:573-577. 3. Griffiths, K.M., Tang, T, T., Hawking, D., and Christensen, H. Automated Assessment of the Quality of Depression Websites. Journal of Medical Internet Research 7, 5 (2005). 4. Health On the Net. HONcode: Guidelines - Operational definition of the HONcode principles. 2011. html. 5. Pornpitakp, C. The Persuasiveness of Source Credibility: A Critical Review of Five Decades’ Evidence. Journal of Applied Social Psychology, 2 (2004), 243-281. Reference List Qα=82% Ø=0.63 X2=14.57* HON certification Qα=100% Ø=1 X2=36* Donation Button Qα=92% Ø=0.82 X2=24.08* Privacy Policy Qα=92% Ø=0.68 X2=16.75* Disclaimer Qα=63% Ø=0.30 X2=3.25 User generated content Qα=60% Ø=0.39 X2=5.51* Blog Qα=62% Ø=0.48 X2=8.23* Selling a product Qα=89% Ø=0.78 X2=22.05* Table 3. Categorical features associated with presence in the HON search portal. * indicates statistical significance, P<0.05 Google Feature Schwarz, J. and Morris, M.R. Augmenting Web Pages and Search Results to Support Credibility Assessment. CHI 2011:Session:Search & Information Seeking, (2011), 1245–1254. Table 1. Search terms used to obtain test sites Alternative and Search Health Ageing and Natural Conditions Geriatrics Category Medicine Search Terms Used Cancer Dementia Acupuncture Diabetes Osteoporosis Detox Back Pain Alzheimer Aloe Vera Number (%) 635 (34) 116 (40) 0.167 Accessibility Link 391 (22) 98 (34) 0.000* 1123 (68) 215 (73) 0.075 Feedback Form 205 (11) 36 (12) 0.639 Reference List 55 (3) 46 (16) 0.000* Donation Button 232 (13) 41 (14) 0.640 Privacy Policy 921 (52) 200 (69) 0.000* User generated content 353 (20) 34 (12) 0.001* Blog 137 (8) 23 (8) 0.906 Selling a product 353 (20) 34 (12) 0.001* Table 4. Continuous features associated with presence in the HON search portal. * indicates statistical significance, P<0.05 Google Table 2. Feature detection accuracy. * indicates statistical significance, P<0.05 Mathews correlation Significance Feature Accuracy Ø=0 (no correlation) Ø>0 (positive correlation) Ø<0 (negative correlation) 2 Advertising Qα=85% Ø=0.68 X =16.76* Accessibility Link Qα=86% Ø=0.71 X2=18.28* Qα=98% Ø=0.80 Contact Page X2=23.29* Postcode Qα=53% Ø=0.10 X2=0.38 Telephone Qα=50% Ø=0 X2=0 Feedback Form Qα=66% Ø=0.35 X2=4.57* P Value Advertising Contact Page 6. HON Feature HON Median (Mean) P Value Readability 35.55 (33) 34.54 (33) 0.598 Proportion of decorative images with alt=”” 0.5 ( 0.39) 0.35 (0.33) 0.001* Proportion of content images with alt text 0.5 (0.51) 0.62 (0.59) 0.012* Facebook Likes 9 (12348 α) 0 (39) 0.000* Proportion of site links external 0.14 (0.22) 0.18 (0.26) 0.779 α The mean for this variable is very high due to extreme outliers such as and which have over 200,000 Likes each