Academia.eduAcademia.edu

Undercounting File Downloads from Institutional Repositories

A primary impact metric for institutional repositories (IR) is the number of file downloads, which are commonly measured through third-party Web analytics software. Google Analytics, a free service used by most academic libraries, relies on HTML page tagging to log visitor activity on Google's servers. However, Web aggregators such as Google Scholar link directly to high value content (usually PDF files), bypassing the HTML page and failing to register these direct access events. This article presents evidence of a study of four institutions demonstrating that the majority of IR activity is not counted by page tagging Web analytics software, and proposes a practical solution for significantly improving the reporting relevancy and accuracy of IR performance metrics using Google Analytics.

Journal of Library Administration ISSN: 0193-0826 (Print) 1540-3564 (Online) Journal homepage: http://www.tandfonline.com/loi/wjla20 Undercounting File Downloads from Institutional Repositories Patrick Obrien, Kenning Arlitsch, Leila Sterman, Jeff Mixter, Jonathan Wheeler & Susan Borda To cite this article: Patrick Obrien, Kenning Arlitsch, Leila Sterman, Jeff Mixter, Jonathan Wheeler & Susan Borda (2016) Undercounting File Downloads from Institutional Repositories, Journal of Library Administration, 56:7, 854-874, DOI: 10.1080/01930826.2016.1216224 To link to this article: http://dx.doi.org/10.1080/01930826.2016.1216224 © 2016 Patrick OBrien, Kenning Arlitsch, Leila Sterman, Jeff Mixter, Jonathan Wheeler, and Susan Borda. Published with license by Taylor & Francis. Published online: 11 Oct 2016. Submit your article to this journal Article views: 566 View related articles View Crossmark data Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=wjla20 Download by: [98.127.248.31] Date: 12 November 2016, At: 13:22 Journal of Library Administration, 56:854–874, 2016 Published with license by Taylor & Francis ISSN: 0193-0826 print / 1540-3564 online DOI: 10.1080/01930826.2016.1216224 posIT KENNING ARLITSCH, Column Editor Dean of the Library, Montana State University, Bozeman, MT, USA Column Editor’s Note. This JLA column posits that academic libraries and their services are dominated by information technologies, and that the success of librarians and professional staff is contingent on their ability to thrive in this technology-rich environment. The column will appear in odd-numbered issues of the journal, and will delve into all aspects of library-related information technologies and knowledge management used to connect users to information resources, including data preparation, discovery, delivery and preservation. Prospective authors are invited to submit articles for this column to the editor at [email protected] UNDERCOUNTING FILE DOWNLOADS FROM INSTITUTIONAL REPOSITORIES PATRICK OBRIEN Semantic Web Research Director, Montana State University, Bozeman, MT, USA KENNING ARLITSCH Dean of the Library, Montana State University, Bozeman, MT, USA © Patrick OBrien, Kenning Arlitsch, Leila Sterman, Jeff Mixter, Jonathan Wheeler, and Susan Borda Address correspondence to Patrick OBrien, Semantic Web Research Director, Montana State University, P.O. Box 173320, Bozeman, MT 59717-3320, USA. E-mail: [email protected] Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/wjla. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/bync-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built upon in any way. 854 posIT 855 LEILA STERMAN Scholarly Communication Librarian, Montana State University, Bozeman, MT, USA JEFF MIXTER Software Engineer, OCLC Research, Dublin, OH, USA JONATHAN WHEELER Data Curation Librarian, University of New Mexico, NM, USA SUSAN BORDA Digital Technologies Development Librarian, Montana State University, Bozeman, MT, USA ABSTRACT. A primary impact metric for institutional repositories (IR) is the number of file downloads, which are commonly measured through third-party Web analytics software. Google Analytics, a free service used by most academic libraries, relies on HTML page tagging to log visitor activity on Google’s servers. However, Web aggregators such as Google Scholar link directly to high value content (usually PDF files), bypassing the HTML page and failing to register these direct access events. This article presents evidence of a study of four institutions demonstrating that the majority of IR activity is not counted by page tagging Web analytics software, and proposes a practical solution for significantly improving the reporting relevancy and accuracy of IR performance metrics using Google Analytics. KEYWORDS institutional repositories, IR, digital library assessment, Web analytics, Google Analytics, log file analytics INTRODUCTION Institutional repositories (IR) have been under development for over fifteen years and have collectively become a significant source of scholarly content. More than 95% of the approximately 3,100 open access repositories listed in OpenDOAR are affiliated with academic institutions or research disciplines (University of Nottingham, 2016) and these repositories can add value to the research process and the reputations of institutions and their faculty. The value proposition that justifies the expense of building and maintaining open access IR is based largely on unrestricted access to their content, and on the ability of IR managers and library administrators to report impact to researchers and university administrators. Ultimately, citations may be the most valued measure of reuse and worth, and it is reasonable to expect publications to be downloaded and read before being cited. Using file download 856 P. OBrien et al. counts as a metric for scholarly value is therefore crucial for IR assessment, but it is a surprisingly difficult metric to measure accurately due to the deficiencies of Web analytics tools and due to overwhelming non-human (robot) traffic. The scholarly information-gathering process includes a filtering approach, (Acharya, 2015) through which the researcher eventually arrives at citable scholarly content. Measurable human interaction with IR can be said to include page views or downloads of three categories: 1. Ancillary Pages—IR HTML pages that provide general information or navigation paths through the IR. Examples include search results, and browse pages organized by author, title, community pages, statistics, etc. 2. Item Summary Pages—IR HTML pages that typically include an abstract and metadata for a single scholarly work, which can help the user decide to download the full publication. 3. Citable Content Downloads—scholarly content that may be formally cited in the research process. These include publications, presentations, data sets, etc., accessed in a non-HTML format (i.e., .pdf, .doc, .ppt, etc.) Current assessment practices have deficiencies that result in serious undercounting of total IR activity, leaving IR managers and stakeholders unable to accurately report on file downloads. This study examined data from four repositories: three running the DSpace platform and one running CONTENTdm. Evidence gathered from these four IR shows as much as 58% of all human-generated IR activity goes unreported by Google Analytics, the Web analytics service used most frequently in academic libraries to measure use. The Research Methods and Findings sections demonstrate a pragmatic framework for reporting meaningful IR performance metrics. The data set that supports this study is available from Montana State University ScholarWorks, http://doi.org/10.15788/M2Z59N. RESEARCH STATEMENT While it is possible to accurately report the first two metrics categories (Ancillary Page Views and Item Summary Page Views), Citable Content Download metrics are very difficult to report accurately. Most libraries lack the technical sophistication and resources, within their chosen Web analytics methods, to identify and exclude all robot activity and to capture and report downloads generated from direct links. Evidence presented in this study will support the following statements: posIT 857 • Ancillary Page Views comprise a large portion of total IR activity being reported. • Citable Content Downloads represent high-value traffic, but most are unreported by Google Analytics • Non-human robot activity overwhelms human activity and is too difficult to consistently filter from Web analytics reports. LITERATURE REVIEW While IR content was initially defined as scholarly (Crow, 2002), some collection development policies now define the scope of IR more broadly and include institutional records and other digitized materials. This study focuses on scholarly content within IR. Assessment of Institutional Repositories IR assessment is acknowledged as a necessity in numerous articles in the professional literature, and is sometimes even tied to their ultimate survival. “Without understanding the significance of this service, the value of such programs may be underestimated and, consequently, funds to ensure IR survival and growth may dwindle” (Burns, Lana, & Budd, 2013). Researchers acknowledge that specific forms of measure must vary based on local needs and audience, and some assessors of IR success place less emphasis on hard metrics, noting instead that IR managers may measure their success in the comprehensiveness and growth of their repositories, and giving credence to downloads only insofar as their general ability to show “use” (Cullen & Chawner, 2010). Most of the literature about IR assessment does focus on collecting and reporting quantitative metrics to help make the case for IR value, “Metrics for repositories can be used to provide a better understanding of how repositories are being used, which can help to inform policy decisions on future investment” (Kelly et al., 2012). A 2011 study of several high-profile IR reported that “assessment measures are still being developed,” but that “most institutions found it easier to develop quantitative measures of success [including] the number of requests” (Campbell-Meier, 2011). Others also reinforce that specific measures based on quantifiable data will resonate, even if those reports must be customized to the audience. “By providing useful and appropriate statistics to authors, departments, the university, and other stakeholders, the library demonstrates its value as a vital partner in research, scholarship, and scholarly communication” (Bruns & Inefuku, 2016). Bruns and Inefuku count item downloads among a number of metrics that should be assembled based on institutional mission and on audience. Lagzian et al. list the ability of the system to make “available the number of downloads 858 P. OBrien et al. and views of full text files” as one of the top critical success factors for IR (Lagzian, Abrizah, & Wee, 2015). Despite the recognition that quantifiable metrics, including downloads, are useful, there is evidence that data and reporting abilities for IR are lacking. “While libraries determine the most appropriate benchmark for success for their respective IRs, the need for more precise usage data will be central to assessment efforts” (Fralinger & Bull, 2013). The message conveying the importance of assessment is also not necessarily widely accepted. In a 2013 publication that surveyed the usage of U.S.-based IR by international audiences, Fralinger and Bull report that many IR administrators “seem to be unaware, apathetic, or unprepared to do IR assessment, specifically from an international perspective” (Fralinger & Bull, 2013). Some prior research has already pointed out that discrepancies exist in download reports when search engines send users directly to the file. A 2006 study at the University of Wollongong, Australia, noted that its Digital Commons repository statistics suggested that users accessing [the IR] from Google are in the majority of cases going straight to the document pdf, rather than to the cover page” (Organ, 2006). Overview of Web Analytics Methods Reporting visitation and use of Web sites and digital repositories is made possible through the use of Web analytics software, which may be divided into two classes: (1) page tagging; and (2) log file. Brief descriptions of these types follow, but more in-depth analyses are available in other studies (Clifton, 2012; Fagan, 2014; Jansen, 2006; Nakatani & Chuang, 2011). PAGE TAGGING ANALYTICS This class of analytics software is typically delivered as Software as a Service (SaaS), where the software package usually resides on the vendor’s servers. Popular page tagging software includes free packages such as Google Analytics, and costly options such as WebTrends and Adobe Marketing Cloud. Page tagging analytics relies on a piece of tracking code (usually JavaScript) that is embedded on each HTML page of the Web site in question. The tracking code is keyed to the account holder and acts as a beacon to the software package on the vendor’s servers. A display of the HTML page triggers a signal from the tracking code to the software package, where the visit is registered along with various other pieces of information that can include the referral site, search terms, user’s geographical location, type of device, etc. posIT LOG 859 FILE ANALYTICS Log file analytics software provides reports on the data normally collected by server logs. This type of software is typically installed and managed locally by server administrators, and “web log analysis software. . . then can be used to analyze the log file” (Nakatani & Chuang, 2011). Log file analytics software includes the packages built into DSpace and ePrints, as well as other packages such as WebLog Expert. Which Type Is Better? Both classes of analytics software have strengths and weaknesses. Correctly configured, page tagging analytics software can provide a holistic view of all the organization’s Web properties, including the ability to see the paths that users follow through a domain. Sophisticated reports can be generated using tools built into the software, and in a SAAS environment there is no need for local updates or maintenance of the software itself. Log file analytics can provide very granular information about IR activity, but since the software is managed locally it can impose a small administrative burden. Log file analysis can be difficult to configure if aggregating data from more than one physical Webserver; manual compilation of reports is required when multiple servers comprise the Web site of an organization. On the other hand, a distinct advantage of log file analysis is that user and institutional information is not shared with a third party. Over the past few years, analytics plug-ins have been developed for popular file index stacks, such as Solr and Elasticsearch. Both page tagging and log file analytics carry significant risks for inaccurate reporting of IR activity (see Figure 1). Page tagging analytics software FIGURE 1 Risks associated with each type of web analytics method. 860 P. OBrien et al. FIGURE 2 Page tagging analytics does not Track Citable Content Downloads. carries high risk for undercounting non-HTML file downloads, particularly when users are referred directly to the file from an external source (see Figure 2). Log file analytics software, on the other hand, carries a high risk of over-counting due to the dynamic “cat and mouse” game required to identify and filter bots, crawlers and scrapers. In Web sites that see fewer than 10,000 visitors per day it is estimated that less than 30% of online traffic is human-initiated (Zeifman, 2015). Paradoxically, log files can also sometimes underestimate activity due to proxy and browser caching (Ferrini & Mohr, 2009). Although Web analytics can help report IR activity, there is a significant amount of academic paper sharing that may never be tracked. It is nearly impossible, within the varied scholarly communication ecosystem, to capture all the interactions that exist with any given paper. The Problem of Robots Undercounting is the basis for this research, but over-counting non-human activity is also a concern. Log file analytics store every request for every page and file. Robots (bots) create a large bias in log-based analytics because they account for almost 50% of all Internet traffic (Zeifman, 2015) and over 85% of IR downloads (Information Power Ltd, 2013). While DSpace has a bot filtering feature, it only addresses known bots, which fall under the “good bots” category and include crawlers from Google or Bing whose job it is to index IR content. “Bad bots” on the other hand are used for malicious purposes, such as probing for server vulnerabilities that can be posIT 861 used to infect visitors, generate SEO referral spam, or harvest the entire IR content to generate traffic on other Web sites. While “good bots” are easily detected and screened from reports, they account for only 40% of total bot activity. Log-based analytics methods have difficulty in effectively identifying and excluding “bad bot” activity that accounts for the other 60% of total bot activity (Zeifman, 2015). The problem of bots skewing reports has led to development of “more sophisticated—but practical—algorithms to improve filtering that will eventually become incorporated into the COUNTER standard” and will be used to help measure use and impact of IR (MacIntyre & Jones, 2016). However, until these sophisticated solutions are available, using Google Search Console Clicks and Google Analytics Events, as described in the Research Methods section, may be the most accurate for reporting IR downloads. GOOGLE ANALYTICS Although some researchers argue that Google Analytics is inappropriate for educational use, since it was built for e-commerce rather than an educational environment (Dragos, 2011), our related research has shown that the majority of academic libraries still use this tool. In a study on privacy that will be published in 2017, we found the presence of Google Analytics tracking code in over 80% of the 263 academic libraries we surveyed. Outside the realm of academic libraries, it has been reported that “more than 60% of all websites on the Internet use Google Analytics, Google AdSense or another Google product using tracking beacons” (Hornbaker & Merity, 2013; Piwik development team, 2016). Google Analytics provides a very accurate metric for determining the number of HTML pages viewed by humans. Most bots are incapable of running the Google Analytics JavaScript tracking code needed to register a page view. Google’s primary business model is digital advertising and companies use their analytics software to maximize eCommerce profit. As a result, Google has a vested interest in ensuring that only human activity is tracked and reported. Given this emphasis, it is not worth the effort for libraries to spend money or staff time to meticulously eliminate bot activity through their own local system. The tools and infrastructure provided by Google Analytics and the Google Search Console API are the most costeffective, although they come with legitimate privacy concerns. Standard configuration of Google Analytics provides statistics on HTML page views, but additional configuration called “event tracking” (Bragg et al., 2015) is required to track non-HTML Citable Content Downloads that comprise the bulk of citable IR content. Other researchers have previously noted the difficulty of tracking non-HTML downloads in Google Analytics: “Without implementing event tracking, Google Analytics has no way to track these 862 P. OBrien et al. [PDF] downloads, and the data will not be included in any reports” (Farney & McHale, 2013) and “direct downloads of PDFs hosted in repositories may not be reported unless Google Analytics has been configured appropriately,” resulting in underestimates (Kelly, 2012). In a related study, Kelly and coauthors describe applying GA tracking code to the download link on the HTML page (Kelly et al., 2012), but this method doesn’t address visitors who arrive directly to the PDF (see Figure 2). Burns, Lana, and Budd also write that “web log analytics may under report IR use” and refer to a DSpace solution developed at the University of Edinburgh, which “created a redirect so a user’s click on a PDF link in a Google search results list will take the user to the file’s item page in the IR, rather than directly to the file” (Burns et al., 2013). Claire Knowles’ slide presentation shows large increases in download statistics once the redirect was put into place in December 2011 (Knowles, 2012). However, this solution does not seem to have gained widespread traction, and it is unlikely that Google and Google Scholar would look favorably upon a redirect when those search engines offer direct links to the files. Similarly, DSpace’s current 5.x implementation has a feature to track download events within Google Analytics. However, after reviewing the DSpace code we determined the current method relies on the Google Analytics API—not Google Search Console API—and, thus, is limited to tracking file downloads that originate from a DSpace HTML page. We also confirmed this by isolating high-use non-HTML files and comparing Google Analytics Download Events with Google Search Console Clicks. Finally, it should be noted that vocabulary plays a role in measuring and communicating impact through Web analytics. The library science profession has long referred to digital objects (such as PDF files in an IR) as “items” (Lagoze, Payette, Shin, & Wilper, 2006; Tansley et al., 2003), while Google Analytics calls all HTML pages, including those that contain abstracts and metadata “items.” This can cause confusion when communicating impact. As we noted at the start, we refer to pages containing only metadata and abstracts as Item Summary Pages, and while technically they contain all the information required for citation, one hopes that a scholarly citation would only result from the download and reading of the full publication, (i.e., what we call the Citable Content Downloads). RESEARCH METHODS The data set for this study was collected from four institutional repositories, whose activity was monitored during a 134-day period during spring academic semester in 2016: • LoboVault—University of New Mexico—https://repository.unm.edu • MacSphere—McMaster University—https://macsphere.mcmaster.ca posIT 863 • ScholarWorks—Montana State University—http://scholarworks.montana. edu • USpace—University of Utah—http://uspace.utah.edu The first three repositories run the DSpace platform, while USpace at the University of Utah runs CONTENTdm. The University of New Mexico is in the process of migrating to a Digital Commons platform, and expects to go live before end of summer, 2016. Data for this study were collected from UNM’s DSpace platform. Tools Data were gathered and compared using a number of tools and configurations. Google Analytics, deployed in conjunction with the Google Search Console (previously known as Webmaster Tools), was used to help compile activity, and DSpace usage statistics and Solr stats were also utilized. The following list explains which specific activity was pulled from each tool: 1. Google Analytics a. Page Views b. Events 2. Google Search Console API a. Clicks 3. DSpace a. Google Analytics Statistics b. Usage Statistics c. Solr Stats d. Solr Item Metadata GOOGLE ANALYTICS In order to exclude activity that does not support the mission of open access IR and to identify search referral details, we applied IP address filters that excluded library staff activity, added the Organic Search Source setting to identify referral detail about Google, Google Scholar and Google Image, and enabled bot filtering. The Google Analytics reports used for this study are listed, below. PV = Page Views. 1. Total IR HTML PV—estimated using Google Analytics > Behavior > Site Content > All Pages Report. Reported PV were used in lieu of Unique PV to ensure our study findings were conservative (Google, Inc., 2016b). 864 P. OBrien et al. 2. Total IR Item Summary PV—estimated by refining the report listed in #1 with an Advanced Filter using Regular Expressions. Regular Expressions were developed to exclude any activity involving HTML Ancillary Pages for the unique configuration in each IR. 3. Ancillary PV—estimated by Total IR HTML PV less Total Item Summary PV. 4. Download Events—estimated using Google Analytics > Events > Behavior > Events > Pages report. Note: only Montana State University and the University of Utah IR had configured their IR software and Google Analytics to track Events, as will be seen in Table 1. GOOGLE SEARCH CONSOLE API Google Search Console provides the count of human clicks that each URL receives from the search results pages (SERP) of Google’s search properties (Google, Inc., 2016a). However, account holders can only access the last 90 days of visitation data via the Google Search Console interface. To accumulate persistent data for our study, we used Python scripts to access the Google Search Console API to extract URLs that received one or more clicks each day. We then applied the Regular Expressions developed for estimating Total IR Item Summary PV, above. We also included rules to include only URLs for non-HTML files in this estimate. This method allowed us to retrieve every record on every day from Google Search Console, with no limitations. In brief, we were able to extract a persistent dataset with the granular detail required for this study. DSPACE A previous researcher had asserted that the reported DSpace statistics may be biased by as much as 85% over-counting due to bot activity (Greene, 2016). We corroborated this claim by temporarily bypassing the local host restriction (Masár, 2015) and acquiring the detailed download records from the Solr statistics core. These records were then joined with the metadata records from the Solr search core (Diggory & Luyten, 2015) in the Montana State and University of New Mexico IR. We also tried to analyze the DSpace Google Analytics Statistics feature, but learned of a current bug (Dietz, 2015) in DSpace 5.x that prevented DSpace from generating those statistics for our study participants’ IR Items. Data Set The resulting dataset (OBrien et al., 2016) contains over 57,087 unique URLs in 413,786 records that received one or more human clicks via Google SERP TABLE 1 Total activity from four IR during the 134-day test period. IR scholarworks.montana.edu macsphere.mcmaster.ca repository.unm.edu content.lib.utah.edu Item Sumary PV Ancillary PV Download Events Citable Content Downloads Total Google Analytics HTML PV 26, 735 51, 150 83, 491 122, 927 23, 350 71, 585 59, 289 47, 569 7, 129 n/a n/a 19, 226 77, 380 133, 342 166, 320 159, 536 50, 085 122, 735 142, 780 170, 496 865 866 P. OBrien et al. from January 5, 2016 to May 17, 2016 (134 days). Using the Google Search Console API, we were able to determine the total number of invisible Citable Item Downloads (item downloads that did not originate from the IR Web site and were not reported in Google Analytics—see Figure 2). After aggregating these data, a regular expression was used to exclude URLs containing nonscholarly material, such as collection landing pages. This resulted in a set of data that could be used to determine the number of non-HTML files (.pdf, .jpeg, MS Word documents, MS PowerPoint, .txt datasets, MS Excel files, etc.) that were directly downloaded from Google SERP. Limitations This study involves only four repositories, although the compiled data set includes over 400,000 records for more than 57,000 URLs. The data were gathered during the height of the spring semester when classes were in session, a time during which use of the IR at the four universities should have been high. However, it could be argued that the fall semester might have garnered more activity, and ideally, an entire year of data would be collected and analyzed for a larger number of IR. As with any study that gathers data from a dynamic environment, the data should be considered a snapshot in time. Another limitation is that only two repository software platforms (DSpace and CONTENTdm) are represented in this study. DSpace is by far the most widely used IR software in the world and its selection is justifiable on that basis. While CONTENTdm has seen broad adoption in cultural heritage digital libraries, it is not very widely used as an IR platform. However, gathering data from IR is contingent on relationships that provide a specific level of access, and the CONTENTdm repository was another data set to which the authors had access. A larger study should ideally include other platforms, such as Digital Commons and ePrints. Finally, Google Search Console brings value to this study by helping to include tracking of non-HTML downloads. However, its ability to count downloads is limited to clicks that originate from other Google properties, and therefore some number of direct downloads from non-Google properties have been missed in this study. Findings The total IR activity from the four repositories that we can report with a high level of confidence and accuracy was calculated by combining Google Analytics Page Views, Google Search Console Clicks, and Google Analytics Events. Evidence gathered from the IR in this study shows as much as 58% of all human-generated IR activity goes unreported by Google Analytics, the posIT 867 FIGURE 3 Chart representation of total Google Analytics HTML page views from the four repositories tracked for the study. Web analytics tool used most frequently in academic libraries to measure use. Table 1 shows the Total IR Activity that was collected from the four repositories during the 134-day data-collection period. HTML Item Summary Page Views and Ancillary Page Views combined to provide the total number that Google Analytics was able to report for each of the four repositories (see Figure 3). Only Montana State University and the University of Utah had configured their IR and Google Analytics to report Download Events, which explains why there are no data for the McMaster University and University of New Mexico repositories. The figures for Citable Content Downloads (see Table 1) was extracted from Google Search Console API. These downloads were in addition to the Total Google Analytics HTML PV figures, demonstrating rather dramatically how much high-value activity Google Analytics is unable to capture. Figure 4 shows the percentage of page views that were reported via Google Analytics for the four repositories categorized as Ancillary Page Views and Item Summary Page Views. The range of Ancillary Page Views across the four repositories was 28%–58%, for a weighted average of 41.51% Ancillary PV. As explained earlier, Ancillary Pages are the low-value HTML pages that provide general information or navigation paths through the IR, while the Item Summary Pages contain abstracts and metadata for a single scholarly work. Figure 5 shows the total IR activity that our study identified via Google Analytics and Google Search Console. 46%–58% of the activity was invisible 868 P. OBrien et al. via Google Analytics, with a weighted average of 51.1% of IR activity being invisible without the use of Google Search Console Clicks. DISCUSSION The true value of an IR is contained within its research papers and data sets, and we refer to measureable interaction with these files as Citable Content Downloads. Our study demonstrates that the most popular analytics methods miss or underreport this most important metric of IR activity. The analytics reporting methods we introduced in this study provide a framework for more accurate measurement of IR activity. There are several paths that a user may take to reach citable content, which is most often a PDF or other type of non-HTML file. A visit may be directed from a known link, a Web search service, or by browsing the IR itself. In the case of the first two paths the user is often linked directly, bypassing the HTML pages of the repository and arriving immediately at the desired content. The third path involves direct use of the IR Web site, through which the user may eventually land on the HTML Item Summary Page (containing abstract and metadata), and from there s/he may click on the link on that page that downloads the non-HTML file. There is no guarantee that FIGURE 4 Percent of item summary and ancillary page views (PV). posIT 869 FIGURE 5 Unreported IR activity in Google Analytics. a user will open the file after it has been downloaded, but s/he certainly could not read the publication or work with the data set prior to download. Together, Item Summary Page Views (collected through Google Analytics), and Citable Content Downloads (collected using the Google Search Console API), are excellent indicators of IR impact that may predict eventual citation activity. Data from this study show enormous numbers of high-value Citable Content Downloads being missed by Google Analytics. For example, the data from Table 1 show that Google Analytics failed to report 100% of Citable Content Downloads at McMaster University and the University of New Mexico, while 91.6% were missed at Montana State University and 89.2% at the University of Utah. In addition to these large reporting errors, there are still Citable Content Downloads being missed due to their origination outside the Google ecosystem. We believe that our tested methods produce a highly accurate picture of IR activity that is as granular as the IR activity reported in server logs, however, our methods also have the added convenience of being pre-filtered for bots and other non-human activity. More research and analysis is required to determine exactly how much human activity goes unaccounted, but our preliminary estimates indicate the activity we cannot accurately measure is small and may have little effect on reporting. Event tracking is an added configuration that should ideally be implemented in Google Analytics, but its effect may be overrated. Our analysis 870 P. OBrien et al. showed that event tracking accounted for only 8%–11% of total Citable Content downloads in our study. Adding Google Search Console results improves accuracy by providing the number of non-HTML downloads, but those represent only clicks that originated from a Google search property. Therefore, this study still may be missing a significant number of Citable Content Downloads originating from other sources that bypass HTML pages in the IR. These other sources may include Bing, Yahoo!, Wikipedia, numerous social media sites such as FaceBook, Twitter, Reddit, CiteULike, professional and academic sites like LinkedIn, ResearchGate, Academia.edu, Mendeley, and direct email referrals. Publishers have similar data capture limitations beyond their journal Web pages, and they track use through services like DataCite that produce and monitor Digital Object Identifiers (DOI) (Paskin, 2000). Publishers may go one step further by working through controlled services like ReadCube (Goncharoff, 2014), but that practice limits access to devices that are set up for the ReadCube application and runs counter to the IR mission of facilitating access to scholarship rather than limiting it. The source of a Web referral to an IR matters. From an institutional perspective, a visitor referred by Google Scholar carries greater value than one referred from Google, Facebook or Yahoo!. Google Scholar users primarily represent researchers seeking scholarly publications, and they are more likely to download IR files and use them to support their own research. This is a high-value audience that should be of great interest to IR managers. Metadata and site crawling problems have historically limited many IR from establishing the kind of robust relationship with Google Scholar that facilitates consistent and accurate harvesting of publications (Arlitsch & O’Brien, 2012). But data collected in the current study indicate that repositories indexed by Google Scholar receive 48%–66% of their referrals from Google Scholar. These figures are significant and imply that a good relationship with Google Scholar is worthwhile, as it leads to considerable high-quality interactions. CONCLUSION Most IR managers are only able to measure a small part of the high-value traffic to their IR; much of what they see is indicative of visits to the site’s lowvalue HTML pages rather than a measure of citable content downloads. The standard configuration of the most widely-used analytics service in academic libraries, Google Analytics, fails to capture the vast majority of non-HTML Citable Content Downloads from IR. This seriously limits the effectiveness of IR managers and library administrators when they try to make the case for IR usefulness and impact. posIT 871 Current mechanisms for collecting accurate analytics are limited and further study is warranted, but the methods tested in this study are promising. Some repository platform developers make claims that they address analytics data collection and reporting, but it can be difficult to know exactly which techniques are being applied and how effective they are, particularly in a proprietary environment. For example, simply using a bot list isn’t good enough because “bad bots” are constantly changing to avoid detection and what worked yesterday may not work tomorrow. We have the potential to know so much more about the movement and use of research than we did when bound paper copies circulated from office to office, but our knowledge will only increase if the tools are set up appropriately and calibrated for the task. Trying to capture currently invisible activity is a large endeavor, but one that will help us make a stronger use case for IR, better understand IR user needs, and continue to improve access to research. The ability to report downloads can be a powerful tool to help faculty engage with IR. Citations may take years to appear in the literature, but repository downloads act as a proxy measure, giving the IR manager a more immediate understanding of citable content use. ACKNOWLEDGEMENT The authors wish to express their gratitude to the Institute of Museum and Library Services, which funded this research (Arlitsch et al., 2014). REFERENCES Acharya, A. (2015, September). What happens when your library is worldwide and all articles are easy to find? Videorecording presented at the Association of Learned and Professional Society Publishers, London. Retrieved from https://www.youtube.com/watch?v = S-f9MjQjLsk Arlitsch, K., OBrien, P., Kyrillidou, M., Clark, J. A., Young, S. W. H., Mixter, J., . . . Stewart, C. (2014). Measuring Up: Assessing Accuracy of Reported Use and Impact of Digital Repositories. (Funded grant proposal) (pp. 1–10) Institute of Museum and Library Services. Retrieved from http://scholarworks.montana.edu/xmlui/handle/1/8924 Arlitsch, K., & O’Brien, P. S. (2012). Invisible institutional repositories: Addressing the low indexing ratios of IRs in Google Scholar. Library Hi Tech, 30(1), 60–81. http://doi.org/10.1108/07378831211213210 Bragg, M., Chapmen, J., DeRidder, J., Johnston, R., Junus, R., Kyrillidou, M., & Stedfeld, E. (2015). Best practices for Google Analytics in digital libraries. Digital Library Federation. Retrieved from https://docs.google.com/document/d/1QmiLJEZXGAY-s7BG_nyF6EUAqcyH0mhQ7j2VPpLpxCQ/edit 872 P. OBrien et al. Bruns, T., & Inefuku, H. W. (2016). Purposeful metrics: Matching institutional repository metrics to purpose and audience. In Making Institutional Repositories Work (pp. 213–234). West Lafayette, IN: Purdue University Press. Retrieved from http://lib.dr.iastate.edu/digirep_pubs/4/ Burns, C. S., Lana, A., & Budd, J. M. (2013). Institutional repositories: Exploration of costs and value. D-Lib Magazine, 19(1/2). http://doi.org/10.1045/january2013burns Campbell-Meier, J. (2011). A framework for institutional repository development. In D. E. Williams & J. Golden (Eds.), Advances in Library Administration and Organization (Vol. 30, pp. 151–185). Emerald Group Publishing Limited. Retrieved from http://www.emeraldinsight.com/doi/abs/10.1108/S07320671%282011%290000030006 Clifton, B. (2012). Advanced web metrics with Google Analytics (3rd ed). Indianapolis, IN: Wiley. Crow, R. (2002). The case for institutional repositories: A SPARC position paper. (ARL Bimonthly Report No. 223). Retrieved from http://works.bepress.com/ir_research/7 Cullen, R., & Chawner, B. (2010). Institutional repositories: Assessing their value to the academic community. Performance Measurement and Metrics, 11(2), 131–147. http://doi.org/10.1108/14678041011064052 Dietz, P. (2015, November 8). Google Analytics Statistics not relating parent comm/coll to bitstream download. DuraSpace. Retrieved from https://jira.duraspace.org/browse/DS-2899 Diggory, M., & Luyten, B. (2015, August 21). SOLR Statistics—DSpace 5.x Documentation - DuraSpace Wiki [Wiki]. Retrieved July 1, 2016, from https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics#SOLRStatisticsWebUserInterfaceElements Dragos, S.-M. (2011). Why Google Analytics cannot be used for educational web content. In Next Generation Web Services Practices (NWeSP) (pp. 113–118). Salamanca: IEEE. http://doi.org/10.1109/NWeSP.2011.6088162 Fagan, J. C. (2014). The suitability of web analytics key performance indicators in the academic library environment. The Journal of Academic Librarianship, 40(1), 25–34. http://doi.org/10.1016/j.acalib.2013.06.005 Farney, T., & McHale, N. (2013). Maximizing Google Analytics: Six high-impact practices (Vol. 49). Chicago, IL: ALA TechSource. Ferrini, A., & Mohr, J. J. (2009). Uses, limitations, and trends in web analytics. In Handbook of research on web log analysis (pp. 122–140). IGI Global. Retrieved from http://www.igi-global.com/chapter/uses-limitations-trends-webanalytics/21999 Fralinger, L., & Bull, J. (2013). Measuring the international usage of US institutional repositories. OCLC Systems & Services: International Digital Library Perspectives, 29(3), 134–150. http://doi.org/10.1108/OCLC-10-2012-0039 Goncharoff, N. (2014, December 10). Clearing Up Misperceptions About Nature.com Content Sharing [News Blog]. Retrieved July 7, 2016, from https://www.digital-science.com/blog/news/clearing-up-misperceptionsabout-nature-com-content-sharing/ posIT 873 Google, Inc. (2016a). Search Analytics Report. Retrieved April 1, 2016, from https://support.google.com/webmasters/answer/6155685 Google, Inc. (2016b). The Difference between AdWords Clicks, and Sessions, Users, Entrances, Pageviews, and Unique Pageviews in Analytics Analytics Help. Support. Google Analytics Help, 2016. https://support. google.com/analytics/answer/1257084#pageviews_vs_unique_views. Greene, J. (2016). Web robot detection in scholarly open access institutional repositories. Library Hi Tech, 34(3). Retrieved from http://hdl.handle.net/10197/ 7682 Hornbaker, C., & Merity, S. (2013). Measuring the impact of Google Analytics: Efficiently trackling Common Crawl using MapReduce & Amazon EC2. Retrieved from http://smerity.com/cs205_ga/ Information Power Ltd. (2013). IRUS download data—identifying unusual usage (IRUS Download Report). Retrieved from http://www.irus. mimas.ac.uk/news/IRUS_download_data_Final_report.pdf Jansen, B. J. (2006). Search log analysis: What it is, what’s been done, how to do it. Library & Information Science Research, 28(3), 407–432. http://doi.org/10.1016/j.lisr.2006.06.005 Kelly, B. (2012, August 29). MajesticSEO analysis of Russell Group University repositories. Retrieved from http://ukwebfocus.com/2012/08/29/majesticseo-analysisof-russell-group-university-repositories/ Kelly, B., Sheppard, N., Delasalle, J., Dewey, M., Stephens, O., Johnson, G., & Taylor, S. (2012). Open metrics for open repositories. In OR2012: The 7th International Conference on Open Repositories. Edinburgh, Scotland. Retrieved from http://opus.bath.ac.uk/30226/ Knowles, C. (2012). Surfacing Google Analytics in DSpace. Presented at the Open Repositories, Edinburgh, Scotland. Retrieved from http://or2012.ed.ac.uk/?s = Knowles&searchsubmit = Lagoze, C., Payette, S., Shin, E., & Wilper, C. (2006). Fedora: An architecture for complex objects and their relationships. International Journal on Digital Libraries, 6(2), 124–138. http://doi.org/10.1007/s00799-005-0130-3 Lagzian, F., Abrizah, A., & Wee, M. C. (2015). Critical success factors for institutional repositories implementation. The Electronic Library, 33(2), 196–209. http://doi.org/10.1108/EL-04-2013-0058 MacIntyre, R., & Jones, H. (2016). IRUS-UK: Improving understanding of the value and impact of institutional repositories. The Serials Librarian, 70(1/4), 100–105. http://doi.org/10.1080/0361526X.2016.1148423 Masár, I. (2015, December 11). Solr—DSpace—DuraSpace Wiki [Wiki]. Retrieved July 1, 2016, from https://wiki.duraspace.org/display/DSPACE/Solr#SolrBypassinglocalhostrestrictiontemporarily Nakatani, K., & Chuang, T. (2011). A web analytics tool selection method: An analytical hierarchy process approach. Internet Research, 21(2), 171–186. http://doi.org/10.1108/10662241111123757 OBrien, P., Arlitsch, K., Sterman, L., Mixter, J., Wheeler, J., & Borda, S. (2016). Data set supporting the study “Undercounting file downloads from institutional repositories.” [dataset]. Montana State University ScholarWorks. http://doi.org/10.15788/M2Z59N 874 P. OBrien et al. Organ, M. (2006). Download statistics—What do they tell us?: The example of Research Online, the Open Access Institutional Repository at the University of Wollongong, Australia. D-Lib Magazine, 12(11). http://doi.org/10.1045/november2006-organ Paskin, N. (2000). E-citations: Actionable identifiers and scholarly referencing. Learned Publishing, 13(3), 159–166. http://doi.org/10.1087/09531510050145308 Piwik development team. (2016). Web Analytics Privacy in Piwik. Retrieved June 13, 2016, from http://piwik.org/privacy/ Tansley, R., Bass, M., Stuve, D., Branschofsky, M., Chudnov, D., McClellan, G., & Smith, M. (2003). The DSpace institutional digital repository system: Current functionality. In Proceedings of the 3rd ACM/IEEECS Joint Conference on Digital Libraries (pp. 87–97). Los Alamitos, Calif.: IEEE Computer Society. Retrieved from http://dspace.mit. edu/bitstream/handle/1721.1/26705/Tansley_2003_The.pdf?sequence = 1 University of Nottingham. (2016, July 14). Open access repository types—worldwide. Retrieved from http://www.opendoar.org/onechart.php?cID = &ctID = &rtID = &clID = &lID = &potID = &rSoftWareName = &search = &groupby = rt.rtHeading&orderby = Tally%20DESC&charttype = pie&width = 600&height = 300&caption = Open%20Access%20Repository%20Types%20-%20Worldwide Zeifman, I. (2015, December 9). 2015 Bot Traffic Report: Humans Take Back the Web, Bad Bots Not Giving Any Ground. Retrieved from https://www.incapsula.com/blog/bot-traffic-report-2015.html