Our CEO, Tim Lloyd, shared his thoughts on the future of usage analytics in the Society for Scholarly Publishing’s blog, the Scholarly Kitchen. We’ve reproduced his article below, or you can read the Scholarly Kitchen’s original version.
Usage analytics in scholarly publishing is undergoing profound change. Powerful industry trends over the last five years have resulted in a new set of challenges for usage analytics that our current reporting infrastructure is poorly equipped to address. As the impact of these changes may not be apparent to the casual observer, this post explores their implications.
A good place to start when envisioning the future is to understand where we’ve come from. In our community, the bedrock of usage reporting is the COUNTER Code of Practice. Since its first release just over 20 years ago, COUNTER reports have rightly become the gold standard for publishers and other service providers to report usage to libraries. These monthly reports provide librarians and publishers with the core metrics they need to understand usage of their resources, and inform decision making on renewals and acquisitions. This primary use case continues to be essential for our community, and the Code of Practice continues to evolve to meet emerging needs (with a new version 5.1 coming into effect in January 2025).
What’s changed?
What’s changed is the environment, which is adding complexity and new requirements beyond this original use case. Here are five examples of those environmental changes.
1. Innovation In Business Models
The growing diversity in Open Access business models, nicely illustrated in Tasha Mellins-Cohen’s Scholarly Kitchen post from March, means that a given organization may be consuming publisher content (both open and controlled) in many different ways. For example, in addition to traditional outright purchase and subscription models, institutions may also be paying transactional fees (APCs, peer review fees, etc.), participating in collaborative funding models (such as Subscribe to Open and the like), and signing bundled agreements that combine reading and publishing budgets.
This diversity in models can make it significantly harder for publishers participating in several models to provide institutions with a single, unified understanding of their usage. The broad re-use rights associated with open access content make it more likely that usage arises in multiple siloed and third party platforms (more below). In contrast, publisher reporting systems are generally engineered around library reporting of controlled access content.
2. New Use Cases
This diversity is also driving new, broader use cases for usage analytics beyond the traditional COUNTER metrics designed for library comparison of vendor services. A variety of new stakeholders are interested in using usage analytics to inform decision making, such as:
- Editors and product managers interested in analyzing usage by subject and reader segment, to assess impact as well as identify new fields of research and study. This includes understanding usage across publisher and third party platforms, and across open and paywalled business models.
- Sales and business development staff seeking to identify the organizations getting value from their content, to inform strategy and illuminate new opportunities.
- Research officers and others involved in institutional funding wanting to understand the impact of funding decisions, to demonstrate the benefit of funded research.
- Authors looking for new opportunities for research and collaboration in the communities engaging with their publications, and to demonstrate their productivity.
These new use cases are shaping the future of usage analytics by extending metadata into new areas (to meet emerging analytics needs) and increasing the focus on reporting tools that are intuitive and accessible for audiences with less data analysis expertise.
3. Fragmentation Of Usage
The traditional model of scholarly users coming to a single publisher-owned platform to access content began to fragment many years ago when ebooks started being distributed on third party platforms, such as JSTOR, Muse, and Amazon. The growth of open access book repositories such as OAPEN and the Open Research Library (ORL) added additional sources of usage for OA content. On the journals side, syndication deals with platforms like ResearchGate and ScienceDirect are further fragmenting scholarly usage. The future will increasingly see scholarly content distributed through more platforms, each (ideally) targeting a valuable additional audience that isn’t effectively addressed by the others.
The incorporation of AI into content discovery further weakens the relationship between content and the publisher platform. If large language models (LLMs) are able to present coherent answers based on publisher content, then users may not feel the need to click on a citation link to view the full version (and therefore won’t generate a usage event that can be tied back to that content). A fascinating session at Silverchair’s recent Platform Strategies meeting explored this issue — most attendees felt that this was both inevitable and unavoidable, but these capsule answers may attract a different audience segment, with traditional scholars still wanting to click through to the full content.
This fragmentation significantly increases the challenge of building a comprehensive understanding of usage. Despite community efforts to build standards and best practices in this area, such as the Distributed Usage Logging (DUL) initiative led by COUNTER and Crossref, building data pipelines to combine usage files from multiple platforms into single, consistent and standardized feeds remains a significant technical challenge. A lot of processing still relies heavily on manual intervention, or simply omits data that’s too difficult to integrate.
4. Societal Benefit
Innovation in publishing models is also throwing dust into the air when funders (and to a lesser extent, authors) make choices about where to publish — with more models to choose from, competition for publishing funds is becoming tighter. The importance of understanding publishing impact is something we increasingly hear from funders, but traditional measures of impact, such as the Journal Impact Factor, have long been recognized as flawed. A fascinating Clarivate study of Research Officers and Researchers, presented at April’s STM Conference, indicated that societal benefit was expected to become the most important impact measure for research offices in five years, as well as the most difficult to measure.
This evolving understanding of what is meant by “publishing impact” requires more creative approaches to usage analytics. You can’t measure societal benefit in terms of download numbers, but you can start to estimate it by understanding which communities are engaging with which scholarly content. For example, was a paper on tropical disease accessed by research institutes in West Africa? Did citizens of a US state engage with publications funded by state agencies?
5. Bot Pollution
Robotic activity has always been part and parcel of scholarly usage logging, and COUNTER’s Code of Practice includes a longstanding requirement to ‘exclude robots and crawlers’. However, in recent years the level and sophistication of robotic access appears to have exponentially grown. While there are many factors driving this, presumably including amassing content to feed AI models and paper mills, the result is that open access and free content usage logs are increasingly polluted by robotic activity. This can significantly reduce the value of, and confidence in, traditional usage measures, such as views and downloads.
And what’s the impact on usage analytics?
The simple answer is that usage analytics need to cope with a lot more complexity.
One way to think about analytics pipelines is to consider their component parts — data ingestion, data processing, and data export.
Data Ingestion
Solutions need to be able to ingest data from multiple sources, in diverse formats, at different cadences (daily? monthly?), and to varying levels of quality. Whereas traditional COUNTER reporting is a monthly event, future analytics pipelines will be more akin to a river that is constantly in a state of flux. Platform A may provide a server-driven monthly export of aggregate COUNTER metrics, platform B may provide real-time COUNTER-compliant events, and platform C may provide a manually-created and home-brewed set of metrics in a .csv file that generally arrives around the same day of the month depending on staff availability and time of year. And yet all three sources are important because they address different segments of the user community, and therefore build up our understanding of publishing impact.
Data Processing
Solutions need to be able to normalize data so that these diverse inputs can be aggregated and compared. Using the example above, this could include:
- Cross-walking several different flavors of organization identifiers (a publisher taxonomy; uncontrolled text strings; proprietary IDs) to a standard identifier such as ROR
- Standardizing timestamps to a consistent format
- Converting individual event metrics to a set of consistent monthly aggregate totals
- Mapping COUNTER and non-COUNTER event types to equivalents (where they exist)
In addition, the volume of data that needs processing, and reprocessing, can be very significant — especially when open access content is included, which can generate usage an order of magnitude greater than paywalled content. Traditional database architectures can struggle to cope with these requirements. As an example, we’ve spent much of the last three years at LibLynx replacing our legacy COUNTER processing infrastructure — which was designed around a monthly build of COUNTER reports — with a completely new architecture designed to cope with processing of billions of usage events each year, and on-demand generation of reports.
Data Export
COUNTER reports are traditionally consumed as spreadsheets by librarians familiar with the format, or in machine readable formats for automated harvesting. In contrast, the new use cases for usage reports include stakeholders that are less familiar with these industry standards, and need more flexible ways to incorporate analytics into their existing workflows.
Analytics need to become more inclusive — more intuitive, more visual, and more context-sensitive in order to be impactful for these new audiences. They need to cater for a wider range of automated access scenarios, including bulk access to the underlying metrics and more flexible querying of data sets outside of templated reports. They also need to incorporate qualitative data that can add valuable, rich depth to understanding impact, but is often lost when pipelines are designed around scale and numbers alone.
My colleague, Lettie Conrad, has been working on a project over the last 12 months to re-imagine the user experience for usage analytics. She’s been talking to various community stakeholders to understand their current and emerging needs, and working with a team of designers and software developers to create a new framework for exploring analytics that provides the flexibility to support these new use cases.
New challenges and opportunities mean new tools for measuring impact and value. Usage analytics will be a critical component of how our community communicates publishing impact in the future. It’s important that we pay attention to its supporting infrastructure, and make the investments needed to keep pace with emerging stakeholder needs.