Extracting Book Data from Library Information Systems

Previously I mentioned on the Kitchen that I have been working with Roger Schonfeld and Katherine Daniel of Ithaka S+R on a project to determine how academic libraries acquire books and what market share the various vendors control. (If you follow the link to that Kitchen post, be sure to watch the video at the end, as it is of the essence.) I am not going to rehearse the nature of that project again beyond stating that we set out to answer two questions: What proportion of academic library sales of books does Amazon control, and have university press sales to libraries been declining?

To these questions we added several more, including investigations into pricing and the subject categories that the acquisitions fell in. I am pleased to report that the final report for that project is now available on the Ithaka S+R Web site. I recommend it to everyone interested in the academic book market. Naturally, having completed this study, we now see reasons to do a dozen more. That is always the case: data breeds a need for more data, and more data engenders a need for even more data, and so on ad infinitum. So rather than say we have completed our project, it’s probably more accurate to say we are taking a breather. The case for hiring data analysts to help operate a business gets stronger day by day.

The key thing about the final phase of this project is that we were able to get data from more libraries — a total of 154. We got this data through arrangements with two of the leading vendors of Integrated Library Systems (ILS), OCLC (WorldShare) and ProQuest (Alma). This meant that we were able to look at the very data that libraries use in managing their operations. It would have been great to have even more libraries — how about 500? 1,000? — so that we could claim that our sample was representative of the U.S. academic library market as a whole, but there were technical limitations in getting a larger sample. How to overcome those technical limitations is a matter for future projects. In the meantime we are being careful not to say that our sample is representative of U.S. academic libraries as a whole; the results are suggestive and directional, but not definitive.

Our data was fleshed out a bit through conversations with representatives from vendors. This allows us to feel comfortable making the following statements:

Amazon controls about 10% of the print book market (and no discernible share of ebooks).
Amazon’s growth is coming at the expense of other vendors. Amazon thus must be acknowledged as a library “wholesaler.”
Our data provided little insight into ebook aggregations, which are cited in an ILS as a single entry and not title by title. While the vendor information helped to fill this out, we still cannot provide answers to some questions, such as the extent that aggregations are replacing title-level book acquisitions.
The unavailability of ebook aggregation data means that we cannot say with certainty that sales of books to academic libraries are declining, which is a view commonly held by many. If there is a decline, it is partly offset by Amazon’s growth; some portion of Amazon’s increasing revenues must be counted by publishers as library sales. Ebook aggregations would add considerable volume to the title-by-title figures we uncovered.

Library expenditure on books is inevitably larger than publisher revenue. One area that we have not explored, but that may represent a significant part of the market, is the size of inter-library loans (ILL). How to put a dollar value on these loans, which cost libraries money to participate in? There are thousands of books being borrowed in this way, but ILL produces no revenue for publishers. Similarly, the purchase of used books is a cost to libraries but yields no revenue to publishers. Finally, only a portion of the prices paid by libraries for aggregated book packages is passed along to publishers in the form of revenues: the publishers’ income is net of the costs of such middlemen as EBSCO, ProQuest, Amazon, Baker & Taylor, Ingram, etc. Thus the $700 million market figure cited here (plus whatever you add on for ILL) represents library expenditures but not publishers’ income. I would be interested to hear members of the community offer comments about what dollar value to put to that net figure. Perhaps $400-$500 million?

It was pure coincidence, but while working on this project some of my colleagues and myself became involved with an entirely separate project concerning ILS’s. Sitting around a large conference table with the heads of a number of libraries of all sizes, the complaints piled up, beginning with the librarians’ frustration with electronic resource management (ERM), on which there was unanimity. Some of the complaints were with publishers, not the ILS — for example, librarians often find that the resources they license, which are listed in contracts, do not match what they actually get access to. It’s no surprise that sometimes publishers fail to turn on access to some publications, but it was eye-opening that often libraries find that they have access to materials they did not pay for (whose responsibility is it to correct that?). Many librarians, it seems, are forced to spend their time dealing with low-value issues, essentially cleaning up the messes left for them by publishers, other vendors, and ILS providers alike.

The real issue these librarians have, however, is that they don’t have the data to do their jobs. The ILS does not enable the kind of management reports that they rightly feel they should be able to get — and the frustration with ERM systems appears to be universal. If the data were better, if it were comprehensive, and if it were viewable through customizable dashboards, what would this inform librarians about their collections, their patrons’ usage, and the value they get from particular publishers?

So here we have a bizarre market situation where book publishers and librarians alike are starved for data, but the intermediaries are not sufficiently helpful in illuminating this. I say “book publishers” because journals are different; usage largely takes place on a publisher’s server, enabling publishers to undertake data analytics directly. But book publishers only get hints and (sometimes) polite smiles from their distributors, and librarians struggle with the data they have, which is incredibly messy with many items miscategorized, occasionally superfluous, and not in a form that permits it to be integrated with other data across platforms.

So what good is hiring a data analyst, whether you are a publisher or a librarian, if you don’t have the data in the first place? In the absence of more comprehensive data and standards for developing it and using it, the industry-wide emphasis on data analysis operates in a vacuum. Of course, larger firms have more data and can do more with it, but smaller enterprises, including all but a handful of university presses, have to fly blind, as do the libraries they seek to serve, which suffer with their recalcitrant ILS.

What we need is to find a way to free the data from the control of the intermediaries or for the intermediaries to improve the data they provide in the first place. Note that this is not a call for open access; the data that would be made available is data about a publisher’s own books and a library’s own collections. Intermediaries play an important role in scholarly communications and they add value to the process, but the academic book market will only shrink if the participants cannot get a window into their own operations.

Go to original article

From the Scholarly Kitchen

Extracting Book Data from Library Information Systems

Joseph Esposito

From the Scholarly Kitchen

Extracting Book Data from Library Information Systems

Share This

Joseph Esposito