Outliers and Misbehaving Data in the LJ Index
There is a fundamental difference between the LJ Index, now in its second edition, and HAPLR. In the LJ Index, just one of four measures can account for nearly the entire score of a library. In HAPLR, no one of the 15 measures counts for more than 10% of the total score. This article will explore that volatility as well as the use of uncertain data that has resulted in some troubling results for the LJ Index.
Library Journal published its second edition of the LJ Index on November 16th. LJ’s decision to use “outlier” numbers caused at least one library to get a five star rating that appears dubious at best. Their decision cost two other libraries in the over $30 million spending category star ratings. That is just in the largest spending category. The problems go deeper and wider.
For all of the 10 years I have been dealing with HAPLR, I have resisted using the electronic measures because they appeared too skewed to me. The LJ Index authors chose to use a very new data measure that is clearly problematic. It compounded the problem by constructing a “Score Calculation Algorithm” that allows a single measurement to swamp the index score. The result is troublesome.
LJ did not rate San Diego County Library in the February edition but gave it five stars in the November edition, called Round Two. What is troubling is the data involved. LJ used data from the Institute for Museums and Library Service (IMLS) as reported by libraries. For the November 2009 edition of LJ Index that meant data reported in 2008 for fiscal year 2007.
San Diego reported 16.5 million “Public Internet Use” sessions. They did not report last year, but two years ago they reported just 1.1 million sessions. That is a 1,400% increase in just two years! Furthermore, the newest data is available on the California State Library site but not at the federal level yet. The number is back down to 1.4 million. These swings seem unlikely and probably result from a misunderstanding of the requested information. It is supposed to be “sessions” of Internet terminals in the library not about individual “hits” on a database. It appears likely that San Diego County, among others in other spending categories, reported hits not sessions.
What does it mean to report San Diego County Library’s 16.5 million “Public Internet Use” sessions? Well, in an average week, each available terminal in San Diego County Library would have to have been used over 786 times per week, nearly 6 times the rate of its closest peer in the LJ Index Spending Category. On average, all visitors had to have used the internet terminals 4.2 times every time they visited! That is highly unlikely. The next closest library in the category saw just 0.5 sessions per visitor, although in the other population categories there are libraries with even higher numbers.
What happened ?
IMLS has developed “edit check” parameters for the “Public Internet Terminal Use” question in the national data. The edit checks alert state data coordinators when numbers are clearly out of range. A second level of edit checking is supposed to occur at the federal level as well. The process did not work as intended, however. The edit check is supposed to be triggered when nearly every visitor logs a Public Internet Terminal Use. Somehow, the federal data retained this unlikely data.
The LJ Index use of this data resulted in problems for their Star Libraries roster. If we assume the true number of sessions for San Diego County is the 1.4 million of their more recent reporting, their LJ Index score would be 450 not the 989 reported. Rather than 5 Stars for being 4th ranked out of 36 libraries, they would fall to 22nd ranked. The effect of the LJ Index Algorithm does not stop by just changing San Diego’s score. It changes the scores of every other library in the group. In all, 29 of the 36 libraries would receive higher or lower rankings if just this one outlier number were changed to a more reasonable number. Alternatively, of course, LJ could have left San Diego County Library out of the mix because of the faulty data or, at the very least, acknowledged the problems. The LJ authors have certainly given me grief about not giving sufficient warning about the vagaries of HAPLR. They may want to hold themselves to the same standards that they urge for other rating systems.
San Diego County’s apparent misunderstanding of the question, the state and federal data coordinators’ failure to deal with the outliers, and LJ’s ignoring of the problem meant that Las Vegas County and Brooklyn Public libraries were denied inclusion as Star Libraries in their spending category. There are similar problems for most other spending categories as well.
In Ain’t Misbehavin’! Uneven LJ Index Score Ranges Are More Informative LJ Index co-author Ray Lyons says, “LJ Index scores are not well behaved. That is, why they don’t conform to neat and tidy intervals the way HAPLR scores range from about 30 to 930.”
Lyons goes on to say:
The LJ Index is more informative than percentile-based rankings. This increased information does help us move a few steps forward.
Later, he adds:
[T]here is the challenging problem of the validity of very high per capita statistics (called outliers in statistical jargon). These can occur, for instance, due to very small service area populations, when libraries serve a population well beyond its official service boundaries, due to errors in data collection or reporting, or for other reasons.
This is the main reason the LJ Index team decided to de-emphasize the scores and group the top-rated libraries into “star” categories. As the Library Journal article explains, the scores are not precise and there is a lot of “noise” in the underlying data. Better to take a more bird’s-eye view of how libraries are arranged than to take exact scores too literally…or numerally!
In the eight thousand plus words that LJ devoted to the print edition and in the tens of thousands more in the online version of the LJ Index Round Two, I may have missed the disclaimers about the impact that outliers had on the LJ Index this time but I don’t think so.
In very nearly Lyon’s own words, HAPLR ain’t misbehavin’ in quite the way that the LJ Index is.
Changing a single number, San Diego County Library’s reported 16.5 Million Public Internet Use sessions, to the more likely 1.4 million sessions has major impacts for all libraries in the category as the chart below demonstrates. The LJ Index uses a “Score Calculation Algorithm” that allows a single measurement to swamp the index score of a library with outlier data.