Panel 5 Discussion
Alex: I'm grateful that you took modeling seriously, took care to explain your interests and assumptions clearly, and especially that you walked us through false steps with the Class Imbalance Menace, which is superbly explained and illustrated.
At one level, I find your models easy to accept; interesting if not surprising. If we have to guess whether a patron might have checked out a book by a given author, and our only evidence before making a guess is information about which other authors they checked out, it's entirely plausible that knowing something about those other authors can allow us to do better than chance in our guessing. All of your experiments would seem to confirm that this is true -- to varying degrees, but reliably.
If I'm following the argument, you ask us to attribute this possibility of prediction to something called "taste," which seems to mean a commonality of preference patterns across readers. By contrast, you suggest, insofar as a model can't predict choices, its unpredictability will be attributable most interestingly to personal idiosyncrasy (although you allow room to qualify that).
You are careful to enumerate several assumptions of your logistic models: that the categories to be predicted are exclusive and exhaustive, that predictor variables are independent, etc., and you offer reasons why you are persuaded that these assumptions are probably met well enough. But I have to say I'm not quite persuaded that all relevant assumptions are fully on the table yet.
I'm not a good statistician myself, just curious about assumptions. Is multicollinearity so easily dismissed? When you say in a comment above that authors Harland, Holmes, and Evans are each predictors of Hentz, that sounds exactly right, but when we're predicting Harland, wouldn't we feel content and unsurprised if Hentz in turn reasonably joins Holmes and Evans as predictors? Isn't the broad overlap of relationships between authors like this exactly multicollinear non-independence, in statistical language? And we like this non-independence whenever we want to turn this informally into a story about relations between authors rather than patrons and authors.
Another big question I have about this approach to modeling is the complex effect of time, and the consequences of ignoring it. If I understand correctly, you're treating each patron as a vector of author associations, flattening out time altogether. In describing your method you sometimes suggest a different picture, referring to predictions based on a patron's "history" and to "previous checkouts." If I understand correctly, however, the models actually take into account past and future checkouts indiscriminately, and what they predict is not an event, but the presence or absence of a set of one or more events spread out over time -- a patron may have checked out a given author just once or repeatedly over a decade.
Creating a subset of patrons with five or more checkouts still preserves a lot of variation across patrons, and doesn't limit us to power users, which is good. A frequent reader could get to five checkouts in a few weeks of activity. But this means that many patrons will be concentrated in short spans of time from 1891 through 1902, while a smaller number of patrons will be represented by longer spans of activity. Limiting authors to 200 or more circulations does more to ensure that authors are spread out through somewhat longer spans of time, which also to some degree makes them fewer and more like "power authors."
Time matters to availability and to the interpretation of checkouts as preferences selected from a menu of alternatives. I understood this to be the point of Steve's question in a comment above about new books. If, say, one-third of patrons ended their transaction history before October 1899, when the first Booth Tarkington novel showed up, that's quite a few cases who never had any opportunity to check out a book by Booth Tarkington as a predictor or an outcome. There could be no taste for Booth Tarkington before 1899. Flattening time assumes that the composition of the library as a collection of authors was mostly stable, but it plainly wasn't. In statistical language maybe one would talk about a false assumption of equal exposure to predictor variables or something like that. There's substantial temporal variation among both authors and patrons. Isn't this a problem for methodological assumptions?
Also, just as taste changes, tastes change. Individual readers, mostly young people, have their own developmental histories. Children's tastes aren't supposed to be stable. It's one thing to look for a pattern of selecting books by Alger and Alcott alongside other authors when a patron has a single year of transaction history, and another thing to be looking at authors checked out by a patron with many transactions spanning a decade between the ages of 10 and 20.
Apologies for an over-long comment. I find the problem of modeling this kind of data fascinating, and am really grateful you have taken it seriously and given us something to dig into.
I can't resist a quick reply to your comment about how Dunbar might have been read in 1890s Muncie. An 1898 meeting of the Woman's Club of Muncie, on the catch-all theme "Cobwebs from a Library Corner" included a number of recitations and discussions of dialect, including part of a Riley essay on dialect, and a lullaby by Dunbar, of whom the presenter claimed "[he] has had more of a claim in literature than any other colored person.” Just before, another presenter read a selection from Riley as "an illustration of the Hoosier." The Dunbar recitation was followed by a "Coon Song" that met with amusement. There are also presentations on Southern Yankee, German, Bohemian, Irish, and child dialect. At the very least, these women were cognizant of dialect as indication of both regional and racial character. (I'm working from old notes; I need to go back to the original minutes on this rather interesting meeting.)
@felsenstein Thanks for reminding us of the class data available from the census, Frank. I'll be sure to raise the issue with the team when we revisit the piece for publication.
@azleslie Great questions, Alex. We have looked at comparative measures of borrowing, but as you note in your talk, these are relatively rare events, except for the obvious, author-driven serial reading. And it's somewhat confounding because, especially for adult reading, which is spread across many more authors than juvenile reading, we can't definitively say what drives selection: reviews in periodicals? back-of-book advertisements? word-of-mouth? Probably all of the above, and more. As a result, and because we don't have a measure that we trust, we've fallen back on a kind of aggregated anecdote; e.g., "25% of the readers of X also read Y".
Regarding "the degree to which an interest in Southern books can be distinguished from an interest in what was currently popular": Yes, exactly. Southern reading is a subset of popular literature, distinguished from other forms by setting and character, which can range from superficial backdrop to substantive parts of the narrative. We reasonably confident that reader's preferences are generally broader, for romance or adventure fiction, and that what we're calling "Southern fiction" might as well be described a "romance set in the South" or "adventures in the South."
But that suggests an interesting experiment: It should be possible, with a little work (well, maybe more that a little!), to arrange books into broad categories according to setting: these are set in the South; these, in Germany; etc. How then would, say, your model respond if it were to work not with author, but with setting? If a reader has selected a book set in the New York, for example, how likely would he or she to select more books set in New York?
Frank: THank you for the question. As Steve says, we have not looked at class much (in part because we were/are wrongly or rightly less confident of the data supplied). But since you asked for an example, I have noticed the case of Bobbie Knowlton (patron/library card) with books checked out by Mrs. Kate Knowlton (borrower). He is listed as blue collar and "illegitimate"; there is no information on her. Mother and son? Do you have insight into these 601 checkouts, which do include southern reading falling into "boy" reading" as well as adult reading (in the direction of romance/historical romance)! Mother checking out for both of them? You speak in your session about ways in which the current data base could be expanded. I'd say from my years of working with it, that precisely more demographic data would be very welcome. Especially in the matter of class one would feel more confident about diving in if there were more data. If I recall correctly, many younger patrons are simply listed in the demographic data as "at school."
Lynne makes good points about the nature of the occupational data we have in WMR. There are a number of complicating factors in inferring from the occupational categories we assigned to borrowers to socioeconomic status/class. We did use the occupational data to make some generalizations but tried not to use the term class in those instances, despite the temptation to do so. Although Muncie was rapidly industrializing during the 1890s and might be expected to have recognizable class divisions, class categories were very complicated and permeable, and based on behavior as well as occupation. The occupational classifications we used (white collar, blue collar, etc.) only loosely correspond to middle class and working class. They don't accommodate high-status artisans, which still existed during the 1890s. And as Julieanne's presentation makes clear, we have only a snapshot of occupational status, in this case from 1900. A young book borrower in 1891 might belong to a blue-collar family at that time but then become a white-collar worker by 1900 (when we have access to census records). The kinds of statistical analysis Lynne, Steve, and Doug do would seem to require more precision than our occupational data can offer.
There is another complication: we've argued that borrowing best-selling books was in at least some cases a form of behavior that marked people as middle class, so assigning them that kind of socioeconomic profile outside of their borrowing would seem to have a confounding effect on any analysis.
In the somewhat unusual case of Bobbie/Kate Knowlton, it appears that Kate Knowlton, the mother in this case, always used a borrowing card taken out for her son. (The "illeg." note refers to illegible rather than illegitimate. The transcriber had some difficulty in reading that census entry.) The occupational category assigned for children was based on the occupation as listed for the head of household (a census designation). In this case Kate Knowlton held a semi-skilled job according to the Edwards classifications.
These discussions are very helpful as we look to rework WMR. Expanding/refining the demographic profiles of patrons is one possible step. There is more data out there now than there was a dozen years ago when collected this data, so it may be possible to expand some patron records. But systematically expanding demographic profiles for the 4,000 borrowers we have in the data will be a huge lift, and will depend on funding and other priorities.
Lynne -- We have a little further info. on Addie Knowlton and her far younger brother, Bobbie, in our book What Middletown Read, pp. 206-207. We remark that, between them, they accounted for 1,084 recorded transactions. We write that "what is distinctive about [these and other similar] library patrons is not just that their borrowing was prolific but that they all belonged to a lower middle-class or working-class (blue-collar) background", suggesting that such readers "gave the MPL its true raison d'etre, although, taking readership as a whole, their patronage and extensive borrowing records were far more the exception than the rule for the city's manual workforce. The library [board] may have trumpeted the fact that membership was an amenity freely available to all the citizens of Muncie, but...the number of working-class patrons in relation to the size of the population remained disappointingly small." Our original text for the book included some further footnoted information that we were obliged to exclude because of constraints of length. The footnote reads: "Addie Knowlton (b. 1868) married James Manor, a mail carrier, in 1886. Robert Knowlton (b. 1882), still at school, was living in Muncie with his widowed mother, Catherine, in 1900." Their late father's profession had been as a dealer in lightning rods.
Jim and Frank: Clearly I've read too many 19th-century novels (went right to illegitimate instead of illegible). I did consult your book, which I have found so useful over the years, but didn't find Bobbie in the index and didn't have time to look cover to cover. Thanks for pointing me right to the pages and for the information. And if I understand you correctly, Bobbie and Addie (I had seen her too but wasn't sure of the relationship) are the exceptions that prove the rule.
Yes! The Knowltons are an exception that proves the rule. One thing that interests me in Frank's added details (besides his impressively ready access to old notes) is that Bobbie Knowlton is listed as "at school" at a point when he is 17 or 18 years old (b. in 1882, census data collected in 1900). That is quite unusual, especially for a young male, in a period when schooling typically did not extend beyond 8th grade or so. That's a behavior that is arguably more associated with a middle-class than a working-class family, at least in Muncie at the time, which shows how difficult it is to generalize from the occupational data we have to socioeconomic status. (Admin=Jim)
I'm really glad you brought up multicollinearity, @douglasknox, because it's a matter that I've found myself thinking a lot about as well for the very same reasons (I ended up trimming several minutes from my original recording along these lines). I think you're spot on to say that non-independence among variables, while technically a red-flag for statistics, is something that as cultural historians we may in fact want to see -- whether as a identifier of a positive relation in itself or an indicator of how to fashion different kinds of analysis to isolate that relation. And given that, as humanists, we often want to insist on the non-independence of cultural phenomena in general, I think it's also worth critiquing assumptions of independence among variables even though the purpose of the narrower sense of independence in statistics is focused on issues of model error; ideally, satisfying and assuming independence in this practical sense should help better identify non-independence in the broader sense.
I have two comments relating to these two thoughts, respectively. First, I've calculated the variance inflation factor (a standard measure of multicollinearity) for models across the board (including Hentz, Harland, and Evans) and found them consistently several times lower than what are considered typical thresholds for concern. Second, I suspect that the reason for this lack of measured multicollinearity lies in the facts that borrowing a book by any given author is something of a rare event and that the actual % checkout overlap even between related books or authors remain low (such as Hentz, Harland, and Evans, as the tables in your presentation helpfully point out). I interpret this to mean that even though taste is sufficiently consistent among patrons as a group to be relatively identifiable and predictable overall, the specific choices it entails nonetheless remains individual.
The temporality of checkouts is definitely an important issue, and one that definitely warrants more exploration. You've understood my approach correctly, which flattens out time for convenience and to maximize available data -- I misspoke a couple times in my presentation. "Borrowing profile" would probably be more apt than "borrowing history" (I meant this in the cumulative rather than chronological sense, but the difference isn't explicit!). I did take a few steps with regard to the timespan issue. For one, I limited my data to 1895-1902. Because I was concerned about accessibility with limited holdings, I also ran tests (with a variety of different model parameters) that excluded 1) book published in the last year of data, from late 1901 through 1902, and 2) transactions in that last year of data altogether; I found, however, that doing so did not significantly change the accuracy or sensitivity of the models. This should, however, be repeated for the last two or even three years of data. And while most of the top authors were available in Muncie throughout this timespan, to be safe we could also remove the ones who, like Tarkington, were not.
And yet -- these decisions would come with certain risks too. Always a matter of tradeoffs! For one, flattening time and taking a broader timespan are key framing decisions for addressing the issue of limited numbers of copies (because even though an author's books might have been checked out at the time when a patron first sought them, patrons tended to be patient). This breadth necessarily must be reflected in the framing of the binary choice.
Here are my tentative thoughts on the more complex issue you raise, as I understand the involved aspects. It is not necessary that every observation in a model have equal opportunity for every possible value for each variable (observations often do not, as in common variables like education level or others that are contingent on age). Similarly, it is not necessary that every single variable in a model be equivalent (in access, in scale, and so on). The comparatively limited temporal accessibility of Tarkington books is implicitly part of the Tarkington-checkouts variable. This is part of the usefulness of the latent variable approach. Your point, though, is about double uncertainty (thanks for clarifying Steve's point for me!): variability in author availability and variability in patron presence aren't issues individually, but is their conjunction?
In one sense, a person who can't have a taste for Tarkington because the latter had not yet published his first (1899) novel is functionally equivalent to a person who doesn't have a taste for Tarkington. Yet while there couldn't be taste specifically for Tarkington before 1899, there certainly could be taste with which Tarkington was consistent. This is where the issue would arise for the model: patrons whose observed taste might suggest that they read Tarkington but who in fact did not because they could not have. The problem here wouldn't lie in the prediction stage: it would cause a drop in accuracy, but one that we could consider by investigating false positives -- which I think is a key next step for any modeling of this kind of data regardless. The real issue, if there is one, would be in the assignment of coefficients. If there is a large enough number of patrons who 1) were inactive after 1898 yet 2) otherwise borrowed the same authors as patrons who did borrow Tarkington, then the assignment of coefficients would devalue the true predictors of taste that includes Tarkington. In the resulting model, the chief predictors of Tarkington borrowers - given that the imbalance here is temporal - would skew more towards authors who were popular during the years that Tarkington books were available (and popular). I'm not sure that this is the case. Several of the authors most predictive of Tarkington borrowing did indeed publish popular books in the years immediately before/after 1899 (though by no means all), but their books tend to have quite a bit more in common with Tarkington as well in genre, style, moral, or setting. And this gets us back to the conceptual question of the relation between taste and contemporary/popular consumption.
Taste (or utility) can change, as you say, though we'd generally expect it to do so gradually rather than all at once. Models with enough observations and variables have the capacity for sensitivity to "changes" of this kind if they are in fact characteristic, especially for changes that are common in the population (such as, for example, growing from a teenager into a young adult). Filtering down the number of observations would limit this capacity. Indeed, though I frankly expected models to perform better when I limited the data to patrons with a larger number of checkouts, I found that this made the models perform worse by every metric no matter how I adjusted other model parameters. I am generally very hesitant to resort to the "the proof is in the pudding" evaluative approach that is, I think, more common in digital humanities than it ought to be. In this case, given the nature of the data, the research question, and the variety of parameters tested, my inclination is to believe that the additional observations of a more expansive subset communicate more about taste than their exclusion would. You've suggested some much more nuanced ways, though, for subset selection (like checkout spread over time) that I think could strengthen and better regularize the methodological grounding without making the sacrifices of scale that I'm worried about. Certainly it seems to me that the category of borrowers-of-5-to-10-authors warrants further investigation in its own right.
My apologies for my own overly-long reply. I really appreciate your thoughtfulness in working through these issues!
Thanks for the interesting talk on taste modeling @azleslie.
One of the challenges of your approach is the need to build separate models for each author. I wonder if you have considered using models that are suited to latent factor modeling of both users and authors. My main research area is recommender systems, which uses a very similar kind of data. Matrix factorization is very commonly used to do this type of modeling. These models allow commonalities across authors (like some of the ones you noted) to be extracted as factors and analyzed. It is also possible to build in biases in the user and authors dimensions such that the properties of popularity can be most easily modeled. I'd be happy to talk more with you about this kind of approach if you're interested.
Another question: It sounds like you were treating the reader/author relation as binary-valued, but of course, a reader might check out multiple books by a given author. There are modeling methods for count data (Poisson regression comes to mind) and these can also be applied in matrix factorization. Do you think there may be some signal in the repeated checkouts that a binary method is not capturing? This might explain why you don't get as much information from the "power users" as you were expecting.