"What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results"

Tina Gross and Arlene Taylor.  College & Research Libraries, 66(3): May 2005, pp. 212-230.

Mark Lindner

LIS 577 6 February 2006

Previous take

Can we do away with subject headings? Only if we keep 'Moral minimalism and libraries'  Late May 2005

¤

I first wrote about this article in late May 2005 immediately after receiving it in the mail. I commented mostly on the "politics" of not naming who made the forthcoming helpful suggestion.

¤

Introduction

Someone suggested that most users search by keywords and SHs "could be removed from catalog records to save space and cost (212, emphasis mine).

This study asks:

¤

There is an assumption that most searches are by keyword (as we'll see...)

Suggestion

"This atttitude has lead to the suggestion (in at least one academic library) that subject headings should be stripped from the bibliographic records in the catalog. The argument was that thousands of subject headings needlessly take up gigabytes of space because users hardly ever search for subject headings. (And an unspoken cost saving, of course, would be that catalogers would not need to provide subject headings for new records.)" (213)

¤

Context

¤

"Any intelligent man..." was a prevalent attitude in the 1830s as reported; and it remained prevalent through most of 20th century

"Any intelliegent man who was sufficiently interested in a subject to want to consult material in it could just as well use author entries as subject, for he would, of course, know the names of all the authors who had written in his field" (212, as cited by Ruth French Strout 1956).

Received wisdom

"Many catalog use studies have shown that most searches are for known items or at least for a known author" (212).

  • What is this saying/not saying? Does 50.1% = most?
  • A few show subject searches as "majority"
    • Primarily from public libraries
    • Tendency is to ignore them.

Surprised librarians

Early 90s - OPACS: "Many librarians were quite surprised to learn from various transaction log studies that a high proportion of searches in catalogs was for subject matter" (212-3).

Early "subject searching" - apparatus of today not in place: Less fields were included in keyword searches. Once, subject headings, along with few others other than title, were not included in keyword searches. What is the impact of that knowledge on our research question?

Prevalence of received wisdom

Assumption of "known item or author" search is prevalent in our literature. Carole Palmer was discussing it in her Use and Users class last Thursday. I showed her this article right after class. Then as I was leaving she told me to give the Gross and Taylor citation to Alan Renear who had just brought up the prevalence of this received wisdom in relation to some of his work.

These are our stories, and often our motivating stories, and our assumptions. It is critical to understand them as such, and to understand their reach.

Keyword searching in 2005

Point: Most fields can be searched as keywords BUT which are searched as keywords is highly variable. [Applies to how widely applicable the results are, among larger implications.]

¤

Literature Review

Research question restated

Take an initial step towards finding the answer to:

How?

¤ None

Methodology

Terms taken from keyword searches in a SC university library

Re-ran as keyword searches in PittCat (U of PA OPAC)

¤ None

Mo' Methods

Stopwords removed: a, an, and, by, for, from, in, of, on, or, the, to, with (216)

Limit to English language only

Provisional acquisition records with minimal bibliographic records

¤

Stopwords

Impact of/on non-English-language materials

Limit to Enlgish-language only materials because "the vast majority of bibliographic records for foreign-language materials with English-language subject headings could only contain many of the English-language search terms from the sample in their subject headings" (216).

Some case 100%

SHs (more?) important in case of a high percentage of non-English-language materials in collection

  • Removed to broaden applicability of study to libraries with less non-English-language materials
  • But decreases result being looked at
  • BUT, what if we were interested in the importance of SHs for the retrieval of English-language searches in a primarly non-English-language collection?

Provisional acquisition records

  • Could not be excluded
  • Also decreases result being looked at

¤

Data retrieved

Number of hits with all keyword(s) anywhere

 

Number of hits with all keyword(s) and at least one in subject, but not all in title

 

Number of records (or of the first fifty records) with at least one keyword in subject only

¤ None

Making data manageable

Second search to reduce hits
Search for:   Search by:
metal sculpture all of these Keyword Anywhere
AND    
metal sculpture any of these Subject
NOT    
metal sculpture all of these Title

"Because keywords can still appear in many fields (subject, title, author, series, notes, publication, physical description, etc.) it was still necessary for us to view the remaining hits" (217).

¤

Reproduction of Figure 2 "Second Search Performed to Reduce Hits Needing to be Viewed Manually"

¤

Mo' management

 

If retrieved set still over 50 hits, used first 50 hits (not sampling)

Assumption: Recent = More Relevant

¤

At least 2 issues with this assumption.

¤

Mo' management

If retrieved set still over 50 hits, used first 50 hits (not sampling)

Assumption: Recent = More Relevant

¤

Assumption of Recent = Useful. Another unexamined story we tell ourselves?

Synchronic consistency:

Were titles more (or less) descriptive in the past?

Are as many subject headings assigned? More? Less?

Are older SHs updated to reflect current terminology?

...

¤

Final methods

Determine (or extrapolate [sets > 50]) number of hits with all keywords in a record, and with at least one in SH, but not all in title

Final step: Determine percentage of hits missed out of total number if there were no SHs

Findings

Hits lost in the absence of SHs

Average proportion of lost hits increases as number of keywords goes up to 3

¤

Mean: Average - quotient of the sum of several quantitites and their number

Median: Middle value of a series of values arranged in order of size

¤

Table 3 (220)

 

Results by Number of Keywords in Search
  All Searches 1 KW 2 KW 3 KW 4 or More KW
# of searches 186 44 98 30 14
Median # of hits 66 390 57.5 39.5 9
Avg % lost 35.9% 26.0% 37.3% 44.9% 38.0%
Median % lost 30.2% 19.7% 36.6% 34.7% 26.5%

¤ None

Outliers or exceptions: Table 4 (220)

Individual Searches with High % of Hits Lost w/o SHs
Keywords # of Hits % of Hits Retrieved That Would Be Missed w/o SHs
airplanes military parts 23 100%
businesswomen 173 98.8%
divorced people 55 92.7%
baptist united states 916 92.7%
horror films 402 82.8%
mass media politics 372 78.6%
history slang 22 77.3%
storytelling books 65 71.4%
hispanic americans 762 71.4%

¤

Left out column "Number of hits with a keyword in SHs only"

For about 31.7% of the searches, the percentage of hits with a KW only in a subject field was 50 percent or greater. This means that for about 3 out of every 10 successful KW searches, half or more would not be retrieved if the were no SHs. For about four of every ten successful searches, more than 40% of hits would be lost; and for half of all successful searches, more than a third would be lost" (219-20).

¤

TOCs and Summaries

Positive:

Negative:

¤

Since study was conducted, many English-language monograph records have been augmented with Blackwell's Table of Contents Enrichment Service.

Easier for user to assess relevance of individual records

but, more irrelevant ones are recalled to wade through

Thus, TOCs and summaries

  • Increase the number of hits
  • Decrease the chance of zero hits
  • Reduces precision

For example: metal sculpture (220-1)

"Now yields considerably more hits"

Many among the 1st 25 are there solely due to TOCs and summaries:

  • Jazz modernism: from Ellington and Armstrong to Matisse and Joyce
  • Rapid prototyping casebook
  • Animaculture [book of poems]
  • The wound-dressers dream

Questions

  • What % of records are enhanced?
  • What % of results are non-monographic?
  • How does this translate to other catalogs?
  • Often these sorts of augmentations are used to argue for no longer needing "expensive" cataloging techniques — thus, doubly important to understand the effects of these sorts of augmentations
  • Shows other quirks and positives and minuses of TOCs and summaries augmenting our records

    This article is a good argument for why we need people educated ala Williamson. We can't just accept that the addition of TOCs and summaries (and other augmentations) is a good thing.

    "..,it is essential that all information professionals have a basic knowledge of the principles of subject analysis, and an understanding of their application in indexing and retrieval in online systems of various kinds. ... who are conversant with the characteristics of the catalogs and databases they search and are familiar with their vocabularies (natural language and controlled) and the ways in which they can be manipulated in retrieval (Kesselman 1984)" (Williamson, 82)

    ¤

More sophisticated searching...?

Future research

 

¤

Replicate study in other libraries

  • public and academic (others?)
  • Varying collection sizes
  • Varying amounts of non-foreign-language materials
  • In catalogs where the native language is not English

Study impact of "augmented" records - comes in varying "strengths"

Study impact on precision of increased recall in various contexts.

¤

Conclusions (223)

If SHs are removed from, and no longer added to, bibliographic records:

¤

Precison not determined for this study - assumption: this 35.9% includes a high proportion of relevant hits

Users doing keyword searches with a high proportion of "false" hits would have few options to reducing this set w/o SHs; thus, a powerful tool for narrowing searches

Ethical implications

If one accepts Ranganathan's 5 Laws as a professional motivating force, and/or you believe that an attempt like Blair's "Toward a Code of Ethics for Catalogers" is a positive development AND you accept the conclusions of this study:

Sources

"What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results"

Tina Gross and Arlene Taylor.  College & Research Libraries, 66(3): May 2005, pp. 212-230.

 

"The Importance of Subject Analysis in Library and Information Science Education"

Nancy J. Williamson.  Technical Services Quarterly, 15(1/2): 1997, pp. 67-87.

Technology Credits