I S K O |
Encyclopedia of Knowledge Organization |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
home about ISKO join ISKO Knowledge Organization journal ISKO events ISKO chapters ISKO people ISKO publications Encyclopedia KO literature KO institutions ⇗ KOS registry 🔒 members contact us |
edited by Birger Hjørland and Claudio Gnoli
Keywordby Marco Lardera and Birger Hjørland 1. Keyword and related termsThe term keyword is one among many related terms, that are sometimes considered synonyms. Soergel (1974, 31-4), for example, considered keyword synonym with descriptor, clueword, cueword, index term, and partly also subject heading. Wikipedia (2020) considers index term, subject term, subject heading and descriptor as keywords [1]. Before we directly address the meaning of keyword in Section 1.6.4, we have therefore chosen to consider a range of related terms. We will argue that keyword is one of many kinds of terms used in information science for describing → documents, and that many of these terms have specific meanings, and thus should not be considered synonyms. The understanding of these different meanings assumes an understanding of various kinds of → indexing and retrieval mechanisms. A summary of conclusions about the concepts presented is given in table 2 at the end of this Section 1. 1.1 Term, index term, free-text term and uniterm1.1.1 TermThe Oxford English Dictionary (2020b) has no entrance for the noun term but defines terminology: “The system of terms belonging to any science or subject; technical terms collectively; nomenclature. Also: the scientific study of the proper use of terms”. This is in agreement with how term is defined by Wiktionary (2020, sense 5): “word or phrase, especially one from a specialised area of knowledge”. Jacquemin and Bourigault (2003, 600-1) characterized “the classical view” of terms as the “dominant approach to termhood [which] stems from the General Theory of Terminology which was elaborated by E. Wüster in the late 1930s, with the Vienna Circle”. This positivist view assumes that experts in an area of knowledge have conceptual maps in their minds. This assumption is misleading and unproductive because experts cannot build a conceptual map from introspection. Terminologists constantly refer to textual data. […] In information retrieval (IR), term has, however, a much broader meaning. In full-text indexing the term-document matrix is a table that describes the frequency of terms that occur in a collection of documents (Baeza-Yates and Ribeiro-Neto 2011, 62-3). In this context term is every word in a document and in a collection of documents (excluding possible stopwords). This is an extremely important model, and thus “term” (and not, for example, “keyword” or “index term”) in this broad meaning is the dominating terminology in IR.
In the same context, Manning, Raghavan and Schütze (2008, 21-2; italics in original) use the token/type distinction to define term: A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR system's dictionary. The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy, but in practice in modern IR systems they are strongly related to the tokens in the document. However, rather than being exactly the tokens that appear in the document, they are usually derived from them by various normalization processes. Anderson and Pérez-Carballo (2005, 1.58, 18-19) emphasized that a term may consist of more than one word: A term is a word or a phrase representing a single concept or multiple concepts that are tightly bound together in the context of a particular IR database […]. Some concepts need more than one word to express them, for example, information science or venetian blind. Some terms could be divided into two separate terms, but they are used so commonly together in a consistent order that they are considered a single bound term or compound term. Because of the issue pointed out by Anderson and Pérez-Carballo, traditional, “classical databases” distinguish between “word indexing”, “phrase indexing” and combined word and phrase indexing (whether “information science” should be found under 1.1.2 Index termBaeza-Yates and Ribeiro-Neto (2011, 61-2; italics in original) wrote that index term has different meanings in search engine designing versus in library and information science: An index term is a word or group of consecutive words in a document. In its most general form, an index term is any word in the collection. This is the approach taken by search engine designers. In a more restricted interpretation, an index terms [sic!] is a preselected group of words that represents a key concept or topic in a document. This is the approach taken by librarians and information scientists. However, as we saw in Section 1.1.1 in full-text IR (including search engine designing) it is “term” rather than “index term”, that is the common expression. We therefore suggest that index term is limited to the meaning, which Baeza-Yates and Ribeiro-Neto defined for library and information science: a preselected word or group of words that represents a key concept or topic in a document (synonym to keyword, as presented below). Index term is a hypernym for subject heading, descriptor, derived index term, uniterm, keyword and tag. 1.1.3 Free text termAnderson and Pérez-Carballo (2005, 1.61, 19) give this definition of free-text term: Often shortened to free text, free-text term usually refers to the use of uncontrolled words or terms from natural language text for indexing or searching. When one searches the actual text of a document, one is searching the free-text terms that are found in the document. The difference between free-text terms and just terms is that sometimes terms may be standardized, at least a little, with respect to format, and they may also have links with the most common synonyms or equivalent terms, even if they are not controlled to the extent of descriptors. Free-text terms are thus opposed to controlled terms (either lightly normalized or derived from a controlled vocabulary (CV), such as a → thesaurus). See also Bondi and Scott (2010). 1.1.4 UnitermUnit term was coined by Taube, Gull and Wachtel (1952) for use in post-coordinate indexing and has often since been called uniterm. It was a kind of minimal processing of natural-language terms, in which, for example, s for the plural form of nouns was removed. It represented a low-level form of indexing, like modern keywords [3]. It was studied by the Cranfield Experiments (Cleverdon 1962), in which the authors compared the Uniterm system to the classic library classification schemes (UDC, faceted classifications, and an alphabetical system). The results, although controversial, showed that a uniterm based search strategy scores better than classic systems in both precision (fraction of the retrieved items that are relevant) and recall (percentage of the relevant items retrieved), which came as something like a shock for the documentation profession at the time. The term uniterm has not, however, reached a clear definition. Costello (1961, 20) wrote: “An intensive study of the literature on Uniterm indexing has revealed that since Dr. Mortimer Taube originally published details of his system there have developed nearly as many different definitions and interpretations of uniterms and uniterm indexing as there are uniterm systems and documentalists responsible for their operation and maintenance”. It seems that uniterm may today be considered a synonym for keyword. 1.2 Heading and subject heading
|
Term | Broad meaning (IR): Any word in a document which is candidate for indexing, including algorithmic full-text retrieval. Narrow meaning: Word or phrase from the terminology of a given field. (E.g., in the field of knowledge organization classification and controlled vocabulary are terms.) |
Index term | Broad meaning: Synonym for term (any word in a document used for or potential useful for indexing). Narrow meaning: A word or phrase used for indexing; a single word index term corresponds in searching to a keyword. In the narrow meaning index term is hypernym for subject heading, descriptor, derived index term, uniterm, keyword and tag. |
Free-text term | Term as it appears in the documents to be indexed, as opposed to terms from controlled vocabularies. |
Uniterm | Free-text terms with a low-level degree of normalization (e.g., color is a uniterm for each of the words “color”, “colors”, “colour”, “colours”). |
Heading | The general term for information used in printed catalogs, bibliographies and browsable lists as an access point. Also used in the term HTML heading and in OPACs (here probably caused by their use of subject heading systems developed in the time of printed indexes). A heading may be single terms or a string of terms. |
Subject heading | Subcategory of heading used as a subject access point (SAP). Normally from a controlled vocabulary (a subject heading system) with pre-coordinated terms (as opposed to descriptors). |
Descriptor | A keyword from a controlled vocabulary designed for post-coordinative indexing (as opposed to subject heading). |
Concept symbol | A symbol associated with a given meaning by a community of users. Specifically: The meaning associated with a given bibliographical reference by a set of citing papers. (E.g., “Hulme (1911)” as symbol for the text that introduced the concept 'literary warrant'.) |
Tag | User-generated keyword, implemented by many online platforms, that let users describe the resources using their own personal vocabulary, sometimes with the help of a recommendation system able to suggest potentially relevant keywords. |
Word | In information science: A sequence of characters surrounded by blanks or punctuation. More than one consecutive word can be treated as a unit (a phrase) in phrase indexing. (E.g., the term information retrieval consists of two consecutive words, which can be 'phrase indexed' to function as one expression. |
Stopword | A common word considered non-relevant for describing the contents of a document and included in a stopword list. |
N-gram | In general: A contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, characters, words, or base pairs according to the application. In the present context: Any sequence of characters of a given length. |
Keyword and keyphrase |
A word, selected for its significance for describing the content of a document, whether or not the selection is done by man or machine, whether it is a controlled or uncontrolled term, or whether it is derived from or assigned to a document. A keyword may be derived from a subject heading, but subject heading and descriptor are terms related to specific kinds of indexing languages. A keyword is different from a full multiterm subject heading or index term, which is called a key phrase. Both keywords and keyphrases are sometimes included in the term key terms . |
It follows from the above given definition of keyword that the function primarily is to characterize documents to improve the findability of the documents described (or to find pages or passages within documents). As such, keywords represent one among many kinds of indexing languages.
In printed books, the indexes may improve the findability of the pages in which a particular subject or concept is discussed, but not where every word appear (which is more the task of concordances) [19]. The difference between “word indexing”, “concept indexing” and “subject indexing” is important, as first pointed out by Bernier (1980) and developed by Hjørland (2018, §3.6). In printed books keywords may also (in combination with tables of contents, section headings and internal references) support browsing, for example, by using keywords as margin notes, or by different kinds of typographical highlighting.
In printed journal indexes, pre-coordinated keywords (subject headings) were mostly used and, as argued by Milstead (1984, 187), pre-coordination is the only appropriate method in the print environment. Computers compare lists of document numbers posted to keywords to locate a combination of concepts, but this is not practical for human beings to do. (But sometimes printed indexes are made from post-coordinate indexes because the producers do not want to produce two different indexes.)
In the electronic bibliographic databases post-coordinate indexing were introduced, allowing flexible combination of keywords (“descriptors”), but also introducing the problem known as “false drops” [20]. These databases typically provide many kinds of SAP (e.g., descriptors from controlled vocabularies, free text terms in titles and abstracts and author supplied keywords). Much research has been done studying the relative contributions of such different kinds of subject access points (cf., Hjørland and Kyllesbech Nielsen 2001; McJunkin 1994).
In picture indexing, two approaches have been introduced (cf., Chu 2001): (1) "content based indexing" (the techniques of indexing and retrieving images based on, among other features, color, shape, and texture) (2) "description based indexing" (techniques based on manually assigned captions, keywords, and other descriptions). In both cases keywords may be used, but in order to connect a keyword to a picture, the word cannot be derived as in texts, but must somehow be assigned on different levels of abstracting in analyzing the picture (from color and shape in the lowest level to artist style, such as impressionism, at the highest level of abstraction).
Besides these functions keywords and keyphrases have become important concepts in information technology, where they are used for text summarization, abstracting, ontology construction, recommender systems, text analyses, browsing [21], subtitling (e.g., of videos) and more. This is a highly active field of research in the second decade of the 21st Century.
The authors of the present article write based on the understanding that in all contexts and for all purposes it is important to consider that keywords should not just be understood in a document-centered way as “semantic condensation” of the content of a document but should be understood in a request-oriented or policy-oriented way as an indication of the subject of a document (cf., Hjørland 2017, §2.4): A document does not have a subject, but is assigned one or more subjects depending on the purpose of the indexing. There can always be different perspectives on given text, and the assignment of keywords cannot avoid this subjectivity (more about this in Section 5). Therefore, keywords should not be considered more or less wrong or correct in relation to a given document but should be considered more or less functional to facilitate given interests.
Some kinds of indexes contain the word “keyword” in their name. A well-known example is the KWIC (KeyWord In Context) index, normally attributed to Hans Peter Luhn who demonstrated his system at IBM in 1958 and published an article about it in 1960 [22]. KWIC is described by Manning and Schütze (1999, 31) as a “concordancing program”, a program that can produce a concordance (cf. endnote 19). The KWIC index is one of the many kinds of “title derivative indexing techniques” (cf. Feinberg 1973) although also, for example, used for full-text documents and for indexes in thesauri. KWIC indexes is probably the oldest kind of → automatic indexing systems. The steps in creating a KWIC index are:
To deal with complaints raised against this format two other formats (KWOC and KWAC) were developed (Van der Veer Martens 2023). In the KWOC index (Key Word Out of Context) the keywords are displayed outside the title, usually at the beginning of the line:
Finally, a KWAC index (“Key Word Alongside Context”, also known as “Keyword and Context” and “Key Word Augmented-in-Context”), as described by Anderson and Pérez-Carballo (2005, 261), preserves word pairs and phrases, but words preceding the keyword are no longer contiguous [25].
Another kind of keyword index is the Keywords Plus used by the Web of Science database. It takes keywords from the references in a given article. The Clarivate Analytics homepage [26] writes:
The data in KeyWords Plus are words or phrases that frequently appear in the titles of an article's references, but do not appear in the title of the article itself. Based upon a special algorithm that is unique to Clarivate Analytics databases, KeyWords Plus enhances the power of cited-reference searching by searching across disciplines for all the articles that have cited references in common.
The system is described in the literature by Garfield (1990a; 1990b), Garfield and Sher (1993) and Zhang et al. (2016). The last paper found (967): “Keywords Plus terms were more broadly descriptive [compared to author-assigned keywords]. Keywords Plus is as effective as Author Keywords in terms of bibliometric analysis investigating the knowledge structure of scientific fields, but it is less comprehensive in representing an article's content.” Further (971): “Keywords Plus terms emphasized research methods and techniques, whereas Author Keywords tended to hone in on specific diseases and conditions”, or generalized: Author keywords may be best at identifying documents about given topics and Keywords Plus best at identifying papers about specific research methods.
Such keyword-based indexes had the advantage of being very fast and inexpensive to be generated using computers. The capabilities of full-text search engines have made such indexes less necessary and they have been in decline. However, the automatic generation of keywords and keyphrases from full text is, as shown later, a highly active research field today. Also, some attempts have been made in order to revitalize the KWIC technique as an instrument to enhance filtering of results in web search engine, displaying the most frequent context of the search query and letting the user select one of them (Käki 2006).
Hjørland (2011) made a distinction between four kinds of indexing based on the dichotomies of human versus computer-based indexing and derived versus assigned indexing (or extractive versus abstractive methods). These four kinds may also be used to classify kinds of keywords:
This classification, therefore, takes into consideration two dimensions: the origin of the keyword (derived or assigned) and the typology of the agent (a human being or a machine).
For both the assigned and the derived approaches the task is to find the best possible word(s)xxviii to represent the contents of a document (from a certain perspective and for a certain purpose). The derived approaches are more restricted in that only terms appearing in the documents can be used. Assigned indexing using a CV are correspondingly restricted by the terms in the CV. This distinction is less important compared to the deep problem: To express the content of a document in a way that increases the findability of the document for those users that might benefit from using it.
Human-based keywords, regardless of being assigned or derived, can also be classified according to the identity of the actor (subject) that associate them to a document:
Holstrom (2019) further differentiated professional indexers, domain experts and casual indexers and expressed the view that each of these categories of indexers have different strengths and weaknesses (see further the discussion in Section 5).
There is a huge literature on automatic keyword extraction, which will here only be presented very selectively and briefly (some such as KWIC indexing were already presented in Section 2.1). In the literature, the main [29] approaches suggested (e.g., Siddiqi and Sharan 2015; Bharti et al. 2017) may be classified as (1) statistical approaches (2) linguistic approaches (3) machine learning approaches (4) domain specific approaches [30]. (In addition, there are hybrid approaches, and there is often strong overlap among the methods used in specific applications.)
Automatic generation of keywords does not necessarily mean that keywords are automatically assigned to documents. In many cases, the generated keywords are only used as a suggestion for the user, that can use his human intellect to decide what keywords to accept and what others to discard (so-called “tag-recommendation”). See, for example, Subramaniyaswamy and Chenthur Pandian (2012).
Among the earliest work in this set of approaches was Luhn (1957) who suggested that terms found repeatedly in a document was appropriate as index term for that document. Luhn thus based automatic indexing on scores of in-document counts. A next step was taken by Edmundson and Wyllys (1961), who developed a “term significance formula” which, in addition to considering word frequency within a given document, also used a “reference” frequency, which measured words' relative frequency within a document collection. Spärck Jones (1972; 1973) developed the term frequency–inverse document frequency (TF-IDF) score to calculate the importance of indexing terms. It is based on the idea that terms with a high frequency in a document, but which are rare in other documents have the highest value as indexing terms. The application of TF-IDF was further developed by Salton and Yang (1973) and Salton et al. (1975).
Whereas most approaches have been based on the “bag of word” [31] principle (disregarding word order and grammar), some approaches have utilized information of word placement in documents. The use of word co-occurrences for selection keywords or keyphrases takes one step in thus direction. See, for example, Bullinaria and Levy 2007; Garg and Kumar 2018 and Lancia 2008.
There are, however, more direct applications of document structures that have been applied for keyword extraction. Siddiqi and Sharan (2015) presented seven approaches [32] but unfortunately without references to further information. It seems obvious for future research to combine the field of genre studies (or composition studies) with empirical research on keyword selection.
This set of approaches use various linguistic analyses and tools for keyword detection and extraction, such as lexical analysis, syntactic analysis, and discourse analysis. Different types of words and word sequences do not have the same value as keywords. For instance, nouns and adjectives are more likely to appear in keywords than other kinds of words such as adverbs and determiners, as they tend to provide more information about a given text. Therefore, extraction tools may exploit morphological and syntactical features of textual units. Firoozeh et al. (2020, 272):
Extraction methods rely not only on plain words (tokens) but also on their lemmatized forms, their parts of speech (POS tag), and some of their morphological features, such as gender or number (singular, plural). […] When the extracted keywords are to be presented to human users, they are mainly retrained without any inflection so as to satisfy the citationess property.
See further about linguistic approaches in Jacquemin and Bourigault (2003).
Machine learning is a kind of artificial intelligence. Often three forms of machine learning are distinguished: Supervised learning, semi-supervised learning, and unsupervised learning.
Supervised learning methods acquire knowledge from data that a human expert has explicitly annotated with category labels or structural information. Supervised learning requires a large amount of expensive training data. Witten et al. (1999) developed the algorithm KEA (keyphrase extraction algorithm) based on author-assigned keywords, which the algorithm learned from and had to demonstrate both on the same documents and on new documents which had not been assigned any keyphrases by their authors. What the algorithm learned was to extract phrases based on their features. Two features were calculated for each candidate phrase: (1) TF/IDF (the frequency of a phrase's use in a particular document compared with the frequency of that phrase in the corpus) (2) first occurrence (the number of words that precede the phrase's first appearance, divided by the number of words in the document. Based on these — and many more technical choices — the authors found that KEA on the average can match between one and two of the five keyphrases chosen by the authors in the collection. This was considered a good performance because KEA must choose from thousands of candidates and because it is highly unlikely that even another human would select the same set of phrases as the original author.
These are approaches that mediate between supervised and unsupervised learning. Li et al. (2010), for example, proposed a semi-supervised key phrase extraction approach, which explored title phrases as the source of knowledge, because phrases in the titles of documents are often appropriate to be key phrases. The position of a term has been a quite effective feature as a phrase extraction method (Witten et al. 1999). Therefore, terms located in the title are ranked higher. Li et al. used Wikipedia to compute the semantic relatedness between terms and thereby suggest key phrases related to title terms.
These are approaches that do not require expert human annotation of examples: rather, an algorithm identifies its own keywords by clustering unlabeled examples into coherent groups. Many different unsupervised approaches have been applied for key-phrase extraction. Among them are graph-based approaches. An overview of these are given in Beliga et al. (2015). On p. 3 they explain:
A graph is a mathematical model, which enables the exploration of the relationships and structural information very effectively. […]. For now, in short, document is modelled as graph where terms (words) are represented by vertices (nodes) and their relations are represented by edges (links) […]. The edge relation between words can be established on many principles exploiting different scopes of the text or relations among words for the graph’s construction.
Among the relations mentioned are co-occurrence relations, syntax relations and semantic relations. The authors conclude:
Graph-based methods for keyword extraction are simple and robust in many ways: (1) they do not require advanced linguistic knowledge or processing, (2) they are domain independent and (3) they are language independent. Such graph-based KE techniques are certainly applicable for various tasks: text classification, summarization, search, etc. Due to the aforementioned benefits it is reasonable to expect that graph-based extraction will attract the attention of the research community in the future. It can be expected that many text and document analyses will incorporate graph-based keyword extraction.
Some approaches use natural language processing technologies to derive keywords, for example, syntactic analyses of text. Kathait et al. (2017) selected nouns and adjectives and combined them with statistical methodologies. Witten et al. (1999) describe an algorithm, KEA, that identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases.
Section 3.3.1 mentioned that supervised machine learning mostly requires domain knowledge in an initial state. A separate set of approaches was mentioned by Siddiqi and Sharan (2015, 19):
Domain specific approaches: Various approaches can be applied to a specific domain corpus, which exploit the backend knowledge related to the domain (such as ontology) and inherent structure of that particular corpus to identify and extract keywords.
Frank et al. (1999) used a machine learning approach to extract key phrases from sets of documents. They found that different key phrases are used in different subject areas and that the exploitation of this domain-specific information significantly improves the quality of the extracted keyphrases. Gazendam et al. (2010) and Medelyan and Witten (2006) used domain-specific thesauri to improve the quality of automatic keyword extraction.
The literature on domain specific approaches is meager (and from this perspective it is somewhat surprising that Siddiqi and Sharan (2015), included it in their survey). From the perspective of the present authors, it seems rather obvious that keyword selection must be related to different “paradigms” or metatheoretical perspectives in fields of knowledge. It is common knowledge, at the least in the humanities, that histories of a given domain cannot be neutral, but are always made from a perspective (whether or not the author acknowledges it or is aware of it). The ways different histories of music, for example, select terms used for periodization, vary. Such knowledge must — to the degree that it can be formalized — be usable, also in relation to the automatic generation of keywords.
Keywords may be, for example, author-generated, indexer-generated, editor generated, reader-generated, generated by algorithms (or be absent). Holstrom (2019) called the focus on who (subject or person) performed the key-word attribution for “an Actor-Based Model for Subject Indexing” and wrote:
Four primary types of actor perform subject indexing work: 1) professional indexers, 2) domain experts [authors in the listing above], 3) casual indexers [readers in the listing above], and 4) machines [i.e., the programmers]. These subject indexing actors all have agency and all act on information objects.
Holstrom also discusses how these kinds of actors may be combined. In special libraries (e.g., the National Library of Medicine) the professional indexers often have an advanced degree in a special domain in addition to their training as information specialists or indexers. Other kinds of actors might be added, for example, fandoms, which are often extremely knowledgeable about the documents being indexed (e.g., in certain genres of fiction and computer games).
An example of an empirical investigation of the different actors' indexing is Lee et al. (2015), which is a study of full-text documents on Alzheimer's disease together with their indexing with MeSH terms in PubMed (MEDLINE) and citation patterns. This article found that different actors in this domain represented different views: MEDLINE indexers emphasize amyloid-related entities, including methodological terms, while authors focus on specific biomedical terms, including clinical syndromes. The complex networks of citing relationships to a certain extent reflected the impact of basic science discoveries in research on Alzheimer's disease.
A debate in library and information science is thus about the effectiveness of keywords assigned by different actors such as authors, indexers, taggers or cited papers (see Section 2.1 about Keywords Plus).
The following kinds of actors are discussed in this section: 5.1 Professional indexers; 5.2 Domain experts/authors; 5.3 Causal indexers (readers, users); 5.4 Machines.
Professional indexers are persons who are employed for the task of indexing documents as one of their main tasks. They may primarily be trained in information science and knowledge organization or they may be subject specialists with training in indexing (such a training may be a general indexing training from a LIS department or it may be inhouse training focusing on the specific system, e.g., as in the MEDLINE database [33]). (Indexing by subject specialists without employment or training in indexing is discussed in Section 5.2.) Professional indexers mostly make use of a large, controlled vocabulary for assigning keywords to documents. It is extremely difficult to identify studies about the quality of indexing made by professional indexers, as Lancaster (2003, 88) wrote:
Regrettably, not much research has been performed on the factors that are most likely to affect the quality of indexing. An attempt has been made to identify such factors in Figure 35 but this is based more on common sense or intuition than on hard evidence.
Lancaster's Figure 35 (89) classified factors affecting indexing quality in five groups:
Here we are discussing the professional indexer. One of the things we know for sure is that inter-indexer consistency is low, i.e., indexers perform rather differently (see, e.g., Lancaster 2003, chapter 5: “Consistency of Indexing”). It is remarkable that under “indexer factors” Lancaster did not mention anything about the indexers' theories. It seems important, for example, whether indexers have a document-oriented view or a request-oriented view on indexing (see Hjørland 2018, 612-3, → §3.1 and 619, §3.8.2). From the domain-analytic perspective, the most important determinant of people's behavior is the theories, that govern them (“theory” to be understood in the perspective of the so-called “theory theory” [34]).
Therefore, instead of supposing that “professional indexers” perform in one specific way, it is more fruitful to suggest that they perform differently based on, among other issues, how they have been trained and their specific theories and views on both indexing and the domain they are indexing, that have influenced them. Although Lancaster (2003) and others found that theories do not exist in indexing, this is, as discussed by Hjørland (2018), not true (and the relative absence of theory discussions in KO may cause lack of progress in this field).
It has often been assumed that human indexing in general, and professional indexing in particular, is the “gold standard” against which computer-based indexing should be measured. According to what has already been written, this is a problematic assumption because indexers perform very differently. Another issue is that it is complicated to determine which set of index terms is the best for a given purpose: This is not something that human indexers just know, but is, as emphasized by Swanson (1986) [35], a scientific hypothesis in need for examination.
Authors of scientific/scholarly papers sometimes assign keywords to their own articles. In this case, we have highly competent domain experts, which are not professional indexers. Author keywords are, like the title and the abstract, a part of the → paratext (Skare 2020), more precisely the peritext of journal articles. Author assigned keywords have, according to Hartley and Kostoff (2003, 434), been employed by many journals since at least 1975, in order to improve their indexing and to increase the probability of the article being retrieved and potentially cited. This employment seems to be increasing at the least in the biomedical domain (cf., Névéol, Dogan and Lu 2010). There is, however, a great variation over journals in the use of such keywords. Some journals do not apply author keywords (but may instead provide keywords by editors or no keyword), some journals demand that such “keywords” should consist of index terms from a controlled vocabulary (Journal of the Association for Information Science and Technology, for example, requires authors to apply terms from the ASIS&T Thesaurus of Information Science, Technology, and Librarianship (Redmond-Neal and Hlava 2005). See further Hartley and Kostoff (2003) about different uses of author keywords among journals.
The number of terms that may be applied also varies much. Today, many scientific journals ask authors to provide a certain number of keywords when submitting their articles. Databases such as Web of Science and Scopus have a specific field for author keywords allowing studies of these [36]. Uddin and Khan (2016) used Scopus to investigate author keywords in the field of obesity and found that in this domain there was a significant relation between the number of author keywords assigned and the citation counts that the article obtained. The analysis of databases has shown that author keywords may take up different functions by describing different aspects of the document, such as “research topic”, “research method”, “research area”, “data” or other (see e.g., Lu et al. 2019), with the indication of research topic as the most common function.
Caragea et al. (2014, 1443) wrote:
Hence, while we believe that authors are the best keyphrase annotators for their own work, there are cases when important keyphrases are overlooked or expressed in different ways, possibly due to the human subjective nature in choosing important keyphrases that describe a document.
Such subjectivity is, however, unavoidable, and it should be a trivial conclusion about all kinds of keyword annotations, whether they are done by human beings or by computers (programmed by human beings) that they represent different kinds of subjectivity. What seems to be important is to be able to characterize different kinds of subjectivity and utilize such knowledge for optimizing keyword assignment.
Névéol, Dogan and Lu (2010) is an important study of author-keywords in biomedical journal articles. It did not (as it is often done) consider MEDLINE indexing as a gold standard that author keywords are compared to. In contrast, their aim was to access whether topics covered by author keywords are also covered by MeSH indexing terms. In their sample of 14,398 articles, which included both author keywords and MEDLINE indexing terms, authors provided on average 5.3 (± 1.9) keywords while indexers assigned on average 13.0 (± 11.9) indexing terms. The article used Surgenor, Corwin and Clerico (2001 [37]) as a running example in their article (see Table 3).
Table 3: MeSH terms and author keywords assigned to Surgenor, Corwin and Clerico (2001)
MEDLINE assigned terms (20 terms) Author assigned keywords (4 terms) Adult
Aged
Cohort Studies
Decision Making
decision-making
(semantic distance to MeSH measured to 0)Diagnosis-Related Groups
Female
Health Services Accessibility
Hospital Mortality
Hospitals, Community / organization & administration*
Hospitals, Rural / organization & administration*
rural health services
(semantic distance to MeSH measured to 0.254)Humans
Intensive Care Units / statistics & numerical data*
Length of Stay
Male
Middle Aged
New Hampshire / epidemiology
Outcome Assessment, Health Care
Patient Transfer / statistics & numerical data*
interhospital transport
(semantic distance to MeSH measured to 0.361)Prospective Studies
Survival Analysis*
survival analysis
(semantic distance to MeSH = exact match)
Névéol, Dogan and Lu (2010) found that in a sample of 300 pairs of author keywords and indexing terms
The authors wrote that from the MEDLINE indexing perspective, their results show that the majority (62%) of the author keywords are already covered by an exact or closely related MeSH indexing term in MEDLINE indexing. They also pointed out that because 49% of the author keywords are either not covered in MeSH (33%) or not linked to their equivalent in MeSH (16%), this suggests that author keywords may be helpful for MeSH terminology development.
The question is, of course, whether keywords suggested by authors, but missing in MeSH, or available in MeSH but not used in the indexing of the documents in the sample, are fruitful keywords, that should be included in MeSH and should have been used by MEDLINE indexers.
Névéol, Dogan and Lu (2010, 537) wrote:
Although the assignment of keywords to an article by the author and the assignment of MeSH indexing terms by the indexer may seem like two very similar activities, there are significant differences in terms of form and perspective. Typically, authors are asked to choose a small number of keywords, without reference to a controlled vocabulary; whereas indexers are trained to select indexing terms from MeSH according to a specific protocol. Moreover, in addition to the subjectivity inherent in an indexing task [references omitted], authors are focused on selecting keywords representing what they consider as important to describe the content of their own article. In contrast, indexers consider the article in the larger scope of the collection.
But is this a true dilemma? Is the description of the content on an article in contrast to considering the article in the larger scope of the whole of the medical literature? We should have the working hypothesis that indexing is an activity governed by knowledge and that the nature and content of this knowledge is subject to research in information science and knowledge organization. The article can be read as if the authors find that author keywords often indicate failures in either MeSH or the indexing. But how do we find out when this is the case? We need to know about the indexing training and instructions of MEDLINE indexers. Unfortunately, their indexing manual is not available for the research community (“NLM Staff access only”) [38]. Looking at Surgenor, Corwin and Clerico (2001) and Table 3 above might perhaps indicate that MEDLINE indexing is too mechanical and very week in the conceptualization of the article [39].
There have also been critique of author keywords. Hahn et al. (2007) proposed the use of automatic procedures in order to avoid kinds of subjectivity in author-assigned keywords. Among their arguments was that “human indexers, even professional ones, are liable to error as well as to the possibility of intrinsic subjective bias […] This is not to say that authors of a structured abstract would consciously cheat, but rather there is a grey area of overstatement and overestimation of one's own results in a highly competitive scientific environment”. Yes, indeed, but the authors did not consider the problems inherent in the automatic procedures and their conclusion was based on a priori thinking rather than on empirical studies. Lok (2010, 418) also addresses the issue of author's role in providing more machine-readable information about their articles. They reported an experiment that they found was not encouraging. Authors used simple vocabulary that the curators of a protein database didn't accept and claimed: “Authors are not the right people to validate their own claims. The community — referees, editors, curators, readers at large — is still needed”.
It is also relevant to notice that Peset (2020) found that most new keywords introduced by authors in research papers are never used a second time and only 11% of them “survive” more than 3 years after their first appearance.
Some further studies of author keywords are reported in an endnote [40].
Reader-assigned keywords are provided by the readers of documents, who can have various ages and backgrounds. Some have more general knowledge about the domain of the studied document, whereas others may not be very familiar with the domain. As stated in Section 5.0 are fandoms often extremely knowledgeable about the documents being indexed, but such indexing is mostly limited to popular genres of, for example, fiction and computer games.
Readers are often given some guidelines for annotating documents, but the reader-assigned keywords may nevertheless express a special kind of subjectivity, e.g., make comments of a private character of little usefulness for other readers.
The field of social tagging and folksonomies is about these kinds of agents and is presented in more depth in a specific article in IEKO (Rafferty 2018).
We have already presented kinds of automatically produced indexes (Section 2.1) and automatic generation of keywords (Section 4). The general view among computer scientists is — implicitly as well as explicitly — that computer approaches are both more efficient and more effective than human indexing/keyword assignment. Gerard Salton, for example, wrote (1996, 333):
forgetting all the accumulated evidence and test data, and acting as if we were stuck in the nineteenth century with controlled vocabularies, thesaurus control, and all the attendant miseries, will surely not contribute to a proper understanding and appreciation of the modern information science field.
Robertson (2008) said: “statistical approaches won, simply. They were overwhelmingly more successful [compared to other approaches such as thesauri]”.
There are voices in library and information science claiming superiority of human indexers (e.g., Day 2014 [51] and Warner 2019). Their arguments are not convincing, however, because they just consider humans in position of a fundamental ability to index documents without considering the nature of knowledge about both indexing and the domain of the documents being indexed.
Hjørland (2011) criticized the dichotomy between human-based indexing versus computer-based indexing because both sides of the dichotomy may be governed by different “theories” of indexing. Humans may, for example, perform very computer-like if they follow a strict set of learned indexing rules or mechanically highlight terms in a text). What seems important is therefore to examine the theoretical foundations on which humans as well as computers work (e.g., if they are governed by document-oriented or policy-oriented indexing, cf. Hjørland 2017, §2.4). Another issue is that computers often depend on input, which is made by humans. In bibliometric techniques, algorithms work on → citation patterns, but such patterns consist of human decisions on which papers to cite. This also indicates that a strict distinction between human-based approaches and computer-based approaches is too simplified.
Two main theories on keywords and keyphrases are:
(Other theories exist, for example, → user-oriented theories, but these theories are considered relatively unimportant and are ignored here.)
The choice between the essentialist theory and the policy-based theory is seldom openly presented or discussed. Nonetheless, such theories have great importance for both manual and automated indexing as well as for the evaluating of indexing. From the viewpoint of essentialist indexing (extracting keyphrases), Caragea et al. (2014, 1435) wrote:
The reason for not considering the entire text of a paper is that scientific papers contain details, e.g., discussion of results, experimental design, notation, that do not provide additional benefits for extracting keyphrases. Hence, […] we did not use the entire text of a paper. However, extracting keyphrases from sections such as “introduction” or “conclusion” needs further attention.
This quote can be interpreted as indicating that the essence of a document is best expressed in the introduction and conclusion. Against this view it can be said that from, for example, the perspective of so-called “evidence-based practice”, exactly the methodology section in documents is of utmost importance. We saw, for example, in Table 3 (Section 5.2) how MEDLINE indexing made high priority to methodological issues in indexing.
Closely related to the issue of content-oriented versus policy-based indexing is the issue of what kind of information should be considered in the derivation or assignment of keywords to documents. The ISO 5963:1985 standard [52] is typically document oriented as it only includes examining title, table of contents, introduction and so on — and even fails to mention the list of references. It is also implicitly assumed by this standard that the same set of indexing terms are optimal for all kinds or queries.
Conflicting with the view that a document itself is all that needs to be considered is, for example, the view represented in the information retrieval tradition, where terms in a document are related to all terms in a collection. This is, however, only a first step. A second step is taken by Morris (2010, 148), who suggested:
The fact that between approximately 30%–40% of the interpretation of lexical cohesion in the texts were individually different provides compelling evidence in support of returning to a computational view of text meaning that includes text, reader, and writer.
We saw another example in Section 5.2 where Lok (2010, 418) suggested: “Authors are not the right people to validate their own claims. The community — referees, editors, curators, readers at large — is still needed”. This quote also indicates the need for a broader interpretation of documents.
A third step can be illuminated by the philosophy of Hegel (here quoted from Mácha 2015, 19):
Hegel was indeed an adherent of the doctrine of internal relations. He writes in his Logic: 'Everything that exists stands in correlation, and this correlation is the veritable nature of every existence' [Hegel 1968, 235]. To adequately understand the veritable nature (i.e., the essence) of every single thing, one has to understand its relations to every other thing and, in the end, to the whole, to the Absolute. To put the doctrine in negative terms: we cannot isolate or abstract one single thing out of the whole and understand it adequately in isolation [cf. Kain 2005, 4–6].
Translated to keyword assignment this means that keywords cannot be assigned by considering a document in isolation. It must be assigned by considering the tradition or paradigm in which the document is written — as well as the interests that the indexing is meant to fulfill.
The authors thanks two referees for their valuable feedback to a former version of this article and Claudio Gnoli, who served as editor as Birger Hjørland was included as a co-author, for a great work improving the manuscript.
1. Wikipedia (2020) wrote: ”An index term, subject term, subject heading, or descriptor, in information retrieval, is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned”.
2. In the former DIALOG system, for example, the names of journals were phrase indexed, and could not be searched by single words in the name, but titles of articles were both 'word indexed' and 'phrase indexed'.
3. Mooers and Mooers (1993) claimed that 'key words' are the direct descendants of Mortimer Taube's Uniterms, but according to the Oxford English Dictionary (2020a) it goes back to 1827. However, Mooers and Mooers may have had a more specific use in mind.
4. Concerning the concept “subject access point” see Hjørland and Kyllesbech Nielsen (2001).
5. Larson (1991, 211): “Experience in catalog use may not necessarily imply that users have been “conditioned” to avoid subject searches [searches using Library of Congress Subject Headings, not searches using title keywords], though such conditioning appears to be a likely result of gaining experience in catalog use, whether card or online catalog. We would suggest, as a hypothesis for further study, that individual users' experiences of subject search failure and information overload lead them to reduce their use of the subject index and to increase their use of alternate means of subject access, such as title keyword searching and shelf browsing following a known item search”. See Gross and Taylor (2005) for a response to Larson (1991) defending subject headings : “It was found that more than one-third of records retrieved by successful keyword searches would be lost if subject headings were not present, and many individual cases exist in which 80, 90, and even 100 percent of the retrieved records would not be retrieved in the absence of subject headings”.
6. Alternatively Soergel (1974, 31) defined: “A 'subject heading' is a specific type of descriptor, namely, a descriptor used in an alphabetical subject catalog or printed index”. Soergel (1974, 126): In subject headings, “[t]he main headings and the subheadings are independent elements designed by terms and arranged alphabetically (with minor deviations in some catalogs). Many subject headings are created by combining a main heading with a subheading, and for those headings the place in the arrangement is determined by its components. The citation order is fixed: Main heading-subheading. A closer look reveals a more complex situation”.
7. ISO 5127:2017: “Preferred term, descriptor. Term (3.1.5.25) used to represent a concept (3.1.1.02) when indexing (3.8.2.01)”.
8. ISO 5127:2017: “3.7.3.04 subject heading: heading (2) access point” (3.7.3.01) expressing an aspect of the contents of all or part of a document (3.1.1.38) and also used to collocate entries (3.2.1.32) for documents having the same or similar content. Note: See also keyword (3.8.1.07); content descriptor (3.8.3.19)”.
9. The date 1950 is debatable. Lancaster (1968) said that in 1947 Calvin Mooers began using the term descriptor on a system called Zato 1 for documents subject classification as of words extracted from their own texts. Roberts (1984) also suggests that the term descriptor was coined in 1947 and explains the difficulties in confirming this information.
10. Modern thesauri tend to include many more descriptors. The Thesaurus of Psychological Index Terms, for example, contains about 8,000 terms, including preferred and entry terms. Source: https://www.apa.org/pubs/databases/training/thesaurus.
11. The PsycINFO database, for example, wrote: “What Are Index Terms? Index terms are controlled vocabulary terms used in database records to make searching easier and more successful. By standardizing the words or phrases used to represent concepts, you don't need to try and figure out all the ways different authors could refer to the same concept. Each record in APA's databases contains controlled vocabulary terms from the Thesaurus of Psychological Index Terms” (https://www.apa.org/pubs/databases/training/thesaurus).
12. Soergel (1974, 31-4) defined descriptors differently: “A 'subject heading' is a specific type of descriptor, namely, a descriptor used in an alphabetical subject catalog or printed index. Some people use 'descriptor' only in connection with ISAR [information storage and retrieval] systems using combinational indexing (usually implemented through edge-notched cards, peek-a-boo cards or computerized methods). Descriptors in this usage represent predominantly elemental concepts, whereas subject headings represent predominantly compound concepts. While this usage corresponds to the original definition of 'descriptor' it has not gained currency in the profession. A 'class number' is another specific type of descriptor, namely the notation from a classification system such as the Dewey Decimal Classification or the Library of Congress Classification”.
13. A phrase is sometimes called “a multiword”, “a syntagmatic term” or “a complex term” (Lauriston 1994, 149).
14. Anderson and Pérez-Carballo (2005, 1.61, 19) wrote: “'Keyword' is often used to indicate the more important free-text terms”. However, on p. 222 (§12.75) another meaning is suggested, “keyword searches using Library of Congress subject headings: A third way to searching is the 'keyword' approach. It should be invoked by the intelligent OPAC in several instances. In this first example, a searcher has used the 'exact' approach to 'jazz music' but wants more options. The next step would be to apply the 'keyword-in-main-heading approach'” (retrieving all subject headings which include 'jazz'. The authors add: “in this case, 'music' and 'jazz' are treated as independent keywords”.
15. Wellisch (1995, 248) distinguishes two senses of ”keyword”: (1) the first word of a heading, often called 'lead term'; (2) “a significant term word taken from the title or the text of a document and used in a heading (but not necessarily the first word in it)”. He writes (2501) that in the KWOC (Key Word Out of Context) index, the two senses of keyword were here conflated so that a keyword serving as an index term was at the same time also used as a lead term. On p. 252-3 Wellisch introduces a third meaning: Author assigned keywords. Wellisch claims that the KWIC was not first proposed by Hans Peter Luhn, but by Andrea Crestadoro in 1858. However, Wellisch have no reference to Crestadoro in the bibliography, and he probably meant Crestadoro 1856 (as cited by Olson and Boll 2001, 112). But Olson and Boll wrote the opposite of Wellisch: “Andrea Crestadoro invented the basic concept that topics should be described by a standardized vocabulary”. Therefore, these two sources are in conflict. It will not be examined here which is right.
16. Broughton (2011, 262) did not define keyword, but implicitly understood it in the same broad sense by defining “keyword list: a list of words, or descriptors, to be used for indexing documents. The keyword list is usually characterized by a lack of structure in its composition, and does not have cross-references between terms. It may be no more than an alphabetical list of terms taken from documents or previously used in indexing”.
17. East Carolina University Libraries. Search Basics for the Health Sciences: Keywords vs. Subject Headings. Retrieved from https://web.archive.org/web/20201020200459/ https://libguides.ecu.edu/c.php?g=89808&p=579367:
“Keywords are:
18. Our understanding of keyword is thus different from the traditional approach, which considers a keyword an (intended) objective description of a document, cf., the definition by Feather and Sturges (1996, 341) quoted above. Also, when Firoozeh et al. (2020) claim: “The terms 'keyword' and 'keyphrase' do not refer to any theory”, we are here arguing that two opposing theories are in play: (1) the traditional view of one correct set of keywords for a given document (2) the alternative theory that different perspectives and interests requires different sets of keywords.
19. A concordance consists of a list of the words in the text with a short section of the context that precedes and follows each word. Although a concordance is often defined as “an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context”, where “principal words” could be understood as synonym for keywords, in reality a concordance includes many more words than key-word indexes (usually all words except words from a stop list). See however endnote 23 about KWIC and related indexes.
20. Library of Congress (2006, 14) wrote: “With post-coordinate indexing, subject concepts are entered as single terms so that users are required to coordinate them. Boolean searching and other advanced techniques are required in order to locate resources on the compound and/or complex subjects in which the searchers are interested. False associations may easily occur because relationships among terms can be unclear”. Slide 13 presented the following examples of pre-coordination: Gold mining—United States—History—19th century
, Diamond mining-South Africa-History-20th century
. “In pre-coordinated indexing, appropriate terms are chosen and coordinated into subject-subdivision
combinations at the time of indexing or cataloging. On the screen you'll see a couple of examples of precoordinated strings.
You'll see Gold mining—United States—History—19th century. What we have here is a topical subject heading followed by a geographic subdivision, a topical subdivision (History), and a chronological period (19th century).
In the second, it is the same structure, but with different components in that particular heading string. In
each of these cases, what you get from the pre-coordinated string is a certain amount of context”. Slide 14 presented the following examples of post-coordination in indexing: Gold mining
, Diamond mining
, United States
, South Africa
, History
, 19th century
, 20th century
”. P. 14: “With post-coordinate indexing, subject concepts are entered as single terms so that users are required to coordinate them. Boolean searching and other advanced techniques are required in order to locate resources on the compound and/or complex subjects in which the searchers are interested. False associations may easily occur because relationships among terms can be unclear. If you look at the slide, you can see an example of what happens when terms are post-coordinated instead of being used in pre-coordinated strings. In contrast to the previous slide, in which it was clear that the resource was about gold mining in the United States in the 19th century, in this example it is unclear whether it's about gold mining in the United States in the 19th century, or about diamond mining in South Africa in the 19th century, or diamond mining in the United States. And what century is it? This is how false drops happen. A post-coordinated system often causes a significant number of false drops, because the context is missing. One might never know exactly what that work is about, without the
strings”.
Another example is that a search after “alcoholism in women”, using alcoholism
and women
as keywords, may retrieve “false drops” such as “alcoholism in men and its implications for their relations to women”.
21. For example, Lok (2010, 416-7) reported about Elsevier's system Reflect, which automatically recognizes and highlights the names of genes, proteins and small molecules in articles in the journal Cell.
22. Luhn is normally given the honor for the idea of the KWIC index although indexes based on the a related idea have occurred earlier.
23. According to the discussion in Section 1 and the note on concordances in Section 2, we have now the paradox, that the “keywords” in, among other, KWIC-indexes are not really keywords because they are not selected (but contain all terms in the title/document except those deselected by the list of stop words). “KWIC index” should therefore, strictly speaking, be termed “term index in context”.
24. Lancaster (2003, 52-3; italics in original) explains the terms cycling and rotation and wrote: “Rotation is essentially the same as cycling except that the entry term is highlighted in some way (e.g., italicized or underlined) rather than being moved to the leftmost position [examples follows]”. The term Permuterm index was used (and trademarked) for a kind of index used in the Science Citation Index and its family of related products, cf. Garfield (1976), who argued that permutations are different from cyclings and wrote about the drawbacks of KWIC indexes.
25. Some sources claim that KWAC index “provides for the enrichment of the keywords of the title with additional significant words taken either from the abstract [o]f the document or its contents.” See, for example, https://www.librarianshipstudies.com/2017/02/keyword-augmented-in-context-kwac.html and http://inmyownterms.com/kwic-kwac-kwoc-not-knock-knock-joke/.
26. KeyWords Plus on Clarivate Analytics' home page: https://support.clarivate.com/ScientificandAcademicResearch/s/article/ KeyWords-Plus-generation-creation-and-changes?language=en_US.
27. In the literature it is often assumed that assigned indexing (whether human or algorithmic) is always based on a controlled vocabulary (see e.g., Beliga et al. 2015, 2), but this is not the case. Humans may, for example, assign words from their own associations, and machines may, for example, chose words from the bibliographic network of the document, a la “concept symbols” of Small (1978). (Somewhat related to Keywordplus®, cf. Garfield and Sher 1993, although this particular method is derived from words in the references of the document and thus formally not a form of assigned indexing.)
28. Strictly speaking not just words can be used to represent the contents of documents, but also, for example, classification codes and pictures (Jacob and Shaw 1996).
29. The following classification is not exhaustive. In addition to the main approaches, other approaches exist and new ones may be invented.
30. Beliga (2015, 2) suggested a somewhat different classification in which statistical, linguistic and other approaches were considered subcategories of unsupervised machine learning. This seems to explain that many of, for example, the statistical techniques are often used in papers about unsupervised learning.
31. The term “bag of words” was used by Harris (1954), who wrote “language is not merely a bag of words”.
32. Siddiqi and Sharan (2015, 22; dotted listing added) presented the following approaches to keyword selection based on location in the document:
33. National Library of Medicine (2018): “Most MEDLINE indexers are either Federal employees or employees of firms that have contracts with NLM for biomedical indexing. A prospective indexer must have no less than a bachelor's degree in a biomedical science. A reading knowledge of certain modern foreign languages is typically sought. An increasing number of recent recruits hold advanced degrees in biomedical sciences. Federal employees must be United States citizens, but citizenship is not mandatory for contractors.
Indexers are trained in principles of MEDLINE indexing, using the Medical Subject Headings (MeSH) controlled vocabulary as part of individualized training. The initial part of the training is based on an online training module (partially available to the public at http://www.nlm.nih.gov/bsd/indexing/index.html), followed by a period of practice indexing. NLM does not accept other indexing training programs as a substitute.
About 1% of MEDLINE indexing is performed by indexers at the International MEDLARS Centers in Sweden and Brazil” (accessed October 9, 2020).
34. Concerning the concept “theory” see Hjørland (2015), who suggested the following definition (116-7): “A theory is an explicit or implicit statement or conception that might be questioned (and thus met with an alternative theory), which is more or less substantiated and dependent on other theories (including background assumptions). We use the term theory about a statement or conception when we want to emphasize that it might be wrong, biased, bad or insufficient for its intended use and therefore should be considered and perhaps replaced by another theory.”
35. Swanson wrote about searching, but his argument is equally relevant for indexing. He wrote that if people working on a task search the literature, some relevant documents may not be retrieved because any search strategy is a theory that may be refused but can never be finally proved. A search strategy is refused if it is possible to discover just one relevant document which has been missed by that strategy. Swanson concluded (1986, 114): “Any search function is necessarily no more than a conjecture and must remain so forever”.
36. However, terms indexed as author-keywords seem to be confused with editor-assigned keywords and terms from CVs. A search for “information” as author-keyword in WoS provided a list of journals. Some of these journals do use author-assigned keywords, on the other hand Journal of the Association for Information Science and Technology demands that authors must select terms from ASIS&T Thesaurus of Information Science, Technology, and Librarianship (author-selected, but not free terms). Finally, the journal Knowledge Organization uses editor-selected keywords (cf. Smiraglia 2013), which the systems apparently cannot distinguish from author-assigned keywords.
37. Névéol, Dogan and Lu (2010) wrote “Surgenor et al. 2009”, but the correct printing year for this reference is 2001.
38. https://web.archive.org/web/20201022102233/https://www.nlm.nih.gov/bsd/indexing/training/INT_030.html (accessed October 22, 2020):
“Indexing Manual (NLM Staff access only)
The Indexing Manual provides discussion on all aspects of indexing at NLM. The Indexing Manual is provided to all new indexers and is available online to NLM staff. The online version contains a search feature.
Technical Memoranda (NLM Staff access only)
Technical Memoranda are updates to the Indexing Manual that provide further clarification of a specific issue or address an immediate need. Technical Memoranda are distributed online and on paper to NLM staff.”
39. A main use of Surgenor, Corwin and Clerico (2001) is for decision making in hospital planning, more precisely regionalization of hospital services. This is far better expressed by the author keywords, although a term such as regionalization of hospital services
(or just regionalization
) is missed. The MEDLINE indexing is extremely individual oriented and seems to fail to connect the document with purpose for which is was written (decision making concerning regionalization of hospital services).
40. A study by Strader (2009) compared the usage of author-supplied keywords to the Library of Congress Subject Headings (LCSH), using a sample of electronic theses and dissertations from the Ohio State University, each of them having both authors keywords and LCSH headings. The results show that a large part of authors assigned keywords don't match with LCSH headings, suggesting that keywords are useful as an additional source of information, even though part of them overlap with the content of the abstract. The author suggests that such a big lack of matching between keywords and LCSH can be explained by the fact that a controlled vocabulary cannot be always updated with the popular terms in current research. Therefore, it seems that authors keywords and controlled vocabularies are complementary, and both should be adopted, as a valuable point of access for users and cataloguers.
Similar results have been found in a study by Schwing, McCutcheon and Maurer (2012), that replicated the analysis of Strader, on a different sample of electronic theses and dissertations. A partial matching between authors keywords and LCSH emerged, suggesting a complementarity of them, as well as an overlapping of author supplied keywords, title and abstract. Such overlapping may reduce the usefulness of them as original and unique access points.
Complementarity between subject headings and keywords has also been reported by Gross and Taylor (2005), McCutcheon (2009) and Gil-Leiva and Alonso Arroyo (2007).
Smiraglia (2013, 158) criticized using an author keywords list to improve retrieval. He initially suggested (p. 155) that every word in an article is a keyword but realized that for some persons they mean “a list of author-contrived 'keywords' underneath the abstract” as common in many journals. He wrote:
“I do not think lists of author-contrived keywords are useful.” His editorial concluded (158): “The potential use of keywords for retrieval and indexing seems clear. That is, the presence of keywords, whether in a separate list or in their usual place in the text, has the potential to influence the formal indexing of research, and also to influence resource location or selection by researchers”. And “What is less clear is how those keywords should be generated. Empirical extraction of the terms is most accurate and therefore most reliable for indexing, retrieval or just for text analysis. Should editorial policy change to incorporate the use of formal keywords in Knowledge Organization it would make the best sense to generate the terms empirically, using text analysis tools designed for statistical term extraction”.
It can be added that from the next volume (41, 2014) Knowledge Organization assigns keywords to all articles (but not by the author but by the editor), whereas other journals use free author-assigned keywords and Journal of the Association for Information Science and Technology uses author-assigned descriptors from the ASIS&T Thesaurus of Information Science, Technology, and Librarianship (see above; Redmond-Neal and Hlava 2005). Smiraglia points out that keywords are a core tool in information retrieval, but we should extract them from the actual text (including title or abstract) instead of providing a separate list. He also expresses concerns about the danger of keyword lists influencing the decision-making process of the indexers:
“The potential use of keywords for retrieval and indexing seems clear. That is, the presence of keywords, whether in a separate list or in their usual place in the text, has the potential to influence the formal indexing of research, and also to influence resource location or selection by researchers. What is less clear is how those keywords should be generated. Empirical extraction of the terms is most accurate and therefore most reliable for indexing, retrieval or just for text analysis. Should editorial policy change to incorporate the use of formal keywords in Knowledge Organization it would make the best sense to generate the terms empirically, using text analysis tools designed for statistical term extraction”.
41. Day (2014, 7):”Human indexes are what machine algorithms strive towards by the use of various syntactical and semantic techniques and technologies”. Day's book has itself an index, but the present author has found it necessary to use Google to find content in this book that could not be found via the index.
42. https://www.iso.org/standard/12158.html wrote on October 25, 2020: ”THIS STANDARD WAS LAST REVIEWED AND CONFIRMED IN 2020. THEREFORE THIS VERSION REMAINS CURRENT”.
Adler, Melissa. 2009. “Transcending Library Catalogs: A Comparative Study of Controlled Terms in Library of Congress Subject Headings and User-generated Tags in LibraryThing for Transgender Books.” Journal of Web Librarianship, 3 no. 4: 309-31.
Anderson, James D. and José Pérez-Carballo. 2005. Information Retrieval Design: Principles and Options for Information Description, Organization, Display and Access in Information Retrieval Databases, Digital Libraries, Catalogs, and Indexes. St. Petersburg, Fl: Ometeca Institute.
Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 2011. Modern Information Retrieval. The Concepts and Technology Behind Search, 2nd ed. New York: Addison Wesley.
Bekhuis, Tanja. 2015. "Keywords, Discoverability, and Impact". Journal of the Medical Library Association (JMLA) 103, no.3: 119-20. doi: http://dx.doi.org/10.3163/1536-5050.103.3.002.
Beliga, Slobodan, Ana Meštrovic and Sanda Martincic-Ipšic. 2015. “An Overview of Graph-Based Keyword Extraction Methods and Approaches”. Journal of Information and Organizational Science 39, no. 1: 1-20. https://jios.foi.hr/index.php/jios/article/view/938.
Bernier, Charles L. 1980. "Subject Indexes". In: Encyclopedia of Library and Information Science: Volume 29, eds. Allen Kent, Harold Lancour and Jay E. Daily, Jay E. New York, NY: Marcel Dekker, Inc.: 191-205.
Bharti, Santosh Kumar, Korra Sathya Babu and Sanjay Kumar Jena. 2017. “Automatic Keyword Extraction for Text Summarization: A Survey”. https://arxiv.org/ftp/arxiv/papers/1704/1704.03242.pdf.
Bondi, Marina and Mike Scott (eds.). 2010. Keyness in Texts. Amsterdam: John Benjamins. DOI: https://doi.org/10.1075/scl.41.
Broughton, Vanda. 2011. Essential Library of Congress Subject Headings. London: Facet Publishing. doi:10.29085/9781783300365.
Bullinaria, John A. and Joseph P. Levy. 2007. “Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study”. Behavior Research Methods 39, no. 3: 510-26. https://link-springer-com.ep.fjernadgang.kb.dk/content/pdf/10.3758%2FBF03193020.pdf.
Caragea, Cornelia, Florin Bulgarov, Andreea Godea and Sujatha Das Gollapalli. 2014. “Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach”. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), October 25-29, 2014, Doha, Qatar. Eds. Alessandro Moschitti, Bo Pang and Walter Daelemans. Stroudsburg, PA: Association for Computational Linguistics, 1435–46. DOI: 10.3115/v1/D14-1150.
Chu, Heting. 2001. “Research in Image Indexing and Retrieval as Reflected in the Literature”. Journal of the American Society for Information Science and Technology 52, no. 12: 1011-18. https://doi.org/10.1002/asi.1153.
Cleverdon Cyril W. 1962. Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems. Aslib-Cranfield Research Project. Cranfield, UK.: College of Aeronautics.
Cohen, Jonathan D. 1995. “Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting”. Journal of the American Society for Information Science 46, no. 3: 162-74.
Costello, John C. 1961. “Uniterm Indexing Principles, Problems and Solutions”. American Documentation 12, no. 1: 20-6.
Crestadoro, Andrea. 1856. The Art of Making Catalogues of Libraries. London: The Literary, Scientific and Artistic Reference Office.
Day, Ronald E. 2014. Indexing It All: The Subject in the Age of Documentation, Information, and Data. Cambridge, MA: The MIT Press.
Dolamic, Ljiljana and Jacques Savoy. 2010. “When Stopword Lists Make the Difference”. Journal of the American Society for Information Science and Technology 61, no. 1: 200-3.
Edmundson, Harold P. and Ronald E. Wyllys. 1961. ”Automatic Abstracting and Indexing-Survey and Recommendations”. Communications of the ACM 4, no. 5: 226-34. https://doi.org/10.1145/366532.366545.
Feather, John and Paul Sturges (eds.). 1996. International Encyclopedia of Information and Library Science. London: Routledge.
Feinberg, Hilda. 1973. Title Derivative Indexing Techniques: A Comparative Study. Metuchen, NJ: Scarecrow Press.
Firoozeh, Nazanin, Adeline Nazarenko, Fabrice Alizon and Béatrice Daille. 2020. “Keyword Extraction: Issues and Methods”. Natural Language Engineering 26, no. 3: 259–91. doi:10.1017/S1351324919000457.
Frank, Elbe, Gordon W. Paynter, Ian H. Witten, Carl Gutwin and Craig G. Nevill-Manning. 1999. “Domain-Specific Keyphrase Extraction”. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden, July 31-August 6, 1999. San Francisco, CA: Morgan Kaufmann, Vol. 1: 668-73.
Furner, Jonathan. 2010. “Folksonomies”. Encyclopedia of Library and Information Sciences, Third
Edition, edited by Marcia J. Bates and Mary Niles Maack. Boca Raton, FL: CRC Press, vol. III: 1858—66. doi: 10.1081/E-ELIS3-120043238.
Garfield, Eugene. 1976. “Permuterm Subject Index: Autobiographical Review”. Journal of the American Society for Information Science 27, no. 5-6: 288-91. Also available at: http://www.garfield.library.upenn.edu/essays/v3p070y1977-78.pdf.
Garfield, Eugene 1990a. “Keywords Plus®: ISI's Breakthrough Retrieval Method. Part 1. Expanding your Searching Power on Current Contents on Diskette”. Current Contents® 1, no. 32: 5–9. http://www.garfield.library.upenn.edu/essays/v13p295y1990.pdf.
Garfield, Eugene. 1990b. “Keywords PlusTM Takes You Beyond Title Words. Part 2. Expanded Journal Coverage for Current Contents on Diskette@, Includes Social and Behavioral Sciences”. Current Contents 1, no. 33: 5-9. http://www.garfield.library.upenn.edu/essays/v13p300y1990.pdf.
Garfield, Eugene and Irving H. Sher. 1993. “KeyWords PlusTM - Algorithmic Derivative Indexing”. Journal of the American Society for Information Science 44, no. 5: 298-99. http://www.garfield.library.upenn.edu/papers/jasis44(5)p298y1993.html.
Garg, Muskan and Mukesh Kumar. 2018. “Identifying Influential Segments from Word Co-Occurrence Networks using AHP [Analytic Hierarchy Process]”. Cognitive Systems Research 47: 28-41.
Gazendam, Luit, Christian Wartena and Rogier Brussee. 2010. “Thesaurus Based Term Ranking for Keyword Extraction”. Workshops on Database and Expert Systems Applications, Bilbao, 2010, 49-53. doi: 10.1109/DEXA.2010.31.
Gil-Leiva, Isidoro and Adolfo Alonso-Arroyo. 2007. "Keywords Given by Authors of Scientific Articles in Database Descriptors." Journal of the American Society for Information Science and Technology 58, no. 8: 1175-87.
Gross, Tina and Arlene G. Taylor. 2005. “What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results”. College & Research Libraries 66, no. 3: 212-30.
Hahn, Udo, Joachim Wermter, Rainer Blasczyk and Peter A. Horn. 2007. “Text Mining: Powering the Database Revolution”. Nature 448, no. 7150: 130.
Harris, Zellig. 1954. "Distributional Structure". Word 10, no. 2/3: 146–62. doi:10.1080/00437956.1954.11659520.
Hartley, James and Ronald N. Kostoff. 2003. "How Useful are 'Key Words' in Scientific Journals?" Journal of Information Science 29, no. 5: 433-8.
Hegel, Georg Wilhelm Friedrich. 1968. The Logic of Hegel. Trans. William Wallace. Oxford: Oxford University Press.
Hjørland, Birger. 2011. "The Importance of Theories of Knowledge: Indexing and Information Retrieval as an Example". Journal of the American Society for Information Science and Technology 62, no. 1: 72-7.
Hjørland, Birger. 2015. “Theories are Knowledge Organizing Systems (KOS)”. Knowledge Organization 42, no. 2: 113-28.
Hjørland, Birger. 2017. “Subject (of Documents)”. Knowledge Organization 44, no. 1: 55-64. Also available in ISKO Encyclopedia of Knowledge Organization, eds. Birger Hjørland and Claudio Gnoli, http://www.isko.org/cyclo/subject.
Hjørland, Birger. 2018. “Indexing: Concepts and Theory”. Knowledge Organization 45, no. 7: 609-39. Also available in ISKO Encyclopedia of Knowledge Organization, eds. Birger Hjørland and Claudio Gnoli, http://www.isko.org/cyclo/indexing.
Hjørland, Birger and Lykke Kyllesbech Nielsen. 2001. “Subject Access Points in Electronic Retrieval”. Annual Review of Information Science and Technology 35, 249-98.
Holstrom, Chris. 2019. “Moving Towards an Actor-Based Model for Subject Indexing.” NASKO: North American Symposium on Knowledge Organization 7, no. 1, 120-28. Doi: http://dx.doi.org/10.7152/nasko.v7i1.15631.
Hulme, E. Wyndham. 1911. "Principles of Book Classification". Library Association Record, 13:354-358, Oct. 1911; 389-394, Nov. 1911 & 444-449, Dec. 1911.
ISO 5127:2017. Information and Documentation: Foundation and Vocabulary. 2nd. ed. Geneva, Switzerland: The International Organization for Standardization.
ISO 5963:1985. Documentation – Methods for Examining Documents, Determining their Subjects, and Selecting Indexing Terms. Geneva: International Organization for Standardization.
Jacob, Elin K. and Debora Shaw. 1996. “Is a Picture Worth a Thousand Words?”. Advances in Knowledge Organization 5, 174-81.
Jacquemin, Christian and Didier Bourigault. 2003. “Term Extraction and Automatic Indexing”. In The Oxford Handbook of Computational Linguistics, ed. Ruslan Mitkov. Oxford, UK: Oxford University Press, pp. 599–615. DOI: 10.1093/oxfordhb/9780199276349.013.0033.
Jäschke, Robert, Leandro Marinho, Andreas Hotho, Lars Schmidt-Thieme and Gerd Stumme. 2008. "Tag Recommendations in Social Bookmarking Systems". Ai Communications 21 no. 4: 231-47.
Juilland, Alphonse and Alexandra Roceric. 1972. The Linguistic Concept of Word: Analytic Bibliography. Janua linguarum, Series minor, 130. The Hague: Mouton.
Kain, Philip J. 2005. Hegel and the Other: A Study of the Phenomenology of Spirit. Albany: State University of New York Press.
Käki, Mika. 2006. "fKWIC: Frequency-Based Keyword-In-Context Index for Filtering Web Search Results". Journal of the American Society for Information Science and Technology 57, no. 12: 1606-15.
Kathait, Shailendra Singh, Shubhrita Tiwari, Anubha Varshney and Ajit Sharma. 2017. “Unsupervised Key-phrase Extraction using Noun Phrases”. International Journal of Computer Applications 162, no. 1: 1-5. doi: 10.5120/ijca2017913171.
Krámský, Jiří. 1969. The Word as a Linguistic Unit. Janua linguarum, Series minor, 75. The Hague: Mouton.
Lancaster, Frederick W. 1968. Information Retrieval Systems: Characteristics, Testing and Evaluation. New York: Wiley.
Lancaster, Frederick W. 2003. Indexing and Abstracting in Theory and Practice. 3rd. ed. London: Facet Publishing.
Lancia, Franco. 2008 “Word Co-Occurrence and Similarity in Meaning: Some Methodological Issues”. In Mind as Infinite Dimensionality, ed. Sergio Salvatore and Jaan Valsiner. Roma, Italy: Carlo Amore, 1–39. https://mytlab.com/wcsmeaning.pdf.
Larson, Ray R. 1991. “The Decline of Subject Searching: Long-Term Trends and Patterns of Index Use in an Online Catalog”. Journal of the American Society for Information Science 42, no. 3: 197-215.
Lauriston, Andy. 1994. “Automatic Recognition of Complex Terms: Problems and the TERMINO Solution”. Terminology 1, no. 1: 149-70.
Lee, Durin, Won Chul Kim, Andreas Charidimou and Min Song. 2015. “A Bird's-Eye View of Alzheimer's Disease Research: Reflecting Different Perspectives of Indexers, Authors, or Citers in Mapping the Field”. Journal of Alzheimer's Disease 45, no. 4: 1207-22. doi: 10.3233/JAD-142688.
Li, Decong, Sujian Li, Wenjie Li, Wei Wang and Weiguang Qu. 2010. "A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document Semantic Network". Proceedings of the ACL 2010 Conference Short Papers, Uppsala, Sweden, 11-16 July 2010. Uppsala, Sweden: Association for Computational Linguistics, 296–300. https://www.aclweb.org/anthology/P10-2055.pdf.
Library of Congress, Policy and Standards Division. 2006. Library of Congress Subject Headings. Module 1.2: Why Do We Use Controlled Vocabulary? PowerPoint Presentation. Retrieved from https://www.loc.gov/catworkshop/lcsh/PDF%20scripts/1-2-WhyCV.pdf.
Lok, Corie. 2010. “Speed Reading”. Nature 463, no. 7280: 416-8.
Lu, Wei, Xin Li, Zhifeng Liu and Qikai Cheng. 2019. "How do Author-Selected Keywords Function Semantically in Scientific Manuscripts?" Knowledge Organization 46 no. 6: 403-418. https://doi.org/10.5771/0943-7444-2019-6-403.
Luhn, Hans Peter. 1957. “A Statistical Approach to the Mechanized Encoding and Searching of Literary Information”. IBM Journal of Research and Development I, no. 4: 309-17.
Luhn, Hans Peter. 1960. “Keyword-In-Context Index for Technical Literature”. American Documentation 11, no. 4: 288–95.
Mácha, Jakub. 2015. Wittgenstein on Internal and External Relations: Tracing All the Connections. London: Bloomsbury Academic.
McCutcheon, Sevim. 2009. "Keyword vs Controlled Vocabulary Searching: The One with the Most Tools Wins". The Indexer: The International Journal of Indexing 27, no. 2: 62-65.
McJunkin, Monica Cahill. 1994. Precision and Recall in Title Keyword Searches. Master's Research Paper. Kent, OH: Kent State University. https://pdfs.semanticscholar.org/14eb/fdef43d49942c71e01a637a8b730ed1068be .pdf?_ga=2.217421270.1972569115.1590375949-1712967020.1514566464.
Manning, Christopher D. and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.
Manning, Christopher D., Prabhakar Raghavan and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press.
Medelyan. Olena and Ian H. Witten. 2006. “Thesaurus Based Automatic Keyphrase Indexing”. Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, Chapel Hill NC USA June, 2006. New York, NY: Association for Computing Machinery, 296-7.
Milstead, Jessica L. 1984. Subject Access Systems: Alternatives in Design. Orlando, FL: Academic Press.
Mooers, Calvin N. 1950. “Zatocoding for Punched Cards.” Zator Technical Bulletin 30. Boston, MA: Zator Co.
Mooers, Calvin N. 1972. ”Descriptors”. In: Encyclopedia of Library and Information Science, Ed. Allen Kent and Harold Lancour. Vol. 7. New York: Marcel Dekker, 31-45. (Reprinted in Mooers 2003.)
Mooers, Calvin N. 2003. “Descriptors”. In Encyclopedia of Library and Information Science 2nd. ed., ed. Miriam A. Drake. New York: Marcel Dekker, vol. 2: 813-21. doi: 10.1081/E-ELIS 120008981 (Reprint of Mooers 1972).
Mooers, Calvin N. and Charlotte D. Mooers. 1993. "Oral History Interview with Calvin N. Mooers and Charlotte D. Mooers". Interviewed by Kevin D. Corbitt. Minneapolis, MN: Charles Babbage Institute, 7-11. https://conservancy.umn.edu/handle/11299/107510.
Morris, Jane. 2010. “Individual Differences in the Interpretation of Text: Implications for Information Science”. Journal of the American Society for Information Science and Technology 61, no. 1: 141-9. https://doi.org/10.1002/asi.21222.
National Library of Medicine. 2018. “Frequently Asked Questions About Indexing for MEDLINE: Who are the Indexers, and What are Their Qualifications?” https://www.nlm.nih.gov/bsd/indexfaq.html#qualifications.
Névéol, Aurélie, Rezarta Islamaj Dogan and Zhiyong Lu. 2010. “Author Keywords in Biomedical Journal Articles”. 537-41. AMIA [American Medical Informatics Association] Annual Symposium proceedings. AMIA Symposium vol. 2010 537-41.
Olson, Hope A. and John J. Boll. 2001. Subject Analysis in Online Catalogs. Englewood, CO: Libraries Unlimited.
Oxford English Dictionary. 2020a. “Keyword”. Retrieved 2020-06-19 from: https://www.oed.com/ (toll access).
Oxford English Dictionary. 2020b. “Terminology”. Retrieved 2020-06-19 from: https://www.oed.com/ (toll access).
Peset, Fernanda, Fernanda Garzon-Farinos, L.M. Gonzalez, X. Garcia-Masso, Antonia Ferrer-Sapena, J. L. Toca-Herrera and E. A. Sanchez-Perez. 2020. “Survival Analysis of Author Keywords: An Application to the Library and Information Sciences Area”. Journal of the Association for Information Science and Technology 71, no. 4: 462-73. DOI: 10.1002/asi.24248
Rafferty, Pauline. 2018. “Tagging”. Knowledge Organization 45, no. 6: 500-516. Also available in ISKO Encyclopedia of Knowledge Organization, eds. Birger Hjørland and Claudio Gnoli. http://www.isko.org/cyclo/tagging
Redmond-Neal, Alice and Marjorie M. K. Hlava, eds. 2005. ASIS&T Thesaurus of Information Science, Technology, and Librarianship 3rd. ed. Medford, NJ: Information Today. Read more: http://books.infotoday.com/asist/TheInfoSciLib3.shtml#ixzz6Q5qYq5Bc.
Reitz, Joan M. 2004. Online Dictionary for Library and Information Science (ODLIS). Santa Barbara, CA: ABC-CLIO. https://products.abc-clio.com/ODLIS/odlis_p.aspx.
Roberts, Norman. 1984. "The Pre-History of the Information Retrieval Thesaurus". Journal of Documentation 40, no. 4: 271-85.
Robertson, Stephen. 2008. “The State of Information Retrieval: A Researcher’s View.” ISKO-UK. Presentation and audio recording freely available at: http://event-archive.iskouk.org/content/state-information-retrieval-researchers-view.
Salton, Gerard and Chung-Shu Yang. I 973. "On the Specification of Term Values in Automatic Indexing". Journal of Documentation 29, no. 4: 351-72. https://doi.org/10.1108/eb026562.
Salton, Gerard, Chung-Shu Yang and C. T. Yu. 1975. “A Theory of Term Importance in Automatic Text Analysis”. Journal of the American Society for Information Science 26, no. 1: 33-44. https://doi.org/10.1002/asi.4630260106.
Sanderson, Mark and W. Bruce Croft. 2012. "The History of Information Retrieval Research". Proceedings of the IEEE 100 (Special Centennial Issue): 1444-51. doi: 10.1109/JPROC.2012.2189916.
Schwing, Theda, Sevim McCutcheon, and Margaret Beecher Maurer. 2012. "Uniqueness Matters: Author-Supplied Keywords and LCSH in the Library Catalog". Cataloging & Classification Quarterly 50, no. 8: 903-28.
Siddiqi, Sifatullah and Aditi Sharan. 2015. "Keyword and Keyphrase Extraction Techniques: A Literature Review". International Journal of Computer Applications 109, no. 2: 18-23.
Skare, Roswitha. 2020. “Paratext”. ISKO Encyclopedia of Knowledge Organization, eds. Birger Hjørland and Claudio Gnoli, https://www.isko.org/cyclo/paratext.
Small, Henry G. 1978. “Cited Documents as Concept Symbols.” Social Studies of Science 8, no. 3: 327-40.
Smiraglia, Richard P. 2013. "Keywords, Indexing, Text Analysis: An Editorial". Knowledge Organization 40, no. 3: 155-9.
Smith, Arthur. 2020. “Physics Subject Headings (PhySH)”. Knowledge Organization 47, no. 3: 257-66. Also available in ISKO Encyclopedia of Knowledge Organization, eds. Birger Hjørland and Claudio Gnoli, https://www.isko.org/cyclo/physh.
Soergel, Dagobert. 1974. Indexing Languages and Thesauri: Construction and Maintenance. Los Angeles, CA: Melville.
Spärck Jones, Karen. 1972. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval”. Journal of Documentation 28, no. 1: 11-21. https://doi.org/10.1108/eb026526.
Spärck Jones, Karen. 1973. “Index Term Weighting”. Information Storage and Retrieval 9, no. 11: 619-33. https://doi.org/10.1016/0020-0271(73)90043-0.
Strader, C Rockelle. 2009. "Author-assigned keywords versus Library of Congress subject headings". Library resources & technical services 53, no. 4: 243-250.
Subramaniyaswamy, V. and S. Chenthur Pandian. 2012. “Effective Tag Recommendation System Based on Topic Ontology Using Wikipedia and WordNet”. International Journal of Intelligent Systems 27, no. 12: 1034-48.
Surgenor, Stephen D., Howard L. Corwin and Terri Clerico. 2001. “Survival of Patients Transferred to Tertiary Intensive Care from Rural Community Hospitals”. Critical Care 5, no. 2: 100-4. doi: 10.1186/cc993.
Swanson, Don R. 1986. “Undiscovered Public Knowledge”. The Library Quarterly 56, no. 2: 103-18.
Syn, Sue Yeon and Michael B. Spring. 2009. "Tags as Keywords–comparison of the Relative Quality of Tags and Keywords". Proceedings of the American Society for Information Science and Technology 46 no. 1: 1-19. doi: 10.1002/meet.2009.1450460247
Taube, Mortimer, C. D. Gull and Irma S. Wachtel 1952. "Unit Terms in Coordinate Indexing". American Documentation 3, no. 4: 213-8.
Thomaidou, Stamatina and Michalis Vazirgiannis. 2011. “Multiword Keyword Recommendation System for Online Advertising”. In Proceedings of the 2011 International Conference on Advances in Social Networks Analysis and Mining, Kaohsiung, Taiwan, July 25-27, 2011. Los Alamitos, CA: IEEE Computer Society, 423-7. DOI: 10.1109/ASONAM.2011.70.
Togeby, Knud. 1949. “Qu'est-ce qu'un mot?”. Travaux du Cercle Linguistique de Copenhague (TCLC) 5: 97-111.
Uddin, Shahadat and Arif Khan. 2016. “The Impact of Author-Selected Keywords on Citation Counts”. Journal of Informetrics 10, no. 4: 1166-77. https://doi.org/10.1016/j.joi.2016.10.004
Uhlenbeck, Eugenius Marius. 2003. “Words”. In International encyclopedia of linguistics. 2nd. Ed. Edited by William J. Frawley. Oxford, UK: Oxford University Press, vol. 4: 377-8.
Van der Veer Martens, Betsy. 2023. Keywords In and Out of Context. Cham: Springer. (Synthesis Lectures on Information Concepts, Retrieval, and Services). DOI: https://doi.org/10.1007/978-3-031-32530-4.
Warner, Julian. 2019. ” When Might Human Indexing Be Strongly Justified”. Proceedings from the Document Academy 5, no. 1: Article 5. DOI: https://doi.org/10.35492/docam/6/1/5.
Wellisch, Hans H. 1995. “Keyword”. In Indexing from A to Z. 2nd. ed. By Hans H. Wellisch. New York: H.W. Wilson, 248-53.
Wikipedia, the Free Encyclopedia. 2020. “Index term” [redirected from “keyword”]. Retrieved 2020-06-19 from: https://en.wikipedia.org/wiki/Index_term.
Wiktionary, the Free Dictionary. 2020. “Term”. Retrieved 2020-06-19 from: https://en.wiktionary.org/wiki/term#English.
Williams, Raymond. 2015. Keywords: A Vocabulary of Culture and Society. New Edition. New York: Oxford University Press.
Witten, Ian H., Gordon W. Paynter, Eibe Frank, Carl Gutwin and C. Craig G. Nevill-Manning. 1999. “Kea: Practical Automatic Key-Phrase Extraction,” In: Proceedings of the fourth ACM conference on Digital libraries, Berkeley, CA: ACM, 254-2. https://arxiv.org/abs/cs/9902007.
Zhang, Juan, Qi Yu, Fashan Zheng, Chao Long, Zuxun Lu and Zhiguang Duan. 2016. “Comparing Keywords Plus of WOS and Author Keywords: A Case Study of Patient Adherence Research”. Journal of the Association for Information Science and Technology 67, no. 4: 967-72. https://doi.org/10.1002/asi.23437.
Version 1.0 published 2020-11-17
Version 1.1 published 2023-08-09: references to Bondi and Scotti 2010 and Van der Veer Martens 2023 added, last edited 2023-08-10
Article category: KOS Kinds
This article (version 1.0) is also published in Knowledge Organization. How to cite it:
Lardera, Marco and Birger Hjørland. 2021. "Keyword". Knowledge Organization 48(6): 430-456. DOI:10.5771/0943-7444-2021-6-430; also available in ISKO Encyclopedia of Knowledge Organization, eds. Birger Hjørland and Claudio Gnoli. https://www.isko.org/cyclo/keyword.
©2020 ISKO. All rights reserved.