Class Lecture Notes: H. P. Luhn and Automatic Indexing --
References
to the Early Years of Automatic Indexing and Information Retrieval
Organizing
and Providing Access to Information -- LIS 391D.2 -- Spring 1998
I. Introduction
- Where we are starting
The
basic and simplest concept of automatic indexing developed in the 1950s was the
KWIC or Keyword in Context index based on permutations of significant words in
titles, abstracts or full text -- manipulated by machine. The first major
report on the application of this indexing concept occurred at the International
Conference on Scientific Information (ICSI) held in Washington, D. C. in November
of 1958. The paper was not the sensational product; the actual demonstration of
the method was the sensation of the conference. Hans Peter Luhn and Phyllis
Baxendale have been deservedly credited as the pioneers in this area of automatic
indexing. Luhn developed the concept with suggestions for auto-abstracting, auto-encoding
and auto indexing. Baxendale developed auto-indexing techniques that identified
topic sentences, and she developed methods of automatic phrase selection and syntactical
deletion. But many others of the day were involved and much work was going on
simultaneously across the nation.
Among those others is Herbert Marvin
Ohlman who was the inventor of Permuterm (or permuted indexing) is reviewed at
the Information Science Pioneers
of North America Web site. According to Robert V. Williams, Professor and
Director of the Office of Research, College of Mass communication and Information
Studies at the University of South Carolina, "Ohlman also presented a paper
and had prepared in advance of, and distributed at, the 1958 Washington D.C. meeting,
a complete permuterm index to the proceedings of the conference. Permuterm indexing
is pretty much the same as KWIC indexing; Luhn's term for the approach happend
to be more "catchy" and stuck but Ohlman's work was just as important,
and possibly, preceded Luhn's work on KWIC."
In the 1958-1959 time
period many minds were conjuring up the same ideas within months of one another
- Where we end upWe end up with a wide range of automated and
semi-automated indexing techniques being developed, studied, debated, and evaluated
including:
- Citation indexes
- Word frequency based on absolute
frequency
- Word frequency based on relative frequency
- Derivative
indexes
- Automatic assignment indexes
- Automatic classification
and categorization
- Factor analysis
- Clumping theory
- Clustering
theory
- Latent class analysis
- Statistical analysis
- Semantic
relationships and mapping
- Associative indexing
- Mathematical modeling
- Discriminate
analysis
- Psycho-linguistics
- In BetweenIn between,
I would like to describe the personality of H. P. Luhn and the exciting times
of the 1950s and 1960s which led to the establishment of new fields of inquiry
and terminology that we use today. Luhn helped set the stage to foster new ideas
and debates on the evaluation of methods of access.
II. The Setting
in the Early Years, the 1950s
- War and Threats of WarAs
Dr. Miska noted last week, the war years provided tremendous impetus for ingenious
work and put many minds together to provide for the practical defense of the country.
After the war, the threat of attack remained fresh in the minds of those who had
served and work continued to military efforts to device machines and techniques
that would protect the country and make us strong. Men and women were both involved
in the effort. SAGE was one of these efforts which fostered networking, teamwork
principles, software development, and the systems approach to information systems.
- Supply of Machine-Readable TextMachine readable text was not
in ready supply. Key punching was the method used to convert text into machine
readable form. IBM and Remington Rand were leaders in this area. Just getting
documents into machine-readable form was a feat. Getting authors to use machine
compatible punctuation symbols was a feat. Work was primarily done manually.
- Nearly 100 computers in the U.S.It is hard to imagine such a time,
but in the 1950s some 100 computers existed in the United States. They were quite
a sensation.
- ENIAC (1945, but veiled in military secrecy)
- 1940s
the punched card was the only available medium for the storage of machine-readable
information. IBM cards were usually punched in Hollerith code; one character to
be coded in a column. (Luhn worked on this limitation.)
- Machines worked
on numerical data; not words—Luhn scanner resulted
- Few trained software
developers--200 or so journeymen in data processing in the late 1950s.
- A
great deal of energy
- Influential organization names:
- Thompson Ramo-Wooldridge, Inc. - TRW
- System Development Corporation. -
SDC
- Rand Corporation
- International Business Machines Corp. (Watson)
- Western
Reserve University (Shera)
- National Science Foundation
- Harvard University
- American Documentation Institute; professional organization
- National Bureau
of Standards
- MIT
- Influential personal names:
- Hans
P. Luhn--Evaluation Side
- Lauren B. Doyle--Maron
- Ronald E. Wyllys--Stiles
- Phyllis Baxendale--Swanson
- Calvin Mooers--Cleverdon
- Robert Fairthorne--Salton
- Vannevar Bush (1945)
- Explosion of Information-- necessitating
mechanized routes to information retrieval
Most importantly, there was
at this time an information explosion. More and more documents were being produced
and the manual techniques used for indexing, classifying, and organizing data
were being overtaken. It was no longer possible for the human being, the information
specialist to keep up.
This problem became one of interest to H. P. Luhn in
the last eighteen years of his life, 1948-1964. His articles normally begin with
a reference to the problem of the information explosion as one for which practical
solutions of a cost effective nature are needed. His thinking and inventiveness
led to using machines to solve this problem.
Tremendous explosion of dollars
and output in terms of reports and papers and everyone wanted the latest report
to read.
III. H. P. Luhn (1896-1964)
- His background
and inventions
Luhn came here from Europe where he had been the assistant
manager of a textile mill. He was an inventor and well known for hard work and
inventiveness. He held 80 patents, 10 which apply directly to the textile industry.
Inventions which Luhn is responsible for include the computing gasoline pump,
an inexpensive foldable raincoat, the early development of the American Airlines
Sabre passenger reservation system, a cocktail recipe organizer, (optical coincidence
principle of searching during Prohibition times or the peekaboo system), and the
Luhn Scanner (first applied to Chemistry), an electronic searching selector which
led to his interest in literary data processing.
His UT connection is through
his second wife who was a singer and a music teacher at UT.
- Work with
IBM
In 1941, he was asked by Thomas J. Watson, Sr. to join IBM as a development
engineer. He created many machines for IBM, became interested in electrical engineering,
and ended his career developing literary data processing techniques in the IBM
Research Division.
He held 20 patents at the time IBM hired him in 1941.
- Interest in Literary Data Processing
In January of 1953, at age
57, he published the first article of his life in American Documentation
and by this time he had begun attending meetings of librarians, literature chemists,
and documentalists because he was excited about the possibilities of applying
machines to literary data processing. From 1953 to 1956, he continued to invent
machines and components for machines and Documentation (now called Information
Retrieval) had pretty much become his full-time interest and area of research.
Although he may have gotten a late start in contributing to the literature,
he made up for it quickly. In 1963 when Carlos Cuadra studied the problems involved
in the identification of key contributions to information science, he sought expert
advice and analyzed references in both textbooks and bibliographies of the field.
Luhn’s name led all the rest on three of four of the listings of major contributions
as estimated by experts, he ranked fifth among the twenty-five highest scoring
authors in terms of "publication densities" and he ranked in the top
ten on the four lists showing the most frequently cited authors in the field.
He was however, far more interested in practical contributions to the field
than to the literature alone.
- 1958 was a milestone year in the life
of H. P. Luhn
- February 6, 1958, IBM released an announcement of the
704 auto-abstracting technique.
- May 27, 1958, IBM unveiled Luhn’s ideas
for business intelligence or selective dissemination system (SDI).
- The
International Conference on Scientific Information (ICSI), Washington, DC, in
November 1958, where Luhn introduced his new equipment and illustrated the practical
results by producing the KWIC indexes for the conference program. Two new Luhn
inventions, the 9900 Index Analyzer and the Universal Card Scanner, and the new
Luhn Keyword-in-Context (KWIC) indexing technique were introduced. Following the
conference, newspapers all over the country carried stories about the auto-abstracting
and auto-indexing.
- Retired from IBM-- in 1961 and died
in 1964 during his term as President of the American Documentation Institute
IV.
Issues and Terminology of the day:
- Literary data processing
- Information
Retrieval
- Trained intermediaries--specialists
- Automatic
indexing
- Luhn became an advocate of automatic indexing; a concept
that allowed him to solve practical problems; using machines to help mankind in
cost effective ways.
- Index language derived from text vs. having been
assigned to documents
- Basically using titles
- KWIC or
keyword-in-context indexing
- KWIC indexing had been around a long
time; Luhn’s contribution was KWIC by machine -- hailed as "the greatest thing
to happen in chemistry since the invention of the test tube"
- automatic
indexing derived from the significant frequency word sentence selection procedure.
Pre-determined list of words to be ignored or non-informing words; stop lists.
- Advantages
were timeliness, speed, and consistency
- Indexing, Language and
Meaning— "native" derived by statistical analyses from the collection itself.
By statistical methods you can choose, enlarge or reduce your classes in such
a fashion that each class has the probability of being equally populated as far
as the overall collection is concerned.
Alluded to meaning or semantics, language
or use of words, and indexing.
- Auto-AbstractingLuhn felt that there
was the unintentional danger of misinterpretations and distortion by the human
abstractors. He felt that since their backgrounds and training vary, no two writers
produce the same abstract. Machine abstracting, done in a matter of minutes, would
release valuable talents of scientific writers, allowing them to work in more
creative fields; the possibilities for errors would be eliminated and a single
standard for abstracting would be established. This method was based on word frequency.
Statistical information derived from word frequency and distribution was used
by the machine to create a measure of significance first for the words, then for
the sentences. Sentences scoring high are extracted and printed to become the
auto-abstract.
This was the start of more diversified paths having to do with
sentence structures.
- Automatic Creation of Literature Abstracts -- An
Experiment in Auto-Abstracting (1958)The objective was quick and accurate
identification of the topic of a published paper and automation of the intellectual
effort. The experimental work to eliminate human bias in abstracting was done
using articles of 500 or 5000 words.
- Auto-encoding
- Extension
of auto-abstracting using a thesaurus created using word frequency connecting
the terms.
- Coding which is a transformation to facilitate the operation
of a machine vs. encoding which is a translation, reduction, or arrangement of
the raw information into a form that will facilitate the system as a whole
- Depended
on items being machine-readable in a sophisticated sense, to distinguish between
reading machines that operate on certain codes and matching machines that can
take code patterns and by complementary techniques discover where elements within
one pattern match those of another pattern.
- IBM 704 was the machine used.
Sentences did not always fit together. A better term was automatic extracting.
- Auto-Encoding of Documents for Information Retrieval Systems (1959)Based
on statistical procedures. One key limitation was that the document must be in
machine-readable form. One dimensional patterns and multi-dimensional patterns
are described along with creating the thesaurus.
Word pairs were discussed.
For example, "information retrieval" held a different meaning from Information
and Retrieval alone.
- Normalization-- in the sense of synonym reduction,
lookup of preferred entries, and other operations designed to simplify or standardize
usage of indexing and documentary languages.
- Families of notionsConcept
of compiling families of terms; keywords stored in machine useful form and used
for normalization of both indexing a search language.
- Selective dissemination
of informationIBM called it business intelligence, and Luhn wrote about it
as Automated Intelligence Systems
- A Business Intelligence System (1958)--
To promote efficient communication -- the key to progress. Based on auto encoding,
auto abstracting, and automatic creation of action points profiles.
--Items to help gain a broader understanding of Automatic
Indexing - Doyle
- Keenan, S.
- Licklider
- Salton
- Schultz
- Stevens
- Taube
- Watters
- Luhn
VI. Conclusion
Mary Elizabeth Stevens
deserves our upmost respect for having compiled a most comprehensive bibliography
on the topic of Automatic Indexing. Her knowledge of the field must have been
the most complete of all. In her books she ends with these questions in 1964.
These questions point to the flavor of the times. Mary Elizabeth Stevens questions
for her state of the art report:
- Can indexing be done by machine at all?
- Is
what can be done by machine properly termed abstracting, indexing, or classifying?
- Is
whatever is done by machine good enough, acceptable, as a good as, or better than
the product of human operations?
- How can we evaluate acceptability or
comparability for any indexing process whatsoever, whether carried out by man
or by machine or by machine-aided manual operations?
- If an indexing product
is to be achieved by machine, can it be done by statistical means alone, or must
syntactic, semantic and pragmatic considerations be brought to bear in the machine
decision-making processes?
If you are interested in more information
on this topic, please refer to the Proceedings of the Conference on the History
and Heritage of Science Information Systems (1998) available in full text at:
http://www.libsci.sc.edu/bob/confprog/confprog.htm
This page is created and maintained
by Sue Soy ssoy@.ischool.utexas.edu
Last Updated 02/24/2003
© Copyright 1996 Susan K. Soy
Please feel
free to copy and distribute freely for academic purposes with this notice and
attribution.
All other rights reserved