Bibliometrics and the World Wide Web

Don Turnbull


This paper explains bibliometrics as a standard method of information analysis and discusses its application to measuring information on the World-Wide Web. As information on the Web increases toward entropy, we need to apply theory from other disciplines (such as Information Studies) to develop new methods, modeling techniques and metaphors to examine this emerging complex network. I hope to show the relevance of using the operational methods of bibliometrics to produce quantitative data and analysis to improve information systems use.

The first section of this paper, Bibliometrics, introduces the standard bibliometrics methods and their use followed by, Bibliometric Laws, which introduces some constants to the bibliometric universe. The next part, Applying Bibliometrics examines bibliometric applications for the World-Wide Web. Finally, the last section, Designing for Metrics, overviews some Web development techniques to enable bibliometric measurement. Also, included is an extensive resource bibliography.


Metrics attempt to make sense of the world through measurement. Bibliometrics specifically focuses on the sub-domain of information to supply a "series of techniques that seek to quantify the process of written communication (Ikpaahindi 1985). By the conclusion of this paper we will expand this to include browsing and Web usage data. Classic bibliometrics result from the idea that information has patterns that can be analyzed by counting and analyzing citations, finding relationships between these references based on frequency, and using other statistical formulas. Common examples of bibliometric data are the Science Citation Index or Social Science Citation Index where an author's publication influence can be seen by how often she is referenced.

Citation Measurement

A host of standard citation reasons show the many types of purposes a citation may have:

While complete, the variety of these reasons and their various implications are difficult to measure with any scale as classic citation analysis begins by counting all types of citations with little (or any) weight to reference types.

Refined Classic Bibliometrics

A refinement of mass citation counting is direct citation counting where the

quantity of citations are tracked over a given period of time to test for aspects of an author's or article's impact. The standard formula for impact is:

n journal citations/n citable articles published

While somewhat blunt, applying and averaging citations along certain ordinals begins to become even more useful. Another basic bibliometric technique is calculating an immediacy index of influence using this formula:

n citations received by article during the year/total number of citable articles published


This is also a useful metric to see a broader view of impact, however it may not always be objective as some journals may have more prestige, a longer history of publication or may require certain types of atypical references (historical listings of past articles, for example). Therefore, we can see how varieties of journal articles can influence an immediacy index.

Bibliometric Coupling

Kessler suggests a technique known as bibliographic coupling: measuring  the number of references two papers have in common to test for similarity. He then showed that a clustering based on this measure yields meaningful groupings of papers for research (and information retrieval) by stating "a number of papers bear a meaningful relation to each other when they have one or more references in common" (Kessler 1963).

Kessler also found a high correlation between groups formed by coupling and groups formed by subject indexing. As bibliometric analysis has become more automated, many have tried to take these techniques and engineer software to detect patterns and establish relationships between articles. Price and Schiminovich both began to analyze electronic publications in the late 60's and early 70's as some journals moved towards digital formats (Price, 1968; Schiminovich, 1971). These were the first real steps taken towards applying bibliometrics to electronic publications. Moreover, once data collection and input issues are overcome, information can be measured in its raw form on a computer far quicker than by hand.

Cocitation Analysis

Marshakova and Small (independently) developed coupling further by noting that if two references are cited together, in a latter literature, the two references are themselves related. The greater the number of times they are cited together, the greater their cocitation strength. The major refinement between bibliometric coupling and cocitation is that while coupling measures the relationship between source documents, cocitation measures the relations between cited documents. This implies that an author purposefully chose to relate two articles together, not merely an association between two articles as coupling reveals.

To tune these techniques to finer points involves asking a series of questions about the reference:

is the reference conceptual or operational?

is the reference organic or perfunctory?

is the reference evolutionary or juxtapositional?

is the reference confirmative or negational?

These questions reveal granularity and trends in measurement such as trends to refute (or present an alternative to) an article or the growing importance of a mode of thought. We can also begin to see the maturity level of ideas as they evolve through the literature.

Common Bibliometric Errors

In addition to the simple yields we can get from basic bibliometric measurements, there are some basic inconsistencies in scholarship and publication. Common problems include:

Moving toward even more automatic analysis, becoming aware of these types of errors, we can develop software to take them into account through measures such as indexes of common names and misspellings or by building field specific metadata. Wide scale data analysis of these simple methods can reveal relationships among all information types. We can begin to develop programs to ask interesting questions such as: are papers cited in the same footnotes and endnotes more closely related than other citations in a document overall? Does the order of citations in context imply relations? Does the type of data format (bibliography order or publication style) mean anything? Do citations have transitive relationships (A cites B, B cites C, how does A relate to C)? Software analysis is ideal for answering these types of comparison questions.

In general, all of the basic bibliometric techniques work well with many types of information entities: authors, journals, organizations, departments, universities, and entire mediums. As we progress into the general laws of bibliometrics, wider application domains are revealed.

Bibliometric Laws

The bibliometric laws in this section are the equivalents to Newton's laws of physics. The classics of their field, built upon by others, but not perfectly correct in certain situations. Like physical laws, they seek to describe the working of a system by mathematical means. Even though bibliometric scholarship is mature, there is little evidence of any Einsteinian breakthroughs that prove bibliometric laws are concrete laws. However, they are incredibly useful in developing general theories about information and provide data to study further.

The three prime laws of bibliometrics are Bradford's Law of Scattering, Lotka's Law and Zipf's Law. Each of these is examined in this section as well as some corollaries and supporting laws.

Bradford's Law of Scattering

Generally, Bradford revealed a pattern of how literature in a subject is distributed in journals. "If scientific journals are arranged in order of decreasing productivity of articles on a given subject, they may be divided into a nucleus of periodicals more particularly devoted to the subject and several other groups of zones containing the same number of articles as the nucleus" (Drott 1981). Bradford's Formula makes it possible to estimate how many of the most productive sources would yield any specified fraction p of the total number of items. The formula is:

R(n) =3D N log n/s (1 <_ n <_ N)

where R(n) is the cumulative total of items contributed by the sources of rank 1 to n, N is the total number of contributing sources and, s is a constant  characteristic of the literature. We then can say that:

R(N) =3D N log N/s

is the total number of items contributed by sources.

Over time, this is also a measure of the rate of obsolescence by distinguishing between the usages of the levels of items. Essentially, this is a method of clustering. For example, 9 journals have 429 articles, the next 59 have 499, the last 258 have 404. We roughly get three groupings (ranging from 404 to 499) of  articles. Bradford noticed this consistent number of titles it takes to contribute to each third of the articles.

Bradford discovered this regularity of calculating the number of titles in each of the three groups: 9 titles, 9x5 titles, 9x5x5 titles. Drott suggests that we can apply this widely, as long as we account for sample sizes, area of (journal) specialization and journal policies (Drott 1981).

B.C. Brookes takes Bradford one step further pointing out that he is correct if we have a finite (manageable, relevant) number of journals. Editorial selection and publishing costs currently determine much of the structure and content amount of most publications. Brookes' point may be more relevant when analyzing the expanding, multi-relational World Wide Web. Will these ratios hold, change or not apply at all? With a deluge of information, we may find the limits to this law. Again, Brookes implies this by stating that "index terms assigned to documents also follow a Bradford distribution because those terms most frequently assigned become less and less specific... (but)... increasingly ineffective in retrieval (Brookes 1973).

Volume and homogeneity may work against us. More positively, we may discover that there is a drop-off point where Bradford's Law at least applies for some period. This means that for resource pressures, there is a drop-off point. We can examine this again by time, to see that citations originally counted year by year can be expressed as the geometric sequence:

R, Ra, Ra2, Ra3, Ra4, ..., Rat-1


where R is the presumed number of citations during the first year (some which could not immediately be referenced in publication), but as a<1, the sum of the sequence converges to the finite limit R/(1-a). Again, here we see how Web documents might be impacted by thinking about what constitutes a "Web year".

Lotka's Law

Generally, Lotka's Law is an inverse square law that for every 100 authors contributing one article, 25 will contribute 2, 11 will contribute 3, and 6 will contribute 4 each. We see a general decrease in performance among a body of authors following 1:n2. This ratio shows that some produce much more than the average which seems agreeably true for all kinds of content creation. However, Lotka doesn't take impact into account, only production numbers. Furthermore, in 1974, Voos found that in Information Science, the ratio was currently 1:n3.5.(Voos 1974) Thus, we can say that Lotka's Law may not be constant in value, but in following inverse square. Our challenge will then be to find the correct exponent in different mediums and fields.

Zipf's Law

The most powerful, wide ranging law of bibliometrics is Zipf's Law. It essentially predicts the phenomenon that as we write, we use familiar words with high frequency. A distribution applied to word frequency in a text states that the nth ranking word will appear k/n times, where k is a constant for that text.

For analysis, this can be applied by counting all of the words in a document (minus some words in a stop list - common words (the, therefore...)) with the most frequent occurrences representing the subject matter of the document. We could also use relative frequency (more often than expected) instead of absolute frequency to determine when a new word is entering a vocabulary.

Zipf said his law is based on the main predictor of human behavior: striving to minimize effort. Therefore, Zipf's work applies to almost any field where human production is involved.[1] We then have a constrained relationship between rank and frequency in natural language. Perhaps Zipf's Law can be applied to other scales of information as well.

Wyllys (1967) suggests an approach by Benoit Mandelbrot is better at a more granular level. For costs in terms of words, letters that spell the words and spaces that separate them increase with the number of letters in a word, and (expanding outward again) by extension in a message. In other words Zipf's Law works at a micro (language) level as well.

Other Ideas

Other, albeit more specific, laws can be used for specific purposes. When applying these laws to Web analysis, implementation details can be easily imagined.

Cleverdon found an inverse law of recall and precision for the matching of a group of terms (with some human elaboration) from matching techniques to gradually replace human indexing by statistical techniques. This could be used in a Web index environment and augmented with tweaking by some kind of knowledge base to work with outliers. Weighted tables and even fuzzy logic systems could also be used.

Another useful idea is Goffman's General Theory of Information Systems. Goffman's thesis is that ideas are "endemic" with minor outbreaks occurring from time to time. Research and reference can follow cycles of use. This is very similar to ideas like memes and paradigm shifts. All of the above bibliometric techniques can be used to test for this theory. Software can be used to study document representations (abstracts of samples), references, and citations (in a single or set of documents). We can then compare a matrix of values to find correlations. We can also use a technique like this to help design future documents by using like terms, references, and citations to establish stronger linkage relationships among documents. A listing of possible variables for study is included in Appendix A.

We can take many of these ideas to deduce how to manage our resources to access and test the optimum utility of a document, since there is always a trade-off between document and Web site size and maintenance. The most-used content can be augmented and used as a starting point for other topic selection perhaps guided paths through the site. Even caching can be predicted with bibliometrics, keep the most frequently accessed and highest impact articles available for the quickest access.  This leads us to Price's "cumulative advantage model" where simply put: success breeds success (Price 1976).

Now that bibliometrics have been overviewed, we can move into how they would be even more specifically applied to the World Wide Web.

Bibliometrics on the Web

Bibliometrics are directly applicable to the World Wide Web, the largest network of documents in history. We can use bibliometrics techniques, rules and formulas in a "pure" environment - information about information organized on information systems. Examples overviewed in this section include analyzing Web server usage logs of all types, system resource architecture, and content creation to maximize compatibility with bibliometric methods. By customizing Web resources to leverage bibliometrics we can generate more precise data and establish more rigor to organize system architecture.

Web Surveys

Since the Web has existed, there have been numerous studies to determine both usage and user demographics. Most of these efforts are similar to bibliometric techniques and reveal comparable results. By examining these surveys, we can devise methods to refine or augment general bibliometrics to find out who users are and what they are doing.

The Georgia Tech Graphics, Visualization, and Usability Web Surveys are the cornerstone of Web demographics. Through clever use of programming, data gathering, and extensive statistical analysis we are finally developing an profile of web users and their preferences. By using HTML forms, qualitative information can be collected in addition to the large amounts of quantitative data. Techniques such as oversampling and random sampling (amongst users) are also applied to remove as much ambiguity as possible (Kehoe 1996). Surveys like these could be extended to actually gather more specific Web metrics by observing how users actually react and use specific Web documents, not by asking them. As these techniques are refined, they can be automated in an effort towards adaptable Web interfaces.

Using Web Servers

In general, Web servers can send any media format they are configured for as MIME (Multipurpose Internet Mail Extensions) types. Standard formats include text, graphics, CGI (Common Gateway Interface) programs, and Java applets. The latter two are scripts or programs that execute on either (or both) the client or server to perform some extra function. Obvious implications of adding programming logic to a server system include affecting performance, content organization, and enhancing server capabilities. However, most of these programming techniques circumvent the server logging because typical system logs only capture basic server procedures (primarily HTTP GET, POST, and SEND). With the added complexity of additional operations, these standard procedures can be skipped or incorrectly logged compared to the actual function of the server (Lui 1994) . To gather bibliometric data however, this can be challenging.

One of the primary problems with using bibliometrics on Web servers is that no state information is captured between usage hits to the server. The current server protocols treat every access as separate; by default no user paths through the resources are recorded. This would be similar to counting citations, but not knowing the number of source documents. Another problem is that server hits themselves don't represent true usage. Some users may be turned away when connections are maximized or redundant or tangential information served (bullet graphics or duplicate navigation icons at the top and bottom of pages) and considered as important as other data in the logs. For these same reasons, Web counters are most likely inaccurate too. One final threat to valid metric information is the advent of robots that can access a site and access any number of possible resources (even multiple times). What these problems show is that much of the data currently gathered on the Web is highly inaccurate. However, with combinations of server logs and appropriate design of system resources, more relevant data can be gathered.

Designing for Bibliometrics

The first step is to fully enable all possible server logs. With some development, logs can also be collected on many normally ignored functions. On a typical UNIX Web server, there are four possible types of logs:

  1. Server-based - the log file created by the Web server itself.

  2. Proxy-based - via a firewall (additional layer of server protection) or some server controlled access system.

  3. Client-based - via client application code that can record and transmit information to the server such as cookies.

  4. Network-based - the programs that control system-level networking security and access on the server.(Spainhour 1996)

Most log options are minimized to reduce system overhead, but to collect complete data, I argue the performance hits are negligible compared to the wealth of accurate information collected. It is also possible to shift recording options around for certain Web events or to specifically address known performance problems. Even periodic log collection can prove useful when analyzing a Web site.

Another (albeit expensive) option is to use another system networked to the server to gather log information. Alternatively, some clever server software developers actually offload the content to another server and allow the initial Web server to gather native log information..[2]

Other internal server information can be collected such as email logs (to and from the system administrator (the data size and times, not necessarily the content), program usage logs (even at the maintenance level), and disk usage (direct system information about various cache configurations).

Log information can be further augmented externally from the server by using programs that compare server usage to others or to gather more information via nslookup, ping, who, finger, and other network identification programs.[3]

Optimal Web Server Configuration

There are a set of standard methods to use in configuring a Web server to gather comprehensive metrics:

  1. Redirect CGI programs to find query the HTTP REFER command.

  2. Use a database to store and dynamically publish Web content making for consistent web content organization giving an additional log (the database access log) to record usage data.

  3. Create state information with programming via NSAPI or ActiveX technology to track a user's path in real-time through the Web site.

  4. Use an FTP (File Transfer Protocol) server for file transfers, freeing up Web server resources and giving more specific file transfer logs.

Managing Log Files

The irony of the Web is that there are almost unlimited opportunities to gather usage and document relation information, but so few native methods to gather, process and compare data. It is here that bibliometrics can be built upon for interpreting Web data.

Additionally, Web server logs might not perfectly match the real Web user session or represent an exact fit between documents, but they approximate it by delivering lots of redundant information and precision. Again, due to their digital nature, these resources for research are open to transformation and processing of all kinds.  However, care must be taken to preserve records and their results. Some general methods to ensure log file reliability include:

Regular backup, based on server usage patterns.

Storing both log file analysis results and the logs themselves.

Beginning new logs on a timely and manageable basis.

Posting results and log information for others to compare.

The log file format itself is fairly dense (see Appendix B: Typical Server Log) and can grow large and unwieldy. This should be taken into account when planning log processing and recording methods. However, storage costs are decreasing rapidly, especially in comparison to the value of quality information about a server's performance and usage logs.

Log analysis tools such as Analog, WWWStat, GetStats, and various Perl Scripts are freely available to help process and display server statistics. Additional commercial server analysis tools offer a host of additional options including meta-data analysis, compression and automatic, periodic analysis.

A typical log analysis cumulative sample shows a number of overlapping results to judge performance and access:

Program started at Tue-03-Dec-1996 01:20 local time.-

Analysed requests from Thu-28-Jul-1994 20:31 to Mon-02-Dec-1996 23:59 (858.1 days).-

Total successful requests: 4 282 156 (88 952)

Average successful requests per day: 4 990 (12 707)-

Total successful requests for pages: 1 058 526 (17 492)-

Total failed requests: 88 633 (1 649)

Total redirected requests: 14 457 (197)

Number of distinct files requested: 9 638 (2 268)-

Number of distinct hosts served: 311 878 (11 284)-

Number of new hosts served in last 7 days: 7 020-

Corrupt logfile lines: 262-

Unwanted logfile entries: 976-

Total data transferred: 23 953 Mbytes (510 619 kbytes)

Average data transferred per day: 28 582 kbytes (72 946 kbytes)-


Most of this data is directly relevant and begins to show how complex analysis can yield previously hidden information.

Unfortunately, almost every type of server program and each individual server log generator create slightly different formats. An effort to change this is currently underway by the WWW Consortium Standards Committee to agree upon an Extended Log File Format.[4] This new structure will automatically include much of what is (inconsistently) programmatically done by add-on utilities now. The new format should also prove to be:

Downie and Web Usage

Many of these ideas are used in Downie's early attempt to apply Bibliometrics to the Web. Mainly, attempts were made to analyze the following categories:

User-based analyses to discover more about user demographics and unveil preferences based on:

who (organization)

where (location)

what (client browser).

Request analyses for content.

Byte-based analyses to measure raw throughput along from certain timeframes.

The power of these techniques is that they can be merged to develop a detailed scenario of  a user's visit(s) to the Web site and their preferences, problems and actions. When using bibliometric analysis techniques, Downie discovered via a rank-frequency table that requests conformed to a Zipfian distribution (Downie 1996)

Other results confirm that poor Web server configuration and lack of access or use of full log files can hinder further results. It is also worth noting that Downie attends to ethical observation issues that many Webmasters and information system professionals don't normally consider. I hope a growing awareness of these issues continue.

Optimal Web Content Setup

Based on the server configuration techniques and Downie's real-world experience, a few methods can be used to organize Web content optimally:

External Bibliometric Gathering

IT is also possible to gather information about a Web site by finding information external to the site on various sources on the Internet. If a site is developed using the above methods, more site-external data can be collected. An obvious example is to use a search engine to find references to the Web site in general or to specific URLs (Uniform Resource Locators). This process is made simpler since page and directory names are somewhat unique. However, for a true crosscut, one should use different search engines for a variety of data along time and geographic regions. Each search engine has its own update times as well as techniques to filter redundant and similar information - use that as an advantage in gathering data.

Programmatically, walking up the links in any referenced sites may also reveal less obvious links or references. A reverse parsing of URLs could reveal its own bibliometric Zipfian distribution of types of content as well, revealing characteristics of directory and file names.

For fine-grained analysis, services such as DejaNews can show postings and access statistics on certain Web usage data. Finally, utilizing a customized robot or spider to search internal and external Web sites for commonalities and references can also yield new data.


This paper begins to examine the potential benefits of applying bibliometrics to finding out more about World Wide Web usage. As we gain wisdom about Web behavior, capture current and previous user information seeking behavior and modify interface and content to meet needs, the Web will be truly mature information resource.

The Web is in out of control in growth, which means opportunities exist where wise setup, good system architecture and diligent analysis can be applied for everyone's benefit.

Appendix A: Reference Variables


            name(s) - married, initials, spellings (translations, trends)



            nationality (implies language too sometimes)





            affiliated organizations




            frequency of publication

            number of pages annually

            number of articles annually





                        list of earlier work

                        method description



            Part Cited     

                        whole cited




            Ref Location






                        in the single document




                        number of total

            References in same footnote/comment/topic


            Static snapshot

            Dynamic tracing of evolution

Methods of Studying

            frequency count

            impact factors




            simple descriptive

            descriptive - predictive



            function and quality of references

            literture of person (influence)




            trend analysis, prediction (i.e. every article will have Nielsen by 1997)

            communication patters

            structure mapping, science of science studies

            information retrieval

Appendix B. Typical Server Log - - [15/Sep/1995:13:19:43 +0100] "GET /access.log HTTP/1.0"  200  3 - - [15/Sep/1995:13:19:37 +0100] "GET /wwwstat4mac/wibble.hqx HTTP/1.0"  200 3026 - - [15/Sep/1995:13:19:43 +0100] "GET /access.log HTTP/1.0"  200  3 - - [15/Sep/1995:13:19:51 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:13:20:02 +0100] "GET /quit.sit.hqx HTTP/1.0"  200  2941 - - [15/Sep/1995:13:46:25 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:13:46:29 +0100] "GET /httpd.gif HTTP/1.0"  200  1457 - - [15/Sep/1995:14:03:17 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:14:15:04 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:14:22:08 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:15:15:07 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:15:15:23 +0100] "GET /httpd.gif HTTP/1.0"  200  1457 - - [15/Sep/1995:15:16:34 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:15:17:26 +0100] "GET /server.html HTTP/1.0"  200  598 - - [15/Sep/1995:15:20:10 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:15:20:11 +0100] "GET /httpd.gif HTTP/1.0"  200  1457 - - [15/Sep/1995:15:25:53 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:15:41:29 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:16:10:24 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:16:46:51 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:17:17:19 +0100] "GET /access.log HTTP/1.0"  200  1691 - - [15/Sep/1995:17:18:27 +0100] "GET /httpd4Mac-v123d9.sit.hqx HTTP/1.0"  200  46498 - - [15/Sep/1995:17:28:03 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:17:56:12 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:17:56:15 +0100] "GET /httpd.gif HTTP/1.0"  200  1457 - - [15/Sep/1995:17:58:35 +0100] "GET /about.html HTTP/1.0"  200  23297 - - [15/Sep/1995:18:09:51 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:18:29:19 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:18:29:25 +0100] "GET /httpd.gif HTTP/1.0"  200  1457 - - [15/Sep/1995:18:30:48 +0100] "GET /about.html HTTP/1.0"  200  23297 - - [15/Sep/1995:18:31:55 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:18:38:13 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:18:40:03 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:18:41:14 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:18:42:19 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:18:44:04 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:19:07:26 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:19:08:19 +0100] "GET / HTTP/1.0"  200  3026 - - [15/Sep/1995:19:08:21 +0100] "GET /httpd.gif HTTP/1.0"  200   1457 - - [15/Sep/1995:19:08:53 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:19:10:59 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:19:53:00 +0100] "GET / "  200  3026 - - [15/Sep/1995:20:10:37 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:20:11:57 +0100] "GET Setup.HtML HTTP/1.0"  200  2412 - - [15/Sep/1995:20:31:50 +0100] "GET /setup.html HTTP/1.0"  200  2412 - - [15/Sep/1995:20:39:41 +0100] "GET /setup.html HTTP/1.0"  200  2412 - - [15/Sep/1995:20:40:03 +0100] "GET /setup.html HTTP/1.0"  200  2412 - - [15/Sep/1995:20:52:37 +0100] "GET / "  200  3026


Brookes, B. C. 1973. Numerical Methods of Bibliographic Analysis. Library Trends:18-43.

Downie, Stephen J. 1996. Informetrics and the World Wide Web: a case study and discussion. Paper read at Canadian Association for Information Science, June 2-3, at University of Toronto.

Drott, M. C. 1981. Bradford's Law: Theory, Empiricism and the Gaps Between. Library Trends Summer (Special Issue on Bibliometrics):41-52.

Ikpaahindi, Linus. 1985. An Overview of Bibliometrics: its Measurements, Laws and their Applications. Libri (June):163-176.

Kehoe, Colleen and Pitkow, James E. 1996. Surveying the Territory: GVU's Five WWW User Surveys. World Wide Web Journal 1 (3):77-84.

Kessler, M.M. 1963. Bibliographic coupling between scientific papers. American Documentation 14:10-25.

Lui, Cricket , Peek, Jerry , Jones, Russ , Buus, Bryan,  Nye, Adrian. 1994. Managing Internet Information Services, A Nutshell Handbook. Sebastopol: O'Reilly & Associates, Inc.

Price, Derek de Solla. 1976. A General Theory of Bibliometric and Other Cumulative Advantage Processes. Journal of American Society of Information Science 27 (Sept-Oct):292-306.

Price, N. and Schiminovich, S. 1968. A Clustering Experiment: First Step Towards a Computer-Generated Classification Scheme. Information Storage Retrieval 4:271-280.

Schiminovich, S. 1971. Automatic Classification and Retrieval of Documents by Means of a Bibliographic Pattern Discovery Algorithm. Information Storage and Retrieval 6:417-435.

Spainhour, Stephen and Quercia, Valerie. 1996. Webmaster in a Nutshell. Sebastopol: O'Reilly & Associates.

Voos, H. 1974. Lotka and Information Science. Journal of the American Society of Information Science 25:270-273.

Source Bibliography

Agosti, Maristella, Fabio Crestani, and Massimo Melucci. 1996. Design and Implementation of a Tool for the Automatice Construction of Hypertexts for Information Retrieval. Information Processing and Management 22 (4):459-476.

Andrews, Robert. 1996. Designing the "Perfect" Web Site Architecture. Paper read at Netscape Developer's Conference, March 1996, at San Francisco.

Baird, P. 1990. International Use Problems with Hypertext. In Designing User Interfaces for International Use, edited by J. Nielsen. Amsterdam: Elsevier Science Publishers.

Blake, Jodi K. 1992. Jump Right In: A Checklist for Planning an Online Documentation Project. Paper read at SIGDOC Proceedings '92.

Botafago, R.A., Rivlin, E. and Schneiderman, B. 1992. Structural analysis of hypertexts: Indentifying hierarchies and useful metrics. ACM Trans. Information Systems 10 (2):142-180.

Brookes, B. C. 1973. An Outline of Bibliometics and Citation Analysis. Library Trends:18-43.

Brookes, B. C. 1973. Numercial Methods of Bibliographic Analysis. Library Trends:18-43.

Butron, R.E.  and Keebler, R.W. 1960. The half-life of some scientific and technical literature. American Documentation 11:18-22.

Campagnoni, F.R>, and Ehrlich, K. 1989. Information retrieval using a hypertext-based help system. ACM Trans. Information Systems 7 (3):271-291.

Cavallaro, U. et. al. 1993. HIFI: Hypertext interface for information systems. IEEE Software 10 (5):48-51.Charnock, E., Rada R., et. al. 1994. Task-based method for creating usable hypertext. Interacting with Computers 6 (3):275-287.

Cleverdon, Cyril W. 1967. The Cranfield Tests on Index Language Devices. Paper read at ASLIT, June.

Cooke, P., and Williams, I. 1989. Design issues in large hypertext systems for technical

documentation. In Hypertext: Theory into Practice, edited by M. R: Albex.

Crane, G. 1990. Standards for a hypermedia database: Diachronic vs. synchronic concerns. Paper read at NIST Hypertext Standardization Workshop, January 16-18, at Gaithersburg, MD.

Cronin, Blaise and Overfelt, Kara. 1994. The Scholar's Courtesy: A Survey of Acknowledgement Behaviour. Journal of Documentation 50 (3):165-196.

Daniels, P. 1986. Cognitive models in information retrieval - an evaluative review. Journal of Documentation 42 (4):272-304.

Delisle, N., and Schwartz, M. 1987. Contexts- A partitioning concept for hypertext. ACM Trans. Office Information Systems 5 (2):168-186.

Drott, M. C. 1981. Bradford's Law: Theory, Empiricism and the Gaps Between. Library Trends Summer (Special Issue on Bibliometrics):41-52.

Egghe, L. and R. Rousseau. 1990. Introduction to informetrics: quantative methods in library, documentation and information science. New York: Elsevier Science Publishers.

Fidel, Raya. 1993. Qualitative methods in information retrieval research. Library and Information  Science Research 15 (3):219-248.Frisse, M.E. and Cousins, S.B. 1992. Models for hypertext. Journal of the American Society for Information Science 43 (2):183-191.

Furuta, R., Plaisant, C. and Schniederman, B. 1989. A spectrum of automatic hypertext constructions. Hypermedia 1 (2):179-195.

Garfield, E. 1965. Can citation indexing be automated? Paper read at Statistical association methods for mechanized documentation.

Gell-Mann, Murray. 1994. The Quark and the Jaguar: Adventures in the Simple and the Complex. >New York: W. H. Friedman and Company.

Gygi, Kathleen. 1990. Recognizing the Symptoms of Hypertext... and What to Do About It. In The Art of Human-Computer Interface Design, edited by B. Laurel. San Francisco: Addison Wesley.

Hahn, U. and Reimer, U. 1988. Automatic generation of hypertext knowledge bases. Paper read at Proc. ACM Conf. Office Information Systems, March 23-25, at Palo Alto, CA.

Hapeshi K., and Jones, D. 1992. Interactive multimedia for instruction: A cognitive analysis of the role of audition and vision. International Journal Human Computer Interaction 4 (1):79-99.

Hjerppe, Roland. 1980. An Outline of Bibliometrics and Citation Analysis.

Horn, Robert E. 1989.  Mapping Hypertext. Waltham, MA: Information Mapping, Inc.

Horton, William. 1991. Is Hypertext the Best Way to Document Your Product? STC Technical Communications 38 (1):25.Horton, William K. 1990. Designing & Writing Online Documentation. New York, NY: John Wiley & Sons.

Ikpaahindi, Linus. 1985. An Overview of Bibliometrics: its Measurements, Laws and their Applications. Libri (June):163-176.

Kacmar, C., Leggett, J., Schnase, J.L. and Boyle, C. 1988. Data management facilities of existing hypertext systems: Texas A&M University.

Kessler, M.M. 1963. Bibliographic coupling between scientific papers. American Documentation 14:10-25.

Marchionini, Gary and Schneiderman, Ben. 1988. Finding facts vs. browsing knowledge in hypertext systems. IEEE Computer 21 (1):70-80.

Mayer, Richard E. 1985. Structural Analysis of Science Prose: Can We Increase Problem-Solving Performance? In Understanding expository text, edited by B. K. B. a. J. B. Black. Hillsdale, NJ: Erlbaum.

McKnight, C., Richardson, J. and Dillon, A. 1991. Journal articles as learning resource: What can hypertext offer? In Designing hypertext/Hypermedia for Learning, edited by D. a. M. Jonassen, H. Heidelberg, Germany: Springer-Verlag.

Meyrowitz, N. 1986. Intermedia: The architecture and construction of an object-oriented hypermedia system and applications framework. Paper read at Proc. OOPSLA'86 Conf. Object-Oriented Programming Systems, Languages, and Applications, 29 September-2 October, at Portland, >OR.

Moed, H. F. and Vriens, M. 1989. Possible Inaccuracies Occurring in Citation Analysis. Journal of Information Science 15:94-107.

Nicholas, David and Ritchie, Maureen. 1978. Literature and Bibliometrics. London: Clive Bingley.Nielsen, Jakob. 1995. Multimedia and Hypertext: The Internet and Beyond. Boston: AP Professional.

Oren, Tim. 1990. Cognitive Load in Hypermedia: Designing for the Exploratory Learner. In Learning with Interactive Multimedia - Developing and Using Multimedia Tools in Education. Seattle: Microsoft Press.

Potter, W.D. and Trueblood, R.P. 1988. Traditional, semantic, and hypersemantic approaches to data modeling. IEEE Computer  21 (6):53-63.

Price, Derek de Solla. 1976. A General Theory of Bibliometric and Other Cumulative Advantage Processes. Journal of American Society of Information Science 27 (Sept-Oct):292-306.

Price, Derek J. de Solla. 1963. Little Science, Big Science - and Beyond. New York: Columbia University Press.

Pritchard, Alan. 1969. Statistical Bibliography or Bibliometrics? Journal of Documentation:348-349.

Schiminovich, S. 1971. Automatic Classification and Retrieval of Documents by Means of a Bibliographic Pattern Discovery Algorithm. Information Storage and Retrieval 6:417-435.

Schwabe, J.R., Caloini, A. et. al. 1992. Hypertext development using a model-based approach. Software-Practice and Experience 22 (11):937-962.

Seyer, Philip. 1991.  Understanding Hypertext Concepts and Applications. New York, >NY: Windcrest/McGraw-Hill.

Spainhour, Stephen and Quercia, Valerie. 1996. Webmaster in a Nutshell. Sebastopol: O'Reilly & Associates.

Spiro, Rand J. and Jehrig, Jihn-Chang. 1990. Cognitive Flexibility and Hypertext: Theory and Technology for the Nonlinear and Multidimensional Traversal of Complex Subject Matter. In Cognition, Education and Multimedia - Exploring Ideas in High Technology, edited by D. N. R. Spiro. Hilldale, NJ. Lawrence Erlbaum Associates.

Voos, H. 1974. Lotka and Information Science. Journal of the American Society of Information Science 25:270-273.

[1] Murray Gell-Mann shows Zipf's law applying to a number of domains from city populations to wealth in his book The Quark and the Jaguar.

[2] A prime example of this are tools built by Maxum, found at

[3] An exhaustive list of external network data collection programs is available at

[4] Working Draft 960323, March 23, 1996 available at Ironically, the standard settersthemselves change their web structure so much, I can't hope that a URL will lead directly to the draft.

[5] I suspect that some of the Zipfian distribution in Downie's results are due to multiple links among Web documents.