AT&T's 800 Directory on the Internet (12/16/94)

Brian W. Kernighan

[This is an unpublished draft of an unpublished paper written long, long ago in a galaxy far, far away. It's stunning how much has changed in less than 15 years.]

800 Numbers

AT&T first provided 800 number service in 1963; since then, 800 services have become a major source of revenue for AT&T, and, increasingly, for other long distance carriers as well.

AT&T publishes two directories of 800 numbers in the usual "yellow pages" format, that is, entries listed alphabetically within categories like Flag Poles or Drug Detection and Testing. The larger Business Guide, about 1200 pages, is aimed at business buyers; the smaller Shoppers Guide, 500 pages, targets consumers. These books are each updated and printed afresh twice a year. Ostensibly they are sold, but more often they are given away. In any case, the primary business purpose is to generate more call volume for AT&T's 800 services; in addition, some revenue is derived from bold-face listings and display advertisements. CD-ROM versions of the directory are sold through third parties for $25-40.

AT&T also provides directory assistance via 1-800-555-1212. This information is updated more frequently.

On the Internet

In July, 1994, we began discussions with the 800 Directory Services group of BCS ("Business Communications Services") about putting AT&T's 800 number directory onto the Internet, accessible through Mosaic. The intent was for BCS to gain some experience with providing a real Internet service (and with the Internet itself), perhaps to generate extra calls to 800 numbers, and perhaps ultimately to provide revenue through enhanced services like display advertisements. In addition, we hoped for some modest public relations benefit from providing something of real value, not merely a teaser like so many Internet offerings at the time.

There were a number of concerns that had to be resolved by the 800 Directory group. The primary one, of course, was whether it was wise to make the information publicly available. Once something is on the net, it can be copied by anyone (and will be); copyright notices provide the appearance of legal protection, but no practical barrier to theft. It appeared here, however, that the gains would outweigh any risks. In particular, since the information changes rapidly (upwards of five percent of the listings change every month), any copies would be quickly out of date.

Other concerns include questions of the quality of information; it is important to maintain AT&T's reputation for high-quality products. And there are some interesting political questions. For example, should the directory include MCI's 800 numbers, if MCI desires?

In any case, we obtained a version of the business directory in August, and put it up under Mosaic in a few hours, though visible only from our local machines. After some refinements and much internal deliberation, managerial reluctance and inertia succumbed to a rumor that MCI was about to offer some Internet service, and the directory went public on October 19, 1994.

With Mosaic

Business and consumer guides are combined into a single directory. All information about sizes, etc., refer to this combined directory. There are about 157,000 records. Each record contains an 800 number, customer name, city, state, and 6-character category code; the raw input file is 22 Mbytes uncompressed.

We provided two access paths. One was the usual category axis. There are 3649 categories, ranging from Abdominal Supports to Zoos. The largest category, Florists, contains about 3500 entries; there are 1564 categories with fewer than six entries, and 619 categories with only a single entry. The initial Mosaic screen provides 26 single-letter buttons that lead directly to a display of all the categories whose names begin with that letter; for example, selecting D causes the display of the 103 categories from Dairy Consultants to Dyes and Dyestuffs. There are only three categories under Z; the largest groups are under B and C, which each have 305 categories.

The other search axis is by name ("white pages"). Since there are many more names than categories, it is necessary to provide a second level of branching, so that downloading times are not too long. (For instance, there are more than 16,000 listings whose names begin with S; this corresponds to about 1.25 Mbytes.) Selecting a letter or number loads a second-level display broken down into smaller groupings, rather like the headwords in a dictionary. For example, selecting names that begin with G brings up 30 items from "G&AC ... GNOP" to "Gunn Ro ... GZ Pain," from which one can make a further selection.

Note that searching by name is not provided by the printed books, which contain only "yellow pages." The online pages also provide a link from each named entry back to the category to which it belongs. For example, there is a link from "AT&T High Seas Radiotelephone Service" back to "Radiotelephone Communications -- Common Carriers," which contains two other listings. People have commented favorably on this novel feature.

Several mechanisms to permit searching for records by any text are available; when one is chosen and installed, it will provide potentially the most useful of all access paths to the data.

Data Validation

The directory information, in common with any large database, contained a number of errors. In fact, the very first entry (and the only one) in the first category, "Abdominal Supports" is Peel Builders, Inc., in Georgia; unless there is more to this company than meets the eye, this is a data error.

With the data in machine-readable form on a Unix or Plan 9 system, it is possible to apply a variety of tools to detecting and perhaps correcting data errors automatically, or at least providing another look at the data.

For example, a spelling check of city names indicated a large number of problems, such as nine different spellings of Cincinnati in the original data; Coeur D'Alene was another challenge. A comparison of city names in the directory with city names in a directory of US place names indicated that a fair number of the city names did not occur in the placename data, and thus were suspect. Fortunately, most of these have been fixed in the latest version of the data.

A spelling check of company names indicated a large number of probable misspellings, particularly of long words within corporate names; Adminstrator or Commerical or Divison are popular variants of the right words. Misspellings are largely typographical errors, though some systematic errors are also evident, such as Enviromental, which occurs at least 15 times.

It is unfortunately not possible to correct "spelling mistakes" automatically; there are far too many unusual spellings in company names ("Raized Printing" in Portland, Oregon, seems to be the correct spelling of the company name, however erroneous it might look). But a spelling check can certainly flag potential errors for examination by a human.

There are also some problems with consistent capitalization schemes. Until recently, the printed directories used only upper case for company names. The current use of mixed case presents some problems of consistency. For example, CompUSA spells its company name that way, or sometimes COMPUSA, and would presumably prefer to see either of these forms, whether printed or on the Internet, instead of Compusa. Radio and television call letters are particularly vulnerable to capitalization problems, for instance, "Katc TV" in Lafayette, LA.

Abbreviations are also a problem. Words like manufacturing and wholesale are abbreviated in multiple ways. Without consistent rules it is difficult to search for an abbreviated name, and sometimes even difficult for a human to comprehend one like "Putnam County Of State Attorney."

The Putnam County (Florida) State Attorney's office might be hard to find for another reason: it appears under "Appliances -- Major -- Wholesale and Manufacturers". As this example suggests, mis-assigned categories present a problem. Categories are encoded as two letters and four digits, as in HO2200 (Hospitals & Clinics) or HO3325 (Hospitals & Clinics -- Utah), where the two letters are the first two letters of the category name. There is only a limited amount of redundancy in this scheme: the numbers are mostly multiples of 25, but any given letter combination might apply to several categories. It is nevertheless still hard to imagine the path by which the LDS Hospital in Salt Lake City found its way into HO2050 (Hosiery).

No simple scheme is likely to address data errors. Although mechanical aids can certainly help, there is no substitute for careful checking by human readers.

It might also be possible to improve the classification scheme. As mentioned above, there are a large number of categories with few entries. Sometimes these draw distinctions that seem too fine; for example, there are two listings under Scissors & Shears, and three under Shears & Scissors.

Given the number of errors in company and city names, it is possible that there are also errors in the telephone numbers themselves, but we have made no attempt to verify them.

The important observation here is that when the directory is on the net it is subject to forms of inspection that would have not been possible before. At best, errors in company names make it difficult to find them in searches; worse, spelling errors and misclassifications leave a bad impression with users and our customers. (One imagines how people at the Century 21 office in Eatonton, GA, will feel about their listing as "Centry 21".) It is in AT&T's interest for many reasons to make the data as clean as possible.

Usage

The directory service is provided on a Plan 9 machine operated at the moment by the Computing Science Research Center. All traffic is monitored and recorded, so we have a reasonably accurate count of how many times the directory has been accessed, and an approximate idea of how it has been used. We do not have any way to determine specifically which numbers have been looked up, nor of course any way to know whether more 800 calls have been made because of its existence.

During the six weeks from October 19 to November 30, there were 32,000 connections to the directory. These came from 20,000 distinct Internet addresses, so the service has been tried from a large number of different sites. Many of the sites that called most frequently are in fact gateways of large corporations, serving as a proxy for an unknown number of users within that corporation. The largest single source of queries is a group of machines at Delphi Internet Services, which provides mail, bulletin board and network access services.

Six of the top nine categories requested are computers and computer services, which is perhaps not surprising; the top category not related to computing is Airline Companies.

As far as we can tell from the logging information, a typical user searches for only a handful of numbers, although at this early stage, it is clear that some users are merely curious browsers. In spite of that, the traffic pattern seems rather stable over this period, about 1200 to 1500 accesses per day on weekdays and 600 to 700 on each weekend day.

There was one truly anomalous search. The Harvest browser project at the University of Colorado retrieved a copy of the entire data base on October 27 (which we counted as a single transaction from a single IP address in the numbers above). Although this use violates the letter of the copyright notice on our data, it is an interesting demonstration of the power of the Internet, and also of the speed with which information is gathered and spread further.

Observations

The 800 directory service appears to be a success. People are looking at the directory. Feedback from the net has been positive; a few people have suggested that they would buy 800 service from AT&T so they could be listed in the directory, though we don't know if anyone has actually made such a shift.

The listing has garnered some modest public relations benefit; it is listed as the first of the "Cool Links" on the Yahoo WWW Guide. It seems likely that we have scored a very small victory over MCI by being first with a useful service, though it was a near thing. It does make AT&T look like it's involved in the Internet and it has stimulated a significant amount of internal discussion and plans at least to make plans for planning further services (services, one hopes, that provide measurable revenues). If nothing else, the experience has brought home to any number of people the remarkable rate at which the Internet is growing and changing; we have to learn to act quickly in this area if we are to be in it at all.

The effort has also had some benefits in our local research environment. Tools for looking at large text databases are of interest to a number of us; the 800 directory is a fertile place to experiment with these. Providing the service on Plan 9 has demonstrated the value of that operating system, notably the security features that make it less vulnerable than a typical Unix system might be.

Acknowledgements

The technical work described here was done by Eric Grosse, Lorinda Cherry and Brian Kernighan. We are grateful to many people who have helped in some way to get this project working. Kathy Sullivan, Peter Weinberger, and Ravi Sethi suggested it; James Chow is the product manager on the business side; Dan Mayer hustled things along; Helen Hua provided the raw data in a convenient form; Sean Dorward wrote an HTTP daemon for Plan 9; Dave Presotto provided advice and infrastructure. Finally, Mike Baldwin and Polly Powledge of BCS have taken on the operation and further development of the service. Our apologies to those who we have inadvertently omitted.