International Journal of Scientometrics, Informetrics and Bibliometrics December 2, 2003
Judit Bar-Ilan : School of Library, Archive and Information Studies The Hebrew University of Jerusalem Jerusalem, 91904, Israel
judit@cc.huji.ac.il
Abstract : This paper examines the performance of search engines over time. The performance is not as expected : search engines loose information, relevant URLs that were retrieved at a given time by a certain search engine, were not retrieved by the same search engine a a later time, although they continued to exist and to be relevant. A closer examination of the these URLs revealed that not only URLs were dropped, but content was also lost for large portion of these URLs : no other URL retrieved by the search engine contained the same information. As far as we know this aspect of the performance of search engines has not been thoroughly studied before. The problem is investigated through a case study, using the search phrase « informetrics OR informetric ». The searches were carried out in on month intervals during a five months period between January and June 1998. An additional search round and comparison were carried out on June 1999. The six largest search engines at the time were examined.
Keywords : search engines, Web, performance, stability, case study
1. Introduction
The World Wide Web is the fastest growing information medium today. The large, general search engines are the main tools of resource discovery. Directory services, subject specific search engines, lists and « hear-say » (learning URLs from other people or from other types o media) cover only a tiny part of the Web. It has been shown before that the general search engines are not perfect. Their coverage is low (Bharat & Broder, 1998-2 ; Lawrence & Giles, 1998 ; Lawrence & Giles, 1999). The most recent study shows that the combined effort of the largest search engines results in only 16% coverage of the indexable Web. The overlap between the search engines’ indexes is surprisingly small. Another measure which has been studied quite extensively is the precision of the search engines (see for example : Ding & Marchionini, 1996 ; Chu & Rosenthal, 1996 ; Tomaiulo & Packer, 1996 ; or Dong & Su, 1997, for a review article). In these studies only the first ten or twenty
http://cybermetrics.cindoc.csic.es/pruebas/v2i1p1.htm 02/12/2003
hits for each search engine were examined, because users rarely go beyond the first twenty results. Since precision is a subjective measure, different conclusions were reached. A case study, examining the whole set of results (6681 URLs retrieved by six search engines), using technical definition of precision (checking whether the search terms appear in the document), showed that the precision of the search engines was quite satisfactory (Bar-Ilan, 1998).
One of the reasons for retrieval of imprecise documents and of the low coverage of the search engines is the dynamic nature of the Web itself. Documents disappear, undergo changes, and thus the index of the search engine does not reflect the contents of the document anymore. New relevant Web pages are constantly added, and « wait » for quite a while to be discovered by the search engines. We set out to investigate the changes that occur to the Web-literature containing the search terms : informetrics or informetric. The searches were carried out in one month intervals over a five months period between January and June, 1998. The query « informetrics OR informetric » was presented to six search engines (in alphabetical order) ; AltaVista, Excite, Hotbot, Infoseek, Lycos and Northern Light. All the documents the search engines pointed to, were retrieved and saved locally each time a search was performed. The documents were carefully examined and the changes that occurred to the URLs appearing in more than one search round were characterized (Bar-Ilan & Peritz, 1999). In June 1999 an additional search was carried out. The searches took place once a month (in search rounds) for a five months period.
During the above mentioned study, we realized, to ou great surprise, that search engines results are not stable, URLs that a search engine retrieved in a given round, were not retrieved by it in a following round, even though other search engines retrieved it, and the contents of these pages continued to be relevant to our search. When examining the extent the above phenomenon, we realized that we are not dealing with just few pages (as AltaVista suggests in its helpsheets (AltaVista, 1998)) : Excite dropped 72.3% ( !) of the URLs it located one time, which continued to exist and to be of relevance, Lycos 34.5%, AltaVista 29.3%, Hotbot 25.9%, Infoseek 22.7% and Northern Light 5.5% of the URLs. AltaVista and Hotbot (Hotbot, 1998) are the only search engines that hint in their helpsheets that some URLs may disappear from their databases. Excite does not even mention the problem.
A possible explanation for the disappearing URLs could be the fact that the physical document may have several different addresses (e.g. : searchenginewatch.com and www.searchenginewatch.com) (Notess, 1997), the search engines are aware of this, and display in the result set only one of the URLs, each time a different one. Since we had no way of deciding if this was the case, we submitted our « forgotten » URLs to an extremely strict test : for each search engine, we compared the content of each « forgotten » URL to the content of all the other URLs that were found b the search engine in the round the URL was lost. The set of URLs for which no match was found, is the set where real information loss occurred : the engine found no URL that contained exactly the same information as the « forgotten » URLs. Even under this lenient definition (we are being lenient with the search engines) the results are surprising : 57.9% out of the covered URLs for Excite, 23.8% for AltaVista, 21.8% for Hotbot, 20.2% for Lycos, 12.1% for Infoseek and only 1.6% for Northern Light belong to these sets.
As far as we know this problem has not been formulated or studied before i this setting. Fluctuations and changes in the number of search results for given queries over time have been investigated before (e.g. Peterson, 1997 ; Notess, 1999 ; Aguillo, 1999 ; Rousseau, 1999). These studies, however, considered only the number of search results and not the results themselves as we did in the current study. This difference in methodology turns out to be rather crucial for studying search engine stability. The search engine Excite, for example, reported more or less the same number of search results in each round. However, it retrieved quite a different set of URLs each time the searches were carried out. Users expect search engines to be reliable, and to be able to find a URL that was retrieved by a search engine also in the future, if the URL continues to contain relevant information on the search topic. The purpose of this paper is to alert users of the Web to the problems that arise because of inconsistent behavior of the search engines over time. We chose to make our point by carrying out a case study, which enabled us to carefully examine each retrieved document. The results of this systematic analysis are presented in this paper.
2. Methodology The search terms used were « informetrics OR informetric ». The searches were carried out six times during a five month period between January 3rd 1998 and June 7th 1998. A search was carried out on the first Sunday of each month (January, February, March, April, May and June). On each of these dates (called search rounds) we submitted the query to six of the major search engines (Sullivan-1) on the Web (in alphabetical order) :
-AltaVista advanced search (http://www.altavista.com/cgibin/query ?pg=aq)
Excite (http://www.excite.com)
Hotbot (http://www.hotbot.com)
Infoseek (http://www.infoseek.com)
Lycos (http://www.lycos.com/index.html)
Northern Light (http://www.northernlight.com)
An additional round, called the comparison round took place a year later, on June 20th, 1999. At this time all the URLs identified in the initial rounds were also revisited. First, the whole result list of each engine was saved. Next, the URLs and the titles were filtered out from the pages returned by the search engines using a Visual Basic program. The output of this process resulted in a table with columns for the URL and the title. These tables were loaded into Microsoft Excel. We ran a Visual Basic module in Excel in order to create a list of unique URLs returned by the search engines at the given search round (the comparison of the URLs was case sensitive). In the last phase of each search round each of these links were retrieved and the documents were saved on our local hard disk. This phase was carried out by « brute force », i.e by saving documents in parallel on several computers. For each search round, the whole data collection process took about ten hours. The analysis of the results was carried out by constructing frequency and cross tables, by utilizing the filtering tool of Microsoft Excel and by running Visual Basic for Applications modules in Microsoft Excel. The retrieved documents were manually examined for relevance. The comparisons between different files were carried out by a program written in Visual Basic.
3. Results
3.1 The Collected Data and its Precision
First the summarized results of the data collection are presented : the number of URLs per search round (called general URLs), the number o technically precise URLs per search round and the percentage of the precise URLs in search round out of the general URLs in that round. In a search round, a document at a given URL is considered technically precise, if one of the search terms (« informetrics » or « informetric ») appears in it. The set of non-precise documents includes those that did not include the search terms, those that were not found (error 404) and those that could not be retrieved due to communication problems (server down, etc.). The last row of Table 1 displays the combined results for the six search rounds, and here, a URL is counted as technically precise, if it was categorized as such in all of the search rounds in which it appeared.
The overall technical precision is much lower than the precision in any single search round, because of the requirement that the URL must be technically precise in all of the search rounds in which it was located. The technical precision of the individual searches was very high, despite common belief that search engines introduce a lot of noise. Probably our definition of technical precision, which is not based on a subjective measure of relevance is one of the reasons for the high numbers. The search engines carry out free text searches, therefore cannot and should not be expected to guess the context in which the searcher is interested. Future developments in artificial intelligence may enable the engines to make better judgements, but until then, in our opinion, the precision should only be measured by the appearance of the search terms in the document. Technical precision is objective, and the measurements ca be carried out automatically, without the involvement of experts, whose judgement is subjective
3.2 Basic Performance of the Search Engines in the Search Rounds
Next we examined how the different search engines performed in the search rounds. The number of URLs and the number of technically precise URLs per search engine are displayed in Table 2. The percentages are out of the total number of different URLs and the total number of different technically precise URLs located in the given search round by all six search engines, as shown in Table 1. These percentages indicate the portion of the URLs (general and technically precise) that each search engine covers in each search round out of the total of URLs (general and technically precise) collected in the round. The last row presents the number of URLs (general and technically precise) each search engine located over the whole search period. The values in this row are not the sums of the columns above them, since each URL is counted only once, while it may have been located in several search rounds. In the last row the percentages are out of the total and total technically precise respectively. From this point on, we shall use precise as a shorthand for technically precise.
* % out of total in round + % out of total precise in round Note the extremely high precision of Excite, in the fourth round, 168 out o 170 documents were precise (98.8% !). Excite’s overall precision is also the highest, 535 URLs out o 590 (90.7%), while Northern Light’s overall precision is the lowest (80.9%).
3.3 Performance in the Average Round versus the Overall Performance
It is expected that the number of URLs located by a search engine over the whole period is greater than the number of URLs located in any single round. The Web is a dynamic medium – new pages appear, old ones are totally removed, while others change their content and cease to be of relevance to the given query. The list of URLs for the whole period consists of all the URLs that were technically precise each time the URL was retrieved. This list contains URL that were, for example, in the first and second rounds only (an example of a URL that ceased to exist or to be of relevance), and also URLs that appeared only once in the last round (an example of a « new » URL). In spite of our reasonable expectations, for Northern Light, there is only a very slight difference between the number of URLs identified in the average round and the number of URLs located overall. It seems that its database (at least pages in which the terms « informetrics » or « informetric » appeared) is rather static, even though the graph appearing in the Search Engine Watch site on search engine sizes (Sullivan 2), shows that during the search period, the size of Northern Light’s index grew from 55 million URLs to 65 million URLs. An earlier version of Sullivan’s page (accessed in August, 1998) stated that between November 1997 and August 1998 Northern Light stopped crawling, a claim that is more consistent with our findings.
3.4 Relative Coverage
The changes in the Web are expected to be reflected more or less similarly by each of the search engines. However, it can be seen clearly form Figure 1, that this is not the case. Let us consider the search results from a different point of view. We computed the average relative coverage of each search engine. The relative coverage of a search engine per search round is the number of precise URLs the given search engine retrieved divided by the total number of precise URLs retrieved in the given search round. Since the values for the different search rounds were very similar for each search engine, we decided to display the average relative coverage, which is the average of the relative coverages over the search rounds. These averages can be compared to the total relative coverage of each search engine (the number of precise URLs located by the search engine in all the rounds divided by the total number of precise URLs located by all the search engines during all the rounds). These results are displayed in Figure 2. Note that the average relative coverage measures the relative coverage of a search engine at a given point in time (a « snapshot ») while the total relative coverage measures the coverage of the search engine during a period of time. The results displayed in Figure 2 show that the search engines’ performance over time differs considerably from their performance at a given point of time. Our results on the search engines’ relative coverage at a given point of time compare well with previous results (Bharat & Broder, 1998-1 ; Lawrence Giles, 1998) as can be seen from Table 3. The cited previous results are based on searches that were carried out in November 1997. Both papers gave estimates of the size of the Web and of the overlap between the indexes of the search engines, as by-product they also computed values that correspond to our notion of relative coverage. Note that Bharat and Broder studied the performance of four engines only. These previous works measured the relative coverage at given point in time, based on a large number of queries.
When comparing the total relative coverage with the average relative coverage, the case of Excite is the most striking : in each separate search round, it found nearly the same number o precise URLs (between 148 and 168) and it covers between 17.4% and 21.8% of the precise URLs located in a given round, however when combining the results of the six searches, it succeeded in covering 50.5%( !) of the total number of precise URLs.
3.5 Search engines « forget » and « recover »
After examining the list of URLs, we found that search engines not only discover new URLs, they also forget URLs they knew before. We define forgotten URL as a precise URL that was located by a given search engine in a certain round (called round_located) and it still exists as a precise document on the Web in one of the following rounds (called round_forgotten), but is not retrieved any more by the search engine under discussion. We kno the page still exists and relevant, since it was retrieved by one of the other engines in round_forgotten and was inspected by us. When examining the extent o forgetfulness, only the precise URLs are taken into account, since these are the URLs the search engine is supposed to retain from one search round to the next. We hardly found any mention of the problem in the literature, only AltaVista (1998) and Hotbot (1998) mention the existence of the problem in their help pages. The only reference discussing the problem of disappearing pages is Sullivan (Sullivan-3, 1998), « in the subscribers only area » of his searchenginewatch.com site (the subscription is not free). According to him this has been a problem since 1997, however these pages « usually reappear within Excite about three weeks after they disappeared ». This is also a problem for Hotbot, and in this case too, disappeared pages « should automatically reappear during the next refresh »(within a month). The suggested explanations for these problems with Hotbot are server problems and network delays at the time of crawling ; no explanation is given for Excite. Thus we decided to examine whether forgotten pages reappear at a later round in the list of pages retrieved by the search engine. We considered precise URLs only for the first four rounds, in order to « give a chance » to the search engine to drop the URL (in round five at the latest) and to rediscover it (in round six at the latest). The percentages are calculated out of the total number of precise URLs each search engine retrieved in the first four rounds.
These results are quite surprising. All the search engines dropped URLs from their databases. Northern Light and Exite are the two extremes, Northern Light hardly forgot any URLs, while Excite forgot 70.9% ( !) of the URLs it once located. In each of the search rounds, Excite presented us with quite a different picture of what the Web has on our query, even though it retrieved nearly the same number of URLs in each search round. Two additional engines, AltaVista and Hotbot had not recovered a substantial portion of the dropped URLs during the search period.
3.6 Additional measures
In the previous section we studied the forgotten and recovered URLs. In the course of our study, we applied two additional measures in order to gain a better understanding of the phenomenon. As mentioned in the introduction, the same server may have several aliases, and if the search engines know this (through a DNS, for example), it may retrieve the same URL in bot round_located and round_forgotten, but display the results under different names in the different rounds. Since we had no way to decide whether two URLs point to the same physical address, we devised the following test (which covers the above case) : for each search engine, we compared the content of each forgotten URL i round_forgotten to the contents of all the other URLs the search engine retrieved in that round. The URLs for which no match was found, are called totally forgotten. This definition is rather wide and lenient with the search engines, since it covers much more than different names of a server (duplicates do exist on Web on purpose). On the other hand the percentage of the totally lost URLs indicates the information loss that was caused by the search engine during the five months period under consideration. The pages in this set are pages that exist on the Web, the search engine located them once, and sometime later thes pages were dropped from the list of retrieved results by the search engine, for some unexplained reason. Not only the URL itself was not retrieved by the engine, i retrieved no other URL that contained exactly the same information. We are not talking about just a few pages, but about a significant portion of the URLs for all the search engines, except for Infoseek and Northern Light. The notion of being totally forgotten examines the content of the URLs. Since there are intentional duplicates on the Web, we came up with a third definition that takes into account the existence of such duplicates. The definition of a URL being lost is based on the idea, that if the search engine itself had not « thought » in round_located, that two different URLs with exactly the same content are duplicates, then if they still exist on the Web and are relevant, they should both appear on its list of URLs in round_forgotten. Under this definition, we let the search engine decide what are duplicates. More precisely, a URL is lost if it was forgotten, and either no other retrieved URL in round_forgotten contains exactly the same information, or if it does it was also retrieved in round_located by the engine under discussion. Note that under any of the above definitions, the numbers presented may well be underestimates, since the search engines may have collectively lost some additional URLs, but these URLs were not checked. The percentages are computed out of the total number of precise URLs retrieved by the engine in the first five rounds (since URLs discovered in the last round only cannot be forgotten). Note that the number of URLs per search engine differs from the number in Table 5, where we considered the first four rounds only.
AltaVista, Hotbot and Lycos lost 21.9% of their information on the average. Northern Light and Infoseek did significantly better under the second definition (totally forgotten) than under the first. The results under the third definition are similar to that of the first but are smaller by 5% on the average.
3.7 Self Overlap
To further emphasize our point, for each search engine, and each precise URL discovered by it, we computed the number of rounds in which the search engine located the URL. The number of search rounds in which a URL was located (irrespective of the search engine) was also computed. The results appear in Table 7. The percentages are out of the total number of precise URLs found by each search engine (the total is in parenthesis under the name of the engine). The « Total » column displays the combined efforts of all six search engines. The results of Table 7 indicate the self-overlap of the search engines. Again, a striking difference can be seen between Excite and the combined effort of all the engines. The combined effort o all six search engines produced 484 documents (45.7% of the total) that appeared in all six search rounds, while only 0.6% of the documents retrieved by Excite were located all six times. Northern Light represents the other extreme : its database was almost stable during the search period, hardly reflecting the changes that occur on the Web over time.
3.8 The Comparison Round
A final search round took place on June 20th, 1999, a year after the initial experiment. This time, in addition to the search and data collection procedure carried out in all the search rounds, all the URLs located in the initial rounds were revisited. The small number of results in Hotbot were probably caused by the changes in the display of the results which were introduced around the time the comparison round took place and were not explained in Hotbot’s documentation. Thus the results for Hotbot are not analyzed. The results clearly show that the trend of « forgetfulness » continues, and the results of the search engines are not as stable as can be expected. The last column of Table 8 represents the percentage of technically precise and existing URLs that were forgotten b the engines between the initial rounds and the comparison round. The number of technically precise URLs that were located in the initial rounds and are still existing and precise in the comparison round is the sum of the respective numbers in the fourth and fifth columns of Table 8. This is the set of URLs the search engine should retain in a best-case scenario, thus the last column in Table 8 reflects on the stability/instability of the engines over time.
4. Discussion Why do the search engines drop perfectly good URLs ? We suggest the following reasons. Possibly they aim for a more o less constant size, and thus in the process of adding new URLs to their databases they delete existing ones. Or perhaps the search results are not based on the whole database. Possibly, as suggested by Lawrence and Giles (1999), it may not be economical for the engines to improve coverage or timeliness, and they may be limited by the non-scalablility of their technology or by network bandwidth. It may be more profitable for the engines to invest their resources into more popular areas, like e-mail. In either case, why do they not explain this behavior in their helpsheets ? The search engines almost totally ignore this problem in their helps, as of September 1998, only two of the search engines, AltaVista and Hotbot address this point in their FAQs : « Why has my URL disappeared from the index ? Periodically, we rebuild an index from scratch and a few pages may be lost during this process. » (AltaVista). « Why is my site no longer in your index ? Your site may have dropped out of the system for a variety of technical reasons. » (Hotbot). These two answers indicate that large numbers of users complain about URLs disappearing from the indexes of the search engines. 25.9% of the URLs over a five months period does not seem like a few pages lost.
5. Conclusions
This paper shows that the search engines have genuine problems with their performance over time. From the users’ point of view, the search engines are not reliable, URLs that were discovered at one time by a search engine, do not appear in the results of the same search engine at later time, even though the URLs continue to exist and to be relevant to the search topic. Our case study shows that this happens to a significant percentage of the URLs (except for Northern Light). Furthermore, even if we examine the contents of the pages (not the URLs themselves), the phenomenon of loosing content is significant and cannot be ignored. Users are not aware of this behavior and are not warned by the search engines. This behavior, however, is not entirely negative, it allows search engines to add newly discovered URLs without increasing the total size of their indexes. We suspect that this is a possible explanation for this course of action. We have seen that in a single round, Excite retrieved about 165 URLs, while in five months it retrieved altogether 535 different, precise URLs, thus greatly increasing its recall over time. Old, however, is not useless : laws of physics discovered, theorems in mathematics proven, works of art created hundreds of years ago are still valid, have their value and form the basis of recent works. If Bach and Boyzone can coexist in the world of music, there is no reason to discard older, existing and relevant pages in order to be able to add newer ones. Koehler (1999) speculates whether the Internet is the « world brain » a discussed by H. G. Wells in the late 1930s. Koehler in his paper examines the permanence behavior of Web sites and Web pages, and not of search engines, as we did in this paper. His conclusion, however, can be applied to the search engines : « World brain has short memory. And when it does remember, it changes its mind a lot. » At the time of the study, Excite exhibited the greatest fluctuations, while Northern Light was the most stable, and possibly the least fresh. The search engines constantly change the technologies they utilize, thus we cannot recommend one engine over the other. However, the basic reasons for the instability of the engines (e.g. : commercial interests, constant size of the database) continue to exist, thus we expect that the problem of the instability of the search results will probably persist. Naturally, a single case study is not enough, although we do not expect any bias, in any direction toward our query. We presented the results to make our point and to formulate the problem. We are currently carrying out additional case studies, and plan a general procedure to further investigate this phenomenon.
References :
Aguillo, I. F. (1999). Personal Communication.
AltaVista (1998). AltaVista Feedback. Online. Available : http://www.altavista.digital.com/av/content/questions.htm. Date of access : September, 1998.
Bar-Ilan, J. (1998). On the Overlap, the Precision and Estimated Recall of Search Engines – A Case Study of the Query « Erdos ». Scientometrics, 42(2) : 207-228.
Bar-Ilan, J. and Peritz, B. C. (1999). The Life Span of a Specific Topic on the Web ; the Case of « Informetrics » : a Quantitative Analysis. I Scientometrics, 46(3) to appear.
Bharat, K. and Broder, A. (1998). A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. In Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30 : 379-388.
Bharat, K. and Broder, A. (1998). Measuring the Web. Online. Available : http://www.research.digital.com/SRC/whatsnew/sem.html. Date of access : August, 1998.
Chu, H. & Rosenthal, M. (1996). Search Engines for the World Wide Web : A Comparative Study and Evaluation Methodology. ASIS96. [Online]. Available : http://www.asis.org/annual-96/Electronic-Proceedings/chu.htm [December 1997].
Ding, W. & Marchionini, G. (1996). A Comparative Study of Web Search Service Performance. ASIS96. [Online]. Available : http://www.glue.umd.edu/ weid/asis/fulltext.htm [December 1997].
Dong, X. & Su, L.T. (1997). Search Engines on the World Wide Web and Information Retrieval from the Internet : A Review and Evaluation. Online & CDROM Review, 21(2), 67-81.
Hotbot. (1998). HotBot Help | Common Questions. Online. Available : http://www.hotbot.com/help/questions/question3.asp. Date of access ; September, 1998.
Koehler, W. (1999). An Analysis of Web Page and Web Site Constancy and Permanence. In JASIS 50(2) : 162-180.
Lawrence, S. and Giles, C. L. (1998). Searching the World Wide Web. In Science 280 : 98-100.
Lawrence, S. and Giles, C. L. (1999). Accessibility and Distribution of Information on the Web. I Nature 400 : 107-110.
Notess, G. (1997). Measuring the Size of Internet Databases. Database 20(5). Also available : http://www.onlineinc.com/database/OctDB97/net10.html.
Notess, G. (1999). Search Engine Showdown. Online. Available : http://www.notess.com. Date of access : August, 1999
Peterson, R. E. (1997). Eight Internet Search Engines Compared. First Monday 2(2). Online. Available : http://www.firstmonday.dk/issues/issue2_2/peterson/index.html Date o access : August, 1999.
Rousseau, R. (1999) Time Evolution of the Number o Hits in Keyword Searches on the Internet. Post Conference Seminar – Cybermetrics’99 at the Seventh International Conference on Scientometrics and Informetrics, July 9, 1999, Colima, Mexico.
Sullivan, D. Search Engine Watch. Online. Available : http://searchenginewatch.com. Date of access : August, 1998
Sullivan, D. Search Engine Sizes. Online. Available : http://searchenginewatch.com/reports/sizes.html Date of access : August, 1999.
Sullivan, D. (1998). Search Engine Watch : Subscriber-Only Area. http://www.searchenginewatch.com/subscribers/. Date of access : November, 1998
Tomaiulo N. G. & Packer, J. G. (1996). An Analysis of Internet Search Engines : Assessment of Over 200 Search Queries. Computers in Libraries, 16 (6), 58-62