Computer Science

Search Engine Indexing

Search engine indexing is the process of collecting, parsing, and storing data to facilitate fast and accurate information retrieval. It involves creating an index of web pages and their content, which allows search engines to quickly provide relevant results to user queries. Indexing enables search engines to efficiently crawl and organize vast amounts of information available on the internet.

Written by Perlego with AI-assistance

8 Key excerpts on "Search Engine Indexing"

  • Process Mining Techniques for Pattern Recognition
    eBook - ePub
    • Vikash Yadav, Anil Kumar Dubey, Harivans Pratap Singh, Gaurav Dubey, Erma Suryani, Vikash Yadav, Anil Kumar Dubey, Harivans Pratap Singh, Gaurav Dubey, Erma Suryani(Authors)
    • 2022(Publication Date)
    • CRC Press
      (Publisher)

    ...For making results more relevant, these algorithms are modified and revised constantly. Information retrieval : Accessing the right and relevant information in less time, or we can say fast searching of relevant data, is very important nowadays and is the way by which we can get the desired information or access the relevant information, known as information retrieval. There are some recovery programs, also known as information recovery programs that part of accessing information on the web. Getting the right information related to the relevant topic from a large database containing all kind of data, metadata, directories, indexes, etc., is known as retrieving information. Search depends on different criteria, sometimes based on metadata or sometimes on the whole text as well. Elements of search engine are shown in Figure 3.1. We show this process as follows. FIGURE 3.1 Elements of a search engine: We can define the elements in three parts, broadly. 1.1 S EARCH E NGINE A web search engine is created to find information on the WWW. The search outcomes are normally given out in a line of results, frequently referred to as search engine results pages (SERPs). The information may be a proficient in web pages, images, information and other types of files. Some Search Engines (SE) also excavates data available in databases or open directories. Unlike web directories, which are maintained only by human editors, SE also maintains real-time information by running an Algorithm on a web crawled. A search engine operates in the following manner: Crawl on web Indexing of web Search interface (SI) 1.1.1 Crawl on Web The software or a program main responsible to browse the web by browsing the documents automatically. Web Crawler alias Spider, index the content of websites to provide up to date data...

  • An Introduction to Search Engines and Web Navigation
    • Mark Levene(Author)
    • 2011(Publication Date)
    • Wiley
      (Publisher)

    ...I will present you with a simplified answer, which is given in the architecture diagram shown in Fig. 4.5. The main components of a search engine are the crawler, indexer, search index, query engine, and search interface. Figure 4.5 Simplified search engine architecture. As I have already mentioned, a web crawler is a software program that traverses web pages, downloads them for indexing, and follows the hyperlinks that are referenced on the downloaded pages; web crawlers will be discussed in detail in the next section. As a matter of terminology, a web crawler is also known as a spider, a wanderer or a software robot. The second component is the indexer which is responsible for creating the search index from the web pages it receives from the crawler. 4.5.1 The Search Index The search index is a data repository containing all the information the search engine needs to match and retrieve web pages. The type of data structure used to organize the index is known as an inverted file. It is very much like an index at the back of a book. It contains all the words appearing in the web pages crawled, listed in alphabetical order (this is called the index file), and for each word it has a list of references to the web pages in which the word appears (this is called the posting list). In 1998 Brin and Page reported the Google search index to contain 14 million words, so currently it must be much larger than that, although clearly very much smaller than the reported number of web pages covered, which is currently over 600 billion. (Google reported that after discarding words that appear less than 200 times, there are about 13.6 million unique words in Google's search index. 32) Consider the entry for “chess” in the search index...

  • Digital Asset Management
    • David Austerberry(Author)
    • 2012(Publication Date)
    • Routledge
      (Publisher)

    ...9 Search Engines Introduction We use language to express concepts and ideas. The richness of languages means that there are many different ways to express the same concept. We can use different words in different grammatical constructions, all to express the same concept. Here lies the problem with search; we might express a concept one way, but the entry in the index for an occurrence of that concept will not be an exact match. The designers of the index and search applications attempt to solve this difficult problem. One of the many advantages claimed for digital asset management is the ability to quickly search the content repository to find relevant information. Whether the application lives up to that claim depends very much on the precision and recall abilities of the search engine. Although cataloguing and search are different operations, undertaken by different users, the two are usually components of the same application. Consequently, the catalog index is prepared for a specific search engine. There are two ways to find content. If you are familiar with the taxonomy, you can search through a directory tree until you locate the wanted file. The other way is to use some form of search engine. The first route assumes that you are familiar with the contents of the file. As a repository grows in size this rapidly becomes untenable. Most people can even lose track of files on the local drive on their desktop computer. Databases have long had the facility to query the data to retrieve records. The simplest search is to enter a string of characters, and then search the records for occurrences of that string. The first drawback with this is that the result set can be very large. The second is that you may not know what string the author used. If you are looking for documents on a given subject, it is unlikely that you will know the title...

  • Search Engine Society
    • Alexander Halavais(Author)
    • 2013(Publication Date)
    • Polity
      (Publisher)

    ...Over time, and billions of queries, it may be possible to notice certain consistencies among query terms. Someone searching for “John,” “Paul,” and “George,” the engine could suggest, may also want to include “Ringo.” By tracking which links are clicked, search engines can provide something akin to a voting system. If the top result for a search for “Mars” is consistently avoided, it is likely that, despite whatever algorithm led to it being placed at the top of the results list, it is not much use to searchers, as a rule. A significant portion of search engine research now focuses on understanding the web, rather than the user. Rather than merely creating an index that indicates what terms appear on various pages, or what the link structure might suggest about the importance of a page, these efforts try to discover what concepts are important on a page, and how this concept or topic might relate to other pages (e.g., Dittenbach, Berger, & Merkl 2006). The Text REtrieval Conferences (TREC) have turned an eager eye toward the web in recent years, and support efforts to provide for understanding of web content in order to aid searching. These kinds of redesign of the search engine’s processes and interface are generally a long-cycle response to user behavior. Even the development of vertical search engines to meet the needs of a perceived subgroup represents this sort of slow evolution. There is also the more immediate response to individuals’ particular contexts for a search, or their searching history...

  • Marketing in the Age of Google, Revised and Updated
    eBook - ePub

    Marketing in the Age of Google, Revised and Updated

    Your Online Strategy IS Your Business Strategy

    • Vanessa Fox(Author)
    • 2012(Publication Date)
    • Wiley
      (Publisher)

    ...Chapter 5 How Search Engines Work We’ve seen how searchers behave and how they interact with search results. We’ve decided what queries we want our sites to be found for. How do search engines compile these lists? THE EVOLUTION OF SEARCH ENGINES In the emerging days of the Web, directories were built to help users navigate to various websites. Generally, these directories were created by hand—people categorized websites so that users could browse to what they wanted. As the Web got larger, this effort became more difficult. Web spiders that “crawled” websites were created. Web spiders, also known as robots, are computer programs that follow links from known web pages to other web pages. These robots access those pages, download the contents of those pages (into a storage mechanism generically referred to as an “index”), and add the links found on those pages to their list for later crawling. Although Web crawlers enabled the early search engines to have a larger list of sites than the manual method of collecting sites, they couldn’t perform the other manual tasks of figuring out what the pages were about and ranking them in order of which ones were best. These search engines started working on computer programs that would help them do these things as well. For instance, computer programs could catalog all the words on a page to help figure out what those pages were about. THE INTRODUCTION OF PAGERANK Google’s PageRank algorithm in 1998 was a big step forward in automatically cataloging and ranking websites. 1 This algorithm used data from the links on the Web to determine what pages were about and which pages were more popular and useful. Links were like votes for a site, and the text from those links was used for cataloging them. For instance, consider two web pages. One is at the address www.myusedcars.com, and the other is at the address www.yourusedcars.com. Both contain text about cars and have the title “Used Cars.” Five websites link to www.myusedcars.com...

  • Google and the Digital Divide
    eBook - ePub

    Google and the Digital Divide

    The Bias of Online Knowledge

    ...Automatic search systems are required to build an index that links search queries with relevant documents and therefore extract, summarise, classify and eventually visualise content in friendly interfaces. In other words, they attempt to shorten some of the information retrieval procedures previously done with the help of professional librarians. However, despite the great benefits this system can offer, during this complex process it is always possible that users will translate their information needs into inadequate search queries, or that the system will omit very relevant documents from its index or its search results. The first search engine, Archie, was developed in 1990, and was based on the idea of downloading the directory listings of all files located on anonymous public File Transfer Protocol (FTP) sites, thus creating a searchable database of filenames. A year later, Gopher was introduced, enabling online users to search within the content of plain-text files. In 1993, various crawl-based search engines, also known as ‘robots’ or ‘crawlers’, were developed (e.g. Wandex, WebCrawler and later commercial ones such as Lycos, Excite, Infoseek, Inktomi, Northern Light and AltaVista). The basic principle of those search engines, which is still common nowadays, is to follow hyperlinks from one website to another and retrieve their content, creating an index that connects keywords or search queries with URLs 1 (Sherman and Price, 2001; Battelle, 2005). In an early analysis of the search engine market, Gandal (2001) noticed that the five big search engines, Yahoo!, Lycos, Excite, Infoseek and AltaVista, could not maintain their dominance. In 1999, six late entrants (About, LookSmart, Snap, HotBot, GoTo and AskJeeves) grew steadily and reached a market share of 5–6 per cent each. This is to suggest that the power of a brand name alone did not help the big five search engines to retain their competitive position in the long term...

  • Natural Language Processing in Artificial Intelligence
    • Brojo Kishore Mishra, Raghvendra Kumar, Brojo Kishore Mishra, Raghvendra Kumar(Authors)
    • 2020(Publication Date)

    ...The process of transforming document text and given query statement to some representation of it is known as indexing. Indexing reduces the size for searching and improves the performance of IR system. There are different index structures which can be used for indexing. The most commonly used data structure by Information Retrieval system is inverted index. To retrieve the information, the information Retrieval system does not use the actual text of the document and query. Instead, some representation of the documents and the query are used by the search engine. The system then matches the document representation with query representation to retrieve the relevant information. Indexing techniques concerned with the selection of good document descriptors, such as keywords or terms, to describe information content of the documents. A good descriptor is one that helps in describing the content of the document and in discriminating the document from other documents in the collection. The most widely used method is to represent the query and the document as a set of tokens,, i.e., index terms or keywords. 2.2.5 TERM WEIGHTING To represent the importance of the terms with respect to the different documents and across the documents a term weighting matrix is created. The factor t f simply means the term count in a document. To represent a document term count is considered to be important because, the terms that occur more frequently represent its meaning more strongly than those occurring less frequently...

  • Search Engine Society
    • Alexander Halavais(Author)
    • 2017(Publication Date)
    • Polity
      (Publisher)

    ...As one group of early search engine pioneers noted, “It is safe to say that at least 99% of the available data is of no interest to at least 99% of the users” (Bowman et al. 1994), and, as a result, a general-purpose search engine probably does not need to index the entirety of the web at once. Some estimate that fully 80 percent of the content on the web is hidden from the general-purpose horizontal search engines (Ratzan 2006), while others suggest that proportion is much, much larger, estimating that, for every page in a major search engine’s index, there are 500 missing (Sullivan 2000). Especially if the latter estimate remains true today, the sheer size and rate of growth of the web is enough to suggest why search engines have an incomplete index, but there are several other reasons, as well. The search engine’s crawler must usually be able to follow a hyperlink to arrive at and index a page; if the page is not hyperlinked from another HTML page somewhere on the web, it is unlikely to be indexed. Anything other than HTML – including document formats (Adobe Acrobat or Microsoft Word, for example), applets and other executable files, archived (“zipped”) files, video, and audio – presents a novel challenge to a search engine, and that challenge is even greater when there are hyperlinks within those documents. Things like live feeds, links to virtual worlds, or even robots that can be controlled from the web (see, e.g., Schulz et al. 2000), are very difficult to anticipate and make sense of in the context of search. Much of the material available through the web can only be accessed through local searches on individual sites. Consider, for example, all of the pages available from a public library’s web portal. There is probably a page for each book available in the library, but these pages probably do not appear as results from a search on a general-purpose site...