How Google and Other Spider Based Search Engines Work
Since I am often asked how search engine actually work, I have decided to do a post on it. I hope you find it useful.
Search engines, such as Google, take a copy of pages on the web and then users search through what they have found. If you update your web content, search engines will find these changes, which can affect how your site is listed.
Search engines use a 3 step process. Firstly, the engine visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being “spidered”. The spider returns to the site on a regular basis, such as every month or two, to look for changes. However, for frequently updated websites or blogs, Google may re-visit every few days or hours.
Secondly, everything the spider finds goes into the search engine’s index of web pages. If a web page changes, then this index is updated with new information.
Thirdly, when a user does a search, the search engine software scans through billions of web pages recorded in the index and ranks the results in order of what it believes is most relevant.
Search engines use algorithms to determine relevancy, when confronted with billions of web pages. All major search engines consider the following factors.
- The location and frequency of keywords on a web page and whether they appear near the top of the page, such as in the headline or in the first few paragraphs of text.
- Search engines may also penalise pages or exclude them from the index, if they detect search engine “spamming.” An example is when a word is repeated hundreds of times on a page.
- All major search engines (particularly Google) make use of “off the page” ranking criteria. By analysing in–bound links for example, a search engine can both determine what a page is about and whether that page is deemed to be “an authority”.
- Another off the page factor is click-through measurement. In short, this means that a search engine may watch what results someone selects for a particular search, then eventually drop high-ranking pages that aren’t attracting clicks, while promoting lower-ranking pages that do pull in visitors.