SEO Information

Search Engine Spiders Lost Without Guidance - Post This Sign!

The robots.txt file is an exclusion standard required by allweb crawlers/robots to tell them what files and directoriesthat you want them to stay OUT of on your site. Not allcrawlers/bots follow the exclusion standard and will continuecrawling your site anyway. I like to call them "Bad Bots" ortrespassers. We block them by IP exclusion which is anotherstory entirely.

This is a very simple overview of robots.txt basics forwebmasters. For a complete and thorough lesson, visithttp://www.robotstxt.org/

To see the proper format for a somewhat standard robots.txtfile look directly below. That file should be at the root ofthe domain because that is where the crawlers expect it to be,not in some secondary directory.

Below is the proper format for a robots.txt file ----->

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /group/

User-agent: msnbot
Crawl-delay: 10

User-agent: Teoma
Crawl-delay: 10

User-agent: Slurp
Crawl-delay: 10

User-agent: aipbot
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: psbot
Disallow: /

--------> End of robots.txt file

This tiny text file is saved as a plain text document andALWAYS with the name "robots.txt" in the root of your domain.

A quick review of the listed information from the robots.txtfile above follows. The "User Agent: MSNbot" is from MSN,Slurp is from Yahoo and Teoma is from AskJeeves. The otherslisted are "Bad" bots that crawl very fast and to nobody'sbenefit but their own, so we ask them to stay out entirely.The * asterisk is a wild card that means "All"crawlers/spiders/bots should stay out of that group of filesor directories listed.

The bots given the instruction "Disallow: /" means they shouldstay out entirely and those with "Crawl-delay: 10" are thosethat crawled our site too quickly and caused it to bog downand overuse the server resources. Google crawls more slowlythan the others and doesn't require that instruction, so isnot specifically listed in the above robots.txt file.Crawl-delay instruction is only needed on very large siteswith hundreds or thousands of pages. The wildcard asterisk *applies to all crawlers, bots and spiders, includingGooglebot.

Those we provided that "Crawl-delay: 10" instruction to wererequesting as many as 7 pages every second and so we askedthem to slow down. The number you see is seconds and you canchange it to suit your server capacity, based on theircrawling rate. Ten seconds between page requests is far moreleisurely and stops them from asking for more pages than yourserver can dish up.

(You can discover how fast robots and spiders are crawling bylooking at your raw server logs - which show pages requestedby precise times to within a hundredth of a second - availablefrom your web host or ask your web or IT person. Your serverlogs can be found in the root directory if you have serveraccess, you can usually download compressed server log filesby calendar day right off your server. You'll need a utilitythat can expand compressed files to open and read those plaintext raw server log files.)

To see the contents of any robots.txt file just typerobots.txt after any domain name. If they have that file up,you will see it displayed as a text file in your web browser.Click on the link below to see that file for Amazon.com

http://www.Amazon.com/robots.txt

You can see the contents of any website robots.txt file thatway.

The robots.txt shown above is what we currently use atPublish101 Web Content Distributor, just launched in May of2005. We did an extensive case study and published a series ofarticles on crawler behavior and indexing delays known as theGoogle Sandbox. That Google Sandbox Case Study is highlyinstructive on many levels for webmasters everywhere about theimportance of this often ignored little text file.

One thing we didn't expect to glean from the research involvedin indexing delays (known as the Google Sandbox) was theimportance of robots.txt files to quick and efficient crawlingby the spiders from the major search engines and the number ofheavy crawls from bots that will do no earthly good to thesite owner, yet crawl most sites extensively and heavily,straining servers to the breaking point with requests forpages coming as fast as 7 pages per second.

We discovered in our launch of the new site that Google andYahoo will crawl the site whether or not you use a robots.txtfile, but MSN seems to REQUIRE it before they will begincrawling at all. All of the search engine robots seem torequest the file on a regular basis to verify that it hasn'tchanged.

Then when you DO change it, they will stop crawling for briefperiods and repeatedly ask for that robots.txt file duringthat time without crawling any additional pages. (Perhaps theyhad a list of pages to visit that included the directory orfiles you have instructed them to stay out of and must nowadjust their crawling schedule to eliminate those files fromtheir list.)

Most webmasters instruct the bots to stay out of "image"directories and the "cgi-bin" directory as well as anydirectories containing private or proprietary files intendedonly for users of an intranet or password protected sectionsof your site. Clearly, you should direct the bots to stay outof any private areas that you don't want indexed by the searchengines.

The importance of robots.txt is rarely discussed by averagewebmasters and I've even had some of my client business'webmasters ask me what it is and how to implement it when Itell them how important it is to both site security andefficient crawling by the search engines. This should bestandard knowledge by webmasters at substantial companies, butthis illustrates how little attention is paid to use ofrobots.txt.

The search engine spiders really do want your guidance andthis tiny text file is the best way to provide crawlers andbots a clear signpost to warn off trespassers and protectprivate property - and to warmly welcome invited guests, suchas the big three search engines while asking them nicely tostay out of private areas.

Google Sandbox Case Study http://publish101.com/Sandbox2Mike Banks Valentine operates http://Publish101.comFree Web Content Distribution for Article Marketers andProvides content aggregation, press release optimizationand custom web content for Search Engine Positioninghttp://www.seoptimism.com/SEO_Contact.htm

MORE RESOURCES:
Unable to open RSS Feed $XMLfilename with error HTTP ERROR: 404, exiting

RELATED ARTICLES

Make The Search Engines Love Your Site
Most webmasters have no idea on how to make a search engine friendly web site. If you are one of them this will all change by following these steps below.

Google: The Ultimate Web Writer's Style Guide
Indulge me for a moment.Forget that Google is a search engine.

How to Get Non-stop Free Traffic to Your Website
Yet the simple truth is that without traffic a website cannot be successful. Lack of traffic really need not be a problem because there are various free, proven ways of generating traffic that will cost you nothing but get you lots of traffic without you having to spend anything.

Banned By Google And Back Again
The date: 29th July 2005. The time: early morning.

Finding Targeted Keyword Phrases Your Competitors Miss
Finding Targeted Keyword Phrases Your Competitors MissFinding keyword phrases your competition is missing is easier than it seems. Combinations of two and three word phrases are often overlooked by your competitors when vying for the top competitive terms.

Ten Steps To A Well Optimized Website - Step 3: Site Structure
Welcome to part three in this search engine positioning series. Last week we discussed the importance and considerations that much be made while creating the content that will provide the highest ROI for your optimization efforts.

Increase Web Site Sales with a SEO Proposal - Part 1
You can easily get confused by all the search engine optimization companies and SEO experts that offer SEO services. It's hard to know who to trust or what should be included in a SEO proposal.

Ten Steps To A Well Optimized Website - Step 1: Keyword Selection
This is part one of ten in this search engine positioning series. In part one we will outline how to choose the keyword phrases most likely to produce a high ROI for your search engine positioning efforts.

No Cost Search Engine Marketing
As a matter of fact, I recommend NOT wasting money on pay inclusion in most cases because it doesn't offer enough of an advantage (and many times the fees give you absolutely no advantage - the only exception are the few sites that guarantee placement within a specific timeline). Focus your online marketing and gain positive and targeted traffic without paying out for "expedited listings" or "submission software.

Beyond the Box with Googles Web API
Google, the most popular, and many say best, search engine, offers searchers many options to help them zero in on just what they're looking for. Although these search modifier features are documented on own site, many searchers, including experienced marketers and technically savvy people, simple don't know these features are available.

Local Search Optimization - A Guide to Getting Started
While searching the web these days, it's hard not to notice all those little Local tabs sprouting up in the vicinity of the search field on virtually every major search engine. Within the past year, the race has been to integrate a plethora of advanced features into local search capabilities.

Importance of Keywords in Links to Your Website
Search Engine Optimization (SEO) is a very complex process. It is a long-term process that will usually never produce results that you can see in days or weeks.

Keyword Targeting Strategy In Your Site
Once the keywords have been decided for the site one has to come up with a strategy to target those keywords across the site. Here is a primer on that.

Search Engine Optimization - Enhancing Web Site Visibility
I've had several prospects and clients say to me "I want my web site to come up on top in search engines." And some have been contacted by an SEO company asking for lots of money and claiming that they can guarantee top search engine results, which in many cases is just false promises.

What You Did Wrong With Your Domain Names!
Trying to improve search engine rankings is just like a rubics cube. A puzzle that can keep you occupied for hours.

Creating a Google SiteMap For Your Work At Home Business Web Page
Search engine traffic is the best traffic You can get for your online Business. So if you are running a Home Based Business with an Online Presence why wouldn't you do everything possible to gain a Top Search Engine Ranking.

SEO, the Simplified Version
Lets get things straight. SEO is a very competitive market.

Search Engine Saturation Tool - A Must Have SEO Tool
Search Engines have become the soul of the Internet. They provide a means of aggregating, correlating, indexing and categorizing the vast amounts of content in the wild world of Internet.

Linking for Traffic not Positioning!
With more and more experts and search engine enthusiastsclaiming the right way and the wrong way to handle linkswapping, link exchanging or reciprocal linking! You can tell something is important when there is more thanone name for it! GRIN! There are also two schools of thought on the reasons linkswapping. The first reason for link swapping has always been to carryfavour with Search engine rankings.

Search Bots, Crawlers, and Spiders
If you are a webmaster and you review your logs, often you will see a bunch of really strange hits. They aren't humans, you can't tell their operating system or their browser! Who are these pesky little creatures who rummage around the internet all the time?Not quite sure what I am talking about? Here is a few examples of various bots searching my website:207.

home | site map | contact us