You, Some SEO and a Spider Get the Trellian Seo Toolkit

Coming soon ...

Web CEO - search engine marketing and optimization software package consisting of 12 SEM tools
Increase your site popularity by adding your it to this SEO friendly web directory.
Genuine Paid Directory
Web Hosting India
SGD Networks offers quality web hosting from hyderabad India.
CR Solutions Group - US consultants living abroad offer Low Cost SEO and Web Design from Costa Rica.

Your text ad here ?
Only $10 per month


You, Some SEO and a Spider

Posted by George Chilton on: 2006-08-29 23:53:36

Self SEO > Search Engine Optimization Articles


Seducing the robots and spiders

What do you imagine when you think of successful seduction?

Right now I’m thinking of thousands of tiny spiders crawling over my computer screen. No, I’m not mentally ill – I’m talking about making your website seductive; or rather, attractive to webspiders and net-bots.

What are webspiders, what are net-bots?

Web-spiders, ants and crawlers are just some of the names for the automatic scripts that browse the Internet in a methodological fashion. They harvest data for different kinds of processing. They can be used internally - a website may employ a net-bot to check for broken links, or they can be used by search engines to index new and updated websites.

For some examples of these webcrawlers please have a browse through Wikipedia’s selection;

http://en.wikipedia.org/wiki/Web_crawler#Examples_of_web_crawlers

Why would I seduce a spider?

Never thought I’d write that. Crawlers are good for your website because they let the search engines find you. Without them your website would be very difficult to find.

The benefits of webcrawlers:

  • Your website will be indexed by the major search engines.
  • The crawlers will notice updates and the search engines will update accordingly.
  • The search engine will display your website correctly.

How do I seduce a Spider?

Spiders like Googlebot (please see How Google Crawls my site for more details) want to index your website and they will find you if you have:

  • Links to your website from external (and *legitimate) URLs
  • Links to other websites (like directories you may feature in, for example).
  • Internal page links (the Bots use them to navigate)

(*By ‘legitimate’ I meant bone fide websites, which are not connected to your own website. It would not benefit you to create single-page website to link back from, for example.)

However, you do not want a crawler to index all the information in your website. It would be a waste of time having your /image directory listed on Google, for example, so you must disallow the crawlers from accessing this content. You may also want to protect your e-mail addresses from malignant crawlers (Please see ‘Are all crawlers safe?’ below).

To do this you should create a Robot.txt file.

A robot.txt file is a simple, but potent, document that every website should keep in its root directory. This file is your ‘fart in the lift’; it is small, but very powerful in effect. With it you may stop a crawler harvesting certain pages or even entire directories by using the command -

Disallow:

A Mini robot.txt tutorial:

1. Start a notepad document and name it robot.txt



2. Address the webcrawlers like this:

User-agent: *

The ‘user-agent’ denotes that you are addressing a webcrawler. If you place an asterisk in the way that I have done here you will address every webcrawler that happens upon your website. If you wish to address individual crawlers you should list them by name like this:

User-agent: Googlebot

But you must list the disallowed pages/directories for each crawler individually.

For example:

User-agent: *

Disallow: /user-list/email/
Disallow: /products/images/
Disallow: /articles/contributors/

All files and folders listed in these directories will be blocked and will not be indexed. Bear in mind that you should list the directories as relative to the position of the robot.txt file, or the robot.txt will not be referring to the correct information. The robot.txt cannot refer to material in directories above it, for example;

http://www.yoururl.co.uk/index/robot.txt

The robot.txt cannot refer to anything that is higher than ‘index/’ directory, in other words –it will not refer to material above itself.

3. You may also want to disallow certain files, you can do so like this:

Disallow: /articles/jubjub.html
Disallow: /index/error_page.html

Are all crawlers safe?

No, some can and will bite you. There are many webcrawlers and they may visit your website for reasons other than indexing. You should attempt to protect certain information by disallowing the crawlers as I have shown you in the tutorial above.

Malignant Crawlers

They can be (much to my upset) used for Spamming. Malignant crawlers look through your website with a view to capture all the e-mail addresses and other useful data displayed there.

If they do this you can expect an inbox full of Spam. I discovered 20 e-mails from a Japanese Adult dating website in my Herds of Words inbox today. I was not a happy bunny.

However, you can avoid this (I was just that little bit too late) if you encode the addresses differently making it harder for these evil Bots to trap you.

If you are using Cascading Style Sheets (.css):

  1. Create an html-tag to fit around the text you want to use as an e-mail address.
  2. In the css file you must define that tag, so:

postmaster:after{ content: "postmaster40herdsofwords.co.uk";}

If that doesn’t help you, or you don’t use cascading style sheets, please have a look through this useful article by Daniel Cody, http://evolt.org/article/Using_Apache_to_stop_bad_robots/18/15126/

I hope this article has been useful, if you have any questions, comments or friendly criticism please don’t hesitate to contact me at herdsofwords.co.uk.

George Chilton is an experienced Advertising and SEO copywriter at Herds of Words. He has fourteen years experience as a magician and public speaker and can be contacted at george@herdsofwords.co.uk.

Or come join the herd at http://www.HerdsOfWords.co.uk - Freelance Copywriters.





Print this article    Tell a friend


Post New Comment

This site does not allow anonymous comments. Registered members can login to participate. Registration is free and takes only a few seconds