Home Employment Order SEO Advertise Research e Services Security Clients Stock Photos Games - IL2 Articles Contact Us

HeaderAsisBiz.com

Articles Global Issues John Cunningham Reports Les Whale Prophecies Spiritual Issues Contact Us

Web Articles

Nearly all search engines, including Google, recognize www.mysite.com and mysite.com as two different websites. This is very bad for SEO reasons. Instead of having one site listed highly in the search engines, you will have two sites with less importance. See more about SEO/SEM to learn why.

To fix this, we can create a very simple .htaccess file to redirect any traffic from mysite.com to www.mysite.com, that way all incoming traffic hits the same URL, and Google gives all of the page ranking importance to one website.

Open up an ASCII text editor such as Notepad and create the following file, named htaccess.txt:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.mysite\.com$
RewriteRule (.*) http://www.mysite.com/$1 [R=301,L]

Next, upload this file to your server (to the root directory of your website) and change the filename from "htaccess.txt" to ".htaccess". That's all there is to it.

Although less common, you can also implement this strategy with reverse thinking, sending all www traffic to a non-www URL. To do this, you would replace the aforementioned code with the following:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^mysite\.com$
RewriteRule (.*) http://mysite.com/$1 [R=301,L]

NOTE: Only use one of these two code segments.

Hopefully any questions that you may have had about htaccess files and www-redirects have just been cleared up. If not, please feel free to send an email to Info@asisbiz.com

http://en.wikipedia.org/wiki/Meta_tags

Checking robots.txt with http://www.google.com/

The robots.txt analysis tool reads the robots.txt file in the same way Googlebot does. If the tool interprets a line as a syntax error, Googlebot doesn't understand that line. If the tool shows that a URL is allowed, Googlebot interprets that URL as allowed. This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard. It understands Allow: directives, as well as some pattern matching. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site.
If a robots.txt file exists in the http://www.google.com/support/webmasters/bin/answer.py?answer=40361 root directory of the host, this tool lists the information that Google has about it, including:

A link to the current robots.txt file on your site.
When Google last downloaded the file - if you've made changes to the file after this date and time, our cached version won't reflect the changes. The status of the file - the HTTP response we received when we tried to downloaded it. (http://www.google.com/support/webmasters) Learn more about status codes. If we receive a 404 (File Not Found) error, this doesn't present a problem. You don't have to have a robots.txt file - but if you don't, bots will be able to crawl all the pages on your site. The MIME type - if the file is a type other than text, we can't process it. Whether the robots.txt is blocking access to your home page or to any Sitemaps you've submitted.If we had trouble parsing lines in the file.

To analyze a site's robots.txt file:

Sign into "https://www.google.com/webmasters/tools" Google Webmaster Tools with your "http://www.google.com/accounts/ManageAccount" Google Account.

On the Dashboard, click the URL for the site you want.

Click Tools, and then click Analyze robots.txt.

A Standard for Robot Exclusion

Table of contents:

Introduction
Method
Format
Examples
Example Code

Status of this document

This document represents a consensus on 30 June 1994 on the robots mailing list (robots-request@nexor.co.uk), between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www-talk@info.cern.ch). This document is based on a previous working draft under the same title.

It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

The latest version of this document can be found on http://www.robotstxt.org/wc/robots.html.

Introduction

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see "http://www.robotstxt.org/" the robots page. In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

The Method

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". The contents of this file are specified below. This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval. A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document.
The choice of the URL was motivated by several criteria:

The filename should fit in file naming restrictions of all common operating systems.

The filename extension should not require extra server configuration.

The filename should indicate the purpose of the file and be easy to remember.

The likelihood of a clash with existing files should be minimal.

The Format

The format and semantics of the "/robots.txt" file are as follows:
The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>". The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary. The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.
User-agent
The value of this field is the name of the robot the record is describing access policy for. If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record. The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended. If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.

For example, Disallow: /help disallows both /help.html and /help/index.html,

whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved.
At least one Disallow field needs to be present in a record.
The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

Examples

The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:

# robots.txt for http://www.example.com/

User-agent: *

Disallow: /cyberworld/map/ # This is an infinite virtual URL space

Disallow: /tmp/ # these will soon disappear

Disallow: /foo.html

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":

# robots.txt for http://www.example.com/

User-agent: *

Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.

User-agent: cybermapper

Disallow:

This example indicates that no robots should visit this site further:

# go away

User-agent: *

Disallow: /

Example Code

Although it is not part of this specification, some example code in Perl is available in norobots.pl. It is a bit more flexible in its parsing than this document specificies, and is provided as-is, without warranty.

Note: This code is no longer available. Instead I recommend using the robots exclusion code in the Perl libwww-perl5 library, available from http://www.cpan.org/CPAN in the http://www.cpan.org/modules/by-module/LWP/ LWP directory.

Author's Address

http://www.greenhills.co.uk/mak/mak.html Martijn Koster
http://www.robotstxt.org/orig.html

Creating a robots.txt file

The easiest way to create a robots.txt file is to use the Generate robots.txt tool in Webmaster Tools. Once you've created the file, you can use the Analyze robots.txt tool to make sure that it's behaving as you expect. Once you've created your robots.txt file, save it to the root of your domain with the name robots.txt. This is where robots will check for your file. If it's saved elsewhere, they won't find it. You can also create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase.

Google and other search engines treat http://www.example.com, https://www.example.com, and http://example.com as different sites. If you want to restrickt crawling on each of these sites, you can create a separate robots.txt for every version of your site's URL.

Syntax

The simplest robots.txt file uses two rules:

User-agent: the robot the following rule applies to

Disallow: the URL you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.

User-agents and bots

A user-agent is a specific search engine robot. The "http://www.robotstxt.org/wc/active.html" Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:

User-agent: *

Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well.

Blocking user-agents

The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).

To block the entire site, use a forward slash.

Disallow: /

To block a directory and everything in it, follow the directory name with a forward slash.

Disallow: /junk-directory/

To block a page, list the page.

Disallow: /private_file.html

To remove a specific image from Google image search, add the following:

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

To remove all images on your site from Google image search:

User-agent: Googlebot-Image
Disallow: /

To block files of a specific file type (for example, .gif), use the following:

User-agent: Googlebot
Disallow: /*.gif$

To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:

User-agent: *
Disallow: /folder1/
User-agent: Mediapartners-Google
Allow: /folder1/

Note that directives are case-sensitive. For instance,

Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp,

but would allow http://www.example.com/Junk_file.asp.

Pattern matching

Googlebot (but not all search engines) respects some pattern matching.

To match a sequence of characters, use an asterisk (*).

For instance, to block access to all subdirectories that begin with private:

User-agent: Googlebot
Disallow: /private*/

To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot
Disallow: /*?

To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:

User-agent: Googlebot
Disallow: /*.xls$

You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: *
Allow: /*?$
Disallow: /*?

The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

Blocking Googlebot
Google uses several user-agents. You can block access to any of them by including the bot name on the User-agent line of an entry. Blocking Googlebot blocks all bots that begin with "Googlebot".

  • Googlebot: crawl pages from our web index and our news index
  • Googlebot-Mobile: crawls pages for our mobile index
  • Googlebot-Image: crawls pages for our image index
  • Mediapartners-Google: crawls pages to determine AdSense content. We only use this bot to crawl your site if AdSense ads are displayed on your site.
  • Adsbot-Google: crawls pages to measure AdWords landing page quality. We only use this bot if you use Google AdWords to advertise your site.

For instance, to block Googlebot entirely, you can use the following syntax:

User-agent: Googlebot
Disallow: /

Allowing Googlebot
If you want to block access to all but a single robot, you can use the following syntax (note: we don't recommend doing this if you want your site to appear in the search results for other search engines such as MSN and Yahoo!):

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:

Googlebot follows the line directed at it, rather than the line directed at everyone.
The Allow extension
Googlebot recognises an extension to the robots.txt standard called Allow. This extension may not be recognized by all other search engine bots, so check with other search engines in which you are interested to find out. The Allow line works exactly like the Disallow line. Simply list a directory or page that you want to allow. You may want to use Disallow and Allow together. For instance, to block access to all pages except one in a subdirectory, you could use the following entries:

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Those entries would block all pages inside the folder1 directory except for myfile.html.
If you block Googlebot and want to allow another of Google's bots (such as Googlebot-Mobile), you can allow access to that bot using the Allow rule. For instance:

User-agent: Googlebot
Disallow: /
User-agent: Googlebot-Mobile
Allow: /

Introduction to "robots.txt"

There is a hidden, relentless force that permeates the web and its billions of web pages and files, unbeknownst to the majority of us sentient beings. I'm talking about search engine crawlers and robots here. Every day hundreds of them go out and scour the web, whether it's Google trying to index the entire web, or a spam bot collecting any email address it could find for less than honorable intentions. As site owners, what little control we have over what robots are allowed to do when they visit our sites exist in a magical little file called "robots.txt." Robots.txt is a regular text file that through its name, has special meaning to the majority of "honorable" robots on the web. By defining a few rules in this text file, you can instruct robots to not crawl and index certain files, directories within your site, or at all. For example, you may not want Google to crawl the /images directory of your site, as it's both meaningless to you and a waste of your site's bandwidth. "Robots.txt" lets you tell Google just that.

Creating your "robots.txt" file

So lets get moving. Create a regular text file called "robots.txt", and make sure it's named exactly that. This file must be uploaded to the root accessible directory of your site, not a subdirectory (ie: http://www.mysite.com but NOT http://www.mysite.com/stuff/). It is only by following the above two rules will search engines interpret the instructions contained in the file. Deviate from this, and "robots.txt" becomes nothing more than a regular text file, like Cinderella after midnight. Now that you know what to name your text file and where to upload it, you need to learn what to actually put in it to send commands off to search engines that follow this protocol (formally the "Robots Exclusion Protocol"). The format is simple enough for most intents and purposes: a USERAGENT line to identify the crawler in question followed by one or more DISALLOW: lines to disallow it from crawling certain parts of your site.
1) Here's a basic "robots.txt":

User-agent: *
Disallow: /

With the above declared, all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/"). Most likely not what you want, but you get the idea.
2) Lets get a little more discriminatory now. While every webmaster loves Google, you may not want Google's Image bot crawling your site's images and making them http://images.google.com searchable online, if just to save bandwidth. The below declaration will do the trick:

User-agent: Googlebot-Image
Disallow: /

3) The following disallows all search engines and robots from crawling select directories and pages:

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm

4) You can conditionally target multiple robots in "robots.txt." Take a look at the below:

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/

This is interesting- here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google, which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not inheritance.
5) There is a way to use Disallow: to essentially turn it into "Allow all", and that is by not entering a value after the semicolon(:):

User-agent: *
Disallow: /
User-agent: ia_archiver
Disallow:

Here I'm saying all crawlers should be prohibited from crawling our site, except for http://pages.alexa.com/help/webmasters/ or Alexa, which is allowed.
6) Finally, some crawlers now support an additional field called "Allow:", most notably, Google. As its name implies, "Allow:" lets you explicitly dictate what files/folders can be crawled. However, this field is currently not part of the "robots.txt" protocol, so my recommendation is to use it only if absolutely needed, as it might confuse some less intelligent crawlers.
Per Google's http://www.google.com/webmasters/faq.html FAQs for webmasters, the below is the preferred way to disallow all crawlers from your site EXCEPT Google:

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Web Reference:
http://www.javascriptkit.com/howto/robots.shtml



asisbiz
Home


This webpage was updated 27th September 2011

websiteerrors@asisbiz.com

Asisbiz Sitemap

Logo W3 valid html401

replica watches sale | rolex replica | tag heuer replica sale | fake hublot