Building robots.txt: markgritter

markgritter

Building robots.txt

May 05, 2013 10:32

Google searches for "Netflix X" where X is a movie title usually find the Netflix page for that movie, even though their robots.txt prevent it being searched. (Not too surprising--- if there are enough links, Google can find it even if it can't scan the page itself.)

So out of curiousity I took a look at robots.txt and not only is it commented but it also has a disabled portion listed:

# Uncomment this when we start generating sitemaps again.
#Sitemap: http://movies.netflix.com/sitemap_Movies.xml.gz

The implication is that the deployment process for the website just does a "copy", not a "build." This is pretty common for websites--- you can find lots of comments and comment-disabled portions in HTML and Javascript documents. http://www.tintri.com/robots.txt has a ton of boilerplate text that obviously came with the web server or framework.

There are exceptions: http://cnn.com/robots.txt doesn't have any comments. http://facebook.com/robots.txt has comments that are directed at outside people (like they should be)! But what surprises me most is that I couldn't quickly find tools or best-practices guide for stripping out "internally"-directed comments, other than the JavaScript compaction + obfuscation tools whose main goal is reducing size.

web, programming