GSiteCrawler Features

In general, the GSiteCrawler will take a listing of your websites URLs, let you edit the settings and generate Google Sitemap files. However, the GSiteCrawler is very flexible and allows you to do a whole lot more than "just" that!

Capture URLs for your site using

  • a normal website crawl - emulating a Googlebot, looking for all links and pages within your website
  • an import of an existing Google Sitemap file
  • an import of a server log file
  • an import of any text file with URLs in it

The Crawler

  • does a text-based crawl of each page, even finding URLs in javascript
  • respects your robots.txt file
  • respects robots meta tags for index / follow
  • can run up to 15 times in parallel
  • can be throttled with a user defined wait-time between URLs
  • can be controlled with filters, bans, automatic URL modifications

With each page, it

  • checks date (from the server of using a date meta-tag) and size of the page
  • checks title, description and keyword tags
  • keeps track of the time required to download and crawl the page

Once the pages are in the database, you can

  • modify Google Sitemap settings like "priority" and "change frequency"
  • search for pages by URL parts, title, description or keywords tags
  • filter pages based on custom criteria - adjust their settings globally
  • edit, add and delete pages manually

And you have everything the way you want it, you can export it as

  • a Google Sitemap file in XML format (of course :-)) - with or without the optional attributes like "change date", "priority" or "change frequency"
  • a text URL listing for other programs (or for use as a UrlList for Yahoo!)
  • a simple RSS feed
  • Excel / CSV files with URLs, settings and attributes like title, description, keywords
  • a Google Base Bulk-Import file
  • a ROR (Resources of Resources) XML file
  • a static HTML sitemap file (with relative or absolute paths)
  • a new robots.txt file based on your chosen filters
  • ... or almost any type of file you want - the export function uses a user-adjustable text-based template-system

For more information, it also generates

  • a general site overview with the number of URLs (total, crawlable, still in queue), oldest URLs, etc
  • a listing of all broken URLs linked in your site (or otherwise not-accessable URLs from the crawl)
  • an overview of your sites speed with the largest pages, slowest pages by total download time or download speed (unusually server-intensive pages), and those with the most processing time (many links)
  • an overview of URLs leading to "duplicate content" - with the option of automatically disabling those pages for the Google Sitemap file

Additionally ...

  • It can run on just about any Windows version from Windows 95b on up (tested on Windows Vista beta 1 and all server versions).
  • It can use local MS-Access databases for re-use with other tools
  • It can also use SQL-Server or MSDE databases for larger sites (requires a seperate installation file).
  • It can be run in a network environment, splitting crawlers over multiple computers - sharing the same database (for both Access and SQL-Server).
  • It can be run automated, either locally on the server or on a remote workstation with automatic FTP upload of the sitemap file.
  • It tests for and recognizes non-standard file-not-found pages (without HTTP result code 404).

... and much more! (if you can't find it here, send me a note and I'll either add it or show you how you can do it!)



© SOFTplus Entwicklungen GmbH · Sitemap · Privacy policy · Terms & conditions · Contact me · About