GSiteCrawler Features

In general, the GSiteCrawler will take a listing of your websites URLs, let you edit the settings and generate Google Sitemap files. However, the GSiteCrawler is very flexible and allows you to do a whole lot more than "just" that!

Capture URLs for your site using

a normal website crawl - emulating a Googlebot, looking for all links and pages within your website
an import of an existing Google Sitemap file
an import of a server log file
an import of any text file with URLs in it

The Crawler

does a text-based crawl of each page, even finding URLs in javascript
respects your robots.txt file
respects robots meta tags for index / follow
can run up to 15 times in parallel
can be throttled with a user defined wait-time between URLs
can be controlled with filters, bans, automatic URL modifications

With each page, it

checks date (from the server of using a date meta-tag) and size of the page
checks title, description and keyword tags
keeps track of the time required to download and crawl the page

Once the pages are in the database, you can

modify Google Sitemap settings like "priority" and "change frequency"
search for pages by URL parts, title, description or keywords tags
filter pages based on custom criteria - adjust their settings globally
edit, add and delete pages manually

And you have everything the way you want it, you can export it as

a Google Sitemap file in XML format (of course :-)) - with or without the optional attributes like "change date", "priority" or "change frequency"
a text URL listing for other programs (or for use as a UrlList for Yahoo!)
a simple RSS feed
Excel / CSV files with URLs, settings and attributes like title, description, keywords
a Google Base Bulk-Import file
a ROR (Resources of Resources) XML file
a static HTML sitemap file (with relative or absolute paths)
a new robots.txt file based on your chosen filters
... or almost any type of file you want - the export function uses a user-adjustable text-based template-system

For more information, it also generates

a general site overview with the number of URLs (total, crawlable, still in queue), oldest URLs, etc
a listing of all broken URLs linked in your site (or otherwise not-accessable URLs from the crawl)
an overview of your sites speed with the largest pages, slowest pages by total download time or download speed (unusually server-intensive pages), and those with the most processing time (many links)
an overview of URLs leading to "duplicate content" - with the option of automatically disabling those pages for the Google Sitemap file

Additionally ...

It can run on just about any Windows version from Windows 95b on up (tested on Windows Vista beta 1 and all server versions).
It can use local MS-Access databases for re-use with other tools
It can also use SQL-Server or MSDE databases for larger sites (requires a seperate installation file).
It can be run in a network environment, splitting crawlers over multiple computers - sharing the same database (for both Access and SQL-Server).
It can be run automated, either locally on the server or on a remote workstation with automatic FTP upload of the sitemap file.
It tests for and recognizes non-standard file-not-found pages (without HTTP result code 404).

... and much more! (if you can't find it here, send me a note and I'll either add it or show you how you can do it!)