GSiteCrawler Features
In general, the GSiteCrawler will take a listing of your websites URLs, let you edit the settings and generate Google Sitemap files. However, the GSiteCrawler is very flexible and allows you to do a whole lot more than "just" that!
Capture URLs for your site using
- a normal website crawl - emulating a Googlebot, looking for all links and pages within your website
- an import of an existing Google Sitemap file
- an import of a server log file
- an import of any text file with URLs in it
The Crawler
- does a text-based crawl of each page, even finding URLs in javascript
- respects your robots.txt file
- respects robots meta tags for index / follow
- can run up to 15 times in parallel
- can be throttled with a user defined wait-time between URLs
- can be controlled with filters, bans, automatic URL modifications
With each page, it
- checks date (from the server of using a date meta-tag) and size of the page
- checks title, description and keyword tags
- keeps track of the time required to download and crawl the page
Once the pages are in the database, you can
- modify Google Sitemap settings like "priority" and "change frequency"
- search for pages by URL parts, title, description or keywords tags
- filter pages based on custom criteria - adjust their settings globally
- edit, add and delete pages manually
And you have everything the way you want it, you can export it as
- a Google Sitemap file in XML format (of course :-)) - with or without the optional attributes like "change date", "priority" or "change frequency"
- a text URL listing for other programs (or for use as a UrlList for Yahoo!)
- a simple RSS feed
- Excel / CSV files with URLs, settings and attributes like title, description, keywords
- a Google Base Bulk-Import file
- a ROR (Resources of Resources) XML file
- a static HTML sitemap file (with relative or absolute paths)
- a new robots.txt file based on your chosen filters
- ... or almost any type of file you want - the export function uses a user-adjustable text-based template-system
For more information, it also generates
- a general site overview with the number of URLs (total, crawlable, still in queue), oldest URLs, etc
- a listing of all broken URLs linked in your site (or otherwise not-accessable URLs from the crawl)
- an overview of your sites speed with the largest pages, slowest pages by total download time or download speed (unusually server-intensive pages), and those with the most processing time (many links)
- an overview of URLs leading to "duplicate content" - with the option of automatically disabling those pages for the Google Sitemap file
Additionally ...
- It can run on just about any Windows version from Windows 95b on up (tested on Windows Vista beta 1 and all server versions).
- It can use local MS-Access databases for re-use with other tools
- It can also use SQL-Server or MSDE databases for larger sites (requires a seperate installation file).
- It can be run in a network environment, splitting crawlers over multiple computers - sharing the same database (for both Access and SQL-Server).
- It can be run automated, either locally on the server or on a remote workstation with automatic FTP upload of the sitemap file.
- It tests for and recognizes non-standard file-not-found pages (without HTTP result code 404).
... and much more! (if you can't find it here, send me a note and I'll either add it or show you how you can do it!)