List of changes and updates to the SOFTplus GSiteCrawler

Note: this listing includes release-versions as well as beta-versions. Release versions are marked with a respective tag ("RELEASE"). If you have a release version, it will automatically notify you of newer release versions when they are available. If you have a beta-version, the notifier will inform you of new beta-versions. You can install a beta-version or a release-version at any time to change your status.

v1.22 - March 21, 2007

  • changed: Crawler queue is a separate database (for each chosen database) (MDB-version)
  • changed: refilter URLs: also filters crawler queue cache
  • changed: options dialog (now tabbed with room for new settings)
  • changed: description for URL+Title CSV file
  • added: option to compress database on startup (or ask)
  • added: recrawl only selected URLs
  • added: display the currently open database name in the header (MDB-version)
  • added: wildcard support to ban-URLs table (* = anything, $ = end of line)
  • added: full regular-expressions (regex) support to ban-URLs table
  • added: open project folder via toolbar/menu on top
  • added: open sitemap folder via toolbar/menu on top
  • added: open sitemap file on the web via toolbar/menu on top
  • added: Romanian translations, thanks Erol!
  • added: language info on website (+ thanks)
  • fixed: typo in gss.xsl

v1.21 - Nov 28, 2006

  • added french, spanish and turkish translations
  • updated for revised Yahoo! verification file format

v1.20 - Nov 20, 2006 RELEASE

  • fixed a pesky bug in the installer

v1.18 - Nov 19, 2006 RELEASE

  • removed expiration date
  • added donation-code to global settings (you'll get one when you donate through the website)
  • added donation screen (after 20 starts, 40% of the time, when no donation-code is found)
  • changed sitemap format to http://sitemaps.org namespace, v0.90
  • added french translations (thanks, Stéphane Brun of www.sbnet.fr!)
  • added translation information text-file
  • added direct links to page-tools (in URL tab)

v1.17

  • fixed bug with automatic notification of Yahoo! urllist-files (would only notify .gz version, even if not uploaded via FTP)

v1.16

  • added Yahoo! urllist to default action on "generate"
  • changed: "Generate" button now opens a window with options (instead of message-boxes for every action)
  • added Yahoo! urllist to "Generate"-menu directly
  • increased general text export speed + displays status-bar
  • added Yahoo! urllist is notified automatically (similar to Google Sitemaps "ping")
  • added support for creation + upload of Yahoo! verification file (with FTP access)
  • added support for Yahoo! urllist creationg and notification through automation
  • added / fixed many translations
  • added support for FTP/SSL implicit + explicit
  • added support for non-standard FTP ports
  • changed: installation files are now code-signed (from "SOFTplus Entwicklungen GmbH")

v1.14

  • Added: support for creation + upload of verification file (with FTP access)
  • Added: check for existing URLs, will keep URLs not found on the server selected
  • gss.xsl stylesheet cleaned up (moved CSS inline) + made Opera 9.0 compatible
  • Bug fixed: would sometimes not upload stylesheet automatically

v1.13

  • SQL-version: Fixed saving of FTP information per project

v1.12 - 1. May 2006 RELEASE

  • released to include changes in last beta-versions
  • Added: "phtml" as a default file extension to crawl
  • Added: Simpler choice of file extensions to include (not crawl) via the wizard
  • Valid until December 1, 2006

v1.11

  • internal version

v1.10

  • updated installation file for SQL-version
  • Added: can import URLs from a log file without using the crawler

v1.09

  • Private Version for "Helion.pl"

v1.08

  • Private version for "Markt und Technik"

v1.07 - 1. March 2006

  • Changed: empty ban-urls / remove parameters lines are ignored
  • Fixed: bug where it would not import from log files

v1.06 - 22. January 2006 RELEASE

  • MDB-Version: allow + create separate database files (as desired), shows last 10 in file-menu
  • new: asks to delete project folder when deleting project
  • new Export Template: HTML-sitemap with relative URLs (from project URL, eg. "/file.htm" instead of "http:/l....")
  • new Export Template: Text URL-listing with relative URLs
  • expires June 1, 2006

v1.05 - 21. December 2005

  • added: can create a new "robots.txt" from ban-URLs filters (only when filters contain URLs from the start, ie. with http://...)
  • added: in the crawler-watch: right-click on aborted URLs to retry them (via context menu)
  • added: it is possible to specify a filename for sitemap file per project (default: empty = sitemap.xml)
  • added: it is possible to specify location for all project files (default: empty = in a subdirectory)
  • fix for non-period decimal country settings with priority pull-down
  • fix for non-default ports when crawling
  • Access-MDB version: check for reaching a specific file size: 950MB - upon start + when having crawled an URL; it stops the crawlers automatically to prevent a broken database
  • added: possibility to export settings+filters, or just filters from a project (for re-use in other projects)
  • added: possibility to import just settings+filters, or just filters

v1.04 - not publicly available

  • fixed bug: in sql-version when creating sitemap file
  • fixed installer: for sql-version (please use full installer for the sql-version as well if it doesn't work)
  • added: template to export Google Base TSV-Files (see http://gsitecrawler.com/articles/google-base-simple-export.asp )

v1.03 - 03. November 2005

  • bug: speed statistics didn't work on MS-Access-DBs
  • changed: allow // in URLs (it was filtered to / earlier); it is only allowed if it is after a "?", eg as a part of the parameters. URLs with // in part of the path will still have them reduced to /.
  • bug: change frequency of several URLs with pulldown to yearly would use 356 instead of 365
  • added startup-debugging code (only in beta-versions): in file 'DEBUG_GSiteCrawler.LOG'

v1.02 - 26. October 2005 RELEASE

  • added MS-SQL-Server / MSDE compatible installation (seperate update to be installed)
  • added MS-SQL-Server / MSDE Database Setup program (DbSetup)
  • added option to only show first part of URLs (instead of all of them)
  • added option to set the refresh-frequency for the crawler-watch window (default for sql: 30 secs, access: 10 secs)
  • added option to clean up URLs, encode apostrophe as "%27", quotation marks to "%22", space as "%20"
  • added: refilter existing URL + crawler-queue list based on ban / drop / remove
  • changed: When starting crawling with an empty table, it will use the main URL instead
  • fixed bug in duplicate content statistic: would show URLs from other projects as well if duplicate to the ones in the current project

v1.01 - 27. September 2005

  • added option to include XML-stylesheet in sitemap-files or not (gss.xsl)
  • changed HTML-encoding of apostrophe from ' to ' (for IE)
  • updated error-handling in DbCompress (for broken databases)

v1.00 - 22. September 2005 RELEASE

  • added MSCOMCT2.OCX to update installer (was previously in the full installer only)

v0.99 - 21. September 2005

  • fixed bug crawling local directories
  • renamed ROR-template to ror.xml
  • moved expiration date to 1. November 2005
  • sorting added to the URL table, just click on the headers, it can sort by any column
  • automation: can execute any command after automation is done, per project (eg. Secure-FTP, local file copy, email logs, etc.)
  • priority in the crawler-queue can be adjusted on a per-project-basis

v0.98 - 11. September 2005

  • fixed FTP directories bug (wouldn't save the remote path unless you changed other FTP settings as well)
  • added active transfer for FTP
  • when doing the duplicate-content statistic, if offers to automatically disable the duplicate content found (keeps the first URL)
  • added export template for ror (resources of resources) - see http://rorweb.com

v0.97 - 31. August 2005

  • added gss.xsl-stylesheet-file for displaying sitemap files (doubleclick on sitemap.xml) (using v1.5a), is automatically copied into project path + sent via FTP to the server
  • added dropdown chooser for user agent (or type your own)
  • fixed typo in beta v0.96 which caused the crawler to sometimes miss new URLs
  • sitemap files are just checked for existance instead of being downloaded as a check after uploading

v0.96 - 30. August 2005

  • Speedup in processing pages
  • Fixed FTP (ha ha, I hope this time for real)
  • added more file extentions to check+include but not crawl. If you want this, you need to click on "defaults" in the tab. Total list is now: asx,avi,dnl,doc,gif,jpg,log,lwp,mid,mp3,mpeg,mov,pdf,png,ppt,ram,rtf,swf,txt,wks,wma,wri,xls,xml
  • added language choice (german / english) to installer + preset in the program afterwards

v0.95 - 27. August 2005 RELEASE

  • statistic: duplicate content (shows URLs with duplicate content in your sitemap files)
  • beta of russian translations (requires codepage Windows 1251) - THANKS ANDREI!
  • simple automation: keeps a log per project of what was done and status, all details
  • simple automation: keeps a log per automation-run including all project logs (in the main program path)

v0.94 - 26. August 2005

  • German translations for full UI (switchable)
  • simple automation: see automation tab in project settings, start with /auto as command-line option.
  • optional action on error 404: nothing, remove from URL-list, mark as inactive
  • URLs with error 404 are only checked once (other errors 10 times)
  • extentions just to check for existance + include in sitemaps, not to crawl: pdf,doc,wri,txt,xls,ppt,rtf,wks,lwp,swf,dnl (by default, editable)
  • export templates: alternate markers in form of html comments, e.g. <!--MARK-->[content]<!--/MARK--> (still accepts <mark>[content]</mark> as well)
  • export templates: changed simple html sitemap template to use new markers
  • export templates: added option to allow gz compression of resulting file (<opt-gz>1</opt-gz>)
  • export template: Yahoo URL list including gz'ed version for submitting here: http://submit.search.yahoo.com/free/request
  • expiration date extended to 1. October 2005 (someday I'll get the real version finished :-))
  • statistics: Speed statistics: largest pages, slowest pages, pages slowest to crawl
  • bug fixed: problems with Microsoft Office HTML Exported Excel pages (bad header blocks, unmatched HTML comments)
  • bug fixed: project export with non-standard characters in username or computer-name resulted in invalid xml files (re-importable)

v0.93 - 1. August 2005 RELEASE

  • updated license-file for version + date :-)
  • project option: remove html comments before crawling the page
  • bug fixed: removed empty lines in larger sitemap-files (is ignored in XML, but nicer this way)
  • bug fixed: crawler watch: when empty, it showed some just empty rows
  • bug fixed: crawler watch: when empty, it continued to show the last entry

v0.92 - 31. July 2005

  • support for "base href" tag in header
  • support for date meta-tags in HEAD:
    • See also: http://gsitecrawler.com/articles/meta-tag-date.asp
    • Tags: meta dc.date.x-metadatalastmodified, meta dc.date.modified, meta dc.date.created, meta dc.date, meta date
    • Schemes: ISO8601, DCTERMS.W3CDTF, ISO.31-1:1992, IETF.RFC-822, ANSI.X3.30-1985, FGDC
    • Testsite: http://gsitecrawler.com/tests/meta-tag-date/
  • logs time to download, time to crawl per URL (for statistics later on)
  • import/export full project settings (for exchange with other PCs, support, etc.)
  • import sitemap files for editing (from other utilities)
  • create a new project based on old project
  • strips html comments when parsing HEAD
  • URL date and frequency now seperated (frequency is just guessed the first time the site is crawled)
  • clickin on "show crawler" when the crawler is minimized will now normalize the crawler-window
  • Statistics: URL age is now calculated from the real date instead of the frequency-field
  • slight speedup in some places
  • bugfix: small problem when exporting the URL list (would sometimes bring the error "Subscript out of range")
  • bugfix: changing priority for multiple URLs would give an error in some code pages

v0.91 - 26. July 2005

  • fixed bug: some problems when crawling SSL-sites (https://)
  • fixed bug: too-large pages can't be crawled at all (not actually fixed, but documented as such)

v0.90 - 25. July 2005

  • fixed bug: robot meta tag "noarchive" would be interpreted as "nofollow"
  • fixed bug: when creating a new project it wouldn't select "date,frequency,priority in sitemap files"
  • added a test ftp upload (creates $test.xml and tries to upload it)

v0.89 - 24. July 2005 RELEASE

  • full setup for 95/98/me
  • internet update check - is my version current?
  • fixed bug: if you pause the crawler, and then make a change to a project and click "recrawl url", it will still be paused
  • fixed bug: create a new project (wizard) and select NOT to crawl the site and NOT to upload and once I press FINISH it start crawling.

v0.88 - 22. July 2005

  • fixed bug where it would sometimes hang while crawling non-standard mailto:links
  • fixed bug where it would add "http://www.domain.com" to the URL list even if http://www.domain.com/" was the main URL (only via Google or Log-file-import)

v0.87 - 21. July 2005

  • optionally leave date last modified, priority or frequency out of Google sitemap-file
  • compress database (via seperate program, also repairs databases if needed)
  • maximum HTML page size (in KB), to protect against never-ending pages or download-redirects (default 200kb, adjustable)

v0.86b - 20. July 2005

  • additional DLL required for gzip component

v0.86 - 19. July 2005

  • gzip hopefully really fixed
  • highlight and process multiple URLs within a project
    • You can highlight them using shift+click or using the search-filter
    • Things you can do with multiple URLs selected:
    • delete from list
    • recrawl these
    • requery these
    • change frequency
    • change include setting: all on, all off, swap settings
    • change crawl setting: all on, all off, swap settings
    • change manual setting: all on, all off, swap settings

v0.85 - 18. July 2005

  • Updated RSS Feed template, date issue (would use the wrong field)
  • added new site wizard:
    • asks for URL
    • checks URL for validity, checks if Microsoft-Server or not (sets case-sensitivity on/off)
    • can filter known session-ids automatically
    • can check robots.txt automatically
    • can check known URLs in Google automatically
    • will scan site
    • will create sitemaps-file automatically
    • will upload to FTP server (if specified)
    • will ping Google :-)
  • fixed bug: bad sitemap files for large files
  • fixed bug: keeps description/keyword meta data from last page
  • changed FTP transfer to force binary mode + passive mode

v0.84

  • internal release

v0.83 - 13. July 2005

  • fixed URL case-sensitivity issue (would always work case-sensitively)
  • force case selection: for hash calculation, etc.
  • fixed setting remove trailing slash (wouldn't save)
  • recognize custom error pages: enter text from page (from html-source)
  • gzip works, new component (really now, hopefully :-))

v0.82 - 12. July 2005

  • better hash model (needs time on first startup to recalc existing URL hashes)

v0.81 - 09. July 2005

  • fixed: ftp wouldn't wait for gzip to finish - sometimes uploaded 0kb files
  • fixed (hopefully): bug where the keyboard sometimes locked
  • added: submits to Google via Ping (should submit via sitemaps-account the first time though), checks results
  • added support for Win95/98/ME
  • fixed: bug with crawling URLs with an appostrophie in them
  • expiration-date 1. Sept. 2005 (more time to complete "final" version)
  • log errors when crawling: reset on "recrawl", add the list to statistics tab
  • added last robots.txt to statistics tab
  • checked + fixed tab order
  • search for url sections
  • fixed: problems with invalid URLs with double-slashes in them, e.g. http://myserver.com//page.htm
  • fixed: issue where the crawler would crawl external sites if you cleared the crawler queue + had it re-crawl a site right afterwards

v0.80 - 30. June 2005

  • toolbar on top:
    • start / stop crawler | show crawler
    • new project
    • import: from log, from google, from sitemap, from robots.txt
    • Crawl this site
    • Generate Sitemap
    • Generate export-file
    • Generate statistics
  • statusbar on bottom:
    • crawler-status / statistics
    • date/time
    • resize-handle
  • FTP upload possible
    • save ftp-infos with project
    • allow several ftp proxy settings
    • ftp upload of sitemap / sitemap-index files
    • automatic checking of ftp uploads
    • test of ftp settings possible
  • new tab organisation, moved some tabs around
  • clearer layout in the main tabs
  • project settings
    • remove trailing slash on folder-names
  • global options dialog:
    • Number of crawlers (1-15)
    • specify if crawler should be running on start
    • user-agent string to use when crawling, accessing, etc.
    • proxy settings (server, port, authentication, bypass-list)
    • wait-time between URLs per Crawler
    • timeout on retrieving URLs
    • options sitemap-splits (number URLs or file size, whichever comes first)
  • short notice when crawler is done ( + time spent crawling)
  • crawl files in a local directory with given base URL

v0.79 - 29. June 2005

  • bug fixed: manual URLs sometimes not exported to the sitemap

v0.78 - 29. June 2005 - RELEASE

  • Export URL list to files, several formats included (text, csv, html, xml, rss, etc.), easy to make your own,
  • Saves page size + checks update status via a page-content-hash (works for server-generated pages),
  • fixed bug: "Couldn't find Installable ISAM" when running on certain systems,
  • now only crawls URLs below the main-url specified (before: everything on the same server),
  • bug fixed: on pages with a large number of links (>1000) it would sometimes miss a few links,
  • bug fixed: sometimes had difficulty finding links on pages with urls in single appostrophies

v0.77 - 27. June 2005 - RELEASE

  • windows resizable for 800x600 systems,
  • the crawler window is also resizable,
  • the crawler window can be minimized + then shows URLs in queue in caption,
  • removed lockdown for multiple starts of same exe (to allow the crawler to work in the background),
  • works with unicode websites on unicode systems (japanese, etc.).

v0.75 - 25. June 2005

  • Small bugfix regarding the display of URLs generated with the first version of this program.

v0.74 - 24. June 2005 - RELEASE

  • Faster processing (again)
  • resizable main window
  • saves window positions
  • sitemaps now saved in per-project subdirectories
  • sitemaps now saved by default as sitemap.xml
  • can import and process standards-conform robots.txt files into the banned URL listing
  • will recognize and comply to the robots-metatag in the head-section of the page (include/follow)
  • will save + display the robots-metatag in the URL-listing
  • automatical gzip-support
  • automatically split sitmap-files (at 5000kb or 20000 links, whichever comes first)
  • automatically create sitemap index when several sitemaps are generated
  • statistics tab with lots of information (# of URLs, errors, oldest pages, robots-file, etc.)

v0.73 - 20. June 2005 - RELEASE

  • Faster processing
  • support for long URLs (up to 2000 characters)
  • better support of malformed/non-standard HTML-pages
  • better support for parameter-links
  • report for aborted URLs from the CrawlerWatch-Window

v0.71 - 17. June 2005 - RELEASE

  • Updated installer
  • partial full version (without VB6 / DAO 3.5 runtimes)
  • more status-information when processing larger sites

v0.70 - 17. June 2005 - RELEASE

  • first public version.



© SOFTplus Entwicklungen GmbH · Sitemap · Privacy policy · Terms & conditions · Contact me · About