List of changes and updates to the SOFTplus GSiteCrawler
Note: this listing includes release-versions as well as beta-versions. Release versions are marked with a respective tag ("RELEASE"). If you have a release version, it will automatically notify you of newer release versions when they are available. If you have a beta-version, the notifier will inform you of new beta-versions. You can install a beta-version or a release-version at any time to change your status.
v1.22 - March 21, 2007
- changed: Crawler queue is a separate database (for each chosen database) (MDB-version)
- changed: refilter URLs: also filters crawler queue cache
- changed: options dialog (now tabbed with room for new settings)
- changed: description for URL+Title CSV file
- added: option to compress database on startup (or ask)
- added: recrawl only selected URLs
- added: display the currently open database name in the header (MDB-version)
- added: wildcard support to ban-URLs table (* = anything, $ = end of line)
- added: full regular-expressions (regex) support to ban-URLs table
- added: open project folder via toolbar/menu on top
- added: open sitemap folder via toolbar/menu on top
- added: open sitemap file on the web via toolbar/menu on top
- added: Romanian translations, thanks Erol!
- added: language info on website (+ thanks)
- fixed: typo in gss.xsl
v1.21 - Nov 28, 2006
- added french, spanish and turkish translations
- updated for revised Yahoo! verification file format
v1.20 - Nov 20, 2006 RELEASE
- fixed a pesky bug in the installer
v1.18 - Nov 19, 2006 RELEASE
- removed expiration date
- added donation-code to global settings (you'll get one when you donate through the website)
- added donation screen (after 20 starts, 40% of the time, when no donation-code is found)
- changed sitemap format to http://sitemaps.org namespace, v0.90
- added french translations (thanks, Stéphane Brun of www.sbnet.fr!)
- added translation information text-file
- added direct links to page-tools (in URL tab)
v1.17
- fixed bug with automatic notification of Yahoo! urllist-files (would only notify .gz version, even if not uploaded via FTP)
v1.16
- added Yahoo! urllist to default action on "generate"
- changed: "Generate" button now opens a window with options (instead of message-boxes for every action)
- added Yahoo! urllist to "Generate"-menu directly
- increased general text export speed + displays status-bar
- added Yahoo! urllist is notified automatically (similar to Google Sitemaps "ping")
- added support for creation + upload of Yahoo! verification file (with FTP access)
- added support for Yahoo! urllist creationg and notification through automation
- added / fixed many translations
- added support for FTP/SSL implicit + explicit
- added support for non-standard FTP ports
- changed: installation files are now code-signed (from "SOFTplus Entwicklungen GmbH")
v1.14
- Added: support for creation + upload of verification file (with FTP access)
- Added: check for existing URLs, will keep URLs not found on the server selected
- gss.xsl stylesheet cleaned up (moved CSS inline) + made Opera 9.0 compatible
- Bug fixed: would sometimes not upload stylesheet automatically
v1.13
- SQL-version: Fixed saving of FTP information per project
v1.12 - 1. May 2006 RELEASE
- released to include changes in last beta-versions
- Added: "phtml" as a default file extension to crawl
- Added: Simpler choice of file extensions to include (not crawl) via the wizard
- Valid until December 1, 2006
v1.11
- internal version
v1.10
- updated installation file for SQL-version
- Added: can import URLs from a log file without using the crawler
v1.09
- Private Version for "Helion.pl"
v1.08
- Private version for "Markt und Technik"
v1.07 - 1. March 2006
- Changed: empty ban-urls / remove parameters lines are ignored
- Fixed: bug where it would not import from log files
v1.06 - 22. January 2006 RELEASE
- MDB-Version: allow + create separate database files (as desired), shows last 10 in file-menu
- new: asks to delete project folder when deleting project
- new Export Template: HTML-sitemap with relative URLs (from project URL, eg. "/file.htm" instead of "http:/l....")
- new Export Template: Text URL-listing with relative URLs
- expires June 1, 2006
v1.05 - 21. December 2005
- added: can create a new "robots.txt" from ban-URLs filters (only when filters contain URLs from the start, ie. with http://...)
- added: in the crawler-watch: right-click on aborted URLs to retry them (via context menu)
- added: it is possible to specify a filename for sitemap file per project (default: empty = sitemap.xml)
- added: it is possible to specify location for all project files (default: empty = in a subdirectory)
- fix for non-period decimal country settings with priority pull-down
- fix for non-default ports when crawling
- Access-MDB version: check for reaching a specific file size: 950MB - upon start + when having crawled an URL; it stops the crawlers automatically to prevent a broken database
- added: possibility to export settings+filters, or just filters from a project (for re-use in other projects)
- added: possibility to import just settings+filters, or just filters
v1.04 - not publicly available
- fixed bug: in sql-version when creating sitemap file
- fixed installer: for sql-version (please use full installer for the sql-version as well if it doesn't work)
- added: template to export Google Base TSV-Files (see http://gsitecrawler.com/articles/google-base-simple-export.asp )
v1.03 - 03. November 2005
- bug: speed statistics didn't work on MS-Access-DBs
- changed: allow // in URLs (it was filtered to / earlier); it is only allowed if it is after a "?", eg as a part of the parameters. URLs with // in part of the path will still have them reduced to /.
- bug: change frequency of several URLs with pulldown to yearly would use 356 instead of 365
- added startup-debugging code (only in beta-versions): in file 'DEBUG_GSiteCrawler.LOG'
v1.02 - 26. October 2005 RELEASE
- added MS-SQL-Server / MSDE compatible installation (seperate update to be installed)
- added MS-SQL-Server / MSDE Database Setup program (DbSetup)
- added option to only show first part of URLs (instead of all of them)
- added option to set the refresh-frequency for the crawler-watch window (default for sql: 30 secs, access: 10 secs)
- added option to clean up URLs, encode apostrophe as "%27", quotation marks to "%22", space as "%20"
- added: refilter existing URL + crawler-queue list based on ban / drop / remove
- changed: When starting crawling with an empty table, it will use the main URL instead
- fixed bug in duplicate content statistic: would show URLs from other projects as well if duplicate to the ones in the current project
v1.01 - 27. September 2005
- added option to include XML-stylesheet in sitemap-files or not (gss.xsl)
- changed HTML-encoding of apostrophe from ' to ' (for IE)
- updated error-handling in DbCompress (for broken databases)
v1.00 - 22. September 2005 RELEASE
- added MSCOMCT2.OCX to update installer (was previously in the full installer only)
v0.99 - 21. September 2005
- fixed bug crawling local directories
- renamed ROR-template to ror.xml
- moved expiration date to 1. November 2005
- sorting added to the URL table, just click on the headers, it can sort by any column
- automation: can execute any command after automation is done, per project (eg. Secure-FTP, local file copy, email logs, etc.)
- priority in the crawler-queue can be adjusted on a per-project-basis
v0.98 - 11. September 2005
- fixed FTP directories bug (wouldn't save the remote path unless you changed other FTP settings as well)
- added active transfer for FTP
- when doing the duplicate-content statistic, if offers to automatically disable the duplicate content found (keeps the first URL)
- added export template for ror (resources of resources) - see http://rorweb.com
v0.97 - 31. August 2005
- added gss.xsl-stylesheet-file for displaying sitemap files (doubleclick on sitemap.xml) (using v1.5a), is automatically copied into project path + sent via FTP to the server
- added dropdown chooser for user agent (or type your own)
- fixed typo in beta v0.96 which caused the crawler to sometimes miss new URLs
- sitemap files are just checked for existance instead of being downloaded as a check after uploading
v0.96 - 30. August 2005
- Speedup in processing pages
- Fixed FTP (ha ha, I hope this time for real)
- added more file extentions to check+include but not crawl. If you want this, you need to click on "defaults" in the tab. Total list is now: asx,avi,dnl,doc,gif,jpg,log,lwp,mid,mp3,mpeg,mov,pdf,png,ppt,ram,rtf,swf,txt,wks,wma,wri,xls,xml
- added language choice (german / english) to installer + preset in the program afterwards
v0.95 - 27. August 2005 RELEASE
- statistic: duplicate content (shows URLs with duplicate content in your sitemap files)
- beta of russian translations (requires codepage Windows 1251) - THANKS ANDREI!
- simple automation: keeps a log per project of what was done and status, all details
- simple automation: keeps a log per automation-run including all project logs (in the main program path)
v0.94 - 26. August 2005
- German translations for full UI (switchable)
- simple automation: see automation tab in project settings, start with /auto as command-line option.
- optional action on error 404: nothing, remove from URL-list, mark as inactive
- URLs with error 404 are only checked once (other errors 10 times)
- extentions just to check for existance + include in sitemaps, not to crawl: pdf,doc,wri,txt,xls,ppt,rtf,wks,lwp,swf,dnl (by default, editable)
- export templates: alternate markers in form of html comments, e.g. <!--MARK-->[content]<!--/MARK--> (still accepts <mark>[content]</mark> as well)
- export templates: changed simple html sitemap template to use new markers
- export templates: added option to allow gz compression of resulting file (<opt-gz>1</opt-gz>)
- export template: Yahoo URL list including gz'ed version for submitting here: http://submit.search.yahoo.com/free/request
- expiration date extended to 1. October 2005 (someday I'll get the real version finished :-))
- statistics: Speed statistics: largest pages, slowest pages, pages slowest to crawl
- bug fixed: problems with Microsoft Office HTML Exported Excel pages (bad header blocks, unmatched HTML comments)
- bug fixed: project export with non-standard characters in username or computer-name resulted in invalid xml files (re-importable)
v0.93 - 1. August 2005 RELEASE
- updated license-file for version + date :-)
- project option: remove html comments before crawling the page
- bug fixed: removed empty lines in larger sitemap-files (is ignored in XML, but nicer this way)
- bug fixed: crawler watch: when empty, it showed some just empty rows
- bug fixed: crawler watch: when empty, it continued to show the last entry
v0.92 - 31. July 2005
- support for "base href" tag in header
- support for date meta-tags in HEAD:
- See also: http://gsitecrawler.com/articles/meta-tag-date.asp
- Tags: meta dc.date.x-metadatalastmodified, meta dc.date.modified, meta dc.date.created, meta dc.date, meta date
- Schemes: ISO8601, DCTERMS.W3CDTF, ISO.31-1:1992, IETF.RFC-822, ANSI.X3.30-1985, FGDC
- Testsite: http://gsitecrawler.com/tests/meta-tag-date/
- logs time to download, time to crawl per URL (for statistics later on)
- import/export full project settings (for exchange with other PCs, support, etc.)
- import sitemap files for editing (from other utilities)
- create a new project based on old project
- strips html comments when parsing HEAD
- URL date and frequency now seperated (frequency is just guessed the first time the site is crawled)
- clickin on "show crawler" when the crawler is minimized will now normalize the crawler-window
- Statistics: URL age is now calculated from the real date instead of the frequency-field
- slight speedup in some places
- bugfix: small problem when exporting the URL list (would sometimes bring the error "Subscript out of range")
- bugfix: changing priority for multiple URLs would give an error in some code pages
v0.91 - 26. July 2005
- fixed bug: some problems when crawling SSL-sites (https://)
- fixed bug: too-large pages can't be crawled at all (not actually fixed, but documented as such)
v0.90 - 25. July 2005
- fixed bug: robot meta tag "noarchive" would be interpreted as "nofollow"
- fixed bug: when creating a new project it wouldn't select "date,frequency,priority in sitemap files"
- added a test ftp upload (creates $test.xml and tries to upload it)
v0.89 - 24. July 2005 RELEASE
- full setup for 95/98/me
- internet update check - is my version current?
- fixed bug: if you pause the crawler, and then make a change to a project and click "recrawl url", it will still be paused
- fixed bug: create a new project (wizard) and select NOT to crawl the site and NOT to upload and once I press FINISH it start crawling.
v0.88 - 22. July 2005
- fixed bug where it would sometimes hang while crawling non-standard mailto:links
- fixed bug where it would add "http://www.domain.com" to the URL list even if http://www.domain.com/" was the main URL (only via Google or Log-file-import)
v0.87 - 21. July 2005
- optionally leave date last modified, priority or frequency out of Google sitemap-file
- compress database (via seperate program, also repairs databases if needed)
- maximum HTML page size (in KB), to protect against never-ending pages or download-redirects (default 200kb, adjustable)
v0.86b - 20. July 2005
- additional DLL required for gzip component
v0.86 - 19. July 2005
- gzip hopefully really fixed
- highlight and process multiple URLs within a project
- You can highlight them using shift+click or using the search-filter
- Things you can do with multiple URLs selected:
- delete from list
- recrawl these
- requery these
- change frequency
- change include setting: all on, all off, swap settings
- change crawl setting: all on, all off, swap settings
- change manual setting: all on, all off, swap settings
v0.85 - 18. July 2005
- Updated RSS Feed template, date issue (would use the wrong field)
- added new site wizard:
- asks for URL
- checks URL for validity, checks if Microsoft-Server or not (sets case-sensitivity on/off)
- can filter known session-ids automatically
- can check robots.txt automatically
- can check known URLs in Google automatically
- will scan site
- will create sitemaps-file automatically
- will upload to FTP server (if specified)
- will ping Google :-)
- fixed bug: bad sitemap files for large files
- fixed bug: keeps description/keyword meta data from last page
- changed FTP transfer to force binary mode + passive mode
v0.84
- internal release
v0.83 - 13. July 2005
- fixed URL case-sensitivity issue (would always work case-sensitively)
- force case selection: for hash calculation, etc.
- fixed setting remove trailing slash (wouldn't save)
- recognize custom error pages: enter text from page (from html-source)
- gzip works, new component (really now, hopefully :-))
v0.82 - 12. July 2005
- better hash model (needs time on first startup to recalc existing URL hashes)
v0.81 - 09. July 2005
- fixed: ftp wouldn't wait for gzip to finish - sometimes uploaded 0kb files
- fixed (hopefully): bug where the keyboard sometimes locked
- added: submits to Google via Ping (should submit via sitemaps-account the first time though), checks results
- added support for Win95/98/ME
- fixed: bug with crawling URLs with an appostrophie in them
- expiration-date 1. Sept. 2005 (more time to complete "final" version)
- log errors when crawling: reset on "recrawl", add the list to statistics tab
- added last robots.txt to statistics tab
- checked + fixed tab order
- search for url sections
- fixed: problems with invalid URLs with double-slashes in them, e.g. http://myserver.com//page.htm
- fixed: issue where the crawler would crawl external sites if you cleared the crawler queue + had it re-crawl a site right afterwards
v0.80 - 30. June 2005
- toolbar on top:
- start / stop crawler | show crawler
- new project
- import: from log, from google, from sitemap, from robots.txt
- Crawl this site
- Generate Sitemap
- Generate export-file
- Generate statistics
- statusbar on bottom:
- crawler-status / statistics
- date/time
- resize-handle
- FTP upload possible
- save ftp-infos with project
- allow several ftp proxy settings
- ftp upload of sitemap / sitemap-index files
- automatic checking of ftp uploads
- test of ftp settings possible
- new tab organisation, moved some tabs around
- clearer layout in the main tabs
- project settings
- remove trailing slash on folder-names
- global options dialog:
- Number of crawlers (1-15)
- specify if crawler should be running on start
- user-agent string to use when crawling, accessing, etc.
- proxy settings (server, port, authentication, bypass-list)
- wait-time between URLs per Crawler
- timeout on retrieving URLs
- options sitemap-splits (number URLs or file size, whichever comes first)
- short notice when crawler is done ( + time spent crawling)
- crawl files in a local directory with given base URL
v0.79 - 29. June 2005
- bug fixed: manual URLs sometimes not exported to the sitemap
v0.78 - 29. June 2005 - RELEASE
- Export URL list to files, several formats included (text, csv, html, xml, rss, etc.), easy to make your own,
- Saves page size + checks update status via a page-content-hash (works for server-generated pages),
- fixed bug: "Couldn't find Installable ISAM" when running on certain systems,
- now only crawls URLs below the main-url specified (before: everything on the same server),
- bug fixed: on pages with a large number of links (>1000) it would sometimes miss a few links,
- bug fixed: sometimes had difficulty finding links on pages with urls in single appostrophies
v0.77 - 27. June 2005 - RELEASE
- windows resizable for 800x600 systems,
- the crawler window is also resizable,
- the crawler window can be minimized + then shows URLs in queue in caption,
- removed lockdown for multiple starts of same exe (to allow the crawler to work in the background),
- works with unicode websites on unicode systems (japanese, etc.).
v0.75 - 25. June 2005
- Small bugfix regarding the display of URLs generated with the first version of this program.
v0.74 - 24. June 2005 - RELEASE
- Faster processing (again)
- resizable main window
- saves window positions
- sitemaps now saved in per-project subdirectories
- sitemaps now saved by default as sitemap.xml
- can import and process standards-conform robots.txt files into the banned URL listing
- will recognize and comply to the robots-metatag in the head-section of the page (include/follow)
- will save + display the robots-metatag in the URL-listing
- automatical gzip-support
- automatically split sitmap-files (at 5000kb or 20000 links, whichever comes first)
- automatically create sitemap index when several sitemaps are generated
- statistics tab with lots of information (# of URLs, errors, oldest pages, robots-file, etc.)
v0.73 - 20. June 2005 - RELEASE
- Faster processing
- support for long URLs (up to 2000 characters)
- better support of malformed/non-standard HTML-pages
- better support for parameter-links
- report for aborted URLs from the CrawlerWatch-Window
v0.71 - 17. June 2005 - RELEASE
- Updated installer
- partial full version (without VB6 / DAO 3.5 runtimes)
- more status-information when processing larger sites
v0.70 - 17. June 2005 - RELEASE
- first public version.