ItSucks system requirements:

Java JRE 1.6 or better.
min. 256 MB RAM

Features

General Features

  • Multithreaded, configurable count of working threads
  • Regular Expression Editor to test expressions on the fly
  • Save/Load download jobs
  • Online Help

HTTP Connection Features

  • HTTP/HTTPS supported
  • HTTP Proxy Support (+ proxy authentication)
  • HTTP Authentication Support
  • Cookie Support (+ cookie migration from browser)
  • Configurable User Agent
  • Limitation of connections per server
  • Configurable behaviour for HTTP response codes
    Example: If an server sends 403 (Forbidden) after to many download, a retry + waiting time can be defined.
  • Bandwidth limitation
  • GZip compression

Rules

ItSucks offers a large variety of filters which can be used to isolate a specific part of an website.

Simple rules:

  • Limitation of link depth
    Example: To download only two levels after the initial link, set the value to 2.
  • Limitation of links to follow (count)
    Example: Stop adding new links after finding 5000 links.
  • Limitation of time per job
    Example: Stop adding new links after 30 minutes.
  • Allowed Hostname filter (regular expression)
    Example: Define '.*\.google.(de|com)' to allow all subdomains from google.de and google.com.
  • Regular Expression Filter to save only certain filetypes/names on disk
    Example: Define '.*jpg|.*png' to save only files which ends with 'jpg' or 'png'.

Special rules:

  • File Size filter
    Example: Only save files on disk which are larger than 100kb.

Advanced Regular Expression Rules:

  • A highly customizable filter chain can hold multiple regular expressions. For every expression, actions can be defined when the regular expression matches a URL and actions to be executed when the expression does not match.
    Possible actions are: follow the URL (Accept), do not follow the URL (Reject), change the priority of the URL

Content filter:

  • Content filter for text/html files (regular expression)
    Example: Only download link if the content contains 'New mail.*arrived'.

Console

  • Start your download templates on the console after creating it with the GUI.

Core library and API

ItSucks is splitted into an backend (core) and frontend part. The backend can be used to implement own web crawler/spider. License is GPL.

General Features

  • Designed to support multiple protocols. At this time only http/https is implemented.
  • Running multiple crawl jobs simultaniously.
  • Customizable filter chain to filter found urls.
  • Customizable processing chain to process the downloaded data.
  • Save and load one or multiple Jobs. (serialize / deserialize)
  • Fully programmed with Java 1.5 features (generics etc.).

Event Handling

  • Support to observe every event fired by the framework.
  • Possibility to filter events by category and type.

External libraries/tools used in development