ItSucks system requirements:
Java JRE 1.6 or better.min. 256 MB RAM
Features
Rules
Console
Core library and API
Features
General Features
- Multithreaded, configurable count of working threads
- Regular Expression Editor to test expressions on the fly
- Save/Load download jobs
- Online Help
HTTP Connection Features
- HTTP/HTTPS supported
- HTTP Proxy Support (+ proxy authentication)
- HTTP Authentication Support
- Cookie Support (+ cookie migration from browser)
- Configurable User Agent
- Limitation of connections per server
- Configurable behaviour for HTTP response codes
Example: If an server sends 403 (Forbidden) after to many download, a retry + waiting time can be defined. - Bandwidth limitation
- GZip compression
Rules
ItSucks offers a large variety of filters which can be used to isolate a specific part of an website.
Simple rules:
- Limitation of link depth
Example: To download only two levels after the initial link, set the value to 2. - Limitation of links to follow (count)
Example: Stop adding new links after finding 5000 links. - Limitation of time per job
Example: Stop adding new links after 30 minutes. - Allowed Hostname filter (regular expression)
Example: Define '.*\.google.(de|com)' to allow all subdomains from google.de and google.com. - Regular Expression Filter to save only certain filetypes/names on disk
Example: Define '.*jpg|.*png' to save only files which ends with 'jpg' or 'png'.
Special rules:
- File Size filter
Example: Only save files on disk which are larger than 100kb.
Advanced Regular Expression Rules:
- A highly customizable filter chain can hold multiple regular expressions.
For every expression, actions can be defined when the regular expression matches a URL
and actions to be executed when the expression does not match.
Possible actions are: follow the URL (Accept), do not follow the URL (Reject), change the priority of the URL
Content filter:
- Content filter for text/html files (regular expression)
Example: Only download link if the content contains 'New mail.*arrived'.
Console
- Start your download templates on the console after creating it with the GUI.
Core library and API
ItSucks is splitted into an backend (core) and frontend part. The backend can be used to implement own web crawler/spider. License is GPL.
General Features
- Designed to support multiple protocols. At this time only http/https is implemented.
- Running multiple crawl jobs simultaniously.
- Customizable filter chain to filter found urls.
- Customizable processing chain to process the downloaded data.
- Save and load one or multiple Jobs. (serialize / deserialize)
- Fully programmed with Java 1.5 features (generics etc.).
Event Handling
- Support to observe every event fired by the framework.
- Possibility to filter events by category and type.