crawler
Sponsored Links
Sponsored Links
Secleted [ 0 ] software to compare
Results 1 - 15 of about 32
Schema Crawler 3.7
Schema Crawler is a platform (OS and database) independent command-line tool to output your database schema. more>>
Schema Crawler project is a platform (database and OS) independent command-line tool to output your database schema and data in a readable form.
The output is designed to be diff-ed with previous versions of your database schema. Schema Crawler is also an API that improves on standard JDBC metadata.
Build
Ant Build
The ant build is a quick build that builds the Schema Crawler jar without compiling and running unit tests. Download ant, and run it from the Schema Crawler directory. The jar file will be created in the _distrib directory. The main targets are all.build and all.clean.
Maven Build
The maven build is a more comprehensive build that runs unit tests, and also creates the project web-site. Download maven, and run it from the Schema Crawler directory. The jar file will be created in the _distrib directory. The main goals are all.build and all.clean.
Eclipse
Schema Crawler consists of two Eclipse projects, dbconnector and schemacrawler. Since schemacrawler depends on dbconnector, you will need to import both projects into Eclipse.
Enhancements:
- Database properties are now retrieved, along with column data types, both system data types, and UDTs.
- A new command, maximum_schema, gives all possible details of the schema, including database system properties.
- Bugs with the merge rows option and with appending output for multiple runs of SchemaCrawler are fixed.
- An SQL exception with the Oracle Database 10g Express Edition driver on outputting views is fixed.
<<lessThe output is designed to be diff-ed with previous versions of your database schema. Schema Crawler is also an API that improves on standard JDBC metadata.
Build
Ant Build
The ant build is a quick build that builds the Schema Crawler jar without compiling and running unit tests. Download ant, and run it from the Schema Crawler directory. The jar file will be created in the _distrib directory. The main targets are all.build and all.clean.
Maven Build
The maven build is a more comprehensive build that runs unit tests, and also creates the project web-site. Download maven, and run it from the Schema Crawler directory. The jar file will be created in the _distrib directory. The main goals are all.build and all.clean.
Eclipse
Schema Crawler consists of two Eclipse projects, dbconnector and schemacrawler. Since schemacrawler depends on dbconnector, you will need to import both projects into Eclipse.
Enhancements:
- Database properties are now retrieved, along with column data types, both system data types, and UDTs.
- A new command, maximum_schema, gives all possible details of the schema, including database system properties.
- Bugs with the merge rows option and with appending output for multiple runs of SchemaCrawler are fixed.
- An SQL exception with the Oracle Database 10g Express Edition driver on outputting views is fixed.
Download (0.73MB)
Added: 2006-06-12 License: GPL (GNU General Public License) Price:
1231 downloads
PyGalleryCrawler 0.1.1
PyGalleryCrawler project is a Web crawler for online image galleries. more>>
PyGalleryCrawler project is a Web crawler for online image galleries.
Installation:
tar -xzf pygallerycrawler.tar.gz
cd pygallerycrawler
Extra python modules
psyco @ http://psyco.sourceforge.net
- performance
Python Imaging Library aka PIL @ http://www.pythonware.com/products/pil/
- thumbnails generation
- size verification
feedparser @ http://feedparser.org
- feed parser
Use:
chmod a+x pygallerycrawler.py
./pygallerycrawler.py the_url_you_want_crawl
Personal configuration:
If you make change in config.py, your changes will be overwrite at the next update. So, you can create a personal configuration and use it with the --config (or -c) switch.
cp config.py ~/pgc_config.py
vi ~/pgc_config.py
./pygallerycrawler.py -c ~/pgc_config.py the_url_you_want_crawl
Version restrictions:
- No check if some pictures are the same after download. Some gallery have a presentation link wich is one of the pictures. So the images will be double.
<<lessInstallation:
tar -xzf pygallerycrawler.tar.gz
cd pygallerycrawler
Extra python modules
psyco @ http://psyco.sourceforge.net
- performance
Python Imaging Library aka PIL @ http://www.pythonware.com/products/pil/
- thumbnails generation
- size verification
feedparser @ http://feedparser.org
- feed parser
Use:
chmod a+x pygallerycrawler.py
./pygallerycrawler.py the_url_you_want_crawl
Personal configuration:
If you make change in config.py, your changes will be overwrite at the next update. So, you can create a personal configuration and use it with the --config (or -c) switch.
cp config.py ~/pgc_config.py
vi ~/pgc_config.py
./pygallerycrawler.py -c ~/pgc_config.py the_url_you_want_crawl
Version restrictions:
- No check if some pictures are the same after download. Some gallery have a presentation link wich is one of the pictures. So the images will be double.
Download (0.015MB)
Added: 2007-04-18 License: GPL (GNU General Public License) Price:
920 downloads
KrawlSite 0.7
KrawlSite is a web crawler/spider/ offline browser/download manager application. more>>
KrawlSite is a web crawler/spider/ offline browser/download manager application.
To integrate with Konqueror, open the file associations page in the configuration dialog, select text/html mime type and in the embedded viewers list choose KrawlSite_Part. Now when you right click on a web-page in Konqueror, in the preview in menu, youll see KrawlSite.
Selecting it embeds the component into Konqueror as in the second screen shot. The first screen shot shows the shell in which the component runs. The third component is the configuration dialog.
To use this app to download tutorials, set offline mode on, start crawling from the start of the tutorial. If the start page of the tutorial is the TOC, set crawl depth to 1 or if the start page has the TOC along with the first chapter, set crawl depth to 0. If only next & previous links are present per chapter page, set crawl depth to number of chapters.
Enhancements:
- crash free(afaik!), esp after kde 4.1 came around.
- support for html frames
- better UI
<<lessTo integrate with Konqueror, open the file associations page in the configuration dialog, select text/html mime type and in the embedded viewers list choose KrawlSite_Part. Now when you right click on a web-page in Konqueror, in the preview in menu, youll see KrawlSite.
Selecting it embeds the component into Konqueror as in the second screen shot. The first screen shot shows the shell in which the component runs. The third component is the configuration dialog.
To use this app to download tutorials, set offline mode on, start crawling from the start of the tutorial. If the start page of the tutorial is the TOC, set crawl depth to 1 or if the start page has the TOC along with the first chapter, set crawl depth to 0. If only next & previous links are present per chapter page, set crawl depth to number of chapters.
Enhancements:
- crash free(afaik!), esp after kde 4.1 came around.
- support for html frames
- better UI
Download (0.62MB)
Added: 2005-12-01 License: GPL (GNU General Public License) Price:
1422 downloads
pro-search 0.17.2
pro-search is a crawler for FTP servers, SMB shares, HTTP servers, and DC++ networks. more>>
pro-search is a crawler for FTP servers, SMB shares, HTTP servers, and DC++ networks.
<<less Download (0.17MB)
Added: 2007-05-22 License: GPL (GNU General Public License) Price:
896 downloads
Minalyzer Lite 2007.05
Minalyzer Lite index and search tools to your internet or intranet website. more>>
Minalyzer Lite project index and search tools to your internet or intranet website. Minalyzer Lite supports user searches by indexing data from combinations of databases, file systems and websites. The project runs on almost all Operating Systems. It even copes with restrictive shared hosting.
Pump up your website with powerful search tools
- Standard search functionality through web forms.
- Advanced query options such as keyword forcing (+) and keyword exclusion (-).
- Targeted searches in single or multiple index collections.
- Summarized and highlighted results, properly paginated.
- Search by field or location within your website.
- Maintain high performance by running resource-intensive processes on remote computers.
Easy to integrate
- Fully customizable look and feel for seamless integration with your website.
- Tailored indexing, e.g. indexes files directly from a file system, from a dynamic websites database or by crawling the website; or by using a combination of all the three.
- Generates multiple index collections for different parts of your site, e.g. for FAQ, documentation, forums and blogs.
- Attractively priced custom programming service available for all of the above.
- Comes with a sample web application demonstrating integration with sites built using ASP. NET, PHP, Perl, and HTML (CGI-Scripts enabled).
Deployable across a range of environments
- Operating System/Platform independent.
- Works with most shared hosting accounts.
- Language independent - works with all web programming languages, e.g. ASP. NET, PHP, Perl, and HTML (CGI-Scripts enabled).
- Indexes portable between multiple Operating Systems, e.g. create and index in a Windows environment then transfer it to a Linux server.
- Indexing support for arc files generated by Heritrix Crawler.
Easy to install and administer
- Simple xcopy/ftp installation.
- No dependencies on other sites or companies - the executable runs locally.
- Tracks user search activity through detailed search logs.
- Optional web form interface for indexing.
- Works around web server restrictions by generating indexes locally and then uploading them.
Enhancements:
Features Added:
- Added reporting capabilities to generate pdf reports.
- Implemented Jericho HTML parser for HTML parsing.
Changes in this version:
- Enhanced example scripts to add more logging.
<<lessPump up your website with powerful search tools
- Standard search functionality through web forms.
- Advanced query options such as keyword forcing (+) and keyword exclusion (-).
- Targeted searches in single or multiple index collections.
- Summarized and highlighted results, properly paginated.
- Search by field or location within your website.
- Maintain high performance by running resource-intensive processes on remote computers.
Easy to integrate
- Fully customizable look and feel for seamless integration with your website.
- Tailored indexing, e.g. indexes files directly from a file system, from a dynamic websites database or by crawling the website; or by using a combination of all the three.
- Generates multiple index collections for different parts of your site, e.g. for FAQ, documentation, forums and blogs.
- Attractively priced custom programming service available for all of the above.
- Comes with a sample web application demonstrating integration with sites built using ASP. NET, PHP, Perl, and HTML (CGI-Scripts enabled).
Deployable across a range of environments
- Operating System/Platform independent.
- Works with most shared hosting accounts.
- Language independent - works with all web programming languages, e.g. ASP. NET, PHP, Perl, and HTML (CGI-Scripts enabled).
- Indexes portable between multiple Operating Systems, e.g. create and index in a Windows environment then transfer it to a Linux server.
- Indexing support for arc files generated by Heritrix Crawler.
Easy to install and administer
- Simple xcopy/ftp installation.
- No dependencies on other sites or companies - the executable runs locally.
- Tracks user search activity through detailed search logs.
- Optional web form interface for indexing.
- Works around web server restrictions by generating indexes locally and then uploading them.
Enhancements:
Features Added:
- Added reporting capabilities to generate pdf reports.
- Implemented Jericho HTML parser for HTML parsing.
Changes in this version:
- Enhanced example scripts to add more logging.
Download (MB)
Added: 2007-06-20 License: Freely Distributable Price:
859 downloads
Fast File Search 1.1.13
Fast File Search crawls FTP servers and SMB shares. more>>
Fast File Search is a crawler for FTP servers and SMB shares that can be found on Windows or UNIX systems running Samba.
It provides a web interface for searching files. It is optimized for searching files by a wildcard when there are some normal (not * or ?) chars specified in the beginning or in the end of the mask (for example *.iso).
Fast File Search crawler runs on UNIX (currently only Linux has been tested but I do not now any reasons why it should not work on other UNIXes). Fast File Search uses MySQL database, web interface needs a web server with PHP >= 4.0.3 and the crawler needs some perl modules.
The crawler (ffsearch.pl) crawls the network (FTP servers from the list and all reachable SMB hosts on the local network) and stores the information about files into database. It is invoked at certain times each day via crontab entries.
There are two modes of operation of the crawler: complete crawl and incremental crawl. The crawler expects a command line argument that tells crawler which mode to run (-c or --complete for complete crawl, -i or --incremental for incremental crawl). Both modes retrieve a list of the active SMB hosts in all workgroups.
The complete crawl tries to scan all active hosts and all hosts that are listed in database. The complete crawl should be run once a day.
The incremental crawl tries to scan active hosts and hosts listed in database that have not been scanned since the last complete crawl because they were unreachable. The incremental crawl should be run several times a day, for example each 3 hours.
How does the crawler get know whether the host has been crawled since the last complete crawl?
Each time the complete crawl is executed, the expire count is incremented first. When the host is crawled, expire count is set to zero. So all hosts whose expire count > 0 were not reachable since the last complete crawl. Moreover, when expire count reaches value specified in configuration (i.e. it was unreachable during the time period of complete crawls) the information about files on the "expired" host is deleted from database.
Web interface is used to search the files in database, details how to search are described in the Help section of the search page.
You can also add a FTP server to a FTP server list, edit FTP server in the list or delete FTP server from the list through the web interface. So that anybody could not do anything with the server list only the record about abcdef is editable from host abcdef. There are also admins who can edit all records in the server list. The admins login through the web interface.
Enhancements:
- fixed few bugs in the crawler
- added a possibility to exclude some SMB shares
- www: improved Russian and Ukrainian translation
<<lessIt provides a web interface for searching files. It is optimized for searching files by a wildcard when there are some normal (not * or ?) chars specified in the beginning or in the end of the mask (for example *.iso).
Fast File Search crawler runs on UNIX (currently only Linux has been tested but I do not now any reasons why it should not work on other UNIXes). Fast File Search uses MySQL database, web interface needs a web server with PHP >= 4.0.3 and the crawler needs some perl modules.
The crawler (ffsearch.pl) crawls the network (FTP servers from the list and all reachable SMB hosts on the local network) and stores the information about files into database. It is invoked at certain times each day via crontab entries.
There are two modes of operation of the crawler: complete crawl and incremental crawl. The crawler expects a command line argument that tells crawler which mode to run (-c or --complete for complete crawl, -i or --incremental for incremental crawl). Both modes retrieve a list of the active SMB hosts in all workgroups.
The complete crawl tries to scan all active hosts and all hosts that are listed in database. The complete crawl should be run once a day.
The incremental crawl tries to scan active hosts and hosts listed in database that have not been scanned since the last complete crawl because they were unreachable. The incremental crawl should be run several times a day, for example each 3 hours.
How does the crawler get know whether the host has been crawled since the last complete crawl?
Each time the complete crawl is executed, the expire count is incremented first. When the host is crawled, expire count is set to zero. So all hosts whose expire count > 0 were not reachable since the last complete crawl. Moreover, when expire count reaches value specified in configuration (i.e. it was unreachable during the time period of complete crawls) the information about files on the "expired" host is deleted from database.
Web interface is used to search the files in database, details how to search are described in the Help section of the search page.
You can also add a FTP server to a FTP server list, edit FTP server in the list or delete FTP server from the list through the web interface. So that anybody could not do anything with the server list only the record about abcdef is editable from host abcdef. There are also admins who can edit all records in the server list. The admins login through the web interface.
Enhancements:
- fixed few bugs in the crawler
- added a possibility to exclude some SMB shares
- www: improved Russian and Ukrainian translation
Download (0.14MB)
Added: 2005-10-18 License: GPL (GNU General Public License) Price:
1468 downloads
Visitors Web Log Analyzer 0.61
Visitors is a very fast Web log analyzer. more>>
Visitors is a very fast web log analyzer for Linux, Windows, and other Unix-like operating systems. It takes as input a web server log file, and outputs statistics in form of different reports. The design principles are very different compared to other software of the same type:
No installation required, can process up to 150,000 lines of log entries per second in fast computers (20MB/s with my log files average length).
Designed to be executed by the command line, output html and text reports. The text report can be used in pipe to less to check web stats from ssh.
Support for real time statistics with the Visitors Stream Mode introduced with version 0.3.
To specify the log format is not needed at all. Works out of box with apache and most other web servers with a standard log format (see the documentation for more information on the format).
Its a portable C program, can be compiled on many different systems. Binaries for Windows systems are in the Download section of this page.
The produced html report doesnt contain images or external CSS, is self-contained, you can send it by email to users.
Visitors is free software (and of course, freeware), under the terms of the GPL license. You dont need to pay to use it. Visitors is supported, if you want a custom version made directly by the original author for a modest price, contact me at antirez (at) invece.org. ISPs may take advantage of the high processing speed.
Main features:
- Requested pages.
- Requested images.
- Referers by hits and age.
- Unique visitors in each day.
- Page views per visit.
- Pages accessed by the Google crawler (and the date of googles last access on every page).
- Percentage of visits originated from Google searches for every day.
- Users navigation patterns (web trails).
- Keyphrases used in Google searches.
- User agents.
- Weekdays and Hours distributions of accesses.
- Weekdays/Hours combined bidimentional map.
- Month/Year combined bidimentional map.
- Visual path analysis with Graphviz.
- Operating systems, browsers and domains popularity.
- 404 errors.
Enhancements:
- This release adds an important bugfix in the unique visitors algorithm.
- The output is now nearer to reality (though unique visitors stats are always a guess without the use of a cookie).
<<lessNo installation required, can process up to 150,000 lines of log entries per second in fast computers (20MB/s with my log files average length).
Designed to be executed by the command line, output html and text reports. The text report can be used in pipe to less to check web stats from ssh.
Support for real time statistics with the Visitors Stream Mode introduced with version 0.3.
To specify the log format is not needed at all. Works out of box with apache and most other web servers with a standard log format (see the documentation for more information on the format).
Its a portable C program, can be compiled on many different systems. Binaries for Windows systems are in the Download section of this page.
The produced html report doesnt contain images or external CSS, is self-contained, you can send it by email to users.
Visitors is free software (and of course, freeware), under the terms of the GPL license. You dont need to pay to use it. Visitors is supported, if you want a custom version made directly by the original author for a modest price, contact me at antirez (at) invece.org. ISPs may take advantage of the high processing speed.
Main features:
- Requested pages.
- Requested images.
- Referers by hits and age.
- Unique visitors in each day.
- Page views per visit.
- Pages accessed by the Google crawler (and the date of googles last access on every page).
- Percentage of visits originated from Google searches for every day.
- Users navigation patterns (web trails).
- Keyphrases used in Google searches.
- User agents.
- Weekdays and Hours distributions of accesses.
- Weekdays/Hours combined bidimentional map.
- Month/Year combined bidimentional map.
- Visual path analysis with Graphviz.
- Operating systems, browsers and domains popularity.
- 404 errors.
Enhancements:
- This release adds an important bugfix in the unique visitors algorithm.
- The output is now nearer to reality (though unique visitors stats are always a guess without the use of a cookie).
Download (0.11MB)
Added: 2005-11-05 License: GPL (GNU General Public License) Price:
1458 downloads
Tales of Middle Earth 2.3.4
Tales of Middle Earth is a tile-based dungeon crawler similar to Nethack, Rogue, and Angband. more>>
Tales of Middle Earth (ToME) is a fantasy adventure game, based on the works of J.R.R. Tolkien. It is a game that emphasizes intricate, challenging, and varied gameplay over graphics.
Hundreds of different monsters in randomly-generated, unpredictable dungeons will strive to slay you by various means, and you counter - if you survive - by developing the skills of your choice and wielding mighty artifacts.
ToMEs races from Hobbit to Troll and classes from Swordmaster to Summoner allow for many different playing styles and a replay value that extends through years.
The only game so realistic that your scrolls and spell books will burn if you trudge in lava (unless you have gained immunity from some armour), you will dry up rivers to cast mighty spells, strike at orcs with blades attuned to slay them specifically, summon armies from a simple totem, and even forge your own artifacts.
There is an entire community to help you with the game, and it is far from static - ToME is a developing game, always improving. And best of all, its free.
Enhancements:
- Fix window position saving on Mac OS
- Remove buggy trap of Stair Movement
- Fix typo in one monsters flags
- Fix word wrapping in character sheet
<<lessHundreds of different monsters in randomly-generated, unpredictable dungeons will strive to slay you by various means, and you counter - if you survive - by developing the skills of your choice and wielding mighty artifacts.
ToMEs races from Hobbit to Troll and classes from Swordmaster to Summoner allow for many different playing styles and a replay value that extends through years.
The only game so realistic that your scrolls and spell books will burn if you trudge in lava (unless you have gained immunity from some armour), you will dry up rivers to cast mighty spells, strike at orcs with blades attuned to slay them specifically, summon armies from a simple totem, and even forge your own artifacts.
There is an entire community to help you with the game, and it is far from static - ToME is a developing game, always improving. And best of all, its free.
Enhancements:
- Fix window position saving on Mac OS
- Remove buggy trap of Stair Movement
- Fix typo in one monsters flags
- Fix word wrapping in character sheet
Download (2.5MB)
Added: 2007-05-23 License: GPL (GNU General Public License) Price:
885 downloads
Webtools 4 larbin 1.0
Webtools 4 larbin provides a set scripts to handle the output of Larbin. more>>
Webtools 4 larbin provides a set scripts to handle the output of Larbin.
Larbin is a Web crawler intended to fetch a large number of Web pages to fill the database of a search engine.
With a network fast enough, it should be able to fetch more than 100 millions pages on a standard PC.
This set of PHP and Perl scripts, called webtools4larbin, can handle the output of Larbin.
<<lessLarbin is a Web crawler intended to fetch a large number of Web pages to fill the database of a search engine.
With a network fast enough, it should be able to fetch more than 100 millions pages on a standard PC.
This set of PHP and Perl scripts, called webtools4larbin, can handle the output of Larbin.
Download (0.003MB)
Added: 2007-02-01 License: GPL (GNU General Public License) Price:
999 downloads
BitTorrent Queue Manager 0.1.3
BitTorrent Queue Manager is a console-based BitTorrent client that provides built-in queue management functions. more>>
BitTorrent Queue Manager is a console-based BitTorrent client running on top of BitTornado that provides built-in queue management functions.
BitTorrent Queue Manager also provides a remote interface compatible with ABC for Web-based control. Furthermore, peer information can be queried, including country and network names, and a built-in crawler can gather new torrents on specified trackers or catalog sites for downloading automatically.
This is the new beginning of BTQueue. By upgrading to 0.1.0, you are able to:
- Utilize DHT network compatible to Bram Cohens client and BitComet
- Query IP location from updated database plus AS number (see ip2cc)
- Change client identifier to Azureus, Bram Cohens client, or BitComet
<<lessBitTorrent Queue Manager also provides a remote interface compatible with ABC for Web-based control. Furthermore, peer information can be queried, including country and network names, and a built-in crawler can gather new torrents on specified trackers or catalog sites for downloading automatically.
This is the new beginning of BTQueue. By upgrading to 0.1.0, you are able to:
- Utilize DHT network compatible to Bram Cohens client and BitComet
- Query IP location from updated database plus AS number (see ip2cc)
- Change client identifier to Azureus, Bram Cohens client, or BitComet
Download (2.0MB)
Added: 2006-06-23 License: Python License Price:
1231 downloads
WWW::Google::SiteMap::URL 1.09
WWW::Google::SiteMap::URL is URL Helper class for WWW::Google::SiteMap. more>>
WWW::Google::SiteMap::URL is URL Helper class for WWW::Google::SiteMap.
This is a helper class that supports WWW::Google::SiteMap and WWW::Google::SiteMap::Index.
METHODS
new()
loc()
Change the URL associated with this object. For a WWW::Google::SiteMap this specifies the URL to add to the sitemap, for a WWW::Google::SiteMap::Index, this is the URL to the sitemap.
changefreq()
Set the change frequency of the object. This field is not used in sitemap indexes, only in sitemaps.
lastmod()
Set the last modified time. You have to provide this as one of the following:
a complete ISO8601 time string
A complete time string will be accepted in exactly this format:
YYYY-MM-DDTHH:MM:SS+TZ:TZ
YYYY - 4-digit year
MM - 2-digit month (zero padded)
DD - 2-digit year (zero padded)
T - literal character T
HH - 2-digit hour (24-hour, zero padded)
SS - 2-digit second (zero padded)
+TZ:TZ - Timezone offset (hours and minutes from GMT, 2-digit, zero padded)
epoch time
Seconds since the epoch, such as would be returned from time(). If you provide an epoch time, then an appropriate ISO8601 time will be constructed with gmtime() (which means the timezone offset will be +00:00). If anyone knows of a way to determine the timezone offset of the current host that is cross-platform and doesnt add dozens of dependencies then I might change this.
an ISO8601 date (YYYY-MM-DD)
A simple date in YYYY-MM-DD format. The time will be set to 00:00:00+00:00.
a DateTime object.
If a DateTime object is provided, then an appropriate timestamp will be constructed from it.
a HTTP::Response object.
If given an HTTP::Response object, the last modified time will be calculated from whatever time information is available in the response headers. Currently this means either the Last-Modified header, or tue current time - the current_age() calculated by the response object. This is useful for building web crawlers.
Note that in order to conserve memory, any of these items that you provide will be converted to a complete ISO8601 time string when they are stored. This means that if you pass an object to lastmod(), you cant get it back out. If anyone actually has a need to get the objects back out, then I might make a configuration option to store the objects internally.
If you have suggestions for other types of date/time objects or formats that would be usefule, let me know and Ill consider them.
priority()
Set the priority. This field is not used in sitemap indexes, only in sitemaps.
delete()
Delete this object from the sitemap or the sitemap index.
lenient()
If lenient contains a true value, then errors will not be fatal.
<<lessThis is a helper class that supports WWW::Google::SiteMap and WWW::Google::SiteMap::Index.
METHODS
new()
loc()
Change the URL associated with this object. For a WWW::Google::SiteMap this specifies the URL to add to the sitemap, for a WWW::Google::SiteMap::Index, this is the URL to the sitemap.
changefreq()
Set the change frequency of the object. This field is not used in sitemap indexes, only in sitemaps.
lastmod()
Set the last modified time. You have to provide this as one of the following:
a complete ISO8601 time string
A complete time string will be accepted in exactly this format:
YYYY-MM-DDTHH:MM:SS+TZ:TZ
YYYY - 4-digit year
MM - 2-digit month (zero padded)
DD - 2-digit year (zero padded)
T - literal character T
HH - 2-digit hour (24-hour, zero padded)
SS - 2-digit second (zero padded)
+TZ:TZ - Timezone offset (hours and minutes from GMT, 2-digit, zero padded)
epoch time
Seconds since the epoch, such as would be returned from time(). If you provide an epoch time, then an appropriate ISO8601 time will be constructed with gmtime() (which means the timezone offset will be +00:00). If anyone knows of a way to determine the timezone offset of the current host that is cross-platform and doesnt add dozens of dependencies then I might change this.
an ISO8601 date (YYYY-MM-DD)
A simple date in YYYY-MM-DD format. The time will be set to 00:00:00+00:00.
a DateTime object.
If a DateTime object is provided, then an appropriate timestamp will be constructed from it.
a HTTP::Response object.
If given an HTTP::Response object, the last modified time will be calculated from whatever time information is available in the response headers. Currently this means either the Last-Modified header, or tue current time - the current_age() calculated by the response object. This is useful for building web crawlers.
Note that in order to conserve memory, any of these items that you provide will be converted to a complete ISO8601 time string when they are stored. This means that if you pass an object to lastmod(), you cant get it back out. If anyone actually has a need to get the objects back out, then I might make a configuration option to store the objects internally.
If you have suggestions for other types of date/time objects or formats that would be usefule, let me know and Ill consider them.
priority()
Set the priority. This field is not used in sitemap indexes, only in sitemaps.
delete()
Delete this object from the sitemap or the sitemap index.
lenient()
If lenient contains a true value, then errors will not be fatal.
Download (0.030MB)
Added: 2006-10-24 License: Perl Artistic License Price:
1097 downloads
jGetFile 0.80
jGetFile is a scriptable command-line mass-file downloader that is specifically geared towards non-HTML files. more>>
jGetFile project is geared towards mass downloading specifically non-html files from the web. Web crawlers and website downloads are widely available and work very well. Not all web-based file crawlers are equal however.
Few file crawers handle this href scenario: < a href="http://www.foo.com?url=http://youreallywantthislink.com/files" >< /a > jGetFile was engineered to be able to handle as many extraneous href configurations as possible. Although it does not handle all possible cases, more will be supported in future releases.
jGetFile supports a highly configurable means to filter the acceptance of links within the program. A user can use the -i or -e options to have the program only traverse links that start with the addresses the user specified, or to exclude links that start with the specified addresses.
Alternatively, for the powerusers, one can specify a BeanShell script through the -als option. This allows for abritrarily complex rules to be specified for accepting links, like accept only links at depth 1 that begin with www.blah.com, exclude links at level 2 that begin with www.foo.com, and exclude links at level 2 that contain word cat. The depth variable is currently not available to custom scripts, but will be in the next release.
There is no gui in the works, and because of the intended simplicity of this program, one is not planned in the future either. Also, jGetFile is not intended to replace wget, although its initial features were based on wget. jGetFile has the single goal of downloading files fast and efficiently from the web.
<<lessFew file crawers handle this href scenario: < a href="http://www.foo.com?url=http://youreallywantthislink.com/files" >< /a > jGetFile was engineered to be able to handle as many extraneous href configurations as possible. Although it does not handle all possible cases, more will be supported in future releases.
jGetFile supports a highly configurable means to filter the acceptance of links within the program. A user can use the -i or -e options to have the program only traverse links that start with the addresses the user specified, or to exclude links that start with the specified addresses.
Alternatively, for the powerusers, one can specify a BeanShell script through the -als option. This allows for abritrarily complex rules to be specified for accepting links, like accept only links at depth 1 that begin with www.blah.com, exclude links at level 2 that begin with www.foo.com, and exclude links at level 2 that contain word cat. The depth variable is currently not available to custom scripts, but will be in the next release.
There is no gui in the works, and because of the intended simplicity of this program, one is not planned in the future either. Also, jGetFile is not intended to replace wget, although its initial features were based on wget. jGetFile has the single goal of downloading files fast and efficiently from the web.
Download (1.8MB)
Added: 2006-08-11 License: The Apache License 2.0 Price:
1169 downloads
AudioLink 0.05
AudioLink is a tool that makes searching for music on your local storage media easier and faster. more>>
.AudioLink is a tool that makes searching for music on your local storage media easier and faster. Your searches can include a variety of criteria, like male artists, female artists, band, genre, etc. It is flexible, you have options of using a command line interface, multiple choices of GUIs, designing your own search criteria, etc. The possibilities are endless.
Currently, its called AudioLink, cos the first milestone would just handle audio files... subsequent versions will be capable of searching for content in HTMLs, PDFs, PSs and other file formats.
This project started with my need of searching for files on my local machine, be it music or any stored information in .txt, .html, .pdf formats. The main goal of the software is to make searching for _content_ on local file systems (or remote file systems mounted in the local namespace) easier. This differs from other search tools, which look for files, not content. You cant use traditional tools like grep to search for songs or a particular artist, for example.
If you are in search of such a tool, AudioLink is the right choice for you for you!
The project will further be improved upon to include a LAN crawler, which will sniff on NFS, SMB, FTP, among other protocols, to collect information on the files residing on other machines as well.
Enhancements:
- * code/alsearch:
1. config file isnt perl code now; simple "a = b" stuff
2. command-line args override config file options
- code/alfilldb: ouch! one more ref. to alfilldb_usage code/alfilldb: removed ref. to alfilldb_usage.txt
- code/alfilldb:
- config file isnt perl code now; simple "a = b" stuff
- command-line args override config file options
- code/audiolink: 1. clean up the printed statements.
- added a verbose mode
- config file isnt perl code now; simple "a = b" stuff
- default to localhost for the host field
- command-line args override config file options
- cvsignore: ignore debian/ and gui/
- Documentation/alsearch_usage.txt, Documentation/alfilldb_usage.txt: remove the _usage.txt files; we now have *_doc.html files.
- INSTALL:
- You can now use the audiolink script to create the datbase and table.
- README: better 1st para
- TODO: 1. We have a config file
- Debian packaging is done; get rpms done now
<<lessCurrently, its called AudioLink, cos the first milestone would just handle audio files... subsequent versions will be capable of searching for content in HTMLs, PDFs, PSs and other file formats.
This project started with my need of searching for files on my local machine, be it music or any stored information in .txt, .html, .pdf formats. The main goal of the software is to make searching for _content_ on local file systems (or remote file systems mounted in the local namespace) easier. This differs from other search tools, which look for files, not content. You cant use traditional tools like grep to search for songs or a particular artist, for example.
If you are in search of such a tool, AudioLink is the right choice for you for you!
The project will further be improved upon to include a LAN crawler, which will sniff on NFS, SMB, FTP, among other protocols, to collect information on the files residing on other machines as well.
Enhancements:
- * code/alsearch:
1. config file isnt perl code now; simple "a = b" stuff
2. command-line args override config file options
- code/alfilldb: ouch! one more ref. to alfilldb_usage code/alfilldb: removed ref. to alfilldb_usage.txt
- code/alfilldb:
- config file isnt perl code now; simple "a = b" stuff
- command-line args override config file options
- code/audiolink: 1. clean up the printed statements.
- added a verbose mode
- config file isnt perl code now; simple "a = b" stuff
- default to localhost for the host field
- command-line args override config file options
- cvsignore: ignore debian/ and gui/
- Documentation/alsearch_usage.txt, Documentation/alfilldb_usage.txt: remove the _usage.txt files; we now have *_doc.html files.
- INSTALL:
- You can now use the audiolink script to create the datbase and table.
- README: better 1st para
- TODO: 1. We have a config file
- Debian packaging is done; get rpms done now
Download (0.033MB)
Added: 2006-07-18 License: GPL (GNU General Public License) Price:
1193 downloads
MKSearch beta 1
MKSearch provides a Web metadata spider and search engine. more>>
MKSearch provides a Web metadata spider and search engine.
MKSearch is a metadata search engine that indexes structured metadata in Web documents instead of free text in the document body.
The data acquisition system conforms to the Dublin Core metadata in HTML recommendations, and supports other application profiles, such as the UK e-Government Metadata Standard.
It also indexes native RDF formats, including RSS 1.0. The system has five major components: a Web crawler, an HTML document validator and formatter, a set of custom indexers, an RDF storage and query system, and a public query interface, provided through a standard servlet container.
System composition
The MKSearch system is composed of several other free software components. Further details are provided in the MKSearch development plans.
JSpider
JSpider is a Java Web crawler engine that has pluggable interfaces that can be used to add custom processing and content handling. MKSearch uses custom SAX-based content handlers for extracting metadata from Web documents.
Sesame
Sesame is a set of RDF processing and storage APIs and applications that includes RDF data query facilities. MKSearch uses Sesame to store indexed metadata in RDF format and to search the repository via the public query interface.
JTidy
JTidy is a utility for correcting common HTML markup errors and is used to convert HTML documents to XHTML so they can be processed using SAX.
<<lessMKSearch is a metadata search engine that indexes structured metadata in Web documents instead of free text in the document body.
The data acquisition system conforms to the Dublin Core metadata in HTML recommendations, and supports other application profiles, such as the UK e-Government Metadata Standard.
It also indexes native RDF formats, including RSS 1.0. The system has five major components: a Web crawler, an HTML document validator and formatter, a set of custom indexers, an RDF storage and query system, and a public query interface, provided through a standard servlet container.
System composition
The MKSearch system is composed of several other free software components. Further details are provided in the MKSearch development plans.
JSpider
JSpider is a Java Web crawler engine that has pluggable interfaces that can be used to add custom processing and content handling. MKSearch uses custom SAX-based content handlers for extracting metadata from Web documents.
Sesame
Sesame is a set of RDF processing and storage APIs and applications that includes RDF data query facilities. MKSearch uses Sesame to store indexed metadata in RDF format and to search the repository via the public query interface.
JTidy
JTidy is a utility for correcting common HTML markup errors and is used to convert HTML documents to XHTML so they can be processed using SAX.
Download (9.0MB)
Added: 2007-02-16 License: GPL (GNU General Public License) Price:
980 downloads
WWW::Google::SiteMap 1.09
WWW::Google::SiteMap is a Perl extension for managing Google SiteMaps. more>>
WWW::Google::SiteMap is a Perl extension for managing Google SiteMaps.
SYNOPSIS
use WWW::Google::SiteMap;
my $map = WWW::Google::SiteMap->new(file => sitemap.gz);
# Main page, changes a lot because of the blog
$map->add(WWW::Google::SiteMap::URL->new(
loc => http://www.jasonkohles.com/,
lastmod => 2005-06-03,
changefreq => daily,
priority => 1.0,
));
# Top level directories, dont change as much, and have a lower priority
$map->add({
loc => "http://www.jasonkohles.com/$_/",
changefreq => weekly,
priority => 0.9, # lower priority than the home page
}) for qw(
software gpg hamradio photos scuba snippets tools
);
$map->write;
The Sitemap Protocol allows you to inform search engine crawlers about URLs on your Web sites that are available for crawling. A Sitemap consists of a list of URLs and may also contain additional information about those URLs, such as when they were last modified, how frequently they change, etc.
This module allows you to create and modify sitemaps.
<<lessSYNOPSIS
use WWW::Google::SiteMap;
my $map = WWW::Google::SiteMap->new(file => sitemap.gz);
# Main page, changes a lot because of the blog
$map->add(WWW::Google::SiteMap::URL->new(
loc => http://www.jasonkohles.com/,
lastmod => 2005-06-03,
changefreq => daily,
priority => 1.0,
));
# Top level directories, dont change as much, and have a lower priority
$map->add({
loc => "http://www.jasonkohles.com/$_/",
changefreq => weekly,
priority => 0.9, # lower priority than the home page
}) for qw(
software gpg hamradio photos scuba snippets tools
);
$map->write;
The Sitemap Protocol allows you to inform search engine crawlers about URLs on your Web sites that are available for crawling. A Sitemap consists of a list of URLs and may also contain additional information about those URLs, such as when they were last modified, how frequently they change, etc.
This module allows you to create and modify sitemaps.
Download (0.041MB)
Added: 2006-11-22 License: Perl Artistic License Price:
1069 downloads
Secleted [ 0 ] software to compare
Copyright Notice:
Software piracy is theft, Using crack, password, serial numbers, registration codes, key generators is illegal and prevent future software development. The above crawler search only lists software in full, demo and trial versions for free download. Download links are directly from our mirror sites or publisher sites, torrent files or links from rapidshare.com, yousendit.com or megaupload.com are not allowed