Main > Free Download Search >

Free scraper software for linux

scraper

Sponsored Links
Sponsored Links
Secleted [ 0 ] software to compare
Results 1 - 15 of about 30
ScraperPOD 3.05

ScraperPOD 3.05


ScraperPOD is a framework for scraping results from search engines. more>>
ScraperPOD is a framework for scraping results from search engines.

SYNOPSIS

use WWW::Scraper;

# Name your Scraper module / search engine as the first parameter,
use WWW::Scraper(eBay);
# or in the new() method
$scraper = new WWW::Scraper(eBay);

Classic WWW::Search mode

# Use a Scraper engine just as you would a WWW::Search engine.
$scraper = new WWW::Scraper(carsforsale, Honda, { lbxModel => Accord, lbxVehicleYearFrom => 1998 });
while ( $response = $scraper->next_result() ) {
# harvest results via hash-table reference.
print $scraper->{sellerPhoneNumber};
}

Canonical Request/Response mode (not yet implemented)

$scraper = new WWW::Scraper(carsforsale, Request => Autos, Response => Autos);
# or, since carsforsale.pm defaults to the Request and Response classes of Autos
$scraper = new WWW::Scraper(carsforsale);
#
# Set field values via field-named canonical access methods.
$scraper->scraperRequest->make(Honda);
$scraper->scraperRequest->model(Accord);
$scraper->scraperRequest->minYear(1998);
#
# Note: this is *not* next_result().
while ( $response = $scraper->next_response() ) {
#
# harvest results via field-named access methods.
print $response->sellerPhoneNumber();
}

Variant Requests to a single search engine

$scraper = new WWW::Scraper(carsforsale);
$scraper->scraperRequest->make(Honda);
$scraper->scraperRequest->minYear(1998);
#
for ( $model = (Accord Civic) ) {
$scraper->scraperRequest->model($model);
$response = $scraper->next_response() ) {
# all response fields are returned as a reference to the value.
print ${$response->sellerPhoneNumber()};
}

Single Request to variant search engines

# Set the request parameters in a Request object (sub-class Autos).
$request = new WWW::Scraper::Request(Autos);
$request->make(Honda);
$request->model(Accord);
$request->minYear(1998);
#
for ( $searchEngine = (carsforsale 1001cars) ) {
$scraper = new WWW::Scraper($searchEngine, Request => $request);
for ( $response = $scraper->next_response() ) {
# all response fields are returned as a reference to the value.
print ${$response->sellerPhoneNumber()};

<<less
Download (0.10MB)
Added: 2006-06-15 License: GPL (GNU General Public License) Price:
1227 downloads
screen-scraper 3.0

screen-scraper 3.0


screen-scraper is a tool for extracting data from Web sites. more>>
screen-scraper project is a tool used to extract data from web sites. You might use screen-scraper for the following purposes:
- Data Mining and Extraction
- Data Migration
- Application Integration
- Business Intelligence
- Web Task Automation
- Portal Components
- Meta-Searching
- Archiving
The screen-scraper application consists of two primary pieces:
- Workbench: A graphical user interface provides an intuitive approach that allows you to designate pages and specific pieces of information to be extracted.
- Server: After using the workbench to designate the data to be scraped, screen-scraper can be run in a server mode, much like a database. External applications can then connect to screen-scraper, which will pull data off of the designated web sites, then return them to the calling application. For example, you might build a web-based application using Active Server Pages (ASP) or PHP that invokes screen-scraper to search for products found on an external web site in real-time.
Additionally, screen-scraper can be started in a non-graphical mode from the command line such that it can be scheduled or invoked on-demand.
screen-scraper can automate many of the tasks typically required when scraping data from web pages, such as tracking cookies, logging in to web sites, and traversing search results pages.
Depending on the programming languages and platforms you most prefer, screen-scraper is likely to be familiar to you. screen-scraper contains an internal scripting engine that supports:
- VBScript
- JScript
- Perl
- Interpreted Java
- JavaScript
- Python
When invoking screen-scraper externally take your pick from the following languages:
- Java
- PHP
- Anything COM-based (such as Active Server Pages, Visual Basic, and Visual C++)
- .NET (both Microsoft-based and Mono)
- Cold Fusion
Enhancements:
- Several bugfixes and minor features have been added, including automatic backup of the database, enhanced HTML rendering and HTML stripping, fixing an error that caused duplicate scripts to appear at times on import, and fixing multiple errors relating to international character sets and non-ASCII characters.
<<less
Download (66MB)
Added: 2007-01-15 License: Freeware Price:
599 downloads
Text::Scraper 0.02

Text::Scraper 0.02


Text::Scraper contains structured data from (un)structured text. more>>
Text::Scraper contains structured data from (un)structured text.

SYNOPSIS

use Text::Scraper;

use LWP::Simple;
use Data::Dumper;

#
# 1. Get our template and source text
#
my $tmpl = Text::Scraper->slurp(*DATA);
my $src = get(http://search.cpan.org/recent) || die $!;

#
# 2. Extract data from source
#
my $obj = Text::Scraper->new(tmpl => $tmpl);
my $data = $obj->scrape($src);

#
# 3. Do something really neat...(left as excercise)
#
print "Newest Submission: ", $data->[0]{submissions}[0]{name}, "nn";
print "Scraper model:n", Dumper($obj), "nn";
print "Parsed model:n", Dumper($data) , "nn";

__DATA__

< div class=path>< center>< table>< tr>
< ?tmpl stuff pre_nav ?>
< td class=datecell>< span>< big>< b> < ?tmpl var date_string ?> < /b>< /big>< /span>< /td>
< ?tmpl stuff post_nav ?>
< /tr>< /table>< /center>< /div>

< ul>
< ?tmpl loop submissions ?>
< li>< a href="< ?tmpl var link ?>">< ?tmpl var name ?>< /a>
< ?tmpl if has_description ?>
< small> -- < ?tmpl var description ?>< /small>
< ?tmpl end has_description ?>
< /li>
< ?tmpl end submissions ?>
< /ul>

ABSTRACT

Text::Scraper provides a fully functional base-class to quickly develop Screen-Scrapers and other text extraction tools. Programmatically generated text such as dynamic webpages are trivially reversed engineered.

Using templates, the programmer is freed from staring at fragile, heavily escaped regular expressions, mapping capture groups to named variables or wrestling with the DOM and badly formed HTML. In addition, extracted data can be hierarchical, which is beyond the capabilities of vanilla regular expressions.

Text::Scrapers functionality overlaps some existing CPAN modules - Template::Extract and WWW::Scraper.
Text::Scraper is much more lightweight than either and has a more general application domain than the latter. It has no dependencies on other frameworks, modules or design-decisions. On average, Text::Scraper benchmarks around 250% faster than Template::Extract - and uses significantly less memory.

Unlike both existing modules, Text::Scraper generalizes its functionality to allow the programmer to refine template capture groups beyond (.*?), fully redefine the template syntax and introduce new template constructs bound to custom classes.

<<less
Download (0.045MB)
Added: 2007-08-22 License: Perl Artistic License Price:
796 downloads
WWW::PDAScraper 0.1

WWW::PDAScraper 0.1


WWW::PDAScraper is a Perl class for scraping PDA-friendly content from websites. more>>
WWW::PDAScraper is a Perl class for scraping PDA-friendly content from websites.

Synopsis

use WWW::PDAScraper;
my $scraper = WWW::PDAScraper->new qw ( NewScientist Yahoo::Entertainment );
$scraper->scrape();
or
use WWW::PDAScraper;
my $scraper = WWW::PDAScraper->new;
$scraper->scrape qw( NewScientist Yahoo::Entertainment );
or
perl -MWWW::PDAScraper -e "scrape qw( NewScientist Yahoo::Entertainment )"

Having written various kludgey scripts to download PDA-friendly content from various websites, I decided to try and write a generalised solution which would

* parse out the section of a news page which contains the links we want
* munge those links into the URL for the print-friendly version, if possible
* download those pages and make an index page for them

The moving of the pages to your PDA is not part of the scope of the module: the open-source browser and "distiller", Plucker, from http://plkr.org/ is recommended. Just get it to read the index.html file with a depth of 1 from disk, using a URL like file:///path/to/index.html

The Sub-modules

WWW::PDAScraper uses a set of rules for scraping a particular website from a second module, i.e. WWW::PDAScraper::Yahoo::Entertainment::TV contains the rules for scraping the Yahoo TV News website:

package WWW::PDAScraper::Yahoo::Entertainment::TV;
# WWW::PDAScraper.pm rules for scraping the
# Yahoo TV website
sub config {
return {
name => Yahoo TV,
start_from => http://news.yahoo.com/i/763,
chunk_spec => [ "_tag", "div", "id", "indexstories" ],
url_regex => [ $, &printer=1 ]
};
}
1;

A more or less random selection of modules is included, as well as a full set for Yahoo, to demonstrate a logical set of modules in categories.

Creating a new sub-module ought to be relatively simple, see the template provided, WWW::PDAScraper::Template.pm - you need name, start_from, then either chunk_spec or url_spec, then optionally a url_regex for transformation into the print-friendly URL.

Then either move your new module to the same location as the other ones on your system, or make sure theyre available to your script with a line like use lib /path/to/local/modules/PDAScraper/

<<less
Download (0.069MB)
Added: 2006-12-14 License: Perl Artistic License Price:
1044 downloads
WWW::Scraper::Dice 0.01

WWW::Scraper::Dice 0.01


WWW::Scraper::Dice Perl module contains Scrapes Dice : (skills,locations) => (title, location ,residue). more>>
WWW::Scraper::Dice Perl module contains Scrapes Dice : (skills,locations) => (title, location ,residue).

SYNOPSIS

use WWW::Search;
my $oSearch = new WWW::Scraper(Dice);
my $sQuery = WWW::Scraper::escape_query("unix and (c++ or java)");
$oSearch->native_query($sQuery,
{method => bool,
state => CA,
daysback => 14});
while (my $res = $oSearch->next_result()) {
if(isHitGood($res->url)) {
my ($company,$title,$date,$location) =
($res->company, $res->title, $res->date, $res->location);
print "$company $title $date $location " . $res->url . "n";
}
}

<<less
Download (0.037MB)
Added: 2007-06-14 License: Perl Artistic License Price:
862 downloads
WWW::Scraper::BAJobs 0.01

WWW::Scraper::BAJobs 0.01


WWW::Scraper::BAJobs it Scrapes BAJobs.com. more>>
WWW::Scraper::BAJobs it Scrapes BAJobs.com.

SYNOPSIS

require WWW::Scraper;
$search = new WWW::Scraper(BAJobs);

This class is an BAJobs specialization of WWW::Search. It handles making and interpreting BAJobs searches http://www.BAJobs.com.

This class exports no public interface; all interaction should be done through WWW::Search objects.

<<less
Download (0.037MB)
Added: 2006-08-26 License: Perl Artistic License Price:
1154 downloads
WWW::Scraper::FlipDog 0.01

WWW::Scraper::FlipDog 0.01


WWW::Scraper::FlipDog it Scrapes www.FlipDog.com. more>>
WWW::Scraper::FlipDog it Scrapes www.FlipDog.com.

SYNOPSIS

use WWW::Scraper;
use WWW::Scraper::Response::Job;

$search = new WWW::Scraper(FlipDog);

$search->setup_query($query, {options});

while ( my $response = $scraper->next_response() ) {
# $response is a WWW::Scraper::Response::Job.
}

FlipDog extends WWW::Scraper.
It handles making and interpreting FlipDog searches of http://www.FlipDog.com.

<<less
Download (0.037MB)
Added: 2006-08-26 License: Perl Artistic License Price:
1154 downloads
WWW::Scraper::Google 3.05

WWW::Scraper::Google 3.05


WWW::Scraper::Google scrapes www.Google.com. more>>
WWW::Scraper::Google scrapes www.Google.com.

Caveat Kleptor

Please note that using the Google Scraper module (may) be a violation of Googles "Terms of Service", of which your humble author has been repeatedly reminded. The TOS is not as easy to locate as some of these correspondents have suggested (without a smile), but you can find the TOS at http://www.google.com/terms_of_service.html

Briefly, the relevant part is the "No Automated Querying" section. Its a kind of "do as I say, not as I do" dictum. Your author has tried to divine exactly what it means. On the surface its pretty clear, but if you follow the thread you will realize that it doesnt lead to a place any of us want to be. However, Google Incs desire is clear enough. They do not want to be *abused* for the exclusive benefit of someone else.

Scraper is not a tool well suited for this kind of abuse. It is designed to be generally configurable and, as such, it is not particularly efficient. It obeys the "robot.txt" rules published by the web-server. It would require some effort on a users part to cirumvent this feature. The Google.pm does not do a "meta-search" on Google. Even if your humble author removed Google.pm from the Scraper suite, it would be trivially easy for someone to build a Google module for Scraper (their format is very simple compared to others).

I believe that Google Inc. understands a little interloping (in moderation) is beneficial to all. I should note that Google Inc. has not notified your author of any concern on their part. This has been done by third parties who, for whatever reasons of their own, feel it necessary to interject themselves in others disputes, even when no such dispute exists.

Keep in mind that this is Googles livelihood. Should your use of Scraper be your hobby, or even part of your livelihood, remember it never helps to hit someone where they live. They will defend themselves to the death (even if that death is yours).
Scraper is a handy little tool for getting to stuff you cant get to otherwise. Lets keep it that way!

<<less
Download (0.10MB)
Added: 2006-11-23 License: Perl Artistic License Price:
1075 downloads
WWW::Scraper::Monster 0.01

WWW::Scraper::Monster 0.01


WWW::Scraper::Monster is a Perl module that scrapes Monster.com. more>>
WWW::Scraper::Monster is a Perl module that scrapes Monster.com.

SYNOPSIS

use WWW::Search;
my $oSearch = new WWW::Search(Monster);
my $sQuery = WWW::Search::escape_query("unix and (c++ or java)");
$oSearch->native_query($sQuery,
{st => CA,
tm => 14d});
while (my $res = $oSearch->next_result()) {
print $res->company . "t" . $res->title . "t" . $res->change_date
. "t" . $res->location . "t" . $res->url . "n";
}

This class is a Monster specialization of WWW::Search. It handles making and interpreting Monster searches at http://www.monster.com. Monster supports Boolean logic with "and"s "or"s. See http://jobsearch.monster.com/jobsearch_tips.asp for a full description of the query language.

The returned WWW::Scraper::Response objects contain url, title, company, location and change_date fields.

<<less
Download (0.038MB)
Added: 2007-06-14 License: Perl Artistic License Price:
862 downloads
WWW::Scraper::Brainpower 0.01

WWW::Scraper::Brainpower 0.01


WWW::Scraper::Brainpower it Scrapes Brainpower.com. more>>
WWW::Scraper::Brainpower it Scrapes Brainpower.com.

SYNOPSIS

use WWW::Scraper;
use WWW::Scraper::Response::Job;

$search = new WWW::Scraper(Brainpower);

$search->setup_query($query, {options});

while ( my $response = $scraper->next_response() ) {
# $response is a WWW::Scraper::Response::Job.
}

Brainpower extends WWW::Scraper.
It handles making and interpreting Brainpower searches of http://www.Brainpower.com.

<<less
Download (0.037MB)
Added: 2006-08-26 License: Perl Artistic License Price:
1154 downloads
WWW::Scraper::NorthernLight 3.05

WWW::Scraper::NorthernLight 3.05


WWW::Scraper::NorthernLight it Scrapes NorthernLight.com. more>>
WWW::Scraper::NorthernLight it Scrapes NorthernLight.com.

SYNOPSIS

require WWW::Scraper;
$search = new WWW::Scraper(NorthernLight);

This class is an NorthernLight specialization of WWW::Search. It handles making and interpreting NorthernLight searches http://www.NorthernLight.com.

This class exports no public interface; all interaction should be done through WWW::Search objects.

<<less
Download (0.10MB)
Added: 2006-08-26 License: Perl Artistic License Price:
1154 downloads
WWW::Scraper::CraigsList 3.05

WWW::Scraper::CraigsList 3.05


WWW::Scraper::CraigsList is a Perl module for scrapes CraigsList. more>>
WWW::Scraper::CraigsList is a Perl module for scrapes CraigsList.
SYNOPSIS
require WWW::Scraper;
$search = new WWW::Scraper(CraigsList);
This class is an CraigsList specialization of WWW::Search. It handles making and interpreting CraigsList searches http://www.CraigsList.com.
This class exports no public interface; all interaction should be done through WWW::Search objects.
OPTIONS
None at this time (2001.04.25)
search_url=URL
Specifies who to query with the CraigsList protocol. The default is at http://www.CraigsList.com/cgi-bin/job-search.
search_debug, search_parse_debug, search_ref Specified at WWW::Search.
Internet/Web Engineering Category options: - ALL JOBS art - web design jobs bus - business jobs mar - marketing jobs eng - internet engineering jobs etc - etcetera jobs wri - writing jobs sof - software jobs acc - finance jobs ofc - office jobs med - media jobs hea - health science jobs ret - retail jobs npo - nonprofit jobs lgl - legal jobs egr - engineering jobs sls - sales jobs sad - sys admin jobs tel - network jobs tfr - tv video radio jobs hum - human resource jobs tch - tech support jobs edu - education jobs trd - skilled trades jobs
Checkboxes - additive to search(?)
addOne value=telecommuting - telecommute addTwo value=contract - contract addThree value=internship - internships addFour value=part-time - part-time addFive value=non-profit - non-profit
Enhancements:
- Perl
<<less
Download (0.10MB)
Added: 2007-02-22 License: Perl Artistic License Price:
591 downloads
WWW::Search::Scraper::BAJobs 2.27

WWW::Search::Scraper::BAJobs 2.27


WWW::Search::Scraper::BAJobs it Scrapes BAJobs.com. more>>
WWW::Search::Scraper::BAJobs it Scrapes BAJobs.com.

SYNOPSIS

require WWW::Search::Scraper;
$search = new WWW::Search::Scraper(BAJobs);

This class is an BAJobs specialization of WWW::Search. It handles making and interpreting BAJobs searches http://www.BAJobs.com.

This class exports no public interface; all interaction should be done through WWW::Search objects.

<<less
Download (0.13MB)
Added: 2006-08-26 License: Perl Artistic License Price:
1154 downloads
WWW::Search::Scraper::Google 2.27

WWW::Search::Scraper::Google 2.27


WWW::Search::Scraper::Google is a Perl module that scrapes www.Google.com more>>
WWW::Search::Scraper::Google is a Perl module that scrapes www.Google.com.

SYNOPSIS

require WWW::Search::Scraper;
$search = new WWW::Search::Scraper(Google);

This class is an Google specialization of WWW::Search. It handles making and interpreting Google searches http://www.Google.com.

<<less
Download (0.13MB)
Added: 2006-11-24 License: Perl Artistic License Price:
1066 downloads
WWW::Search::Scraper::Dice 2.27

WWW::Search::Scraper::Dice 2.27


WWW::Search::Scraper::Dice is a Perl module that scrapes Dice : (skills,locations) => (title, location ,residue) more>>
WWW::Search::Scraper::Dice is a Perl module that scrapes Dice : (skills,locations) => (title, location ,residue)

SYNOPSIS

use WWW::Search;
my $oSearch = new WWW::Search::Scraper(Dice);
my $sQuery = WWW::Search::Scraper::escape_query("unix and (c++ or java)");
$oSearch->native_query($sQuery,
{method => bool,
state => CA,
daysback => 14});
while (my $res = $oSearch->next_result()) {
if(isHitGood($res->url)) {
my ($company,$title,$date,$location) =
($res->company, $res->title, $res->date, $res->location);
print "$company $title $date $location " . $res->url . "n";
}
}

This class is a Dice extension of WWW::Search::Scraper. It handles making and interpreting Dice searches at http://www.dice.com.

<<less
Download (0.13MB)
Added: 2007-06-15 License: Perl Artistic License Price:
861 downloads
Secleted [ 0 ] software to compare
  • Page: 1 of 2
  • 1
  • 2