extract data
PDFMiner 20090721
PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. more>>
PDFMiner 20090721 brings users the convenience of a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines.
It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.
Major Features:
- Written entirely in Python. (for version 2.4 or newer)
- PDF-1.7 specification support. (well, almost)
- Non-ASCII languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Basic encryption (RC4) support.
- PDF to HTML conversion (with a sample converter web app).
- Outline (TOC) extraction.
- Tagged contents extraction.
- Infer text running by using clustering technique.
- Python
Bottle 0.4.4
WSGI micro web framework + templates more>>
Bottle 0.4.4 is created as a fast, simple and useful one-file WSGI framework and templates with a ton of features.
Bottle is a fast, simple and useful one-file WSGI framework. It is not a full-stack framework with a ton of features, but a useful mirco-framework for small web-applications that stays out of your way.
Bottle only depends on the Python Standard Library. If you want to use a HTTP server other than wsgiref.simple_server you may need cherrypy, flup or paste (your choice).
Major Features:
- Request dispatching: Map requests to handler-callables using URL-routes.
- URL parameters: Use regular expressions /object/(?P[0-9]+) or simplified syntax /object/:id to extract data out of URLs.
- WSGI abstraction: Dont worry about cgi and wsgi internals.
- Input: request.GET[parameter] or request.POST[form-field]
- HTTP header: response.header[Content-Type] = text/html.
- Cookie Management: response.COOKIES[session] = new_key.
- Static files: send_file(movie.flv, /downloads/) with automatic mime-type guessing.
- Errors: Throw HTTP errors using abort(404, Not here) or subclass HTTPError and use custom error handlers.
- Templates: Integrated template language.
- Plain simple: Execute python code with %... or use the inline syntax {{...}} for one-line expressions.
- No IndentationErrors: Blocks are closed by %end. Indentation is optional.
- Extremely fast: Parses and renders templates 5 to 10 times faster than mako.
- Support for Mako-Templates (requires mako).
- HTTP Server: Build in WSGI/HTTP Gateway server (for development and production mode)
- Currently supports wsgiref.simple_server (default), cherrypy, flup, paste and fapws3.
- Speed optimisations:
- Sendfile: Support for platform-specific high-performance file-transmission facilities, such as the Unix sendfile()
- Depends on wsgi.file_wrapper provided by your WSGI-Server implementation.
- Self optimising routes: Frequently used routes are tested first (optional)
- Fast static routes (single dict lookup)
Requirements:
- Python
LogMiner 1.23
A powerful log analysis package for Apache more>>
Major Features:
- data is stored in a DBMS backend and reports are generated on-the-fly, while Webalizer generates plain html files. A DBMS allows to extract and aggregate data in many ways, whenever you need. A drawback is that you won't have the processing speed of Webalizer when parsing log files.
- LogMiner allows to navigate to previous months easily.
- Webalizer reports are hardcoded in the program. LogMiner implements reports in a more extensible way. Each report is in fact a simple PHP class, usually supported by a PL/pgSQL function (although youre free to insert your SQL queries in the PHP code if you like).
- LogMiner offers more reports than Webalizer: for instance, the OS charts and the navigation graphs.
- Depending on your needs, you might prefer LogMiner over Webalizer, especially if you like having a central SQL repository for your data which enables you to extract the data you need at any time, or to add a kind of report which wasnt planned from the start and apply it to older data.

AbiWord for linux 2.4.6
A free word processing program for Linux, full featured word processor. more>> AbiWord is a full-featured word processor originally developed by the SourceGear Corporation, and is now maintained by an open group of volunteers.
Today AbiWord compiles as a native application on a wide collection of computers and can handle an equally impressive number of file formats. In addition, AbiWords feature set includes most everything one would expect in a modern word processor, plus numerous ground-breaking and advanced features allowing it to compete with many proprietary word processors successfully. A short list of features includes:
- A familiar interface
- Outstanding file import and export, with support for MS Word, WordPerfect, and more
- Unlimited undo and redo capacity
- Solid (X)HTML export, with CSS styles support
- Images
- Spelling support, with optional underlining
- Bullets and Lists
- Styles
- Table of Contents generation and customization through the Stylist
- Complete, intuitive revisions-tracking support
- Nested tables support, nearly unmatched in the field
- Mail merge
- Bidirectional text support
Command-line and server use modes for document processing capabilities
One of the most lasting differences between AbiWord and most word processors is the default file format. Unlike documents saved normally in many competing word-processors, document saved with AbiWord is written in plainly readable text with XML markup, making it possible to use any text editor to view AbiWord documents. With this style of data storage, you can feel assured that your precious data is safe and readable, even without using the original AbiWord program that created it. Users are even free to create their own program to parse the AbiWord markup and extract data from it. No matter how AbiWord is used, users can be sure that their important data is well kept.<<less
Text::Scraper 0.02
Text::Scraper contains structured data from (un)structured text. more>>
SYNOPSIS
use Text::Scraper;
use LWP::Simple;
use Data::Dumper;
#
# 1. Get our template and source text
#
my $tmpl = Text::Scraper->slurp(*DATA);
my $src = get(http://search.cpan.org/recent) || die $!;
#
# 2. Extract data from source
#
my $obj = Text::Scraper->new(tmpl => $tmpl);
my $data = $obj->scrape($src);
#
# 3. Do something really neat...(left as excercise)
#
print "Newest Submission: ", $data->[0]{submissions}[0]{name}, "nn";
print "Scraper model:n", Dumper($obj), "nn";
print "Parsed model:n", Dumper($data) , "nn";
__DATA__
< div class=path>< center>< table>< tr>
< ?tmpl stuff pre_nav ?>
< td class=datecell>< span>< big>< b> < ?tmpl var date_string ?> < /b>< /big>< /span>< /td>
< ?tmpl stuff post_nav ?>
< /tr>< /table>< /center>< /div>
< ul>
< ?tmpl loop submissions ?>
< li>< a href="< ?tmpl var link ?>">< ?tmpl var name ?>< /a>
< ?tmpl if has_description ?>
< small> -- < ?tmpl var description ?>< /small>
< ?tmpl end has_description ?>
< /li>
< ?tmpl end submissions ?>
< /ul>
ABSTRACT
Text::Scraper provides a fully functional base-class to quickly develop Screen-Scrapers and other text extraction tools. Programmatically generated text such as dynamic webpages are trivially reversed engineered.
Using templates, the programmer is freed from staring at fragile, heavily escaped regular expressions, mapping capture groups to named variables or wrestling with the DOM and badly formed HTML. In addition, extracted data can be hierarchical, which is beyond the capabilities of vanilla regular expressions.
Text::Scrapers functionality overlaps some existing CPAN modules - Template::Extract and WWW::Scraper.
Text::Scraper is much more lightweight than either and has a more general application domain than the latter. It has no dependencies on other frameworks, modules or design-decisions. On average, Text::Scraper benchmarks around 250% faster than Template::Extract - and uses significantly less memory.
Unlike both existing modules, Text::Scraper generalizes its functionality to allow the programmer to refine template capture groups beyond (.*?), fully redefine the template syntax and introduce new template constructs bound to custom classes.
Local Data Manager 6.6.5
Local Data Manager is a collection of cooperating programs that select, capture, manage, and distribute arbitrary data products. more>>
The system is designed for event-driven data distribution, and is currently used in the Unidata Internet Data Distribution (IDD) project. The LDM system includes network client and server programs and their shared protocols.
An important characteristic of the LDM is its support for flexible, site-specific configuration.
Enhancements:
- Fixes for timestamp bugs.
Data Crow 2.12 / 3.0 Alpha 2
Data Crow retrieves information from the web for you. more>>
Main features:
- Skinnable UI
- Internal help system (activated by the F1 key)
- Nice-looking and easy-to-use interface
- Highly customizable!
- Keeping track of who borrowed what
- Software registration
- Audio CD registration
- Music files registration
- Movie registration
- Book registration
- Reporting Tool (Html, Pdf, Text)
- Amazon.com support (http://www.amazon.com)
- Imdb support (http://www.imdb.com)
- Freedb support (http://www.freedb.org)
- Imports information from CD or your hard disk
- Extracts information from music files (OGG, FLAC, APE and MP3 files)
- Supports parsing for DivX, Xvid, ASF, MKV, OGM, RIFF, MOV, IFO, VOB and Mpeg video
- Add your own, rename, disable and order fields
- Backup and Restore of the database
- SQL query tool, for expert users
- Platform-independent
- Internal HSQL database
Whats New in 2.12 Stable Release:
- Some changes and fixes were made and the overall quality of the product was improved.
Whats New in 3.0 Alpha 2 Development Release:
- General fixes were made and missing functionality was added.
Google Data Objective-C Client 1.1.0
Google Data Objective-C Client provides a framework and source code that make it easy to access data through Google Data APIs. more>>
The Google data APIs provide a simple protocol for reading and writing data on the web. Many Google services provide a Google data API.
Each of the following Google services provides a Google data API:
- Base
- Blogger
- Calendar
- Spreadsheets
- Picasa Web Albums
- Notebook
Additional services with Google data APIs that are not yet supported by the Objective-C Client Library:
- Code Search
- Google Apps Provisioning
themonospot 0.5.1
themonospot can be used to scan an avi file and extract some information about audio and video data flow. more>>
- Video codec used
- Frame size
- Average video bitrate
- File size
- Total time
- Frame rate
- Total frames
- Info data
- User data (in MOVI chunk)
- Audio codec used
- Average audio bitrate
- Audio channels
With themonospot is also possible modify FourCC informations (FourCC code in video chunk and FourCC description in stream header).
WWW::Myspace::Data 0.13
WWW::Myspace::Data is a WWW::Myspace database interaction. more>>
SYNOPSIS
This module is the database interface for the WWW::Myspace modules. It imports methods into the callers namespace which allow the caller to bypass the loader object by calling the methods directly. This module is intended to be used as a back end for the Myspace modules, but it can also be called directly from a script if you need direct database access.
my %db = (
dsn => dbi:mysql:database_name,
user => username,
password => password,
);
# create a new object
my $data = WWW::Myspace::Data->new( $myspace, { db => %db } );
# set up a database connection
my $loader = $data->loader();
# initialize the database with Myspace login info
my $account_id = $data->set_account( $username, $password );
# now do something useful...
my $update = $data->update_friend( $friend_id );
CPAN::Mini::Extract 1.16
CPAN::Mini::Extract is a Perl module that can create CPAN::Mini mirrors with the archives extracted. more>>
SYNOPSIS
# Create a CPAN extractor
my $cpan = CPAN::Mini::Extract->new(
remote => http://mirrors.kernel.org/cpan/,
local => /home/adam/.minicpan,
trace => 1,
extract => /home/adam/.cpanextracted,
extract_filter => sub { /.pm$/ and ! /b(inc|t)b/ },
extract_check => 1,
);
# Run the minicpan process
my $changes = $cpan->run;
CPAN::Mini::Extract provides a base for implementing systems that download "all" of CPAN, extract the dists and then process the files within.
It provides the same syncronisation functionality as CPAN::Mini except that it also maintains a parallel directory tree that contains a directory located at an identical path to each archive file, with a controllable subset of the files in the archive extracted below.
How does it work
CPAN::Mini::Extract starts with a CPAN::Mini local mirror, which it will optionally update before each run. Once the CPAN::Mini directory is current, it will scan both directory trees, extracting any new archives and removing any extracted archives no longer in the minicpan mirror.
Data::Phrasebook::Loader::XML 0.12
Data::Phrasebook::Loader::XML Perl module can abstract your phrases with XML. more>>
SYNOPSIS
use Data::Phrasebook;
my $q = Data::Phrasebook->new(
class => Fnerk,
loader => XML,
file => phrases.xml,
dict => Dictionary, # optional
);
OR
my $q = Data::Phrasebook->new(
class => Fnerk,
loader => XML,
file => {
file => phrases.xml,
ignore_whitespace => 1,
}
);
# simple keyword to phrase mapping
my $phrase = $q->fetch($keyword);
# keyword to phrase mapping with parameters
$q->delimiters( qr{ [% s* (w+) s* %] }x );
my $phrase = $q->fetch($keyword,{this => that});
Data::Diff 0.01
Data::Diff is a data structure comparison module. more>>
SYNOPSIS
use Data::Diff qw(diff);
# simple procedural interface to raw difference output
$out = diff( $a, $b );
# OO usage
$diff = Data::Diff->new( $a, $b );
$new = $diff->apply();
$changes = $diff->diff_a();
Data::Diff computes the differences between two abirtray complex data structures.
METHODS
Creation
new Data::Diff( $a, $b, $options )
Creates and retruns a new Data::Diff object with the differences between $a and $b.
Access
apply( $options )
Returns the result of applying one side over the other.
raw()
Returns the internal data structure that describes the differences at all levels within.
Functions
Diff( $a, $b, $options )
Compares the two arguments $a and $b and returns the raw comparison between the two.
EXPORT
Nothing by default but you can choose to export the non-OO function Diff().
Data::Serializer 0.41
Data::Serializer package contains modules that serialize data structures. more>>
SYNOPSIS
use Data::Serializer;
$obj = Data::Serializer->new();
$obj = Data::Serializer->new(
serializer => Storable,
digester => MD5,
cipher => DES,
secret => my secret,
compress => 1,
);
$serialized = $obj->serialize({a => [1,2,3],b => 5});
$deserialized = $obj->deserialize($serialized);
print "$deserialized->{b}n";
Provides a unified interface to the various serializing modules currently available. Adds the functionality of both compression and encryption.
EXAMPLES
Please see Data::Serializer::Cookbook(3)
METHODS
new - constructor
$obj = Data::Serializer->new();
$obj = Data::Serializer->new(
serializer => Data::Dumper,
digester => SHA-256,
cipher => Blowfish,
secret => undef,
portable => 1,
compress => 0,
serializer_token => 1,
options => {},
);
new is the constructor object for Data::Serializer objects.
The default serializer is Data::Dumper
The default digester is SHA-256
The default cipher is Blowfish
The default secret is undef
The default portable is 1
The default encoding is hex
The default compress is 0
The default compressor is Compress::Zlib
The default serializer_token is 1
The default options is {} (pass nothing on to serializer)
serialize - serialize reference
$serialized = $obj->serialize({a => [1,2,3],b => 5});
Serializes the reference specified.
Will compress if compress is a true value.
Will encrypt if secret is defined.
deserialize - deserialize reference
$deserialized = $obj->deserialize($serialized);
Reverses the process of serialization and returns a copy of the original serialized reference.
freeze - synonym for serialize
$serialized = $obj->freeze({a => [1,2,3],b => 5});
thaw - synonym for deserialize
$deserialized = $obj->thaw($serialized);
raw_serialize - serialize reference in raw form
$serialized = $obj->raw_serialize({a => [1,2,3],b => 5});
This is a straight pass through to the underlying serializer, nothing else is done. (no encoding, encryption, compression, etc)
raw_deserialize - deserialize reference in raw form
$deserialized = $obj->raw_deserialize($serialized);
This is a straight pass through to the underlying serializer, nothing else is done. (no encoding, encryption, compression, etc)
secret - specify secret for use with encryption
$obj->secret(mysecret);
Changes setting of secret for the Data::Serializer object. Can also be set in the constructor. If specified than the object will utilize encryption.
portable - encodes/decodes serialized data
Uses encoding method to ascii armor serialized data
Aids in the portability of serialized data.
compress - compression of data
Compresses serialized data. Default is not to use it. Will compress if set to a true value $obj->compress(1);
serializer - change the serializer
Currently have 8 supported serializers: Storable, FreezeThaw, Data::Denter, Config::General, YAML, PHP::Serialization, XML::Dumper, and Data::Dumper.
Default is to use Data::Dumper.
Each serializer has its own caveats about usage especially when dealing with cyclical data structures or CODE references. Please see the appropriate documentation in those modules for further information.
cipher - change the cipher method
Utilizes Crypt::CBC and can support any cipher method that it supports.
digester - change digesting method
Uses Digest so can support any digesting method that it supports. Digesting function is used internally by the encryption routine as part of data verification.
compressor - changes compresing module
This method is included for possible future inclusion of alternate compression method Currently Compress::Zlib is the only supported compressor.
encoding - change encoding method
Encodes data structure in ascii friendly manner. Currently the only valid options are hex, or b64.
The b64 option uses Base64 encoding provided by MIME::Base64, but strips out newlines.
serializer_token - add usage hint to data
Data::Serializer prepends a token that identifies what was used to process its data. This is used internally to allow runtime determination of how to extract Serialized data. Disabling this feature is not recommended.
options - pass options through to underlying serializer
Currently is only supported by Config::General, and XML::Dumper.
my $obj = Data::Serializer->new(serializer => Config::General,
options => {
-LowerCaseNames => 1,
-UseApacheInclude => 1,
-MergeDuplicateBlocks => 1,
-AutoTrue => 1,
-InterPolateVars => 1
},
) or die "$!n";
or
my $obj = Data::Serializer->new(serializer => XML::Dumper,
options => { dtd => 1, }
) or die "$!n";
store - serialize data and write it to a file (or file handle)
$obj->store({a => [1,2,3],b => 5},$file, [$mode, $perm]);
or
$obj->store({a => [1,2,3],b => 5},$fh);
Serializes the reference specified using the serialize method and writes it out to the specified file or filehandle.
If a file path is specified you may specify an optional mode and permission as the next two arguments. See IO::File for examples.
Trips an exception if it is unable to write to the specified file.
retrieve - read data from file (or file handle) and return it after deserialization
my $ref = $obj->retrieve($file);
or
my $ref = $obj->retrieve($fh);
Reads first line of supplied file or filehandle and returns it deserialized.
Data::TreeDumper 0.33
Data::TreeDumper is an improved replacement for Data::Dumper. more>>
SYNOPSIS
use Data::TreeDumper ;
my $sub = sub {} ;
my $s =
{
A =>
{
a =>
{
}
, bbbbbb => $sub
, c123 => $sub
, d => $sub
}
, C =>
{
b =>
{
a =>
{
a =>
{
}
, b => sub
{
}
, c => 42
}
}
}
, ARRAY => [qw(elment_1 element_2 element_3)]
} ;
#-------------------------------------------------------------------
# package setup data
#-------------------------------------------------------------------
$Data::TreeDumper::Useascii = 0 ;
$Data::TreeDumper::Maxdepth = 2 ;
print DumpTree($s, title) ;
print DumpTree($s, title, MAX_DEPTH => 1) ;
print DumpTrees
(
[$s, "title", MAX_DEPTH => 1]
, [$s2, "other_title", DISPLAY_ADDRESS => 0]
, USE_ASCII => 1
, MAX_DEPTH => 5
) ;
Output:
title:
|- A [H1]
| |- a [H2]
| |- bbbbbb = CODE(0x8139fa0) [C3]
| |- c123 [C4 -> C3]
| `- d [R5]
| `- REF(0x8139fb8) [R5 -> C3]
|- ARRAY [A6]
| |- 0 [S7] = elment_1
| |- 1 [S8] = element_2
| `- 2 [S9] = element_3
`- C [H10]
`- b [H11]
`- a [H12]
|- a [H13]
|- b = CODE(0x81ab130) [C14]
`- c [S15] = 42