Main > Free Download Search >

Free java html parser software for linux

java html parser

Sponsored Links
Sponsored Links
Secleted [ 0 ] software to compare
Results 1 - 15 of about 4599
Java Mozilla Html Parser 0.1.7

Java Mozilla Html Parser 0.1.7


Java Mozilla Html Parser project is a Java package that enables you to parse html pages into a Java Document object. more>>
Java Mozilla Html Parser project is a Java package that enables you to parse html pages into a Java Document object. The parser is a wrapper around Mozillas Html Parser, thus giving the user a browser-quality html parser.
Limitiations and known issues
The most major limitation is performance related , in the sense that the parser serializes the requests. At the moment , the parser is running at a separate thread , which at its time receives request , parses them and gives back the responses to the requester. It all happens because of Mozillas mechanism to keep its object thread safe. in the process of doing that, mozilla forces you to use proxied objects instead of the real objects that you have. My hope is that the open source community will take that project and maintain those issues.
Main features:
- Real world , browser quality DOM parsing
- compatiability with SAX parsers
- sequential performance comparable to pure java implementations of dom parsers
- Win32 , linux and MacOSX platforms are supported.
Enhancements:
- Windows missing dll files inserted into the package
- < title > tag extraction added 3. better handling for html entities
<<less
Download (1.5MB)
Added: 2007-07-30 License: LGPL (GNU Lesser General Public License) Price:
817 downloads
Jericho HTML Parser 2.4

Jericho HTML Parser 2.4


Jerich HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document. more>>
Jerich HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
Jericho HTML Parser project is an open source library released under the GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.
Main features:
- No parse tree of the entire document is ever generated. The document source text is searched only for the markup relevant to the current operation. This allows the library to analyse and modify documents containing incorrect or badly formatted HTML or any other server or client side code, script, macro or markup. Most other parsers cant handle content that they are not explicitly programmed to accept.
- The beginning and end positions in the source text of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a parse tree. This feature, in combination with the one above, makes the toolkit extremely powerful in its simplicity.
- Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
- ASP, JSP, PSP, PHP and Mason server tags can be registered for recognition by the parser, and are recognised as accurately as is possible without incorporating actual parsers for these languages into the library. The library then allows any of these segments to be ignored when parsing the rest of the document so that they do not interfere with the HTML syntax. (see Segment.ignoreWhenParsing())
- Custom tag types can be easily defined and registered for recognition by the parser.
Enhancements:
- This version has been released under a dual licence system, allowing a choice between the Eclipse Public License (EPL) and the LGPL.
- It includes important bugfixes and introduces the following major features: simple rendering of HTML markup into text, integrated logging with various logging frameworks, and easier parsing of HTML tags containing server tags.
<<less
Download (0.85MB)
Added: 2007-05-20 License: LGPL (GNU Lesser General Public License) Price:
534 downloads
HTML::Parser 3.54

HTML::Parser 3.54


HTML::Parser is a HTML parser class. more>>
HTML::Parser is a HTML parser class. Objects of the HTML::Parser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked.
HTML::Parser is not a generic SGML parser.

We have tried to make it able to deal with the HTML that is actually "out there", and it normally parses as closely as possible to the way the popular web browsers do it instead of strictly following one of the many HTML specifications from W3C. Where there is disagreement, there is often an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This makes on-the-fly parsing as documents are received from the network possible.
If event driven parsing does not feel right for your application, you might want to use HTML::PullParser. This is an HTML::Parser subclass that allows a more conventional program structure.

SYNOPSIS:

use HTML::Parser ();

# Create parser object
$p = HTML::Parser->new( api_version => 3,
start_h => [&start, "tagname, attr"],
end_h => [&end, "tagname"],
marked_sections => 1,
);

# Parse document text chunk by chunk
$p->parse($chunk1);
$p->parse($chunk2);
#...
$p->eof; # signal end of document

# Parse directly from file
$p->parse_file("foo.html");
# or
open(my $fh, "<<less
Download (0.082MB)
Added: 2006-05-05 License: Perl Artistic License Price:
1269 downloads
CyberNeko HTML Parser 0.9.5

CyberNeko HTML Parser 0.9.5


NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents. more>>
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
NekoHTML is written using the Xerces Native Interface (XNI) that is the foundation of the Xerces2 implementation. This enables you to use the NekoHTML parser with existing XNI tools without modification or rewriting code.
Version restrictions:
- There are HTML documents for which NekoHTML cannot properly generate a well-formed XML document event stream. For example, documents with multiple tags are inherently ill-formed because XML documents may only have a single root element.
- Code added to the core DOM implementation in Xerces-J 2.0.1 introduced a bug in the HTML DOM implementation based on it.
The bug causes the element nodes in the resultant HTML document object to be of type org.apache.xerces.dom.ElementNSImpl instead of the appropriate HTML DOM element objects.
The problem affects NekoHTML users who use the parser with Xerces-J 2.0.1 and anyone using the HTML DOM implementation in Xerces-J 2.0.1.
- There are no other known major limitations with this release. However, additional work can always be done to improve performance, fix bugs, and add functionality.
<<less
Download (0.38MB)
Added: 2005-09-28 License: The Apache License Price:
1486 downloads
HTML Parser 1.6-20060610

HTML Parser 1.6-20060610


HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. more>>
HTMLParser is a super-fast real-time parser for real-world HTML. What has attracted most developers to HTMLParser has been its simplicity in design, speed and ability to handle streaming real-world html.
The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output.
In order to use HTMLParser you will need to be able to write code in the Java programming language. Although some example programs are provided that may be useful as they stand, its more than likely you will need (or want) to create your own programs or modify the ones provided to match your intended application.
To use the library, you will need to add either the htmllexer.jar or htmlparser.jar to your classpath when compiling and running. The htmllexer.jar provides low level access to generic string, remark and tag nodes on the page in a linear, flat, sequential manner. The htmlparser.jar, which includes the classes found in htmllexer.jar, provides access to a page as a sequence of nested differentiated tags containing string, remark and other tag nodes. So where the output from calls to the lexer nextNode() method might be:
< html>
< head>
< title>
"Welcome"
< /title>
< /head>
< body>
etc...
The output from the parser NodeIterator would nest the tags as children of the , and other nodes (here represented by indentation):
< html>
< head>
< title>
"Welcome"
< /title>
< /head>
< body>
etc...
The parser attempts to balance opening tags with ending tags to present the structure of the page, while the lexer simply spits out nodes. If your application requires only modest structural knowledge of the page, and is primarily concerned with individual, isolated nodes, you should consider using the lightweight lexer. But if your application requires knowledge of the nested structure of the page, for example processing tables, you will probably want to use the full parser.
Extraction
Extraction encompasses all the information retrieval programs that are not meant to preserve the source page. This covers uses like:
- text extraction, for use as input for text search engine databases for example
- link extraction, for crawling through web pages or harvesting email addresses
- screen scraping, for programmatic data input from web pages
- resource extraction, collecting images or sound
- a browser front end, the preliminary stage of page display
- link checking, ensuring links are valid
- site monitoring, checking for page differences beyond simplistic diffs
There are several facilities in the HTMLParser codebase to help with extraction, including filters, visitors and JavaBeans.
Transformation
Transformation includes all processing where the input and the output are HTML pages. Some examples are:
- URL rewriting, modifying some or all links on a page
- site capture, moving content from the web to local disk
- censorship, removing offending words and phrases from pages
- HTML cleanup, correcting erroneous pages
- ad removal, excising URLs referencing advertising
- conversion to XML, moving existing web pages to XML
During or after reading in a page, operations on the nodes can accomplish many transformation tasks "in place", which can then be output with the toHtml() method. Depending on the purpose of your application, you will probably want to look into node decorators, visitors, or custom tags in conjunction with the PrototypicalNodeFactory.
The HTML Parser is an open source library released under GNU Lesser General Public License, which basically says you are free to use the library "as is" in other (even proprietary) products, as long as due credit is given to the authors and the source code for the HTMLParser is included or available with the other product. For modified or embedded use, please consult the LGPL license.
<<less
Download (4.2MB)
Added: 2006-06-11 License: LGPL (GNU Lesser General Public License) Price:
1234 downloads
Java Tools 0.30

Java Tools 0.30


Java Tools is a lightweight integrated development environment for creating, compiling, and executing Java applications. more>>
Java Tools is a lightweight integrated development environment for creating, compiling, and executing Java applications and applets.
Java Tools includes point and click access to the Java files, commands, and documents. It also includes a built-in text editor and user interface for the Java debugger.
It is intended for the new Java user who needs help getting started. Its also for the more experienced Java user who wants easy access to the Java commands and a text editor.
Main features:
- GUI with built-in help and small footprint.
- Point and click access to all files (Java, manifest, HTML, image and sound) and directories (package).
- Point and click access to all commands for compiling (javac), archiving (jar), documenting (javadoc), executing (java), debugging (jdb) and disassembling (javap).
- Point and click access to all documents (Java API Specification, Java Tools and Utilities, Java Features and Java Tutorial).
- Point and click creation of all files (Java, manifest and HTML) and directories (package).
- Point and click installation of distribution archive files (Java document, Java source code, Java Tutorial and Sun Tools).
- Automatic determination of class file dependencies for archiving (jar) and documenting (javadoc) Java files.
- Checking for unused, redundant and missing imports.
- Logging of all commands invoked by GUI.
- Code metrics for Java files.
- Built-in text editor (see Edit for details).
- Built-in user interface for the Java debugger with command-line editing and history.
- Self-installing executable (Java archive file).
- Comprehensive installation and user documentation for Java and Java Tools.
<<less
Download (0.15MB)
Added: 2007-07-09 License: Freeware Price:
838 downloads
HTML Purifier 2.1.1

HTML Purifier 2.1.1


HTML Purifier is the premiere PHP solution for all your HTML filtering needs. more>>
HTML Purifier project is the premiere PHP solution for all your HTML filtering needs. Tired of forcing users to use BBCode or some other obscure custom markup language due to the current landscape of deficient or hole-ridden HTML filterers? Look no further: HTMLPurifier will not only remove all malicious code (the stuff of XSS), it will also make sure the HTML is standards compliant.
There are a number of ad hoc HTML filtering solutions out there on the web (some examples including PEARs HTML_Safe, kses and SafeHtmlChecker.class.php) that claim to filter HTML properly, preventing malicious JavaScript and layout breaking HTML from getting through the parser. None of them, however, demonstrates a thorough knowledge of the DTD that defines HTML or the caveats of HTML that cannot be expressed by a DTD.
Configurable filters (such as kses or PHPs built-in striptags() function) have trouble validating the contents of attributes and can be subject to security attacks due to poor configuration. Other filters take the naive approach of blacklisting known threats and tags, failing to account for the introduction of new technologies, new tags, new attributes or quirky browser behavior.
However, HTML Purifier takes a different approach, one that doesnt use specification-ignorant regexes or narrow blacklists. HTML Purifier will decompose the whole document into tokens, and rigorously process the tokens by: removing non-whitelisted elements, transforming bad practice tags like font into span, properly checking the nesting of tags and their children and validating all attributes according to their RFCs.
To my knowledge, there is nothing like this on the web yet. Not even MediaWiki, which allows an amazingly diverse mix of HTML and wikitext in its documents, gets all the nesting quirks right. Existing solutions hope that no JavaScript will slip through, but either do not attempt to ensure that the resulting output is valid XHTML or send the HTML through a draconic XML parser (and yet still get the nesting wrong: SafeHtmlChecker.class.php does not prevent a tags from being nested within each other).
Enhancements:
- This version amends a few bugs in some of newly introduced features for 2.1, namely running the standalone download version in PHP4 and %URI.MakeAbsolute.
<<less
Download (0.16MB)
Added: 2007-08-07 License: LGPL (GNU Lesser General Public License) Price:
809 downloads
PandaLex PDF Parser 0.5

PandaLex PDF Parser 0.5


PandaLex PDF Parser is a flex and bison parser for PDF documents. more>>
PandaLex is the PDF parsing code from Panda, which has been split into its own project to increase its utility.

It is a flex and bison description of the PDF specification, which allows programmers to define callbacks to handle different document elements.
<<less
Download (0.38MB)
Added: 2005-05-04 License: GPL (GNU General Public License) Price:
1639 downloads
ShaniXmlParser 1.4.15

ShaniXmlParser 1.4.15


ShaniXmlParser is an XML/HTML DOM/SAX parser that can be validating. more>>
ShaniXmlParser is an XML/HTML DOM/SAX parser that can be validating. It can parse badly formed XML files.
ShaniXmlParser can parse files with inverted tags and bad escaped &,< and >. ShaniXmlParser expands all HTML entities. ShaniXmlParser is well suited to parse HTML files.
It is up to 3 times faster than the internal JDK 1.5 xerces parser and as fast as the internal JDK 1.4 Crimson parser, compliant with the jaxp/w3c DOM interfaces, and very small.
Enhancements:
- Support of DOM 2 HTML interfaces.
- 668/685 succeeded tests on DOM 2 HTML Test Validation suite.
<<less
Download (2.0MB)
Added: 2007-04-25 License: GPL (GNU General Public License) Price:
913 downloads
DNS name parser 1.2.1

DNS name parser 1.2.1


DNS name parser is a Java utility library for parsing dns names, ip and hw addresses. more>>
DNS name parser is a Java utility library for parsing dns names, ip and hw addresses.

Synopsis

import su.netdb.parser.*;

Parser parser = new Parser();

Hashtable result = parser.parse(str);

System.out.println("string: "+result.get("string"));
System.out.println("hw: "+result.get("hw"));
System.out.println("name: "+result.get("name"));
System.out.println("domain: "+result.get("domain"));
System.out.println("ip_low: "+result.get("ip_low"));
System.out.println("ip_high: "+result.get("ip_high"));

"DNS name parser" is an utility library created to be used in a search application. Given a single input field its function is to differentiate between several types of possible input strings. Namely if it a dns name, IP address (exact, ip range or ip with wildcards) or hardware address. The result of the parsing is a Hashtable with possible keys "string", "hw", "name", "domain", "ip_low" and "ip_high".

<<less
Download (0.008MB)
Added: 2007-07-20 License: GPL (GNU General Public License) Price:
835 downloads
C++ WSDL Parser 1.9.3

C++ WSDL Parser 1.9.3


C++ WSDL Parser is an efficient C++ Web services library. more>>
C++ WSDL Parser is an efficient C++ Web services library that includes a standards compliant WSDL parser API, a Schema parser and validator, an XML parser and serializer, and an API for dynamically inspecting and invoking WSDL Web services.
Enhancements:
- Many WSDLs can now be dynamically invoked.
- Added documentation (doxygen for the API).
- Better error reporting when types are found missing.
<<less
Download (0.56MB)
Added: 2005-10-06 License: LGPL (GNU Lesser General Public License) Price:
1483 downloads
Shell::Parser 0.04

Shell::Parser 0.04


Shell::Parser is a simple shell script parser. more>>
Shell::Parser is a simple shell script parser.

SYNOPSIS

use Shell::Parser;

my $parser = new Shell::Parser syntax => bash, handlers => {

};
$parser->parse(...);
$parser->eof;

This module implements a rudimentary shell script parser in Perl. It was primarily written as a backend for Syntax::Highlight::Shell, in order to simplify the creation of the later.

<<less
Download (0.017MB)
Added: 2007-04-06 License: Perl Artistic License Price:
934 downloads
Test-Parser 1.2

Test-Parser 1.2


Test::Parser is a collection of parsers for different test output file formats. more>>
Test::Parser is a collection of parsers for different test output file formats. These parse the data into a general purpose data structure that can then be used to create reports, do post-processing analysis, etc.

Test-Parser can also export tests in SpikeSources TRPI test description XML language.

<<less
Download (0.053MB)
Added: 2006-05-04 License: GPL (GNU General Public License) Price:
1268 downloads
SVG::Parser 1.01

SVG::Parser 1.01


SVG::Parser is a Perl module with XML Parser for SVG documents. more>>
SVG::Parser is a Perl module with XML Parser for SVG documents.

SYNOPSIS

#!/usr/bin/perl -w
use strict;
use SVG::Parser;

die "Usage: $0 n" unless @ARGV;

my $xml;
{
local $/=undef;
$xml=;
}

my $parser=new SVG::Parser(-debug => 1);
my $svg=$parser->parse($xml);
print $svg->xmlify;

and:

#!/usr/bin/perl -w
use strict;
use SVG::Parser qw(SAX=XML::LibXML::Parser::SAX Expat SAX);

die "Usage: $0 n" unless @ARGV;
my $svg=SVG::Parser->new()->parsefile($ARGV[0]);
print $svg->xmlify;

SVG::Parser is an XML parser for SVG Documents. It takes XML as input and produces an SVG object as its output.

SVG::Parser supports both XML::SAX and XML::Parser (Expat) parsers, with SAX preferred by default. Only one of these needs to be installed for SVG::Parser to function.

A list of preferred parsers may be specified in the import list - SVG::Parser will use the first parser that successfully loads. Some basic measures are taken to provide cross-compatability. Applications requiring more advanced parser features should use the relevant parser module directly; see SVG::Parser::Expat and SVG::Parser::SAX

<<less
Download (0.014MB)
Added: 2006-09-20 License: Perl Artistic License Price:
1131 downloads
Makefile::Parser 0.11

Makefile::Parser 0.11


Makefile::Parser is a Simple Parser for Makefiles. more>>
Makefile::Parser is a Simple Parser for Makefiles.

SYNOPSIS

use Makefile::Parser;

$parser = Makefile::Parser->new;

# Equivalent to ->parse(Makefile);
$parser->parse or
die Makefile::Parser->error;

# Get last value assigned to the specified variable CC:
print $parser->var(CC);

# Get all the variable names defined in the Makefile:
@vars = $parser->vars;
print join( , sort @vars);

@roots = $parser->roots; # Get all the "root targets"
print $roots[0]->name;

@tars = $parser->targets; # Get all the targets
$tar = join("n", $tars[0]->commands);

# Get the default target, say, the first target defined in Makefile:
$tar = $parser->target;

$tar = $parser->target(install);
# Get the name of the target, say, install here:
print $tar->name;

# Get the dependencies for the target install:
@depends = $tar->depends;

# Access the shell command used to build the current target.
@cmds = $tar->commands;

# Parse another file using the same Parser object:
$parser->parse(Makefile.old) or
die Makefile::Parser->error;

# Get the target who is specified by variable EXE_FILE
$tar = $parser->target($parser->var(EXE_FILE));

This is a parser for Makefiles. At this very early stage, the parser only supports a limited set of features, so it may not recognize some advanced features provided by certain make tools like GNU make. Its initial purpose is to provide basic support for another module named Makefile::GraphViz, which is aimed to render the building process specified by a Makefile using the amazing GraphViz library. The Make module is not satisfactory for this purpose, so I decided to build one of my own.

<<less
Download (0.018MB)
Added: 2006-10-24 License: Perl Artistic License Price:
1098 downloads
Secleted [ 0 ] software to compare
  • Page: 1 of 5
  • 1
  • 2
  • 3
  • 4
  • 5