HTML Tidy  5.2.0
The HTACG Tidy HTML Project
tidylib

About the tidylib C library and code

TidyLib features

  • easy to integrate. Because of the near universal adoption of C linkage, a C interface may be called from a great number of programming languages.
  • designed to use opaque types in the public interface. This allows the application to just pass an integer around and the need to transform data types in different languages is minimized. As a results it’s straight-forward to write very thin library wrappers for C++, Pascal, and COM/ATL.
  • eats its own dogfood. HTML Tidy links directly to TidyLib.
  • is Thread Safe and Re-entrant. Because there are many uses for HTML Tidy - from content validation, content scraping, conversion to XHTML - it was important to make TidyLib run reasonably well within server applications as well as client side.
  • uses adaptable I/O. As part of the larger integration strategy it was decided to fully abstract all I/O. This means a (relatively) clean separation between character encoding processing and shovelling bytes back and forth. Internally, the library reads from sources and writes to sinks. This abstraction is used for both markup and configuration “files”. Concrete implementations are provided for file and memory I/O, and new sources and sinks may be provided via the public interface.

Return codes

It’s important to understand that API functions that return an integer almost universally adhere to the following convention:

  • 0 == Success
    • Good to go.
  • 1 == Warnings, but no errors
    • Check the error buffer or track error messages for details.
  • 2 == Errors (and maybe warnings)
    • By default, Tidy will not produce output. You can force output with the TidyForceOutput option. As with warnings, check error buffer or track error messages for details.
  • < 0 == Severe error
    • Usually value equals -errno. See errno.h.

Also, by default, warning and error messages are sent to stderr. You can redirect diagnostic output using either tidySetErrorFile() or tidySetErrorBuffer(). See tidy.h for details.

Application Notes

Of course, there are functions to parse and save both markup and configuration files. For the adventurous, it is possible to create new input sources and output sinks. For example, a URL source could pull the markup from a given URL.

It is also worth remembering that an application may instantiate any number of document and buffer objects. They are fairly cheap to initialize and destroy (just memory allocation and zeroing, really), so they may be created and destroyed locally, as needed. There is no problem keeping them around a while for keeping state. For example, a server app might keep a global document as a master configuration. As documents are parsed, they can copy their configuration data from the master instance. See tidyOptCopyConfig(). If the master copy is initialized at startup, no synchronization is necessary.

tidylib example

#include <tidy.h>
#include <tidybuffio.h>
#include <stdio.h>
#include <errno.h>
int main(int argc, char **argv )
{
const char* input = "<title>Hello</title><p>World!";
TidyBuffer output = {0};
TidyBuffer errbuf = {0};
int rc = -1;
Bool ok;
// Initialize "document"
TidyDoc tdoc = tidyCreate();
printf( "Tidying:\t%s\n", input );
// Convert to XHTML
ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes );
if ( ok )
rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture diagnostics
if ( rc >= 0 )
rc = tidyParseString( tdoc, input ); // Parse the input
if ( rc >= 0 )
rc = tidyCleanAndRepair( tdoc ); // Tidy it up!
if ( rc >= 0 )
rc = tidyRunDiagnostics( tdoc ); // Kvetch
if ( rc > 1 ) // If error, force output.
rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
if ( rc >= 0 )
rc = tidySaveBuffer( tdoc, &output ); // Pretty Print
if ( rc >= 0 )
{
if ( rc > 0 )
printf( "\nDiagnostics:\n\n%s", errbuf.bp );
printf( "\nAnd here is the result:\n\n%s", output.bp );
}
else
printf( "A severe error (%d) occurred.\n", rc );
tidyBufFree( &output );
tidyBufFree( &errbuf );
tidyRelease( tdoc );
return rc;
}