html_scrub -- An HTML Editing Utility for Groklaw by Scott McKellar

Monday, June 07 2004 @ 03:26 AM EDT

Contributed by: PJ

Scott McKellar decided to take pity on me and write a command line HTML cleaning utility for me. As many of you know, Geeklog, the underlying software Groklaw uses, chokes on certain HTML. When volunteers send me documents they have turned into HTML from text, using certain automatic HTML utilities, I end up spending hours sometimes cleaning out the tags Geeklog doesn't like. It's like picking fleas out of your dog's coat. It takes a long time, it's no fun, and sometimes you miss things.

This is particularly a problem when volunteers use web authoring tools in Windows. I've struggled with the problem for some time, so Scott decided to try to do something about it. He wrote a utility for me called html_scrub that does that cleaning chore for me, and it's licensed under the GPL, so everyone can use it.

He has it up on Freshmeat today. His personal page is here. If anyone wants to write a GUI for it, I'd love it. Then volunteers could pre-clean. I wanted to let you know about it, so you can try it out if you'd like to. Don't sue me or him if your house falls down or your hair turns purple when you try it. There are always bugs in new software, so be sure to let him know if you find any. Alan Canon already found a javascript bug, but today's release fixes it, and Scott says html_scrub is ready to be taken for a spin.

Scott explains his html_scrub:

"When people contribute HTML documents to Groklaw, PJ (or one of her lieutenants) has to edit them to make sure that they don't include certain kinds of HTML that create problems for GeekLog.  I wrote a command line utility called html_scrub to automate this task.  Depending on what you tell it in a configuration file, html_scrub can eliminate unwanted HTML tags or certain attributes within specified HTML tags -- or, if you prefer, it can just warn you about them so that you can screen them manually.  For more information, see the html_scrub web page.

"This utility is available under the GPL in the form of C source code and a simple Makefile.  You should be able to compile it on any Linux or Unix-like system, possibly after a little tweaking of the Makefile.  If you're on a Windows box without a compiler, you can download a Unix-like environment from Cygwin and compile from there with gcc.  If you're not that ambitious, I can provide an executable for you to run within a DOS session.

"I haven't given html_scrub a thorough workout yet, so bug reports are welcome.  I have tried to make the code modular enough that others may extend or reuse it.  For example, it would be nice to have a GUI version so that the user doesn't have to work from a command prompt.  I don't have the necessary skills to do that myself."

It feels really nice to have software written for you, I must say. I am starting to get what folks mean about writing software to scratch an itch. It must be very nice to be able to write whatever you want to do what you need done. So, thank you very much, Scott. And if you want to know what tags I can use in Geeklog, here is the list, with brackets removed, because if I leave them in, Geeklog will have a nervous breakdown trying to figure out what to do: p, blockquote, b, i, u, strike, a, em, strong, br, tt, hr, li, ol, ul, code, pre, font, div, span, table, tr, th, td, font color="" . . . /font.

If anyone wants to know what I'd really find useful, it'd be a way to hit a key and get [p] [blockquote] [i]" and then hit another key and get "[/i] [blockquote] [p] with the brackets in there, of course, instead of []. I used those to trick Geeklog. That one thing would make my life better. If you look at the source of any article, you'll see why I need that, if it's possible.