(Untitled)

Jan 18, 2005 17:04

How does the S2 code cleaner work? I assume the content is piped through some program and stripped of malicious code, but how do you know which code is malicious? And where is that program located? I searched the CVS repos and the manual but couldn't find it. I'm interested in doing something similar with PHP.

Leave a comment

Comments 14

marksmith January 19 2005, 11:11:23 UTC
We wrote our own. It isn't piped anywhere, check out the livejournal repository's cgi-bin/cleanhtml.pl.

Reply


evan January 19 2005, 11:17:00 UTC
http://cvs.livejournal.org/browse.cgi/livejournal/cgi-bin/cleanhtml.pl?rev=HEAD&content-type=text/x-cvsweb-markup
is one of them.

Note that LJ code is licensed under the GPL, and so all derivative works must also be under the GPL.

Reply

timwi January 19 2005, 13:16:24 UTC
Does taking only the idea and re-writing it entirely in PHP constitute a "derivative work"?

Reply

surye January 19 2005, 13:23:07 UTC
2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
[....]

I'd say if no code is actually used from the original, it's not a "derivative work". If so, it seems to my interpretation that we're jumping into the area of patents, in which a process or concept is retained as IP. However, GPL does not concern software patent as far as I know. (And should not, software patents are BAD ;)

Just my 2 cents based on reading the GPL, maybe I missed something, I'd love for someone to point out any flaws in my logic there.

Reply

j4k0b January 19 2005, 13:27:38 UTC
It seems like the only parts which would be reused are patterns in Regexp's, and I could get exactly the same thing without ever seeing LJ's source code. That and the concept of stripping something from a string to make it safe, which is a fairly broad idea.

I'm probably not going to need to clean PHP code anymore anyway.

Reply


mart January 19 2005, 15:44:42 UTC

There are two HTML cleaners. The old-fashioned one, cleanhtml.pl (which everyone else is talking about) is usually applied to perl strings in memory, returning a string result. It's used to clean entries, comments and other little things which are generally already in memory anyway, because they come from the database.

HTMLCleaner.pm, a more modern library, operates on streams and produces a stream. This is what the S2 print statement (from untrusted layers) is thrown through. This one's from the wcmtools repository, not livejournal, so it's BSD-licenced.

Reply


snej January 19 2005, 16:45:00 UTC
The canonical library that does this sort of stuff (and much more) is libTidy. It's written in C, but there may be PHP bindings to it - check PEAR, or the libTidy home page (which I'm sure you can find on Google!).

Reply

bradfitz January 24 2005, 11:19:14 UTC
Are you sure?

I didn't think Tidy's focus was XSS/Javascript removal. And looking at it again now, I'm even more certain of it. At least their docs/examples make no mention of either scripting or XSS.

Reply

snej January 24 2005, 11:22:36 UTC
Well, I've never used Tidy directly, but I know we use it for this purpose, among other things. I believe it's pretty flexible about letting you specify what tags/attributes to strip out of the HTML, so we just list every one that could contain JS. (I'm not sure whether we strip 'javascript:' URLs with tidy or something else.)

Reply


Leave a comment

Up