Comments | lj_dev: (без темы)

j4k0b in lj_dev

(Untitled)

Jan 18, 2005 17:04

How does the S2 code cleaner work? I assume the content is piped through some program and stripped of malicious code, but how do you know which code is malicious? And where is that program located? I searched the CVS repos and the manual but couldn't find it. I'm interested in doing something similar with PHP.

Comments 14

marksmith January 19 2005, 11:11:23 UTC

We wrote our own. It isn't piped anywhere, check out the livejournal repository's cgi-bin/cleanhtml.pl.

evan January 19 2005, 11:17:00 UTC

http://cvs.livejournal.org/browse.cgi/livejournal/cgi-bin/cleanhtml.pl?rev=HEAD&content-type=text/x-cvsweb-markup
is one of them.

Note that LJ code is licensed under the GPL, and so all derivative works must also be under the GPL.

timwi January 19 2005, 13:16:24 UTC

Does taking only the idea and re-writing it entirely in PHP constitute a "derivative work"?

surye January 19 2005, 13:23:07 UTC

2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
[....]

I'd say if no code is actually used from the original, it's not a "derivative work". If so, it seems to my interpretation that we're jumping into the area of patents, in which a process or concept is retained as IP. However, GPL does not concern software patent as far as I know. (And should not, software patents are BAD ;)

Just my 2 cents based on reading the GPL, maybe I missed something, I'd love for someone to point out any flaws in my logic there.

j4k0b January 19 2005, 13:27:38 UTC

It seems like the only parts which would be reused are patterns in Regexp's, and I could get exactly the same thing without ever seeing LJ's source code. That and the concept of stripping something from a string to make it safe, which is a fairly broad idea.

I'm probably not going to need to clean PHP code anymore anyway.

Thread 9

mart January 19 2005, 15:44:42 UTC

There are two HTML cleaners. The old-fashioned one, cleanhtml.pl (which everyone else is talking about) is usually applied to perl strings in memory, returning a string result. It's used to clean entries, comments and other little things which are generally already in memory anyway, because they come from the database.

HTMLCleaner.pm, a more modern library, operates on streams and produces a stream. This is what the S2 print statement (from untrusted layers) is thrown through. This one's from the wcmtools repository, not livejournal, so it's BSD-licenced.

snej January 19 2005, 16:45:00 UTC

The canonical library that does this sort of stuff (and much more) is libTidy. It's written in C, but there may be PHP bindings to it - check PEAR, or the libTidy home page (which I'm sure you can find on Google!).

bradfitz January 24 2005, 11:19:14 UTC

Are you sure?

I didn't think Tidy's focus was XSS/Javascript removal. And looking at it again now, I'm even more certain of it. At least their docs/examples make no mention of either scripting or XSS.

snej January 24 2005, 11:22:36 UTC

Well, I've never used Tidy directly, but I know we use it for this purpose, among other things. I believe it's pretty flexible about letting you specify what tags/attributes to strip out of the HTML, so we just list every one that could contain JS. (I'm not sure whether we strip 'javascript:' URLs with tidy or something else.)