T-SQL Function to strip HTML tags

Feb 10, 2005 14:01

My boss called me over and asked me to write a T-SQL function on behalf of our web guy, that would arbitrarily strip out all HTML tags from a VARCHAR. In any other language like PHP, Perl, etc., I'd employ Regular Expressions to do this globally. However, T-SQL only really has PATINDEX and CHARINDEX, those aren't all that powerful.

I did wind up coming up with an iterative solution, which I thought I'd share with everyone to get suggestions and feedback.



CREATE TABLE #tmpText (
id INT IDENTITY(1, 1),
data varchar(4000)
)

INSERT INTO #tmpText (data)
VALUES ('this is some html')

INSERT #tmpText VALUES (
'Some Name

SOME HTML text after the body'
)

INSERT #tmpText VALUES (
'Another Name

Another HTML text after the body'
)

-- WHILE a '<' and '>' pair are present and the former's position is less than the latter's position value
-- Find a '<' then the next '>' positions, and run a stuff to remove whatever's in between!

BEGIN TRANSACTION
WHILE EXISTS (SELECT 1
FROM #tmpText
WHERE PATINDEX('%<%>%', data) > 0
)
BEGIN
UPDATE #tmpText
SET data = STUFF(data, PATINDEX('%<%>%', data), CHARINDEX('>', data) - PATINDEX('%<%>%', data) + 1, '')
WHERE PATINDEX('%<%>%', data) > 0
END

SELECT *
FROM #tmpText

ROLLBACK

I'm going to keep trying to dissect this in order to come up with a cleaner, non-iterative solution. If anyone else already has a function that does this in one fell-swoop, rather than via looping, PLEASE share it! :-)

UPDATE: I've made further code changes to handle a few odd circumstances and have finished the function. Thought I'd share it with everyone for a "peer review." Please comment away!


ALTER FUNCTION uf_stripHTML
(
   @strHTML varchar(8000),
   @flgFormat int = 1
)
RETURNS varchar(8000)
AS
BEGIN

--------------------------------------------------------------------------------------------------
-- Date Written: 02-10-2005
-- Purpose: Arbitrarily strip all text between all pairs of < > tags. This SHOULD be
-- HTML but theoretically could be other data?
----------------------------------------------------------------------------------------------------
-- Input Parameters: @strHTML = The string that we are stripping HTML from.
-- @flgFormat = If set to 1 (DEFAULT), then the function will attempt to
-- preserve basic formatting by detecting non-breaking spaces, start and end
-- paragraph tags, and break-return tags, and replacing them with their
-- ASCII equivalents.
----------------------------------------------------------------------------------------------------
-- Comments: This solution employs an iterative algorithm to repeatedly sweep through
-- the variable's text, removing HTML tags one at a time.
----------------------------------------------------------------------------------------------------

DECLARE @ltPosition int

-- If flgFormat is 1, then replace pre-determined list of tags and characters with
-- corresponding values!
IF @flgFormat = 1
BEGIN
   SET @strHTML = REPLACE(@strHTML, '
', CHAR(10) + CHAR(13))
   SET @strHTML = REPLACE(@strHTML, '
', CHAR(10) + CHAR(13))
   SET @strHTML = REPLACE(@strHTML, '
', CHAR(10))
END

-- Arbitrarily replace &_nbsp; (intentionally mistyped)
SET @strHTML = REPLACE(@strHTML, '&_nbsp;', CHAR(32))

-- STRIP OUT HTML HERE
WHILE (SELECT PATINDEX('%<[^ ]%>%', @strHTML)) > 0
BEGIN
   -- Must search for the correct '>' because any unmatched ones will cause errors!
   SET @ltPosition = 0
   WHILE (PATINDEX('%<[^ ]%>%', @strHTML) > @ltPosition)
      SET @ltPosition = CHARINDEX('>', @strHTML, @ltPosition + 1)

SET @strHTML = STUFF(@strHTML, PATINDEX('%<[^ ]%>%', @strHTML), @ltPosition - PATINDEX('%<[^ ]%>%', @strHTML) + 1, '')
END

RETURN @strHTML

END
GO

Previous post Next post
Up