My boss called me over and asked me to write a T-SQL function on behalf of our web guy, that would arbitrarily strip out all HTML tags from a VARCHAR. In any other language like PHP, Perl, etc., I'd employ Regular Expressions to do this globally. However, T-SQL only really has PATINDEX and CHARINDEX, those aren't all that powerful.
I did wind up coming up with an iterative solution, which I thought I'd share with everyone to get suggestions and feedback.
CREATE TABLE #tmpText (
id INT IDENTITY(1, 1),
data varchar(4000)
)
INSERT INTO #tmpText (data)
VALUES ('this is some html')
INSERT #tmpText VALUES (
'Some Name
SOME HTML text after the body'
)
INSERT #tmpText VALUES (
'Another Name
Another HTML text after the body'
)
-- WHILE a '<' and '>' pair are present and the former's position is less than the latter's position value
-- Find a '<' then the next '>' positions, and run a stuff to remove whatever's in between!
BEGIN TRANSACTION
WHILE EXISTS (SELECT 1
FROM #tmpText
WHERE PATINDEX('%<%>%', data) > 0
)
BEGIN
UPDATE #tmpText
SET data = STUFF(data, PATINDEX('%<%>%', data), CHARINDEX('>', data) - PATINDEX('%<%>%', data) + 1, '')
WHERE PATINDEX('%<%>%', data) > 0
END
SELECT *
FROM #tmpText
ROLLBACK
I'm going to keep trying to dissect this in order to come up with a cleaner, non-iterative solution. If anyone else already has a function that does this in one fell-swoop, rather than via looping, PLEASE share it! :-)
UPDATE: I've made further code changes to handle a few odd circumstances and have finished the function. Thought I'd share it with everyone for a "peer review." Please comment away!
ALTER FUNCTION uf_stripHTML
(
@strHTML varchar(8000),
@flgFormat int = 1
)
RETURNS varchar(8000)
AS
BEGIN
--------------------------------------------------------------------------------------------------
-- Date Written: 02-10-2005
-- Purpose: Arbitrarily strip all text between all pairs of < > tags. This SHOULD be
-- HTML but theoretically could be other data?
----------------------------------------------------------------------------------------------------
-- Input Parameters: @strHTML = The string that we are stripping HTML from.
-- @flgFormat = If set to 1 (DEFAULT), then the function will attempt to
-- preserve basic formatting by detecting non-breaking spaces, start and end
-- paragraph tags, and break-return tags, and replacing them with their
-- ASCII equivalents.
----------------------------------------------------------------------------------------------------
-- Comments: This solution employs an iterative algorithm to repeatedly sweep through
-- the variable's text, removing HTML tags one at a time.
----------------------------------------------------------------------------------------------------
DECLARE @ltPosition int
-- If flgFormat is 1, then replace pre-determined list of tags and characters with
-- corresponding values!
IF @flgFormat = 1
BEGIN
SET @strHTML = REPLACE(@strHTML, '
', CHAR(10) + CHAR(13))
SET @strHTML = REPLACE(@strHTML, '
', CHAR(10) + CHAR(13))
SET @strHTML = REPLACE(@strHTML, '
', CHAR(10))
END
-- Arbitrarily replace &_nbsp; (intentionally mistyped)
SET @strHTML = REPLACE(@strHTML, '&_nbsp;', CHAR(32))
-- STRIP OUT HTML HERE
WHILE (SELECT PATINDEX('%<[^ ]%>%', @strHTML)) > 0
BEGIN
-- Must search for the correct '>' because any unmatched ones will cause errors!
SET @ltPosition = 0
WHILE (PATINDEX('%<[^ ]%>%', @strHTML) > @ltPosition)
SET @ltPosition = CHARINDEX('>', @strHTML, @ltPosition + 1)
SET @strHTML = STUFF(@strHTML, PATINDEX('%<[^ ]%>%', @strHTML), @ltPosition - PATINDEX('%<[^ ]%>%', @strHTML) + 1, '')
END
RETURN @strHTML
END
GO