Strip Tags HTML

45 pts.
Tags:
Oracle
PL/SQL
Is there any function to ignore html tags when conducting searches in a oracle table?

Answer Wiki

Thanks. We'll let you know when a new response is added.

Hi,

I’m not sure I understand the problem. Do you have a string value s that contains HTML tags, and do you want to use things like INSTR and LIKE on s, but ignore the tags?

If so, you can use REGEXP_REPLACE starting in Oracle 10:

<pre>
REGEXP_REPLACE (s, ‘<[^>]+>’, NULL)
</pre>

returns a copy of s, without ‘<’s and anything that follows up to the next ‘>’.

Discuss This Question: 7  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Ged
    Sorry. Had not said that use Oracle 9i, and REGEXP_REPLACE is only available on Oracle 10, I think. My Select is: Select x from n where ((CONTAINS (camp1, ' "+ text +"')> 0)) And I need to modify it to ignore html tags that contain camp1 to find "text". Thanks.
    45 pointsBadges:
    report
  • Ged
    Sorry. Had not said that use Oracle 9i, and REGEXP_REPLACE is only available on Oracle 10, I think. My Select is: Select x from n where ((CONTAINS (camp1, ' "+ text +"')> 0)) And I need to modify it to ignore html tags that contain camp1 to find "text". Thanks.
    45 pointsBadges:
    report
  • FrankKulash
    Hi, I don't know any good way to do what you want in pure SQL. I'd use a PL/SQL function, such as replace_between (see end of this message). I'm not familiar with Oracle Text, so I don't know how this can be used with it. You may have to create a column containing (or an index on) "replace_between (camp1, '<', '>')" in addition to (or instead of) the column containing (or index on) camp1 in order to use CONTAINS. Here's the replace_between function:
    --		***************************************
    --		**   r e p l a c e _ b e t w e e n   **
    --		***************************************
    
    --	replace between returns a copy of in_txt with all substrings that begin with
    --	in_begin_txt and end with in_end_txt replaced by in_new_txt.
    --	in_begin_txt nd in_end_txt do NOT have to be distinct. 
    --	If in_begin_txt is not followed by in_end_txt, the value returned does not
    --	include anything after it.
    --	Examples:
    --						in_	in_e
    --						begin_	end_	new_
    --	in_txt					txt	txt	txt	value returned
    --	<html>The <b>Quick</b> brown</html>	<	>	<tag>	<tag>The <tag>Quick<tag> brown<tag>
    --	fox [ jumps [ over ) the [lazy dog 	[	)	*	fox * the *
    --	abracadabra				a	a	a	acar
    --	a</p>b<p>inside</p>unbalanced</p>	<p>	</p>	NULL	a</p>bunbalanced</p>
    
    FUNCTION	replace_between
    (	in_txt		IN	VARCHAR2			-- string to be altered
    ,	in_begin_txt	IN	VARCHAR2			-- beginning of sub-string to be replaced
    ,	in_end_txt	IN	VARCHAR2			-- end of sub-string to be replaced
    ,	in_new_txt	IN	VARCHAR2	DEFAULT NULL	-- replacement string
    )
    RETURN	VARCHAR2
    DETERMINISTIC
    IS
    	begin_len	PLS_INTEGER	:= LENGTH (in_begin_txt);
    	begin_pos	PLS_INTEGER	:= INSTR (in_txt, in_begin_txt);
    	end_len		PLS_INTEGER	:= LENGTH (in_end_txt);
    	end_pos		PLS_INTEGER;
    	j		PLS_INTEGER	:= 0;
    	new_len		PLS_INTEGER	:= length0 (in_new_txt); 
    	return_txt	VARCHAR2 (4000)	:= in_txt;
    BEGIN
    	WHILE	begin_pos > 0
    	LOOP
    		end_pos := INSTR	( return_txt
    					, in_end_txt
    					, begin_pos + begin_len
    					);
    		IF  end_pos > 0
    		AND j < 10
    		THEN	-- end_txt found
    			return_txt := SUBSTR (return_txt, 1, begin_pos - 1)
    				|| in_new_txt
    				|| SUBSTR (return_txt, end_pos + end_len);
    			begin_pos := INSTR	( return_txt
    						, in_begin_txt
    						, begin_pos + new_len + 1
    						);
    		ELSE	-- end_txt not found
    			return_txt := SUBSTR (return_txt, 1, begin_pos - 1)
    				|| in_new_txt;
    			begin_pos := 0;
    		END IF;
    		j := j + 1;
    	END LOOP;
    
    	RETURN	return_txt;
    END	replace_between
    ;
    
    1,240 pointsBadges:
    report
  • Vakilabhay
    sir it's very helpful but one problem is that when we store the data like following '<b>india</b>&copy' in database (&copy is for special character ) that time how we can remove the html tags from data if this function is use or file like that select REGEXP_REPLACE ('<b>india</b>&copy') from dual; it ask for bind variable. how to resolve it. it's urgent please.
    10 pointsBadges:
    report
  • Kccrosser
    How complicated do you need to make this? If you are only concerned with converting the five standard "entities" (ampersand, apostrophe, quote, less than, greater than), and the basic formatting (bold, italic, underline), you can do that by repeated application of the simple REPLACE expression. sOut = REPLACE(sIn, '<b>', '') sOut = REPLACE(sOut, '</b>', '') sOut = REPLACE(sOut, '<', '<') sOut = REPLACE(sOut, '>', '>') sOut = REPLACE(sOut, '"', '"') sOut = REPLACE(sOut, ''', '''') sOut = REPLACE(sOut, '&', '&') ... Of course, you can concatenate these: sOut = REPLACE(REPLACE(REPLACE(sIn, '<b>', ''), '</b>', ''), '"', '"') I would write a simple PL/SQL function that applied these rather than using the REPLACE multiple times in line. Note that to avoid side-effects, I would apply them in a specific order - formatting first (<b>, <i>, etc.), and then the entities, ending with the substitution for the ampersand itself. Otherwise, you have a (slight) risk of generating a false entity or formatting string in the interim results, which could change your data contents. If you need to remove more complex structures, then you would need to resort to actually parsing the text.
    3,830 pointsBadges:
    report
  • Kccrosser
    Ok, duh, I forgot to escape the entity tags in my code above, so my examples got converted from the entity representations down to the text characters. Obviously, the first arguments in the replace should be the actual text strings of the entity values - e.g., for the ampersand, the string would be "(ampersand)amp;". One of the following should display correctly:
    sOut = REPLACE ( sIn, '&amp;', '&' )
    sOut = REPLACE( sIn, '&', '&' )
    3,830 pointsBadges:
    report
  • Kccrosser
    sOut = REPLACE(sIn, ‘<b>’, ”) Here is the corrected sequence:
    sOut = REPLACE(sOut, ‘</b>’, ”)
    sOut = REPLACE(sOut, ‘&lt;’, ‘<’)
    sOut = REPLACE(sOut, '&gt;’, ‘>’)
    sOut = REPLACE(sOut, ‘&quot;’, ‘”‘)
    sOut = REPLACE(sOut, ‘&apos;’, ””)
    sOut = REPLACE(sOut, ‘&amp;’, ‘&’)
    3,830 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following