Monday, March 19, 2012

Exclude HTML Tags in my Search

Hi all;
I would like to exclude HTML tags in my search criteria at my full-text
search i.e. when I look for for body word I desire the search to return the
body words which included in the body field but I don't want to include
<body>or </body>.
Thank you
This can't be done easily. The best way to fix this is to convert your html
content to text content using a html parser or filtdump -b (there may be
licensing restrictions with this).
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"yaser" <yaser.abu-khudier@.hotmail.com> wrote in message
news:%23pGqH$3FHHA.4712@.TK2MSFTNGP04.phx.gbl...
> Hi all;
>
> I would like to exclude HTML tags in my search criteria at my full-text
> search i.e. when I look for for body word I desire the search to return
> the body words which included in the body field but I don't want to
> include <body>or </body>.
>
> Thank you
>
>
|||Thank you so much Hilary , I have found another method to solve the problem
by
replacing the html tags with spaces before query them from the DB (ntext),
Thanks a lot for your assist
"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
news:eKJn5L6FHHA.3268@.TK2MSFTNGP04.phx.gbl...
> This can't be done easily. The best way to fix this is to convert your
> html content to text content using a html parser or filtdump -b (there may
> be licensing restrictions with this).
> --
> Hilary Cotter
> Looking for a SQL Server replication book?
> http://www.nwsu.com/0974973602.html
> Looking for a FAQ on Indexing Services/SQL FTS
> http://www.indexserverfaq.com
>
> "yaser" <yaser.abu-khudier@.hotmail.com> wrote in message
> news:%23pGqH$3FHHA.4712@.TK2MSFTNGP04.phx.gbl...
>
|||Hello yaser,
I thought that if you set the doc type to HTML the tags are ignored. We had
this issue with searches for colours finding matches in FONT definitions.
Changing the doc type to HTML solved the problem
Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons
[vbcol=seagreen]
> Thank you so much Hilary , I have found another method to solve the
> problem
> by
> replacing the html tags with spaces before query them from the DB
> (ntext),
> Thanks a lot for your assist
> "Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
> news:eKJn5L6FHHA.3268@.TK2MSFTNGP04.phx.gbl...
|||Hi Simon;
is this able to be done even if I use full-text search on column with data
type ntext or I have to keep this column as binary such as (varbinary(max)),
because I know that we specify the TYPE COLUMN key word in the creation
statement of the fulltext index if we are searching in a binary column only
(and in your solution I have to specify the column type to HTML). By the way
do you mean the same for setting the doc type to HTML do you mean somthing
like this:
CREATE FULLTEXT INDEX ON Production.Document (Document TYPE COLUMN HTML) KEY
INDEX PK_Document_DocumentID ON AWCatalog WITH CHANGE_TRACKING AUTO;
Is this going to ignore html tags in my search?
Thanks a lot for your help and support
|||Hi yaser,
I am interested to know if you came up with a solution to this as I am
trying to do exactly the same thing. I will more than likely be converting
our textual data to binary data and specifying the extension as html.
Another thought that I had though - could a person not use an "HTML" word
breaker as such and then specify the language as HTML. I will look into and
let you know if I found a way of doing this.
"yaser" wrote:

> Hi Simon;
> is this able to be done even if I use full-text search on column with data
> type ntext or I have to keep this column as binary such as (varbinary(max)),
> because I know that we specify the TYPE COLUMN key word in the creation
> statement of the fulltext index if we are searching in a binary column only
> (and in your solution I have to specify the column type to HTML). By the way
> do you mean the same for setting the doc type to HTML do you mean somthing
> like this:
>
> CREATE FULLTEXT INDEX ON Production.Document (Document TYPE COLUMN HTML) KEY
> INDEX PK_Document_DocumentID ON AWCatalog WITH CHANGE_TRACKING AUTO;
>
> Is this going to ignore html tags in my search?
> Thanks a lot for your help and support
>
>
>
|||Sorry to be late in reply;
My final decision was building two columns in the database one with html
tags and the other without html tags so when I perform my search I do it
over the column that doesn't has any html tags, I found this the easiest and
fastest technique.
Hope this will help you
|||Hello yaser,
Yes it will. And yes you do have to store it as a binary type.
Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons

> Hi Simon;
> is this able to be done even if I use full-text search on column with
> data type ntext or I have to keep this column as binary such as
> (varbinary(max)), because I know that we specify the TYPE COLUMN key
> word in the creation statement of the fulltext index if we are
> searching in a binary column only (and in your solution I have to
> specify the column type to HTML). By the way do you mean the same for
> setting the doc type to HTML do you mean somthing like this:
> CREATE FULLTEXT INDEX ON Production.Document (Document TYPE COLUMN
> HTML) KEY INDEX PK_Document_DocumentID ON AWCatalog WITH
> CHANGE_TRACKING AUTO;
> Is this going to ignore html tags in my search?
> Thanks a lot for your help and support
>
|||thanks a lot simon i will try to test it, using this method will give better
performance and less storage.
thanks for your hel and support
|||Hi, you can use Regex in C# like this:
[Code]
using System.Text.RegularExpressions;
//..
const string HTML_TAG_PATTERN = "<.*?>";
protected string StripHTML(string strInputString)
{
return Regex.Replace(strInputString, HTML_TAG_PATTERN,
string.Empty);
}
[Code]
"yaser" wrote:

> thanks a lot simon i will try to test it, using this method will give better
> performance and less storage.
> thanks for your hel and support
>
>

No comments:

Post a Comment