Tag Archives: encode

Exact searches with Lucene (and searches with special characters)

There’s off course thousands of ways to solve this issue. I decided to go old-school 🙂

So whats the problem? Well, if you really want to search for an exact value, and perhaps it contains strange characters (like formulas, algorithms), maybe it’s to short to be indexed correctly… and I’m sure there are more reasons.

What can you do? One way to address this is to construct a unique string, store it in the index and search that column instead of the column containing the clear text value.

An example, lets say you have values like A(b)-c in a column, and you really only want to find exact matches. Parentheses and hyphens makes this hard, and if you replace these you risk get other matches as well. Want you can do is pass the value to a base64-encoder, which will give you a unique string without strange characters, in this case you’ll get: QShiKS1j

This has to be done when you create your index. So now you have two columns, side-by-side, one with the actual value and one with encoded value.

So, when a user enters A(b)-c in the search field, all you have to do in base64-encode it before using it in your search query, and of course search in the encoded field rather than the clear text one.

The code to encode and decode strings:

public static string Base64Encode(string text)
{
  return text != null ? Convert.ToBase64String(Encoding.UTF8.GetBytes(text)) : "";
}
public static string Base64Decode(string text)
{
  return text != null ? Encoding.UTF8.GetString(Convert.FromBase64String(text)) : "";
}

The code to create an index

IndexWriter Writer = new IndexWriter(
  FSDirectory.Open(@"C:\temp\index"),
  new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), 
  true, 
  IndexWriter.MaxFieldLength.LIMITED
);

Document doc = new Document();
doc.Add(new Field("value", "A(b)-c", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("encodedvalue", Base64Encode("A(b)-c"), Field.Store.YES, Field.Index.ANALYZED));

Writer.AddDocument(doc);
Writer.Optimize();
Writer.Dispose();

And here’s one way to search the index

FSDirectory SearchIndex = FSDirectory.Open(@"c:\temp\index");
StandardAnalyzer Analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
            
IndexReader Reader = IndexReader.Open(SearchIndex, true);
IndexSearcher Searcher = new IndexSearcher(Reader);
            
QueryParser Parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Alias", Analyzer);
string SpecialQueryText = string.Format("encodedvalue:{0}", Base64Encode("A(b)-c"));

Query SpecialQuery = Parser.Parse(SpecialQueryText);
TopDocs Hits = Searcher.Search(SpecialQuery, null, 1000);
Document[] Result = Hits.ScoreDocs.Select(s => Searcher.Doc(s.Doc)).ToArray<Document>();

Result = Hits.ScoreDocs.Where(x => x.Score >= 1.1F).Select(s => Searcher.Doc(s.Doc)).ToArray<Document>();