.net5 开启Lucene的全文搜索之旅

1,160 阅读6分钟

.net 平台下的lucene.net 开源项目维护进度一直不是很理想,因为需要移植的包太多,因此维护者的进度比较慢,虽然网上有一些文章是针对.net core平台移植之前3.0.3版本的改造版,但文章数目极少,能有个参考已经很不错了,因此我就把今天的趟坑之旅写出来,以供大家实施中参考。

1、Lucene介绍

按照惯例,先抄袭一段描述,这里省略8000字,自己脑补吧。

简言之,Lucene就是apache基金会下的开源全文搜索类库,其强劲的搜索能力不是其他类库可以比拟的。而Lucene.net是 其.net平台的适配移植类库,目前支持.net framework和.net core系列平台。

最新版本: 4.8.0 ,4.其他版本是没有的,因为是移植的原因,因此目前从4.8迁移,对于全文搜索,基本够用了。

这么牛逼,感觉掌握了Lucene,就掌握了搜索引擎! 在这里插入图片描述

摘抄下开源组织的工作,已经非常辛苦了,我就不给他们添麻烦了,然而,今天的大坑和其必不可分,后面小节再说。

ICU4J是Lucene最大的依赖项。已经进行了多种尝试来利用各种替代方法: ICU4NET icu-dotnet 但是我们遇到了几个问题: 缺乏对32/64位的支持 缺乏对.NET标准平台的支持 缺少功能,以及在尝试实现它们时遇到的问题 缺乏线程安全性 我们最终完成了直接移植ICU4J功能40%的功能,以支持Lucene.NET。该项目名为ICU4N,并在外部GitHub存储库中进行。有几个亟待解决的问题,我们可以使用一些帮助将Lucene.NET投入生产。

2、SmartChineseAnalyzer

很早前接触Lucene时,都是采用盘古分词,没想到几年过去了,Lucene已经内置了个不错的分词器,SmartChineseAnalyzer,可以自定义分词词典,内置的比较香。 其他几个推荐的分词器:

  1. PanGu分词(可以直接使用的)
  2. JIEba分词(可以直接使用的)

3、一个建立索引的例子

4.8以后已经更换为 MMapDirectory目录管理索引目录了,这里使用其建立索引。

var dir = MMapDirectory.Open(@"E:\lucene", new NativeFSLockFactory());
            //var dir = new RAMDirectory(,);
            var t = "cn";
            Analyzer analyzer;
            switch (t)
            {
                case "std":
                    analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);
                    break;
                case "cn":
                    analyzer = new SmartChineseAnalyzer(LuceneVersion.LUCENE_48,true);
                    break;
                case "ws":
                    analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);
                    break;
                default:
                    throw new NotImplementedException();
            }


            //准备数据
            IndexWriterConfig iwc = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);
            //当IndexWriter添加的文档的大小超过RAMBufferSizeMB ,IndexWriter就会把在内存中的操作,写入到硬盘中。
            iwc.RAMBufferSizeMB = 32;
            //当IndexWriter添加的文档数量超过MaxBufferedDocs的时候,IndexWriter就会把内存中写入的文档
            iwc.MaxBufferedDocs = 32;
            iwc.MergePolicy = new TieredMergePolicy();
            iwc.OpenMode = OpenMode.CREATE_OR_APPEND;
            IndexWriter writer = new IndexWriter(dir, iwc);
            if (IndexWriter.IsLocked(dir))
            {
                IndexWriter.Unlock(dir);  //unlock:强制解锁
            }

            //writer.Commit();

            Document doc = new Document();

            //只索引不分词
            Field pathField = new StringField("id", Guid.NewGuid().ToString("N"), Field.Store.YES);
            doc.Add(pathField);

            //既索引又分词
            Field contentField = new TextField("bb", GetContent(), Field.Store.YES);
            doc.Add(contentField);

            Field dblField = new DoubleField("cc", 1000.12d, Field.Store.YES);
            doc.Add(dblField);

            string s = "adfadfasfwerewre";
            Field binaryField = new StoredField("bin", new BytesRef(Encoding.UTF8.GetBytes(s)));
            doc.Add(binaryField);

            writer.UpdateDocument(new Term("id", "1"), doc);
            writer.AddDocument(doc);

            writer.Flush(triggerMerge: false, applyAllDeletes: false);
            //写入比较慢,注意时机
            writer.Commit();
            writer.Dispose();

4、坑来了

在执行 writer.Commit();时,发生了Stack Overflow,查看异常:

 at ICU4N.Globalization.CultureInfoExtensions.ToUCultureInfo(System.Globalization.CultureInfo)
   at ICU4N.Globalization.UCultureInfo.GetCurrentCulture()
   at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, System.String, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, System.String, System.Reflection.Assembly, Boolean)
   at ICU4N.Util.UResourceBundle+<>c__DisplayClass25_0.<GetRootType>b__0(System.String)
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[ICU4N.Util.UResourceBundle+RootType, ICU4N, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b]].GetOrAdd(System.__Canon, System.Func`2<System.__Canon,RootType>)
   at ICU4N.Util.UResourceBundle.GetRootType(System.String, System.Reflection.Assembly)
   at ICU4N.Util.UResourceBundle.InstantiateBundle(System.String, System.String, System.Reflection.Assembly, Boolean)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(System.String, System.String, System.Reflection.Assembly, Boolean)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(System.String, System.String)
   at ICU4N.Impl.ICUResourceBundle.GetBundle(ICU4N.Impl.ICUResourceBundleReader, System.String, System.String, System.Reflection.Assembly)
   at ICU4N.Impl.ICUResourceBundle.CreateBundle(System.String, System.String, System.Reflection.Assembly)
   at ICU4N.Impl.ICUResourceBundle+<>c__DisplayClass59_0.<InstantiateBundle>b__0(System.String)
   at ICU4N.Impl.SoftCache`2+<>c__DisplayClass1_0[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].<GetOrCreate>b__0(System.__Canon)
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetOrAdd(System.__Canon, System.Func`2<System.__Canon,System.__Canon>)
   at ICU4N.Impl.SoftCache`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetOrCreate(System.__Canon, System.Func`2<System.__Canon,System.__Canon>)
   at ICU4N.Impl.ICUResourceBundle.InstantiateBundle(System.String, System.String, System.String, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, System.String, System.String, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Globalization.UCultureInfo+DotNetLocaleHelper.GetDefaultCalendar(System.String)
   at ICU4N.Globalization.UCultureInfo+DotNetLocaleHelper.ToUCultureInfo(System.Globalization.CultureInfo)
   at ICU4N.Globalization.CultureInfoExtensions+<>c.<ToUCultureInfo>b__1_0(System.Globalization.CultureInfo)
   at J2N.Collections.Concurrent.Add2Info`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].CreateValue(System.__Canon, System.__Canon ByRef)
   at J2N.Collections.Concurrent.LurchTable`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].InternalInsert[[J2N.Collections.Concurrent.Add2Info`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], J2N, Version=2.0.0.0, Culture=neutral, PublicKeyToken=f39447d697a969af]](Int32, System.__Canon, Int32 ByRef, J2N.Collections.Concurrent.Add2Info`2<System.__Canon,System.__Canon> ByRef)
   at J2N.Collections.Concurrent.LurchTable`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Insert[[J2N.Collections.Concurrent.Add2Info`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], J2N, Version=2.0.0.0, Culture=neutral, PublicKeyToken=f39447d697a969af]](System.__Canon, J2N.Collections.Concurrent.Add2Info`2<System.__Canon,System.__Canon> ByRef)
   at J2N.Collections.Concurrent.LurchTable`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetOrAdd(System.__Canon, System.Func`2<System.__Canon,System.__Canon>)
   at ICU4N.Globalization.CultureInfoExtensions.ToUCultureInfo(System.Globalization.CultureInfo)
   at ICU4N.Globalization.UCultureInfo.GetCurrentCulture()
   at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, System.String, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, System.String, System.Reflection.Assembly, Boolean)
   at ICU4N.Util.UResourceBundle+<>c__DisplayClass25_0.<GetRootType>b__0(System.String)
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[ICU4N.Util.UResourceBundle+RootType, ICU4N, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b]].GetOrAdd(System.__Canon, System.Func`2<System.__Canon,RootType>)
   at ICU4N.Util.UResourceBundle.GetRootType(System.String, System.Reflection.Assembly)
   at ICU4N.Util.UResourceBundle.InstantiateBundle(System.String, System.String, System.Reflection.Assembly, Boolean)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(System.String, System.String, System.Reflection.Assembly, Boolean)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(System.String, System.String)
   at ICU4N.Impl.ICUResourceBundle.GetBundle(ICU4N.Impl.ICUResourceBundleReader, System.String, System.String, System.Reflection.Assembly)
   at ICU4N.Impl.ICUResourceBundle.CreateBundle(System.String, System.String, System.Reflection.Assembly)
   at ICU4N.Impl.ICUResourceBundle+<>c__DisplayClass59_0.<InstantiateBundle>b__0(System.String)
   at ICU4N.Impl.SoftCache`2+<>c__DisplayClass1_0[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].<GetOrCreate>b__0(System.__Canon)
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetOrAdd(System.__Canon, System.Func`2<System.__Canon,System.__Canon>)
   at ICU4N.Impl.SoftCache`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetOrCreate(System.__Canon, System.Func`2<System.__Canon,System.__Canon>)
   at ICU4N.Impl.ICUResourceBundle.InstantiateBundle(System.String, System.String, System.String, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, System.String, System.String, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Globalization.UCultureInfo+DotNetLocaleHelper.GetDefaultCalendar(System.String)
   at ICU4N.Globalization.UCultureInfo+DotNetLocaleHelper.ToUCultureInfo(System.Globalization.CultureInfo)
   at ICU4N.Globalization.CultureInfoExtensions+<>c.<ToUCultureInfo>b__1_0(System.Globalization.CultureInfo)
   at J2N.Collections.Concurrent.Add2Info`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].CreateValue(System.__Canon, System.__Canon ByRef)
   at J2N.Collections.Concurrent.LurchTable`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].InternalInsert[[J2N.Collections.Concurrent.Add2Info`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], J2N, Version=2.0.0.0, Culture=neutral, PublicKeyToken=f39447d697a969af]](Int32, System.__Canon, Int32 ByRef, J2N.Collections.Concurrent.Add2Info`2<System.__Canon,System.__Canon> ByRef)
   at J2N.Collections.Concurrent.LurchTable`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Insert[[J2N.Collections.Concurrent.Add2Info`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], J2N, Version=2.0.0.0, Culture=neutral, PublicKeyToken=f39447d697a969af]](System.__Canon, J2N.Collections.Concurrent.Add2Info`2<System.__Canon,System.__Canon> ByRef)
   at J2N.Collections.Concurrent.LurchTable`2[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetOrAdd(System.__Canon, System.Func`2<System.__Canon,System.__Canon>)
   at ICU4N.Globalization.CultureInfoExtensions.ToUCultureInfo(System.Globalization.CultureInfo)
   at ICU4N.Globalization.UCultureInfo.GetCurrentCulture()
   at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, System.String, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, ICU4N.Globalization.UCultureInfo, System.Reflection.Assembly, ICU4N.Impl.OpenType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(System.String, ICU4N.Globalization.UCultureInfo, ICU4N.Impl.OpenType)
   at ICU4N.Text.BreakIteratorFactory.CreateBreakInstance(ICU4N.Globalization.UCultureInfo, Int32)
   at ICU4N.Text.BreakIteratorFactory.CreateBreakIterator(ICU4N.Globalization.UCultureInfo, Int32)
   at ICU4N.Text.BreakIterator.GetBreakInstance(ICU4N.Globalization.UCultureInfo, Int32)
   at ICU4N.Text.BreakIterator.GetSentenceInstance(System.Globalization.CultureInfo)
   at Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizer..cctor()
   at Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizer..ctor(AttributeFactory, System.IO.TextReader)
   at Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizer..ctor(System.IO.TextReader)
   at Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.CreateComponents(System.String, System.IO.TextReader)
   at Lucene.Net.Analysis.Analyzer.GetTokenStream(System.String, System.IO.TextReader)
   at Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer)
   at Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IIndexableField[], Int32)
   at Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Builder)
   at Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable`1<Lucene.Net.Index.IIndexableField>, Lucene.Net.Analysis.Analyzer, Lucene.Net.Index.Term)
   at Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable`1<Lucene.Net.Index.IIndexableField>, Lucene.Net.Analysis.Analyzer, Lucene.Net.Index.Term)
   at Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term, System.Collections.Generic.IEnumerable`1<Lucene.Net.Index.IIndexableField>, Lucene.Net.Analysis.Analyzer)
   at Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term, System.Collections.Generic.IEnumerable`1<Lucene.Net.Index.IIndexableField>)

一开始,我以为是.net core环境没有设置中文的原因,因为异常里有一堆的国际化语言的相关函数,设置为中文,代码如下:

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
            CultureInfo culture1 = CultureInfo.CurrentCulture;
            CultureInfo culture2 = Thread.CurrentThread.CurrentCulture;
            Console.WriteLine("The current culture is {0}", culture1.Name);
            Console.WriteLine("The two CultureInfo objects are equal: {0}",
                              culture1 == culture2);

打印已经是完美的zh-CN了,但是异常如旧,这是why?难道是ICU4N类库出现麻大了?

在github上找到 ICU4N类库,终于发现了issue:Getting UCultureInfo.CurrentCulture will throw a StackOverflowException if the current culture is any of the following: zh-CN, zh-HK, zh-MO, zh-SG, zh-TW.

好了,有空拉下需求,可以帮着改改bug哦。知道问题了,先解决问题吧。

代码前加上如下代码,F5,一切OK。

//https://github.com/NightOwl888/ICU4N/issues/29
Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("en-us");

一个坑,就这么趟过去了!

5、搜索结果

搜索相对比较顺利,代码如下:

//下面来查询 bb字段
   var a = "bb";
   IndexReader reader = DirectoryReader.Open(dir);
   IndexSearcher searcher = new IndexSearcher(reader);
   QueryParser parser = new QueryParser(LuceneVersion.LUCENE_48, a, analyzer);
   BooleanQuery mp = new BooleanQuery();
   mp.Add(new TermQuery(new Term(a, "服装")), Occur.MUST);
   //查询前10条结果
   var r = searcher.Search(mp, null, Convert.ToInt32(10));


   //获取第一条结果值,危险警示,这里并没有检查结果索引,仅供测试用
   var b = r.ScoreDocs[0];
   var docRst = reader.Document(b.Doc);
   var f = docRst.GetBinaryValue("bin");
   var fdoc = Encoding.UTF8.GetString(f.Bytes);

   Console.WriteLine(fdoc);
   Console.WriteLine("Hello World!");

6、小结

简单的例子已经搞定了,如果要用到全文搜索,利用这个再进一步深化,相信不是问题了。