![]() |
VOOZH | about |
dotnet add package JiebaNet.Segmenter.Net6 --version 6.42.2
NuGet\Install-Package JiebaNet.Segmenter.Net6 -Version 6.42.2
<PackageReference Include="JiebaNet.Segmenter.Net6" Version="6.42.2" />
<PackageVersion Include="JiebaNet.Segmenter.Net6" Version="6.42.2" />Directory.Packages.props
<PackageReference Include="JiebaNet.Segmenter.Net6" />Project file
paket add JiebaNet.Segmenter.Net6 --version 6.42.2
#r "nuget: JiebaNet.Segmenter.Net6, 6.42.2"
#:package JiebaNet.Segmenter.Net6@6.42.2
#addin nuget:?package=JiebaNet.Segmenter.Net6&version=6.42.2Install as a Cake Addin
#tool nuget:?package=JiebaNet.Segmenter.Net6&version=6.42.2Install as a Cake Tool
jieba.NET是jieba中文分词的.NET版本(C#实现)。
当前版本为0.42.2,基于jieba 0.42,提供与jieba基本一致的功能与接口,但不支持其最新的paddle模式。关于jieba的实现思路,可以看看这篇wiki里提到的资料。
此外,也提供了 KeywordProcessor,参考 FlashText 实现。KeywordProcessor 可以更灵活地从文本中提取词典中的关键词,比如忽略大小写、含空格的词等。
如果您在开发中遇到与分词有关的需求或困难,请提交一个Issue,I see u:)
当前版本支持net40、net45和netstandard2.0,可以手动引用项目,也可以通过NuGet添加引用:
PM> Install-Package jieba.NET
安装之后,在packages\jieba.NET目录下可以看到Resources目录,这里面是jieba.NET运行所需的词典及其它数据文件,最简单的配置方法是将整个Resources目录拷贝到程序集所在目录,这样jieba.NET会使用内置的默认配置值。如果希望将这些文件放在其它位置,则要在app.config或web.config中添加如下的配置项:
<appSettings>
<add key="JiebaConfigFileDir" value="C:\jiebanet\config" />
</appSettings>
需要注意的是,这个路径可以使用绝对路径或相对路径。如果使用相对路径,那么jieba.NET会假设该路径是相对于当前应用程序域的BaseDirectory。
配置示例:
如果因为某些原因,不方便通过应用的 config 文件配置,可使用代码设置(在使用任何分词功能之前,建议使用绝对路径),如:
JiebaNet.Segmenter.ConfigManager.ConfigFileBaseDir = @"C:\jiebanet\config";
JiebaSegmenter.Cut方法接受三个输入参数,text为待分词的字符串;cutAll指定是否采用全模式;hmm指定使用是否使用hmm模型切分未登录词;返回类型为IEnumerable<string>JiebaSegmenter.CutForSearch方法接受两个输入参数,text为待分词的字符串;hmm指定使用是否使用hmm模型;返回类型为IEnumerable<string>代码示例
var segmenter = new JiebaSegmenter();
var segments = segmenter.Cut("我来到北京清华大学", cutAll: true);
Console.WriteLine("【全模式】:{0}", string.Join("/ ", segments));
segments = segmenter.Cut("我来到北京清华大学"); // 默认为精确模式
Console.WriteLine("【精确模式】:{0}", string.Join("/ ", segments));
segments = segmenter.Cut("他来到了网易杭研大厦"); // 默认为精确模式,同时也使用HMM模型
Console.WriteLine("【新词识别】:{0}", string.Join("/ ", segments));
segments = segmenter.CutForSearch("小明硕士毕业于中国科学院计算所,后在日本京都大学深造"); // 搜索引擎模式
Console.WriteLine("【搜索引擎模式】:{0}", string.Join("/ ", segments));
segments = segmenter.Cut("结过婚的和尚未结过婚的");
Console.WriteLine("【歧义消除】:{0}", string.Join("/ ", segments));
输出
【全模式】:我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
【精确模式】:我/ 来到/ 北京/ 清华大学
【新词识别】:他/ 来到/ 了/ 网易/ 杭研/ 大厦
【搜索引擎模式】:小明/ 硕士/ 毕业/ 于/ 中国/ 科学/ 学院/ 科学院/ 中国科学院/ 计算/ 计算所/ ,/ 后/ 在/ 日本/ 京都/ 大学/ 日本京都大学/ 深造
【歧义消除】:结过婚/ 的/ 和/ 尚未/ 结过婚/ 的
JiebaSegmenter.LoadUserDict("user_dict_file_path")如
创新办 3 i
云计算 5
凱特琳 nz
台中
机器学习 3
JiebaSegmenter.AddWord(word, freq=0, tag=null)可添加一个新词,或调整已知词的词频;若freq不是正整数,则使用自动计算出的词频,计算出的词频可保证该词被分出来JiebaSegmenter.DeleteWord(word)可移除一个词,使其不能被分出来JiebaNet.Analyser.TfidfExtractor.ExtractTags(string text, int count = 20, IEnumerable<string> allowPos = null)可从指定文本中抽取出关键词。JiebaNet.Analyser.TfidfExtractor.ExtractTagsWithWeight(string text, int count = 20, IEnumerable<string> allowPos = null)可从指定文本中抽取关键词的同时得到其权重。JiebaNet.Analyser.TextRankExtractor与TfidfExtractor相同的接口。需要注意的是,TextRankExtractor默认情况下只提取名词和动词。JiebaNet.Segmenter.PosSeg.PosSegmenter类可以在分词的同时,为每个词添加词性标注。var posSeg = new PosSegmenter();
var s = "一团硕大无朋的高能离子云,在遥远而神秘的太空中迅疾地飘移";
var tokens = posSeg.Cut(s);
Console.WriteLine(string.Join(" ", tokens.Select(token => string.Format("{0}/{1}", token.Word, token.Flag))));
一团/m 硕大无朋/i 的/uj 高能/n 离子/n 云/ns ,/x 在/p 遥远/a 而/c 神秘/a 的/uj 太空/n 中/f 迅疾/z 地/uv 飘移/v
var segmenter = new JiebaSegmenter();
var s = "永和服装饰品有限公司";
var tokens = segmenter.Tokenize(s);
foreach (var token in tokens)
{
Console.WriteLine("word {0,-12} start: {1,-3} end: {2,-3}", token.Word, token.StartIndex, token.EndIndex);
}
word 永和 start: 0 end: 2
word 服装 start: 2 end: 4
word 饰品 start: 4 end: 6
word 有限公司 start: 6 end: 10
var segmenter = new JiebaSegmenter();
var s = "永和服装饰品有限公司";
var tokens = segmenter.Tokenize(s, TokenizerMode.Search);
foreach (var token in tokens)
{
Console.WriteLine("word {0,-12} start: {1,-3} end: {2,-3}", token.Word, token.StartIndex, token.EndIndex);
}
word 永和 start: 0 end: 2
word 服装 start: 2 end: 4
word 饰品 start: 4 end: 6
word 有限 start: 6 end: 8
word 公司 start: 8 end: 10
word 有限公司 start: 6 end: 10
使用如下方法:
JiebaSegmenter.CutInParallel()、JiebaSegmenter.CutForSearchInParallel()PosSegmenter.CutInParallel()jiebaForLuceneNet项目提供了与Lucene.NET的简单集成,更多信息请看:jiebaForLuceneNet
jieba分词亦提供了其它的词典文件:
Segmenter.Cli项目build之后得到jiebanet.ext,它的选项和实例用法如下:
-f --file the file name, (必要的).
-d --delimiter the delimiter between tokens, default: / .
-a --cut-all use cut_all mode.
-n --no-hmm don't use HMM.
-p --pos enable POS tagging.
-v --version show version info.
-h --help show help details.
sample usages:
$ jiebanet -f input.txt > output.txt
$ jiebanet -d | -f input.txt > output.txt
$ jiebanet -p -f input.txt > output.txt
可以使用Counter类统计词频,其实现来自Python标准库的Counter类(具体接口和实现细节略有不同),用法大致是:
var s = "在数学和计算机科学之中,算法(algorithm)为任何良定义的具体计算步骤的一个序列,常用于计算、数据处理和自动推理。精确而言,算法是一个表示为有限长列表的有效方法。算法应包含清晰定义的指令用于计算函数。";
var seg = new JiebaSegmenter();
var freqs = new Counter<string>(seg.Cut(s));
foreach (var pair in freqs.MostCommon(5))
{
Console.WriteLine($"{pair.Key}: {pair.Value}");
}
输出:
的: 4
,: 3
算法: 3
计算: 3
。: 3
Counter类可通过Add,Subtract和Union方法进行修改,最后以MostCommon方法获得频率最高的若干词。具体用法可见测试用例。
可通过 KeywordProcessor 提取文本中的关键词,不过它的提取与 KeywordExtractor不同。KeywordProcessor 可理解为基于词典从文本中找出已知的词,仅仅如此。
jieba分词当前的实现里,不能处理忽略大小写、含空格的词之类的情况,而在文本提取应用中,这是很常见的场景。因此 KeywordProcessor 主要是作为提取之用,而非分词,尽管通过其中的方法,可以实现另一种基于字典的分词模式。
代码示例:
var kp = new KeywordProcessor();
kp.AddKeywords(new []{".NET Core", "Java", "C语言", "字典 tree", "CET-4", "网络 编程"});
var keywords = kp.ExtractKeywords("你需要通过cet-4考试,学习c语言、.NET core、网络 编程、JavaScript,掌握字典 tree的用法");
// keywords 值为:
// new List<string> { "CET-4", "C语言", ".NET Core", "网络 编程", "字典 tree"}
// 可以看到,结果中的词与开始添加的关键词相同,与输入句子中的词则不尽相同。如果需要返回句中找到的原词,可以使用 `raw` 参数。
var keywords = kp.ExtractKeywords("你需要通过cet-4考试,学习c语言、.NET core、网络 编程、JavaScript,掌握字典 tree的用法", raw: true);
// keywords 值为:
// new List<string> { "cet-4", "c语言", ".NET core", "网络 编程", "字典 tree"}
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net6.0 net6.0 is compatible. net6.0-android net6.0-android was computed. net6.0-ios net6.0-ios was computed. net6.0-maccatalyst net6.0-maccatalyst was computed. net6.0-macos net6.0-macos was computed. net6.0-tvos net6.0-tvos was computed. net6.0-windows net6.0-windows was computed. net7.0 net7.0 was computed. net7.0-android net7.0-android was computed. net7.0-ios net7.0-ios was computed. net7.0-maccatalyst net7.0-maccatalyst was computed. net7.0-macos net7.0-macos was computed. net7.0-tvos net7.0-tvos was computed. net7.0-windows net7.0-windows was computed. net8.0 net8.0 was computed. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed. |
Showing the top 1 NuGet packages that depend on JiebaNet.Segmenter.Net6:
| Package | Downloads |
|---|---|
|
JiebaNet.Analyser.Net6
JiebaNet.Analyser.(.Net 6.0 版本) |
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 6.42.2 | 21,960 | 7/2/2024 |
基于 .Net 6.0 重新编译