中文正向匹配分词算法入门新手可以看的资料：书籍：《数学之美》吴军老师《自然语言处理》宗成

入门新手可以看的资料：

书籍：

《数学之美》吴军老师

《自然语言处理》宗成庆老师

《统计学习方法》李航老师

《机器学习导论》

网上资源：

吴恩达老师的机器学习

同学推荐的：

苏大一位老师的课件，主要是后面有习题练习还是挺不错的

正向最大匹配

整体的思路比较简单，这里就不在进行敖述，相信来看这博文的大家也是懂得这世界上有一种东西叫做：搜索引擎！道理我们都懂，但是具体怎么做呢？好像还是很迷茫啊！莫慌莫慌，想一想，根据算法，我们首先需要有个语料库，老师提供了一个data.conll.txt文件，但是不太懂这个是什么，打开看一下，是这样的：

1 戴相龙_NR _ _ 2VMOD_ _
2 说_VV _ _ 0ROOT_ _
3 中国_NR _ _ 5NMOD_ _
4 经济_NN _ _ 5NMOD_ _
5 发展_NN _ _ 8SUB_ _
6 为_P _ _ 8VMOD_ _
7 亚洲_NR _ _ 6PMOD_ _
8 作出_VV _ _ 2VMOD_ _
9 积极_JJ _ _ 10NMOD_ _
10 贡献_NN _ _ 8OBJ_ _

有些茫然，并不是很清楚这个是什么.....于是百度.conll，原来是依存语料库的文件后缀名，一般树库文件以conll结尾，来标识是语料库罢了。另外介绍一个比较好用的工具：Dependency Viewer 居然是未曾谋面的南大学长佳作。做出来的依存树效果如下：

编辑

其实从图上比较好理解什么是依存树，像是在理解句子是如何产生的，不过我们更关注分析词之间的关系。

okay，这个做完了来看看，我们要完成的第一个任务是：根据这data.conll.txt 构建自己的词典文件，输出到word.dict 中。词典的构建简单的想了一下，大概就是要把上面的conll文件单词提取出来，那么关键的问题点其实是索引的建立。看起来中文的索引不是那么好做，比较容易想到的是根据拼音，但是这样每次就需要得到每个中文的拼音，后来想想可以直接利用“首字”索引来建立就好了嘛。

整体的思路是：

把首字整理出来，然后按照首字相同的的开始进行整理，写入到文件中去。数据结构就是单纯的采用Hash表的方式。这种的缺点就是占用的内存空间太大，不太适合很大的数据量。

下面附上苏大老师课后习题答案，采用最大正向匹配法进行分词。文中用到的data.conll.txt苏老师的网站上可以下载。

public class Word {

	private Map<String, Map<String,Integer>> dictionary;
	
	//读取文件内容，得到dictionary
	public void readFile(String fileNmae) {

		File tempfile = new File(fileNmae);
		dictionary = new Hashtable<String, Map<String,Integer>>();
		int iteratorTime = 0;
		
		if( tempfile != null )
		{
			BufferedReader br = null;
			StringBuffer buffer = null;
			
			try
			{
				buffer = new StringBuffer();
				BufferedReader isr = new BufferedReader(new FileReader(tempfile));
				br = new BufferedReader(isr); 
				String s = null;
				while( (s = br.readLine()) != null )
				{
			
					if(s.length() == 0)
						continue;
					String []parseStr = s.split("\t");
					iteratorTime++;

					String sp = parseStr[1];
					
				
					String key = (sp).charAt(0)+"";
					
					if(!dictionary.containsKey(key)) {
						Map<String,Integer> tempValue = new Hashtable<String, Integer>();
						tempValue.put(sp, 1);
						dictionary.put(key, tempValue);
					} else if(!dictionary.get(key).containsKey(sp)) {
						dictionary.get(key).put(sp, 1);
					} else {
						int temp = dictionary.get(key).get(sp) + 1;
						dictionary.get(key).put(sp, temp);
					}
				}
			} 
			catch (Exception e) {
				System.err.printf("Error in reading from file %s\n",fileNmae);
				e.printStackTrace();
			}
		}//end if
		
		System.out.println(iteratorTime);
	}
	
	//将dictionary的内容输出到文件中去，得到word.txt，文件格式自定义都可以。
	public void outPutFile(String outFileName) {
		
		FileOutputStream fs;
		try {
			fs = new FileOutputStream(new File(outFileName));
			BufferedOutputStream buff = new BufferedOutputStream(fs);
			
			for (Map.Entry<String, Map<String,Integer>> entry : dictionary.entrySet()) {
				String message = entry.getKey()+": \n";
				for (Map.Entry<String,Integer> s : entry.getValue().entrySet()) {
					message = message + "\t" + s.getKey() + "," + s.getValue();
				}
				
				message += "\n";
				
				
				System.out.printf(message);
				buff.write(message.getBytes());
			}
			
			buff.flush();
			buff.close();
		    fs.close();
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
	
	
	//按行输出文字
	public void getDataTxt(String fileNmae,String outFileName) {

		File tempfile = new File(fileNmae);
		
		if( tempfile != null )
		{
			BufferedReader br = null;
			StringBuffer buffer = null;
			FileOutputStream fs;
			
			try
			{
				buffer = new StringBuffer();
				BufferedReader isr = new BufferedReader(new FileReader(tempfile));
				br = new BufferedReader(isr); 
				fs = new FileOutputStream(new File(outFileName));
				BufferedOutputStream buff = new BufferedOutputStream(fs);
				
				String s = null;
				String message = "";
				while( (s = br.readLine()) != null )
				{
			
					if(s.length() == 0) {
						message += "\n";
						continue;
					}
					String []parseStr = s.split("\t");
					String sp = parseStr[1];
					
					message += sp;
				}
				
				buff.write(message.getBytes());
				buff.close();
			} 
			catch (Exception e) {
				e.printStackTrace();
			}
		}//end if
	}

	//根据word.txt,对文本data.txt进行分词
	public void max_Macth(String fileNmae,String outFileName) {

		File tempfile = new File(fileNmae);
		
		if( tempfile != null )
		{
			BufferedReader br = null;
			StringBuffer buffer = null;
			FileOutputStream fs;
			
			try
			{
				buffer = new StringBuffer();
				BufferedReader isr = new BufferedReader(new FileReader(tempfile));
				br = new BufferedReader(isr); 
				fs = new FileOutputStream(new File(outFileName));
				BufferedOutputStream buff = new BufferedOutputStream(fs);
				
				String s = null;
				String outPutMessage = "";
				while( (s = br.readLine()) != null )
				{
					//--------------- MAx_Macth Chinese Token --------------------
					//1.init set sp = 0,and m = 4
					int sp = 0,m = 4;
					int rp = sp + m;
					 if(rp > s.length()){
						  rp = s.length();
				     }
						
						while(rp >= sp){
							if(sp == s.length())
								break;
							if(rp == sp){
								String tempStr = s.substring(sp, rp+1);
								outPutMessage += tempStr;
								outPutMessage += "\n";
								sp = sp + tempStr.length();
								rp = sp + m;
								if(rp > s.length())
									rp = s.length();
							     continue;
							}
								
								
						  String tempStr = IsIn(s.substring(sp, rp));
						  if(tempStr == null)
							  rp--;
						  else{
							  outPutMessage += tempStr;
							  outPutMessage += "\n";
							  sp = sp + tempStr.length();
							  rp = sp + m;
							  if(rp > s.length())
								  rp = s.length();
						     }
					   }  
					outPutMessage += "\n";
					
					
					
					//--------------- Chinese Token --------------------
					
				}
				
				buff.write(outPutMessage.getBytes());
				buff.close();
			} 
			catch (Exception e) {
				e.printStackTrace();
			}
		}//end if
		
		
		
	}
	
	public String IsIn(String str){
		String returnStr = null;
		String s = str.substring(0, 1);
		if(dictionary.containsKey(s)){
			if(dictionary.get(s).containsKey(str))
				returnStr = str;
		}
		
		return returnStr;
	}
	
	public static void main(String[] args) {
		long start = System.currentTimeMillis();
		
		Word word = new Word();
		word.readFile("data.conll.txt");
        word.outPutFile("word.dict.txt");
        word.getDataTxt("data.conll.txt", "data.txt");
        
        word.max_Macth("data.txt", "data.out.txt");
        long end = System.currentTimeMillis();
         
        System.out.printf("\n this task spend %d ms\n",(end - start));
	}

}

关于字典的构建，还有比较常用的一种存储结构，也就是Trie数来存储，可以使用数组或者是指针来实现，Trie数组最大的特点是插入和查询的速度要更快，但是是利用空间换时间来实现的。有时间再介绍。

本文同步在个人博客，开启掘金成长之旅！这是我参与「掘金日新计划 · 2 月更文挑战」的第 22 天，点击查看活动详情