用单线程与多线程实现统计文件中的单词数量的功能分别使用单线程、多线程对单个文件中的单词进行统计，以及使用多线程对文件列表

摘要

$\quad$ 分别使用单线程、多线程对单个文件中的单词进行统计，以及使用多线程对文件列表中的文件内的单词进行统计。

目录
(点击可直接跳转)

1.前言

$\quad$ 通过这个例子，一来可以熟悉Java中线程池的概念与运用，也可以对加强主线程与分线程的配合执行的理念。

2.对单个文件进行单线程统计

$\quad$ 首先说明，为了方便起见，假设文件中的单词都是以空格分隔的，不存在? , ; 等分隔符。这样只需用String.split(" ")对读取的文件内容进行分隔即可。拿到拆分后的数组后，只需put进入一个HashMap中即可。
代码如下：

 public static Map<String, Integer> count(File file) throws IOException {
        Map<String, Integer> numberOfWord = new HashMap<>();
        StringTokenizer stringTokenizer = dealWithStringFromFile(file);
        while (stringTokenizer.hasMoreTokens()) {
            String aWord = stringTokenizer.nextToken();
            numberOfWord.put(aWord,numberOfWord.getOrDefault(aWord,0)+1);
        }
        return numberOfWord;
    }
    
public static StringTokenizer dealWithStringFromFile(File file) throws IOException {
        List<String> stringFromFile = Files.readAllLines(file.toPath());
        StringTokenizer string;
        string = new StringTokenizer(stringFromFile.toString()
                .replace("[", " ")
                .replace("]", " ")
                .replace(",", " "));
        return string;

    }

$\quad$ 就两个方法，一个是对处理过后的文件中的单词进行压入HashMap的操作，另一个方法就是将由于toString方法带来的[]以及，替换成空格。感觉写的很辣鸡，肯定有什么可以优化的方式。

3.对单个文件进行多线程统计

代码如下：

public class WordCount {
    private final int threadNum;
    private ExecutorService threadPool;

    public WordCount(int threadNum) {
        threadPool = Executors.newFixedThreadPool(threadNum);
        this.threadNum = threadNum;
    }
    
    public Map<String, Integer> count(File file) throws IOException, ExecutionException, InterruptedException {
        BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
        List<Future<Map<String, Integer>>> list = new ArrayList<>();
        Map<String, Integer> finalResult = new HashMap<>();
        for (int i = 0; i < threadNum; ++i) {
            list.add(threadPool.submit(() -> {
                Map<String, Integer> result = new HashMap<>();
                String line;
                while ((line = bufferedReader.readLine()) != null) {
                    String[] oneThreadReadOneLine = line.split(" ");
                    for (String element : oneThreadReadOneLine
                    ) {
                        result.put(element, result.getOrDefault(element, 0) + 1);
                    }
                }
                return result;
            }));
        }

        for (Future<Map<String, Integer>> future : list
        ) {
            Map<String, Integer> resultFromThread = future.get();
            mergeResultFromThread(resultFromThread, finalResult);
        }
        return finalResult;
    }

    private void mergeResultFromThread(Map<String, Integer> resultFromThread,
                                       Map<String, Integer> finalResult) {
        for (Map.Entry<String, Integer> entry : resultFromThread.entrySet()) {
            int resultNumber = finalResult.getOrDefault(entry.getKey(), 0) + entry.getValue();
            finalResult.put(entry.getKey(), resultNumber);
        }
    }
}

在线程池中开启若干个线程后，每个线程读取文件中的一行字符。threadPoll.submit会返回一个Future类型的参数，将其逐个加入list列表。在循环中，将list列表的内容整合。

4.对多个文件进行多线程统计

先贴出代码:

public class WordCount {
    private final int threadNum;
    private ExecutorService threadPool;
    private final Object lock = new Object();

    public WordCount(int threadNum) {
        this.threadNum = threadNum;
        threadPool = Executors.newFixedThreadPool(threadNum);
    }


    public Map<String, Integer> count(List<File> files) throws ExecutionException, InterruptedException {
        synchronized (lock) {
            List<Future<Map<String, Integer>>> list = new ArrayList<>();
            Map<String, Integer> finalResult = new ConcurrentHashMap<>();
            List<List<File>> partOfFiles = Lists.partition(files, threadNum);
            for (List<File> partOfFile : partOfFiles) {
                list.add(threadPool.submit(
                        new Worker(partOfFile)));
            }

            for (Future<Map<String, Integer>> future : list
            ) {
                Map<String, Integer> resultFromThread = future.get();
                mergeResultFromThread(resultFromThread, finalResult);
            }
            System.out.println(finalResult);
            threadPool.shutdown();
            return finalResult;
        }
    }

    static class Worker implements Callable<Map<String, Integer>> {
        List<File> files;

        Worker(List<File> files) {
            this.files = files;
        }

        @Override
        public Map<String, Integer> call() throws Exception {
            Map<String, Integer> result = new ConcurrentHashMap<>();
            for (File file : files) {
                List<String> oneFileToString = Files.readAllLines(file.toPath());
                for (String content : oneFileToString) {
                    String[] contentToArray = content.split(" ");
                    for (String element : contentToArray) {
                        result.put(element, result.getOrDefault(element, 0) + 1);
                    }
                }
            }
            return result;
        }
    }

    private void mergeResultFromThread(Map<String, Integer> resultFromThread,
                                       Map<String, Integer> finalResult) {
        for (Map.Entry<String, Integer> entry : resultFromThread.entrySet()) {
            int resultNumber = finalResult.getOrDefault(entry.getKey(), 0) + entry.getValue();
            finalResult.put(entry.getKey(), resultNumber);
        }
    }
}

$\quad$ 根据传入的List链表，以及给定的threadNum线程数量，来实现对多个文件的同时操作。思路在于：在主线程中将List根据线程数量等分。这里的等分List用到的是com.google.common.collect.Lists中的Lists.partition。
$\quad$ 对于list的等分分割，我看到有人给出了这种方法:

	public static <T> List<List<T>> averageAssign(List<T> source,int n){
		List<List<T>> result=new ArrayList<List<T>>();
		int remaider=source.size()%n;  //(先计算出余数)
		int number=source.size()/n;  //然后是商
		int offset=0;//偏移量
		for(int i=0;i<n;i++){
			List<T> value=null;
			if(remaider>0){
				value=source.subList(i*number+offset, (i+1)*number+offset+1);
				remaider--;
				offset++;
			}else{
				value=source.subList(i*number+offset, (i+1)*number+offset);
			}
			result.add(value);
		}
		return result;
	}

也是可以使用的。

总结：
$\quad$ 1.使用多线程对代码进行处理的时候，最好将主线程的工作，与各分线程的工作分开，这样会调理更清晰，更容易修改bug。
$\quad$ 2.重温了一下Java.io中的读取文件内容的方法。用到BufferReader.Reader(Reader)/Files.readAllLine(File)
$\quad$ 3.知道了如何对List进行等分的两种方法。即Guava的Lists包中的partition方法，以及使用List.sublist()自己造轮子。

用单线程与多线程实现统计文件中的单词数量的功能

摘要

摘要

目录
(点击可直接跳转)

目录

*1.前言

*2.对单个文件进行单线程统计

*3.对单个文件进行多线程统计

*4.对多个文件进行多线程统计

1.前言

2.对单个文件进行单线程统计

3.对单个文件进行多线程统计

4.对多个文件进行多线程统计

用单线程与多线程实现统计文件中的单词数量的功能

摘要

摘要

目录(点击可直接跳转)

目录

*1.前言

*2.对单个文件进行单线程统计

*3.对单个文件进行多线程统计

*4.对多个文件进行多线程统计

1.前言

2.对单个文件进行单线程统计

3.对单个文件进行多线程统计

4.对多个文件进行多线程统计

目录
(点击可直接跳转)