theme: smartblue

jscpd是一款根据Rabin–Karp算法测试代码重复率的纯JavaScript代码库，支持超过150种编程语言，支持html、json等多种格式的结果报告，是一款比较强大的代码重复率检测工具。该代码库用ts编写，代码清晰简洁，十分容易阅读和上手，缺点是文档不足，官网的示例demo运行也有点问题，写这篇文章记录下自己探索的过程，亦是为了帮助后来者更快的入门

jscpd版本：3.3.26
nodejs版本：13.10.0

官方api示例得不到结果

const jscpd = require('jscpd');

(async () => {
    const clones = await jscpd.detectClones({
        path: [
            `${__dirname}/target`
        ],
        silent: true
    });
    console.log(clones);
})();

运行上述代码，得到的结果是空数组，但命令行验证是可以有输出的

$ npx jscpd test/jscpd/target/ --silent

原因：

@jscpd/finder根据path和pattern得出patterns（files.ts#L71），然后传递给fast-glob模块，问题在于，在windows上，path是反斜线，pattern是正斜线，就导致得出这样一个patterns：d:\\test\\target/**/*,如果是d:/test/target/**/*那么是可以识别出来的，但如果是前者，则识别不出文件

解决办法：

如果是在windows上，则可以通过replace(/\\/g, '/')解决：

const path = require('path');
const jscpd = require('jscpd');

(async () => {
    const clones = await jscpd.detectClones({
        path: [
            // 'target'
+            path.resolve(__dirname, './target').replace(/\\/g, '/')
        ],
        silent: true
    });
    console.log('result', clones);
})();

API方式控制台不打印统计信息

采用命令行的方式，我们可以得到如下的统计信息：

但是如果通过API的方式，我们得不到最有用的 Duplicated lines 信息

原因：

分析源代码，可知无论是命令行方式还是API方式，底层都调用了detectClones函数，如果是命令行方式，则该函数在调用detector.detect时会打印如上截图里的内容。

继续追查到@jscpd/finder的这个地方，可知reporter.report(clones, statistic)在控制台打印出了结果。其中statistic的内容是：

{
    detectionDate: '2021-06-02T08:37:16.761Z',
    formats: { jsx: { sources: [Object], total: [Object] } },
    total: {
        lines: 24,
        tokens: 222,
        sources: 2,
        clones: 1,
        duplicatedLines: 12,
        duplicatedTokens: 111,
        percentage: 50,
        percentageTokens: 50,
        newDuplicatedLines: 0,
        newClones: 0
    }
}

可知其实statistic就是我们想要的内容，继续追查，发现在此处会判断silent是否为true，是true的话则不输出

解决办法：

调用api的时候，不要设置silent为true即可

API方式的返回结果缺少统计信息

原因：

调用detectClones接口得不到统计信息，只能得到clones信息

解决办法：

可以用如下的代码，得到代码重复率

const path = require('path');
const { getFilesToDetect, InFilesDetector } = require('@jscpd/finder');
const { MemoryStore, Statistic, getDefaultOptions } = require('@jscpd/core');
const { Tokenizer, getSupportedFormats } = require('@jscpd/tokenizer');

const options = {
    ...getDefaultOptions(),
    path: [
        path.resolve(__dirname, './target').replace(/\\/g, '/')
    ],
    minTokens: 10,
    minLines: 2
};
options.format = getSupportedFormats();
const tokenizer = new Tokenizer();
const store = new MemoryStore();
const statistic = new Statistic(options);
const files = getFilesToDetect(options);
const detector = new InFilesDetector(tokenizer, store, statistic, options);
(async () => {
    const clones = await detector.detect(files);
    console.log(statistic.statistic);
})();

其中statistic.statistic就是我们想得到的数据

调用detectClones和detector.detect生成的clones不同

调用detectClones：

const path = require('path');
const { getFilesToDetect, InFilesDetector } = require('@jscpd/finder');
const { MemoryStore, Statistic, getDefaultOptions } = require('@jscpd/core');
const { Tokenizer, getSupportedFormats } = require('@jscpd/tokenizer');

const options = {
    ...getDefaultOptions(),
    path: [
        path.resolve(__dirname, './target').replace(/\\/g, '/')
    ],
    reporters: ['html'],
    minTokens: 10
};
options.format = getSupportedFormats();
const tokenizer = new Tokenizer();
const store = new MemoryStore();
const statistic = new Statistic(options);

const files = getFilesToDetect(options);
// console.log(JSON.stringify(options, null, 4));

const detector = new InFilesDetector(tokenizer, store, statistic, options);

(async () => {
    const clones = await detector.detect(files);
})();

调用detectClones：

const path = require('path');
const fs = require('fs');
const jscpd = require('jscpd');

(async () => {
    const clones = await jscpd.detectClones({
        path: [
            // 'target'
            path.resolve(__dirname, './target').replace(/\\/g, '/')
        ],
        reporters: ['html'],
        silent: true,
        minTokens: 10
    });
})();

得到的clones所含字段不同，detectClones函数返回的内容里有fragment字段，指明了具体的重复代码段

原因：

通过打印调用过程中的clones我们会发现，在in-files-detector.ts#L97执行hook之前，还没有fragment字段，但是调用之后就有了，而hook恰恰是在index.ts:detectClones #L31里面调用的，直接调用detector.detect没有这个过程。

解决办法：

通过阅读hooks.ts源码可知，可以通过增加FragmentsHook来增加fragment字段，执行detector.registerHook(new FragmentsHook());即可。完整代码如下：

const path = require('path');
const fs = require('fs');
+ const { getFilesToDetect, InFilesDetector, FragmentsHook } = require('@jscpd/finder');
const { MemoryStore, Statistic, getDefaultOptions } = require('@jscpd/core');
const { Tokenizer, getSupportedFormats } = require('@jscpd/tokenizer');

const options = {
    ...getDefaultOptions(),
    path: [
        path.resolve(__dirname, './target').replace(/\\/g, '/')
    ],
    reporters: ['html'],
    minTokens: 10
};
options.format = getSupportedFormats();
const tokenizer = new Tokenizer();
const store = new MemoryStore();
const statistic = new Statistic(options);
const files = getFilesToDetect(options);

const detector = new InFilesDetector(tokenizer, store, statistic, options);

+ detector.registerHook(new FragmentsHook());

(async () => {
    const clones = await detector.detect(files);
    fs.writeFileSync('./result.json', JSON.stringify(clones));
})();

是否可以处理单个文件

可以处理单个文件，但需要设置好minTokens，默认大小是50，如果设置不正确，很可能扫描不出来重复字符串

遗留问题

minTokens如果比较小，例如是5，则常常会出现匪夷所思的结果，例如这样一个文本

<GraceLoading loading={this.props.flowGridLoading}>
</GraceLoading>
<GraceLoading loading={this.props.flowBubbleLoading}>
<Bubble {...chartBubbleConfig} key={JSON.stringify(this.props.flowBubbleData)} />
</GraceLoading>
<div className='flow-grid-panel'>
<GraceLoading loading={this.props.flowGridLoading}>

在minLines是3，minTokens是5的情况下，会认为以下是重复的

// target\file1.jsx (4:2-7:2)
// target\file1.jsx (1:16-1:2)

} />
</GraceLoading>
<div className='flow-grid-panel'>
<GraceLoading loading={this.props.flowGridLoading}

这个结果就很匪夷所思了，猜测和jscpd底层所用到的Rabin-Karp算法有关，但还需要验证

jscpd学习笔记

theme: smartblue

官方api示例得不到结果

API方式控制台不打印统计信息

API方式的返回结果缺少统计信息

调用detectClones和detector.detect生成的clones不同

是否可以处理单个文件

遗留问题