day10_数据保存&排重&文档解析
1 数据保存准备
1.1 ip代理池
1.1.1 需求分析 针对于ip代理池的管理,包括了增删改查,设置可用ip和不可用ip
下面是关于IP代理池的增删改查
检测IP是否存在的接口
@Override
public boolean checkExist(String host, int port) {
ClIpPool clIpPool = new ClIpPool();
clIpPool.setIp(host);
clIpPool.setPort(port);
List<ClIpPool> clIpPools = clIpPoolMapper.selectList(clIpPool);
if(null!=clIpPools && !clIpPools.isEmpty()){
return true;
}
return false;
}
IP代理池的测试
@Test
public void testSaveCrawlerIpPool(){
ClIpPool clIpPool = new ClIpPool();
clIpPool.setIp("2222.3333.444.5555");
clIpPool.setPort(1111);
clIpPool.setEnable(true);
clIpPool.setCreatedTime(new Date());
crawlerIpPoolService.saveCrawlerIpPool(clIpPool);
}
测试通过
保存IP成功
接口测试
public void testQueryList(){
ClNewsAdditional clNewsAdditional = new ClNewsAdditional();
clNewsAdditional.setUrl("https://blog.csdn.net/weixin_43976602/article/details/96971651");
List<ClNewsAdditional> clNewsAdditionals = crawlerNewsAdditionalService.queryList(clNewsAdditional);
System.out.println(clNewsAdditionals);
}
测试通过
1.3 爬虫文章图文评论信息表
1.3.1 需求分析 保存文章的评论信息
@Override
public void saveClNewsComment(ClNewsComment clNewsComment) {
clNewsCommentMapper.insertSelective(clNewsComment);
}
1.4 爬虫文章‘
1.4.1 需求分析 文章的增删改查
@Override
public void saveNews(ClNews clNews) {
clNewsMapper.insertSelective(clNews);
}
@Override
public void updateNews(ClNews clNews) {
clNewsMapper.updateByPrimaryKey(clNews);
}
@Override
public void deleteByUrl(String url) {
clNewsMapper.deleteByUrl(url);
}
@Override
public List<ClNews> queryList(ClNews clNews) {
return clNewsMapper.selectList(clNews);
}
2 排重 ,集成Reds