go-crawler-distributed
Github:github.com/golang-coll…
Distributed crawler projects, the project supports personalization page parser secondary development, the whole project using micro service architecture, messages are sent asynchronously through message queues, Use the following framework: redigo, gorm, goquery, easyjson, viper, closer, zap, go-micro, and containable deployment is realized through Docker, intermediate crawler nodes support horizontal expansion.
分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。
目录结构
- cache redis相关操作
- config 存放配置文件
- crawler 处理爬虫相关逻辑
- crawerConfig 爬虫对应的消息队列配置
- douban 豆瓣网页解析
- fetcher 网页抓取
- meituan 美团网页解析
- persistence 定义用于保存数据的结构体
- worker 具体的工作逻辑,通过它实现代码的解耦
- crawlOperation 除存储外统一爬虫处理模块
- db mysql操作
- dependencies 项目依赖的环境相关的docker-compose文件
- deploy 脚本
- buildScript 部署脚本
- deploy 直接启动项目的脚本
- dockerBuildScript 构建docker镜像
- service 用于存放服务的Dockerfile
- elastic elastic的相关操作
- model 结构体定义
- mq 消息队列操作
- runtime 日志文件
- service 微服务
- cache redis微服务通过grpc操作
- proto 定义grpc的proto文件
- server 定义grpc的服务端
- douban 豆瓣相关服务
- elastic elasticSearch微服务通过grpc操作
- proto 定义grpc的proto文件
- server 定义grpc的服务端
- meituan 美团相关服务
- watchConfig 配置相关
- cache redis微服务通过grpc操作
- tools 小工具
- unifiedLog 统一日志操作
配置文件
You need to customize your configuration. Create the config.json file under the config folder in the project root directory. The sample of config.json.
{
"mysql": {
"user": "root",
"password": "example",
"host": "mysql:3306",
"db_name": "short_story"
},
"redis": {
"host": "redis:6379"
},
"rabbitmq": {
"user": "guest",
"password": "guest",
"host": "rabbitmq:5672"
},
"elastic": {
"url": "http://elastic_server:9200",
"index": "article"
}
}
Parser
douban
Parser:parser uses css selectors for page parsing.
meituan
Parser: parser uses css selectors for page parsing.
Framework
Architecture
Installation
Deploying projects locally or in the cloud provides two ways:
- Direct Deploy
- Docker(Recommended)
Pre-requisite
- Go 1.13.6
- Redis 6.0
- MySQL 5.7
- RabbitMQ management
- ElasticSearch
- Others go.mod
Quick Start
Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.
git clone https://github.com/Knowledge-Precipitation-Tribe/go-crawler-distributed
cd go-crawler-distributed/deploy/buildScript
bash linux_build
cd ../
docker network create -d bridge crawler
docker-compose up -d
The above command has started the basic services, including Redis, elasticSearch, RabbitMQ, and mysql. RPC services for Redis and elasticSearch are also included.
You can then switch to the douban or meituan folder and launch the service with the following command.
Like:
cd meituan
docker-compose up -d
Run
Basic services
Please use docker-compose to one-click to start up. Create a file named docker-compose.yml and input the code below.
version: "3"
services:
cache:
build:
context: cache
dockerfile: Dockerfile
networks:
- crawler
elastic:
build:
context: elastic
dockerfile: Dockerfile
depends_on:
- elastic_server
networks:
- crawler
redis:
image: redis
restart: always
ports:
- "6379:6379"
volumes:
- redis-data:/data
networks:
- crawler
mysql:
image: mysql
command: --default-authentication-plugin=mysql_native_password
restart: always
environment:
MYSQL_ROOT_PASSWORD: example
networks:
- crawler
elastic_server:
image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0
ports:
- "9200:9200"
- "9300:9300"
volumes:
- elastic-data:/data
environment:
- discovery.type=single-node
networks:
- crawler
rabbitmq:
image: rabbitmq:management
hostname: myrabbitmq
ports:
- "5672:5672"
- "15672:15672"
volumes:
- rabbitmq-data:/var/lib/rabbitmq
networks:
- crawler
volumes:
elastic-data:
rabbitmq-data:
redis-data:
networks:
crawler:
external: true
Then execute the command below, and the basic service will start up.
docker-compose up -d
The above command has started the basic services, including Redis, elasticSearch, RabbitMQ, and mysql. RPC services for Redis and elasticSearch are also included.
crawler service
Create a docker-composite.yml file under the meituan or douban folder.
version: "3"
services:
crawl_list:
build:
context: crawl_list
dockerfile: Dockerfile
networks:
- crawler
crawl_tags:
build:
context: crawl_urllist
dockerfile: Dockerfile
networks:
- crawler
crawl_detail:
build:
context: crawl_detail
dockerfile: Dockerfile
networks:
- crawler
storage_detail:
build:
context: storage_detail
dockerfile: Dockerfile
networks:
- crawler
networks:
crawler:
external: true
Then execute the command below, and the crawler service will start up.
docker-compose up -d
If you want to start multiple crawler nodes you can use the following command.
docker-compose up --scale crawl_list=3 -d
Direct
Start the service directly, but be aware that you need to rely on redis, RabbitMQ and other services
cd deploy/deploy/
bash start-meituan-direct.sh
Appendix
- docker安装:docs.docker.com/
- docker-compose安装:docs.docker.com/compose/ins…
License
Copyright (c) 2020 Knowledge-Precipitation-Tribe