Python网络爬虫编程新手篇网络爬虫是一种自动抓取互联网信息的脚本程序，广泛应用于搜索引擎、数据分析和内容聚合。这次我

网络爬虫是一种自动抓取互联网信息的脚本程序，广泛应用于搜索引擎、数据分析和内容聚合。这次我将带大家使用Python快速构建一个基础爬虫，为什么使用python做爬虫？主要就是支持的库很多，而且同类型查询文档多，在同等情况下，使用python做爬虫，成本、时间、效率等总体各方便综合最优的选择。废话不多说直接开干。

环境准备

pip install requests beautifulsoup4  # 安装核心库

基础爬虫四步法

1. 发送HTTP请求

import requests

url = "https://example.com"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)
print(f"状态码: {response.status_code}")  # 200表示成功

2. 解析HTML内容

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

# 提取标题
title = soup.title.text
print(f"页面标题: {title}")

# 提取所有链接
links = [a['href'] for a in soup.find_all('a', href=True)]
print(f"发现{len(links)}个链接")

3. 数据存储

# 存储到CSV文件
import csv

with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['标题', '链接'])
    for link in links:
        writer.writerow([title, link])

4. 处理分页

base_url = "https://example.com/page/{}"
for page in range(1, 6):  # 爬取5页
    page_url = base_url.format(page)
    response = requests.get(page_url)
    # 解析和存储逻辑...

高级技巧

1. 处理动态内容（使用Selenium）

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://dynamic-site.com")
dynamic_content = driver.page_source
# 后续解析过程相同
driver.quit()

2. 避免被封禁

import time
import random

# 随机延迟（1-3秒）
time.sleep(random.uniform(1, 3))

# 使用代理IP
proxies = {"http": "http://10.10.1.10:3128"}
requests.get(url, proxies=proxies)

3. 遵守robots.txt

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", url):
    # 允许爬取

完整示例：爬取图书信息

import requests
from bs4 import BeautifulSoup

url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

books = []
for book in soup.select('article.product_pod'):
    title = book.h3.a['title']
    price = book.select_one('p.price_color').text
    books.append((title, price))

print(f"抓取到{len(books)}本书籍")
for title, price in books[:3]:
    print(f"- {title}: {price}")

重要提醒

1、法律合规：遵守网站robots.txt协议，不爬取敏感数据

2、频率控制：添加延迟避免对服务器造成压力

3、异常处理：添加try-except应对网络错误

try:
    response = requests.get(url, timeout=5)
except requests.exceptions.RequestException as e:
    print(f"请求失败: {e}")

4、User-Agent轮换：使用不同浏览器标

通过上面这个教程，想必大家已经掌握了爬虫的基本原理和实现方法。实际开发中可根据需求添加数据库存储、异步处理等高级功能，当然这个是后续学习的范畴，也是更高要求爬虫项目必会的环节。