1.页面分析

我们从网站的主页面开始提取网站信息，一直到最后具体的房产信息。

以二手房为例，我们对网页源代码进行分析。
（其余卖房租房等网址我们也可以爬取）

首先我们对页面源码进行分析，查找二手房和新房对应的源码和链接。

![](https://p3-tt-ipv6.byteimg.com/origin/pgc-image/207bab86062a4f5b9440c40d05dea254)

我们可以从HTML代码中找到每个区域对应的网址URL，因此直接提取出href属性就可以跳转到对应区域的房产信息。

![](https://p3-tt-ipv6.byteimg.com/origin/pgc-image/7e49c9fbaf644f9685b475036dbb4fcb)

以包河区为例，我们可以得到房产信息，因此对节点li进行遍历就可提取出所有的房产详细信息的地址URL。
我们这里以第一页为基础爬取，若爬取多页信息更改URL即可，查看url变化，之后每一页都是从p2->p3
从：

hf.anjuke.com/sale/baoheq…

到：

hf.anjuke.com/sale/baoheq…
hf.anjuke.com/sale/baoheq…

![](https://p3-tt-ipv6.byteimg.com/origin/pgc-image/0dde79a64e8b4d5f93ae6839c075cb59)

最后提取房产详细页面的信息。

![](https://p26-tt.byteimg.com/origin/pgc-image/d0a6a2dfc8174d3298fab7eda978c63c)

2.代码

其实代码比较简单，没有什么特别讲解的地方，如果有什么不懂的可以在下方评论留言

将提取的标题作为文件的名称，有些标题存在非法字符，因此使用replace代替。

from pyquery import PyQuery as pq
import requests
import pymongo
import os

'''client = pymongo.MongoClient(host='localhost', port=27017)
db = client.安居客  # 指定数据库,若不存在，则直接创建一个test数据库
collection = db.合肥'''

def gethtml(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        return None

def getaddress(html):
    doc = pq(html)
    old_news = doc('#content_Rd1 div.clearfix .details').items()
    for i, old_new in enumerate(old_news):
        if i == 0:
            print('二手房')
            old_houses = old_new.find('.areas a').items()
            for old_house in old_houses:
                house_url = old_house.attr('href')
                address = old_house.text()
                old_house_list(house_url, address)
        else:
            new_houses = old_new.find('.areas a').items()
            print('新房')
            for new_house in new_houses:
                house_url = new_house.attr('href')
                address = new_house.text()
                new_house_list(house_url, address)

def old_house_list(url, address):
    html = gethtml(url)
    doc = pq(html)
    items = doc(
        '#houselist-mod-new .list-item .house-details .house-title').items()
    for item in items:
        url = item.find('a').attr('href')
        old_house_information(url, address)

def new_house_list(url, address):
    url = 'https:'+url
    html = gethtml(url)
    doc = pq(html)
    items = doc('.key-list .item-mod').items()
    for item in items:
        url = item.find('a').attr('href')
        new_house_information(url, address)

def old_house_information(url, address):
    html = gethtml(url)
    doc = pq(html)
    title = doc('#content .clearfix h3').text()
    information = doc(
        '.houseInfo-wrap .houseInfo-detail-item').text().replace('\n', '')
    old = '二手房'
    write_txt(title, information, address, old, url)

def new_house_information(url, address):
    html = gethtml(url)
    doc = pq(html)
    title = doc('.basic-info .basic-fst').text().replace('\n', '')
    information = doc('.basic-parms').text().replace('\n', '').replace('变价通知我', '').replace('全部户型', '').replace('开盘通知我', '').replace('查看地图','')
    new = '新盘'
    write_txt(title, information, address, new, url)

def write_txt(title, content, address, infor, url):
    house_path = '安居客'+os.path.sep+infor+os.path.sep+address
    if not os.path.exists(house_path):
        os.makedirs(house_path)
    if title:
        file_path = house_path+os.path.sep + \
            '{0}.{1}'.format(title.replace(' ', '').replace('|','').replace('*',''), 'txt')
        if not os.path.exists(file_path):
            with open(file_path, 'w', encoding='utf-8')as f:
                print('正在爬取 '+address+' '+infor+'  '+' '+title)
                f.write(content+'\n')
                f.write(url)
        else:
            print('已爬取 '+address+' '+infor+'  '+' '+title)

if __name__ == "__main__":
    url = 'https://hf.anjuke.com/'
    html = gethtml(url)
    getaddress(html)

3.结果展示

![](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/4a41fd56ce314531b76f34f2b5325cfa)

![](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/dc5c1b091f0a4de28c602f9a878a0286)

![](https://p1.pstatp.com/origin/pgc-image/92829728e1984888b4b2f5e215e0688c)

![](https://p1.pstatp.com/origin/pgc-image/d65ad61b4bc4414894517aba460c1efb)

![](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/606015578e564d6ba39bf021c73d0bf3)

![](https://p6-tt-ipv6.byteimg.com/origin/pgc-image/07cba75e354247db9152e15739eb959e)

![](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/48f59125153444d9919be44cf2f165eb)

其实还是挺简单的！

PS：如有需要Python学习资料的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入

爬虫爬取安居客二手房和新房信息，你是买新房还是二手的呢？

1.页面分析

2.代码

3.结果展示