Python conda虚拟环境配置/爬虫入门

89 阅读2分钟

Python-environment Settings & crawling

Initially I want to make a plugin which may acquire weather data automally, however I find there is an API, so the porject became meaningless and just noted some experience.
Tutorials about how to write a plugin: developer.chrome.com/docs/extens…

Check Robot.txt first

Before starting, I need to determine which web page the data I need is coming from and if it is banned: www.hko.gov.hk/robots.txt

Environment Settings

When I first wrote the code, I encountered dependency conflicts, which caused some packages didn't work properly even when installed. Therefore, it is recommended to create a virtual environment with Conda and run the program in the virtual environment.

Here are some steps for reference: (suppose you have already have conda)

  1. Check environment: conda env list
  2. Create a new environment (here is): conda create -n py36tf1 numpy pandas python=3.9
    This command creates a new environment called py36tf1 in the conda environment and installs the Python 3.9, NumPy, and Pandas packages in that environment.
  3. Activate the environment: conda activate py36tf1
  4. Now you just need to install the packages needed, eg: pip install requests...
  5. Start the crawler programme: (py36tf1) ➜ python3 Py_weather_crawler.py

Static web page crawling and dynamic web page crawling

When I tried to use Beautifulsoup to save the page source and acquire the data with tag selector, I surprisingly find that the real time data is inserted as a varialbe like: <div class="col-xs-5 "> {{item.IssueTime|formatTimeEN}}</div>, so I try to crawl pages by simulating browser behavior. For example, automated testing tools such as Selenium or Puppeteer can be used to simulate users browsing pages in browsers and obtain dynamically generated content.

  1. Chromedriver settings: need to download the same version of Chrome and ChromeDriver, check your chrome version by clicking settings -> 关于Chrome, and download the corresponding version of Chromedriver. (where I download the package is: chromedriver.storage.googleapis.com/index.html?…)
  2. Place the downloaded ChromeDriver in any directory, for example, /usr/local/chromedriver. Add it to the environment variable:
  • vim ~/.profile
  • export PATH="$PATH:/usr/local/chromedriver"
  • source ~/.profile
  • test by inputting chromedriverin the terminal.

Sample code of scrwaling

import requests
import os
from bs4 import BeautifulSoup
from bs4 import element
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def get_text(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '}  # 构建请求头
        res = requests.get(url, headers=headers)  # 发送请求
        res.encoding = res.apparent_encoding  # 设置编码格式
        res.raise_for_status()  # 是否成功访问
        return res.text  # 返回网页源代码
    except Exception as e:
        print(e)  # 抛出异常, 打印异常信息
        
def parse_warning(url):  # 解析网页获取正在生效的警告信息
    
    # 创建 Chrome 浏览器实例
    options = Options()
    options.add_argument('--headless')  # 无头模式
    options.add_argument('--disable-gpu')
    options.binary_location = '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' # 指定浏览器二进制文件路径
    driver = webdriver.Chrome(options=options)
    
    # 解析网页获取正在生效的警告信息
    driver.get(url)
    html = driver.page_source
    b_soup = BeautifulSoup(text, "html.parser") # 构建BeautifulSoup对象
    
    info = {} # 创建用来存储信息的字典
    
    # 解析网页获取正在生效的警告信息
    b_soup = BeautifulSoup(html, "html.parser")
    
    # 爬取需要的信息
    title_tag = b_soup.find(id='mainContent').find('h1')
    info["title"] = title_tag.text if title_tag else None
    update_tag = b_soup.find(class_='self_align_center').find('em')
    info["latest updated time"] = update_tag.text if update_tag else None

    # 关闭浏览器
    driver.quit()
    
    return info
    

if __name__ == '__main__':
    index_url = 'https://www.hko.gov.hk/tc/wxinfo/dailywx/wxwarntoday.htm' # 起始网页地址
    warning = parse_warning(index_url) # 请求网页获取源代码并获取网页中需要知道的信息
    # 将网页信息存储并更新在本地
    f = open("warning_webpage.html", 'w')
    f.write(content)
    f.close
    print(content)
    print(warning)