下班了,学一点Python爬虫 第3天基本库的使用(urllib)-处理登录验证

536 阅读1分钟

这是我参与2022首次更文挑战的第11天,活动详情查看:2022首次更文挑战

上一篇:juejin.cn/post/705787… 还差两天就放春节假期了,不知道还有多少人目前还在上班. 最近工作上,事情好多啊,事情都排到了年后,好难处理这些东西,目前在做的东西,没有一样我是擅长的. 不过,生活还是要继续嘛,不会做咱们就慢慢学,希望公司能给这个时间办

urllib的高级用法

Handler, Handler可以理解为处理器,有处理cookie的,有处理登录验证的,

urlilib-里的Handler

  • 验证 当我们访问网站时,可能会弹出这样的认证窗口,这种情况启用了基本的身份认证,英文名叫HTTP Basic Access Authentication ,那么爬虫怎么请求呢? 比如我们想访问这个网页

我们可以使用以下代码:

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError
import ssl


ssl._create_default_https_context = ssl._create_unverified_context
username = 'admin'
password = "admin"

url = "https://ssr3.scrape.center/"
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)
try:
    result  = opener.open(url)
    html = result.read().decode("utf-8")
    print(html)
except URLError as e:
    print(e.reason)

我们就可以通过这种方式来验证登录,

<html lang="en">
<head>
  
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width,initial-scale=1">
  <link rel="icon" href="/static/img/favicon.ico">
  <title>Scrape | Movie</title>
  

  <link href="/static/css/app.css" type="text/css" rel="stylesheet">
  
<link href="/static/css/index.css" type="text/css" rel="stylesheet">

</head>
<body>
<div id="app">
  <div data-v-74e8b908="" class="el-row" id="header">
    <div data-v-74e8b908="" class="container el-col el-col-18 el-col-offset-3">
      <div data-v-74e8b908="" class="el-row">
        <div data-v-74e8b908="" class="logo el-col el-col-4">
          <a data-v-74e8b908="" href="/" class="router-link-exact-active router-link-active">
            <img data-v-74e8b908="" src="/static/img/logo.png" class="logo-image">
            <span data-v-74e8b908="" class="logo-title">Scrape</span>
          </a>
        </div>
      </div>
    </div>
  </div>
  
<div data-v-7f856186="" id="index">
  <div data-v-7f856186="" class="el-row">
    <div data-v-7f856186="" class="el-col el-col-18 el-col-offset-3">
      
      <div data-v-7f856186="" class="el-card item m-t is-hover-shadow">
        <div class="el-card__body">
          <div data-v-7f856186="" class="el-row">
            <div data-v-7f856186="" class="el-col el-col-24 el-col-xs-8 el-col-sm-6 el-col-md-4">
              <a data-v-7f856186=""
                 href="/detail/1"
                 class="">
                <img
                    data-v-7f856186=""
                    src="https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c"
                    class="cover">
              </a>
            </div>
            <div data-v-7f856186="" class="p-h el-col el-col-24 el-col-xs-9 el-col-sm-13 el-col-md-16">
              <a data-v-7f856186="" href="/detail/1" class="name">
                <h2 data-v-7f856186="" class="m-b-sm">霸王别姬 - Farewell My Concubine</h2>
              </a>
              <div data-v-7f856186="" class="categories">
                
                <button data-v-7f856186="" type="button"
                        class="el-button category el-button--primary el-button--mini">
                  <span>剧情</span>
                </button>
                
                <button data-v-7f856186="" type="button"
                        class="el-button category el-button--primary el-button--mini">
                  <span>爱情</span>
                </button>
                
              </div>
              <div data-v-7f856186="" class="m-v-sm info">
                <span data-v-7f856186="">中国内地、中国香港</span>
                <span data-v-7f856186=""> / </span>
                <span data-v-7f856186="">171 分钟</span>
              </div>
              <div data-v-7f856186="" class="m-v-sm info">
                
                <span data-v-7f856186="">1993-07-26 上映</span>
                
              </div>
            </div>
        
                ......
             
</div>
</body>

其实他实例化了一个HTTPBasicAuthHandler参数是 HTTPPasswordMgrWithDefaultRealm() 利用add_password方法添加用户名和密码,构建一个Handler

获取的结果就是刚刚的页面源码.

还有两天就要春节放假了,新年快乐铁子们