这是我参与2022首次更文挑战的第11天,活动详情查看:2022首次更文挑战」
上一篇:juejin.cn/post/705787… 还差两天就放春节假期了,不知道还有多少人目前还在上班. 最近工作上,事情好多啊,事情都排到了年后,好难处理这些东西,目前在做的东西,没有一样我是擅长的. 不过,生活还是要继续嘛,不会做咱们就慢慢学,希望公司能给这个时间办
urllib的高级用法
Handler, Handler可以理解为处理器,有处理cookie的,有处理登录验证的,
urlilib-里的Handler
- 验证 当我们访问网站时,可能会弹出这样的认证窗口,这种情况启用了基本的身份认证,英文名叫HTTP Basic Access Authentication ,那么爬虫怎么请求呢? 比如我们想访问这个网页
我们可以使用以下代码:
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
username = 'admin'
password = "admin"
url = "https://ssr3.scrape.center/"
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)
try:
result = opener.open(url)
html = result.read().decode("utf-8")
print(html)
except URLError as e:
print(e.reason)
我们就可以通过这种方式来验证登录,
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link rel="icon" href="/static/img/favicon.ico">
<title>Scrape | Movie</title>
<link href="/static/css/app.css" type="text/css" rel="stylesheet">
<link href="/static/css/index.css" type="text/css" rel="stylesheet">
</head>
<body>
<div id="app">
<div data-v-74e8b908="" class="el-row" id="header">
<div data-v-74e8b908="" class="container el-col el-col-18 el-col-offset-3">
<div data-v-74e8b908="" class="el-row">
<div data-v-74e8b908="" class="logo el-col el-col-4">
<a data-v-74e8b908="" href="/" class="router-link-exact-active router-link-active">
<img data-v-74e8b908="" src="/static/img/logo.png" class="logo-image">
<span data-v-74e8b908="" class="logo-title">Scrape</span>
</a>
</div>
</div>
</div>
</div>
<div data-v-7f856186="" id="index">
<div data-v-7f856186="" class="el-row">
<div data-v-7f856186="" class="el-col el-col-18 el-col-offset-3">
<div data-v-7f856186="" class="el-card item m-t is-hover-shadow">
<div class="el-card__body">
<div data-v-7f856186="" class="el-row">
<div data-v-7f856186="" class="el-col el-col-24 el-col-xs-8 el-col-sm-6 el-col-md-4">
<a data-v-7f856186=""
href="/detail/1"
class="">
<img
data-v-7f856186=""
src="https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c"
class="cover">
</a>
</div>
<div data-v-7f856186="" class="p-h el-col el-col-24 el-col-xs-9 el-col-sm-13 el-col-md-16">
<a data-v-7f856186="" href="/detail/1" class="name">
<h2 data-v-7f856186="" class="m-b-sm">霸王别姬 - Farewell My Concubine</h2>
</a>
<div data-v-7f856186="" class="categories">
<button data-v-7f856186="" type="button"
class="el-button category el-button--primary el-button--mini">
<span>剧情</span>
</button>
<button data-v-7f856186="" type="button"
class="el-button category el-button--primary el-button--mini">
<span>爱情</span>
</button>
</div>
<div data-v-7f856186="" class="m-v-sm info">
<span data-v-7f856186="">中国内地、中国香港</span>
<span data-v-7f856186=""> / </span>
<span data-v-7f856186="">171 分钟</span>
</div>
<div data-v-7f856186="" class="m-v-sm info">
<span data-v-7f856186="">1993-07-26 上映</span>
</div>
</div>
......
</div>
</body>
其实他实例化了一个HTTPBasicAuthHandler参数是
HTTPPasswordMgrWithDefaultRealm()
利用add_password方法添加用户名和密码,构建一个Handler
获取的结果就是刚刚的页面源码.
还有两天就要春节放假了,新年快乐铁子们