Python爬虫之Request库的使用

requests

就有了更为强大的库 requests，有了它，Cookies、登录验证、代理设置等操作都不是事儿。

安装环境

pip install requests

官方地址：https://requests.readthedocs.io/en/latest/

1. 示例引入

urllib 库中的 urlopen 方法实际上是以 GET 方式请求网页，而 requests 中相应的方法就是 get 方法，是不是感觉表达更明确一些？下面通过实例来看一下：

import requests
r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

测试实例：

r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')

2、GET抓取

import requests
data = {
'name': 'germey',
'age': 22
}
r = requests.get("http://httpbin.org/get", params=data)
print(r.text)

2.1 抓取二进制数据

下面以图片为例来看一下：

import requests
r = requests.get("http://qwmxpxq5y.hn-bkt.clouddn.com/hh.png")
print(r.text)
print(r.content)

如果不传递 headers，就不能正常请求：

import requests
r = requests.get("https://mmzztt.com/")
print(r.text)

但如果加上 headers 并加上 User-Agent 信息，那就没问题了：

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get("https://mmzztt.com/, headers=headers)
print(r.text)

3、POST请求

3.1 --前面我们了解了最基本的 GET 请求，另外一种比较常见的请求方式是 POST。使用 requests 实现 POST 请求同样非常简单，示例如下：

import requests
data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

测试网站

• 巨潮网络数据点击资讯选择公开信息

import requests
url= 'http://www.cninfo.com.cn/data20/ints/statistics'
res = requests.post(url)
print(res.text)

3.2 --发送请求后，得到的自然就是响应。在上面的实例中，我们使用 text 和 content 获取了响应的内容。此外，还有很多属性和方法可以用来获取其他信息，比如状态码、响应头、Cookies 等。示例如下：

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('http://www.jianshu.com',headers=headers)
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)

3.3 --状态码常用来判断请求是否成功，而 requests 还提供了一个内置的状态码查询对象 requests.codes，示例如下：

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('http://www.jianshu.com',headers=headers)
if not r.status_code == requests.codes.ok：
	exit()
else:
	print('Request Successfully')

3.4 --那么，肯定不能只有 ok 这个条件码。下面列出了返回码和相应的查询条件：

# 信息性状态码
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
# 成功状态码
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', 'o/', ' '),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),
# 重定向状态码
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', 'o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
'resume_incomplete', 'resume',), # These 2 to be removed in 3.0
# 客户端错误状态码
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),
# 服务端错误状态码
500: ('internal_server_error', 'server_error', '/o', ' '),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication')

4、高级用法

1、代理添加

proxy = {
'http' : 'http://183.162.171.78:4216',
}
# 返回当前IP
res = requests.get('http://httpbin.org/ip',proxies=proxy)
print(res.text)

2、快代理IP使用

文献：https://www.kuaidaili.com/doc/dev/quickstart/

打开后，默认http协议，返回格式选json，我的订单是VIP订单，所以稳定性选稳定，返回格式选json，然后点击生成链接，下面的API链接直接复制上。

3.关闭警告

from requests.packages import urllib3
urllib3.disable_warnings()

爬虫流程

5、初级爬虫

import requests
from lxml import etree

def main():
    # 1. 定义页面URL和解析规则
    crawl_urls = [
        'https://36kr.com/p/1328468833360133',
        'https://36kr.com/p/1328528129988866',
        'https://36kr.com/p/1328512085344642'
    ]
    parse_rule = "//h1[contains(@class,'article-title margin-bottom-20 common-width')]/text()"

    for url in crawl_urls:
        # 2. 发起HTTP请求
        response = requests.get(url)
        # 3. 解析HTML
        result = etree.HTML(response.text).xpath(parse_rule)[0]
        # 4. 保存结果
        print(result)

if __name__ == '__main__':
    main()

6、全站采集

6.1 封装公共文件

创建utils文件夹，写一个base类供其他程序调用

import requests
from retrying import retry
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from lxml import etree
import random,time

class FakeChromeUA:
    first_num = random.randint(55, 62)
    third_num = random.randint(0, 3200)
    fourth_num = random.randint(0, 140)
    os_type = [
                '(Windows NT 6.1; WOW64)', '(Windows NT 10.0; WOW64)', '(X11; Linux x86_64)','(Macintosh; Intel Mac OS X 10_12_6)'
               ]

    chrome_version = 'Chrome/{}.0.{}.{}'.format(first_num, third_num, fourth_num)

    @classmethod
    def get_ua(cls):
        return ' '.join(['Mozilla/5.0', random.choice(cls.os_type), 'AppleWebKit/537.36','(KHTML, like Gecko)', cls.chrome_version, 'Safari/537.36'])


class Spiders(FakeChromeUA):
    urls = []
    @retry(stop_max_attempt_number=3, wait_fixed=2000)
    def fetch(self, url, param=None,headers=None):
        try:
			if not headers:
                headers ={}
                headers['user-agent'] = self.get_ua()
            else:
                headers['user-agent'] = self.get_ua()
            self.wait_some_time()
            response = requests.get(url, params=param,headers=headers)
            if response.status_code == 200:
                response.encoding = 'utf-8'
                return response
        except requests.ConnectionError:
            return

    def wait_some_time(self):
        time.sleep(random.randint(100, 300) / 1000)

6.2 案例实践

from urllib.parse import urljoin

import requests
from lxml import etree
from queue import Queue
from xl.base import Spiders
from pymongo import MongoClient  

flt = lambda x :x[0] if x else None
class Crawl(Spiders):
    base_url = 'https://36kr.com/'
    # 种子URL
    start_url = 'https://36kr.com/information/technology'
    # 解析规则
    rules = {
        # 文章列表
        'list_urls': '//p[@class="article-item-pic-wrapper"]/a/@href',
        # 详情页数据
        'detail_urls': '//p[@class="common-width margin-bottom-20"]//text()',
        # 标题
        'title': '//h1[@class="article-title margin-bottom-20 common-width"]/text()',
    }
    # 定义队列
    list_queue = Queue()

    def crawl(self, url):
        """首页"""
        response =self.fetch(url)
        list_urls = etree.HTML(response.text).xpath(self.rules['list_urls'])
        # print(urljoin(self.base_url, list_urls))
        for list_url in list_urls:
            # print(urljoin(self.base_url, list_url))  # 获取url 列表信息
            self.list_queue.put(urljoin(self.base_url, list_url))

    def list_loop(self):
        """采集列表页"""
        while True:
            list_url = self.list_queue.get()
            print(self.list_queue.qsize())
            self.crawl_detail(list_url)
            # 如果队列为空 退出程序
            if self.list_queue.empty():
                break

    def crawl_detail(self,url):
        '''详情页'''
        response = self.fetch(url)
        html = etree.HTML(response.text)
        content = html.xpath(self.rules['detail_urls'])
        title = flt(html.xpath(self.rules['title']))
        print(title)
        data = {
            'content':content,
            'title':title
        }
        self.save_mongo(data)

    def save_mongo(self,data):
        client = MongoClient()  # 建立连接
        col = client['python']['hh']
        if isinstance(data, dict):
            res = col.insert_one(data)
            return res
        else:
            return '单条数据必须是这种格式：{"name":"age"}，你传入的是%s' % type(data)

    def main(self):
        # 1. 标签页
        self.crawl(self.start_url)
        self.list_loop()

if __name__ == '__main__':
    s = Crawl()
    s.main()

文件操作标识

requests-cache

pip install requests-cache

在做爬虫的时候，我们往往可能这些情况：

• 网站比较复杂，会碰到很多重复请求。

• 有时候爬虫意外中断了，但我们没有保存爬取状态，再次运行就需要重新爬取。

测试样例对比

import requests
import time

start = time.time()
session = requests.Session()
for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)

测试样例对比2

import requests_cache
import time

start = time.time()
session = requests_cache.CachedSession('demo_cache')

for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)

但是，刚才我们在写的时候把 requests 的 session 对象直接替换了。有没有别的写法呢？比如我不影响当前代码，只在代码前面加几行初始化代码就完成 requests-cache 的配置呢？

import time
import requests
import requests_cache

requests_cache.install_cache('demo_cache')

start = time.time()
session = requests.Session()
for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)

这次我们直接调用了 requests-cache 库的 install_cache 方法就好了，其他的 requests 的 Session 照常使用即可。

刚才我们知道了，requests-cache 默认使用了 SQLite 作为缓存对象，那这个能不能换啊？比如用文件，或者其他的数据库呢？

自然是可以的。

比如我们可以把后端换成本地文件，那可以这么做：

requests_cache.install_cache('demo_cache', backend='filesystem')

如果不想生产文件，可以指定系统缓存文件

requests_cache.install_cache('demo_cache', backend='filesystem', use_cache_dir=True)

另外除了文件系统，requests-cache 也支持其他的后端，比如 Redis、MongoDB、GridFS 甚至内存，但也需要对应的依赖库支持，具体可以参见下表：

Backend	Class	Alias	Dependencies
SQLite	SQLiteCache	'sqlite'
Redis	RedisCache	'redis'	redis-py
MongoDB	MongoCache	'mongodb'	pymongo
GridFS	GridFSCache	'gridfs'	pymongo
DynamoDB	DynamoDbCache	'dynamodb'	boto3
Filesystem	FileCache	'filesystem'
Memory	BaseCache	'memory'

比如使用 Redis 就可以改写如下：

backend = requests_cache.RedisCache(host='localhost', port=6379)
requests_cache.install_cache('demo_cache', backend=backend)

更多详细配置可以参考官方文档：https://requests-cache.readthedocs.io/en/stable/user_guide/backends.html#backends

当然，我们有时候也想指定有些请求不缓存，比如只缓存 POST 请求，不缓存 GET 请求，那可以这样来配置：

import time
import requests
import requests_cache

requests_cache.install_cache('demo_cache2', allowable_methods=['POST'])

start = time.time()
session = requests.Session()
for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time for get', end - start)
start = time.time()

for i in range(10):
    session.post('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time for post', end - start)

当然我们还可以匹配 URL，比如针对哪种 Pattern 的 URL 缓存多久，则可以这样写：

urls_expire_after = {'*.site_1.com': 30, 'site_2.com/static': -1}
requests_cache.install_cache('demo_cache2', urls_expire_after=urls_expire_after)

好了，到现在为止，一些基本配置、过期时间配置、后端配置、过滤器配置等基本常见的用法就介绍到这里啦，更多详细的用法大家可以参考官方文档：https://requests-cache.readthedocs.io/en/stable/user_guide.html。

展开阅读全文

页面更新：2024-04-23

标签：爬虫队列示例缓存实例状态文件测试方法信息

1 2 3 4 5

Python爬虫之Request库的使用

中国科学家在AGB星核合成研究中取得的最新进展

硅谷银行破产，有什么影响？对咱们有什么警示？

B站考虑取消视频播放量展示 EPIC将继续推出更多独占游戏 - 每日B报

宁德时代：市值破万亿，不要跟比亚迪斗了，LG才是真正的对手

今年将淘汰30%做亚马逊的活跃店铺

浦东新区科技精英评选答疑

京彩•绿色”消费券发放首日北京苏宁易购梨园店迎来首个用券顾客

复旦大学在底夸克偶素研究领域中取得重要新进展

妇女节“她”经济带动鲜花单量增长74%，顺丰同城打造无损配送体验

智慧校园物联传感平台应用解决方案（附PPT）

《2022京东超市热销趋势品报告》发布：招牌预制菜“最香” 冰淇淋出“健康牌”

“在线办”打破时空界限，“云庭审”助力司法公正

人类史前时期的重要发展阶段

马斯克在打全球老板们的主意，而比亚迪则是服务中国民众

喜报:抖音开通扫支付宝收款码付款功能，极大的方便了我们的生活

塔尔德利：如果莱奥状态出色米兰就能争取欧冠胜利，否则将

关于宝塔面板phpmyadmin报错的处理方法

永泰能源钒电池取得新突破首台32kW电池电堆测试成功

铁路信息系统智能运维数据采集方案研究

轻纱曼妙，气质如兰，36周岁的刘诗诗状态有多好？

张丹峰主动牵老婆秀恩爱，十指紧扣好甜蜜，洪欣幸福满面状

“进展超预期”！雷军透露小米造车最新进度：已完成冬季测

40个糟糕的建筑示例

ChatGpt浪潮信息龙头必定会倒下

解决thinkCMF6总是显示404 Not Found而不是显示报错信