Python爬虫之Request库的使用

requests

就有了更为强大的库 requests,有了它,Cookies、登录验证、代理设置等操作都不是事儿。

安装环境

pip install requests

官方地址:https://requests.readthedocs.io/en/latest/

1. 示例引入

urllib 库中的 urlopen 方法实际上是以 GET 方式请求网页,而 requests 中相应的方法就是 get 方法,是不是感觉表达更明确一些?下面通过实例来看一下:

import requests
r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

测试实例:

r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')

2、GET抓取

import requests
data = {
'name': 'germey',
'age': 22
}
r = requests.get("http://httpbin.org/get", params=data)
print(r.text)

2.1 抓取二进制数据

下面以 图片为例来看一下:

import requests
r = requests.get("http://qwmxpxq5y.hn-bkt.clouddn.com/hh.png")
print(r.text)
print(r.content)

如果不传递 headers,就不能正常请求:

import requests
r = requests.get("https://mmzztt.com/")
print(r.text)

但如果加上 headers 并加上 User-Agent 信息,那就没问题了:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get("https://mmzztt.com/, headers=headers)
print(r.text)

3、POST请求

3.1 --前面我们了解了最基本的 GET 请求,另外一种比较常见的请求方式是 POST。使用 requests 实现 POST 请求同样非常简单,示例如下:

import requests
data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

测试网站

• 巨潮网络数据 点击资讯选择公开信息

import requests
url= 'http://www.cninfo.com.cn/data20/ints/statistics'
res = requests.post(url)
print(res.text)

3.2 --发送请求后,得到的自然就是响应。在上面的实例中,我们使用 text 和 content 获取了响应的内容。此外,还有很多属性和方法可以用来获取其他信息,比如状态码、响应头、Cookies 等。示例如下:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('http://www.jianshu.com',headers=headers)
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)

3.3 --状态码常用来判断请求是否成功,而 requests 还提供了一个内置的状态码查询对象 requests.codes,示例如下:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('http://www.jianshu.com',headers=headers)
if not r.status_code == requests.codes.ok:
	exit()
else:
	print('Request Successfully')

3.4 --那么,肯定不能只有 ok 这个条件码。下面列出了返回码和相应的查询条件:

# 信息性状态码
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
# 成功状态码
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', 'o/', ' '),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),
# 重定向状态码
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', 'o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
'resume_incomplete', 'resume',), # These 2 to be removed in 3.0
# 客户端错误状态码
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),
# 服务端错误状态码
500: ('internal_server_error', 'server_error', '/o', ' '),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication')

4、高级用法

1、代理添加

proxy = {
'http' : 'http://183.162.171.78:4216',
}
# 返回当前IP
res = requests.get('http://httpbin.org/ip',proxies=proxy)
print(res.text)

2、快代理IP使用

文献:https://www.kuaidaili.com/doc/dev/quickstart/

打开后,默认http协议,返回格式选json,我的订单是VIP订单,所以稳定性选稳定,返回格式选json,然后点击生成链接,下面的API链接直接复制上。


3.关闭警告

from requests.packages import urllib3
urllib3.disable_warnings()

爬虫流程



5、初级爬虫

import requests
from lxml import etree

def main():
    # 1. 定义页面URL和解析规则
    crawl_urls = [
        'https://36kr.com/p/1328468833360133',
        'https://36kr.com/p/1328528129988866',
        'https://36kr.com/p/1328512085344642'
    ]
    parse_rule = "//h1[contains(@class,'article-title margin-bottom-20 common-width')]/text()"

    for url in crawl_urls:
        # 2. 发起HTTP请求
        response = requests.get(url)
        # 3. 解析HTML
        result = etree.HTML(response.text).xpath(parse_rule)[0]
        # 4. 保存结果
        print(result)

if __name__ == '__main__':
    main()

6、全站采集

6.1 封装公共文件

创建utils文件夹,写一个base类供其他程序调用

import requests
from retrying import retry
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from lxml import etree
import random,time

class FakeChromeUA:
    first_num = random.randint(55, 62)
    third_num = random.randint(0, 3200)
    fourth_num = random.randint(0, 140)
    os_type = [
                '(Windows NT 6.1; WOW64)', '(Windows NT 10.0; WOW64)', '(X11; Linux x86_64)','(Macintosh; Intel Mac OS X 10_12_6)'
               ]

    chrome_version = 'Chrome/{}.0.{}.{}'.format(first_num, third_num, fourth_num)

    @classmethod
    def get_ua(cls):
        return ' '.join(['Mozilla/5.0', random.choice(cls.os_type), 'AppleWebKit/537.36','(KHTML, like Gecko)', cls.chrome_version, 'Safari/537.36'])


class Spiders(FakeChromeUA):
    urls = []
    @retry(stop_max_attempt_number=3, wait_fixed=2000)
    def fetch(self, url, param=None,headers=None):
        try:
			if not headers:
                headers ={}
                headers['user-agent'] = self.get_ua()
            else:
                headers['user-agent'] = self.get_ua()
            self.wait_some_time()
            response = requests.get(url, params=param,headers=headers)
            if response.status_code == 200:
                response.encoding = 'utf-8'
                return response
        except requests.ConnectionError:
            return

    def wait_some_time(self):
        time.sleep(random.randint(100, 300) / 1000)

6.2 案例实践

from urllib.parse import urljoin

import requests
from lxml import etree
from queue import Queue
from xl.base import Spiders
from pymongo import MongoClient  

flt = lambda x :x[0] if x else None
class Crawl(Spiders):
    base_url = 'https://36kr.com/'
    # 种子URL
    start_url = 'https://36kr.com/information/technology'
    # 解析规则
    rules = {
        # 文章列表
        'list_urls': '//p[@class="article-item-pic-wrapper"]/a/@href',
        # 详情页数据
        'detail_urls': '//p[@class="common-width margin-bottom-20"]//text()',
        # 标题
        'title': '//h1[@class="article-title margin-bottom-20 common-width"]/text()',
    }
    # 定义队列
    list_queue = Queue()

    def crawl(self, url):
        """首页"""
        response =self.fetch(url)
        list_urls = etree.HTML(response.text).xpath(self.rules['list_urls'])
        # print(urljoin(self.base_url, list_urls))
        for list_url in list_urls:
            # print(urljoin(self.base_url, list_url))  # 获取url 列表信息
            self.list_queue.put(urljoin(self.base_url, list_url))

    def list_loop(self):
        """采集列表页"""
        while True:
            list_url = self.list_queue.get()
            print(self.list_queue.qsize())
            self.crawl_detail(list_url)
            # 如果队列为空 退出程序
            if self.list_queue.empty():
                break

    def crawl_detail(self,url):
        '''详情页'''
        response = self.fetch(url)
        html = etree.HTML(response.text)
        content = html.xpath(self.rules['detail_urls'])
        title = flt(html.xpath(self.rules['title']))
        print(title)
        data = {
            'content':content,
            'title':title
        }
        self.save_mongo(data)

    def save_mongo(self,data):
        client = MongoClient()  # 建立连接
        col = client['python']['hh']
        if isinstance(data, dict):
            res = col.insert_one(data)
            return res
        else:
            return '单条数据必须是这种格式:{"name":"age"},你传入的是%s' % type(data)

    def main(self):
        # 1. 标签页
        self.crawl(self.start_url)
        self.list_loop()

if __name__ == '__main__':
    s = Crawl()
    s.main()

文件操作标识


requests-cache

pip install requests-cache

在做爬虫的时候,我们往往可能这些情况:

• 网站比较复杂,会碰到很多重复请求。

• 有时候爬虫意外中断了,但我们没有保存爬取状态,再次运行就需要重新爬取。

测试样例对比

import requests
import time

start = time.time()
session = requests.Session()
for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)

测试样例对比2

import requests_cache
import time

start = time.time()
session = requests_cache.CachedSession('demo_cache')

for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)

但是,刚才我们在写的时候把 requests 的 session 对象直接替换了。有没有别的写法呢?比如我不影响当前代码,只在代码前面加几行初始化代码就完成 requests-cache 的配置呢?

import time
import requests
import requests_cache

requests_cache.install_cache('demo_cache')

start = time.time()
session = requests.Session()
for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)

这次我们直接调用了 requests-cache 库的 install_cache 方法就好了,其他的 requests 的 Session 照常使用即可。

刚才我们知道了,requests-cache 默认使用了 SQLite 作为缓存对象,那这个能不能换啊?比如用文件,或者其他的数据库呢?

自然是可以的。

比如我们可以把后端换成本地文件,那可以这么做:

requests_cache.install_cache('demo_cache', backend='filesystem')

如果不想生产文件,可以指定系统缓存文件

requests_cache.install_cache('demo_cache', backend='filesystem', use_cache_dir=True)

另外除了文件系统,requests-cache 也支持其他的后端,比如 Redis、MongoDB、GridFS 甚至内存,但也需要对应的依赖库支持,具体可以参见下表:

Backend

Class

Alias

Dependencies

SQLite

SQLiteCache

'sqlite'


Redis

RedisCache

'redis'

redis-py

MongoDB

MongoCache

'mongodb'

pymongo

GridFS

GridFSCache

'gridfs'

pymongo

DynamoDB

DynamoDbCache

'dynamodb'

boto3

Filesystem

FileCache

'filesystem'


Memory

BaseCache

'memory'


比如使用 Redis 就可以改写如下:

backend = requests_cache.RedisCache(host='localhost', port=6379)
requests_cache.install_cache('demo_cache', backend=backend)

更多详细配置可以参考官方文档:https://requests-cache.readthedocs.io/en/stable/user_guide/backends.html#backends

当然,我们有时候也想指定有些请求不缓存,比如只缓存 POST 请求,不缓存 GET 请求,那可以这样来配置:

import time
import requests
import requests_cache

requests_cache.install_cache('demo_cache2', allowable_methods=['POST'])

start = time.time()
session = requests.Session()
for i in range(10):
    session.get('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time for get', end - start)
start = time.time()

for i in range(10):
    session.post('http://httpbin.org/delay/1')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time for post', end - start)

当然我们还可以匹配 URL,比如针对哪种 Pattern 的 URL 缓存多久,则可以这样写:

urls_expire_after = {'*.site_1.com': 30, 'site_2.com/static': -1}
requests_cache.install_cache('demo_cache2', urls_expire_after=urls_expire_after)

好了,到现在为止,一些基本配置、过期时间配置、后端配置、过滤器配置等基本常见的用法就介绍到这里啦,更多详细的用法大家可以参考官方文档:https://requests-cache.readthedocs.io/en/stable/user_guide.html。

展开阅读全文

页面更新:2024-04-23

标签:爬虫   队列   示例   缓存   实例   状态   文件   测试   方法   信息

1 2 3 4 5

上滑加载更多 ↓
推荐阅读:
友情链接:
更多:

本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828  

© CopyRight 2008-2024 All Rights Reserved. Powered By bs178.com 闽ICP备11008920号-3
闽公网安备35020302034844号

Top