avatar

Python爬取极简壁纸

Ajax动态爬取
爬取网站:https://bz.zzzmh.cn/#anime

实现思路:

  1. 由于是Ajax实现,所以抓取XHR,发现只有一个请求:
    1
    2
    3
    4
    5
    Request URL: https://api.zzzmh.cn/bz/getJson
    Request Method: POST
    Status Code: 200
    Remote Address: 140.143.248.87:443
    Referrer Policy: no-referrer-when-downgrade
    由图我们可以看出post请求的页数:
    20200328-1-1
    targert是标签,pageNum是请求页数。
    由于这里是Payload请求,所以我们需要将其转化成json类型付给data,代码如下:
    1
    2
    3
    4
    5
    6

    data = {
    'target': "anime",
    'pageNum': '1'
    }
    res = requests.post(url,data=json.dumps(data),headers=headers)
  2. 查看preview,如图:
    20200328-1-2
    不难发现
    1
    {t: "j", x: 3840, i: "4xlkjz", y: 1080}
    • t代表着图片格式:jjpg,ppng
    • xy代表着图片尺寸
    • i则是图片名称
      由此我们可以编写正则表达式提取图片格式和图片名称,代码如下:
      1
      img_re = re.findall('"t":"([j,p])","x":(.*?),"i":"(.*?)"',res.text)
  3. 我们再点开第一个图片,可以发现请求链接为:https://w.wallhaven.cc/full/4x/wallhaven-4xlkjz.jpg
    第二个图片链接为:https://w.wallhaven.cc/full/lq/wallhaven-lqzp62.jpg
    不难看出图片url与图片名称的关系为
    1
    2
    base_url= 'https://w.wallhaven.cc/full/'
    url = base_url + img_name[0:2] + '/wallhaven-' + img_name + img_type

完整代码

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import requests
from bs4 import BeautifulSoup
import json
import re
import os

base_url= 'https://w.wallhaven.cc/full/'
novel_save_dir= os.path.join(os.path.abspath(os.path.dirname(__file__)),'极简壁纸/')
print('当前执行的脚本文件路径为'+ os.path.abspath(__file__))
print('壁纸下载的文件夹路径为' + os.path.abspath(os.path.dirname(__file__)) + '/极简壁纸')
img_name =''
img_type =''
count= 1

def get_data(count):
url = 'https://api.zzzmh.cn/bz/getJson'
headers= {
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
"access": "3111c568d1242b689b9a0380dc2ca665549ef1416b0864a1cd72010c94e94ddc",
"content-length": "30",
"content-type": "application/json",
"location": "bz.zzzmh.cn",
"origin": "https",
"referer": "https",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-site",
"sign": "error",
"timestamp": "1585389526757",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69"
}
data = {
'target': "anime",
'pageNum': count
}
res = requests.post(url,data=json.dumps(data),headers=headers)
img_re = re.findall('"t":"([j,p])","x":(.*?),"i":"(.*?)"',res.text)
for img in img_re:
img_type = img[0]
img_name = img[2]
if img_type == 'j': # 判断图片格式
url = base_url + img_name[0:2] + '/wallhaven-' + img_name + '.jpg'
img_type = '.jpg'
download_img(url,img_name,img_type)
else:
url = base_url + img_name[0:2] + '/wallhaven-' + img_name + '.png'
img_type = '.png'
download_img(url,img_name,img_type)

def download_img(url,img_name,img_type):
try:
headers= {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 Edg/80.0.361.69'
}
img= requests.get(url,headers=headers)
with open(novel_save_dir + img_name + img_type ,'wb+')as f:
f.write(img.content)
except:
print('')
else:
print('下载完成:' + img_name)


if __name__ == "__main__":
if not os.path.exists(novel_save_dir): # 判断文件夹是否存在,不存在则创建
os.mkdir(novel_save_dir)
for count in range(1,10):
get_data(count)
文章作者: Techoc
文章链接: https://techoc.xyz/posts/632a71ec/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Techoc's

评论