登陆注册
2277

最近迷上了爬虫技术

站长网2023-05-23 17:55:530

python 爬虫,目前我还在进一步学习阶段,有志同道合的兄弟们,可以一起探讨。

import requests

import os

from lxml import etree

if __name__ == "__main__":

parse = etree.HTMLParser(encoding="utf-8")

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '

'(HTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36'

}

url = "https://域名/index_4.html"

page_text = requests.get(url=url, headers=headers)

# 通用处理中文乱码的解决方案

# img_name = img_name.encode('iso-8859-1').decode('gbk')

page_text.encoding = page_text.apparent_encoding

page_text = page_text.text

tree = etree.HTML(page_text, parser=parse)

li_list = tree.xpath('//ul[@ class = "clearfix"]/li')

if not os.path.exists('文件夹'):

os.mkdir('文件夹')

for li in li_list:

img_name = li.xpath('./a/img/@alt')[0] '.jpg'

img_src = "https://域名/" li.xpath('./a/img/@src')[0]

img_data = requests.get(url=img_src, headers=headers).content

with open('文件夹/' img_name, 'wb') as fp:

fp.write(img_data)

print(img_name)

测试结果:

0000
评论列表
共(0)条