爬虫入门

2023-01-27 tools - 爬虫 python, 爬虫评论

爬虫可以模拟浏览器向网站发起请求，获取网站数据，代替人做一些重复性劳动。

常见的网站可分为：前后端分离架构和前后端不分离架构。对于前后端分离架构，只要发送request，返回的数据一般是json格式，很好处理；对于前后端不分离的网站，只能获取其html页面，然后使用BeautifulSoup从中提取想要的内容。

有些网站有反爬机制，需要做一些额外配置，最常见的是加一个’User-Agent’，有些需要cookie等其他header信息。具体信息可使用浏览器的开发者工具查看，chrome浏览器的快捷键为 “ctrl + shift + i”，然后点击network, 重新发送请求，即可看到请求的具体参数，也可用postman测试需要哪些参数。

一个简单的request请求结构如下：

# 导入 requests 包
import requests

# 发送请求
x = requests.get('https://www.sogou.com/web?query=上海')

# 返回网页内容
print(x.text)

一些示例

搜狗网页提取

将搜索“上海”的结果页面保存到 “上海.html” 文件，设置了 “user-agent”

import requests
header = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
param = {
    "query": "上海"
}
x = requests.get('https://www.sogou.com/web', params = param, headers = header)
print(x.text)
with open("上海.html", 'w', encoding = "utf-8") as f:
    f.write(x.text)

调用百度翻译

百度翻译采用了前后端分离架构，返回的结果是json，非常便于处理

import requests
header = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
data = {
    "kw": "crawler"
}
x = requests.post('https://fanyi.baidu.com/sug', data = data, headers = header)
print(x.json())

豆瓣电影

import requests
header = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
param = {
    'type': '24',
    'interval_id': '100:90',
    'action': '',
    'start': '40',
    'limit': '20'
}
x = requests.get('https://movie.douban.com/j/chart/top_list', params = param, headers = header)
print(x.json())

谷歌学术

需要科学上网

import urllib.request,urllib.error
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/90.0.4430.93 Safari/537.36'}
url = 'https://scholar.google.com/scholar?hl=zh-CN&as_sdt=0%2C5&as_ylo=2018&q=zero-shot+NER'
req = urllib.request.Request(url=url, headers=headers)
res = urllib.request.urlopen(req, timeout=7)
html=res.read().decode('utf-8')
soup=BeautifulSoup(html,"lxml")
print(soup)

本文链接： http://chadqiu.github.io/395c25595e33.html

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

chadqiuDeveloper & Researcher

在校程序猿