一、Week1:规则
1. Library: Requests
(1). requests
r = requests.get(url, params=None, **kwargs)
#!/usr/bin/python
import requests
r = request.get('http://www.baidu.com')
type(r)
if r.headers == 200:
print('Access success')
else if r.headers ==404:
print('Access denied')
r.headers
r.text
r.encoding #HTTP header中获取字段
r.apparent_encoding #从内容中分析出的响应内容编码方式(备选)
r.content
(2). universal framework
| requests.ConnectionError | 网络连接错误 |
| requests.HTTPError | HTTP错误异常 |
| requests.URLRuquired | URL缺失异常 |
| requests.TooManyRedirects | 重定向 |
| requests.ConnectTimeout | 连接远程服务器超时异常 |
| requests.Timeout | 请求URL超时 |
#!/usr/bin/python
import requests
def getHTMLtext(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status() #如果r.status!= 200,产生异常requests.HTTPError
r..encoding = r.apparent_encoding
return r.text
except:
return "Error occurred"
if __name__ == '__main__':
url = 'https://www.baidu.com'
print(getHTMLtext)
(3). 7 major methods of Requests
| METHODS | INTRODUCTION |
|---|---|
| requests.request() | |
| requests.get() | major method to get HTML webpage |
| requests.head() | header information |
| requests.post() | submit POST resourses |
| requests.put() | submit PUT(replace) |
| requests.patch() | partially modified requests |
| requests.delete() | delete |
HTTP, Hyper Text Transfer Protocol URL: http://host[:port][path]
#!/usr/bin/python
#requests.post
payload = {'key1':'value1', 'key2':'value2'}
r = requests.post(URL, data = payload')
print(r.text)
'''
{
...
"form":{
"key2":"value2",
"key1":"value1"
},
}
'''
r = requests.post(URL, data = 'just a piece of text')
print(r.text)
'''
{
...
"data":"just a piece of text"
"form":{}
}
'''
(4). More about Requests
(0). requests.request(method, url, **kwargs)
method = ['GET', 'HEAD', 'POST', 'PUT', 'PATCH', 'deleta', 'OPTIONS']
(1). params
#!/usr/bin/python
kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('GET', URL, patams = kv)
print(r.url)
URL?key1=value1&key2=value2
(2). data
#!/usr/bin/python
kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('POST', URL, data = kv)
body = 'textbody'
r = requests.request('POST', URL, data = body)
(3). json
kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('POST', URL, json = kv)
(4). headers
hd = {'user-agent':'Chrome/10'}
r = requests.request('POST', URL, headers = hd)
(5). cookie
advanced
(6). auth
advanced
(7). files
fs = {'file':open('data.csv', 'rb')}
r = requests.request('POST', URL, files = fs)
(8). timeout
(9). proxies
pxs = {'http':'http://user:pass@10.10.10.1:1234'
'https':'https://10.10.10.1:4321' }
r = requests.request('GET',URL, proxies = pxs)
| **args | ||
|---|---|---|
| 1 | params | dictionary or text sequences |
| 2 | data | dictionary, text sequences or file object |
| 3 | json | ad titled |
| 4 | headers | HTTP headers |
| 5 | cookies | dict or CookieJar |
| 6 | auth | tuple |
| 7 | files | dict, file transfer |
| 8 | timeout | seconds |
| 9 | proxies | dict |
| 10 | allow_redirects | default: True |
| 11 | stream | fetch contents, download instantly(default: True) |
| 12 | verify | verify SSL certificate(default True) |
| 13 | cer | local SSL certificate path |
2. 爬虫的‘盗亦有道’
(1). 爬虫引发的问题
Requests: 小规模,数据量小,速度不敏感 Scrapy: 中规模,数据规模较大,速度敏感 Google, Bing: 大规模,搜索引擎,速度关键
- Sources check:判断HTTP headers的User-Agent域
- Annountment: Robots
(2). Robots protocol
Robots Exclusion Standard
location: host/robots.txt
Grammar
User-agent: * Disallow: /
i.e. https://www.jd.com/robots.txt
User-agent: *
Disallow: /?
Disallow: /pop/.html
Disallow: /pinpai/.html?
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
(3). Robots protocol
dylan 说:别吃牢饭就好~
(4). Practice
1. jd.com
京东上默认第一 Lolita 裙的信息(逃)
#!/usr/bin/python
import requests
url = 'https://item.jd.com/55949296412.html'
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except:
print('Error occured')
2. Amazon
Amazon上面的 Lolita 裙
import requests
url = 'https://www.amazon.cn/dp/B07MQSJQC4/ref=sr_1_12?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&dchild=1&keywords=lolita&qid=1586515136&sr=8-12'
try:
kv = {'User-Agent':'Mozilla/5.0'}
r = requests.request('GET',url,headers = kv)
r.raise_for_status()
r.encoding = r.apparent_coding
print(r.text[1000:2000])
except:
print('Error occurred')
3. baidu关键词提交
baidu api: https://www.baidu.com/s?wd=+ keywords
import requests
import
kw = {'wd':'Lolita裙'}
kv = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.baidu.com/s'
try:
r = requests.get(url, params = kw, headers = kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print('Error occurred')
4. Pictures / Videos
用途:老婆们的图 / 视频
import requests
import os
url = 'https://tva2.sinaimg.cn/large/87c01ec7gy1fsnqqz23i'
root = './'
path = root + url.split('/')[-1]
try:
if not os.path.exists(path):
r = requests.request('GET',url)
r.raise_for_status()
with open(path,'wb') as f:
f.write(r.content)
f.close()
print('Successfully saved!')
except:
print('Failed')
5. IP address
import requests
url = 'http://m.ip138.com/ip.asp?ip='
ip = input('Enter ur ip address:')
kv = {'User-Agent':'Mozilla/5.0'}
try:
r = requests.get(url + ip, timeout = 10)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[-500:])
except:
print('Error occurred')
二、Week2: 语音系列
1. 靓汤 Beautiful Soup
(1). 靓汤安装
import requests
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, 'html.parser')
print(soup.prettify())
(2). 靓汤基本元素
from bs4 import BeautifulSoup
soup = BeautifulSoup('<html>data</html>', 'html.parser')
soup2 = BeautifulSoup(open('./demo.html'), 'html.parser')
| parser | use |
|---|---|
| bs4 HTML | BeautifulSoup(mk, 'html.parser') |
| lxml HTML | BeautifulSoup(mk, 'lxml') |
| lxml XML | BeautifulSoup(mk, 'xml') |
| html5lib | BeautifulSoup(mk, 'html5lib') |
| Element | About |
|---|---|
| Tag | <>和</>标明开头结尾 |
| Name | 标签名字, <p>...` |