一、Week1:规则

1. Library: Requests

(1). requests

r = requests.get(url, params=None, **kwargs)

  #!/usr/bin/python
  import requests
  r = request.get('http://www.baidu.com')
  type(r)
  if r.headers == 200:
      print('Access success')
  else if r.headers ==404:
      print('Access denied')

  r.headers
  r.text
  r.encoding            #HTTP header中获取字段
  r.apparent_encoding   #从内容中分析出的响应内容编码方式（备选）
  r.content

(2). universal framework


requests.ConnectionError	网络连接错误
requests.HTTPError	HTTP错误异常
requests.URLRuquired	URL缺失异常
requests.TooManyRedirects	重定向
requests.ConnectTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时

#!/usr/bin/python
import requests

def getHTMLtext(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status() #如果r.status！= 200，产生异常requests.HTTPError
        r..encoding = r.apparent_encoding
        return r.text
    except:
        return "Error occurred"

    if __name__ == '__main__':
        url = 'https://www.baidu.com'
        print(getHTMLtext)

(3). 7 major methods of Requests

METHODS	INTRODUCTION
requests.request()
requests.get()	major method to get HTML webpage
requests.head()	header information
requests.post()	submit POST resourses
requests.put()	submit PUT(replace)
requests.patch()	partially modified requests
requests.delete()	delete

HTTP, Hyper Text Transfer Protocol URL: http://host[:port][path]

#!/usr/bin/python
#requests.post

payload = {'key1':'value1', 'key2':'value2'}
r = requests.post(URL, data = payload')
print(r.text)
'''
{
    ...
    "form":{
        "key2":"value2",
        "key1":"value1"
    },
}
'''

r = requests.post(URL, data = 'just a piece of text')
print(r.text)
'''
{
    ...
    "data":"just a piece of text"
    "form":{}
}
'''

(4). More about Requests

(0). requests.request(method, url, **kwargs)
method = ['GET', 'HEAD', 'POST', 'PUT', 'PATCH', 'deleta', 'OPTIONS']
(1). params

#!/usr/bin/python
kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('GET', URL, patams = kv)
print(r.url)
URL?key1=value1&key2=value2

(2). data

#!/usr/bin/python
kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('POST', URL, data = kv)
body = 'textbody'
r = requests.request('POST', URL, data = body)

(3). json

kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('POST', URL, json = kv)

(4). headers

hd = {'user-agent':'Chrome/10'}
r = requests.request('POST', URL, headers = hd)

(5). cookie
advanced

(6). auth
advanced

(7). files

fs = {'file':open('data.csv', 'rb')}
r = requests.request('POST', URL, files = fs)

(8). timeout

(9). proxies

pxs = {'http':'http://user:pass@10.10.10.1:1234'
       'https':'https://10.10.10.1:4321' }
r = requests.request('GET',URL, proxies = pxs)

	**args
1	params	dictionary or text sequences
2	data	dictionary, text sequences or file object
3	json	ad titled
4	headers	HTTP headers
5	cookies	dict or CookieJar
6	auth	tuple
7	files	dict, file transfer
8	timeout	seconds
9	proxies	dict
10	allow_redirects	default: True
11	stream	fetch contents, download instantly(default: True)
12	verify	verify SSL certificate(default True)
13	cer	local SSL certificate path

2. 爬虫的‘盗亦有道’

(1). 爬虫引发的问题

Requests: 小规模，数据量小，速度不敏感 Scrapy: 中规模，数据规模较大，速度敏感 Google, Bing: 大规模，搜索引擎，速度关键

Sources check：判断HTTP headers的User-Agent域
Annountment: Robots

(2). Robots protocol

Robots Exclusion Standard
location: host/robots.txt

Grammar

User-agent: * Disallow: /

i.e. https://www.jd.com/robots.txt

User-agent: *
Disallow: /?
Disallow: /pop/.html
Disallow: /pinpai/.html?
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /

(3). Robots protocol

dylan 说：别吃牢饭就好～

(4). Practice

1. jd.com

京东上默认第一 Lolita 裙的信息（逃）

#!/usr/bin/python
import requests
url = 'https://item.jd.com/55949296412.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print('Error occured')

2. Amazon

Amazon上面的 Lolita 裙

import requests
url = 'https://www.amazon.cn/dp/B07MQSJQC4/ref=sr_1_12?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&dchild=1&keywords=lolita&qid=1586515136&sr=8-12'

try:
    kv = {'User-Agent':'Mozilla/5.0'}
    r = requests.request('GET',url,headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_coding
    print(r.text[1000:2000])
except:
    print('Error occurred')

3. baidu关键词提交

baidu api: https://www.baidu.com/s?wd=+ keywords

import requests

import 
kw = {'wd':'Lolita裙'}
kv = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.baidu.com/s'
try:
    r = requests.get(url, params = kw, headers = kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print('Error occurred')

4. Pictures / Videos

用途：老婆们的图 / 视频

import requests
import os

url = 'https://tva2.sinaimg.cn/large/87c01ec7gy1fsnqqz23i'
root = './'
path = root + url.split('/')[-1]

try:
    if not os.path.exists(path):
        r = requests.request('GET',url)
        r.raise_for_status()
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print('Successfully saved!')
except:
    print('Failed')

5. IP address

import requests
url = 'http://m.ip138.com/ip.asp?ip='
ip = input('Enter ur ip address:')
kv = {'User-Agent':'Mozilla/5.0'}
try:
    r = requests.get(url + ip, timeout = 10)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print('Error occurred')

二、Week2: 语音系列

1. 靓汤 Beautiful Soup

(1). 靓汤安装

import requests
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, 'html.parser')
print(soup.prettify())

(2). 靓汤基本元素

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html>data</html>', 'html.parser')
soup2 = BeautifulSoup(open('./demo.html'), 'html.parser')

parser	use
bs4 HTML	BeautifulSoup(mk, 'html.parser')
lxml HTML	BeautifulSoup(mk, '`lxml`')
lxml XML	BeautifulSoup(mk, '`xml`')
html5lib	BeautifulSoup(mk, '`html5lib`')

Element	About
Tag	<>和</>标明开头结尾
Name	标签名字, `<p`>...`

Python Crawler Tutorial of BIT