Python Crawler Tutorial of BIT

2020-03-13 22:24

一、Week1:规则

1. Library: Requests

(1). requests

r = requests.get(url, params=None, **kwargs)
  #!/usr/bin/python
  import requests
  r = request.get('http://www.baidu.com')
  type(r)
  if r.headers == 200:
      print('Access success')
  else if r.headers ==404:
      print('Access denied')

  r.headers
  r.text
  r.encoding            #HTTP header中获取字段
  r.apparent_encoding   #从内容中分析出的响应内容编码方式(备选)
  r.content

(2). universal framework

requests.ConnectionError网络连接错误
requests.HTTPErrorHTTP错误异常
requests.URLRuquiredURL缺失异常
requests.TooManyRedirects重定向
requests.ConnectTimeout连接远程服务器超时异常
requests.Timeout请求URL超时
#!/usr/bin/python
import requests

def getHTMLtext(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status() #如果r.status!= 200,产生异常requests.HTTPError
        r..encoding = r.apparent_encoding
        return r.text
    except:
        return "Error occurred"

    if __name__ == '__main__':
        url = 'https://www.baidu.com'
        print(getHTMLtext)

(3). 7 major methods of Requests

METHODSINTRODUCTION
requests.request()
requests.get()major method to get HTML webpage
requests.head()header information
requests.post()submit POST resourses
requests.put()submit PUT(replace)
requests.patch()partially modified requests
requests.delete()delete

HTTP, Hyper Text Transfer Protocol URL: http://host[:port][path]

#!/usr/bin/python
#requests.post

payload = {'key1':'value1', 'key2':'value2'}
r = requests.post(URL, data = payload')
print(r.text)
'''
{
    ...
    "form":{
        "key2":"value2",
        "key1":"value1"
    },
}
'''

r = requests.post(URL, data = 'just a piece of text')
print(r.text)
'''
{
    ...
    "data":"just a piece of text"
    "form":{}
}
'''

(4). More about Requests

(0). requests.request(method, url, **kwargs)
method = ['GET', 'HEAD', 'POST', 'PUT', 'PATCH', 'deleta', 'OPTIONS']
(1). params

#!/usr/bin/python
kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('GET', URL, patams = kv)
print(r.url)
URL?key1=value1&key2=value2

(2). data

#!/usr/bin/python
kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('POST', URL, data = kv)
body = 'textbody'
r = requests.request('POST', URL, data = body)

(3). json

kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('POST', URL, json = kv)

(4). headers

hd = {'user-agent':'Chrome/10'}
r = requests.request('POST', URL, headers = hd)

(5). cookie
advanced

(6). auth
advanced

(7). files

fs = {'file':open('data.csv', 'rb')}
r = requests.request('POST', URL, files = fs)

(8). timeout

(9). proxies

pxs = {'http':'http://user:pass@10.10.10.1:1234'
       'https':'https://10.10.10.1:4321' }
r = requests.request('GET',URL, proxies = pxs)
**args
1paramsdictionary or text sequences
2datadictionary, text sequences or file object
3jsonad titled
4headersHTTP headers
5cookiesdict or CookieJar
6authtuple
7filesdict, file transfer
8timeoutseconds
9proxiesdict
10allow_redirectsdefault: True
11streamfetch contents, download instantly(default: True)
12verifyverify SSL certificate(default True)
13cerlocal SSL certificate path

2. 爬虫的‘盗亦有道’

(1). 爬虫引发的问题

Requests: 小规模,数据量小,速度不敏感 Scrapy: 中规模,数据规模较大,速度敏感 Google, Bing: 大规模,搜索引擎,速度关键

(2). Robots protocol

Robots Exclusion Standard
location: host/robots.txt

Grammar

User-agent: * Disallow: /

i.e. https://www.jd.com/robots.txt

User-agent: *
Disallow: /?
Disallow: /pop/
.html
Disallow: /pinpai/.html?
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /

(3). Robots protocol

dylan 说:别吃牢饭就好~

(4). Practice

1. jd.com

京东上默认第一 Lolita 裙的信息(逃)

#!/usr/bin/python
import requests
url = 'https://item.jd.com/55949296412.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print('Error occured')

2. Amazon

Amazon上面的 Lolita 裙

import requests
url = 'https://www.amazon.cn/dp/B07MQSJQC4/ref=sr_1_12?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&dchild=1&keywords=lolita&qid=1586515136&sr=8-12'

try:
    kv = {'User-Agent':'Mozilla/5.0'}
    r = requests.request('GET',url,headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_coding
    print(r.text[1000:2000])
except:
    print('Error occurred')

3. baidu关键词提交

baidu api: https://www.baidu.com/s?wd=+ keywords

import requests

import 
kw = {'wd':'Lolita裙'}
kv = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.baidu.com/s'
try:
    r = requests.get(url, params = kw, headers = kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print('Error occurred')

4. Pictures / Videos

用途:老婆们的图 / 视频

import requests
import os

url = 'https://tva2.sinaimg.cn/large/87c01ec7gy1fsnqqz23i'
root = './'
path = root + url.split('/')[-1]

try:
    if not os.path.exists(path):
        r = requests.request('GET',url)
        r.raise_for_status()
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print('Successfully saved!')
except:
    print('Failed')

5. IP address

import requests
url = 'http://m.ip138.com/ip.asp?ip='
ip = input('Enter ur ip address:')
kv = {'User-Agent':'Mozilla/5.0'}
try:
    r = requests.get(url + ip, timeout = 10)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print('Error occurred')

二、Week2: 语音系列

1. 靓汤 Beautiful Soup

(1). 靓汤安装

import requests
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, 'html.parser')
print(soup.prettify())

(2). 靓汤基本元素

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html>data</html>', 'html.parser')
soup2 = BeautifulSoup(open('./demo.html'), 'html.parser')
parseruse
bs4 HTMLBeautifulSoup(mk, 'html.parser')
lxml HTMLBeautifulSoup(mk, 'lxml')
lxml XMLBeautifulSoup(mk, 'xml')
html5libBeautifulSoup(mk, 'html5lib')
ElementAbout
Tag<>和</>标明开头结尾
Name标签名字, <p>...`