Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库
Requests is an elegant and simple HTTP library for Python, built for human beings.
通过Requests发起请求获取博客信息,然后再通过BeautifulSoup的基本应用,当然这里我们选取的是排名靠前的博客,注意如果是在内网通过代理上网,Requests可通过如下方式设置代理:
url = 'http://blog.csdn.net'
proxies = {
"http": "http://user:pass@proxy_ip:proxy_port"}
r = requests.get('http://blog.csdn.net/phphot', proxies=proxies)
print(r.status_code)
response = r.text
# print(response)
连接建立之后,我们通过BeautifulSoup指定lxml解析器来解析返回内容,Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag , NavigableString , BeautifulSoup , Comment .
比如:Tag对象
soup = BeautifulSoup(response, "lxml")
tree = html.fromstring(response)
# print(soup.prettify())
# tag
# print("h1 is '%s'" % soup.h1)
print("---------------------start---------------------")
title = soup.h1
# tag name start
print("tag name is '%s'" % title.name)
# tag name end
# tag attribute start
print('tag class attribute is %s' % title["class"])
print('get class attribute by .')
d = title.attrs
for k in d:
print("key is %s, value is %s" % (k, d[k]))
if "class" in title.attrs:
print('class attribute type is %s' % type(title["class"]))
print("get contents by .")
for c in title.contents:
print("current is %s" % c)
print("nex_sibling is %s" % c.next_sibling)
print("previous_sibling is %s" % c.previous_sibling)
# tag attribute end
比如:NavigableString对象
# string start
print("NavigableString is %s, unicode string is %s" % (title.string, str(title.string)))
print("NavigableString type is %s" % type(title.string))
# string end
如果你熟悉CSS选择器,我们还可以通过CSS选择器的方式去获取对象,比如下面这样:
# css selector start
css_selector_result = soup.select(".unfold-btn")
print("css selector result is %s" % css_selector_result)
print("css_selector_result type is %s" % type(css_selector_result))
for css in css_selector_result:
print("css span is '%s', span string is '%s'" % (css.find("span"), css.find("span").string))
a_list = soup.select("a[href]")
for a in a_list:
if a.span is None:
print("href is '%s', span string is '%s