使用BeautifulSoup爬取CSDN博客文章

最新推荐文章于 2025-06-11 08:51:40 发布

__HelloWorld__

最新推荐文章于 2025-06-11 08:51:40 发布

阅读量856

点赞数

CC 4.0 BY-SA版权

分类专栏： Python 前端架构综合文章标签： Python Requests Beautiful 爬网

本文链接：https://blog.csdn.net/kangkanglou/article/details/78785563

本文介绍了如何使用Python的Requests库获取CSDN博客信息，并结合BeautifulSoup进行数据解析。通过设置代理，发起HTTP请求，解析返回的HTML内容，利用CSS选择器提取博客排名、文章列表等信息，展示了爬虫在数据抓取中的基本应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库

Requests is an elegant and simple HTTP library for Python, built for human beings.

通过Requests发起请求获取博客信息，然后再通过BeautifulSoup的基本应用，当然这里我们选取的是排名靠前的博客，注意如果是在内网通过代理上网，Requests可通过如下方式设置代理：

url = 'http://blog.csdn.net'
proxies = {
  
  "http": "http://user:pass@proxy_ip:proxy_port"}
r = requests.get('http://blog.csdn.net/phphot', proxies=proxies)
print(r.status_code)
response = r.text
# print(response)

连接建立之后，我们通过BeautifulSoup指定lxml解析器来解析返回内容，Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种：

Tag , NavigableString , BeautifulSoup , Comment .

比如：Tag对象

soup = BeautifulSoup(response, "lxml")
tree = html.fromstring(response)
# print(soup.prettify())
# tag
# print("h1 is '%s'" % soup.h1)
print("---------------------start---------------------")
title = soup.h1
# tag name start
print("tag name is '%s'" % title.name)
# tag name end

# tag attribute start
print('tag class attribute is %s' % title["class"])
print('get class attribute by .')
d = title.attrs
for k in d:
    print("key is %s, value is %s" % (k, d[k]))
if "class" in title.attrs:
    print('class attribute type is %s' % type(title["class"]))


print("get contents by .")
for c in title.contents:
    print("current is %s" % c)
    print("nex_sibling is %s" % c.next_sibling)
    print("previous_sibling is %s" % c.previous_sibling)
# tag attribute end

比如：NavigableString对象

# string start
print("NavigableString is %s, unicode string is %s" % (title.string, str(title.string)))
print("NavigableString type is %s" % type(title.string))
# string end

如果你熟悉CSS选择器，我们还可以通过CSS选择器的方式去获取对象，比如下面这样：

# css selector start

css_selector_result = soup.select(".unfold-btn")
print("css selector result is %s" % css_selector_result)
print("css_selector_result type is %s" % type(css_selector_result))
for css in css_selector_result:
    print("css span is '%s', span string is '%s'" % (css.find("span"), css.find("span").string))

a_list = soup.select("a[href]")
for a in a_list:
    if a.span is None:
        print("href is '%s', span string is '%s