由于百度股票网页版已经取消,所以已经爬不上了,但还是写出来了解一下代码思路
功能描述
目标:获取上交所和深交所所有股票的名称和交易信息
输出:保存到文件中
候选数据网站选择
新浪股票:http://financc.sina.com.cn/stock/
百度股票:https://gupiao.baidu.com/stock/
选择标准:股票信息静态存在于HTML页面中,非js代码生成,没有Robots协议限制
选取方法:浏览器F12,源代码查看等
选取心态:不要纠结于某个网站,多找信息源尝试
由于新浪股票的信息存在于js中,所以本次选用百度股票网站
程序的结构设计
步骤1:从东方财富网获取股票列表
步骤2:根据股票列表朱哥道百度股票获取个股信息
步骤3:将结果存储到文件
代码生成:
- 获取网页信息
代码为:
#默认编码为utf-8
def getHTMLText(url, code = 'utf-8'):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = code
return r.text
except:
return ""
- 获得所有注册公司编号
在东方财富网上找到每个上市公司的信息
url为http://quote.eastmoney.com/center/gridlist.html#hs_a_board
注册公司的编码在href’中,所以在获取到a标签时,要进行字符串匹配,并将其添加到股票名称列表中
代码为:
def getStockList(lst, stockUrl):
html = getHTMLText(stockUrl, 'GB2312')
soup = BeautifulSoup(html, "html.parser")
a = soup.find_all('a')
for i in a:
try:
href = i.attrs['href']
lst.append(re.findall(r"[s][hz]\d{6}",href)[0])
except:
continue
- 根据股票名称,到百度股票中获取股票信息,并将其存储到字典格式中
百度股票网页内容:
查看源码:
提取信息的代码为:
def getStockInfo(lst, stockUrl, fpath):
count = 0
for stock in lst:
url = stockUrl + stock + ".html"
html = getHTMLText(url)
try:
if html == "":
continue
infoDict = {}
soup = BeautifulSoup(html, "html.parser")
stockInfo = soup.find('div', attrs={'class':'stock-bets'})
name = stockInfo.find_all(attrs = {'class':'bets-name'})[0]
infoDict.update({'股票名称':name.text.split()[0]})
keyList = stockInfo.find_all('dt')
valueList = stockInfo.find_all('dd')
for i in range(keyList):
key = keyList[i].text
val = valueList[i].text
infoDict[key] = val
#a表示增添
with open(fpath, 'a', encoding='utf-8') as f:
f.write(str(infoDict)+'\n')
count = count+1
print("\r当前的进度为:{:.2f}%".format(count*100/len(lst)),end = '')
except:
count = count+1
print("\r当前的进度为:{:.2f}%".format(count * 100 / len(lst)), end='')
traceback.print_exc()
continue
主函数代码为:
def main():
stock_list_url = 'http://quote.eastmony.com/stocklist.html'
stock_info_url = 'https://gupiao.baidu.com/stock/'
output_file = "E://BaiduStockInfo.txt"
slist = []
getStockList(slist, stock_list_url)
getStockInfo(slist, stock_info_url, output_file)