东方亚洲欧a∨人在线观看|欧美亚洲日韩在线播放|日韩欧美精品一区|久久97AV综合

實(shí)習(xí)報(bào)告 蒲公英文摘 > 范文大全 > 實(shí)習(xí)報(bào)告 >

Python網(wǎng)絡(luò)爬蟲實(shí)習(xí)報(bào)告python實(shí)習(xí)報(bào)告

發(fā)布時(shí)間:2020-08-26 來源: 實(shí)習(xí)報(bào)告點(diǎn)擊：

　 Python 網(wǎng)絡(luò)爬蟲實(shí)習(xí)報(bào)告

　目錄一、選題背景 .................................................................................... - 1 -

　二、爬蟲原理 .................................................................................... - 1 -

　三、爬蟲歷史與分類 ......................................................................... - 1 -

　四、常用爬蟲框架比較 ..................................................................... - 1 -

　五、數(shù)據(jù)爬取實(shí)戰(zhàn)( 豆瓣網(wǎng)爬取電影數(shù)據(jù)) ........................................ - 2 -

　1 分析網(wǎng)頁 .......................................................................................... -

　2

　- 2 爬取數(shù)據(jù) .......................................................................................... -

　2

　- 3 數(shù)據(jù)整理、轉(zhuǎn)換............................................................................... -

　3

　- 4 數(shù)據(jù)保存、展示............................................................................... -

　7

　- 5 技術(shù)難點(diǎn)關(guān)鍵點(diǎn)............................................................................... -

　9

　-

　六、總結(jié) ......................................................................................... - 11 -

　一、

　選題背景二、

　爬蟲原理三、

　爬蟲歷史與分類四、

　常用爬蟲框架比較 y Scrapy 框架: :Scrapy 框架就是一套比較成熟的 Python 爬蟲框架,就是使用 Python 開發(fā)的快速、高層次的信息爬取框架,可以高效的爬取 web頁面并提取出結(jié)構(gòu)化數(shù)據(jù)。Scrapy 應(yīng)用范圍很廣,爬蟲開發(fā)、數(shù)據(jù)挖掘、數(shù)據(jù)監(jiān)測(cè)、自動(dòng)化測(cè)試等。

　y Crawley 框架: :Crawley 也就是 Python 開發(fā)出的爬蟲框架,該框架致力于改變?nèi)藗儚幕ヂ?lián)網(wǎng)中提取數(shù)據(jù)的方式。

　a Portia 框架: :Portia 框架就是一款允許沒有任何編程基礎(chǔ)的用戶可視化地爬取網(wǎng)頁的爬蟲框架。

　r newspaper 框架: :newspaper 框架就是一個(gè)用來提取新聞、文章以及內(nèi)容分析的 Python 爬蟲框架。

　Python- -e goose 框架: :Python-goose 框架可提取的信息包括:<1>文章主體內(nèi)容;<2>文章主要圖片;<3>文章中嵌入的任 heYoutube/Vimeo 視頻;<4>元描述;<5>元標(biāo)簽

　五、數(shù)據(jù)爬取實(shí)戰(zhàn)( 豆瓣網(wǎng)爬取電影數(shù)據(jù)) 1 分析網(wǎng)頁

　# 獲取 html 源代碼

　def __getHtml():

　data = []

　pageNum = 1

　pageSize = 0

　try:

　while (pageSize <= 125):

　# headers = {"User- - Agent":"Mozilla/5 、 0 (Wi ndows NT 6 6 、 1) AppleWebKit/537 、 11 (KHTML, like Gecko) Chrome/23 、0 0 、1271 、 64 Safari/537 、 11",

　# "Referer":None # 注意如果依然不能抓取的話, , 這里可以設(shè)置抓取網(wǎng)站的 host

　# }

　# opener = urllib 、 request 、 build_opener()

　# opener 、 addheaders = [head ers]

　url = "" + str(pageSize) + "&filter=" + str(pageNum)

　# data["html%s" % i ]=urllib、、 request、、 urlopen(url) 、read() 、 decode("utf- - 8")

　data、、 append(urllib、、 request、、 urlopen(url)、、 read() 、

　decode("utf- - 8"))

　pageSize += 25 5

　pageNum += 1

　print(pageSize, pageNum)

　except Exception as e:

　raise e

　return data

　2 爬取數(shù)據(jù) def __getData(html):

　title = []

　# 電影標(biāo)題

　#rating_num = []

　# 評(píng)分

　range_num = []

　 # 排名

　#rating_people_num = []

　#

　評(píng)價(jià)人數(shù)

　movie_author = []

　 # 導(dǎo)演

　data = {}

　# bs4 解析 html

　soup = BeautifulSoup(html, "html 、 parser")

　for li in soup 、 find("ol", attrs={"class": "grid_view"}) 、find_all("li"):

　title 、 append(li 、 find("span", class_="title") 、 text)

　#rating_num 、 append(li 、 find("div", class_="star") 、

　find("span", class_="rating_num") 、 text)

　range_num 、 append(li 、 find("div", class_="pic") 、find("em") 、 text)

　#spans = li 、 find("div", class_="star") 、find_all("span")

　#for x in range(len(sp ans)):

　#

　 if x <= 2:

　#

　pass

　# else:

　#

　rating_people_num 、 append(spans[x] 、string[- - len(spans[x] 、 string):- - 3])

　str = li 、 find("div", class_="bd") 、 find("p", class_="") 、 text 、 lstrip()

　index = str 、 fi nd(" 主 ")

　if (index == - - 1):

　index = str 、 find(" 、、、 ")

　print(li 、 find("div", class_="pic") 、 find("em") 、text)

　if (li、、 find("div", class_="pic")、、 find("em")、、 text == 210):

　index = 60

　# print("a aa")

　# print(str[4:index])

　movie_author 、 append(str[4:index])

　data["title"] = title

　#data["rating_num"] = rating_num

　data["range_num"] = range_num

　#data["rating_people_num"] = rating_people_num

　data["movie_author"] = mov ie_author

　return data

　3 數(shù)據(jù)整理、轉(zhuǎn)換 def __getMovies(data):

　f = open("F://douban_movie 、 html", "w",encoding="utf- - 8")

　f f 、 write("<html>")

　f f 、 write("<head><meta charset="UTF- - 8"><title>Insert title here</title></head>")

　f f 、 write("<body>")

　f f 、 write ("<h1> 爬取豆瓣電影 </h1>")

　f f 、 write("<h4> 作者: : 劉文斌 </h4>")

　f f 、 write("<h4> 時(shí)間 :" + nowtime + "</h4>")

　f f 、 write("<hr>")

　f f 、 write("<table width="800px" border="1" align=center>")

　f f 、 write("<thead>")

　f f 、 write("<tr>")

　f f 、 write("<th><font size="5" colo r=green> 電影</font></th>")

　#f 、 write("<th

　width="50px"><font size="5" color=green>評(píng)分 </font></th>")

　f f 、 write("<th

　width="50px"><font size="5" color=green>排名 </font></th>")

　#f 、 write("<th

　width="100px"><font size="5" color=green>評(píng)價(jià)人數(shù) </font></th>")

　f f 、 write("<th><font size="5" color=green> 導(dǎo)演</font></th>")

　f f 、 write("</tr>")

　f f 、 write("</thead>")

　f f 、 write("<tbody>")

　for data in datas:

　for i in range(0, 25):

　f f 、 write("<tr>")

　f f 、 write("<td style="color:orange;text- - align:center">%s</td>" % data["title"][i])

　#

　f 、 write("<td

　style="color:blue;text- - align:center">%s</td>" % data["rating_num"][i])

　f f 、 write("<td style="color:red;text- - align:center">%s</td>" % data["range_num"][i])

　#

　 f 、 write(" <td style="color:blue;text- - align:center">%s</td>" % data["rating_people_num"][i])

　f f 、 write("<td style="color:black;text- - align:center">%s</td>" % data["movie_author"][i])

　f f 、 write("</tr>")

　f f 、 write("</tbody>")

　f f 、 write("</thead>")

　f f 、 write("</table>")

　f f 、 write("</body>")

　f f 、 write("</html>")

　f f 、 close()

　if __name__ == "__main__":

　datas = []

　htmls = __getHtml()

　for i in range(len(htmls)):

　data = __getData(htmls[i])

　datas 、 append(data)

　__getMovies(datas)

　 4 4 數(shù)據(jù)保存、展示

　結(jié)果如后圖所示:

　 5 技術(shù)難點(diǎn)關(guān)鍵點(diǎn) 數(shù)據(jù)爬取實(shí)戰(zhàn)( 搜房網(wǎng)爬取房屋數(shù)據(jù))

　from bs4 import BeautifulSoup import requests rep = requests、get( "") rep、encoding = "gb2312"

　# 設(shè)置編碼方式 html = rep、text soup = BeautifulSoup(html, "html 、 parser") f = open( "F://fang 、 htm l", "w",encoding= "utf- - 8") f、write( "<html>") f、write( "<head><meta charset="UTF- - 8"><title>Insert title here</title></head>") f、write( "<body>") f、write( "<center><h1> 新房成交 TOP3</h1></center>") f、write( "<table border="1px" width="1000px" height="800px" align=cent er><tr>") f、write( "<th><h2> 房址 </h2></th>") f、write( "<th><h2> 成交量 </h2></th>") f、write( "<th><h2> 均價(jià) </h2></th></tr>") for li in soup、find( "ul",class_= "ul02")、find_all( "li"):

　 name=li、find( "div",class_= "pbtext")、find( "p")、text

　 chengjiaoliang=li、find( "span" ",class_= "red- - f3")、text

　 try:

　 junjia=li、find( "div",class_= "ohter")、find( "p",class_= "gray- - 9") #、text、replace("?O", "平方米")

　 except Exception as e:

　 junjia=li、find( "div",class_= "gray- - 9") #、text、replace("?O", "平方米")

　f、write( "<tr><td ali gn=center><font size="5px"

　color=red>%s</font></td>" % name)

　 f、write( "<td align=center><font size="5px"

　color=blue>%s</font></td>" % chengjiaoliang)

　 f、write( "<td align=center><font size="5px"

　color=green>%s</font></td></tr>" % junjia)

　 print(name)

　f、write( "</table>") f、write( "</body>")

　六、總結(jié)

　教師評(píng)語:

　成績(jī):

　指導(dǎo)教師:

相關(guān)熱詞搜索：實(shí)習(xí)報(bào)告爬蟲網(wǎng)絡(luò)

熱點(diǎn)文章閱讀

落實(shí)防止干預(yù)司法“三個(gè)規(guī)定” 2020-09-24
疫情防控考試題（附答案） 2020-07-06
落實(shí)防止干預(yù)司法“三個(gè)規(guī)定” 2020-09-24
成都市19區(qū)(市)縣黨委常委名單 2020-07-30
對(duì)照先進(jìn)典型身邊榜樣存問題 2020-10-12
加強(qiáng)網(wǎng)絡(luò)意識(shí)形態(tài)工作重要性及 2020-09-13
銀行助力打贏疫情防控攻堅(jiān)戰(zhàn)簡(jiǎn) 2020-09-23
社區(qū)矯正短篇思想?yún)R報(bào) 2020-08-23
2021年村級(jí)換屆選舉 2020-08-10
繼續(xù)教育考試,牢記黨宗旨,堅(jiān)持 2020-07-20

版權(quán)所有 蒲公英文摘 smilezhuce.com

<p id="tgfis"><tr id="tgfis"></tr></p>

<td id="tgfis"></td>