本帖最后由 叶凯 于 2020-10-9 01:22 编辑
最近想做一些自己的项目,需要网上采集一些数据,以前都是用火车头采集的,感觉很不灵活,于是今天就花了一些时间学下python
展示下今天的成果,做了两个小实战
一个是抖音去水印
另外一个是爬取B站上的视频弹幕,用结巴分词,再用词云生成一张图片
请忽略变量命名 变量命名随便取的
抖音去水印
[Python] 纯文本查看 复制代码import requestsimport reimport jsondef download_page(url, pc=True): if pc == True: ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' else: ua = 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1' headers = { 'User-Agent': ua } res = requests.get(url, headers=headers) return resif __name__ == '__main__': # https://v.douyin.com/JyPHShN/ url = input('输入抖音视频地址:') # 获取真实地址 res1 = download_page(url) patten = re.compile('/video/(.*?)/') # https://www.iesdouyin.com/share/video/6881151874846723339/?region=CN&mid=6881152095287479053&u_code=imbie9bd&titleType=title×tamp=1602169029&app=aweme&utm_campaign=client_share&utm_medium=ios&tt_from=copy&utm_source=copy # /vodeo/ 后面那串数字就是item_ids号 6881151874846723339 通过正则获取 item_ids = (patten.findall(res1.url))[0] # 获取视频相关数据 res2 = download_page(f'https://www.iesdouyin.com/web/api/v2/aweme/iteminfo/?item_ids={item_ids}') res2_text = json.loads(res2.text) info = res2_text['item_list'][0] old_addr = info['video']['play_addr']['url_list'][0] new_addr = old_addr.replace('playwm', 'play') # 通过手机UA获取无水印视频地址 res3 = download_page(new_addr, False) new_addr = res3.url douyin_info = { 'aweme_id': info['aweme_id'], 'title': info['desc'], 'cover': info['video']['cover']['url_list'][0], 'play_addr': new_addr } print(douyin_info)
爬取弹幕
[Python] 纯文本查看 复制代码import requestsimport jsonimport re# 下载页面def download_page(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' } res = requests.get(url, headers=headers) return res# 根据dvid获取ciddef get_cid(dvid): ''' 获取cid :param dvid: https://api.bilibili.com/x/player/pagelist?bvid=BV1KK4y1h76a&jsonp=jsonp :return: cid ''' url = f'https://api.bilibili.com/x/player/pagelist?bvid={dvid}&jsonp=jsonp' res = download_page(url) return json.loads(res.text)['data'][0]['cid']# 根据cid请求弹幕def get_msg(cid): ''' :param cid: https://api.bilibili.com/x/v1/dm/list.so?oid=241955049 :return: ''' url = f'https://api.bilibili.com/x/v1/dm/list.so?oid={cid}' res = download_page(url) res.xml = res.content.decode('utf-8') patten = re.compile('<d.*?>(.*?)</d>') dan_mu_list = patten.findall(res.xml) return dan_mu_list# 保存弹幕到txt文件def save_to_file(dan_mu_list, filename): with open(filename, mode='w', encoding='utf-8') as f: for i in dan_mu_list: f.write(i) f.write('\n')# 爬取弹幕主程序def main(dvid): cid = get_cid(dvid) dan_mu_list = get_msg(cid) save_to_file(dan_mu_list, f'{dvid}.txt')if __name__ == '__main__': # dvid = 'BV1aE411d7Rp' dvid = input('输入B站视频后缀:') main(dvid) print('弹幕爬取成功')
词云生成图片
[Python] 纯文本查看 复制代码import jiebaimport wordcloud# 读取弹幕文件def rand_file(filename): with open(filename, mode='r', encoding='utf-8') as f: dan_mu = f.read() return dan_mu# 结巴分词 生成词云def jieba_cut(str, imgname): cut_list = jieba.lcut(str) word = ' '.join(cut_list) w = wordcloud.WordCloud(font_path='msyh.ttc', background_color='white', width=600, height=400) w.generate(word) w.to_file(f'{imgname}.png')if __name__ == '__main__': # dvid = 'BV1aE411d7Rp' dvid = input('输入B站视频后缀:') str = rand_file(f'{dvid}.txt') jieba_cut(str, dvid) print('词云图生成完毕')
生成的图片效果 ,字幕出现的次数越多,字体就会越大,感觉挺有意思的
发表评论: