Wesukilaye

Your choices please me, for now


  • 首页

  • 关于

  • 标签

  • 分类

  • 归档

  • 搜索

大数据学习 Day6

发表于 2019-08-01 更新于 2019-08-06 分类于 大数据
本文字数: 6.4k 阅读时长 ≈ 6 分钟
智联招聘数据爬取

利用爬虫,爬取智联招聘的数据,其中因为智联招聘网站对爬虫进行了反爬机制,从网页源码爬不到div下面的内容,于是采用了获取json数据进行分析:

自己的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import requests
import json
from bs4 import BeautifulSoup

url='https://fe-api.zhaopin.com/c/i/sou?pageSize=90&cityId=489&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=%E5%A4%A7%E6%95%B0%E6%8D%AE&kt=3&_v=0.79005936&x-zp-page-request-id=520adc5dcbde404f8f20d5c0846b54b5-1562324160122-21643&x-zp-client-id=0d243d91-4d7b-43f7-9551-07854f531ab2'
header={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}

def spider_job_content():
response=requests.get(url=url,headers=header)
print(response.status_code)
if response.status_code==200:
response.encoding='utf-8'
html=response.text

html0=json.loads(html)
text=''
file=open('spider_job.txt','w',encoding='utf-8')
for result in html0['data']['results']:
# print(result['company']['name'])
jobname=result["jobName"]
company=result['company']['name']
URL=result['company']['url']
update=result["updateDate"]
salary=result['salary']
jobType=result["jobType"]["items"][0]["name"]
workingExp=result["workingExp"]['name']
welfare=str(result["welfare"])
text+='工作名称:'+jobname+'\n'
text+='公司名称:'+company+'\n'
text+='公司网站:'+url+'\n'
text+='更新日期:'+update+'\n'
text+='薪水:'+salary+'\n'
text+='工作类型:'+jobType+'\n'
text+='工作经验:'+workingExp+'\n'
text+='福利:'+welfare+'\n\n\n'




# print(html0['data']['results'][0]['company']['name'])
# div_list=bfs.find_all('div',attrs={'class':'c-chat-ads'})
# print(div_list)
# for div in div_list:
# job_name=bfs.find('div',attrs={'class':'contentpile__content__wrapper__item__info__box__jobname jobName'})
# print(job_name)
# text+='工作名称:'+job_name+'\n'

file.write(text)
file.close()

if __name__=='__main__':
spider_job_content()
老师的代码:

import csv
import requests
import json
from lxml import etree

class ZhiLian:
def __init__(self):
self.headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}

# 以可写的方式创建一个文件,名称叫招聘.csv,默认在这个文件中写入一行数据(company,city,edu,resumeCount,salary,jobName,detail)
self.head = ['company','city','edu','resumeCount','salary','jobName','detail']
self.f=open('招聘.csv','w',encoding='utf-8',newline='')
self.out = csv.writer(self.f)
self.out.writerow(self.head)

# with open('招聘.csv','w',encoding='utf-8',newline='')as f:
# self.f_csv = csv.writer(f)
# self.f_csv.writerow(head)

def getHtml(self,url,data):
# get请求获取网络数据
response=requests.get(url=url,params=data)
response=response.content.decode(encoding='utf-8')
# print(response)
return response

def parse(self,res):
html=json.loads(res);
for i in html['data']['results']:
try:
list=[]
# 得到公司名称
company=i['company']['name']
# 得到所在城市信息
city=i['city']['display']
# 得到教育水平
eduLevel=i['eduLevel']['name']
# 得到工作年限
workingExp=i['workingExp']['name']
# 讲教育水平跟工作年限拼在一起
edu=eduLevel+"/"+workingExp
# 得到薪水
salary=i['salary']
# 得到工作岗位名称
jobName=i['jobName']
# i["positionURL"]得到当前项点击进入的二级页面详情的网址 初始化Xpath解析对象
data=etree.HTML(requests.get(i["positionURL"]).content.decode(encoding='utf-8'))
detail=data.xpath('string(//div[@class="describtion__detail-content"])')
# resumeCount=data.xpath('string(//div[@class="summary-plane__left"]/ul/li[last()])')
resumeCount=data.xpath('string(//ul[@class="summary-plane__info"]/li[last()])')
print(resumeCount)
list.append(company)
list.append(city)
list.append(edu)
list.append(resumeCount)
list.append(salary)
list.append(jobName)
list.append(detail)
# print(list)
self.out.writerow(list)
except:
print("出错了")

def run(self):
url="https://fe-api.zhaopin.com/c/i/sou"
for i in range(0,1000,90):
data={
"start":i,
"pageSize":90,
"cityId":'530',
"workExperience":-1,
"education":-1,
"companyType":-1,
"employmentType":-1,
"jobWelfareTag":-1,
"kw":"大数据",
"kt":-1,
"_v":0.68686337,
"x-zp-page-request-id":"c6d2f5a9337c45ddbd6659b5a0fd1b33-1562378017635-920095",
"x-zp-client-id":"a9b00a0c-0a03-43c5-b89b-6ec0c3deb104"
}
# 获取网络json数据
res=self.getHtml(url,data)
# 调用解析方法解析json数据
self.parse(res)

if __name__=='__main__':
zl=ZhiLian()
zl.run()
  • 本文作者: Mr.Zhao
  • 本文链接: https://wesukilayezcy.github.io/2019/08/01/大数据学习-Day6/
  • 版权声明: 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
大数据
大数据学习 Day5
大数据学习 Day7
-------------本文结束感谢您的阅读-------------
  • 文章目录
  • 站点概览
Wesukilaye

Wesukilaye

熟练使用iOS Objective-c,Swift. 了解Python爬取网络数据,深入研究移动端开发,目前正在学习Flutter
23 日志
4 分类
16 标签
RSS
GitHub E-Mail bilibili
Links
  • Jacksu
  1. 1. 智联招聘数据爬取
© 2019 Wesukilaye | 62k | 57 分钟
由 Hexo 强力驱动 v3.9.0
|
主题 – NexT.Pisces v7.3.0