对之前爬虫爬到的智联招聘的数据进行mapreduce处理,再进行数据分析
上传招聘数据到分布式文件系统
hdfs dfs –put /opt/zl0507.csv /
在eclipse中新建package,重命名为com.sj.clean
其中新建类 CleanMapper.java
1 | package com.sj.clean; |
新建类 CleanReducer.java
1 | package com.sj.clean; |
新建类 Main.java
1 | package com.sj.clean; |
运行Main.java,在分布式文件系统/clean/ 中生成文件
利用hive进行词频统计:
常用命令:
- create database if not exists hive;
- show databases;
- show databases like ‘h.*’;
- use hive;
- show tables;
- create table if not exists hive.userr(name string comment’username’,pwd string comment ‘password’, address structstreet:string,city:string,state:string,zip:int,identify map
comment’number,sex’);
统计:
master:
1 | cd /opt |
写入:
1 | hongyutang love qiaoshuang |
启动hive
1 | create table wc(txt String) row format delimited fields terminated by '\t'; |
结果:
1 | ["hongyutang","love","qiaoshuang"] |
select explode(split(txt,' ')) from wc;
结果:
1 | hongyutang |
select t1.word,count(t1.word) from (select explode(split(txt ,' ')) word from wc)t1 group by t1.word;
结果:
1 | beijing 1 |