注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Search的博客

不断学习中!

 
 
 

日志

 
 

Hadoop实现倒排索引相关内容  

2013-05-05 13:21:03|  分类: |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
1>首先搭建hadoop开发环境,由于一般情况下我们并没有hadoop环境,因此我们这里选择单机环境的搭建
    具体教程见下链接http://weixiaolu.iteye.com/blog/1401931
2>编写mapreduce代码
    

package code.lxy.hadoop;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class InvertedIndex {

/**
* 首先将文件输出为,word file 1的格式
*/
public static class InvertedIndexMapper extends
Mapper<Object, Text, Text, Text> {
private Text keyInfo = new Text();
private Text one = new Text("1");
private FileSplit split;

@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
split = (FileSplit) context.getInputSplit();
String fileName = split.getPath().getName().toString();
StringTokenizer tokenizer = new StringTokenizer(value.toString());
//将文件内容进行分割,以空格,制表符,回车,换行符为分界
while (tokenizer.hasMoreTokens()) {
keyInfo.set(tokenizer.nextToken() + ":" + fileName);
context.write(keyInfo, one);
}
}

}
/*
* 接着将上面的word file 1的格式进行合并,输出word file n的格式
*/
public static class InvertedIndexCombiner extends
Reducer<Text, Text, Text, Text> {

private Text valueInfo = new Text();

@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
int sum = 0;
for (Text text : values) {
sum += Integer.parseInt(text.toString());
}
int splitIndex = key.toString().indexOf(":");
valueInfo.set(key.toString().substring(splitIndex + 1) + ":" + sum);
key.set(key.toString().substring(0, splitIndex));
context.write(key, valueInfo);
}

}
/*
* 最后将file n合并,输出word file1 m;file2 n的形式
*/
public static class InvertedIndexReducer extends
Reducer<Text, Text, Text, Text> {
private Text finalResult = new Text();

@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
StringBuffer sb = new StringBuffer();
for (Text text : values) {
//将文件名连接起来,以;分割
sb.append(text.toString() + ";");
}
finalResult.set(sb.toString());
context.write(key, finalResult);
}

}

public static void main(String[] args) {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err
.println("error,and userage should be like this:InvertedIndex input output");
System.exit(2);
}
Job job = null;
try {
job = new Job(conf, "Inverted Index");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(InvertedIndexMapper.class);
job.setCombinerClass(InvertedIndexCombiner.class);
job.setReducerClass(InvertedIndexReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

}



3>hadoop相关操作
    各种操作hdfs文件的命令,见如下链接http://www.cnblogs.com/gpcuster/archive/2010/06/04/1751538.html
  评论这张
 
阅读(279)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017