Skip to content

Commit c3784a5

Browse files
author
liudiwei
committed
add invertedIndex example
1 parent e0f0e81 commit c3784a5

File tree

11 files changed

+98
-1
lines changed

11 files changed

+98
-1
lines changed

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,17 @@
2626

2727
- PCA主要成分分析, [Python实现源码](https://github.com/csuldw/MachineLearning/tree/master/PCA)
2828

29+
- spark-demo:使用scala编写的spark实例.
30+
- invertedIndex, [Spark 倒排索引源码](https://github.com/csuldw/MachineLearning/tree/master/invertedIndex)
31+
2932
## Supplementary
3033

3134
- MNIST数据集[加载方法](https://github.com/csuldw/MachineLearning/tree/master/dataset/MNIST).
3235

3336

3437
## Contributor
3538

36-
- 刘帝伟, 中南大学在读硕士,[Homepage](http://csuldw.github.io).
39+
- 刘帝伟, 中南大学在读硕士,[HomePage](http://www.csuldw.com).
3740

3841

3942
## Contact

doc/i0200bh53yr.p202.1.mp4.egt

-1.26 MB
Binary file not shown.

spark-demo/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
2+
### 说明
3+
4+
- invertedIndex: 一个倒排索引的spark实例.

spark-demo/invertedIndex/build.sbt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
name := "invertedIndex"
2+
version := "1.0.0"
3+
scalaVersion := "2.10.4"
4+
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"
5+
resolvers += "Akka Respository" at "http://repo.akka.io/releases/"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
inputfile=/home/hadoop-news/liudiwei/test/input.data
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
doc1 Apache Spark Scala Hadoop Java C Python Do And Will KNN
2+
doc2 SVM Scala News Play Akka Yes GBDT
3+
doc3 LDA SVM RF GBDT Adaboost Kmeans KNN
4+
doc4 QQ BAT I Great All LDA
5+
doc5 Apache Hadoop MapReduce Git SVN SVM
Binary file not shown.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
(Akka,doc2)
2+
(Python,doc1)
3+
(QQ,doc4)
4+
(RF,doc3)
5+
(Apache,doc1|doc5)
6+
(Will,doc1)
7+
(Java,doc1)
8+
(MapReduce,doc5)
9+
(SVM,doc2|doc3|doc5)
10+
(Scala,doc1|doc2)
11+
(Git,doc5)
12+
(Play,doc2)
13+
(And,doc1)
14+
(SVN,doc5)
15+
(GBDT,doc2|doc3)
16+
(News,doc2)
17+
(Spark,doc1)
18+
(Kmeans,doc3)
19+
(Do,doc1)
20+
(KNN,doc1|doc3)
21+
(I,doc4)
22+
(All,doc4)
23+
(LDA,doc4|doc3)
24+
(BAT,doc4)
25+
(Great,doc4)
26+
(C,doc1)
27+
(Adaboost,doc3)
28+
(Yes,doc2)
29+
(Hadoop,doc5|doc1)

spark-demo/invertedIndex/run.sh

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
hadoop fs -put data/input.data /home/hadoop-news/liudiwei/test
2+
exe_cores=2
3+
exe_num=2
4+
tmp_dir=/home/hdp-guanggao/old_jobs/tmp
5+
exe_mem=2G
6+
drv_mem=3G
7+
8+
spark-submit \
9+
--master yarn-client \
10+
--driver-memory $drv_mem \
11+
--executor-memory $exe_mem \
12+
--num-executors $exe_num \
13+
--executor-cores $exe_cores \
14+
--driver-java-options -Dsun.io.serialization.extendedDebugInfo=true \
15+
--driver-java-options -Djava.io.tmpdir=$tmp_dir \
16+
--conf spark.eventLog.enabled=true \
17+
--conf spark.storage.memoryFraction=0.1 \
18+
--jars "deps/json4s-native_2.10-3.2.10.jar" \
19+
--class "InvertedIndex" \
20+
./target/scala-2.10/invertedindex_2.10-1.0.0.jar \
21+
conf/base.conf
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
import org.apache.spark.SparkContext
2+
import org.apache.spark.SparkConf
3+
import org.apache.spark.SparkContext._
4+
import org.apache.spark.SparkContext
5+
import org.apache.spark.rdd.RDD
6+
import org.apache.commons.configuration.{ PropertiesConfiguration => HierConf }
7+
import org.apache.spark.broadcast.Broadcast
8+
9+
import scala.collection.mutable._
10+
11+
object InvertedIndex{
12+
def main(args : Array[String]){
13+
val conf = new SparkConf().setAppName("invertedIndex")
14+
.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")
15+
.set("spark.akka.frameSize","256")
16+
.set("spark.ui.port","4071")
17+
val sc = new org.apache.spark.SparkContext(conf)
18+
val cfg = new HierConf(args(0))
19+
val inputfile = cfg.getString("inputfile")
20+
val result = sc.textFile(inputfile)
21+
.map(x => x.split("\t"))
22+
.map(x => (x(0), x(1)))
23+
.map(x => x._2.split(" ").map(y => (y, x._1)))
24+
.flatMap(x => x)
25+
.reduceByKey( (x, y) => x + "|" + y)
26+
result.collect.foreach(println)
27+
sc.stop()
28+
}
29+
}

0 commit comments

Comments
 (0)