File tree Expand file tree Collapse file tree 11 files changed +98
-1
lines changed Expand file tree Collapse file tree 11 files changed +98
-1
lines changed Original file line number Diff line number Diff line change 26
26
27
27
- PCA主要成分分析, [ Python实现源码] ( https://github.com/csuldw/MachineLearning/tree/master/PCA )
28
28
29
+ - spark-demo:使用scala编写的spark实例.
30
+ - invertedIndex, [Spark 倒排索引源码](https://github.com/csuldw/MachineLearning/tree/master/invertedIndex)
31
+
29
32
## Supplementary
30
33
31
34
- MNIST数据集[ 加载方法] ( https://github.com/csuldw/MachineLearning/tree/master/dataset/MNIST ) .
32
35
33
36
34
37
## Contributor
35
38
36
- - 刘帝伟, 中南大学在读硕士,[ Homepage ] ( http://csuldw.github.io ) .
39
+ - 刘帝伟, 中南大学在读硕士,[ HomePage ] ( http://www. csuldw.com ) .
37
40
38
41
39
42
## Contact
Original file line number Diff line number Diff line change
1
+
2
+ ### 说明
3
+
4
+ - invertedIndex: 一个倒排索引的spark实例.
Original file line number Diff line number Diff line change
1
+ name := " invertedIndex"
2
+ version := " 1.0.0"
3
+ scalaVersion := " 2.10.4"
4
+ libraryDependencies += " org.apache.spark" %% " spark-core" % " 1.3.1"
5
+ resolvers += " Akka Respository" at " http://repo.akka.io/releases/"
Original file line number Diff line number Diff line change
1
+ inputfile=/home/hadoop-news/liudiwei/test/input.data
Original file line number Diff line number Diff line change
1
+ doc1 Apache Spark Scala Hadoop Java C Python Do And Will KNN
2
+ doc2 SVM Scala News Play Akka Yes GBDT
3
+ doc3 LDA SVM RF GBDT Adaboost Kmeans KNN
4
+ doc4 QQ BAT I Great All LDA
5
+ doc5 Apache Hadoop MapReduce Git SVN SVM
Original file line number Diff line number Diff line change
1
+ (Akka,doc2)
2
+ (Python,doc1)
3
+ (QQ,doc4)
4
+ (RF,doc3)
5
+ (Apache,doc1|doc5)
6
+ (Will,doc1)
7
+ (Java,doc1)
8
+ (MapReduce,doc5)
9
+ (SVM,doc2|doc3|doc5)
10
+ (Scala,doc1|doc2)
11
+ (Git,doc5)
12
+ (Play,doc2)
13
+ (And,doc1)
14
+ (SVN,doc5)
15
+ (GBDT,doc2|doc3)
16
+ (News,doc2)
17
+ (Spark,doc1)
18
+ (Kmeans,doc3)
19
+ (Do,doc1)
20
+ (KNN,doc1|doc3)
21
+ (I,doc4)
22
+ (All,doc4)
23
+ (LDA,doc4|doc3)
24
+ (BAT,doc4)
25
+ (Great,doc4)
26
+ (C,doc1)
27
+ (Adaboost,doc3)
28
+ (Yes,doc2)
29
+ (Hadoop,doc5|doc1)
Original file line number Diff line number Diff line change
1
+ hadoop fs -put data/input.data /home/hadoop-news/liudiwei/test
2
+ exe_cores=2
3
+ exe_num=2
4
+ tmp_dir=/home/hdp-guanggao/old_jobs/tmp
5
+ exe_mem=2G
6
+ drv_mem=3G
7
+
8
+ spark-submit \
9
+ --master yarn-client \
10
+ --driver-memory $drv_mem \
11
+ --executor-memory $exe_mem \
12
+ --num-executors $exe_num \
13
+ --executor-cores $exe_cores \
14
+ --driver-java-options -Dsun.io.serialization.extendedDebugInfo=true \
15
+ --driver-java-options -Djava.io.tmpdir=$tmp_dir \
16
+ --conf spark.eventLog.enabled=true \
17
+ --conf spark.storage.memoryFraction=0.1 \
18
+ --jars " deps/json4s-native_2.10-3.2.10.jar" \
19
+ --class " InvertedIndex" \
20
+ ./target/scala-2.10/invertedindex_2.10-1.0.0.jar \
21
+ conf/base.conf
Original file line number Diff line number Diff line change
1
+ import org .apache .spark .SparkContext
2
+ import org .apache .spark .SparkConf
3
+ import org .apache .spark .SparkContext ._
4
+ import org .apache .spark .SparkContext
5
+ import org .apache .spark .rdd .RDD
6
+ import org .apache .commons .configuration .{ PropertiesConfiguration => HierConf }
7
+ import org .apache .spark .broadcast .Broadcast
8
+
9
+ import scala .collection .mutable ._
10
+
11
+ object InvertedIndex {
12
+ def main (args : Array [String ]){
13
+ val conf = new SparkConf ().setAppName(" invertedIndex" )
14
+ .set(" spark.serializer" , " org.apache.spark.serializer.JavaSerializer" )
15
+ .set(" spark.akka.frameSize" ," 256" )
16
+ .set(" spark.ui.port" ," 4071" )
17
+ val sc = new org.apache.spark.SparkContext (conf)
18
+ val cfg = new HierConf (args(0 ))
19
+ val inputfile = cfg.getString(" inputfile" )
20
+ val result = sc.textFile(inputfile)
21
+ .map(x => x.split(" \t " ))
22
+ .map(x => (x(0 ), x(1 )))
23
+ .map(x => x._2.split(" " ).map(y => (y, x._1)))
24
+ .flatMap(x => x)
25
+ .reduceByKey( (x, y) => x + " |" + y)
26
+ result.collect.foreach(println)
27
+ sc.stop()
28
+ }
29
+ }
You can’t perform that action at this time.
0 commit comments