apache spark - Input format Problems with MLlib -
i want run svm regression, have problems input format. right train , test set 1 customer looks this:
1 '12262064 |f offer_quantity:1 has_bought_brand_company:1 has_bought_brand_a:6.79 has_bought_brand_q_60:1.0 has_bought_brand:2.0 has_bought_company_a:1.95 has_bought_brand_180:1.0 has_bought_brand_q_180:1.0 total_spend:218.37 has_bought_brand_q:3.0 offer_value:1.5 has_bought_brand_a_60:2.79 has_bought_brand_60:1.0 has_bought_brand_q_90:1.0 has_bought_brand_a_90:2.79 has_bought_company_q:1.0 has_bought_brand_90:1.0 has_bought_company:1.0 never_bought_category:1 has_bought_brand_a_180:2.79
if tried read textfile spark, without success. missing? have delete feature names? right in vowal wabbit format.
my code looks this:
import org.apache.spark.sparkcontext import org.apache.spark.mllib.classification.svmwithsgd import org.apache.spark.mllib.evaluation.binaryclassificationmetrics import org.apache.spark.mllib.regression.labeledpoint import org.apache.spark.mllib.linalg.vectors import org.apache.spark.mllib.util.mlutils load training data in libsvm format. val data = mlutils.loadlibsvmfile(sc, "mllib/data/train.txt") split data training (60%) , test (40%). val splits = data.randomsplit(array(0.6, 0.4), seed = 11l) val training = splits(0).cache() val test = splits(1) run training algorithm build model val numiterations = 100 val model = svmwithsgd.train(training, numiterations) model.clearthreshold() val scoreandlabels = test.map { point => val score = model.predict(point.features) (score, point.label) } val metrics = new binaryclassificationmetrics(scoreandlabels) val auroc = metrics.areaunderroc() println("area under roc = " + auroc)
``i answer, auc value 1, shouldnt case.
scala> println("area under roc = " + auroc) area under roc = 1.0
i think file not in libsvm format.if can convert file libsvm format or have load normal file , create label point did file.
import org.apache.spark.mllib.feature.hashingtf val tf = new hashingtf(2) val tweets = sc.textfile(tweetinput) val labelpoint = tweets.map(l=>{ val parts = l.split(' ') var t=tf.transform(parts.tail.map(x => x).sliding(2).toseq) labeledpoint(parts(0).todouble,t ) }).cache() labelpoint.count() val model = linearregressionwithsgd.train(labelpoint, numiterations)
Comments
Post a Comment