apache spark - Input format Problems with MLlib -

August 15, 2011

i want run svm regression, have problems input format. right train , test set 1 customer looks this:

1 '12262064 |f offer_quantity:1  has_bought_brand_company:1 has_bought_brand_a:6.79 has_bought_brand_q_60:1.0  has_bought_brand:2.0 has_bought_company_a:1.95 has_bought_brand_180:1.0  has_bought_brand_q_180:1.0 total_spend:218.37 has_bought_brand_q:3.0 offer_value:1.5  has_bought_brand_a_60:2.79 has_bought_brand_60:1.0 has_bought_brand_q_90:1.0  has_bought_brand_a_90:2.79 has_bought_company_q:1.0 has_bought_brand_90:1.0  has_bought_company:1.0 never_bought_category:1 has_bought_brand_a_180:2.79

if tried read textfile spark, without success. missing? have delete feature names? right in vowal wabbit format.

my code looks this:

import org.apache.spark.sparkcontext import org.apache.spark.mllib.classification.svmwithsgd import org.apache.spark.mllib.evaluation.binaryclassificationmetrics import org.apache.spark.mllib.regression.labeledpoint import org.apache.spark.mllib.linalg.vectors import org.apache.spark.mllib.util.mlutils  load training data in libsvm format. val data = mlutils.loadlibsvmfile(sc, "mllib/data/train.txt")  split data training (60%) , test (40%). val splits = data.randomsplit(array(0.6, 0.4), seed = 11l) val training = splits(0).cache() val test = splits(1)  run training algorithm build model val numiterations = 100 val model = svmwithsgd.train(training, numiterations)   model.clearthreshold()   val scoreandlabels = test.map { point =>   val score = model.predict(point.features)   (score, point.label) }   val metrics = new binaryclassificationmetrics(scoreandlabels) val auroc = metrics.areaunderroc()  println("area under roc = " + auroc)

``i answer, auc value 1, shouldnt case.

scala> println("area under roc = " + auroc) area under roc = 1.0

i think file not in libsvm format.if can convert file libsvm format or have load normal file , create label point did file.

import org.apache.spark.mllib.feature.hashingtf val tf = new hashingtf(2) val tweets = sc.textfile(tweetinput)  val labelpoint = tweets.map(l=>{      val parts = l.split('  ')   var t=tf.transform(parts.tail.map(x => x).sliding(2).toseq)   labeledpoint(parts(0).todouble,t )  }).cache() labelpoint.count()  val model = linearregressionwithsgd.train(labelpoint, numiterations)

Search This Blog

My

apache spark - Input format Problems with MLlib -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Why am I getting Internal .NET Framework Data Provider error 1025 when passing Method to where? -

postgresql - how to get points from linestring postgis -