r - How to subset a dataset such that the test set contains -
i built linear regression model (lm.full) , i'm trying test model on test data set. i'm running issue due feature / predictor many unique values when try predict based on test data. troublesome feature cbsa (core based statistical area).
the train , test have same unique values. i'm not sure issue is, because if each of levels of factor variable fit in training model, think should able predict value test.
i divided data here test , training sets:
sample.size<-floor(0.95*nrow(tvwm)) # make sure seeds different set.seed(15) tvwm_train_ind <- sample(seq_len(nrow(tvwm)), size = sample.size) tvwm_train <- tvwm[tvwm_train_ind,] tvwm_test <- tvwm[-tvwm_train_ind,] and here prediction:
> predict(object=lm.full, newdata=tvwm_test, type = "response") error in model.frame.default(terms, newdata, na.action = na.action, xlev = object$xlevels) : factor factor(cbsa_name) has new levels boston-cambridge-newton, ma-nh, detroit-warren-livonia, mi, virginia beach-norfolk-newport news, va-nc
try
all(levels(tvwm_test$cbsa_name) %in% levels(tvwm_train$cbsa_name)) all(levels(tvwm_train$cbsa_name) %in% levels(tvwm_test$cbsa_name)) and make sure both true. or, gregor suggested below in comment, can in 1 statement:
identical(levels(tvwm_test$cbsa_name), levels(tvwm_train$cbsa_name)) if not both true, , both training set , test set have same factor levels in data, run following reset levels:
tvwm_train$cbsa_name <- factor(tvwm_train$cbsa_name) tvwm_test$cbsa_name <- factor(tvwm_test$cbsa_name)
Comments
Post a Comment