我如何创建测试和训练样本从一个数据框架与熊猫?

我有一个数据框架形式的相当大的数据集，我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!

当前回答

这是我在需要分割数据帧时所写的。我考虑过使用上面安迪的方法，但不喜欢我不能精确地控制数据集的大小(例如，有时是79，有时是81，等等)。

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

2014-12-25 20:52:09

其他回答

我认为你还需要一个副本，而不是一个切片的数据框架，如果你想以后添加列。

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

2015-08-04 04:16:06

这是我在需要分割数据帧时所写的。我考虑过使用上面安迪的方法，但不喜欢我不能精确地控制数据集的大小(例如，有时是79，有时是81，等等)。

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

2014-12-25 20:52:09

上面有很多很好的答案，所以我只想再加一个例子，在这种情况下，你想通过使用numpy库来指定火车和测试集的确切样本数量。

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]

2019-11-19 06:00:45

有很多有效的答案。又多了一个。从sklearn。交叉验证导入train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]

2016-12-09 22:18:03

可以使用~(波浪符)排除使用df.sample()采样的行，让pandas单独处理索引的采样和过滤，以获得两个集。

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]

2020-01-26 11:54:43

我如何创建测试和训练样本从一个数据框架与熊猫?

推荐文章

最新文章

标签