我如何创建测试和训练样本从一个数据框架与熊猫?

我有一个数据框架形式的相当大的数据集，我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!

当前回答

示例方法选择数据的一部分，您可以先通过传递种子值来打乱数据。

train = df.sample(frac=0.8, random_state=42)

对于测试集，您可以删除通过train DF索引的行，然后重置新DF的索引。

test = df.drop(train_data.index).reset_index(drop=True)

2022-11-02 06:31:20

其他回答

您可以使用df.as_matrix()函数并创建Numpy-array并传递它。

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

2015-11-27 08:50:52

可以使用~(波浪符)排除使用df.sample()采样的行，让pandas单独处理索引的采样和过滤，以获得两个集。

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]

2020-01-26 11:54:43

我会使用numpy的randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

为了证明这是有效的:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

2014-06-10 17:29:25

Scikit Learn的train_test_split就是一个很好的例子。它将拆分numpy数组和数据框架。

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

2014-06-10 22:19:31

熊猫随机抽样也可以

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

对于相同的random_state值，您将始终在训练集和测试集中获得相同的确切数据。这带来了一定程度的可重复性，同时还随机分离训练和测试数据。

2016-02-21 01:28:55

我如何创建测试和训练样本从一个数据框架与熊猫?

推荐文章

最新文章

标签