我如何创建测试和训练样本从一个数据框架与熊猫?

我有一个数据框架形式的相当大的数据集，我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!

当前回答

在我的例子中，我想用特定的数字分割训练、测试和开发中的数据帧。我在这里分享我的解决方案

首先，为数据帧分配一个唯一的id(如果已经不存在的话)

import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]

以下是我的分割数字:

train = 120765
test  = 4134
dev   = 2816

分裂函数

def df_split(df, n):
    
    first  = df.sample(n)
    second = df[~df.id.isin(list(first['id']))]
    first.reset_index(drop=True, inplace = True)
    second.reset_index(drop=True, inplace = True)
    return first, second

现在分成培训，测试，开发

train, test = df_split(df, 120765)
test, dev   = df_split(test, 4134)

2020-12-20 09:06:03

其他回答

我将使用scikit-learn自己的training_test_split，并从索引生成它

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

2015-05-26 09:33:30

如果你希望有一个数据帧和两个数据帧(不是numpy数组)，这应该可以做到:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data

2015-07-19 21:29:26

shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]

2020-06-17 20:05:06

在我的例子中，我想用特定的数字分割训练、测试和开发中的数据帧。我在这里分享我的解决方案

首先，为数据帧分配一个唯一的id(如果已经不存在的话)

import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]

以下是我的分割数字:

train = 120765
test  = 4134
dev   = 2816

分裂函数

def df_split(df, n):
    
    first  = df.sample(n)
    second = df[~df.id.isin(list(first['id']))]
    first.reset_index(drop=True, inplace = True)
    second.reset_index(drop=True, inplace = True)
    return first, second

现在分成培训，测试，开发

train, test = df_split(df, 120765)
test, dev   = df_split(test, 4134)

2020-12-20 09:06:03

可以使用~(波浪符)排除使用df.sample()采样的行，让pandas单独处理索引的采样和过滤，以获得两个集。

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]

2020-01-26 11:54:43

我如何创建测试和训练样本从一个数据框架与熊猫?

推荐文章

最新文章

标签