我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。
谢谢!
当前回答
有许多方法可以创建训练/测试甚至验证样本。
案例1:没有任何选项的经典方法train_test_split:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
案例2:非常小的数据集(<500行):为了通过这种交叉验证获得所有行的结果。最后,您将对可用训练集的每一行都有一个预测。
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
案例3a:用于分类的不平衡数据集。下面是情形1的等价解:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
案例3b:用于分类的不平衡数据集。在情形2之后,等价解如下:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
案例4:你需要在大数据上创建一个训练/测试/验证集来调优超参数(60%训练,20%测试和20% val)。
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
其他回答
不需要转换为numpy。只要用pandas df来做拆分,它就会返回一个pandas df。
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
如果你想把x和y分开
X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)
如果要分割整个df
X, y = df[list_of_x_cols], df[y_col]
有很多有效的答案。又多了一个。 从sklearn。交叉验证导入train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
对我来说,更优雅一点的方法是创建一个随机列,然后按它进行分割,这样我们就可以得到一个符合我们需求的随机分割。
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
在我的例子中,我想用特定的数字分割训练、测试和开发中的数据帧。我在这里分享我的解决方案
首先,为数据帧分配一个唯一的id(如果已经不存在的话)
import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]
以下是我的分割数字:
train = 120765
test = 4134
dev = 2816
分裂函数
def df_split(df, n):
first = df.sample(n)
second = df[~df.id.isin(list(first['id']))]
first.reset_index(drop=True, inplace = True)
second.reset_index(drop=True, inplace = True)
return first, second
现在分成培训,测试,开发
train, test = df_split(df, 120765)
test, dev = df_split(test, 4134)
您需要将pandas数据帧转换为numpy数组,然后将numpy数组转换回数据帧
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)