我有一个数据框架形式的相当大的数据集,我想知道我如何能够将数据框架分成两个随机样本(80%和20%)进行训练和测试。

谢谢!


当前回答

您需要将pandas数据帧转换为numpy数组,然后将numpy数组转换回数据帧

 import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)

其他回答

Scikit Learn的train_test_split就是一个很好的例子。它将拆分numpy数组和数据框架。

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

不需要转换为numpy。只要用pandas df来做拆分,它就会返回一个pandas df。

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

如果你想把x和y分开

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

如果要分割整个df

X, y = df[list_of_x_cols], df[y_col]

像这样从df中选择range row

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

有很多有效的答案。又多了一个。 从sklearn。交叉验证导入train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)