是否有一种直接的方法将CSV文件的内容导入到记录数组中,就像R的read.table(), read.delim()和read.csv()将数据导入到R数据框架中一样?
或者我应该使用csv.reader(),然后应用numpy.core.records.fromrecords()?
是否有一种直接的方法将CSV文件的内容导入到记录数组中,就像R的read.table(), read.delim()和read.csv()将数据导入到R数据框架中一样?
或者我应该使用csv.reader(),然后应用numpy.core.records.fromrecords()?
当前回答
还可以尝试recfromcsv(),它可以猜测数据类型并返回正确格式化的记录数组。
其他回答
使用numpy.genfromtxt(),将分隔符kwarg设置为逗号:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
In [329]: %time my_data = genfromtxt('one.csv', delimiter=',')
CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s
Wall time: 24.4 s
In [330]: %time df = pd.read_csv("one.csv", skiprows=20)
CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s
Wall time: 1.38 s
可在最新的熊猫和numpy版本。
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv', header=None)
# Discover, visualize, and preprocess data using pandas if needed.
data = data.to_numpy()
这是一个非常简单的任务,最好的方法如下
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:\Users\Ron\Desktop\Clients.csv') #read the file (put 'r' before the path string to address any special characters in the file such as \). Don't forget to put the file name at the end of the path + ".csv"
print(df)`
y = np.array(df)
当我尝试使用NumPy和Pandas两种方式时,使用Pandas有很多优点:
快 减少CPU占用 与NumPy genfromttxt相比,RAM占用了1/3
这是我的测试代码:
$ for f in test_pandas.py test_numpy_csv.py ; do /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps
23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps
test_numpy_csv.py
from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')
test_pandas.py
from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')
数据文件:
du -h ~/me/notebook/train.csv
59M /home/hvn/me/notebook/train.csv
在NumPy和熊猫版本:
$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2