我试图读取一个大的csv文件(aprox。6 GB)在熊猫和我得到一个内存错误:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError: 

有什么帮助吗?


当前回答

解决方案1:

使用大数据的熊猫

解决方案2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

其他回答

分块不应该总是解决这个问题的第一步。

Is the file large due to repeated non-numeric data or unwanted columns? If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter. Does your workflow require slicing, manipulating, exporting? If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API. If all else fails, read line by line via chunks. Chunk via pandas or via csv library as a last resort.

如果你有一个csv文件,有数百万个数据条目,你想要加载完整的数据集,你应该使用dask_cudf,

import dask_cudf as dc

df = dc.read_csv("large_data.csv")

函数read_csv和read_table几乎是一样的。但在程序中使用read_table函数时,必须分配分隔符“,”。

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

在使用chunksize选项之前,如果你想确定你想要在@unutbu提到的分块for循环中写入的进程函数,你可以简单地使用nrows选项。

small_df = pd.read_csv(filename, nrows=100)

一旦确定流程块准备好了,就可以将其放入整个数据帧的分块for循环中。

你可以尝试sframe,它和pandas有相同的语法,但是允许你操作比你的RAM大的文件。