我试图读取一个大的csv文件(aprox。6 GB)在熊猫和我得到一个内存错误:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError: 

有什么帮助吗?


当前回答

除了上面的答案,对于那些想要处理CSV然后导出到CSV、parquet或SQL的人来说,d6tstack是另一个不错的选择。您可以加载多个文件,它处理数据模式更改(添加/删除列)。核心支持已经被剔除。

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

其他回答

函数read_csv和read_table几乎是一样的。但在程序中使用read_table函数时,必须分配分隔符“,”。

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

在使用chunksize选项之前,如果你想确定你想要在@unutbu提到的分块for循环中写入的进程函数,你可以简单地使用nrows选项。

small_df = pd.read_csv(filename, nrows=100)

一旦确定流程块准备好了,就可以将其放入整个数据帧的分块for循环中。

除了上面的答案,对于那些想要处理CSV然后导出到CSV、parquet或SQL的人来说,d6tstack是另一个不错的选择。您可以加载多个文件,它处理数据模式更改(添加/删除列)。核心支持已经被剔除。

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

如果有人还在寻找这样的东西,我发现这个名为modin的新库可以提供帮助。它使用分布式计算来帮助读取。这里有一篇不错的文章将其功能与熊猫进行了比较。它本质上使用与熊猫相同的功能。

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

您可以将数据读入为块,并将每个块保存为pickle。

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

在下一步中,读入pickle并将每个pickle附加到所需的数据框架中。

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)