我试图读取一个大的csv文件(aprox。6 GB)在熊猫和我得到一个内存错误:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError: 

有什么帮助吗?


当前回答

下面是一个例子:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

其他回答

在使用chunksize选项之前,如果你想确定你想要在@unutbu提到的分块for循环中写入的进程函数,你可以简单地使用nrows选项。

small_df = pd.read_csv(filename, nrows=100)

一旦确定流程块准备好了,就可以将其放入整个数据帧的分块for循环中。

我是这样说的:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

除了上面的答案,对于那些想要处理CSV然后导出到CSV、parquet或SQL的人来说,d6tstack是另一个不错的选择。您可以加载多个文件,它处理数据模式更改(添加/删除列)。核心支持已经被剔除。

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

分块不应该总是解决这个问题的第一步。

Is the file large due to repeated non-numeric data or unwanted columns? If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter. Does your workflow require slicing, manipulating, exporting? If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API. If all else fails, read line by line via chunks. Chunk via pandas or via csv library as a last resort.

对于大数据,我建议你使用"dask"库,例如:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

你可以在这里阅读更多的文档。

另一个很好的选择是使用modin,因为所有的功能都与pandas相同,但它利用了分布式数据框架库,如dask。

在我的项目中,另一个高级库是数据表。

# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")