我如何能逐行读取大文本文件，而不将它们加载到内存?

我想逐行读取一个大文件(>5GB)，而不将其全部内容加载到内存中。我不能使用readlines()，因为它在内存中创建了一个非常大的列表。

当前回答

blaze项目在过去6年里取得了长足的进展。它有一个简单的API，涵盖了pandas功能的一个有用子集。

dask。Dataframe内部负责分块，支持许多可并行操作，并允许您轻松地将切片导出回pandas，以便在内存中操作。

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

2018-01-22 20:51:11

其他回答

请试试这个:

with open('filename','r',buffering=100000) as f:
    for line in f:
        print line

2018-01-25 14:48:49

当您希望并行工作并只读取数据块，但要用新行保持数据整洁时，这可能很有用。

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data

2019-05-10 12:00:04

我意识到这个问题在很久以前就已经回答过了，但是这里有一种并行的方法，而不会杀死您的内存开销(如果您试图将每一行放入池中，就会出现这种情况)。显然，将readJSON_line2函数替换为一些合理的函数——这只是为了说明这一点!

加速将取决于文件大小和你对每一行所做的事情-但最坏的情况是，对于一个小文件，只是用JSON阅读器读取它，我看到下面设置的性能与ST相似。

希望对大家有用:

def readJSON_line2(linesIn):
  #Function for reading a chunk of json lines
   '''
   Note, this function is nonsensical. A user would never use the approach suggested 
   for reading in a JSON file, 
   its role is to evaluate the MT approach for full line by line processing to both 
   increase speed and reduce memory overhead
   '''
   import json

   linesRtn = []
   for lineIn in linesIn:

       if lineIn.strip() != 0:
           lineRtn = json.loads(lineIn)
       else:
           lineRtn = ""
        
       linesRtn.append(lineRtn)

   return linesRtn




# -------------------------------------------------------------------
if __name__ == "__main__":
   import multiprocessing as mp

   path1 = "C:\\user\\Documents\\"
   file1 = "someBigJson.json"

   nBuffer = 20*nCPUs  # How many chunks are queued up (so cpus aren't waiting on processes spawning)
   nChunk = 1000 # How many lines are in each chunk
   #Both of the above will require balancing speed against memory overhead

   iJob = 0  #Tracker for SMP jobs submitted into pool
   iiJob = 0  #Tracker for SMP jobs extracted back out of pool

   jobs = []  #SMP job holder
   MTres3 = []  #Final result holder
   chunk = []  
   iBuffer = 0 # Buffer line count
   with open(path1+file1) as f:
      for line in f:
            
          #Send to the chunk
          if len(chunk) < nChunk:
              chunk.append(line)
          else:
              #Chunk full
              #Don't forget to add the current line to chunk
              chunk.append(line)
                
              #Then add the chunk to the buffer (submit to SMP pool)                  
              jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
              iJob +=1
              iBuffer +=1
              #Clear the chunk for the next batch of entries
              chunk = []
                            
          #Buffer is full, any more chunks submitted would cause undue memory overhead
          #(Partially) empty the buffer
          if iBuffer >= nBuffer:
              temp1 = jobs[iiJob].get()
              for rtnLine1 in temp1:
                  MTres3.append(rtnLine1)
              iBuffer -=1
              iiJob+=1
            
      #Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
      if chunk:
          jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
          iJob +=1
          iBuffer +=1

      #And gather up the last of the buffer, including the final chunk
      while iiJob < iJob:
          temp1 = jobs[iiJob].get()
          for rtnLine1 in temp1:
              MTres3.append(rtnLine1)
          iiJob+=1

   #Cleanup
   del chunk, jobs, temp1
   pool.close()

2021-03-02 13:36:07

如果你在文件中没有换行符，你可以这样做:

with open('large_text.txt') as f:
  while True:
    c = f.read(1024)
    if not c:
      break
    print(c,end='')

2018-05-06 15:20:56

谢谢你！我最近已经转换到python 3，并对使用readlines(0)读取大文件感到沮丧。这就解决了问题。但是为了得到每一行，我必须做一些额外的步骤。每一行之前都有一个“b”，我猜这是二进制格式的。使用“decode(utf-8)”将其更改为ascii。

然后我必须在每行中间删除一个“=\n”。

然后我在新线处把线分开。

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
        a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
        data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
        data_list = data_chunk.split('\n')  #List containing lines in chunk
        #print(data_list,'\n')
        #time.sleep(1)
        for j in range(len(data_list)): #iterate through data_list to get each item 
            i += 1
            line_of_data = data_list[j]
            print(line_of_data)

下面是Arohi代码中“打印数据”上方的代码。

2018-01-18 15:28:19

我如何能逐行读取大文本文件，而不将它们加载到内存?

推荐文章

最新文章

标签