如何逐行读取一个大文件?

我希望遍历整个文件的每一行。一种方法是读取整个文件，将其保存到一个列表中，然后遍历感兴趣的行。这种方法使用大量内存，所以我正在寻找一种替代方法。

到目前为止我的代码:

for each_line in fileinput.input(input_file):
    do_something(each_line)

    for each_line_again in fileinput.input(input_file):
        do_something(each_line_again)

执行这段代码会给出一个错误消息:设备处于活动状态。

有什么建议吗?

目的是计算成对字符串的相似性，这意味着对于文件中的每一行，我想计算与其他每一行的Levenshtein距离。

编辑:在这个问题8个月后提出的一个相关问题有许多有用的答案和评论。要更深入地了解python逻辑，请阅读以下相关问题:如何在python中逐行读取文件?

当前回答

这是python中读取文件的一种可能方式:

f = open(input_file)
for line in f:
    do_stuff(line)
f.close()

它不分配一个完整的列表。它在直线上迭代。

2011-11-04 13:33:37

其他回答

#Using a text file for the example
with open("yourFile.txt","r") as f:
    text = f.readlines()
for line in text:
    print line

打开文件以读取(r) 读取整个文件并将每行保存为一个列表(文本) 遍历列表打印每行。

例如，如果您希望检查长度大于10的特定行，则使用现有的可用内容。

for line in text:
    if len(line) > 10:
        print line

2016-07-30 02:01:01

这是python中读取文件的一种可能方式:

f = open(input_file)
for line in f:
    do_stuff(line)
f.close()

它不分配一个完整的列表。它在直线上迭代。

2011-11-04 13:33:37

来自python文档fileinput.input():

这将遍历sys. exe中列出的所有文件的行。Argv[1:]，默认为sys。如果列表为空，则输入

进一步，函数的定义为:

fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])

字里行间，这告诉我文件可以是一个列表，所以你可以有这样的东西:

for each_line in fileinput.input([input_file, input_file]):
  do_something(each_line)

更多信息请参见这里

2011-11-04 13:32:05

我强烈建议不要使用默认的文件加载，因为它非常慢。你应该研究一下numpy函数和IOpro函数(例如numpy.loadtxt())。

http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

https://store.continuum.io/cshop/iopro/

然后你可以把你的成对操作分解成几个块:

import numpy as np
import math

lines_total = n    
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
    for j in xrange(n_chunks):
        chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
        chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
        similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
                   j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j)

以块方式加载数据，然后对其进行矩阵操作，几乎总是比一个元素一个元素地加载数据快得多!!

2014-10-17 19:39:11

需要经常从上一个位置读取一个大文件?

我创建了一个脚本，用于每天多次删除Apache access.log文件。所以我需要在最后一次执行期间解析的最后一行上设置位置光标。为此，我使用了file.seek()和file.seek()方法，它们允许将光标存储在文件中。

我的代码:

ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))

# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")

# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")

# Set in from_line 
from_position = 0
try:
    with open(cursor_position, "r", encoding=ENCODING) as f:
        from_position = int(f.read())
except Exception as e:
    pass

# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
    with open(cut_file, "w", encoding=ENCODING) as fw:
        # We set cursor to the last position used (during last run of script)
        f.seek(from_position)
        for line in f:
            fw.write("%s" % (line))

    # We save the last position of cursor for next usage
    with open(cursor_position, "w", encoding=ENCODING) as fw:
        fw.write(str(f.tell()))

2020-01-07 13:18:47

如何逐行读取一个大文件?

推荐文章

最新文章

标签