用csv模块从csv文件中读取特定的列?

我试图通过csv文件进行解析，并仅从特定列中提取数据。

例csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

我试图只捕获特定的列，比如ID、Name、Zip和Phone。

我看过的代码让我相信我可以通过对应的数字调用特定的列，因此ie: Name将对应于2，并且使用行[2]遍历每一行将产生列2中的所有项。但事实并非如此。

以下是我目前所做的:

import sys, argparse, csv
from settings import *

# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
 fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file

# open csv file
with open(csv_file, 'rb') as csvfile:

    # get number of columns
    for line in csvfile.readlines():
        array = line.split(',')
        first_item = array[0]

    num_columns = len(array)
    csvfile.seek(0)

    reader = csv.reader(csvfile, delimiter=' ')
        included_cols = [1, 2, 6, 7]

    for row in reader:
            content = list(row[i] for i in included_cols)
            print content

我期望它只打印出每行我想要的特定列，但它没有，我只打印出最后一列。

当前回答

由于你可以索引和子集pandas数据框架，一个非常简单的方法从csv文件提取单列到一个变量是:

myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']

有几件事需要考虑:

上面的代码片段将生成一个pandas系列，而不是数据框架。如果速度是一个问题，ayhan和usecols的建议也会更快。在一个2122 KB大小的csv文件上使用%timeit测试这两种不同的方法，usecols方法得到22.8 ms的结果，而我建议的方法得到53 ms的结果。

别忘了进口熊猫当pd

2018-12-10 08:33:55

其他回答

要获取列名，最好使用readline()而不是readlines()，以避免循环&读取完整文件并将其存储在数组中。

with open(csv_file, 'rb') as csvfile:

    # get number of columns

    line = csvfile.readline()

    first_item = line.split(',')

2017-05-15 13:52:43

我认为有一个更简单的方法

import pandas as pd

dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values

在这里iloc[:， 0]，:表示所有值，0表示列的位置。在下面的例子中，ID将被选中

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

2020-02-13 11:38:26

您可以使用numpy.loadtext(文件名)。例如，如果这是你的数据库。csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

你需要Name列:

import numpy as np 
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))

>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

你可以更容易地使用genfromtext:

b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

2014-01-10 13:46:04

import pandas as pd

dataset = pd.read_csv('Train.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

X是一堆列，如果你想读更多的列，就用它 Y是单列，用它来读一列 [:， 1:-1]是[row_index: to_row_index, column_index: to_column_index]

2021-11-20 11:21:32

从这段代码中获得最后一列的唯一方法是在for循环中不包含print语句。

这很可能是你代码的结尾:

for row in reader:
    content = list(row[i] for i in included_cols)
print content

你希望它是这样的:

for row in reader:
        content = list(row[i] for i in included_cols)
        print content

现在我们已经解决了你的错误，我想借此时间向你介绍熊猫模块。

Pandas在处理csv文件方面非常出色，下面的代码将是读取csv并将整个列保存到变量中所需要的全部代码:

import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']

所以如果你想保存你列Names中的所有信息到一个变量中，这就是你所需要做的:

names = df.Names

这是一个很棒的模块，我建议你研究一下。如果由于某种原因，你的打印语句在for循环中，它仍然只打印出最后一列，这是不应该发生的，但如果我的假设是错误的，请告诉我。你发布的代码有很多缩进错误，所以很难知道什么应该在哪里。希望这对你有帮助!

2013-05-12 03:06:30

用csv模块从csv文件中读取特定的列?

推荐文章

最新文章

标签