我想从目录中读取几个CSV文件到熊猫,并将它们连接到一个大的DataFrame。不过我还没弄明白。以下是我目前所掌握的:

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我在for循环中需要一些帮助?


当前回答

darindaCoder的答案的替代方案:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

其他回答

import os

os.system("awk '(NR == 1) || (FNR > 1)' file*.csv > merged.csv")

其中NR和FNR表示正在处理的行号。

FNR是每个文件中的当前行。

NR == 1包含第一个文件的第一行(头文件),而FNR > 1跳过每个后续文件的第一行。

这是如何使用协作实验室谷歌驱动器:

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # Use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

Dask库可以从多个文件中读取数据帧:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(来源:https://examples.dask.org/dataframes/01-data-access.html # Read-CSV-files)

Dask数据框架实现了Pandas数据框架API的一个子集。如果所有的数据都适合内存,你可以调用df.compute()将数据帧转换为Pandas数据帧。

灵感来自MrFun的回答:

import glob
import pandas as pd

list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()

df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)

注:

By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp. In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like timestamp id valid_frame 0 1 2 . . . 0 1 2 . . . With ignore_index=True, it looks like: timestamp id valid_frame 0 1 2 . . . 108 109 . . . IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g. begin_timestamp = df['timestamp'][0] Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.

所有可用的.read_方法参见pandas: IO工具。

如果所有CSV文件都有相同的列,请尝试以下代码。

我添加了header=0,这样在读取CSV文件的第一行之后,就可以将它赋值为列名。

import pandas as pd
import glob
import os

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

或者,归属于Sid的评论。

all_files = glob.glob(os.path.join(path, "*.csv"))

df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

通常需要标识每个数据样本,这可以通过向数据框架添加一个新列来实现。 本例将使用标准库中的Pathlib。它将路径视为具有方法的对象,而不是要切片的字符串。

导入和设置

from pathlib import Path
import pandas as pd
import numpy as np

path = r'C:\DRO\DCL_rawdata_files'  # or unix / linux / mac path

# Get the files from the path provided in the OP
files = Path(path).glob('*.csv')  # .rglob to get subdirectories

选项1:

添加带有文件名的新列

dfs = list()
for f in files:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['file'] = f.stem
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

选项2:

使用enumerate添加具有泛型名称的新列

dfs = list()
for i, f in enumerate(files):
    data = pd.read_csv(f)
    data['file'] = f'File {i}'
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

选项3:

使用列表理解创建数据框架,然后使用np。重复此操作以添加新列。 [f' s {i}' for i in range(len(dfs))]创建一个字符串列表来命名每个数据帧。 [len(df) for df in dfs]创建一个长度列表 这个选项的归属归属于这个绘图答案。

# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]

# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)

# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])

选项4:

一行代码使用.assign创建新列,并将其归属于来自C8H10N4O2的注释

df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)

or

df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)