Python列表目录,子目录和文件

我试图使一个脚本列出所有目录,子目录,和文件在一个给定的目录。< br / > 我尝试了这个:

import sys, os


root = "/home/patate/directory/"
path = os.path.join(root, "targetdirectory")


for r, d, f in os.walk(path):
for file in f:
print(os.path.join(root, file))

不幸的是它不能正常工作。< br / > 我得到了所有的文件,但没有它们的完整路径

例如,如果dir结构体为:

/home/patate/directory/targetdirectory/123/456/789/file.txt

它将打印:

/home/patate/directory/targetdirectory/file.txt

我需要的是第一个结果。任何帮助都将不胜感激!谢谢。

329215 次浏览

使用os.path.join来连接目录和文件的名字:

for path, subdirs, files in os.walk(root):
for name in files:
print(os.path.join(path, name))

注意在串联中使用path而不是root,因为使用root是不正确的。


在Python 3.4中,添加了pathlib模块以方便路径操作。所以os.path.join的等价物是:

pathlib.PurePath(path, name)

pathlib的优点是你可以在路径上使用各种有用的方法。如果你使用具体的Path变量,你也可以通过它们进行实际的操作系统调用,比如改变到一个目录,删除路径,打开它指向的文件等等。

以防万一……获取目录中的所有文件和匹配某些模式的子目录(例如*.py):

import os
from fnmatch import fnmatch


root = '/some/directory'
pattern = "*.py"


for path, subdirs, files in os.walk(root):
for name in files:
if fnmatch(name, pattern):
print(os.path.join(path, name))

你可以看看我做的这个样品。它使用os.path.walk函数,该函数已被弃用。使用列表存储所有文件路径

root = "Your root directory"
ex = ".txt"
where_to = "Wherever you wanna write your file to"
def fileWalker(ext,dirname,names):
'''
checks files in names'''
pat = "*" + ext[0]
for f in names:
if fnmatch.fnmatch(f,pat):
ext[1].append(os.path.join(dirname,f))




def writeTo(fList):


with open(where_to,"w") as f:
for di_r in fList:
f.write(di_r + "\n")












if __name__ == '__main__':
li = []
os.path.walk(root,fileWalker,[ex,li])


writeTo(li)

下面是一行代码:

import os


[val for sublist in [[os.path.join(i[0], j) for j in i[2]] for i in os.walk('./')] for val in sublist]
# Meta comment to ease selecting text

最外层的val for sublist in ...循环将列表平铺为一维。j循环收集每个文件basename的列表,并将其连接到当前路径。最后,i循环遍历所有目录和子目录。

这个例子在os.walk(...)调用中使用了硬编码路径./,你可以补充任何你喜欢的路径字符串。

注意:os.path.expanduser和/或os.path.expandvars可以用于路径字符串,如~/

扩展这个例子:

很容易在文件basename测试和directoryname测试中添加。

例如,测试*.jpg文件:

... for j in i[2] if j.endswith('.jpg')] ...

此外,不包括.git目录:

... for i in os.walk('./') if '.git' not in i[0].split('/')]

简单一点的一行代码:

import os
from itertools import product, chain


chain.from_iterable([[os.sep.join(w) for w in product([i[0]], i[2])] for i in os.walk(dir)])

无法评论,所以在这里写下答案。这是我见过的最清晰的一句话:

import os
[os.path.join(path, name) for path, subdirs, files in os.walk(root) for name in files]

由于这里的每个例子都只是使用walk(与join),我想展示一个很好的例子并与listdir进行比较:

import os, time


def listFiles1(root): # listdir
allFiles = []; walk = [root]
while walk:
folder = walk.pop(0)+"/"; items = os.listdir(folder) # items = folders + files
for i in items: i=folder+i; (walk if os.path.isdir(i) else allFiles).append(i)
return allFiles


def listFiles2(root): # listdir/join (takes ~1.4x as long) (and uses '\\' instead)
allFiles = []; walk = [root]
while walk:
folder = walk.pop(0); items = os.listdir(folder) # items = folders + files
for i in items: i=os.path.join(folder,i); (walk if os.path.isdir(i) else allFiles).append(i)
return allFiles


def listFiles3(root): # walk (takes ~1.5x as long)
allFiles = []
for folder, folders, files in os.walk(root):
for file in files: allFiles+=[folder.replace("\\","/")+"/"+file] # folder+"\\"+file still ~1.5x
return allFiles


def listFiles4(root): # walk/join (takes ~1.6x as long) (and uses '\\' instead)
allFiles = []
for folder, folders, files in os.walk(root):
for file in files: allFiles+=[os.path.join(folder,file)]
return allFiles




for i in range(100): files = listFiles1("src") # warm up


start = time.time()
for i in range(100): files = listFiles1("src") # listdir
print("Time taken: %.2fs"%(time.time()-start)) # 0.28s


start = time.time()
for i in range(100): files = listFiles2("src") # listdir and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.38s


start = time.time()
for i in range(100): files = listFiles3("src") # walk
print("Time taken: %.2fs"%(time.time()-start)) # 0.42s


start = time.time()
for i in range(100): files = listFiles4("src") # walk and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.47s

因此,正如你自己所看到的,listdir版本更有效。(并且join很慢)

这只是一个附加功能,有了它,你可以将数据转换成CSV格式

import sys,os
try:
import pandas as pd
except:
os.system("pip3 install pandas")
    

root = "/home/kiran/Downloads/MainFolder" # it may have many subfolders and files inside
lst = []
from fnmatch import fnmatch
pattern = "*.csv"      #I want to get only csv files
pattern = "*.*"        # Note: Use this pattern to get all types of files and folders
for path, subdirs, files in os.walk(root):
for name in files:
if fnmatch(name, pattern):
lst.append((os.path.join(path, name)))
df = pd.DataFrame({"filePaths":lst})
df.to_csv("filepaths.csv")

非常简单的解决方案是运行一组子流程调用,将文件导出为CSV格式:

import subprocess


# Global variables for directory being mapped


location = '.' # Enter the path here.
pattern = '*.py' # Use this if you want to only return certain filetypes
rootDir = location.rpartition('/')[-1]
outputFile = rootDir + '_directory_contents.csv'


# Find the requested data and export to CSV, specifying a pattern if needed.
find_cmd = 'find ' + location + ' -name ' + pattern +  ' -fprintf ' + outputFile + '  "%Y%M,%n,%u,%g,%s,%A+,%P\n"'
subprocess.call(find_cmd, shell=True)

该命令生成逗号分隔的值,可以很容易地在Excel中进行分析。

f-rwxrwxrwx,1,cathy,cathy,2642,2021-06-01+00:22:00.2970880000,content-audit.py

生成的CSV文件没有标题行,但是您可以使用第二个命令来添加它们。

# Add headers to the CSV
headers_cmd = 'sed -i.bak 1i"Permissions,Links,Owner,Group,Size,ModifiedTime,FilePath" ' + outputFile
subprocess.call(headers_cmd, shell=True)

根据返回的数据量,可以使用Pandas对其进行进一步处理。以下是一些我认为有用的东西,特别是在处理要查看的多级目录时。

将这些添加到导入中:

import numpy as np
import pandas as pd

然后将以下内容添加到代码中:

# Create DataFrame from the csv file created above.
df = pd.read_csv(outputFile)
    

# Format columns
# Get the filename and file extension from the filepath
df['FileName'] = df['FilePath'].str.rsplit("/",1).str[-1]
df['FileExt'] = df['FileName'].str.rsplit('.',1).str[1]


# Get the full path to the files. If the path doesn't include a "/" it's the root directory
df['FullPath'] = df["FilePath"].str.rsplit("/",1).str[0]
df['FullPath'] = np.where(df['FullPath'].str.contains("/"), df['FullPath'], rootDir)


# Split the path into columns for the parent directory and its children
df['ParentDir'] = df['FullPath'].str.split("/",1).str[0]
df['SubDirs'] = df['FullPath'].str.split("/",1).str[1]
# Account for NaN returns, indicates the path is the root directory
df['SubDirs'] = np.where(df.SubDirs.str.contains('NaN'), '', df.SubDirs)


# Determine if the item is a directory or file.
df['Type'] = np.where(df['Permissions'].str.startswith('d'), 'Dir', 'File')


# Split the time stamp into date and time columns
df[['ModifiedDate', 'Time']] = df.ModifiedTime.str.rsplit('+', 1, expand=True)
df['Time'] = df['Time'].str.split('.').str[0]


# Show only files, output includes paths so you don't necessarily need to display the individual directories.
df = df[df['Type'].str.contains('File')]


# Set columns to show and their order.
df=df[['FileName','ParentDir','SubDirs','FullPath','DocType','ModifiedDate','Time', 'Size']]


filesize=[] # Create an empty list to store file sizes to convert them to something more readable.


# Go through the items and convert the filesize from bytes to something more readable.
for items in df['Size'].items():
filesize.append(convert_bytes(items[1]))
df['Size'] = filesize


# Send the data to an Excel workbook with sheets by parent directory
with pd.ExcelWriter("scripts_directory_contents.xlsx") as writer:
for directory, data in df.groupby('ParentDir'):
data.to_excel(writer, sheet_name = directory, index=False)
        



# To convert sizes to be more human readable
def convert_bytes(size):
for x in ['b', 'K', 'M', 'G', 'T']:
if size < 1024:
return "%3.1f %s" % (size, x)
size /= 1024


return size

如果你想在SharePoint上列出文件这就是你列出它的方式。你的道路可能会从“队伍”;部分

    import os
root = r"\\mycompany.sharepoint.com@SSL\DavWWWRoot\teams\MyFolder\Policies and Procedures\Deal Docs\My Deals"
list = [os.path.join(path, name) for path, subdirs, files in os.walk(root) for name in files]
print(list)

另一个选项是使用标准库中的< >强水珠< / >强模块:

import glob


path = "/home/patate/directory/targetdirectory/**"


for path in glob.glob(path, recursive=True):
print(path)

如果你需要一个迭代器,你可以使用< >强iglob < / >强作为替代:

for file in glob.iglob(my_path, recursive=True):
# ...

使用任何受支持的Python版本(3.4+),你应该使用pathlib.rglob来递归地列出当前目录和所有子目录的内容:

from pathlib import Path




def generate_all_files(root: Path, only_files: bool = True):
for p in root.rglob("*"):
if only_files and not p.is_file():
continue
yield p




for p in generate_all_files(Path("."), only_files=False):
print(p)

如果你想要复制粘贴的东西:

例子

文件夹结构:

$ tree . -a
.
├── a.txt
├── bar
├── b.py
├── collect.py
├── empty
├── foo
│   └── bar.bz.gz2
├── .hidden
│   └── secrect-file
└── martin
└── thoma
└── cv.pdf

给:

$ python collect.py
bar
empty
.hidden
collect.py
a.txt
b.py
martin
foo
.hidden/secrect-file
martin/thoma
martin/thoma/cv.pdf
foo/bar.bz.gz2