如何从 URL 中提取文件名并向其添加单词?

我有以下网址:

url = http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg

我想提取这个 URL 中的文件名: 09-09-201315-47-571378756077.jpg

一旦我得到这个文件名,我将用这个名称保存到桌面。

filename = **extracted file name from the url**
download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))

在这之后,我将调整照片的大小,一旦这样做,我将保存调整后的版本,并附加“ _ small”到文件名的末尾。

downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename))
resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS)
resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))

由此,我试图实现的是得到两个文件,原来的照片与原来的名称,然后与修改名称的大小照片。像这样:

09-09-201315-47-571378756077.jpg

重命名为:

09-09-201315-47-571378756077_small.jpg

我该怎么做呢?

79251 次浏览
filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")

maybe use ".jpg" in the last case since a . can also be in the filename.

You can use urllib.parse.urlparse with os.path.basename:

import os
from urllib.parse import urlparse


url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

Your URL might contain percent-encoded characters like %20 for space or %E7%89%B9%E8%89%B2 for "特色". If that's the case, you'll need to unquote (or unquote_plus) them. You can also use pathlib.Path().name instead of os.path.basename, which could help to add a suffix in the name (like asked in the original question):

from pathlib import Path
from urllib.parse import urlparse, unquote


url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path


url_parsed = urlparse(url)
print(unquote(url_parsed.path))  # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path)        # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg


new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file)         # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg

Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1]).

Python split url to find image name and extension

helps you to extract the image name. to append name :

imageName =  '09-09-201315-47-571378756077'


new_name = '{0}_small.jpg'.format(imageName)

You could just split the url by "/" and retrieve the last member of the list:

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1]
#09-09-201315-47-571378756077.jpg

Then use replace to change the ending:

small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg

os.path.basename(url)

Why try harder?

In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'


In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'


In [3]: os.path.basename("https://example.com/")
Out[3]: ''


In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'

Note 2020-12-20

Nobody has thus far provided a complete solution.

A URL can contain a ?[query-string] and/or a #[fragment Identifier] (but only in that order: ref)

In [1]: from os import path


In [2]: def get_filename(url):
...:     fragment_removed = url.split("#")[0]  # keep to left of first #
...:     query_string_removed = fragment_removed.split("?")[0]
...:     scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
...:     if scheme_removed.find("/") == -1:
...:         return ""
...:     return path.basename(scheme_removed)
...:


In [3]: get_filename("a.com/b")
Out[3]: 'b'


In [4]: get_filename("a.com/")
Out[4]: ''


In [5]: get_filename("https://a.com/")
Out[5]: ''


In [6]: get_filename("https://a.com/b")
Out[6]: 'b'


In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'

Sometimes there is a query string:

filename = url.split("/")[-1].split("?")[0]
new_filename = filename.replace(".jpg", "_small.jpg")

Use urllib.parse.urlparse to get just the path part of the URL, and then use pathlib.Path on that path to get the filename:

from urllib.parse import urlparse
from pathlib import Path




url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path             # '/some/long/path/a_filename.jpg'
Path(a.path).name  # 'a_filename.jpg'

We can extract filename from a url by using ntpath module.

import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077  .jpg




print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg

With python3 (from 3.4 upwards) you can abuse the pathlib library in the following way:

from pathlib import Path


p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'


print(p.stem)
# >>> 'somefile'


print(p.suffix)
# >>> '.html'


print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'


❗️ WARNING

The pathlib module is NOT meant for parsing URLs — it is designed to work with POSIX paths and not with URLs. Don't use it in production code! It's a dirty quick hack for non-critical code. The code is only provided as an example of what you can do but probably should not do. If you need to parse URLs then go with urllib.parse or alternatives.

A simple version using the os package:

import os


def get_url_file_name(url):
url = url.split("#")[0]
url = url.split("?")[0]
return os.path.basename(url)

Examples:

print(get_url_file_name("example.com/myfile.tar.gz"))  # 'myfile.tar.gz'
print(get_url_file_name("example.com/"))  # ''
print(get_url_file_name("https://example.com/"))  # ''
print(get_url_file_name("https://example.com/hello.zip"))  # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e"))  # 'args.tar.gz'

Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects

import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url

then you can continue with the best answer at the moment (Ofir's)

import os
from urllib.parse import urlparse




a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

it doesn't work with this page however, as the page isn't available anymore

I see people using the Pathlib library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib or similar instead.

This is the most stable version I could come up with. It handles params as well as fragments:

from urllib.parse import urlparse, ParseResult


def update_filename(url):
parsed_url = urlparse(url)
path = parsed_url.path


filename = path[path.rfind('/') + 1:]


if not filename:
return


file, extension = filename.rsplit('.', 1)


new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})


return parsed_url.geturl()

Example:

assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2#test'