小开

filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")

maybe use ".jpg" in the last case since a . can also be in the filename.

小开

最佳答案

You can use urllib.parse.urlparse with os.path.basename:

import os
from urllib.parse import urlparse


url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

Your URL might contain percent-encoded characters like %20 for space or %E7%89%B9%E8%89%B2 for "特色". If that's the case, you'll need to unquote (or unquote_plus) them. You can also use pathlib.Path().name instead of os.path.basename, which could help to add a suffix in the name (like asked in the original question):

from pathlib import Path
from urllib.parse import urlparse, unquote


url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path


url_parsed = urlparse(url)
print(unquote(url_parsed.path))  # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path)        # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg


new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file)         # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg

Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1]).

小开

Python split url to find image name and extension

helps you to extract the image name. to append name :

imageName =  '09-09-201315-47-571378756077'


new_name = '{0}_small.jpg'.format(imageName)

小开

You could just split the url by "/" and retrieve the last member of the list:

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1]
#09-09-201315-47-571378756077.jpg

Then use replace to change the ending:

small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg

小开

os.path.basename(url)

Why try harder?

In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'


In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'


In [3]: os.path.basename("https://example.com/")
Out[3]: ''


In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'

Note 2020-12-20

Nobody has thus far provided a complete solution.

A URL can contain a ?[query-string] and/or a #[fragment Identifier] (but only in that order: ref)

In [1]: from os import path


In [2]: def get_filename(url):
...:     fragment_removed = url.split("#")[0]  # keep to left of first #
...:     query_string_removed = fragment_removed.split("?")[0]
...:     scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
...:     if scheme_removed.find("/") == -1:
...:         return ""
...:     return path.basename(scheme_removed)
...:


In [3]: get_filename("a.com/b")
Out[3]: 'b'


In [4]: get_filename("a.com/")
Out[4]: ''


In [5]: get_filename("https://a.com/")
Out[5]: ''


In [6]: get_filename("https://a.com/b")
Out[6]: 'b'


In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'

小开

Sometimes there is a query string:

filename = url.split("/")[-1].split("?")[0]
new_filename = filename.replace(".jpg", "_small.jpg")

小开

Use urllib.parse.urlparse to get just the path part of the URL, and then use pathlib.Path on that path to get the filename:

from urllib.parse import urlparse
from pathlib import Path




url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path             # '/some/long/path/a_filename.jpg'
Path(a.path).name  # 'a_filename.jpg'

小开

We can extract filename from a url by using ntpath module.

import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077  .jpg




print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg

小开

With python3 (from 3.4 upwards) you can abuse the pathlib library in the following way:

from pathlib import Path


p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'


print(p.stem)
# >>> 'somefile'


print(p.suffix)
# >>> '.html'


print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'

❗️ WARNING

^{The pathlib module is NOT meant for parsing URLs — it is designed to work with POSIX paths and not with URLs. Don't use it in production code! It's a dirty quick hack for non-critical code. The code is only provided as an example of what you can do but probably should not do. If you need to parse URLs then go with urllib.parse or alternatives.}

小开

A simple version using the os package:

import os


def get_url_file_name(url):
url = url.split("#")[0]
url = url.split("?")[0]
return os.path.basename(url)

Examples:

print(get_url_file_name("example.com/myfile.tar.gz"))  # 'myfile.tar.gz'
print(get_url_file_name("example.com/"))  # ''
print(get_url_file_name("https://example.com/"))  # ''
print(get_url_file_name("https://example.com/hello.zip"))  # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e"))  # 'args.tar.gz'

小开

Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects

import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url

then you can continue with the best answer at the moment (Ofir's)

import os
from urllib.parse import urlparse




a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

it doesn't work with this page however, as the page isn't available anymore

小开

I see people using the Pathlib library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib or similar instead.

This is the most stable version I could come up with. It handles params as well as fragments:

from urllib.parse import urlparse, ParseResult


def update_filename(url):
parsed_url = urlparse(url)
path = parsed_url.path


filename = path[path.rfind('/') + 1:]


if not filename:
return


file, extension = filename.rsplit('.', 1)


new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})


return parsed_url.geturl()

Example:

assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2#test'

如何从 URL 中提取文件名并向其添加单词？

❗️ WARNING