python - BeautifulSoup 获取 href

这个问题在这里已经有了答案:

retrieve links from web page using python and BeautifulSoup [closed] (16 个答案)

我有以下汤:

<a href="some_url">next</a>
<span class="class">...</span>

我想从中提取href，"some_url"

如果我只有一个标签，我可以做到，但这里有两个标签。我也可以得到文本 'next' 但这不是我想要的。

另外，在某处是否有关于 API 的良好描述以及示例。我正在使用 the standard documentation ，但我正在寻找更有条理的东西。

最佳答案

您可以通过以下方式使用find_all 查找每个具有href 属性的a 元素，并将每个元素打印出来:

# Python2
from BeautifulSoup import BeautifulSoup
    
html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''
    
soup = BeautifulSoup(html)
    
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

# The output would be:
# Found the URL: some_url
# Found the URL: another_url

# Python3
from bs4 import BeautifulSoup

html = '''<a href="https://some_url.com">next</a>
<span class="class">
<a href="https://some_other_url.com">another_url</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print("Found the URL:", a['href'])

# The output would be:
# Found the URL: https://some_url.com
# Found the URL: https://some_other_url.com

请注意，如果您使用的是旧版本的 BeautifulSoup(版本 4 之前)，则此方法的名称为 findAll。在版本 4 中，BeautifulSoup 的方法名称 were changed to be PEP 8 compliant ，所以你应该改用 find_all。

如果你想要所有标签带有href，你可以省略name参数:

href_tags = soup.find_all(href=True)

https://stackoverflow.com/questions/5815747/

相关文章：

python - 如何将字符串复制到剪贴板？

linux - 如何找到某个命令的目录？

python - 在调用者线程中捕获线程的异常？

python - 计算列表差异

linux - Apache VirtualHost 403 被禁止

linux - 重命名文件和目录(添加前缀)

linux - 如何列出(ls)目录中最后修改的 5 个文件？

linux - 如何在 Linux 上使用 grep 搜索包含 DOS 行尾 (CRLF) 的文件？

python - Python中的异步方法调用？

linux - 如何查看符号链接(symbolic link)的完整绝对路径