Skip to content

How to Extract URLs from a Sitemap#

Using External Libraries#

The easiest way to obtain URLs from a sitemap is to parse the sitemap XML file using the lxml library. Example:

Python
1
2
3
4
5
6
7
8
9
import requests
import lxml.etree

response = requests.get(https://example.com/sitemap.xml)
sitemap_tree = lxml.etree.fromstring(response.content)
sitemap_urls = sitemap_tree.xpath("//ns:url/ns:loc/text()", namespaces={"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"})

for url in sitemap_urls:
    print(url.strip())

This will print all the URLs from the sitemap:

https://example.com
https://example.com/page1
https://example.com/page2
...

Using Existing Methods#

Alternatively, you can simplify the process and use the existing methods in IndexNow for Python to retrieve and parse the sitemap XML file:

Python
1
2
3
4
5
6
7
8
from index_now.sitemap.get import get_sitemap_xml
from index_now.sitemap.parse import parse_sitemap_xml_and_get_urls

sitemap_content = get_sitemap_xml(https://example.com/sitemap.xml)
urls = parse_sitemap_xml_and_get_urls(sitemap_content)

for url in urls:
    print(url)

The end result will be the same:

https://example.com
https://example.com/page1
https://example.com/page2
...