Requests-XML: XML Parsing for Humans

https://travis-ci.org/erinxocon/requests-xml.svg?branch=master https://img.shields.io/pypi/v/requests-xml.svg?maxAge=2592000 https://img.shields.io/pypi/l/requests-xml.svg?maxAge=2592000

This library intends to make parsing XML as simple and intuitive as possible. Requests-XML is related to the amazing Requests-HTML and delivers the same quality of user experience — with support for our beloved XML documents.

When using this library you automatically get:

  • XPath Selectors, for the brave at heart.
  • Simple Search/Find for the faint at heart.
  • XML to JSON conversion thanks to xmljson.
  • Mocked user-agent (like a real web browser).
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical XML parsing abilities.

Installation

$ pipenv install requests-xml
✨🍰✨

Only Python 3.6 is supported.

Tutorial & Usage

Make a GET request to nasa.gov, using Requests:

>>> from requests_xml import XMLSession
>>> session = XMLSession()

>>> r = session.get('https://www.nasa.gov/rss/dyn/lg_image_of_the_day.rss')

Grab a list of all links on the page, as–is (this only works for RSS feeds, or other feeds that happen to have link elements):

>>> r.xml.links
['http://www.nasa.gov/image-feature/from-the-earth-moon-and-beyond', 'http://www.nasa.gov/image-feature/jpl/pia21974/jupiter-s-colorful-cloud-belts', 'http://www.nasa.gov/', 'http://www.nasa.gov/image-feature/portrait-of-the-expedition-54-crew-on-the-space-station', ...]

XPath is the main supported way to query an element (learn more):

>>> item = r.xml.xpath('//item', first=True)
<Element 'item' >

Grab an element’s text contents:

>>> print(item.text)
The Beauty of Light
http://www.nasa.gov/image-feature/the-beauty-of-light
The Soyuz MS-08 rocket is launched with Soyuz Commander Oleg Artemyev of Roscosmos and astronauts Ricky Arnold and Drew Feustel of NASA, March 21, 2018, to join the crew of the Space Station.
http://www.nasa.gov/image-feature/the-beauty-of-light
Wed, 21 Mar 2018 14:12 EDT
NASA Image of the Day

Introspect an element’s attributes (learn more):

>>> rss = r.xml.xpath('//rss', first=True)
>>> rss.attrs
{'version': '2.0', '{http://www.w3.org/XML/1998/namespace}base': 'http://www.nasa.gov/'}

Render out an element’s XML (note: namespaces will be applied to sub elements when grabbed):

>>> item.xml
'<item xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/"> <title>The Beauty of Light</title>\n <link>http://www.nasa.gov/image-feature/the-beauty-of-light</link>\n <description>The Soyuz MS-08 rocket is launched with Soyuz Commander Oleg Artemyev of Roscosmos and astronauts Ricky Arnold and Drew Feustel of NASA, March 21, 2018, to join the crew of the Space Station.</description>\n <enclosure url="http://www.nasa.gov/sites/default/files/thumbnails/image/nhq201803210005.jpg" length="1267028" type="image/jpeg"/>\n <guid isPermaLink="false">http://www.nasa.gov/image-feature/the-beauty-of-light</guid>\n <pubDate>Wed, 21 Mar 2018 14:12 EDT</pubDate>\n <source url="http://www.nasa.gov/rss/dyn/lg_image_of_the_day.rss">NASA Image of the Day</source>\n</item>'

Select an element list within an element:

>>> item.xpath('//enclosure')[0].attrs['url']
'http://www.nasa.gov/sites/default/files/thumbnails/image/nhq201803210005.jpg'

Search for links within an element:

>>> item.links
['http://www.nasa.gov/image-feature/the-beauty-of-light']

Search for text on the page. This is useful if you wish to search out things between specific tags without using XPath:

>>> r.xml.search('<title>{}</title>)
<Result ('NASA Image of the Day',) {}>

Using PyQuery we can use tag selectors to easily grab an element, with a simple syntax for ensuring the element contains certain text. This can be used as another easy way to grab an element without an XPath:

>>> light_title = r.xml.find('title', containing='The Beauty of Light')
[<Element 'title' >]

>>> light_title[0].text
'The Beauty of Light'

Note: XPath is preferred as it can allow you to get very specific with your element selection. Find is intended to be an easy way of grabbing all elements of a certain name. Find does however accept CSS selectors, and if you can get those to work with straight XML, go for it!

JSON Support

Using the great xmljson package, we convert the whole XML document into a JSON representation. There are six different conversion convetions available. See the about for what they are. The default is badgerfish. If you wish to use a different conversion convention, pass in a string with the name of the convetion to the .json() method.

Using without Requests

You can also use this library without Requests:

>>> from requests_xml import XML
>>> doc = """
<employees>
    <person>
        <name value="Alice"/>
    </person>
    <person>
        <name value="Bob"/>
    </person>
</employees>
"""

>>> xml = XML(xml=doc)
>>> xml.json()
{
    "employees": [{
        "person": {
            "name": {
                "@value": "Alice"
            }
        }
    }, {
        "person": {
            "name": {
                "@value": "Bob"
            }
        }
    }]
}

License

MIT

API Documentation

Main Classes

These classes are the main interface to requests-xml:

class requests_xml.XML(*, xml: Union[str, bytes], default_encoding: str = 'utf-8') → None[source]

An XML document, ready for parsing.

Parameters:
  • xml – XML from which to base the parsing upon (optional).
  • default_encoding – Which encoding to default to.
encoding

The encoding string to be used, extracted from the XML and XMLResponse header.

find(selector: str = '*', containing: Union[str, typing.List[str]] = None, first: bool = False, _encoding: str = None) → Union[typing.List[_ForwardRef('Element')], _ForwardRef('Element')]

Given a simple element name, returns a list of Element objects or a single one.

Parameters:
  • selector – Element name to find.
  • containing – If specified, only return elements that contain the provided text.
  • first – Whether or not to return just the first result.
  • _encoding – The encoding format.

If first is True, only returns the first Element found.

json(conversion: str = 'badgerfish') → Mapping

A JSON Representation of the XML. Default is badgerfish. :param conversion: Which conversion method to use. (learn more)

All found links on page, in as–is form. Only works for Atom feeds.

lxml

lxml representation of the Element or HTML.

pq

PyQuery representation of the Element or HTML.

raw_xml

Bytes representation of the XML content. (learn more).

search(template: str, first: bool = False) → List[_ForwardRef('Result')]

Search the Element for the given parse template.

Parameters:template – The Parse template to use.
text

The text content of the Element or HTML.

xml

Unicode representation of the XML content (learn more).

xpath(selector: str, *, first: bool = False, _encoding: str = None) → Union[typing.List[str], typing.List[_ForwardRef('Element')], str, _ForwardRef('Element')]

Given an XPath selector, returns a list of Element objects or a single one.

Parameters:
  • selector – XPath Selector to use.
  • first – Whether or not to return just the first result.
  • _encoding – The encoding format.

If a sub-selector is specified (e.g. //a/@href), a simple list of results is returned.

See W3School’s XPath Examples for more details.

If first is True, only returns the first Element found.

class requests_xml.Element(*, element, default_encoding: str = None) → None[source]

An element of HTML.

Parameters:
  • element – The element from which to base the parsing upon.
  • default_encoding – Which encoding to default to.
attrs

Returns a dictionary of the attributes of the Element (learn more).

encoding

The encoding string to be used, extracted from the XML and XMLResponse header.

find(selector: str = '*', containing: Union[str, typing.List[str]] = None, first: bool = False, _encoding: str = None) → Union[typing.List[_ForwardRef('Element')], _ForwardRef('Element')]

Given a simple element name, returns a list of Element objects or a single one.

Parameters:
  • selector – Element name to find.
  • containing – If specified, only return elements that contain the provided text.
  • first – Whether or not to return just the first result.
  • _encoding – The encoding format.

If first is True, only returns the first Element found.

json(conversion: str = 'badgerfish') → Mapping

A JSON Representation of the XML. Default is badgerfish. :param conversion: Which conversion method to use. (learn more)

All found links on page, in as–is form. Only works for Atom feeds.

lxml

lxml representation of the Element or HTML.

pq

PyQuery representation of the Element or HTML.

raw_xml

Bytes representation of the XML content. (learn more).

search(template: str, first: bool = False) → List[_ForwardRef('Result')]

Search the Element for the given parse template.

Parameters:template – The Parse template to use.
text

The text content of the Element or HTML.

xml

Unicode representation of the XML content (learn more).

xpath(selector: str, *, first: bool = False, _encoding: str = None) → Union[typing.List[str], typing.List[_ForwardRef('Element')], str, _ForwardRef('Element')]

Given an XPath selector, returns a list of Element objects or a single one.

Parameters:
  • selector – XPath Selector to use.
  • first – Whether or not to return just the first result.
  • _encoding – The encoding format.

If a sub-selector is specified (e.g. //a/@href), a simple list of results is returned.

See W3School’s XPath Examples for more details.

If first is True, only returns the first Element found.

Utility Functions

requests_xml.user_agent(style=None) → str[source]

Returns an apparently legit user-agent, if not requested one of a specific style. Defaults to a Chrome-style User-Agent.

XML Sessions

These sessions are for making HTTP requests:

class requests_xml.XMLSession(mock_browser=True)[source]

A consumable session, for cookie persistence and connection pooling, amongst other things.

close()

Closes all adapters and as such the session

delete(url, **kwargs)

Sends a DELETE request. Returns Response object.

Parameters:
  • url – URL for the new Request object.
  • **kwargs – Optional arguments that request takes.
Return type:

requests.Response

get(url, **kwargs)

Sends a GET request. Returns Response object.

Parameters:
  • url – URL for the new Request object.
  • **kwargs – Optional arguments that request takes.
Return type:

requests.Response

get_adapter(url)

Returns the appropriate connection adapter for the given URL.

Return type:requests.adapters.BaseAdapter
get_redirect_target(resp)

Receives a Response. Returns a redirect URI or None

head(url, **kwargs)

Sends a HEAD request. Returns Response object.

Parameters:
  • url – URL for the new Request object.
  • **kwargs – Optional arguments that request takes.
Return type:

requests.Response

merge_environment_settings(url, proxies, stream, verify, cert)

Check the environment and merge it with some settings.

Return type:dict
mount(prefix, adapter)

Registers a connection adapter to a prefix.

Adapters are sorted in descending order by prefix length.

options(url, **kwargs)

Sends a OPTIONS request. Returns Response object.

Parameters:
  • url – URL for the new Request object.
  • **kwargs – Optional arguments that request takes.
Return type:

requests.Response

patch(url, data=None, **kwargs)

Sends a PATCH request. Returns Response object.

Parameters:
  • url – URL for the new Request object.
  • data – (optional) Dictionary, bytes, or file-like object to send in the body of the Request.
  • **kwargs – Optional arguments that request takes.
Return type:

requests.Response

post(url, data=None, json=None, **kwargs)

Sends a POST request. Returns Response object.

Parameters:
  • url – URL for the new Request object.
  • data – (optional) Dictionary, bytes, or file-like object to send in the body of the Request.
  • json – (optional) json to send in the body of the Request.
  • **kwargs – Optional arguments that request takes.
Return type:

requests.Response

prepare_request(request)

Constructs a PreparedRequest for transmission and returns it. The PreparedRequest has settings merged from the Request instance and those of the Session.

Parameters:requestRequest instance to prepare with this session’s settings.
Return type:requests.PreparedRequest
put(url, data=None, **kwargs)

Sends a PUT request. Returns Response object.

Parameters:
  • url – URL for the new Request object.
  • data – (optional) Dictionary, bytes, or file-like object to send in the body of the Request.
  • **kwargs – Optional arguments that request takes.
Return type:

requests.Response

rebuild_auth(prepared_request, response)

When being redirected we may want to strip authentication from the request to avoid leaking credentials. This method intelligently removes and reapplies authentication where possible to avoid credential loss.

rebuild_method(prepared_request, response)

When being redirected we may want to change the method of the request based on certain specs or browser behavior.

rebuild_proxies(prepared_request, proxies)

This method re-evaluates the proxy configuration by considering the environment variables. If we are redirected to a URL covered by NO_PROXY, we strip the proxy configuration. Otherwise, we set missing proxy keys for this URL (in case they were stripped by a previous redirect).

This method also replaces the Proxy-Authorization header where necessary.

Return type:dict
request(*args, **kwargs) → requests_xml.XMLResponse[source]

Makes an HTTP Request, with mocked User–Agent headers. Returns a class:HTTPResponse <HTTPResponse>.

resolve_redirects(resp, req, stream=False, timeout=None, verify=True, cert=None, proxies=None, yield_requests=False, **adapter_kwargs)

Receives a Response. Returns a generator of Responses or Requests.

send(request, **kwargs)

Send a given PreparedRequest.

Return type:requests.Response

Indices and tables