Python Programming/Web
Python web requests/parsing is very simple, and there are several must-have modules to help with this.
Urllib
[edit | edit source]Urllib is the built in python module for html requests, main article is Python Programming/Internet.
try:
import urllib2
except (ModuleNotFoundError, ImportError): #ModuleNotFoundError is 3.6+
import urllib.parse as urllib2
url = 'https://www.google.com'
u = urllib2.urlopen(url)
content = u.read() #content now has all of the html in google.com
Requests
[edit | edit source]Python HTTP for Humans | |
PyPi Link | https://pypi.python.org/pypi/requests |
---|---|
Pip command | pip install requests |
The python requests library simplifies http requests. It has functions for each of the http requests
- GET (requests.get)
- POST (requests.post)
- HEAD (requests.head)
- PUT (requests.put)
- DELETE (requests.delete)
- OPTIONS (requests.options)
Basic request
[edit | edit source]import requests
url = 'https://www.google.com'
r = requests.get(url)
The response object
[edit | edit source]The response from the last function has many variables/data retrieval.
>>> import requests
>>> r = requests.get('https://www.google.com')
>>> print(r)
<Response [200]>
>>> dir(r) # dir shows all variables, functions, basically anything you can do with var.n where n is something to do
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
r.content
andr.text
provide similar html content, butr.text
is preferred.r.encoding
will display the encoding of the website.r.headers
shows the headers returned by the website.r.is_redirect
andr.is_permanent_redirect
shows whether or not the original link was a redirect.r.iter_content
will iterate each character in the html as a byte. To convert bytes to string, it must be decoded with the encoding inr.encoding
.r.iter_lines
is liker.iter_content
, but will iterate each line of the html. It is also in bytesr.json
will convert json to a python dict if the return output is json.r.raw
will return the baseurllib3.response.HTTPResponse
object.r.status_code
will return the html code sent by the server. Code 200 is success, while any other code is an error.r.raise_for_status
will return an exception if the status code is not 200.r.url
will return the url sent.
Authentication
[edit | edit source]Requests has built-in authentication. Here is an example with basic authentication.
import requests
r = requests.get('http://example.com', auth = requests.auth.HTTPBasicAuth('username', 'password'))
If it is Basic Authentication, you can just pass a tuple.
import requests
r = requests.get('http://example.com', auth = ('username', 'password'))
All of the other types of authentication are at the requests documentation.
Queries
[edit | edit source]
Queries in html pass values. For example, when you make a google search, the search url is a form of https://www.google.com/search?q=My+Search+Here&...
. Anything after the ? is the query. Queries are url?name1=value1&name2=value2...
. Requests has a system for automatically making these queries.
>>> import requests
>>> query = {'q':'test'}
>>> r = requests.get('https://www.google.com/search', params = query)
>>> print(r.url) #prints the final url
https://www.google.com/search?q=test
The true power is noticed in multiple entries.
>>> import requests
>>> query = {'name':'test', 'fakeparam': 'yes', 'anotherfakeparam': 'yes again'}
>>> r = requests.get('http://example.com', params = query)
>>> print(r.url) #prints the final url
http://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again
Not only does it pass these values but also changes special characters & whitespace to html-compatible versions.
BeautifulSoup4
[edit | edit source]Screen-scraping library | |
PyPi Link | https://pypi.python.org/pypi/beautifulsoup4 |
---|---|
Pip command | pip install beautifulsoup4 |
Import command | import bs4 |
BeautifulSoup4 is a powerful html parsing command. Let's try with some example html.
>>> import bs4
>>> example_html = """<!DOCTYPE html>
... <html>
... <head>
... <title>Testing website</title>
... <style>.b{color: blue;}</style>
... </head>
... <body>
... <h1 class='b', id = 'hhh'>A Blue Header</h1>
... <p> I like blue text, I like blue text... </p>
... <p class = 'b'> This text is blue, yay yay yay!</p>
... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>
... </body>
... </html>
... """
>>> bs = bs4.BeautifulSoup(example_html)
>>> print(bs)
<!DOCTYPE html>
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.prettify()) #adds in newlines
<!DOCTYPE html>
<html>
<head>
<title>
Testing website
</title>
<style>
.b{color: blue;}
</style>
</head>
<body>
<h1 class="b" id="hhh">
A Blue Header
</h1>
<p>
I like blue text, I like blue text...
</p>
<p class="b">
This text is blue, yay yay yay!
</p>
<p class="b">
Check out the
<a href="#hhh">
Blue Header
</a>
</p>
</body>
</html>
Getting elements
[edit | edit source]There are two ways to access elements. The first way is to manually type in the tags, going down in order, until you get to the tag you want.
>>> print(bs.html)
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.html.body)
<body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body>
>>> print(bs.html.body.h1)
However, this is inconvenient with large html. There is a function, find_all, to find all instances of a certain element. It takes in a html tag, such as h1 or p, and returns all instances of it.
>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
This is still inconvenient in a large website because there will be thousands of entries. You can simplify it by finding classes or ids.
>>> blue = bs.find_all('p', _class = 'b')
>>> blue
[]
However, it does not bring up any results. Therefore, we might want to use our own finding system.
>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
>>> blue = [p for p in p if 'class' in p.__dict__['attrs'] and 'b' in p.__dict__['attrs']['class']]
>>> blue
[<p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
This checks to see if there are any classes in each of the elements and then checks to see if the class b is in the classes if there are classes. From the list, we can do something to each element, such as retrieve the text inside.
>>> b = blue[0].text
>>> print(bb)
This text is blue, yay yay yay!