Python get the HTML content

get content

Table of contents:

Get html content using urllib

You will probable ignore to use urllib to get HTTP content from the web page, since there is newer module called requests for that. However, if you need to use urllib, here is the tip:

Example:

import urllib.request
url = "https://programming-review.com"
r = urllib.request.urlopen(url)
b = r.read()
print(b)

Output:

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="utf-8"> ... '

urllib, urllib2 and urllib3 story: urllib2 is not present anymore, it has been split to urllib.request and urllib.error. urllib you may forget, since there is newer urllib3, part of the requests. In other words, just use requests.

Get html content using requests

First you need to install requests from the command line:

pip install requests

Then you chan check:

pip show requests

Output:

Summary: Python HTTP for Humans.
Requires: urllib3, chardet, certifi, idna

These are requests module requirements, telling you it uses latest urllib3.

package description
urllib3 HTTP library with thread-safe connection pooling, file post, and more.
chardet Universal encoding detector for Python 3.
certfify Package for providing Mozilla’s CA Bundle.
idna Internationalized Domain Names in Applications (IDNA)

Example:

import requests
url = "https://programming-review.com"
try:
    r = requests.get(url) # requests.models.Response
    d = ['apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding',
     'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines',
     'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code',
     'text', 'url']
    for p in d:
        print('['+p+']')
        print(getattr(r, p))
        print("---")
        
except:
    print("Error with the request")

Output:

[apparent_encoding]
Windows-1252
---
[close]
<bound method Response.close of <Response [200]>>
---
[connection]
<requests.adapters.HTTPAdapter object at 0x0000013AC844D888>
---
[content]
b'<!DOCTYPE html>\n<html lang="en">...</html>\n'
---
[cookies]
<RequestsCookieJar[<Cookie __cfduid=d97d6be96c548c62b6e006d99fa31933f1580069304 for .programming-review.com/>]>
---
[elapsed]
0:00:00.166215
---
[encoding]
utf-8
---
[headers]
{'Date': 'Sun, 26 Jan 2020 20:08:24 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', ... , 'Content-Encoding': 'gzip'}
---
[history]
[]
---
[is_permanent_redirect]
False
---
[is_redirect]
False
---
[iter_content]
<bound method Response.iter_content of <Response [200]>>
---
[iter_lines]
<bound method Response.iter_lines of <Response [200]>>
---
[json]
<bound method Response.json of <Response [200]>>
---
[links]
{}
---
[next]
None
---
[ok]
True
---
[raise_for_status]
<bound method Response.raise_for_status of <Response [200]>>
---
[raw]
<urllib3.response.HTTPResponse object at 0x0000013AC845CDC8>
---
[reason]
OK
---
[request]
<PreparedRequest [GET]>
---
[status_code]
200
---
[text]
<!DOCTYPE html>
<html lang="en">
<head>
...
</html>
---
[url]
https://programming-review.com/
---

We usually need requests.models.Response r.text property.

r.content.decode() is equivalent to r.text

To finalize this would be the code to get the content from the web page using requests.

import requests
url = "https://programming-review.com"
try:
    r = requests.get(url) # requests.models.Response
    print(r.text) # print(r.content) if you need bytes
except:
    print("Error with the request")

Get a text file using requests

Here is how to download a file using requests.

Example:

import pandas as pd
import io
import requests
url="https://programming-review.com/wp-content/uploads/cities.csv"
bytes_data =requests.get(url).content
text =bytes_data.decode('utf-8') 

with open('cities.csv', 'w') as file:
    file.write(text)
#dataframe=pd.read_csv(io.StringIO(text))
dataframe=pd.read_csv('cities.csv')
dataframe

Output: loaded data

Get a binary file using requests

Example:

import requests
url="https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
r = requests.get(url)
with open('google.logo.png', 'wb') as f:
    f.write(r.content)

Get any file using wget

Example:

import wget
url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
try:
    wget.download(url, 'google.logo.png')
except:
  print("An exception occurred") 

HTTP response codes

While you create http requests to get the URL content and expect 200 OK it is nice to have this overview of the response status codes you may also get in some cases.

code Short text Long text
100 Continue Request received, please continue
101 Switching Protocols Switching to new protocol; obey Upgrade header
200 OK Request fulfilled, document follows
201 Created Document created, URL follows
202 Accepted Request accepted, processing continues off-line
203 Non-Authoritative Information Request fulfilled from cache
204 No Content Request fulfilled, nothing follows
205 Reset Content Clear input form for further input.
206 Partial Content Partial content follows.
300 Multiple Choices Object has several resources – see URI list
301 Moved Permanently Object moved permanently – see URI list
302 Found Object moved temporarily – see URI list
303 See Other Object moved – see Method and URL list
304 Not Modified Document has not changed since given time
305 Use Proxy You must use proxy specified in Location to access this resource.
307 Temporary Redirect Object moved temporarily – see URI list
400 Bad Request Bad request syntax or unsupported method
401 Unauthorized No permission – see authorization schemes
402 Payment Required No payment – see charging schemes
403 Forbidden Request forbidden – authorization will not help
404 Not Found Nothing matches the given URI.
405 Method Not Allowed Specified method is invalid for this server.
406 Not Acceptable URI not available in preferred format.
407 Proxy Authentication Required You must authenticate with this proxy before proceeding.
408 Request Timeout Request timed out; try again later.
409 Conflict Request conflict.
410 Gone URI no longer exists and has been permanently removed.
411 Length Required Client must specify Content-Length.
412 Precondition Failed Precondition in headers is false.
413 Request Entity Too Large Entity is too large.
414 Request-URI Too Long URI is too long.
415 Unsupported Media Type Entity body in unsupported format.
416 Requested Range Not Satisfiable Cannot satisfy request range.
417 Expectation Failed Expect condition could not be satisfied.
500 Internal Server Error Server got itself in trouble.
501 Not Implemented Server does not support this operation.
502 Bad Gateway Invalid responses from another server/proxy.
503 Service Unavailable The server cannot process the request due to a high load.
504 Gateway Timeout The gateway server did not receive a timely response.
505 HTTP Version Not Supported Cannot fulfill request.

tags: html - https - get request & category: python