Python get the HTML content

get content

Table of contents:

Get html content using urllib
Get html content using requests
Get a text file using requests
Get a binary file using requests
Get any file using wget
HTTP response codes

Get html content using `urllib`

You will probable ignore to use urllib to get HTTP content from the web page, since there is newer module called requests for that. However, if you need to use urllib, here is the tip:

Example:

import urllib.request
url = "https://programming-review.com"
r = urllib.request.urlopen(url)
b = r.read()
print(b)

Output:

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="utf-8"> ... '

urllib, urllib2 and urllib3 story: ~~urllib2~~ is not present anymore, it has been split to urllib.request and urllib.error. urllib you may forget, since there is newer urllib3, part of the requests. In other words, just use requests.

Get html content using `requests`

First you need to install requests from the command line:

pip install requests

Then you chan check:

pip show requests

Output:

Summary: Python HTTP for Humans.
Requires: urllib3, chardet, certifi, idna

These are requests module requirements, telling you it uses latest urllib3.

package	description
urllib3	HTTP library with thread-safe connection pooling, file post, and more.
chardet	Universal encoding detector for Python 3.
certfify	Package for providing Mozilla’s CA Bundle.
idna	Internationalized Domain Names in Applications (IDNA)

Example:

import requests
url = "https://programming-review.com"
try:
    r = requests.get(url) # requests.models.Response
    d = ['apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding',
     'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines',
     'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code',
     'text', 'url']
    for p in d:
        print('['+p+']')
        print(getattr(r, p))
        print("---")
        
except:
    print("Error with the request")

Output:

[apparent_encoding]
Windows-1252
---
[close]
<bound method Response.close of <Response [200]>>
---
[connection]
<requests.adapters.HTTPAdapter object at 0x0000013AC844D888>
---
[content]
b'<!DOCTYPE html>\n<html lang="en">...</html>\n'
---
[cookies]
<RequestsCookieJar[<Cookie __cfduid=d97d6be96c548c62b6e006d99fa31933f1580069304 for .programming-review.com/>]>
---
[elapsed]
0:00:00.166215
---
[encoding]
utf-8
---
[headers]
{'Date': 'Sun, 26 Jan 2020 20:08:24 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', ... , 'Content-Encoding': 'gzip'}
---
[history]
[]
---
[is_permanent_redirect]
False
---
[is_redirect]
False
---
[iter_content]
<bound method Response.iter_content of <Response [200]>>
---
[iter_lines]
<bound method Response.iter_lines of <Response [200]>>
---
[json]
<bound method Response.json of <Response [200]>>
---
[links]
{}
---
[next]
None
---
[ok]
True
---
[raise_for_status]
<bound method Response.raise_for_status of <Response [200]>>
---
[raw]
<urllib3.response.HTTPResponse object at 0x0000013AC845CDC8>
---
[reason]
OK
---
[request]
<PreparedRequest [GET]>
---
[status_code]
200
---
[text]
<!DOCTYPE html>
<html lang="en">
<head>
...
</html>
---
[url]
https://programming-review.com/
---

We usually need requests.models.Response r.text property.

r.content.decode() is equivalent to r.text

To finalize this would be the code to get the content from the web page using requests.

import requests
url = "https://programming-review.com"
try:
    r = requests.get(url) # requests.models.Response
    print(r.text) # print(r.content) if you need bytes
except:
    print("Error with the request")

Get a text file using requests

Here is how to download a file using requests.

Example:

import pandas as pd
import io
import requests
url="https://programming-review.com/wp-content/uploads/cities.csv"
bytes_data =requests.get(url).content
text =bytes_data.decode('utf-8') 

with open('cities.csv', 'w') as file:
    file.write(text)
#dataframe=pd.read_csv(io.StringIO(text))
dataframe=pd.read_csv('cities.csv')
dataframe

Output: loaded data

Get a binary file using requests

Example:

import requests
url="https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
r = requests.get(url)
with open('google.logo.png', 'wb') as f:
    f.write(r.content)

Get any file using wget

Example:

import wget
url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
try:
    wget.download(url, 'google.logo.png')
except:
  print("An exception occurred") 

HTTP response codes

While you create http requests to get the URL content and expect 200 OK it is nice to have this overview of the response status codes you may also get in some cases.

code	Short text	Long text
100	Continue	Request received, please continue
101	Switching Protocols	Switching to new protocol; obey Upgrade header
200	OK	Request fulfilled, document follows
201	Created	Document created, URL follows
202	Accepted	Request accepted, processing continues off-line
203	Non-Authoritative Information	Request fulfilled from cache
204	No Content	Request fulfilled, nothing follows
205	Reset Content	Clear input form for further input.
206	Partial Content	Partial content follows.
300	Multiple Choices	Object has several resources – see URI list
301	Moved Permanently	Object moved permanently – see URI list
302	Found	Object moved temporarily – see URI list
303	See Other	Object moved – see Method and URL list
304	Not Modified	Document has not changed since given time
305	Use Proxy	You must use proxy specified in Location to access this resource.
307	Temporary Redirect	Object moved temporarily – see URI list
400	Bad Request	Bad request syntax or unsupported method
401	Unauthorized	No permission – see authorization schemes
402	Payment Required	No payment – see charging schemes
403	Forbidden	Request forbidden – authorization will not help
404	Not Found	Nothing matches the given URI.
405	Method Not Allowed	Specified method is invalid for this server.
406	Not Acceptable	URI not available in preferred format.
407	Proxy Authentication Required	You must authenticate with this proxy before proceeding.
408	Request Timeout	Request timed out; try again later.
409	Conflict	Request conflict.
410	Gone	URI no longer exists and has been permanently removed.
411	Length Required	Client must specify Content-Length.
412	Precondition Failed	Precondition in headers is false.
413	Request Entity Too Large	Entity is too large.
414	Request-URI Too Long	URI is too long.
415	Unsupported Media Type	Entity body in unsupported format.
416	Requested Range Not Satisfiable	Cannot satisfy request range.
417	Expectation Failed	Expect condition could not be satisfied.
500	Internal Server Error	Server got itself in trouble.
501	Not Implemented	Server does not support this operation.
502	Bad Gateway	Invalid responses from another server/proxy.
503	Service Unavailable	The server cannot process the request due to a high load.
504	Gateway Timeout	The gateway server did not receive a timely response.
505	HTTP Version Not Supported	Cannot fulfill request.

…

tags: html - https - get request & category: python

Python get the HTML content

Get html content using urllib

Get html content using requests

Get a text file using requests

Get a binary file using requests

Get any file using wget

HTTP response codes

Get html content using `urllib`

Get html content using `requests`