Pandas load data

Pandas is a a great package for data analysis. It is required by packages such as: xarray, seaborn, pymc3, plotnine, magenta, google-colab, featuretools, fastai, and others. In here we discuss how to load data into pandas dataframe.

pandas

Table of contents:

Creating a dataframe

Let’s create a dataframe in pandas. Simplified dataframe is a table. To create a table we just need the table size.

Example:

df = pd.DataFrame(index=range(14),columns=range(7))
print(df)

Output: pandas dataframe

In here we set the 7 columns and 14 rows. Note how in python pandas the start index is always 0.

To create a dataframe you may also provide the column names and row index names.

Example:

import pandas as pd 
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
print(df)
print(df.values)

Output:

  time date name
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c  NaN  NaN  NaN
[[nan nan nan]
 [nan nan nan]
 [nan nan nan]]

From the output when we printed print(df.values) we got something that looks like a list. What is this?

Actually if we check the type of type(df.values) object we will find this is numpy.ndarray.

Pandas requires numpy, python-dateutil and pytz packages. Confirm that with pip show pandas.

Does this just mean we need numpy array to provide data to pandas dataframe?

No, because pandas is smart. Consider this example:

Example:


data = pd.DataFrame([[10, 11, 12, 13, 14],
                   [22, 23, 24, 25, 26], 
                   [31, 32, 33, 34, 35]])
print(data)

Output:

    0   1   2   3   4
0  10  11  12  13  14
1  22  23  24  25  26
2  31  32  33  34  35

In here we haven’t provided numpy arrays, instead the input data for the DataFrame is list of lists.

What will happen internally? Internally, pandas will convert the list of lists to numpy array.

Example:

import numpy as np
ll = [[10, 11, 12, 12, 14],
      [22, 23, 21, 22, 23], 
      [31, 35, 32, 34, 34]]
npa = np.asarray(ll)      
data = pd.DataFrame(npa)
print(data)

Another example:

data = pd.DataFrame({'Col1': [10, 11, 12, 13, 14],
                  'Col2': [22, 23, 24, 25, 26], 
                  'Col3': [31, 32, 33, 34, 35]})
print(data)

Output:

   Col1  Col2  Col3
0    10    22    31
1    11    23    32
2    12    24    33
3    13    25    34
4    14    26    35

If we plan to set the row names, this would be also possible:

Example:

data = pd.DataFrame({'Col1': [10, 11, 12, 13, 14],
                  'Col2': [22, 23, 24, 25, 26], 
                  'Col3': [31, 32, 33, 34, 35]},
                  index=['row1', 'row2', 'row3', 'row4', 'row5'])
print(data)

Output:

      Col1  Col2  Col3
row1    10    22    31
row2    11    23    32
row3    12    24    33
row4    13    25    34
row5    14    26    35

Read textual dataframe

Comma separator

First example will use the read_csv() function to read a multiline text. CSV means Comma Separated Values. Using csv format is very frequent in pandas.

Example:

import pandas as pd
import io

text=u"""asdf,333
asdf,444
asdf,555"""

df=pd.read_csv(io.StringIO(text), 
               sep=r',', 
               header=None, 
               engine='python', 
               encoding = "iso-8859-1")
print(df)

Output:

      0    1
0  asdf  333
1  asdf  444
2  asdf  555

Default read_csv separator is sep=r','.

Space as separator

Let’s now use the same function but space as a separator.

import pandas as pd
import io

temp=u"""
NSW     VIC
6718023   5023203
6735528 5048207
6742690  5061266
6766133     5083593
6786160 5103965"""

df=pd.read_csv(io.StringIO(temp), 
                sep=r'\s{1,}', # one or more spaces
                engine='python', 
                encoding = "iso-8859-1")
print(df)

Output:

       NSW      VIC
0  6718023  5023203
1  6735528  5048207
2  6742690  5061266
3  6766133  5083593
4  6786160  5103965

More than 1 spaces as separator

The next example will have NEW YORK as the index name, however, this should still be the single index name.

Example:

import pandas as pd
import io

s=u"""      one  two  three  four
INDIANA     0    1      2     3
COLORADO    4    5      6     7
NEW YORK    8    9     10    11
"""

df =pd.read_csv(io.StringIO(s), 
                sep=r'\s{2,}', # one or more spaces
                engine='python', 
                encoding="iso-8859-1")
print(df)

The solution in here is to use two spaces as a separator.

Output:

          one  two  three  four
INDIANA     0    1      2     3
COLORADO    4    5      6     7
NEW YORK    8    9     10    11

In here the dataframe index names are:

  • INDIANA
  • COLORADO
  • NEW YORK

New line as a separator

Very similar example, but now, we use the new line as a separator.

Example:

from io import StringIO
import pandas as pd

text="""The lion (Panthera leo) is a species in the family Felidae.
It is a muscular, deep-chested cat with a short, rounded head.
A reduced neck and round ears, and a hairy tuft at the end of its tail. 
The lion males are larger than females.
A typical weight range of 150 to 250 kg (330 to 550 lb) for males and 120 to 182 kg (265 to 400 lb) for females."""
df =pd.read_csv(StringIO(text),
                sep=r'\n', 
                header=None, 
                engine='python', 
                encoding = "iso-8859-1", 
                names=["cname"])
print(df)
print("size:", df.size)

Output: loaded data

error_bad_lines=False

One special option to read bad lines, would be error_bad_lines=False.

Example:

import pandas as pd
import io

s=u"""      one  two  three  four
INDIANA     0    1      2     3
COLORADO    4    5      6     7
NEW YORK    8    9     10    11
"""

df =pd.read_csv(io.StringIO(s), 
                sep=r'\s{1,}', # one or more spaces
                engine='python', 
                error_bad_lines=False,
                encoding = "iso-8859-1")
print(df)

In here the NEW YORK row will fail, because it has extra column. If we use the error_bad_lines=False option we will ignore this problem.

Output:

          one  two  three  four
INDIANA     0    1      2     3
COLORADO    4    5      6     7

Skipping line 4: Expected 5 fields in line 4, saw 6. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

Read dataframe from the CSV file

Another approach would be to read the CSV on disk file.

Again we will use read_csv function which is almost identical as the read_table function. The only difference is:

  • read_csv uses comma separator
  • read_table uses tab as separator

Usually the code to read the csv file will be short as this:

import pandas as pd 
dataframe = pd.read_csv("test.csv") 

Problems when loading csv files:

First if there is no file you may get the FileNotFoundError. Frequent problem is the UnicodeDecodeError, where you should first to understand the encoding of your file. To get the encoding of a file on Linux you may run file -i command and use that. If you think the things will work with the utf-8 encoding, you may use this trick:

  • Read the text from a file
  • Decode the text with text.decode('utf-8')

Usually utf-8 encoding, should cover the non-standard characters. If this doesn’t work try with the utf-16 encoding.

To fix the encoding problems you may use your editor and save csv file to particular encoding.

In some cases you may get the parsing errors. If this is the case, use the engine=’python’ option.

Read dataframe from remote csv file

Again, the same read_csv function works. This time we will use the requests package.

Check here -> how to use requests package.

Example:

import pandas as pd
import io
import requests
url='https://programming-review.com/wp-content/uploads/cities.csv'
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Output: loaded data

Read dataframe from HTML page

If there is a table (<table>) on a web page page, pandas can read that. Note, this doesn’t work for divs.

Example:

import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Comparison_of_web_browsers')[0]
df

This is equivalent to:

url = 'https://en.wikipedia.org/wiki/Comparison_of_web_browsers'
html_data = requests.get(url)
df = pd.read_html(html_data.text)[0]
df

Output: browsers

tags: pandas - string - load & category: python