Python SciPy examples

SciPy is concise open-source library based on NumPy, Pandas, Matplotlib, SymPy. For more dependencies check out the official link

Table of contents:

Modules

SciPy contains modules for:

Module Description
scipy.cluster Vector quantization / Kmeans
scipy.constants Physical and mathematical constants
scipy.fftpack Fourier transform
scipy.integrate Integration routines
scipy.interpolate Interpolation
scipy.io Data input and output
scipy.linalg Linear algebra routines
scipy.ndimage n-dimensional image package
scipy.odr Orthogonal distance regression
scipy.optimize Optimization
scipy.signal Signal processing
scipy.sparse Sparse matrices
scipy.spatial Spatial data structures and algorithms
scipy.special Any special mathematical functions
scipy.stats Statistics
 

These are also called subpackages.

There is also a module called scipy.misc for various utilities that don’t have another home.

Basic Examples

Example:

from scipy import misc
import matplotlib.pyplot as plt
face = misc.face()
fig,ax=plt.subplots(dpi=153)
ax.set_axis_off()
ax.imshow(face)
plt.show()

Output:

face

Advanced Examples

Fitting a curve

In this example we start from scatter points trying to fit the points to a sinusoidal curve.

We know the test_func and parameters, a and b we will also discover.

x_data is a np.linespace and y_data is sinusoidal with some noise.

We will be using the scipy optimize.curve_fit function with the test function, two parameters, and x_data, and y_data forwarded as input.

from scipy import optimize
import matplotlib.pyplot as plt
x_data = np.linspace(-5, 5, num=50)
y_data = 2.9 * np.sin(1.5 * x_data) + np.random.normal(size=50)

fig, ax = plt.subplots(dpi=153)

def test_func(x, a, b):
    return a * np.sin(b * x)

params, params_covariance = optimize.curve_fit(test_func, x_data, y_data, p0=[2, 2])
print(params)
ax.scatter(x_data,y_data)
ax.plot(x_data, test_func(x_data,params[0], params[1]), c='r')
plt.show()

fit

The red curve is what we will get, together with teh params we print.

Interpolation

Interpolation is another technique and in here we are using the interval [0,1] with 10 measurements (measured_time).

We also have the interpolation time with high resolution np.linspace(0, 1, 50).

import matplotlib.pyplot as plt
measured_time = np.linspace(0, 1, 10)
noise = (np.random.random(10)*2 - 1) * 1e-1
measures = np.sin(2 * np.pi * measured_time) + noise
fig, ax = plt.subplots(dpi=153)
ax.scatter(measured_time, measures) # Blue dots

# Interpolation time
from scipy.interpolate import interp1d
interpolation_time = np.linspace(0, 1, 50)
linear_interp = interp1d(measured_time, measures)
linear_results = linear_interp(interpolation_time)
cubic_interp = interp1d(measured_time, measures, kind='cubic')
cubic_results = cubic_interp(interpolation_time)

ax.plot(interpolation_time, linear_results, c='r', label='linear results')
ax.plot(interpolation_time, cubic_results, c='g', label='cubic results')

ax.legend()
plt.show()

interpolation

As you may note we are trying out the linear and cubic interpolation and using the scipy.interpolate.interp1d function.

Statistics

In here I will combine pandas data loading with scipy.stats module.

import pandas as pd
import io
text=u'''
"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545
'''

pd.read_csv
df=pd.read_csv(io.StringIO(text), 
               sep=r';', 
               na_values=".",
               header='infer', #by default
               engine='python', 
               encoding = "iso-8859-1")
df

Output:

	 0 	Gender 	FSIQ 	VIQ 	PIQ 	Weight 	Height 	MRI_Count
0 	1 	Female 	133 	132 	124 	118.0 	64.5 	816932
1 	2 	Male 	140 	150 	124 	NaN 	72.5 	1001121
2 	3 	Male 	139 	123 	150 	143.0 	73.3 	1038437
3 	4 	Male 	133 	129 	128 	172.0 	68.8 	965353
4 	5 	Female 	137 	132 	134 	147.0 	65.0 	951545

After data load, here is how to create Student’s t-test or the simplest statistical test (part of hypothesis testing).

from pandas import plotting
#plotting.scatter_matrix(df[['Weight', 'Height', 'MRI_Count']])
from scipy import stats
r1 = stats.ttest_1samp(df['VIQ'], 0)
female_viq = df[df['Gender'] == 'Female']['VIQ']
male_viq = df[df['Gender'] == 'Male']['VIQ']
r2 = stats.ttest_ind(female_viq, male_viq)   
# testing the value of a population mean
print(r1)
# testing for difference across populations
print(r2)

Output:

Ttest_1sampResult(statistic=29.53444067255562, pvalue=7.825719971719742e-06)
Ttest_indResult(statistic=-0.18926408936295352, pvalue=0.8619667118907584)
Column Description
Gender Male or Female
FSIQ Full Scale IQ scores based on the four Wechsler (1981) subtests
VIQ Verbal IQ scores based on the four Wechsler (1981) subtests
PIQ Performance IQ scores based on the four Wechsler (1981) subtests
Weight body weight in pounds
Height height in inches
MRI_Count total pixel Count from the 18 MRI scans
   

Scipy.stats vs. Statsmodels

Although statsmodels is not part of scipy.stats they work great in tandem.some very important functions worth to mention in here.

Statsmodels has scipy.stats as a dependency.

Scipy.stats has all of the probability distributions and some statistical tests. It’s more like library code in the vein of numpy and scipy.

Statsmodels on the other hand provides statistical models with a formula framework similar to R and it works with pandas out of the box.

Statsmodels has statistical tests, plotting, and plenty of helper functions.

Example:

import numpy as np
# feature
x = np.linspace(-10, 10, 20)
np.random.seed(13)
# target
y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
df = pd.DataFrame({'x': x, 'y': y})
from statsmodels.formula.api import ols
model = ols("y ~ x", df).fit()
print(model.summary())

Output:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.944
Model:                            OLS   Adj. R-squared:                  0.941
Method:                 Least Squares   F-statistic:                     302.5
Date:                Sun, 26 Apr 2020   Prob (F-statistic):           1.06e-12
Time:                        21:40:26   Log-Likelihood:                -57.988
No. Observations:                  20   AIC:                             120.0
Df Residuals:                      18   BIC:                             122.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.5335      1.036     -5.342      0.000      -7.710      -3.357
x              2.9684      0.171     17.393      0.000       2.610       3.327
==============================================================================
Omnibus:                        0.100   Durbin-Watson:                   2.956
Prob(Omnibus):                  0.951   Jarque-Bera (JB):                0.322
Skew:                          -0.058   Prob(JB):                        0.851
Kurtosis:                       2.390   Cond. No.                         6.07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

tags: scipy & category: python