主成分分析(PCA)

小开

这是 numpy的工作。

下面的教程演示了如何使用 numpy的内置模块(如 mean,cov,double,cumsum,dot,linalg,array,rank)进行主成分分析。

Http://glowingpython.blogspot.sg/2011/07/principal-component-analysis-with-numpy.html

请注意，scipy在这里也有很长的解释 Https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#l105

小开

最佳答案

您可以在 matplotlib 模块中找到一个 PCA 函数:

import numpy as np
from matplotlib.mlab import PCA


data = np.array(np.random.randint(10,size=(10,3)))
results = PCA(data)

结果将存储 PCA 的各种参数。它来自 matplotlib 的 mlab 部分，这是 MATLAB 语法的兼容层

编辑: 在博客下一代遗传学上，我发现了一个很棒的演示，演示了如何使用 matplotlib mlab 模块来执行和显示 PCA，玩得开心，查看那个博客！

小开

尽管另一个答案已经被接受了，我还是发布了我的答案; 接受的答案依赖于已废弃的功能; 另外，这个已经废弃的函数是基于 奇异值分解(SVD)的，它(尽管完全有效)是计算 PCA 的两种通用技术中内存和处理器密集得多的。由于 OP 中数据数组的大小，这一点在这里尤其重要。使用基于协方差的 PCA，计算流中使用的数组只是 144 x 144，而不是 26424 x 144(原始数据数组的维数)。

下面是使用来自 SciPy的 < strong > linall 模块的 PCA 的一个简单工作实现。因为这个实现首先计算协方差矩阵，然后在这个数组上执行所有后续计算，所以它使用的内存比基于 SVD 的主成分分析要少得多。

(除了 import 语句(也就是 从麻木的进口链接作为洛杉矶)之外，还可以使用笨蛋中的 linall 模块，而不需要修改下面的代码。)

实施常设仲裁法的两个关键步骤是:

计算 协方差矩阵; 及
取这个 Cov矩阵的 八个向量和 特征值

在下面的函数中，参数 Dims _ rescaled _ data指的是 重新缩放数据矩阵中所需的维数; 这个参数的默认值只有两个维，但是下面的代码不限于两个维，但是它可以是小于原始数据数组列数的任何值。

def PCA(data, dims_rescaled_data=2): """ returns: data transformed in 2 dims/columns + regenerated original data pass in: data as 2D NumPy array """ import numpy as NP from scipy import linalg as LA m, n = data.shape # mean center the data data -= data.mean(axis=0) # calculate the covariance matrix R = NP.cov(data, rowvar=False) # calculate eigenvectors & eigenvalues of the covariance matrix # use 'eigh' rather than 'eig' since R is symmetric, # the performance gain is substantial evals, evecs = LA.eigh(R) # sort eigenvalue in decreasing order idx = NP.argsort(evals)[::-1] evecs = evecs[:,idx] # sort eigenvectors according to same index evals = evals[idx] # select the first n eigenvectors (n is desired dimension # of rescaled data array, or dims_rescaled_data) evecs = evecs[:, :dims_rescaled_data] # carry out the transformation on the data using eigenvectors # and return the re-scaled data, eigenvalues, and eigenvectors return NP.dot(evecs.T, data.T).T, evals, evecs def test_PCA(data, dims_rescaled_data=2): ''' test by attempting to recover original data array from the eigenvectors of its covariance matrix & comparing that 'recovered' array with the original data ''' _ , _ , eigenvectors = PCA(data, dim_rescaled_data=2) data_recovered = NP.dot(eigenvectors, m).T data_recovered += data_recovered.mean(axis=0) assert NP.allclose(data, data_recovered) def plot_pca(data): from matplotlib import pyplot as MPL clr1 = '#2026B2' fig = MPL.figure() ax1 = fig.add_subplot(111) data_resc, data_orig = PCA(data) ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1) MPL.show() >>> # iris, probably the most widely used reference data set in ML >>> df = "~/iris.csv" >>> data = NP.loadtxt(df, delimiter=',') >>> # remove class labels >>> data = data[:,:-1] >>> plot_pca(data)

下图是这个主成分分析函数在虹膜数据上的直观表示。正如您所看到的，二维转换将 I 类与 II 类和 III 类(但不是 II 类与 III 类，这实际上需要另一个维度)清楚地区分开来。

小开

另一个使用 numpy 的 Python PCA，和@doug 的想法相同，但是没有运行。

from numpy import array, dot, mean, std, empty, argsort from numpy.linalg import eigh, solve from numpy.random import randn from matplotlib.pyplot import subplots, show def cov(X): """ Covariance matrix note: specifically for mean-centered data note: numpy's `cov` uses N-1 as normalization """ return dot(X.T, X) / X.shape[0] # N = data.shape[1] # C = empty((N, N)) # for j in range(N): # C[j, j] = mean(data[:, j] * data[:, j]) # for k in range(j + 1, N): # C[j, k] = C[k, j] = mean(data[:, j] * data[:, k]) # return C def pca(data, pc_count = None): """ Principal component analysis using eigenvalues note: this mean-centers and auto-scales the data (in-place) """ data -= mean(data, 0) data /= std(data, 0) C = cov(data) E, V = eigh(C) key = argsort(E)[::-1][:pc_count] E, V = E[key], V[:, key] U = dot(data, V) # used to be dot(V.T, data.T).T return U, E, V """ test data """ data = array([randn(8) for k in range(150)]) data[:50, 2:4] += 5 data[50:, 2:5] += 5 """ visualize """ trans = pca(data, 3)[0] fig, (ax1, ax2) = subplots(1, 2) ax1.scatter(data[:50, 0], data[:50, 1], c = 'r') ax1.scatter(data[50:, 0], data[50:, 1], c = 'b') ax2.scatter(trans[:50, 0], trans[:50, 1], c = 'r') ax2.scatter(trans[50:, 0], trans[50:, 1], c = 'b') show()

产生的结果和短得多的一样

from sklearn.decomposition import PCA def pca2(data, pc_count = None): return PCA(n_components = 4).fit_transform(data)

根据我的理解，使用特征值(第一种方法)更适合于高维数据和更少的样本，而使用奇异值分解更适合于样本多于维数的情况。

小开

为了 def plot_pca(data):能够正常工作，有必要更换线路

data_resc, data_orig = PCA(data) ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)

还有台词

newData, data_resc, data_orig = PCA(data) ax1.plot(newData[:, 0], newData[:, 1], '.', mfc=clr1, mec=clr1)

小开

更新: matplotlib.mlab.PCA自发布2.2(2018-03-06)以来确实是不赞成。

~~库 matplotlib.mlab.PCA(在这个答案中使用)不推荐使用没有。因此，对于所有通过 Google 到达这里的人，我将发布一个使用 Python 2.7测试的完整工作示例。~~

谨慎使用下面的代码，因为它使用的是一个现已废弃的库！

from matplotlib.mlab import PCA import numpy data = numpy.array( [[3,2,5], [-2,1,6], [-1,0,4], [4,3,4], [10,-5,-6]] ) pca = PCA(data)

现在在‘ pca.Y’中是原始数据矩阵的主成分基向量。关于 PCA 对象的更多细节可以找到给你。

>>> pca.Y array([[ 0.67629162, -0.49384752, 0.14489202], [ 1.26314784, 0.60164795, 0.02858026], [ 0.64937611, 0.69057287, -0.06833576], [ 0.60697227, -0.90088738, -0.11194732], [-3.19578784, 0.10251408, 0.00681079]])

您可以使用 matplotlib.pyplot来绘制这些数据，只是为了说服自己 PCA 产生了“好”的结果。names列表只是用来注释我们的五个向量。

import matplotlib.pyplot names = [ "A", "B", "C", "D", "E" ] matplotlib.pyplot.scatter(pca.Y[:,0], pca.Y[:,1]) for label, x, y in zip(names, pca.Y[:,0], pca.Y[:,1]): matplotlib.pyplot.annotate( label, xy=(x, y), xytext=(-2, 2), textcoords='offset points', ha='right', va='bottom' ) matplotlib.pyplot.show()

看看我们的原始向量，我们会发现 data [0](“ A”)和 data [3](“ D”)与 data [1](“ B”)和 data [2](“ C”)非常相似。这反映在我们的 PCA 转换数据的2D 图中。

小开

下面是 scikit-learn 选项

方法1: 让 scikit-learn 选择主成分的 最低限度数目，以保留至少 x% (例如下面的90%)的方差。

from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler iris = load_iris() # mean-centers and auto-scales the data standardizedData = StandardScaler().fit_transform(iris.data) pca = PCA(.90) principalComponents = pca.fit_transform(X = standardizedData) # To get how many principal components was chosen print(pca.n_components_)

方法2: 选择主要组件的数量(在本例中，选择了2)

from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler iris = load_iris() standardizedData = StandardScaler().fit_transform(iris.data) pca = PCA(n_components=2) principalComponents = pca.fit_transform(X = standardizedData) # to get how much variance was retained print(pca.explained_variance_ratio_.sum())

资料来源: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

小开

我编写了一个小脚本来比较这里出现的不同的 PCA，作为一个答案:

import numpy as np from scipy.linalg import svd shape = (26424, 144) repeat = 20 pca_components = 2 data = np.array(np.random.randint(255, size=shape)).astype('float64') # data normalization # data.dot(data.T) # (U, s, Va) = svd(data, full_matrices=False) # data = data / s[0] from fbpca import diffsnorm from timeit import default_timer as timer from scipy.linalg import svd start = timer() for i in range(repeat): (U, s, Va) = svd(data, full_matrices=False) time = timer() - start err = diffsnorm(data, U, s, Va) print('svd time: %.3fms, error: %E' % (time*1000/repeat, err)) from matplotlib.mlab import PCA start = timer() _pca = PCA(data) for i in range(repeat): U = _pca.project(data) time = timer() - start err = diffsnorm(data, U, _pca.fracs, _pca.Wt) print('matplotlib PCA time: %.3fms, error: %E' % (time*1000/repeat, err)) from fbpca import pca start = timer() for i in range(repeat): (U, s, Va) = pca(data, pca_components, True) time = timer() - start err = diffsnorm(data, U, s, Va) print('facebook pca time: %.3fms, error: %E' % (time*1000/repeat, err)) from sklearn.decomposition import PCA start = timer() _pca = PCA(n_components = pca_components) _pca.fit(data) for i in range(repeat): U = _pca.transform(data) time = timer() - start err = diffsnorm(data, U, _pca.explained_variance_, _pca.components_) print('sklearn PCA time: %.3fms, error: %E' % (time*1000/repeat, err)) start = timer() for i in range(repeat): (U, s, Va) = pca_mark(data, pca_components) time = timer() - start err = diffsnorm(data, U, s, Va.T) print('pca by Mark time: %.3fms, error: %E' % (time*1000/repeat, err)) start = timer() for i in range(repeat): (U, s, Va) = pca_doug(data, pca_components) time = timer() - start err = diffsnorm(data, U, s[:pca_components], Va.T) print('pca by doug time: %.3fms, error: %E' % (time*1000/repeat, err))

Pca _ mark 是在马克的回答中。

Pca _ doug 是在 Doug 的回答中。

下面是一个示例输出(但结果在很大程度上取决于数据大小和 pca _ Component，因此我建议使用您自己的数据运行您自己的测试。此外，facebook 的 pca 针对规范化数据进行了优化，所以在这种情况下它会更快、更准确) :

svd time: 3212.228ms, error: 1.907320E-10 matplotlib PCA time: 879.210ms, error: 2.478853E+05 facebook pca time: 485.483ms, error: 1.260335E+04 sklearn PCA time: 169.832ms, error: 7.469847E+07 pca by Mark time: 293.758ms, error: 1.713129E+02 pca by doug time: 300.326ms, error: 1.707492E+02

编辑:

Fbpca 的不规范函数计算舒尔分解的光谱范数误差。

小开

除了所有其他的答案之外，这里还有一些代码来使用 sklearn和 matplotlib绘制 biplot。

import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA import pandas as pd from sklearn.preprocessing import StandardScaler iris = datasets.load_iris() X = iris.data y = iris.target #In general a good idea is to scale the data scaler = StandardScaler() scaler.fit(X) X=scaler.transform(X) pca = PCA() x_new = pca.fit_transform(X) def myplot(score,coeff,labels=None): xs = score[:,0] ys = score[:,1] n = coeff.shape[0] scalex = 1.0/(xs.max() - xs.min()) scaley = 1.0/(ys.max() - ys.min()) plt.scatter(xs * scalex,ys * scaley, c = y) for i in range(n): plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5) if labels is None: plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center') else: plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center') plt.xlim(-1,1) plt.ylim(-1,1) plt.xlabel("PC{}".format(1)) plt.ylabel("PC{}".format(2)) plt.grid() #Call the function. Use only the 2 PCs. myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :])) plt.show()

小开

此示例代码加载日本收益率曲线，并创建 PCA 组件。然后，它使用 PCA 估计给定日期的移动，并将其与实际移动进行比较。

%matplotlib inline import numpy as np import scipy as sc from scipy import stats from IPython.display import display, HTML import pandas as pd import matplotlib import matplotlib.pyplot as plt import datetime from datetime import timedelta import quandl as ql start = "2016-10-04" end = "2019-10-04" ql_data = ql.get("MOFJ/INTEREST_RATE_JAPAN", start_date = start, end_date = end).sort_index(ascending= False) eigVal_, eigVec_ = np.linalg.eig(((ql_data[:300]).diff(-1)*100).cov()) # take latest 300 data-rows and normalize to bp print('number of PCA are', len(eigVal_)) loc_ = 10 plt.plot(eigVec_[:,0], label = 'PCA1') plt.plot(eigVec_[:,1], label = 'PCA2') plt.plot(eigVec_[:,2], label = 'PCA3') plt.xticks(range(len(eigVec_[:,0])), ql_data.columns) plt.legend() plt.show() x = ql_data.diff(-1).iloc[loc_].values * 100 # set the differences x_ = x[:,np.newaxis] a1, _, _, _ = np.linalg.lstsq(eigVec_[:,0][:, np.newaxis], x_) # linear regression without intercept a2, _, _, _ = np.linalg.lstsq(eigVec_[:,1][:, np.newaxis], x_) a3, _, _, _ = np.linalg.lstsq(eigVec_[:,2][:, np.newaxis], x_) pca_mv = m1 * eigVec_[:,0] + m2 * eigVec_[:,1] + m3 * eigVec_[:,2] + c1 + c2 + c3 pca_MV = a1[0][0] * eigVec_[:,0] + a2[0][0] * eigVec_[:,1] + a3[0][0] * eigVec_[:,2] pca_mV = b1 * eigVec_[:,0] + b2 * eigVec_[:,1] + b3 * eigVec_[:,2] display(pd.DataFrame([eigVec_[:,0], eigVec_[:,1], eigVec_[:,2], x, pca_MV])) print('PCA1 regression is', a1, a2, a3) plt.plot(pca_MV) plt.title('this is with regression and no intercept') plt.plot(ql_data.diff(-1).iloc[loc_].values * 100, ) plt.title('this is with actual moves') plt.show()

小开

这可能是 PCA 能找到的最简单的答案，包括容易理解的步骤。假设我们想保留144的两个主要维度，它们提供了最多的信息。

首先，将二维数组转换为数据帧:

import pandas as pd # Here X is your array of size (26424 x 144) data = pd.DataFrame(X)

那么，有两种方法可以使用:

方法1: 手工计算

步骤1: 对 X 应用列标准化

from sklearn import preprocessing scalar = preprocessing.StandardScaler() standardized_data = scalar.fit_transform(data)

第二步: 找出原始矩阵 X 的协方差矩阵 S

sample_data = standardized_data covar_matrix = np.cov(sample_data)

步骤3: 找到 S 的特征值和特征向量(这里是2D，每个有2个)

from scipy.linalg import eigh # eigh() function will provide eigen-values and eigen-vectors for a given matrix. # eigvals=(low value, high value) takes eigen value numbers in ascending order values, vectors = eigh(covar_matrix, eigvals=(142,143)) # Converting the eigen vectors into (2,d) shape for easyness of further computations vectors = vectors.T

步骤4: 转换数据

# Projecting the original data sample on the plane formed by two principal eigen vectors by vector-vector multiplication. new_coordinates = np.matmul(vectors, sample_data.T) print(new_coordinates.T)

这个 new_coordinates.T的大小(26424x2)由两个主要部件组成。

方法2: 使用 Scikit-Learning

步骤1: 对 X 应用列标准化

from sklearn import preprocessing scalar = preprocessing.StandardScaler() standardized_data = scalar.fit_transform(data)

步骤2: 初始化 pca

from sklearn import decomposition # n_components = numbers of dimenstions you want to retain pca = decomposition.PCA(n_components=2)

步骤3: 使用 pca 来拟合数据

# This line takes care of calculating co-variance matrix, eigen values, eigen vectors and multiplying top 2 eigen vectors with data-matrix X. pca_data = pca.fit_transform(sample_data)

这个 pca_data的大小(26424x2)由两个主要部件组成。

除了所有其他的答案之外，这里还有一些代码来使用 sklearn和 matplotlib绘制 biplot。

除了所有其他的答案之外，这里还有一些代码来使用 `sklearn`和 `matplotlib`绘制 `biplot`。