如何在 matplotlib 中制作由密度着色的散点图?

我想制作一个散点图,其中每个点由附近点的空间密度着色。

I've come across a very similar question, which shows an example of this using R:

散点图: 符号颜色表示重叠点的数量

使用 matplotlib 在 python 中实现类似功能的最佳方法是什么?

154107 次浏览

你可以做一个直方图:

import numpy as np
import matplotlib.pyplot as plt


# fake data:
a = np.random.normal(size=1000)
b = a*3 + np.random.normal(size=1000)


plt.hist2d(a, b, (50, 50), cmap=plt.cm.jet)
plt.colorbar()

2dhist

除了像@askewchan 建议的那样使用 hist2dhexbin之外,您还可以使用与您链接到的问题中已接受的答案相同的方法。

If you want to do that:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde


# Generate fake data
x = np.random.normal(size=1000)
y = x * 3 + np.random.normal(size=1000)


# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)


fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=100)
plt.show()

enter image description here

如果您希望按密度顺序绘制这些点,以便密度最大的点总是在顶部(类似于链接的示例) ,只需按 z 值对它们进行排序。我还要用一个更小的记号笔,因为它看起来更好一些:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde


# Generate fake data
x = np.random.normal(size=1000)
y = x * 3 + np.random.normal(size=1000)


# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)


# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]


fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50)
plt.show()

enter image description here

Also, if the number of point makes KDE calculation too slow, color can be interpolated in np.histogram2d [Update in response to comments: If you wish to show the colorbar, use plt.scatter() instead of ax.scatter() followed by plt.colorbar()]:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.colors import Normalize
from scipy.interpolate import interpn


def density_scatter( x , y, ax = None, sort = True, bins = 20, **kwargs )   :
"""
Scatter plot colored by 2d histogram
"""
if ax is None :
fig , ax = plt.subplots()
data , x_e, y_e = np.histogram2d( x, y, bins = bins, density = True )
z = interpn( ( 0.5*(x_e[1:] + x_e[:-1]) , 0.5*(y_e[1:]+y_e[:-1]) ) , data , np.vstack([x,y]).T , method = "splinef2d", bounds_error = False)


#To be sure to plot all data
z[np.where(np.isnan(z))] = 0.0


# Sort the points by density, so that the densest points are plotted last
if sort :
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]


ax.scatter( x, y, c=z, **kwargs )


norm = Normalize(vmin = np.min(z), vmax = np.max(z))
cbar = fig.colorbar(cm.ScalarMappable(norm = norm), ax=ax)
cbar.ax.set_ylabel('Density')


return ax




if "__main__" == __name__ :


x = np.random.normal(size=100000)
y = x * 3 + np.random.normal(size=100000)
density_scatter( x, y, bins = [30,30] )

绘制 > 10万个数据点?

使用 Gaussian _ kde ()接受的答案将花费大量时间。在我的机器上,10万行大约需要 11分钟。在这里,我将添加两个替代方法(毫微秒散射密度datashader) ,并比较给定的答案与相同的数据集。

在下面的例子中,我使用了一个100k 行的测试数据集:

import matplotlib.pyplot as plt
import numpy as np


# Fake data for testing
x = np.random.normal(size=100000)
y = x * 3 + np.random.normal(size=100000)

产量及计算时间比较

下面是不同方法的比较。

1: mpl-scatter-density

安装

pip install mpl-scatter-density

Example code

import mpl_scatter_density # adds projection='scatter_density'
from matplotlib.colors import LinearSegmentedColormap


# "Viridis-like" colormap with white background
white_viridis = LinearSegmentedColormap.from_list('white_viridis', [
(0, '#ffffff'),
(1e-20, '#440053'),
(0.2, '#404388'),
(0.4, '#2a788e'),
(0.6, '#21a784'),
(0.8, '#78d151'),
(1, '#fde624'),
], N=256)


def using_mpl_scatter_density(fig, x, y):
ax = fig.add_subplot(1, 1, 1, projection='scatter_density')
density = ax.scatter_density(x, y, cmap=white_viridis)
fig.colorbar(density, label='Number of points per pixel')


fig = plt.figure()
using_mpl_scatter_density(fig, x, y)
plt.show()

画这个花了0.05秒: using mpl-scatter-density

放大效果看起来很不错: zoom in mpl-scatter-density

2: datashader

  • Datashader 是一个有趣的项目,它在 Datashader 0.12中增加了对 matplotlib 的支持。

安装

pip install datashader

代码(DSshow的源代码和参数列表) :


import datashader as ds
from datashader.mpl_ext import dsshow
import pandas as pd




def using_datashader(ax, x, y):


df = pd.DataFrame(dict(x=x, y=y))
dsartist = dsshow(
df,
ds.Point("x", "y"),
ds.count(),
vmin=0,
vmax=35,
norm="linear",
aspect="auto",
ax=ax,
)


plt.colorbar(dsartist)




fig, ax = plt.subplots()
using_datashader(ax, x, y)
plt.show()
  • 花了0.83秒画出这个:

enter image description here

  • 也有可能通过第三个变量着色。dsshow的第三个参数控制着色。请参阅更多示例 给你和 dsshow here的源代码。

3: scatter_with_gaussian_kde

def scatter_with_gaussian_kde(ax, x, y):
# https://stackoverflow.com/a/20107592/3015186
# Answer by Joel Kington


xy = np.vstack([x, y])
z = gaussian_kde(xy)(xy)


ax.scatter(x, y, c=z, s=100, edgecolor='')
  • 花了11分钟画出这个: scatter_with_gaussian_kde

4: using_hist2d

import matplotlib.pyplot as plt
def using_hist2d(ax, x, y, bins=(50, 50)):
# https://stackoverflow.com/a/20105673/3015186
# Answer by askewchan
ax.hist2d(x, y, bins, cmap=plt.cm.jet)


  • It took 0.021 s to draw this bins=(50,50): using_hist2d_50
  • It took 0.173 s to draw this bins=(1000,1000): using_hist2d_1000
  • 缺点: 放大后的数据看起来不如 mpl 散布密度或数据着色器好。而且你还得自己决定垃圾桶的数量。

zoomed in hist2d 1000bins

5: density_scatter

  • 代码与 纪尧姆中的 回答相同。
  • 用盒子画这个花了0.073秒 = (50,50) : density_scatter_50bins
  • 用0.368 s 画出这个箱子 = (1000,1000) : density_scatter_1000bins