This sample uses the SciKit-Learn spectral biclustering algorithm to detect clusters of similar colors. To accelerate computation, we make use of Dask to execute the algorithm in parallel on the Palma HPC cluster. First, we start with some usual imports for dask, scikit-learn and plotly.

import math
import numpy as np
import joblib
import sklearn
import sklearn.datasets
import sklearn.cluster
import plotly
from matplotlib import pyplot as plt

from dask_palma import PalmaCluster
from dask.distributed import Client

Then, we create the Dask cluster on Palma an start some workers. While we wait for our workers to start, we can view the process at{username}/proxy/8787/status . Please be patient and check the state of your jobs using "squeue -u {username}" on the console.

cluster = PalmaCluster(template = dict(nativeSpecification = "--time=1:00:00 -p express,normal,broadwell,requeue --mem=2048"))
client = Client(cluster)

After the workers are started and connected to the Dask cluster, we can proceed with the main code of our sample. The following snippet is a helper function to convert colorscales.

def matplotlib_to_plotly(cmap, pl_entries):
    h = 1.0 / (pl_entries - 1)
    pl_colorscale = []

    for k in range(pl_entries):
        C = list(map(np.uint8, np.array(cmap(k * h)[:3])*255))
        pl_colorscale.append([k * h, 'rgb' + str((C[0], C[1], C[2]))])

    return pl_colorscale

Using a "with"-statement together with joblibs parallel Dask-Backend, all SciKit-Learn code will automatically be executed on the Dask cluster with no further need for instrumentation. The sample proceeds with generating some noisy clusters, which are then shuffeled and rearranged by the spectral biclustering algorithm of SciKit-Learn.

with joblib.parallel_backend('dask'):
    n_clusters = (4, 3)
    # generate noisy clusters
    data, rows, columns = sklearn.datasets.make_checkerboard(shape = (300, 300), n_clusters = n_clusters, noise = 10, shuffle = False, random_state = 0)
    original_dataset = plotly.graph_objs.Heatmap(z = data, colorscale = matplotlib_to_plotly(, len(data)), showscale = False)

    # shuffle data
    data, row_idx, col_idx = sklearn.datasets.samples_generator._shuffle(data, random_state=0)
    shuffled_dataset = plotly.graph_objs.Heatmap(z = data, colorscale = matplotlib_to_plotly(, len(data)), showscale = False)

    # initialize model, fit and calculate score
    model = sklearn.cluster.bicluster.SpectralBiclustering(n_clusters = n_clusters, method = 'log', random_state = 0)
    score = sklearn.metrics.consensus_score(model.biclusters_, (rows[:, row_idx], columns[:, col_idx]))

    print("consensus score: {:.1f}".format(score))

    fit_data = data[np.argsort(model.row_labels_)]
    fit_data = fit_data[:, np.argsort(model.column_labels_)]

    # plot clustered data and checkerboard structure
    after_biclustering = plotly.graph_objs.Heatmap(z = fit_data, colorscale = matplotlib_to_plotly(, len(data)), showscale = False)
    checkerboard_structure = plotly.graph_objs.Heatmap(z = np.outer(np.sort(model.row_labels_) + 1, np.sort(model.column_labels_) + 1), colorscale = matplotlib_to_plotly(, len(data)), showscale = False)

consensus score: 1.0

After all data is generated, we plot the output to see our input data, the shuffeled data and the clustered data.

fig = = 2, cols = 2, subplot_titles = 
    ('Original dataset', 'Shuffled dataset', 'After biclustering: rearranged to show biclusters', 'Checkerboard structure of rearranged data'))
fig.append_trace(original_dataset, 1, 1)
fig.append_trace(shuffled_dataset, 1, 2)
fig.append_trace(after_biclustering, 2, 1)
fig.append_trace(checkerboard_structure, 2, 2)

fig['layout'].update(height = 900)
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]

This page originated from the notebook spectral_biclustering.ipynb which is attached to this page for safe keeping.