Week notes 20 Feb

Loose tool notes

Since the last time I tried it a few years ago, google translate lets you download a pdf of the translation of a pdf (preserving the page number). Also machine translation seems a lot better than it used to be, the scope of what’s an easily understandable source just keeps expanding.

Been trying using mermaid charts for a project, and it’s pretty great. Would be nice to wrap it in something like altair_saver and be able to fit it into my notebook workflows.

Jittering overlapping values for an Altair graph

ggplots in R has a better set of functions for slightly offsetting overlapping points so you get a sense that a lot of points are at 0,0.

Altair in Python doesn’t have a way of doing this, and I found this stackoverflow answer that did most of the needed bits. I’ve adjusted it so the random offsets can be negative and the process repeats until the minimum offset value is reached for all points. I wasn’t sure about how some of the numpy bits were working, so I’ve made some of the coments more explicit.

from scipy.spatial.distance import pdist
import numpy as np
import pandas as pd


def jitter_df(
    df: pd.DataFrame,
    cols: List[str],
    threshold: float = 0.2,
    jitter: float = 0.1
) -> pd.DataFrame:
    """
    Stops overlap in plotted graphs by moving apart overlapped values
    in specified cols.
    extends answer from https://stackoverflow.com/a/58772101
    """

    n = len(df)
    while True:

        # calculate distance matrix for specified columns
        p = pdist(df[cols])

        # the distance matrix will contain duplicate values (A,B and B,A)
        # this lets us just get one set, the upper triangle
        i, j = np.triu_indices(n, 1)

        # Initialize a mask of False
        too_close = np.zeros(n, bool)

        # in-place operation
        # for indices (i), check if distance (p) is below threshold
        # and update mask (too_close) at same place
        np.logical_or.at(too_close, i, p <= threshold)

        overlap_count = too_close.sum()
        if overlap_count == 0:
            # we're done, escape
            return df
        # random offset either side of 0
        shape = (overlap_count, len(cols))
        rng = (np.random.rand(*shape) * jitter) - (jitter / 2)

        # apply offset to items that are too close
        df.loc[too_close, cols] += rng