DataFrame: My redundant .js pandas clone

What I've done here is waste a bunch of time... also, I've created a JavaScript implementation of a pandas-like DataFrame for data manipulation and analysis in javascript. Once again, yes. There are already pretty good implementations of this, like arquero. However:

  1. I wanted to implement my own minimal version and nothing teaches you like implementing your own stuff, and
  2. I didn't think arquero was python-esque enough for me.
  3. I don't expect anyone else to use this stuff except me.

With Claude and OpenAI these days, it's easier than ever to learn, which is exactly why I've done what I've done. As I really am starting to love typscript (save for the analysis part), this implementation aims to provide a Python/pandas-like experience in JavaScript, making data manipulation and analysis more intuitive for Python developers, like me.


Table of Contents

  1. Quickstart
    1. Import and Create
    2. Basic Inspection
    3. Selection & Filtering
    4. Statistical Analysis
    5. Group Operations
  2. Key Methods Reference
    1. Basic Operations
    2. Data Access & Transformation
    3. Statistics
    4. Grouping
    5. Iteration Methods
  3. Advanced Features
    1. Performance Comparison
    2. Concurrent Processing
    3. Correlation Plot

Quickstart

Import and Create

// Show code
import { DataFrame } from "./components/DataFrame.js";
const nations = await FileAttachment("./data/nations.csv").csv({ typed: true });
const df = new DataFrame(nations);

Basic Inspection

// Show code
view(df);
view(df.table());
view(df.describe().table());

Let's view the data and the table in its entirety.

Let's use the describe function and turn that into a table.


Selection & Filtering

// Show code
const highIncome = df
  .query("row.income > 20000")
  .sort_values("population", false)
  .head(5);

view(highIncome.table());

Cool. That's great, but there are more than one entry per country, as it's a time series. Let's do a little pythonic manipulation to get just the last pop for each country. Let's do this in a basic pythonic way and then an even more pythonic way.

Here's the basic way:

// Show code
const highIncomeUnique = [];
for (const [key, group] of df.groupby("name", { iterable: true })) {
  const data = group.dropna({ axis: 0, subset: ["population"] });
  const lastYear = d3.max(data._data.year);
  const lastYearData = data.query(`row.year === ${lastYear}`);
  highIncomeUnique.push({
    ...key,
    year: lastYear,
    income: lastYearData._data.income[0],
    population: lastYearData._data.population[0],
    region: lastYearData._data.region[0],
  });
}
const highIncomeUniquedf = new DataFrame(highIncomeUnique).sort_values(
  "population",
  false
);
view(highIncomeUniquedf.table());

Here’s an agg version, because it's way nicer to write.

// Show code
const highIncomeUnique2 = df
  .dropna({ axis: 0, subset: ["population"] })
  .groupby(["name"])
  .agg({
    year: ["last"],
    income: ["last"],
    population: ["last"],
    region: ["last"],
  })
  .query("row.income > 350")
  .sort_values("population", false)
  .head(20);

view(highIncomeUnique2.table());

Statistical Analysis

// Show code
view(df.corr("income", "population"));

const width_plot = 600;
view(
  df.corrPlot({
    width: width_plot,
    height: (width_plot * 240) / 300,
    marginTop: 100,
  })
);

Group Operations

// Show code
const incomeSummary = df.groupby(["name", "region"]).agg({
  population: ["min", "mean", "max"],
  lifeExpectancy: ["mean", "max"],
});

view(incomeSummary.table());

Key Methods Reference

Basic Operations

df.head(n);
df.tail(n);
df.print();
df.table();

Data Access & Transformation

df.select(["colA", "colB"]);
df.query("row.colA > 100");
df.sort_values("colA", true);
df.fillna(0);
df.apply("colA", (x) => x * 2);
df.map("colA", { oldValue: "newValue" });

Statistics

df.describe();
df.corr("colA", "colB");
df.corrMatrix();
df.corrPlot({ width: 500, height: 500 });

Grouping

df.groupby(["colA"]).agg({ colB: ["min", "max", "mean"] });
df.concurrentGroupBy(["colA", "colC"], { colB: ["min", "mean"] });

Iteration Methods

for (const [idx, row] of df.iterrows()) {
  /* ... */
}
for (const tuple of df.itertuples("Record")) {
  /* ... */
}
for (const [col, values] of df.items()) {
  /* ... */
}
for (const row of df) {
  /* ... */
}

describe()

Generate descriptive statistics for all columns. Returns a DataFrame containing statistics like count, mean, std, min/max, and percentiles for numeric columns, and category distributions for non-numeric columns.

// Get basic statistics
const stats = df.describe();

// Print the statistics in a formatted way
stats.print();

// Access category distributions for non-numeric columns
const categories = stats._data.categories;

// Render as an HTML table using Observable's Inputs.table
Inputs.table(stats.print());

The describe method provides different statistics based on column type:

percentile(p[, columns])

Calculate percentile value(s) for numeric columns.

// 75th percentile of age
const p75 = df.percentile(0.75, "age");

// Multiple columns
const quartiles = df.percentile(0.25, ["age", "salary"]);

Data Manipulation

fillna(value)

Fill missing values in the DataFrame.

const filled = df.fillna(0);

apply(column, func)

Apply a function to a column.

const df2 = df.apply("name", (name) => name.toUpperCase());

map(column, mapper)

Map values in a column using a mapping object.

const df2 = df.map("status", {
  A: "Active",
  I: "Inactive",
});

drop() & dropna()

Remove specified columns.

const df2 = df.drop(["temp_col", "unused_col"]);
coonst df2 = df.dropna({ axis: 0, subset: ["population"] });

rename(columnMap)

Rename columns using a mapping object.

const df2 = df.rename({
  old_name: "new_name",
  prev_col: "next_col",
});

assign(columnName, values)

Add a new column.

// Add column with array
const df2 = df.assign("new_col", [1, 2, 3]);

// Add column with function
const df3 = df.assign("bmi", (row) => row.weight / (row.height * row.height));

merge(other[, options])

Merge two DataFrames.

const merged = df1.merge(df2, {
  on: "id", // Join on same column name
  how: "inner", // Join type
  left_on: "id_1", // Custom left join column
  right_on: "id_2", // Custom right join column
});

Mathematical Operations

add(other)

Add scalar or DataFrame to numeric columns.

// Add scalar
const df2 = df.add(10);

// Add DataFrame
const df3 = df1.add(df2);

sub(other)

Subtract scalar or DataFrame from numeric columns.

const df2 = df.sub(5);

mul(other)

Multiply numeric columns by scalar or DataFrame.

const df2 = df.mul(2);

div(other)

Divide numeric columns by scalar or DataFrame.

const df2 = df.div(100);

Advanced Features

Performance Comparison

Here's a comparison of regular and concurrent groupby performance on different dataset sizes. Spoiler: concurrency overhead is real, so if your DataFrame is tiny, you’re just wasting CPU cycles. However, we do see some good speedup for larger dataframes. On my computer (M2 air), I can groupby aggregate 10M rows in ~1.2 seconds or less.

const results = await comparePerformance();
view(results.table());

Concurrent Processing

const regularGroupBy = df.groupby(["name", "region"]).agg({
  population: ["min", "mean", "max"],
});

const concurrentResult = await df.concurrentGroupBy(["name", "region"], {
  population: ["min", "mean", "max"],
});

corrPlot

// Show code
const plot = df.corrPlot({
  width: 600,
  height: 480,
  marginTop: 100,
  scheme: "blues",
  decimals: 2,
});
const corrMatrix = df.corrMatrix();
const corrData = corrMatrix._data;