DataFrame: My redundant .js pandas clone
What I've done here is waste a bunch of time... also, I've created a JavaScript implementation of a pandas-like DataFrame for data manipulation and analysis in javascript. Once again, yes. There are already pretty good implementations of this, like arquero
. However:
- I wanted to implement my own minimal version and nothing teaches you like implementing your own stuff, and
- I didn't think arquero was python-esque enough for me.
- I don't expect anyone else to use this stuff except me.
With Claude and OpenAI these days, it's easier than ever to learn, which is exactly why I've done what I've done. As I really am starting to love typscript (save for the analysis part), this implementation aims to provide a Python/pandas-like experience in JavaScript, making data manipulation and analysis more intuitive for Python developers, like me.
Table of Contents
Quickstart
Import and Create
// Show code
import { DataFrame } from "./components/DataFrame.js";
const nations = await FileAttachment("./data/nations.csv").csv({ typed: true });
const df = new DataFrame(nations);
Basic Inspection
// Show code
view(df);
view(df.table());
view(df.describe().table());
Let's view the data and the table in its entirety.
Let's use the describe function and turn that into a table.
Selection & Filtering
// Show code
const highIncome = df
.query("row.income > 20000")
.sort_values("population", false)
.head(5);
view(highIncome.table());
Cool. That's great, but there are more than one entry per country, as it's a time series. Let's do a little pythonic manipulation to get just the last pop for each country. Let's do this in a basic pythonic way and then an even more pythonic way.
Here's the basic way:
// Show code
const highIncomeUnique = [];
for (const [key, group] of df.groupby("name", { iterable: true })) {
const data = group.dropna({ axis: 0, subset: ["population"] });
const lastYear = d3.max(data._data.year);
const lastYearData = data.query(`row.year === ${lastYear}`);
highIncomeUnique.push({
...key,
year: lastYear,
income: lastYearData._data.income[0],
population: lastYearData._data.population[0],
region: lastYearData._data.region[0],
});
}
const highIncomeUniquedf = new DataFrame(highIncomeUnique).sort_values(
"population",
false
);
view(highIncomeUniquedf.table());
Here’s an agg
version, because it's way nicer to write.
// Show code
const highIncomeUnique2 = df
.dropna({ axis: 0, subset: ["population"] })
.groupby(["name"])
.agg({
year: ["last"],
income: ["last"],
population: ["last"],
region: ["last"],
})
.query("row.income > 350")
.sort_values("population", false)
.head(20);
view(highIncomeUnique2.table());
Statistical Analysis
// Show code
view(df.corr("income", "population"));
const width_plot = 600;
view(
df.corrPlot({
width: width_plot,
height: (width_plot * 240) / 300,
marginTop: 100,
})
);
Group Operations
// Show code
const incomeSummary = df.groupby(["name", "region"]).agg({
population: ["min", "mean", "max"],
lifeExpectancy: ["mean", "max"],
});
view(incomeSummary.table());
Key Methods Reference
Basic Operations
df.head(n);
df.tail(n);
df.print();
df.table();
Data Access & Transformation
df.select(["colA", "colB"]);
df.query("row.colA > 100");
df.sort_values("colA", true);
df.fillna(0);
df.apply("colA", (x) => x * 2);
df.map("colA", { oldValue: "newValue" });
Statistics
df.describe();
df.corr("colA", "colB");
df.corrMatrix();
df.corrPlot({ width: 500, height: 500 });
Grouping
df.groupby(["colA"]).agg({ colB: ["min", "max", "mean"] });
df.concurrentGroupBy(["colA", "colC"], { colB: ["min", "mean"] });
Iteration Methods
for (const [idx, row] of df.iterrows()) {
/* ... */
}
for (const tuple of df.itertuples("Record")) {
/* ... */
}
for (const [col, values] of df.items()) {
/* ... */
}
for (const row of df) {
/* ... */
}
describe()
Generate descriptive statistics for all columns. Returns a DataFrame containing statistics like count, mean, std, min/max, and percentiles for numeric columns, and category distributions for non-numeric columns.
// Get basic statistics
const stats = df.describe();
// Print the statistics in a formatted way
stats.print();
// Access category distributions for non-numeric columns
const categories = stats._data.categories;
// Render as an HTML table using Observable's Inputs.table
Inputs.table(stats.print());
The describe method provides different statistics based on column type:
- Numeric columns: count, mean, std, min, 25%, 50%, 75%, max
- Categorical columns: count, unique values, top values, frequency, category distribution
percentile(p[, columns])
Calculate percentile value(s) for numeric columns.
// 75th percentile of age
const p75 = df.percentile(0.75, "age");
// Multiple columns
const quartiles = df.percentile(0.25, ["age", "salary"]);
Data Manipulation
fillna(value)
Fill missing values in the DataFrame.
const filled = df.fillna(0);
apply(column, func)
Apply a function to a column.
const df2 = df.apply("name", (name) => name.toUpperCase());
map(column, mapper)
Map values in a column using a mapping object.
const df2 = df.map("status", {
A: "Active",
I: "Inactive",
});
drop() & dropna()
Remove specified columns.
const df2 = df.drop(["temp_col", "unused_col"]);
coonst df2 = df.dropna({ axis: 0, subset: ["population"] });
rename(columnMap)
Rename columns using a mapping object.
const df2 = df.rename({
old_name: "new_name",
prev_col: "next_col",
});
assign(columnName, values)
Add a new column.
// Add column with array
const df2 = df.assign("new_col", [1, 2, 3]);
// Add column with function
const df3 = df.assign("bmi", (row) => row.weight / (row.height * row.height));
merge(other[, options])
Merge two DataFrames.
const merged = df1.merge(df2, {
on: "id", // Join on same column name
how: "inner", // Join type
left_on: "id_1", // Custom left join column
right_on: "id_2", // Custom right join column
});
Mathematical Operations
add(other)
Add scalar or DataFrame to numeric columns.
// Add scalar
const df2 = df.add(10);
// Add DataFrame
const df3 = df1.add(df2);
sub(other)
Subtract scalar or DataFrame from numeric columns.
const df2 = df.sub(5);
mul(other)
Multiply numeric columns by scalar or DataFrame.
const df2 = df.mul(2);
div(other)
Divide numeric columns by scalar or DataFrame.
const df2 = df.div(100);
Advanced Features
Performance Comparison
Here's a comparison of regular and concurrent groupby performance on different dataset sizes. Spoiler: concurrency overhead is real, so if your DataFrame is tiny, you’re just wasting CPU cycles. However, we do see some good speedup for larger dataframes. On my computer (M2 air), I can groupby aggregate 10M rows in ~1.2 seconds or less.
const results = await comparePerformance();
view(results.table());
Concurrent Processing
const regularGroupBy = df.groupby(["name", "region"]).agg({
population: ["min", "mean", "max"],
});
const concurrentResult = await df.concurrentGroupBy(["name", "region"], {
population: ["min", "mean", "max"],
});
corrPlot
// Show code
const plot = df.corrPlot({
width: 600,
height: 480,
marginTop: 100,
scheme: "blues",
decimals: 2,
});
const corrMatrix = df.corrMatrix();
const corrData = corrMatrix._data;