CSV Parser API
Functions for parsing CSV data from various sources.
Functions
parseCsv(content: string, options?)- Parse CSV content from stringparseCsvFromFile(filePath: string, options?)- Parse CSV from file or URLanalyzeCsv(filePath: string, options?)- Sample the first N lines to infer columns, dtypes, and estimate row countparseCsvBatchedStream(filePath: string, options)- Stream a large CSV in batches and receiveDataFramebatches via callbackparseCsvStream(filePath: string, options?)- Stream a large CSV and return a single accumulatedDataFrame
Options
interface CsvParseOptions {
hasHeader?: boolean; // Whether first row contains headers
delimiter?: string; // Custom delimiter (default: ',')
skipRows?: number; // Number of rows to skip
columns?: string[]; // Custom column names
comment?: string; // Comment character to ignore lines
dtypes?: Record<string, DType> | DType[]; // Column data types
inferTypes?: boolean; // Enable automatic type inference (default: true)
naValues?: string[]; // Values to treat as null/NaN
}
type DType = "int32" | "float64" | "string" | "bool" | "datetime";
Examples
Basic Usage
import { parseCsv, parseCsvFromFile } from "@cross/boxframe";
// Parse CSV string with automatic type inference
const df1 = parseCsv("name,age,salary,hire_date\nAlice,25,50000.50,2023-01-15\nBob,30,75000,2022-06-10");
// age: int32, salary: float64, hire_date: datetime, name: string
// Parse from file
const df2 = await parseCsvFromFile("data.csv");
// Parse from URL
const df3 = await parseCsvFromFile("https://example.com/data.csv");
Streaming Options
interface ParseCsvStreamOptions extends CsvParseOptions {
batchSize?: number; // Number of lines per batch (default: 25,000)
onBatch: (df: DataFrame) => void | Promise<void>; // Required for batched streaming
onProgress?: (progress: { bytesRead: number; rowsProcessed: number }) => void; // Optional progress callback
// Worker-based parallel parsing options
useWorkers?: boolean; // Enable worker-based parsing (default: false)
workerCount?: number; // Number of workers (default: Math.max(1, cores - 1))
preserveOrder?: boolean; // Preserve batch order (default: true for parseCsvStream, false for batched)
schema?: { columns: string[]; dtypes: Record<string, DType> }; // Pre-computed schema to skip inference
abortSignal?: AbortSignal; // Signal for cancellation/timeout
workerModule?: string | URL; // Custom worker module URL (advanced)
}
Data Type Control
// Explicit data types by column name
const df4 = parseCsv(csvContent, {
dtypes: {
name: "string",
age: "int32",
salary: "float64",
is_active: "bool",
hire_date: "datetime"
}
});
// Partial type specification (infer rest)
const df5 = parseCsv(csvContent, {
dtypes: {
name: "string",
age: "int32"
// salary, hire_date will be automatically inferred
}
});
// Data types by column index (no headers)
const df6 = parseCsv(csvContent, {
hasHeader: false,
dtypes: ["string", "int32", "float64", "bool", "datetime"]
});
// Disable type inference (all strings)
const df7 = parseCsv(csvContent, {
inferTypes: false
});
Advanced Options
// Custom NA values and delimiter
const df8 = parseCsv(csvContent, {
delimiter: ";",
naValues: ["N/A", "unknown", "NULL"],
dtypes: {
salary: "float64",
is_active: "bool"
}
});
// Skip rows and comments
const df9 = parseCsv(csvContent, {
skipRows: 2,
comment: "#",
hasHeader: true
});
Streaming Large CSVs
import { parseCsvBatchedStream, parseCsvStream } from "@cross/boxframe";
// 1) Batched streaming with callbacks
await parseCsvBatchedStream("/path/to/large.csv", {
hasHeader: true,
batchSize: 50000, // number of data lines per batch
onBatch: async (df) => {
// Process each batch DataFrame incrementally
console.log("batch rows:", df.shape[0]);
// e.g., write to DB or aggregate stats here
},
onProgress: ({ bytesRead, rowsProcessed }) => {
console.log("progress:", { bytesRead, rowsProcessed });
},
});
// 2) Streaming that returns a single accumulated DataFrame. Options are optional.
const df = await parseCsvStream("/path/to/large.csv", {
hasHeader: true,
batchSize: 25000, // affects internal batching; result is a single DataFrame
});
console.log("total rows:", df.shape[0]);
Worker-Based Parallel Parsing
For very large CSV files, you can enable worker-based parallel parsing to leverage multiple CPU cores:
import { parseCsvBatchedStream, parseCsvStream, analyzeCsv } from "@cross/boxframe";
// Enable workers for parallel processing (recommended for files > 100MB)
await parseCsvBatchedStream("/path/to/very-large.csv", {
useWorkers: true, // Enable parallel processing
workerCount: 4, // Optional: specify number of workers (default: cores - 1)
batchSize: 50000,
preserveOrder: false, // Faster but batches may arrive out of order
onBatch: async (df) => {
console.log("Received batch:", df.shape[0], "rows");
// Process batch...
},
});
// With order preservation (slower but maintains batch sequence)
await parseCsvBatchedStream("/path/to/very-large.csv", {
useWorkers: true,
preserveOrder: true, // Batches arrive in sequence (0, 1, 2, ...)
batchSize: 50000,
onBatch: async (df) => {
// Process batches in order...
},
});
// Pre-compute schema to skip inference in workers (faster startup)
const analysis = await analyzeCsv("/path/to/very-large.csv", { sampleLines: 1000 });
const df = await parseCsvStream("/path/to/very-large.csv", {
useWorkers: true,
schema: {
columns: analysis.columns,
dtypes: analysis.dtypes,
},
// Schema is known, so workers skip type inference
});
// With cancellation support
const controller = new AbortController();
setTimeout(() => controller.abort(), 60000); // Cancel after 1 minute
await parseCsvBatchedStream("/path/to/very-large.csv", {
useWorkers: true,
abortSignal: controller.signal, // Enables cancellation
onBatch: async (df) => {
// Process batches...
},
}).catch((err) => {
if (err.message.includes("aborted")) {
console.log("Parsing cancelled");
}
});
Quick File Analysis (Sampling)
import { analyzeCsv } from "@cross/boxframe";
const analysis = await analyzeCsv("/path/to/large.csv", { sampleLines: 1000 });
console.log(analysis.columns); // inferred column names
console.log(analysis.dtypes); // inferred dtypes per column
console.log(analysis.estimatedRows); // rough estimate of total rows
console.log(analysis.sampleRows); // rows parsed from the sample
Data Type Inference
The CSV parser intelligently infers data types from your data:
- Numbers:
int32for integers,float64for decimals - Booleans:
true/false(case-insensitive) →bool - Dates:
YYYY-MM-DD,MM/DD/YYYY,YYYY-MM-DDTHH:MM:SS→datetime - Strings: Everything else →
string
Supported Date Formats
YYYY-MM-DD(e.g.,2023-01-15)YYYY-MM-DD HH:MM:SS(e.g.,2023-01-15 10:30:00)YYYY-MM-DDTHH:MM:SS(e.g.,2023-01-15T10:30:00Z)MM/DD/YYYY(e.g.,01/15/2023)MM-DD-YYYY(e.g.,01-15-2023)YYYY/MM/DD(e.g.,2023/01/15)M/D/YYYY(e.g.,1/15/2023)