Jordan Parses Large CSVs

Demo code here

First attempt

funny gif

When dealing with large files they have to be read in a stream. It’s just too much to load 18gb into the buffer. Node’s fs.createReadStream works amazing for this.

This is a basic parse function without the rows being too large. I used csv-parser which was great. Very easy to use. I set up the counter for a couple of different reasons.

The first is that logging out each number over 34 million numbers actually slows down getting to the end of the script because it takes a long time to just log everything out. With a log after every thousand iterations 530267ms. With a log on every iteration 4,028,349ms. So if I’m doing my math right, just over an hour with the log on every iteration and only nine minutes with a log on every 1000th iteration. That’s a CRAZY amount of difference.

The second reason is that I want to put this data in a database. I didn’t want to make 34 million inserts but I also couldn’t make one insert with 34 million records. So I’d create an array and push into it until it hit 1,000 and then insert that chunk.

async function parseCsv() {
	const fileName = 'big-file.csv';

	let counter = 0;
	let totalCounter = 0;

	fs.createReadStream(fileName)
		.pipe(csvParser())
		.on('data', (row) => {
			totalCounter++;
			counter++;

			if (counter > 1000) {
				console.log('total Counter', totalCounter);
				counter = 0;
				// do something here
			}
		})
		.on('end', () => {
			console.log('completed the parse!');
		});
}

Hanging error

hanging funny gif

The first problem I encountered with the above function is that occassionally it would just hang. It wouldn’t throw any errors that I could see but would just sit at the same place every time.

Finally when I was testing around with things I left it running all night. I came back in the morning and saw a buffer error. I’m not sure exactly how or why this happened. My first thought was that maybe a single record was just too big. The CSV was too big to open up so I had no way to be able to tell the difference.

I finally landed on splitting down the CSV into a bunch of smaller CSVs so I could open them up and see if anything was weird. I used csv-split-stream for this task. The function looks like this:

async function splitCsv(path: string, fileName: string) {

	return new Promise((resolve, reject) => {
		csvSplitStream.split(
			fs.createReadStream(path),
			{
				lineLimit: 10000
			},
			(index) => fs.createWriteStream(`csvs/${fileName}-${index}.csv`)
		)
			.then(csvSplitResponse => {
				console.log('csvSplitStream succeeded.', csvSplitResponse);
				resolve(csvSplitResponse.totalChunks);
			}).catch(csvSplitError => {
				console.log('csvSplitStream failed!', csvSplitError);
				reject();
			});
	})
}

However, once I had the files smaller, if I just looped through all of the smaller pieces suddenly I could parse the whole file. No more hanging. This still does not make much sense to me. This means it’s not the record it was hanging on but the entire file size. If it’s streaming, why would the file size matter? But…it worked.

Mongo document size error

size judge funny gif
MongoError: document is larger than the maximum size 16777216

This was my next problem. Occasionally I would have a record that was too big for the maximum size of Mongodb. Luckily this didn’t hang forever so I was easily able to pick out the problem.

This again, was solved by splitting. It didn’t make too much sense to me. Why would a smaller file size make this suddenly work? It’s not inserting the whole file but just pieces of it as it is streamed.

This is what I ended up with:

(async () => {

	for (let index = 0; index < 621; index++) {
		const directory = `F:\\business-csvs\\part_3/`;
		const fileName = `output-${index}.csv`;
		const path = `${directory}${fileName}`;
		if (fs.existsSync(path)) {
			try {
				await parseCsv(path);
			}
			catch (e) {
				// console.log('caught an error from down low', e, fileName);
				if (e.message.includes('document is larger')) {
					console.log('document too big error');

					const totalChunks = await splitCsv(path, fileName);
					const basePath = `csvs/problems/${fileName}-`;

					for (let i = 0; i < totalChunks; i++) {
						await parseCsv(`${basePath}${i}.csv`);
						fs.unlinkSync(`${basePath}${i}.csv`);
					}
				}
			}
		}
	}

})();

If parseCsv() threw an error and it included a message about document being too large, I’d split that file into even smaller files and then loop through those and parse them. After completing them, I’d delete them.

All in all, it worked really well. It took quite a bit of time but in the end I was able to parse eight 35GB CSV files in just a couple of days.

Demo code here

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more at Cobalt Intelligence!

Leave a Reply

Your email address will not be published. Required fields are marked *