Categories
Dev Stuff

Converting CSV to SQL

I was recently working on an issue that required me to dig through gigabytes of CSV files to look for patterns and data points. I had no intention of bringing all of this data in to Excel, because there’s not enough RAM on earth for that to work. Instead I thought it would be easier to dump the data into MariaDB (MySQL would be fine as well) and just write some simple SQL statements.

There were a few things I tried, such as LOAD DATA INFILE and other SQL type things. Nothing really worked properly, so instead I decided to just convert the CSV files into SQL statements I could import.

This is a bad idea! Don’t ever let this touch a live production database!

I wrote this little PHP script to convert all of the CSV files in a directory to SQL:

<?php
$files    = glob( './*.{csv}', GLOB_BRACE );
$db_table = 'csv_table';

foreach ( $files as $file ) {
	$row       = 0;
	$handle    = fopen( $file, 'rb' );
	$row_names = '';

	if ( false !== $handle ) {
		while ( false !== ( $data = fgetcsv( $handle ) ) ) {
			$row++;
			$values = '';

			if ( 1 === $row ) {
				// Do Header things.
				$index  = 0;

				echo sprintf( 'DROP TABLE IF EXISTS `%s`;' . PHP_EOL, $db_table );
				echo sprintf( 'CREATE TABLE `%s` (' . PHP_EOL, $db_table );

				foreach ( $data as $key => $value ) {
					$index++;
					$data[ $key ] = strtolower( preg_replace( '/[^A-Za-z0-9]/', '_', $value ) );

					// Add a comma if we're not at the end of the array.
					if ( count( $data ) !== $index ) {
						echo sprintf( '	`%s` text NOT NULL,' . PHP_EOL, $data[ $key ] );
					} else {
						echo sprintf( '	`%s` text NOT NULL' . PHP_EOL, $data[ $key ] );
					}
				}

				echo ') ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;' . PHP_EOL;
				$row_names = '(`' . implode( '`,`', $data ) . '`)';
				continue;
			}

			foreach ( $data as $key => $value ) {
				$data[ $key ] = fake_sql_escape( $value );
			}

			$values = '(\'' . implode( '\',\'', $data ) . '\')';

			echo sprintf( 'INSERT INTO `%s` %s VALUES %s;' . PHP_EOL, $db_table, $row_names, $values );
		}
		fclose( $handle );
	}
}

function fake_sql_escape( $value ) {
	// https://stackoverflow.com/a/1163091
	// Totaly not safe for production!
	$return = '';
	$length = strlen( $value );

	for ( $i = 0; $i < $length; ++$i ) {
		$char = $value[ $i ];
		$ord  = ord( $char );
		if (
			"'" !== $char &&
			'"' !== $char &&
			'\\' !== $char &&
			$ord >= 32 &&
			$ord <= 126
		) {
			$return .= $char;
		} else {
			$return .= '\\x' . dechex( $ord );
		}
	}
	return $return;
}

Borrowing the example airtravel.csv from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html this is what the output looks like:

DROP TABLE IF EXISTS `csv_table`;
CREATE TABLE `csv_table` (
	`month` text NOT NULL,
	`1958` text NOT NULL,
	`1959` text NOT NULL,
	`1960` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('JAN','  340','  360','  417');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('FEB','  318','  342','  391');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('MAR','  362','  406','  419');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('APR','  348','  396','  461');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('MAY','  363','  420','  472');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('JUN','  435','  472','  535');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('JUL','  491','  548','  622');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('AUG','  505','  559','  606');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('SEP','  404','  463','  508');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('OCT','  359','  407','  461');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('NOV','  310','  362','  390');
INSERT INTO `csv_table` (`month`,`1958`,`1959`,`1960`,`type`) VALUES ('DEC','  337','  405','  432');

And if you’re too lazy to see what the input CSV looks like (I would be) here it is, in its LGPL glory:

"Month", "1958", "1959", "1960"
"JAN",  340,  360,  417
"FEB",  318,  342,  391
"MAR",  362,  406,  419
"APR",  348,  396,  461
"MAY",  363,  420,  472
"JUN",  435,  472,  535
"JUL",  491,  548,  622
"AUG",  505,  559,  606
"SEP",  404,  463,  508
"OCT",  359,  407,  461
"NOV",  310,  362,  390
"DEC",  337,  405,  432

There’s a lot of ways that this could be made better, but it works in a quick pinch. If you need something similar, feel free to take this as a starting point.

Leave a Reply