Talking to .net from node.js (and node streams)

Note: This post is not about CI or CD. It falls into the category of

"Other stuff I learnt while doing my developer day job"

Hopefully you still find it interesting.

Some Background

At my current client, we use a node.js script to load large files into a Microsoft SQL database. We chose node for various reasons that I won't get into in this post.

Recently we have been pulling in some very large CSV files (i.e. more than 2 million rows, +/- 600MB) and we were running into some performance problems resulting in us taking most of the day to load all the files. These files are loaded daily, so we needed to find a way to improve that.

I won't go into too much detail, but we load the files using a customizable node streams (v2) pipeline. This allows us to load slightly different files using the same code, but making decision at runtime about how to handle them. At a high level, our pipeline looks like this:

Read file in as raw lines and send each line through the pipeline:

Strip out invalid UTF characters
Regex replacements for special strings
Ensure each line is valid CSV
Check that row has same amount of fields as header
Convert the CSV to JSON
Add some metadata to the JSON object
Map the JSON to our SQL table
Write the JSON to SQL via the tds npm package

Along the way, we also have custom hooks that are only used on certain files (the beauty of JS) and we log to Mongo twice.

The problem

This process would take 5 hours to run through the 600MB file!

A huge performance improvement came about by accident, we switched of the Mongo loading (for other reasons) and ended up with about a 3x increase in processing speed.

But our biggest improvement came in the way we talked to SQL Server. The tds module works with SQL commands, so we were generating SQL strings for each object. But the module (or some underlying SQL protocol) has a limit on the size of the SQL string. Some of our tables have many columns, so we were limited to a single SQL query at a time. This obviously comes with the overhead of creating a connection to the database and parsing/executing a query for each object. With 2 million objects, this added a huge overhead.

The solution to bulk inserts on SQL Server is a utility called BCP (bulk copy), and there is a node module claiming to support that, but it relies on the data being in a file before bulk copying it into the DB, something thats impossible with our streams architecture.

The solution was a .net class called SqlBulkCopy, but I'm sure you've spotted the problem, our pipeline is in node, how do we use a .net class?

Our solution - Node talking to .NET

Enter edge.js. This amazing project allows a node module to call a function in a .net assembly (.dll or .exe). You can also write inline C# code within your JS or inline JS within your C#.

So I wrote a simple C# binary that uses the SqlBulkCopy class to dump an array of objects into a table. Our 5 hour job, now completes in just under 30 minutes!

There are a few gotchas though:

The .net methods you call must be async
You can only pass one dynamic parameter
Any objects within that single parameter must be dynamic as well (except for primitives)
You get an object return value

These restrictions resulted in some extra data marshalling that can be a hassle, but in the end it was worth it. Here are some code samples from my implementation.

In C# / .net

C#'s dynamic object works in a similar way to a JSON object in JavaScript in that its not based off a class, and properties are created when assigned for the first time. The issue is that it doesn't support null properties. This will cause an error:

dynamic myObj;  
myObj.myProp = null;

This is a problem when you're only allowed to pass in a single dymanic object and you want some optional parameters. Dynamic objects do allow you to reflect on them, but another gotcha is that the dynamic object provided by edge.js is not the same as the dynamic object declared in c#, making testing tricky.

In the end, I wrote a utility method for extracting values from the object which handles both methods for reflecting on dynamic objects. (The "casting to IDictionary" method is used with an edge.js call). If a property does not exist in the object (which is how edge.js handles null in the json parameter), it returns the default value for the specified type.

private static T GetDynamicValue<T>(object obj, string name)  
{
    try
    {
        var dict = obj as IDictionary<string, object>;
        if (dict != null)
        {
            return dict.ContainsKey(name) ? (T) dict[name] : default(T);
        }

        if (obj.GetType().GetProperty(name) == null)
        {
            return default(T);
        }                
        return (T) obj.GetType().GetProperty(name).GetValue(obj);
    }
    catch (Exception e)
    {
        Console.Error.WriteLine("Could not extract [{0}] with type [{1}] from dynamic object",
                                    name, typeof(T).Name);
        Console.Error.WriteLine(e);
        return default(T);
    }
}

My BCP command then looked like this:

public async Task<object> BulkInsert(dynamic dataToInsert)  
{
    //Use the above method to pull the config into a strongly typed config object
    var config = ExtractConfig(dataToInsert);
    var rows = GetDynamicValue<object[]>(dataToInsert, "rows");

    // Sql.cs is a class that wraps the SqlBulkInsert and also provides some extra functionality
    // like running SQL statements. You can see more details about SqlBulkInsert online, that is
    // not the subject of this post
    var sql = new Sql(config);

    return await Task.Run(() => sql.BulkInsert(config.Table, rows));
}

In JavaScript / Node

In the node pipeline, I build up an array of 1000 items and then flush them to SQL via .net. 1000 seemed like the sweetspot, but it could be tweaked.

var edge = require('edge');  
var bcpFunction = edge.func({  
    assemblyFile: './lib/NodeBulkCopy.dll',
    typeName: 'NodeBulkCopy.BulkCopy',
    methodName: 'BulkInsert'
});


bcpFunction({  
    //config such as table name and location of the database
    rows : itemArray
}, function(err, result){
    //Handle the result here. If c# throws an unhandled exception, the err param will contain details
});

A nice side affect of this approach is that it enabled SQL Server Integrated Security (i.e. Windows user authentication). Previously we had to use a local SQL user with the password stored in plaintext.

Wow, this ended up being a longer post than I expected, thanks for reading to the end. I would love to hear you views on edge.js and node streams. Write a comment ;)