At BaseKit, we have a huge amount of Javascript in our product, and we're in the midst of gradually refactoring it to use ES6 syntax. Our code is split into different areas, some of which is legacy and won't be refactored, but the rest is a mixture of ES5 and ES6. This was causing our Webpack builds to slow down, since each file needed to be checked to see if it needed to be transpiled: ES6 files are passed to Babel, while ES5 files are just concatenated into a bundle.

To be able to see how much more we needed to do, along with a visual reminder that we need to refactor, I decided to create a little script which would be run on each Webpack build which would show the percentage of ES6 code in our codebase.

Requirements

The requirements are pretty straightforward:

  • Check if a file is using ES6 syntax (just checking for the presence of the word export is enough);
  • Report the percentage of JS files using ES6 syntax;
  • Be fast enough to run on every Webpack build - we really don't need to add more time to every cycle;
  • Be able to run on Linux - our development virtual machine - and ideally other platforms for manual runs.

Simple enough!

Bash

My initial pass was to use a bash script to grep for the word export in every Javascript file in our codebase (excluding node_modules, since this is outside the folder being searched):

#!/bin/bash

nomatch=$(grep -Lr "export" ./assets/public/ --include "*.js" | wc -l)
total=$(find ./assets/public/ -name "*.js" | wc -l)
percent=$(expr 100 - $(($nomatch * 100)) / $(($total))) # bash doesn't support float

echo Total JS files: $'\t'$total
echo ES5 remaining:      $'\t\t' $nomatch
echo Percentage:     $'\t\t' $percent% ES6

This was ok, but far from ideal: it took a quarter of a second (0.25s) to run each time, and was checking every Javascript file. Some of the folders being checked contained legacy code which will never be updated to ES6, so shouldn't be included in the count.

I'd like to be able to exclude certain folders, and ideally report back in JSON format so Webpack can pick up the stats easily, but didn't really fancy trying to find a JSON parser for Bash, or manually listing folders to exclude from the script.

Fortunately, our JS is split into several bundles by Webpack, so we already have a JSON config file which lists various folders to be checked for each bundle. These are written in glob format - assets/js/version3/**/*.js - so would need to be parsed to get a complete list of files, each of which should then be checked to see if it contains ES6 syntax.

Rewrite it in Rust

With a big list of improvements, I decided to use Rust to write a new version. Rust, other than being what I'm interested in lately, was a good fit because:

  • it has a mature JSON handling library in serde;
  • it has glob library;
  • a great regex library called regex;
  • can be run on a Linux VM, which is what we use in development.

The supporting infrastructure around Rust makes it attractive too: Crates.io makes it easy to discover libraries and Cargo makes it incredibly easy to install them and try them out.

Finally, some Rust code

The Rust version is pretty simple: create a list of paths, then go through them and check each file for the presence of 'export' (for ES6) and also $(, to detect jQuery usage, which I haven't mentioned up until this point. We also want to gradually get rid of using jQuery, so it'd be nice to check which files are using it.

// config file name
static CONFIG_FILE: &'static str = "assets.json";

// open and parse a JSON file to get the lists of files
fn read_config(path: &Path) -> Result<Value, io::Error> {
    let text = read_file(&path)?;
    let config: Value = serde_json::from_str(&text)?;
    Ok(config)
}

// expand glob paths
pub fn get_paths(dir: &PathBuf) -> HashSet<PathBuf> {
    let asset_path = Path::new(&dir);
    let mut paths: HashSet<PathBuf> = HashSet::new();
    match read_config(&Path::new(&dir).join(&CONFIG_FILE)) {
        Ok(text) => {
            for value in text.as_object().expect("Error parsing JSON").values() {
                for val in value.as_array().expect("Error parsing JSON").iter() {
                    let val_str = val.as_str().expect("Error parsing JSON");
                    let abs_path = Path::new(&asset_path).join(val_str.to_string());

                    for path in glob(&abs_path.to_str().expect("Failed to read string")).expect("Failed to glob path").filter_map(Result::ok) {
                        paths.insert(path);
                    }
                }
            }
        },
        Err(e) => panic!("Error reading js_assets: {:?}", e),
    };

    paths
}

// go through list of files and keep lists of files to be refactored (either because they're ES5 or because they use jQuery)
pub fn check_es5_jquery(assets: Vec<&PathBuf>) -> (Vec<PathBuf>, Vec<PathBuf>) {
    let mut es5: Vec<PathBuf> = Vec::new();
    let mut jquery: Vec<PathBuf> = Vec::new();

    let set = RegexSet::new(&[
        r"export",
        r"\$\(",
    ]).expect("Failed to create regexes");

    for path in assets {
        let (is_es6, is_jquery) = match scan(&path, &set) {
            Ok((e, j)) => (e, j),
            Err(e) => panic!("{:?}", e)
        };

        if !is_es6 {
            es5.push(path.clone());
        }

        if is_jquery {
            jquery.push(path.clone());
        }
    }

    (es5, jquery)
}

// scan a file for regex matches
fn scan(path: &PathBuf, set: &RegexSet) -> Result<(bool, bool), io::Error> {
    let mut is_es6 = false;
    let mut is_jquery = false;

    for line in BufReader::new(File::open(path)?).lines() {
        let text = line.expect("Failed to read line");

        let matches = set.matches(&text);
        if matches.matched(0) {
            is_es6 = true;
        }

        if matches.matched(1) {
            is_jquery = true;
        }
    }

    Ok((is_es6, is_jquery))
}

// a little struct so we can output nice JSON
#[derive(Serialize)]
struct Statistics<'a> {
    file_count: i16,
    es6_percentage: i8,
    shared: Vec<&'a PathBuf>,
    es5_files: Vec<PathBuf>,
    es5_count: i16,
    jquery_files: Vec<PathBuf>,
    jquery_count: i16
}

// run it
fn main() {
    let current_dir = env::current_dir().expect("Where am I?");

    let assets = get_v8_paths(&current_dir);
    let (es5, jquery) = check_es5_jquery(assets);

    let stats = Statistics {
        file_count: assets.len() as i16,
        es6_percentage: (((assets.len() - es5.len()) * 100) / assets.len()) as i8,
        es5_count: es5.len().clone() as i16,
        es5_files: es5,
        jquery_count: jquery.len().clone() as i16,
        jquery_files: jquery
    };

    println!("{}", serde_json::to_string(&stats).unwrap());
}

The code is edited for clarity - in the actual programme I used Clap for arguments so we can exclude certain directories, or only search in unit test files and so on.

Results

I was surprised by how simple it was to get this running - opening and parsing a config file was really easy, and the only time-consuming part was adding .expect() error messages everywhere (after getting rid of unwrap). I found it about as simple as writing a Python script, which I had also been considering using for the project, if slightly more verbose.

Speed

The first Rust version - compiled with --release but without RegexSet - ran in 0.05s: five times faster than the basic Bash script. I was pretty surprised by this, since the Rust version was doing a lot more work by reading config files, expanding glob patterns and then running two regex matches compared with Bash's one. Both were searching through roughly 1200 Javascript files, so overall the Rust version was significantly faster.

I then updated the Rust version to use RegexSet rather than having two regexes and searching each line twice; this further reduced the time of the Rust version to 0.007s. This was easily fast enough to add to our Webpack build, so the files could be checked on every file change.

Cross-platform

The Rust version was also a lot more configurable since it could take command line arguments. and could be compiled to run on the Linux VM we use in development, as well as running on OSX while writing the programme.

Testing

I went even further, and added unit tests to the Rust checker - Rust makes adding tests really easy with cargo test and annotating modules:

#[cfg(test)]
mod test_imports {
    #[test]
    fn test_import_export_es5() {
        let set = super::RegexSet::new(&[
            r"export",
            r"(?m)^import",
        ]).expect("Failed to create regexes");

        let es5 = super::import_export_scan(&super::PathBuf::from("testHelpers/es5.js"), &set);
        match es5 {
            Ok(val) => assert_eq!(val, true),
            Err(e) => panic!("ES5 import/export scan failed: {:?}", e)
        };
    }

    #[test]
    fn test_import_export_es6() {
        let set = super::RegexSet::new(&[
            r"export",
            r"(?m)^import",
        ]).expect("Failed to create regexes");

        let es6 = super::import_export_scan(&super::PathBuf::from("testHelpers/es6.js"), &set);
        match es6 {
            Ok(val) => assert_eq!(val, true),
            Err(e) => panic!("ES6 import/export scan failed: {:?}", e)
        };
    }
}

Why not add tests to tiny dev utilities?!

Overall

I was really impressed by the speed of the final Rust checker - at less than a hundredth of a second I didn't feel bad about adding it to the Webpack build process to give our developers a prod to keep refactoring. It was also nice to know that it would run cross-platform, since our development take place in a Linux VM while most of the developers use OSX, so making changes to the checker was easy. The Bash script also made me slightly uneasy, since although it should also work on Linux and OSX, everyone uses different terminals and shells, so you can never be entirely sure that all the commands are available or have the alias you expect.

The downsides: Having to add .expect()s with error messages everywhere in the get_paths loop was a bit annoying, but that's probably something I can improve with an error crate like failure (if I ever fancy improving the script).