At work the software we develop and ship comes with a lot of third party dependencies, so we must keep track of their licenses. The process was very manual and was improved by automating it so that it could also scale with the company growth.
Status a few years ago
The software we develop is comprised of many third party dependencies in different languages:
- Java
- JavaScript
- Python
- a few static data
A few years ago, the process to track the version and license(s) of each dependency was the job of a single person. The dependency list was tracked in a Google Spreadsheet with one sheet by language or part of the software. Each sheet having a set of column, the main ones beings:
- dependency name
- version
- display name
- repository url
- homepage
- copyrights if available
- license(s): a dependency could be available under different licenses
Before each release, each dependency was manually tracked and diffed with the previous software version, to check what was added, updated or removed. The Spreadsheet was updated accordingly and then processed in other tools to extract data to be published in different format:
- as html on the website
- as a global notice file shipped with the software, which is a concatenation of all license files
This process had the following problems:
- very tedious work and took quite some time before each release to ensure everything was done right
- because everything was manual, it was also very error prone
- the Spreadsheet was disconnected from the software development lifecycle and only linked to releases
As more developers joined, more features were added along with new dependencies. It was impossible to keep the current process in that state.
Reworking the process
The process was to be reworked. We wanted to achieve at least the following:
- not tied to a single person, so that the burden is spread on more people and can scale further
- follow the development lifecycle to prevent doing everything before a release
- automate to prevent manual errors and add automatic checks
Based on the above requirement, devising a plan was quite simple:
- extract information from the Google Spreadsheet
- store it into the repository using a text format
- automate dependency extraction for all components
- update licenses since the previous release to update to the current development state
- improve the process and add CI jobs to check up to date data
- update output formats to be published: notice file and documentation
- scale up with developers: provide training
Extracting data from the Spreadsheet
The extraction was pretty straightforward:
- extract each sheet into its json
- normalize the fields between all json so that they have a common data model
- commmit them into our repository to serve as a basis
A few experiments were made on the format so that it was the most developer friendly and easy to use.We settled with json:
- widely used and can be parsed and read from everything
- pretty readable when prettified
- items and fields are ordered based on name and version: can be diffed easily from one version to another:
To sum up, we have one file by language and one more for manual integration and data files.
JSON schema
After normalizing the data, a json schema was added to check mandatory fields and look up the list of authorized licenses in an explicit list. That way, no new license type could be added without the schema not being updated and checked beforehand. The format is not set in stone and had evolved a bit since the initial version. It is quite easy to update and track changes due to being versioned in the git repository.
The SPDX Identifier is used as the license type field.
Copied components
A few of third parties are copied into the software. There are severeal reasons for that:
- only one function is needed from a whole library
- some static data are needed, geo locations for example
- a whole library may be copied at a specific version because there is no available release
The company a duty to track those dependencies for licenses and security purposes. They are added and updated manually in their own files.
Automating dependency extraction
Dependency extraction was done manually up to now. As explained above, it was a very tedious process and it was time to automate it.
As the software is comprised of different languages, different scripts are used for each one:
- python: a custom script extracting packages because the software is shipped with several custom python environments. This is a part to be rework later to move to
pip freeze
- frontend:
- license-checker for the frontend reading several
package-lock.json
- a custom script to also extract dependencies from
bower.json
that we still use
- license-checker for the frontend reading several
- java: that was the most complex part. The software not only depends on a single gradle file, but is a complex multiproject build, with build (not shipped) and runtime dependencies.
owasp-dependency-check was first used and integrated into gradle. At a later stage we moved to a single script that scans the jars along with the gradle output to produce the exhaustive list of bundled jars - manual data: this is a list of components manually added into the software. It is not automated and just manually updated
The generated files are then normalized into dedicated files by languages that matches the Google Spreadsheet format from above.
Quite some time was spent into checking for each language
Dedicated tool to download and update licenses
Once those files were extracted, we wanted to automate the process of updating the license, checking the type and downloading the license file.
As most third party software is now hosted on Github, a tool was developped to:
- read a normalized dependency file
- for each dependency:
- query github if the repository URL is available
- check the license type of the metadata
- check the tag based on the version with a few heuristic: the license at the dependency version is needed because some software change their license from version to version (e.g ElasticSearch)
- use some heuristic to check if the licence type in the metadata matches the license text
- download related licences
- regenerate the normalized file with updated fields
In case something is unsure, the tool just skips or raise a warning. Some manual checks will need to be done later.
Automating checks
A new job has been added to our CI (Jenkins) the periodically runs the above scripts to regenerate the files. They are then diffed with the current version of the repository. Changes are may be due to a security update, a developer having forgotten to update the licenses, …
The job is not run on each PR because there is quite a lot of analysis done in the background for all components, and that would have been overkill to run that job on each commit automatically. There is still an opt-in label to activate it when someone knows a dependency is upgraded.
The differences are highlighted and changed dependencies sent to slack for easier processing.
Updating the process
Now that a set of script was in place to extract dependencies, it was time to scale up.
A presentation of the new process was made to other developers, and it was described in our Wiki. It can be summed up as:
- check if a dependency needs to be upgraded, added or removed
- all dependencies changes are set via a PR for easier tracking and review: licenses are then also updated when the dependency is bumped in the build system
- rerun the set of script to add the new dependency, it most cased it just an overwrite of the original file
- fill in and check the required authorized licenses with respect to the json schema
- the reviewer adds a layer of validation to the new dependency. If a question arise on the new dependency, other people can help. The reviewer is usually the team lead or team manager who also assess the need to change the dependency
- a CI job that checks and diffs the dependencies, as highlighted above
That way, we ensure that licenses are kept up to date during the software development cycle, at the same time than when a dependency change. Moreover, each team is now responsible to update the dependencies in their scope. The process has been “scaled up” from one person to different teams and developers.
There is no more rush at the very end just before a release as it was before to check and update everything.
Documentation update
When the process was still manual, a set of CSV files were extracted from the Google Spreadsheet. They were then copied manually into a directory and process via a set of small script to generate the documentation on the website.
Because files were now in json versioned along the software source code and not in a Google Spreadsheet, it was easier to automate and extract data to other formats. The script was adapted to read json files instead of a CSV. As fields were almost the same, little modifications had to be done.
The documentation is now generated live and the manual process has been removed.
Conclusion
Moving from a manual process to check and update third party dependencies to an automated one took a few months.
It was not a full time work, but done in the background a few hours every week or so with several people. There were lots of trials and errors, to ensure that we had the good list of dependencies for all our languages and manual additions, as well as tedious checks. But in the end, even if it’s not perfect, we believe that we have our goals:
- automating the whole process so that it’s also done on each dependency update
- checking involving more people to remove errors
- scaling with other people and teams so that other people also have dependencies ownership