At work, the software we develop and ship comes with a lot of third party dependencies, so we must keep track of their licenses. The process was very manual and was improved by automating it so that it could also scale with the company’s growth.
Status a few years ago
The software we develop is comprised of many third party dependencies in different languages:
- Java
- JavaScript
- Python
- a few static data
A few years ago, the process to track the version and license(s) of each dependency was the job of a single person. The dependency list was tracked in a Google Spreadsheet with one sheet per language or part of the software. Each sheet had a set of columns, the main ones being:
- dependency name
- version
- display name
- repository URL
- homepage
- copyrights if available
- license(s): a dependency could be available under different licenses
Before each release, each dependency was manually tracked and compared with the previous software version to check what was added, updated, or removed. The Spreadsheet was updated accordingly and then processed in other tools to extract data to be published in different formats:
- as HTML on the website
- as a global notice file shipped with the software, which is a concatenation of all license files
This process had the following problems:
- very tedious work that took quite some time before each release to ensure everything was done correctly
- because everything was manual, it was also very error-prone
- the Spreadsheet was disconnected from the software development lifecycle and only linked to releases
As more developers joined, more features were added along with new dependencies. It became impossible to keep the current process in that state.
Reworking the process
The process had to be reworked. We wanted to achieve at least the following:
- not tied to a single person, so that the burden is shared by more people and can scale further
- follow the development lifecycle to avoid doing everything before a release
- automate to prevent manual errors and add automatic checks
Based on the above requirements, devising a plan was quite simple:
- extract information from the Google Spreadsheet
- store it in the repository using a text format
- automate dependency extraction for all components
- update licenses since the previous release to reflect the current development state
- improve the process and add CI jobs to check up-to-date data
- update output formats to be published: notice file and documentation
- scale up with developers: provide training
Extracting data from the Spreadsheet
The extraction was straightforward:
- extract each sheet into its JSON
- normalize the fields between all JSON files so that they have a common data model
- commit them into our repository to serve as a basis
A few experiments were made on the format to ensure it was the most developer-friendly and easy to use. We settled on JSON:
- widely used and can be parsed and read from everything
- fairly readable when prettified
- items and fields are ordered based on name and version: can be easily compared from one version to another
To sum up, we have one file per language and one more for manual integration and data files.
JSON schema
After normalizing the data, a JSON schema was added to check mandatory fields and look up the list of authorized licenses from an explicit list.
That way, no new license type could be added without updating and validating the schema. The format is not set in stone and has evolved a bit since the initial version. It is quite easy to update and track changes due to being versioned in the Git repository.
The SPDX Identifier is used as the license type field.
Copied components
A few third party components are copied into the software. There are several reasons for that:
- only one function is needed from a whole library
- some static data are needed, for example geo locations
- a whole library may be copied at a specific version because there is no available release
The company has a duty to track these dependencies for license and security purposes. They are added and updated manually in their own files.
Automating dependency extraction
Dependency extraction was done manually until now. As explained above, it was a very tedious process, and it was time to automate it.
As the software is comprised of different languages, different scripts are used for each one:
- Python: a custom script extracts packages because the software is shipped with several custom Python environments. This part is to be reworked later to move to
pip freeze
- Frontend:
- license-checker reads several
package-lock.json
files - a custom script also extracts dependencies from
bower.json
, which we still use
- license-checker reads several
- Java: this was the most complex part. The software not only depends on a single Gradle file, but is a complex multi-project build, with build (not shipped) and runtime dependencies.
OWASP Dependency-Check was first used and integrated into Gradle. At a later stage, we moved to a single script that scans the JARs along with the Gradle output to produce the exhaustive list of bundled JARs - Manual data: this is a list of components manually added into the software. It is not automated and is updated manually
The generated files are then normalized into dedicated files by language that match the Google Spreadsheet format described above.
Quite some time was spent checking each language.
Dedicated tool to download and update licenses
Once these files were extracted, we wanted to automate the process of updating the license, checking the type, and downloading the license file.
As most third party software is now hosted on GitHub, a tool was developed to:
- read a normalized dependency file
- for each dependency:
- query GitHub if the repository URL is available
- check the license type from the metadata
- check the tag based on the version using a few heuristics: the license at the dependency version is needed because some software changes its license from version to version (e.g., Elasticsearch)
- use heuristics to check if the license type in the metadata matches the license text
- download related licenses
- regenerate the normalized file with updated fields
In case something is unclear, the tool skips it or raises a warning. Some manual checks will need to be done later.
Automating checks
A new job has been added to our CI (Jenkins) that periodically runs the above scripts to regenerate the files. They are then compared with the current version of the repository. Changes may be due to a security update, a developer forgetting to update the licenses, etc.
The job is not run on each pull request because there is a lot of analysis done in the background for all components, and that would be too much to run on each commit automatically. There is still an opt-in label to activate it when someone knows a dependency has been upgraded.
The differences are highlighted and changed dependencies are sent to Slack for easier processing.
Updating the process
Now that a set of scripts was in place to extract dependencies, it was time to scale up.
A presentation of the new process was made to other developers, and it was documented in our Wiki. It can be summarized as:
- check if a dependency needs to be upgraded, added, or removed
- all dependency changes are set via a pull request for easier tracking and review; licenses are then also updated when the dependency is bumped in the build system
- rerun the set of scripts to add the new dependency; in most cases, it is just an overwrite of the original file
- fill in and check the required authorized licenses according to the JSON schema
- the reviewer adds a layer of validation to the new dependency. If a question arises about the new dependency, other people can help. The reviewer is usually the team lead or team manager who also assesses the need to change the dependency
- a CI job checks and compares the dependencies, as described above
That way, we ensure that licenses are kept up to date during the software development cycle, at the same time a dependency is changed. Moreover, each team is now responsible for updating the dependencies within their scope. The process has been scaled from one person to different teams and developers.
There is no more last-minute rush just before a release to check and update everything.
Documentation update
When the process was still manual, a set of CSV files was extracted from the Google Spreadsheet. They were then copied manually into a directory and processed via a set of small scripts to generate the documentation on the website.
Because files are now in JSON and versioned alongside the software source code instead of in a Google Spreadsheet, it is easier to automate and extract data into other formats. The script was adapted to read JSON files instead of CSV. As the fields were almost the same, only minor modifications were needed.
The documentation is now generated live, and the manual process has been removed.
Conclusion
Moving from a manual process to check and update third party dependencies to an automated one took a few months.
It was not full-time work, but was done in the background for a few hours every week or so by several people. There were many trials and errors to ensure that we had a correct list of dependencies for all our languages and manual additions, as well as thorough checks. But in the end, even if it is not perfect, we believe that we have achieved our goals:
- automating the entire process so that it is also performed on each dependency update
- involving more people in the checking process to reduce errors
- scaling with other people and teams so that dependency ownership is shared