Vulnerable Contributions at Scale
I read an interesting paper the other day that investigates how secure, or indeed insecure, the code contributions from Github Copilot are. The article is titled "Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions".
To quote a main conclusion from the abstract:
We explore Copilot’s performance on three distinct code generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,689 programs. Of these, we found approximately 40 % to be vulnerable
The paper goes into more details but one striking example demonstrates how you could prompt Github Copilot to generate code that checks if a username/password combination exists in a database. Copilot then proceeds to generate Python code that looks like;
That's a recipe for a SQL injection right there. CI tools like bandit might catch it, but a novice programmer might accidentally learn that this code represents good practice. There are many types of these issues and the article goes into a fair amount of depth.
It seems 40.73% of all copilot suggestions suffer from a security issue. However, copilot suggestions are ordered. When you prompt it to generate code you get an ordered list to choose from. You could argue that the top options are particularly important to get right, since novice users may interpret this as the best suggestion. Between all security concerns and languages it seems like 39.33% of the top suggestions had a security flaw.
To quote the article;
Copilot is trained over open-source code available on GitHub, we theorize that the variable security quality stems from the nature of the community-provided code.
This however, is a quote with many angles to consider. A very valid one from the paper;
Another aspect of open-source software that needs to be considered with respect to security qualities is the effect of time. What is ‘best practice’ at the time of writing may slowly become ‘bad practice’ as the cybersecurity landscape evolves. Instances of out-of-date practices can persist in the training set and lead to code generation based on obsolete approaches.
The paper reflects my own experience with Copilot. It may be a useful tool to handle some boilerplate but it should never be trusted blindly, especially on the more mission-critical parts of the codebase. Or, as the paper puts it:
while Copilot can rapidly generate prodigious amounts of code, our conclusions reveal that developers should remain vigilant (‘awake’) when using Copilot as a co-pilot. Ideally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities.