Microsoft has collected 13 million work items and bugs since 2001 and used that data to create a machine learning model to tackle software bugs. According to the company, the model distinguishes between security bugs and non-security bugs 99% of the time and identifies high-priority bugs 97% of the time.
“At Microsoft, 47,000 developers generate nearly 30,000 bugs per month. These items are stored in over 100 AzureDevOps and GitHub repositories. To better tag and prioritize bugs at this scale, we simply couldn’t apply more people to the problem,” wrote Scott Christiansen, Microsoft’s senior security program manager, and Mayana Pereira, data and application specialist. Publish. “Large volumes of semi-organized data are great for machine learning.”
Early in the project, Microsoft knew it needed to look for data that was general enough and not suitable for a small number of examples, look for data that didn’t violate privacy rules, and consider generating data in a simulated environment. environment to overcome problems with data taken from nature.
As part of the process, security experts vetted the training data before it was fed into the machine learning model and statistical sampling was used to provide security experts with a manageable amount of data to review. .
“Our classification system must function as a security expert, which means that the subject matter expert is as important to the process as the data scientist,” wrote Christiansen Pereira.
Collaboration between subject matter experts and data scientists was key to identifying all types and sources of data and the review process once viable data was identified. Data scientists select a data modeling technique, train the model and evaluate the performance of the model while security experts evaluate the model in production by monitoring the average number of bugs and manually reviewing a random sample of bugs, a explained Microsoft.
In the end, the model was able to classify the bugs accurately and, in a second step, was able to apply severity labels to the security bugs.
“The process didn’t end once we had a model that worked. To ensure that our bug modeling system keeps pace with Microsoft’s ever-evolving products, we perform automated training. The data is always approved by a security expert before the model is recycled, and we constantly monitor the number of bugs generated in production,” wrote Christiansen Pereira.
Additional details are available here.