GitHub is under automated attack by millions of cloned repositories filled with malicious code

C sharp programming language source code example on monitor and bokeh background
(Image credit: islander11 via Getty Images)

GitHub has become a vital resource for programmers the world over, and an extensive knowledge base and repository for open-source coding projects, data storage and code management. However, the site is currently undergoing an automated attack involving the cloning and creation of huge numbers of malicious code repositories, and while the developers have been working to remove the affected repos, a significant amount are said to survive, with more uploaded on a regular basis.

An unknown attacker has managed to create and deploy an automated process that forks and clones existing repositories, adding its own malicious code which is concealed under seven layers of obfuscation (via Ars Technica). These rogue repositories are difficult to tell from their legitimate counterparts, and some users unaware of the malicious nature of the code are forking the affected repos themselves, unintentionally adding to the scale of the attack.

Once a developer makes use of an affected repo, a hidden payload begins unpacking seven layers worth of obfuscation, including malicious Python code and a binary executable. The code then sets to work collecting confidential data and login details before uploading it to a control server.

Research and data teams at security provider Apiiro have been monitoring a resurgence of the attack since its relatively minor beginnings back in May of last year. And while the company says that GitHub has been quickly removing the affected repositories, its automation detection system is still missing many of them, and manually uploaded versions are still slipping the net. 

Given the current scale of the attack, said by the researchers to be in the millions of uploaded or forked repositories, even a 1% miss-rate still means potentially thousands of compromised repos still on the site.

While the attack was initially somewhat small-scale when it was first documented, with several packages detected on the site with early versions of the malicious code, it has gradually developed in size and sophistication. The researchers have identified several potential reasons for the success of the operation thus far, including the overall size of GitHub's user base and the developing complexity of the technique.

Your next upgrade

Nvidia RTX 4070 and RTX 3080 Founders Edition graphics cards

(Image credit: Future)

Best CPU for gaming: The top chips from Intel and AMD.
Best gaming motherboard: The right boards.
Best graphics card: Your perfect pixel-pusher awaits.
Best SSD for gaming: Get into the game ahead of the rest.

What's really intriguing here is the combination of sophisticated automated attack methods and simple human nature. While the methods of obfuscation have become increasingly complex, the attackers have relied heavily on social engineering to confuse developers into picking the malicious code over the real one and unintentionally spreading it onwards, compounding the attack and making it much harder to detect.

As things stand this method seems to have worked remarkably well, and while GitHub has yet to comment on the attack directly, it did issue a general statement reassuring its users that "We have teams dedicated to detecting, analyzing, and removing content and accounts that violate our Acceptable Use Policies. We employ manual reviews and at-scale detection that use machine learning and constantly evolve and adapt to adversarial attacks".

The perils of becoming popular, it seems, have manifested themselves here. While GitHub remains a vital resource for developers worldwide, its open-source nature and huge user base appears to have left it somewhat vulnerable, although given the effectiveness of the method, it comes as no surprise that solving the issue entirely seems to be an uphill battle that GitHub has yet to overcome.

Andy Edser
Hardware Writer

Andy built his first gaming PC at the tender age of 12, when IDE cables were a thing and high resolution wasn't. After spending over 15 years in the production industry overseeing a variety of live and recorded projects, he started writing his own PC hardware blog for a year in the hope that people might send him things. Sometimes they did.

Now working as a hardware writer for PC Gamer, Andy can be found quietly muttering to himself and drawing diagrams with his hands in thin air. It's best to leave him to it.