AI writing code promises to cut development costs and allow programmers to focus on creative, less repetitive tasks. Research labs like OpenAI and Alphabet-backed DeepMind have already developed powerful AI for code generation, but many of the most efficient systems are not available in open source. For example, the training data for OpenAI’s Codex, which powers GitHub Copilot, is not available to the public, preventing researchers from fine-tuning the AI model or exploring different aspects of it.
To remedy this, Carnegie Mellon University researchers Frank Xu, Uri Alon, Graeme Neubig, and Vincent Hellendorn developed PolyCoder, a model based on the OpenAI GPT-2 language model that was trained on 249GB of code in 12 programming languages. Although PolyCoder does not match the performance of the best AI code generators in every task, the researchers claim that PolyCoder can write C with greater accuracy than all known models, including the Codex, reports Apptractor.
“When GitHub Copilot came out last summer, it became clear that these large language models could be very useful in helping developers and increasing their productivity. But there was not a single model even of a close scale in the public domain,” the researchers said.
“PolyCoder started with Vincent just trying to see what was the biggest model that could be trained on our lab server. We ended up making a model with 2.7 billion parameters… and that model was head and shoulders above other code-centric models that were publicly available at the moment.”
Code generation
More and more organizations are looking into AI code generation. During the May 2021 Build conference, Microsoft detailed a new feature in Power Apps that uses the OpenAI GPT-3 language model to help people build formulas. Intel ControlFlag can autonomously detect errors in code. And Facebook’s TransCoder converts code from one programming language to another.
DeepMind recently announced AlphaCode, which the lab claims is one of the first code generation systems to compete with human programmers. DeepMind reported that in coding competitions hosted on Codeforces, a coding competition platform, AlphaCode ranked in the top 54.3% of developers.
But researchers at Carnegie Mellon University note that “almost no one” other than well-resourced companies can train models the size of AlphaCode or Codex. A 2020 study by startup AI21 Labs estimated the cost of training a text generation model with 1.5 billion parameters—about half the size of PolyCode—from $80,000 to $1.6 million. And Copilot, for example, has 12 billion parameters.
“Large tech companies don’t publish their models publicly, which is really holding back scientific research and the democratization of such large language code models,” the researchers say. “To some extent, we hope that our open source efforts will convince others to do the same. But the broader view is that the community should be able to train these models on its own. Our model has pushed the limits of what you can train on a single server – anything more requires a cluster of servers, which drives up the cost dramatically.”
Openness in code generation
While developing PolyCoder, the researchers also studied and compared the performance of various code-generating AI systems, including Codex. They found that the models, mostly trained in English texts and a small amount of code, turned out to be very good at writing programs—possibly because they obtained code-related information from resources such as the Q&A website for Stack Overflow developers who were included in the 249 GB database.
“A promising approach to building robust code generation models appears to be learning from a variety of sources of programming knowledge, including code in a variety of programming languages, as well as code-related texts from the Internet,” the researchers say.
However, researchers have raised concerns that models such as PolyCoder may be forced to generate buggy programs, including hard-to-find security vulnerabilities. They fear that in the future, attackers will be able to hide malicious behavior in code generation models that will only show up if certain conditions are met, such as a keyword, or “load” vulnerable code that could be picked up by code generation models.
As one way to combat this, they offer open source models, which can allow researchers to look for errors in them, notes NIX Solutions. As an added benefit, open source will allow developers to personalize models or “train” them in new programming languages through fine-tuning that is less expensive than training models from scratch.
“While the industry currently has much more computing power, there is still a lot of room for innovation from academia and the research community, including the creation of smaller and faster personalized models that don’t depend on internet connectivity, useful applications , automatic code checking and much more. These are tasks for which the research community has created promising prototypes that can really benefit from the capabilities of such very large language models,” the researchers say. “Decentralized learning, where multiple teams come together to train a large model together, can make a big difference here. Research grants and collaboration between companies and academia can also help.”