Hi, my name is Wasi Ahmad and I am presenting
our work, “A Transformer-based Approach
for Source Code Summarization”.
This is joint work with Saikat Chakraborty,
Baishakhi Ray from Columbia University, and
my advisor Kai-Wei Chang from UCLA.
Source code summarization refers to the task
of creating human-readable summaries that
describe the functionality of a program.
With the progress of natural language generation
using neural sequence-to-sequence learning,
recent approaches in literature frame code
summarization as translating a piece of source
code into a short natural language description.
For example, given this Python source code
snippet, a code summarization model should
be able to generate a summary, similar to
the human-written one.
In this work, we study the Transformer, a
sequence generation model that has been found
effective in many natural language generation
applications but hasn’t been explored in
source code summarization.
A notable amount of prior works in source
code summarization has leveraged recurrent
neural networks to learn source code representations.
Transformer, in contrast, entirely relies
on the self-attention mechanism, which is
well-known for its effectiveness in capturing
long-range dependencies.
Such a characteristic is essential to learn
program representations.
Besides, the order of source code tokens plays
an important role in embedding the code structure
into learned representations.
Hence, in this work, we study different positional
encoding schemes to find an effective way
to encode the source code structure.
We study the impact of three different forms
of positional encoding in learning source
code representations.
The absolute position encoding scheme treats
source code as a linear sequence of code tokens,
while the relative position encoding considers
source code as a fully connected graph, either
directed or undirected.
We show an example here.
For the expression, “a + b”, absolute
position encoding uses the tokens’ index
to form their respective position representations.
The relative position encoding, on the other
hand, models the pairwise distance between
code tokens.
We consider two variants of relative position
encoding, wherein one variant we compute relative
distances based on whether a token is on the
left or right of the target token.
In the second variant, we simply ignore the
direction, treating the source code as an
undirected graph.
We conduct experiments on two well-studied
datasets in Java and Python programming languages,
collected from Github.
The datasets are pre-processed following a
prior work.
In an additional preprocessing step, we sub-tokenize
the source code tokens based on Camel Case
and snake case that improves code summarization
significantly.
We use BLEU, METEOR, and Rouge-L as the evaluation
metrics.
We compare the Transformer with six state-of-the-art
approaches as baselines that utilize RNN-based
sequence-to-sequence learning.
The baseline approaches model different forms
of knowledge about source code, such as abstract
syntax tree structure, API knowledge.
They also cover a variety of techniques, including
reinforcement learning and dual learning.
In this work, we consider a simple setup where
the Transformer takes the source code as input
and learns to generate the summary with supervised
training.
The overall results show that the Transformer
outperforms all the baseline approaches by
a significant margin.
Next, we show the impact of using absolute
positions while encoding source code tokens
and the words in the natural language summary.
Encoding the absolute position of the summary
is necessary as the Transformer learns to
generate summaries word-by-word.
In contrast, we can see that the use of absolute
position encoding for source code hurts the
performance, which illustrates that treating
source code as a linear sequence of tokens
is not accurate.
Since source code has a non-sequential structure,
we hypothesize relative position encoding
for source code tokens would result in better
code representations that can improve summarization.
Our experimental findings prove our hypothesis.
In this bar plot, we compare the two forms
of relative position encoding, wherein one
form, source code is treated as a directed
graph, while in other, it is considered as
an undirected graph.
The results suggest that source code should
be viewed as a directed graph.
In other words, while modeling the pairwise
relationship between code tokens, whether
a token is in the left or right of the target
token should be emphasized.
We study Transformer for source code summarization
that outperforms state-of-the-art approaches.
The code for reproducing the experiments is
in Github.
We hope our work would be considered as a
baseline in future works.
Thank you for listening!
