Login and get codingIn Bite 312 we learned how to identify builtin functions, modules, and keywords when provided with a list of strings.
In this Bite we take it a step further considering the following scenario: What if somebody gives you some python source code and asks you to identify which modules it uses and determine how frequently functions from those modules are used?
For a few small files you can solve this problem manually using python's regex facility or with command line tools like
grep
orawk
. However the problem becomes more difficult to manage when you consider a library which may have many large modules.The
ast
module can help you in this scenario.Intro to the
ast
moduleThe
ast
module transforms any python source code into a tree representation of the underlying python grammar. Such tree is called Abstract Syntax Tree (AST), and it is a data structure used to reason about programming languages grammar in general.Let's see the
ast
module in action on a simple example:>>> import ast
>>> tree = ast.parse("one_plus_two = 1+2")
>>> ast.dump(tree)
"Module(body=[Assign(targets=[Name(id='a', ctx=Store())], value=BinOp(left=Num(n=1), op=Add(), right=Num(n=2)))])"...ghhh, pretty condensed output, isn't it? Let's rearrange it manually to make it more readable:
"Module(
body=[
Assign(
targets=[
Name(id='one_plus_two', ctx=Store())
],
value=BinOp(
left=Num(n=1),
op=Add(),
right=Num(n=2))
)
]
)"
Module
|____ Assign
|___ Name
|___ BinOp
|__ Num
|__ Add
|__ NumWhat can we observe?
- The outer most node
Module
is the root of our tree.- The
Module.body
attribute is a list of nodes, one for each instruction of the input program. Our example program consists of one assignment instruction, soModule.body
contains only oneAssign
node.- The
Assign
node has two attributes,Assign.targets
andAssign.value
, which describe the left and right-hand sides of the assignment with respect to the equal sign.- The
Assign.targets
attribute is a list containing the different destinations of the assignment operation. In our case only the nodeName
is in the list since the assignment target is a single variable. Notice also that theName.id
attribute of the target node contains the name of the variable"one_plus_two"
as indeed specified in the source code.- The
Assign.value
contains aBinOp
node since the right hand side of the assignment is a binary operation between two operands. The attributesBinOp.left
andBinOp.right
specify the left and right operands for theBinOp.op
operation. Remember, the expression on the right hand side of the assignment was1 + 2
, so it is not surprising thatBinOp.left
andBinOp.right
nodes areNum
nodes with values1
and2
respectively.The
ast
module provides a large variety of nodes to cover the different aspects of the Python grammar (for more details, see the official python 3.9 doc of theast
module, as well as the external documentation Green Tree Snake as suggested by the official doc).Overall, the rich functionality of the
ast
module makes it easy to work with an AST data structure to both inspect and manipulate python source code programmatically.As a concrete example,
mutpy
, which we also introduced in Bite 244, is a mutation testing tool that artificially alters the code under testing to broaden the unit tests in an automated fashion. Under the hood, it uses theast
module to (i) create an AST, (ii) apply mutations, and (iii) convert the tree into a code object via the builtin functioncompile()
before executing it (seemutpy.mutpy.utils.create_module()
for details).Our objective
In this bite we are going to see how the
ast
module makes it easy to visit an AST tree, and discoverimport
andfrom ... import
statements, and function calls. Specifically, we will work withast.Import
,ast.ImportFrom
, andast.Call
node types.Given a python program source code, determine the following:
- Which builtin functions are used.
- Which modules are imported.
- Which calls make direct use of imported modules or submodules.
To do so, we are going to use the
ast
module. Specifically, we take advantage of the NodeVisitor class to create our ownAstVisitor
class to visit the nodes of an AST.We already provide you a template of the
AstVisitor
class with a.parse()
method to create an AST tree and trigger a visit, and we registered three callbacks that are invoked when three specific node types are encountered during parsing:-
.visit_Import()
called when visiting a node related toimport xyz
statements (see theast.Import
node documentation).-
.visit_ImportFrom()
called when visiting a node related tofrom abc import xyz
statements (see theast.ImportFrom
node documentation).-
.visit_Call()
called during when visiting a node related to function call (see theast.Call
node documentation).What is left to do is to complete the implementation of the three methods mentioned so to "track" both which builtins functions (if any) are used, which modules are imported (if any) and which calls they related to (if any). To be more specific, we want to know both the list of builtins / methods, but also the line number where each of tracked objects are located (see the
ast.AST
documentation). To report on the tracking you need to complete the following methods defined in the template:-
.builtins()
returns a list of string, one for each builtin functions tracked.-
.builtins_lineno()
returns a dictionary whose keys are builtins functions, and values are the line number where the related call occurred.-
.modules()
returns a list of string, one for each imported module.-
.modules_lineno()
returns a dictionary whose keys are module object calls, and values are the line number where the related call occurred.The tracking you need to implement is subject to the following rules:
- __rule-1__: each of the methods support an optional input parameter named
valid
to specify a list of builtins or modules to restrict the search to. Whenvalid=None
, all builtins and modules will be reported in the output.- __rule-2__: resolve aliased import statements and report their true names.
- __rule-3__: resolve submodule imports and express the module path using dotted notation.
- __rule-4__: when considering modules,
valid=
can contain alias names, but they need to be expanded to full names to comply with __rule-2__;An example
>> code = '''
import pandas as pd
import numpy as np
from argparse import ArgumentParser
print("this is a test")
parser = ArgumentParser()
len("12345")
df_tmp = pd.DataFrame(range(10), column=["x"])
df_tmp.loc[:, "y"] = df_tmp[x] + 1
np.random.random()
'''
>> vst = AstVisitor(code)
>> vst.builtins()
["print", "range", "len"]
>> vst.modules()
["argparse", "numpy", "pandas"]
>> vst.builtins_lineno()
{"print": [6], "len": [9], "range": [11]}
>> vst.modules_lineno()
{"argparse.ArgumentParser": [8], "pandas.DataFrame": [11], "numpy.random.random": [14]}Considering the output of
vst.modules_lineno()
, notice the following:- The call
ArgumentParser()
is resolved intoargparse.ArgumentParser
to comply with __rule-3__.- The
pd.DataFrame()
andnp.random.random()
calls are resolved intopandas.DataFrame
andnumpy.random.random
respectively to comply with __rule-2__.Let's see a few more examples using the code from the previous example, but this time using
valid=
to constrain the search output:>> first = vst.builtins(valid=["iter"])
>> first
[]
>> second = vst.builtins_lineno(valid=["iter"])
>> second
{}
>> third = vst.builtins_lineno(valid=["iter", "range"])
>> third
{"range": [11]}
>> fourth = vst.modules(valid=["np"])
>> fourth
["numpy"]
>> fifth = vst.modules(valid=["numpy"])
>> fifth
["numpy"]
>> sixth = vst.modules_lineno(valid=["numpy"])
>> sixth
{"numpy.random.random": [14]}Notice the following:
-
first
andsecond
are empty sinceiter
is not used in the source code.-
third
contains onlyrange
sinceiter
is not used in the source code.-
fourth
containsnumpy
even if we specifiedvalid=["np"]
to comply with __rule-2__ and __rule-4__.-
fifth
is identical tofourth
as expected, but notice that this time we specifiedvalid=["numpy"]
while before we usedvalid=["np"]
.-
sixth
contains only calls related tonumpy
as requested, but notice how we specifiedvalid=["numpy"]
while the original code has the statementnp.random.random()
, hence the search takes care of aliasing to comply with __rule-2__.Final notes:
- Please refer to the python3.9 documentation or, as suggested by the doc itself, the external Green Tree Snake documentation. The official documentation for python version lower than 3.9 is not very complete. If you are coding outside the pybites platform, you should not have issues in solving this bite with any python versions higher than 3.6.
- Tracking name aliasing is the most complex part to implement successfully and depends on you understanding what the various node attributes mean.
- Beware of expressions like
module.submodule.submodule.func()
which may be present when handlingCall
nodes.- Lastly, I recommend you spend some time with the python console trying to parse simple statements and investigate the AST tree manually. The
ipython
console console is great for this sort of exploration.Good luck!
13 out of 13 users completed this Bite.
Will you be the 14th person to crack this Bite?
Resolution time: ~123 min. (avg. submissions of 5-240 min.)
Our community rates this Bite 9.0 on a 1-10 difficulty scale.
» Up for a challenge? 💪