Added lambda that integrates Forklift#411
Added lambda that integrates Forklift#411JaimeZepeda08 wants to merge 20 commits intoopen-lambda:mainfrom
Conversation
tylerharter
left a comment
There was a problem hiding this comment.
Good work! Haven't read everything super closely yet, but leaving some initial feedback.
Let's not have big data files here, like deps.json and wl.json. In the docs, we can describe a user would use your lambda. wget those files (from wherever they are, probably on the forklift repo), and then curl to your function.
| event = json.loads(event) | ||
|
|
||
| workload_data = event.get("workload") | ||
| if not workload_data: |
There was a problem hiding this comment.
Do we need these checks? Seems it is all in a big try/except, and you'll return a similar error below anyway
|
|
||
| Expected event format: | ||
| { | ||
| "workload": { ... workload data ... }, |
There was a problem hiding this comment.
Can you show the next level of detail (below) about how workload data is structured?
| } | ||
| """ | ||
| try: | ||
| if isinstance(event, str): |
There was a problem hiding this comment.
Why do we need to check? Are there different possible inputs?
|
|
||
| result = generate_tree(workload_data, num_nodes, multi_package) | ||
|
|
||
| return { |
There was a problem hiding this comment.
Can we write a flask lambda instead? Then, we could return a status code, like 500 or something. The benefit is that somebody could do a "curl ... > tree.json" to get a tree, without needing to extract the result separately.
|
|
||
| # Testing code | ||
| if __name__ == "__main__": | ||
| with open("wl.json", "r") as file: |
|
|
||
| def generate_tree(workload_data, num_nodes, multi_package=True): | ||
| calls = parse_workload(workload_data) | ||
| deps_json = load_deps_json() |
There was a problem hiding this comment.
Let's not load from a file. Can we include include this in the POST body passed to our function?
| candidate_queue = [] # priority queue for candidate nodes | ||
|
|
||
|
|
||
| def enqueue_top_child_candidate(parent, deps, multi_package=True): |
There was a problem hiding this comment.
It seems we pass around deps and multi_package to all our functions. Perhaps put this all in a class, and these are attributes?
| continue | ||
|
|
||
| pkg_name = child_pkgV.split("==")[0] | ||
| version = child_pkgV.split("==")[1] if "==" in child_pkgV else None |
There was a problem hiding this comment.
Do we allow pkgs without a version? If our input is based on pip compile output, won't we have versions? I may misunderstand, though.
| continue | ||
|
|
||
| # keep track of packages that would be loaded by this candidate | ||
| packages_to_load = set([child_pkgV]) |
There was a problem hiding this comment.
Can create a set without converting:
{child_pkgV}
| from collections import defaultdict | ||
|
|
||
|
|
||
| # TODO: instead of loading deps from a file, use pip-compile on the fly |
There was a problem hiding this comment.
I'm thinking lets require callers to pass deps in. Then, however we figure out those dependencies, it will be outside the nice, clean tree-building logic here.
…ept deps as input, add paper link
| @@ -0,0 +1,290 @@ | |||
| ''' | |||
There was a problem hiding this comment.
just call "forklift", not "forklift-service"
| self._enqueue_top_child_candidate(child) | ||
|
|
||
| def build_tree(self, desired_nodes): | ||
| self.candidate_queue = [] |
There was a problem hiding this comment.
Best to see all class attrs in init even if None at that point.
For the queue, let's have namedtuple's so we can see the code is correct easier. Things like the following need careful checking:
_, _, best_candidate
| found := false | ||
| for _, p := range packages { | ||
| if p == nodePkg { | ||
| // compare only using package name |
| call_counts[name] += 1 | ||
|
|
||
| # build call matrix | ||
| calls = {} |
There was a problem hiding this comment.
is a dict the best type for a calls matrix? Do want something (a) sparse with (b) column names.
Need to check, but Claude says Pandas can have a pd.SparseDtype(int, fill_value=0) type to make it sparse. Maybe Pandas DF is the best matrix format then?
There was a problem hiding this comment.
I believe in this case a dict would work better because the algorithm repeatedly partitions calls by rows and we only operate on rows.
From AI:
"Pandas sparse data structures (specifically SparseDtype and SparseArray introduced in version 1.0+) are designed to handle sparse data efficiently, but they do not deal well with frequent, row-wise, or random mutations."
What do you think?
…ions when looking for a zygote match
…namedtuples for readability, and extracted parse logic outside of the class
| @@ -0,0 +1 @@ | |||
| flask No newline at end of file | |||
There was a problem hiding this comment.
Let's do the pattern of requirements.in, and a pip compiled requirements.txt, as in other example lambdas.
| app = Flask(__name__) | ||
|
|
||
|
|
||
| class CallMatrix: |
There was a problem hiding this comment.
I think a generic SparseMatrix would allow us to better mirror the logic in the paper. The class has package-specific awareness, and therefore takes on some responsibilities handled by other functions in the pseudocode.
| self.root = None | ||
| self.candidate_queue = [] | ||
|
|
||
| def _enqueue_top_child_candidate(self, parent): |
There was a problem hiding this comment.
This function is quite a bit longer and more complicated than the pseudocode for enqueue_top_child_candidate in the paper. I think it is because this function is taking on more concerns, not just the core logic.
E.g., you have separate parse functions. Splitting pkg name from version feels like it should be handled there, not in the core logic of this function.
There was a problem hiding this comment.
Maybe a helper function or two would help as well?
| return jsonify(result), 200 | ||
|
|
||
| except Exception as e: | ||
| import traceback |
There was a problem hiding this comment.
imports should be top of file
|
|
||
| try: | ||
| event = request.get_json() | ||
| if event is None: |
There was a problem hiding this comment.
we have a couple special cases where we return different errors for specific kinds of invalid data. But given all is in a try/except, do we need special code for these?
| ''' | ||
|
|
||
| try: | ||
| event = request.get_json() |
There was a problem hiding this comment.
perhaps the body of this try should be it's own function to cleanly separate logic: one function for core logic, another (which calls the former) that adds error handling.
tylerharter
left a comment
There was a problem hiding this comment.
Let's iterate a bit more to make sure the code is elegant and commented (at key places). Left a few ideas in the parsing code.
| func_packages = {} | ||
|
|
||
| for func in workload.get("funcs", []): | ||
| func_name = func.get("name", "") |
There was a problem hiding this comment.
is a function without a name valid input? If somebody gives us a garbage file, do we want to silently carry on, or crash in an obvious way?
I'm not a fan of the .get's -- if there are specific cases where it is valid to have a field or not, then let's comment to justify it.
|
|
||
|
|
||
| def parse_workload(workload): | ||
| func_packages = {} |
There was a problem hiding this comment.
what are the keys/values going to be? A comment would be nice.
| app = Flask(__name__) | ||
|
|
||
|
|
||
| def parse_workload(workload): |
There was a problem hiding this comment.
what is it going to return? describe the pandas df, and what the rows+columns mean
| if "==" in line and not line.startswith("#"): | ||
| packages.add(line) | ||
|
|
||
| if packages: |
There was a problem hiding this comment.
do we need the conditional? Is it a problem to have a function with no packages?
|
|
||
| # rows=calls, columns=package==version, values=0/1 | ||
| rows = {} | ||
| if call_counts: |
There was a problem hiding this comment.
is it valid to have an input with zero calls? Do we want to complicate the function with this case?
| for func_name, packages in func_packages.items(): | ||
| rows[func_name] = packages.copy() | ||
|
|
||
| if not rows: |
There was a problem hiding this comment.
another check I'm not sure about
| return pd.DataFrame(dtype=int) | ||
|
|
||
| # sorted list of all unique packages across all calls, to ensure consistent column ordering | ||
| all_pkgs = sorted(set().union(*rows.values())) |
There was a problem hiding this comment.
I'm working too hard to parse this, maybe a few lines will be more readable?
No description provided.