Few optimizations to `CatfishSearchEngine.py` for better performance and lesser memory usage
I think we can slightly optimize catfish
- The
else
block at line 292 in methodCatfishSearchEngine.search_filenames
in CatfishSearchEngine.py# OLD match_list = set() for kword in keywords: if kword in fname.lower(): match_list.add(kword) if len(set(keywords)) == len(match_list): return True # NEW # Avoids repeated substring search for duplicates in `keywords` # Avoids repeated processing of converting `fname` to lower case # Avoids repeated conversion of object `keywords` to `set` # No need of an extra `set` data structure `match_list` (thus reducing memory usage) # Returns immediately if a single `kword` is not present in `fnanme` fname = fname.lower() for kword in set(keywords): if kword not in fname: return False return True
- In method
CatfishSearchMethod_Zeitgeist.run
at line 674 of CatfishSearchEngine.py- This change avoid repeated conversion of string
keywords
to lower case in nestedfor
loops
# line=676 # OLD keywords = " ".join(keywords) # NEW keywords = " ".join(keywords).lower()
# line=707 # OLD if keywords.lower() in filename and \ # NEW if keywords in filename and \
- This change avoid repeated conversion of string
-
else
block of methodCatfishSearchMethod_Fulltext.search_text
at line 557 of CatfishSearchEngine.py- I did some performance testing using the attached files
optimization3_.*
and found that the proposed change indeed improved the performance. Following are some of the results[ExecutionTime] "catfish_old_full_search" = 00:00:08.600 (i.e. ~8 seconds) [ExecutionTime] "new_full_search" = 00:00:00.285 (~0.29 seconds) [ExecutionTime] "catfish_old_full_search" = 00:00:00.935 (i.e. ~0.9 seconds) [ExecutionTime] "new_full_search" = 00:00:00.128 (i.e. ~0.13 seconds)
# OLD def search_text(self, lines, keywords): if self.exact: for line in lines: if " ".join(keywords) in line.lower(): return True else: match_list = set() for line in lines: for kword in keywords: if kword in line.lower(): match_list.add(kword) if len(set(keywords)) == len(match_list): return True # NEW # Avoids repeated substring search for duplicates in `keywords` # Avoids repeated conversion of `keywords` object to `set` # Avoids repeated substring search if the keyword was already found in the past # Extra memory requirement has reduced from `set[str]` to `bool array + int` def search_text(self, lines, keywords): if self.exact: for line in lines: if " ".join(keywords) in line.lower(): return True else: keywords = list(set(keywords)) match_found = [False for i in range(len(keywords))] match_found_count = 0 for line in lines: for i in range(len(keywords)): if (not match_found[i]) and keywords[i] in line.lower(): match_found[i] = True match_found_count += 1 if match_found_count == len(keywords): return True
- I did some performance testing using the attached files