Skip to content

Few optimizations to `CatfishSearchEngine.py` for better performance and lesser memory usage

I think we can slightly optimize catfish

  1. The else block at line 292 in method CatfishSearchEngine.search_filenames in CatfishSearchEngine.py
    # OLD
    match_list = set()
    for kword in keywords:
        if kword in fname.lower():
            match_list.add(kword)
        if len(set(keywords)) == len(match_list):
            return True
    
    # NEW
    # Avoids repeated substring search for duplicates in `keywords`
    # Avoids repeated processing of converting `fname` to lower case
    # Avoids repeated conversion of object `keywords` to `set`
    # No need of an extra `set` data structure `match_list` (thus reducing memory usage)
    # Returns immediately if a single `kword` is not present in `fnanme`
    fname = fname.lower()
    for kword in set(keywords):
        if kword not in fname:
            return False
    return True
  2. In method CatfishSearchMethod_Zeitgeist.run at line 674 of CatfishSearchEngine.py
    • This change avoid repeated conversion of string keywords to lower case in nested for loops
    # line=676
    
    # OLD
    keywords = " ".join(keywords)
    
    # NEW
    keywords = " ".join(keywords).lower()
    # line=707
    
    # OLD
    if keywords.lower() in filename and \
    
    # NEW
    if keywords in filename and \
  3. else block of method CatfishSearchMethod_Fulltext.search_text at line 557 of CatfishSearchEngine.py
    • I did some performance testing using the attached files optimization3_.* and found that the proposed change indeed improved the performance. Following are some of the results
      [ExecutionTime] "catfish_old_full_search" = 00:00:08.600 (i.e. ~8 seconds)
      [ExecutionTime] "new_full_search" = 00:00:00.285 (~0.29 seconds)
      
      [ExecutionTime] "catfish_old_full_search" = 00:00:00.935 (i.e. ~0.9 seconds)
      [ExecutionTime] "new_full_search" = 00:00:00.128 (i.e. ~0.13 seconds)
    # OLD
    def search_text(self, lines, keywords):
        if self.exact:
            for line in lines:
                if " ".join(keywords) in line.lower():
                    return True
        else:
            match_list = set()
            for line in lines:
                for kword in keywords:
                    if kword in line.lower():
                        match_list.add(kword)
                if len(set(keywords)) == len(match_list):
                    return True
    
    # NEW
    # Avoids repeated substring search for duplicates in `keywords`
    # Avoids repeated conversion of `keywords` object to `set`
    # Avoids repeated substring search if the keyword was already found in the past
    # Extra memory requirement has reduced from `set[str]` to `bool array + int`
    def search_text(self, lines, keywords):
        if self.exact:
            for line in lines:
                if " ".join(keywords) in line.lower():
                    return True
        else:
            keywords = list(set(keywords))
            match_found = [False for i in range(len(keywords))]
            match_found_count = 0
            for line in lines:
                for i in range(len(keywords)):
                    if (not match_found[i]) and keywords[i] in line.lower():
                        match_found[i] = True
                        match_found_count += 1
                if match_found_count == len(keywords):
                    return True