catastrophic regex backtracking causes 100% CPU utilization when searching with punctuation in filename
Version: 4.20.1
Steps to reproduce:
- Open Catfish, in my case with a large file collection indexed by locate (200,000+ files)
- Enter a search string containing commas and multiple words, e.g.:
Truckers, Kickers, Cowboy Angels - Observe CPU utilization
The accompanying image shows what I saw with this command: watch -n1 'ps -eo pid,pcpu,args | grep -E "catfish|find|locate|plocate" | grep -v grep'
Catfish passes the following to locate:
locate --regex --basename -i truckers(.)*(.)*kickers(.)*(.)*cowboy(.)*angels|truckers(.)*(.)* ...
This causes locate to consume ~2285% CPU (approximately 23 cores) for an extended period due to catastrophic backtracking. The session consumed 6min 35s CPU time over 21s wall clock time.
Root cause (two distinct issues):
-
Pathological regex construction: Catfish generates
(.)*between search tokens instead of.*. These are semantically equivalent for matching purposes, but(.)*forces the regex engine to explore exponentially more backtracking paths on a non-match. With 200,000+ locate database entries, the majority of evaluations are non-matches, making this extremely expensive. This issue exists independently of punctuation and could be triggered by any sufficiently complex multi-word search against a large database. - Punctuation not normalized: Commas in the search string are incorporated into regex tokens rather than being stripped or treated as whitespace. This causes the tokens themselves to become invalid partial anchors, compounding the backtracking problem.
Suggested fixes:
- Replace
(.)*with.*in the regex construction code - Strip or normalize punctuation characters (at minimum commas) to whitespace before building the search regex
System:
- OS: Arch Linux
- locate implementation: plocate
- File count in locate database: ~200,000
- "search file contents" is not enabled.
Edited by SFdrifter
