python - How to match strings with possible typos? -


i have multiple pdf converted text files , want search phrase might in files. problem conversion between pdf , text file not perfect there errors appear in text (such missing spaces between word; mix-up between i, l, 1's; etc.)

i wondering if there common technique give me "soft" search, looks @ hamming distance between 2 terms example.

if 'word' in sentence: 

vs

if my_search('word',sentence, tolerance): 

you can use this:

from difflib import sequencematcher  text = """there  3rrors in text cannot find them"""  def fuzzy_search(search_key, text, strictness):     lines = text.split("\n")     i, line in enumerate(lines):         words = line.split()         word in words:             similarity = sequencematcher(none, word, search_key)             if similarity.ratio() > strictness:                 return " '{}' matches: '{}' in line {}".format(search_key, word, i+1)  print fuzzy_search('errors', text, 0.8) 

which should output this:

'errors' matches: '3rrors' in line 2 

Comments

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -