Advanced document searching algorithms

New Topic
This topic has been archived, and won't accept reply postings.
 elliot.baker 26 Apr 2019

If you wanted to search 1-100 pdf or word documents for the same approx. 60 phrases, and then create a log of which phrases appeared in each, perhaps with a confidence level for each phrase - where would one start with this, e.g. is there a particular software package, programming code, etc.?

I assume it would be quite trivial to someone who know how to - but what isn't!

Many thanks in advanced and apolz so off topic!

 snoop6060 26 Apr 2019
In reply to elliot.baker:

Convert them to text files and you could do this in about 3 lines of code of pretty much any language. You could index them with something also I guess. I use swish-e which is fairly simple but mainly for speed as the code I write has to search millions of documents and good old grep tends to be a bit slow.

Edit: To help a little further if you are familar with linux Ubuntu apparently has pdf2text and catdoc which will convert these on the fly to text documents so you can just grep/egrep to search them. grep -l will just provide the file name of the match, so you can output the term and the names of the files that it matched on. Loop though your search terms, use grep to search the files (catdoc file | grep -l ) and output both the search term and its matches into a text file. This is really simple scripting.

Post edited at 15:13
 tehmarks 26 Apr 2019
In reply to elliot.baker:

I'd be looking at whichever scripting language you're more familiar with - that'd be Perl for me personally, but then Perl is the first language I ever learnt and I'm quite fond of it's weird and wonderful nature. And something like Perl will probably have a package for reading PDFs - so it might not be much extra work to work directly with the PDF.

 gravy 26 Apr 2019

grep


New Topic
This topic has been archived, and won't accept reply postings.
Loading Notifications...