UKH Forums - Advanced document searching algorithms

Advanced document searching algorithms

This topic has been archived, and won't accept reply postings.

elliot.baker 26 Apr 2019

If you wanted to search 1-100 pdf or word documents for the same approx. 60 phrases, and then create a log of which phrases appeared in each, perhaps with a confidence level for each phrase - where would one start with this, e.g. is there a particular software package, programming code, etc.?

I assume it would be quite trivial to someone who know how to - but what isn't!

Many thanks in advanced and apolz so off topic!

snoop6060 26 Apr 2019

In reply to elliot.baker:

Convert them to text files and you could do this in about 3 lines of code of pretty much any language. You could index them with something also I guess. I use swish-e which is fairly simple but mainly for speed as the code I write has to search millions of documents and good old grep tends to be a bit slow.

Edit: To help a little further if you are familar with linux Ubuntu apparently has pdf2text and catdoc which will convert these on the fly to text documents so you can just grep/egrep to search them. grep -l will just provide the file name of the match, so you can output the term and the names of the files that it matched on. Loop though your search terms, use grep to search the files (catdoc file | grep -l ) and output both the search term and its matches into a text file. This is really simple scripting.

Post edited at 15:13

tehmarks 26 Apr 2019

In reply to elliot.baker:

I'd be looking at whichever scripting language you're more familiar with - that'd be Perl for me personally, but then Perl is the first language I ever learnt and I'm quite fond of it's weird and wonderful nature. And something like Perl will probably have a package for reading PDFs - so it might not be much extra work to work directly with the PDF.

Jonathan Emett 26 Apr 2019

In reply to elliot.baker:

https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multi...

gravy 26 Apr 2019

grep

New Topic

This topic has been archived, and won't accept reply postings.

Latest Jobs 4 New

Jobs Website Merchandiser

Elsewhere on the site

Podcast Mountain Air - 9. Doug Bartholomew: Managing Beinn Eighe's Wild Spaces

News New National Park for Ireland

The Irish Government today announced the creation of a new National Park, the country's eighth. Located on the Dingle Peninsula, Páírc Náisiúnta na Mara, Ciarraí includes the rugged Conor Pass, the unique island of Sceilg Mhichíl, and Mount Brandon, one of...

22 Apr

Advanced document searching algorithms

Latest Jobs 4 New

Jobs Website Merchandiser

Elsewhere on the site

Podcast Mountain Air - 9. Doug Bartholomew: Managing Beinn Eighe's Wild Spaces

News New National Park for Ireland

Press Release Alpkit and Outside Bank Holiday Hathersage Tent Show 4-5th May