LLM queries for personal pdf libraries?

Imgonnatrythis@sh.itjust.works · 2 years ago

LLM queries for personal pdf libraries?

DandomRude@lemmy.world · 2 years ago

Afforai might be able to do stuff like this. I haven’t tested it myself yet, but the service also seems to have some other features that might be relevant for your use case.

Imgonnatrythis@sh.itjust.works · 2 years ago

Wow. Yes, this looks spot on thanks! Warning, Whenever I find cool services like this it seems like they tend to go under within a year or two, so apologies in advance.

LesserAbe@lemmy.world · 2 years ago

Don’t have an answer but I’d be interested in something like that too. I know Microsoft released a freely available lightweight LLM that’s supposed to make it easier for people to run it locally, called Phi3. Decent article from ars technica: https://arstechnica.com/information-technology/2024/04/microsofts-phi-3-shows-the-surprising-power-of-small-locally-run-ai-language-models/

merari42@lemmy.world · edit-2 2 years ago

I have used this small R package that allows you to read the text content of a PDF and send it to a local llama model via ollama or one of the large LLM APIs. I could use that to get structured answers in JSON format on a whole folder of papers, but the context length of a typical model is only long enough to hold a single (roughly 40-page) paper in the memory. So I had to get separate structurer answers on each paper and then generate a complete summary from those. Unfortunately that is not user-friendly yet.

Imgonnatrythis@sh.itjust.works · 2 years ago

Interesting start, yeah looks a bit in weeds for my purposes right now though

rufus@discuss.tchncs.de · edit-2 2 years ago

I don’t think you can use Retrieval Augmented Genaration or vector databases for a task like that. At least not if you want to compare the whole papers and not just a single statement or fact. And that’d be what most tools are focused on. As far a I know the tools that are concerned with big PDF libraries are meant to retrieve specific information out of the library. Relevant to a specific question from the user. If your task is to go through the complete texts, it’s not the right tool because it’s made to only pick out chunks of text.

I’d say you need an LLM with a long context length, like 128k or way more, fit all the texts in and add your question. Or you come up with a clever agent. Make it summarize each paper individually or extract facts, then feed that result back and let it search for contradictions, or do a summary of the summaries.

(And I’m not sure if AI is up to the task anyways. Doing meta-studies is a really complex task, done by highly skilled professionals of a field. And it takes them months… I don’t think current AI’s performance is anywhere near that level. It’s probably going to make something up instead of outputting anything that’s related to reality.)

Imgonnatrythis@sh.itjust.works · 2 years ago

Check out Afforai. It’s not perfect at all, but it is on track to do what I want.

rufus@discuss.tchncs.de · 2 years ago

Ah, nice. Thanks for sharing.

MigratingtoLemmy@lemmy.world · 2 years ago

That would likely be a language model finetuned on said material. The problem is feeding PDFs as a structured data source for the model to ingest. The finetuning can’t happen with random unstructured PDFs

Kata1yst@kbin.social · 2 years ago

Look into RAG using a vector database, this is exactly what they’re for. https://www.linkedin.com/events/buildaragapplicationontheaistac7191489677017649153

Imgonnatrythis@sh.itjust.works · 2 years ago

Looks a bit beyond me unfortunately, but sounds interesting