For a few months, I and a few others were working on a project we called “arxiv-search”, an attempt to search and sort all of the arxiv (~1 million papers). We were inspired by Andrej Karpathy’s arxiv sanity preserver which is an excellent tool for a limited set of papers (~50,000). Starting from that project, we ended up writing a new backend and frontend. Our backend used elasticsearch which is a large scale search engine which runs constantly on a big server, indexing metadata and responding to search requestsIn fact, the arxiv itself very recently started using elasticsearch to improve their own search results.
. The original arxiv-sanity implementation kept the metadata for all the papers in the server’s RAM which limits how many papers can be hosted; the idea was that elasticsearch can scale far better, as well as perhaps do more sophisticated searching efficiently. Our frontend was written by my officemate Ed Ayers in react.js, and was responsive and useful. Our ambitions expanded; we set up an AWS Lambda pipeline to process new papers (get the metadata, generate thumbnails, scrape the text), which we hoped to include semantic elements (parse definitions and theoremsLars Mennen wrote a nice python script to work on this task.
, etc), and to generate recommendations for our users.
But at some point other parts of life (and research) caught up with me, and I had to spend less time on the project. Soon I found myself paying for the elasticsearch server each month without working on the project. At some point the plan was to try to get some funding for it, maybe a summer student, and expand the project. But personally I can only do a few things at one time, and this is one too many, at this point. I shut down the elasticsearch serverbut made a backup, of course
I think it would be good if we open-sourced the code; after all, we definitely benefited from access to the source of the arxiv sanity preserver project. But our code is pretty messy, especially everything I wrote, and we’d probably need to look it over more firstDear reader: Let me know if you’re interested in the source; that would be motivating, I think.
Personally, I’m definitely interested in how modern technology can improve and expedite research, and I think the arxiv itself is a wonderful, underutilized resource. I think making academic literature easier to search and parse would help researchers at every level, and especially newcomers to a given field, who don’t have encyclopedic knowledge of every relevant paper. Moreover, more and more papers are being submitted to the arxiv each monthSee the monthly submission rates
, and it will only be increasingly difficult to stay current in any given field. I think computer-driven personalized recommendations are an excellent approach to ameliorate that problem. Finally, I suspect there is a lot of time spent rederiving results from one field in the language of another, and finding a semantic representation of academic work could allow computer analysis to help find these connections. That is a much bigger problem, which we were hoping to make a tiny dent in by extracting theorems and definitions from LaTeX files to allow easier discovery.
I’m convinced that the process of academic research can be improved, and that there is a lot of benefit in doing soAnd I think computer technology is only part of the solution– but I’ll leave my other thoughts on that for another post.
. But the arxiv’s own search interface has improved and can handle full-text search nowIn fact, we found a lot of what we were planning on doing was on the arxiv’s 2018 roadmap. But it doesn’t look like they are planning on doing recommendations or anything semantic, both of which seem quite important to me.
, which is a promising direction, and I still have to work on my PhD, so I think for now this project isn’t what I’ll be spending my time on.
Thank you very much to everyone who encouraged us, tried out our website, and/or gave us feedback. And I’d particularly like to thank Ed Ayers and Lars Mennen for working on this project with me.