When you choose to publish with PLOS, your research makes an impact. Make your work accessible to all, without restrictions, and accelerate scientific discovery with options like preprints and published peer review that make your work more Open.

PLOS BLOGS Absolutely Maybe

Google Scholar Risks and Alternatives

 

search. cite. coffee. repeat.

 

I wasn’t there. But it sounds like a ripper of a talk. “There seems to be a narrowing of our collective view of the literature”, according to Jevin West. He’s from the University of Washington, and he was speaking at the Metascience 2019 symposium earlier this month.

Why does he think we are seeing a narrower slice of the literature? Via Carl Bergstrom on Twitter: West and his colleague, Jason Portnoy, studied over half the article usage on JSTOR, and in recent years, Google Scholar is swamping every other way of arriving at an article. One result?

 

 

That does not necessarily mean that the cream is rising to the top. For that to be true, people would have to be putting in a lot of effort to make sure their citation practices were impeccable. And they really don’t. Citation can just mean, here’s a thing I found so I can plonk a citation in this sentence.

A case in point: across Twitter in another conversation, Paul Whaley pointed out a 2018 study by Andreas Stang and colleagues. Stang had written about the many problems of a particular quality-measuring scale. The title began “Critical evaluation of…” Despite the fact that he was challenging the validity of the scale, in 96 systematic reviews citing his article, all but 2 were using it as justification for using that scale. “It appears”, wrote Stang & co, “that the vast majority of systematic review authors who cited this commentary did not read it”.

But back to Google Scholar. Johan Ugander weighed in: “…of course Google Scholar (+ other tools) are altering citation patterns, but not necessarily only in bad ways”. Ex-Googler, Helder Suzuki, pointed to a 2014 paper he co-authored with Anurag Acharya (Google Scholar’s developer) and others [PDF]. They report,

[T]he fraction of citations to articles published in non-elite journals has grown substantially over most research areas… Now that finding and reading relevant articles in non-elite journals is about as easy as finding and reading articles in elite journals, researchers are increasingly building on and citing work published everywhere.

It’s great to crack open the old monopolies of our attention. They were never as good at finding the best as they were reckoned to be. But how good is Google Scholar at herding us to the most important papers?

Katie Corker pointed to Chapter 2 of Nick Fox’s dissertation. Fox raises interesting points: we’re not reading more, and if we just lean on varieties of social cues – anything that’s click- or citation-based – where is that going to lead science?

And if we all become reliant on Google Scholar, what happens if Google pulls the plug on it? West pointed us to “Killed by Google“, lest we be too complacent about this. Bergstrom argued this blind dependence is a failure of the scientific community. It doesn’t have to be dramatic either: Google Scholar killed off one of its few functionalities not that long ago – one I used almost every day, making it far less useful to me. PubMed does that too sometimes, so it’s not just a matter of being a private company. There’s something risky, though, about having all your eggs in one basket.

 

 

What can we do about it? West and colleagues built a search engine based on images from PMC (PubMed Central). Even without trying to take on the giant commitment of building a community-driven wildly popular alternative to Google Scholar, there are many smaller scale projects that could help improve our access to knowledge.

In my area of interest, Epistemonikos is a pretty spectacular example. Developed by Gabriel Rada and colleagues in Chile, it’s an indispensable scientist-driven searchable, relational database of health evidence [PDF]. Here’s another, that also shows a useful resource doesn’t even have to be technologically advanced, although that sure helps. This database of curated methodology papers is indispensable, too, since it’s so hard to search for papers about a methodology, in the great ocean of papers using it – is simply released in a Zotero library. It’s not well known, though.

But it was this exchange that prompted this blog post:

 

 

I hadn’t looked at Microsoft Academic in ages, and it has changed a lot. I had no idea it had added downloads of results and citations. It’s got close to 200 million papers more than PubMed’s 30 million today, covering nearly 50,000 journals and more.

This is the second one mentioned: Dimensions. It’s got more than 100 million. But it looks like you need a subscription for its interesting features. Moving on.

The third is one I had never seen before, and it’s an eye-opener. Lens is open source, with APIs. Its core content is patents, but within 2 years they hope to be linking to “most of the scholarly literature”. So far they are tapping PubMed/PMC, Crossref, Microsoft Academic, and CORE. It’s at more than 200 million scholarly works.

Speaking of CORE, that’s now got over 130 million open access papers.

I hope I can lessen my reliance on Google Scholar by getting to know Lens and MS Academic better. Even if multiple databases have the same contents, big variations in the search engines mean you can end up with very different results.

There’s another aspect here, too. And that’s our search skills. A few years ago, I wrote a post about the impact Google has had on them. The little research I found then, and the last time I tried to update it, was grim reading. We could be worse at finding information these days, because of putting in too little effort and over-relying on the Google machine. Maybe the thing that’s most off-putting about adding more places to search is one of the strongest reasons for doing it: learning some new ropes and taking a bit more time.

 

~~~~

 

Disclosure: I was a senior scientist at the NIH’s NCBI working on PubMed-related projects from 2011 to 2018. (NCBI is part of the U.S. National Library of Medicine.)

 

On a related note, check out 8 PubMed Ninja Skills

Imaginary PubMed Ninja cartoon game

 

#Metascience2019

The video of West’s talk will be going online in the next few months. In the meantime, you can check out at least some of the discussion of the talk here.

 

The cartoons are my own (CC BY-NC-ND license). (More cartoons at Statistically Funny and on Tumblr.)

 

Discussion
  1. You can get some powerful features out of Dimensions with free registration. This includes exporting very large bibliographic datasets.
    https://www.dimensions.ai/blog/discovering-relationships-between-researchers-and-publications-using-dimensions-data-just-got-a-lot-more-colorful/

    It seems they also grant full access, for some researchers at least, on request:
    https://www.dimensions.ai/request-access/

    On CORE, as with BASE, note that the startlingly high headline number of 135 million “papers” they claim is misleading – they’re actually metadata records, and there are usually multiple records per paper. CORE only claims to find 24 million full-text papers, the same as unpaywall.
    https://core.ac.uk/data/
    Dissem.in finds ~35 million OA works, including ResearchGate. Still an impressive number.

  2. “The prudent mariner will not rely solely on any single aid to navigation, particularly on floating aids” is a warning printed on the legend of nautical charts. The scholarly corollary might be “The prudent scholar will not rely solely on any single bibliographic search tool, particularly those with unknown algorithms.” Unfortunately, the imprudent scholar will use whatever tool is at hand to plunk in some relevant sounding citation, regardless the search provider.
    To me the greatest risk with relying on Google Scholar is that one day Google will cut it adrift. I can deal with some of the other issues mentioned. Yes, GS’s corpus is unknown, but it seems broader than that of other tools. Likewise, its algorithms are secret, but they tend to turn up a lot of relevant literature, more so than would say, WoS or Scopus. That’s a double-edged feature, as WoS and Scopus don’t index “predatory journals” but GS includes some dodgy sources. GS’s feature of pointing to downloadable PDFs is helpful, even though (or especially since) GS will find and point to PDFs even when they are bootleg versions on a ResearchGate-like site.
    I had written off Microsoft Academic as anemic years ago until your post. I just checked it out and am very impressed. Highly functional and an elegant interface. But same as Google Scholar – the risk will be that they don’t keep it going. Remember that great late, elegant and highly functional Windows phone?
    And thanks for this and so many more of your provocative writings and cartoons!

  3. A couple of relevant paragraphs on Google Scholar versus some others:

    “Literature searches from different sources can yield very different results.  For example, using a 2007 original research article on population modeling of selenium toxicity to trout (Van Kirk and Hill 2007), four leading bibliographic indexing services were searched for articles citing that study.  Web of Science (WoS), Elsevier’s Scopus, Digital Science’s Dimensions, and Google Scholar found 7, 10, 15, and 22 citing publications respectively.  Scopus found all articles found by WoS, plus articles WoS missed in Human and Ecological Risk Assessment and IEAM. Google Scholar found all articles found by Scopus and WoS, plus articles in Ecotoxicology Modeling, Water Resources Research, 3 government reports, 2 books, a thesis, a conference proceeding, a duplicate, and 2 ambiguous citations. It follows from this 3-fold difference in valid citations that a critical review of published literature on a topic or a regulatory assessment could miss relevant science if the assessors relied too heavily on a single search provider.

    This simple example was from the “current era” of science, which began by 1996 or so, depending on which bibliographic indexing service scholars are using. Web sites for WoS and Scopus respectively report their indexing databases are reliable from 1971 and 1996 forward. Relying exclusively on bibliographic index searching may omit important, relevant older research.”

    from Scientific Integrity Issues in Environmental Toxicology and Chemistry: improving research reproducibility, credibility, and transparency by others and myself.

  4. For particle/nuclear/high-energy astro physics, the INSPIRE database (http://inspirehep.net/) can perform a similar function to google scholar. It provides an interesting counterpart, in that it is based on ‘straight cuts’ rather than a machine learning approach. The user can search for words in the title, author, etc., but then the output is either ordered by date of entry into the database, or number of citations. INSPIRE grew out of the SPIRES database, which started in the 1970’s, so it is generally simpler than google scholar.
    However, I find it more useful when trying to do complete literature searches. INSPIRE is funded by the particle physics community (a consortium of labs from multiple countries). Other science communities might wants to consider something like this.

Leave a Reply

Your email address will not be published. Required fields are marked *


Add your ORCID here. (e.g. 0000-0002-7299-680X)

Back to top