The Invisible Web
A RECAP Workshop by Paula Edmiston
West Chester University, Pennsylvania, May 14, 2004
|
 |
Still round the corner there may wait,
A new road or a secret gate.
-- J. R. R. Tolkien |
What is the Invisible Web (sometimes referred to as the Deep Web)? It is composed
of those resources not accessible through the search engines. Revealing the
Invisible Web requires taking a slightly different road, opening new gates. The
resources are there but traditional paths, via search engines, do not lead to
them. A study by faculty and students at the School of Information Management and
Systems at the University of California at Berkeley (Lyman), estimates the size of
the Internet in 2002 at 532,897 terabytes. The surface web, accessible via search
engines was measured as 167 terabytes. They measured the "Deep Web" (the Invisible
Web) as 91,850 terabytes. N.B. If digitized with full formatting, the seventeen
million books in the Library of Congress contain about 136 terabytes of
information. (Lyman)
Search engines were originally designed to index web pages created using HTML,
a markup language used to organize and structure text. Search engines are
struggling to keep pace with the evolution of
databases, multimedia and other formats used to present information on the web.
What makes Resources Invisible?
The resources are not really invisible of course. But they are not included in the indexing of the
web conducted by search engines; so they are, in a sense, invisible to the search engines.
Sometimes the phrase "Deep Web" is used to describe these resources because you have to dig deep
to find them. Easy to find web pages can be considered "surface" resources.
- Databases
- In addition to the proprietary databases such as
Psych Abstracts, the MLA Bibliography, Chemical Abstracts and the
citation indexes, there are many databases available
for free via the web. Academic organizations, professional
associations and non-profits usually sponsor these databases.
An example of this is the
Group
Psychotherapy, Psychodrama & Sociometry Bibliography
Database, a
collection of over 4,000 citations in several languages. This is a
tremendous resource but the citations
are in a database and not accessible from a search engine.
- Format
- Formats that are difficult to index include office software such as
word processing and spreadsheets, PDF, Postscript, Flash, Shockwave,
executables, streaming audio and video, and images. More search engines
are indexing a greater variety of formats all the time but there are
still limitations. While some search engines include the PDF format
they index only a portion of the document. Search engines like
Altavista and Google do offer image searches, but search file names
only.
- Pages that are Not linked
- Many search engines discover new web pages
only by following links on known pages. If a web page has not been linked to
a page already being indexed, the likelyhood of it being indexed is small.
- Robots
- Robot (vocab)
instructions can be applied sitewide or page-by-page. Authors
of individual pages can include a meta tag
(vocab) to
turn away search engine spiders. Web hosts can place special commands on
their sites that will turn spiders away from entire sites.
- Password Protection
- Some web pages are password protected. Even demonstration sites
containing valuable information can be rendered invisible because the
spiders
(vocab)
cannot read and use the login information that might be
displayed on the page.
- Search Engine limitations
- Some search engines place restrictions on indexing, such as choosing to index
only a portion of a page.
- Dynamically Generated Pages
- Search Engines will avoid pages that are generated "on the fly":
weather conditions, air flight arrivals, etc.
- Script-generated Pages
- This type of resource produces static HTML (vocab)
formatted pages that a spider could index. Because of abuse of this type of scripting
(vocab)
search engines deliberately refuse to follow these links. An example is a URL
(vocab)
incorporating a question mark (?).
The web is maturing and search engine spiders are improving. Slowly more
types
of resources are being indexed. In the meantime there are a number of
ways you
can identify relevant resources and manage your link collections.
Revealing the Invisible Web
Searching for information is a tricky process requiring the searcher to
be precise and exact in phrasing. The way a search is conducted can
affect the success of finding appropriate resources. Consider the needle
and the haystack from Koll:
A known needle in a known haystack
A known needle in an unknown haystack
An unknown needle in an unknown haystack
Any needle in a haystack
The sharpest needle in a haystack
Most of the sharpest needles in a haystack
All the needles in a haystack
Affirmation of no needles in a haystack
Things like needles in any haystack
Let me know whenever a new needle shows up
Where are the haystacks?
Needles, haystacks -- whatever
|
- Search engines
- Although search engines can't index the contents of databases they can
sometimes be used to identify potentially useful databases. Try a search in
Teoma:
database "art history"
or
"database of art history"
-
Note the use of quotation marks. This strategy can be used in most search engines.
Quoting text marks it as a "phrase": the words must be next to each other and
in the order given.
- Portals and Specialized Search Engines
- Portals are software programs that organize the titles in lists of
links by subject
and usually offer some method of searching the collection of titles.
- Some universities and many libraries are maintaining portals
of databases and subject-specific resources.
- The Librarian's Index to the Internet
is a well-organized point of access for reliable, trustworthy,
librarian-selected Internet resources, serving California, the nation,
and the world.
- BUBL LINK / 5:15 is a
collection of selected Internet resources covering all academic subject
areas
- Digital Librarian
is
a librarian-maintained portal to myriad resources.
- Infomine is a collection of
scholarly resources - some chosen by librarians, some added
automatically by the software. Infomine is a joint project of of several
libraries.
- Scirus - for scientific
information is a search engine covering scientific, scholarly,
technical and medical data on the Web. It includes peer-reviewed
articles and journals that other search engines miss.
- The Public Library
of Science (PLoS) is a non-profit organization of scientists and
physicians committed to making the world's scientific and medical
literature a freely available public resource.
- OAIster is a project of the
University of Michigan Digital Library Production Service. It includes over 3
million resources from 277 institutions.
- The CompletePlanet is a commercial
venture that provides access to over 70,000 searchable databases and specialty
search engines
- Invisible Web is a web site companion
to the book, The Invisible Web by Chris Sherman and Gary Price.
- Word of mouth
- Make note of resources mentioned by collegues; add these resources to the
tools you use to keep track of specialized sites (see the section on tools, below).
Tools for Keeping the Resources Visible
- Bookmarks
- You can use your browser bookmarks to save and
organize links to tools and resources.
Browser bookmarks can be very convenient in that it is possible to quickly bookmark
a site and file it into a specific subject folder. But if you work at more than one
computer you'll run into a problem because the bookmark files remains on the the
computer on which it was saved. Once you walk away from that computer you lose
access to the bookmark file.
Netscape saves its bookmarks in a single file. If you have a web account you
can copy your bookmark file
to your web site and access it from anywhere in the world.
MS Internet Explorer saves each "favorite" as an individual file and you cannot
copy all those files to your web site. It is possible to export (a command to save
the files in a different format) the favorites to a Netscape bookmarks compatable
file. With MSIE V. 5 you find the export command on the File menu. Once you've
exported the favorites they can be placed in your web site.
- Web Pages
- If you have your own web site you can create pages to keep
track useful sites, including notes about search strategies.
- Personal Portals
- You can acquire ready-made programs to install in your web site
to act as a portal, similar to the subject directory portals offered
by Yahoo and the
Open Directory Project. One example
is the Linker script. An
example of a perl
CGI script for managing lists of links in subject categories is from Gossamer
threads An early version of this commercial code is available at
MM Resources. You can write to Paula at MM for a copy of this code.
The iVia software behind
Infomine (see below) is available as open source
(vocab)
software.
CWIS (pronounced see-wis) from
the Internet Scout Project
is software to assemble, organize, and share
collections of data about resources, like Yahoo! or Google Directory but
conforming to international and academic standards for metadata. CWIS
was specifically created to help build collections of Science,
Technology, Engineering, and Math (STEM) resources and connect them into
NSF's National Science Digital Library, but can be (and is being) used
for a wide variety of other purposes. CWIS is open source.
The Future of Web-based Resources
Scholarly and Institutional organizations are beginning to work toward
making previously inaccessible resources available. Here are a few
examples.
- The Scout Portal Toolkit
- The Scout
Portal Toolkit, also from the Internet Scout Project, allows groups
or organizations that have a collection of knowledge or resources they
want to share via the World Wide Web to put that collection online
without making a big investment in technical resources or expertise.
- Open Archives
- The Open Archives
Initiative
is a project to
make web-based scholarly resources accessible through the use of a
special language
tool called "metadata". Metadata offers a way to standardize
how documents
harvested from institutions. The OAI is developing tools that can be
used by institutions
to make their resources more available.
- The MIT dSpace
- DSpace is a groundbreaking digital
library system to capture, store,
index, preserve, and redistribute the intellectual output of a
university's research faculty in digital formats.
The Invisible Web can be rendered visible. As tools such as search
engines continue to improve, and researchers continue to learn new
search techniques, more resources will become accessible.
References and Readings
All pages viewed May 2004.
Bergman, Michael K.
"The Deep
Web: Surfacing Hidden Value".
The Journal of Electronic Publishing. 7:1. August 2001.
Block, Marylaine. The
Invisible Web. 14 Nov 2003.
Chamberlain, Ellen. Bare
Bones 101: Lesson 4: Gateways and Subject Specific
Databases. 10 July 2002.
Koll, Matthew. "Information
Retrieval". Bulletin of The American Society for
Information Science. 26:2 (December / January 2000)
Lyman, Peter et al.
How
Much Information? 2003
Norvig, Peter [Director of Search Quality, Google]
Internet
Searching. [to appear, Report on the Fundamentals of Computer
Science, The National Academies, 2003]
Rosenberg, Gena. Web
Spiders and Robot Exclusion. February 26, 2001.
Search Tools Consulting.
Search
Indexing Robots and Robots.txt. Page Updated 2002-12-18.
Sherman, Chris and Gary Price. "The Invisible Web: Uncovering Sources
Search Engines Can't See." Library Trends 52:2 Fall 2003,
pp.282-298.
Tyburski, Genie.
The Invisible
Web: a Brief Note and Bibliography. 3 April 2003.
Vidmar, Dale. Hocus Pocus:
(Un)veiling
the (In)visible Web. 28 Feb 2003.
Learn to see,
and then you'll know
there is no end to the
new worlds of our vision.
-- Carlos Castaneda |
|