August 14, 2007
10:00- 11:20 am
Room 14-267

Moderators: Mark Matienzo (American Institute of Physics) and Jason Casden (NYU Medical)

SOLR / Lucene -

Haphazard notes - please revise/add at will!

Specific areas of interest from Audience?
  • xpath indexing in lucene

How familiar are folks with solr/lucene?

many people came to learn what SOLR/Lucene is

Lucene - open source indexing engine - incredibly powerful, robust, but very difficult for folks that aren't used to heavy tech

SOLR - web service layer on top
- set up index scheme
- enables faceting
- build interface on top (in any prog. lang.) - ruby/python/
  • And this is a key point of distinction between Lucene and Solr. Running Lucene requires that you write your code in Java, just as Plucene requires Perl, PyLucene requires Python, Ferret requires Ruby, and so forth. Solr exposes Lucene via a web service, and so your code can be in any programming or scripting language that satisfies the following two criteria: 1) Can talk HTTP. 2) Can read and write XML. Most if not all modern programming languages satisfy both.

Are the facilitators librarians? - how do they use it? - how does it fit into libraries?
Most folks here not actually using it on the job
lacking admin support - resources
motivation for why?
- verity search engine (amer inst physics)

size of library data: just under 30k recs - lots of EAD finding aids - free text search, too complicated to set up fielded search.

digitizing transcripts - no structureed metadata, hard coded html

Jason - also librarian, not using it at work.

Two main uses, SOLR Lucene to spruce up OPAC & organize EAD finding aids

Med Lib, opac lightly used

Faceted browsing,

What is EAD? - XML Schema to define structure for archival finding aids (at collection- and item-level)
- extensive descriptive tool - more robust

VU Find
Project Blacklight
Internet Archive
Smithsonian x-coll search
Library find at OSU
Dspace

Also, Ryan Eby has a detailed blog posting on Solr in libraries here: http://blog.ryaneby.com/archives/solr-in-libraries/

Commercial library systems:
- primo

Lucene incredibly low level - won't do anything unless you're a programmer - need to use a programing language to interact with the index.

SOLR - interface layer - provides query language (lucene, you'd have to write it yourself)

Searching Smithsonian - for Kiowa
- retrieving digitized materials
- this is all pulled from marc records
- local subject heading for

This is all cataloged in MARC records??

Is all of this cataloged using same standards?

Not dealing with heterogeneous data, becomes very challenging
you start getting uneven clumping, facets get messy

Is it worth it to change/clean-up data?

Normalization issues - simple ones - date normalization - especially for MARC for archival collections

ISBD punctuation fairly simple
subject headings with or without periods in date field

Blacklight (U. Virginia project) -
MARC recs
DL (Fedora)
Tang Dynasty poems - (TEI?)

Indexing is batched - dumping catalog - converting to MARC-XML - re-index

queries / facets cached for performance

faceted filtering is staying on the client? - not exhibit layer, all in ruby on rails
was part of flare project

quick code4lib plug

eric hatcher's pre-conf / seminar at code4lib '07 in Atlanta on solr/lucene - he demod his basic interface: flare; project blacklight, clean - simple - easy

date normalization script for ead

solr interaction over http - all update stuff

systems using solr - ils data, link back to records

DLF is working on the passing data back and forth from ILS to discovery system
NC state solved this with their own web services layer
Suny school working with grokker on this

Fac-Back-OPAC - Laurentian University in Ontario? Designed to be a back up to their sirsi unicorn site

based on Casey Durfee's system (one of most
- open source endeca into 150 lines of code)

Hiring someone to be able to use this kind of thinking / tech?
- catalogers
- database folks
- techies
- programmers
- sys-admins
- service folks - ref

Lots of effective projects in this space are team based. All of the above.

lib systems folks not doing the programming

solr - straightforward enough to do something basic marginally competant in programming - minimal system - data normalization won't be good.

MARC data - problematic - more systems to not convert to marc xml first.
can't pull data out of ILS dbs -

Side conversation about manipulating MARC data
RUBY MARC - has been steadily maturing
PY-MARC - definitely usable - lots of folks using python to work with solr.
MARC::Record (PERL) -- being maintained by Evergreen programmer, Mike Rylander

PERL great for programmer job security (through obscurity)
(side conversation about programming language preferences)

Is there a provision for searching by field completness?
- lucene layer?
- lucene - standard, fuzzy match, must turn off stemming and tokenize strings properly.
- controlled vocabularies in lucene
- does he do this in structured input layer?
- content in full xml, can control fields

XPATH indexing - xpath expressions to lucene fields -
How long has he been using lucene - 4/5 years

average search 500k recs - avg rec approx meg
indexes in about 3 hours - 8 hours if serialzed
miliseconds on the search

cluster of 6 servers.

startup - 75 mililon recs - 60 gigs

ruby on rails as front end - lucene as java server on backend

shameless plug - kevin reiss - cuny grad center - joanne - columbia
new sig for metro NY lib council - lib2.0 - first meeting on similar topic in September

SOLR/Lucene

at metro center 11th and u: metro.org/index.php?option-com_content&task=view&id=74&itemid=199

demo - inside of stuff.... SOLR admin interface