Crawling, Indexing & Searching
The CrunchIndex server provides the ability to crawl various data sources, Index various data sources and search the indexes that you create. Together, these three main components form a powerful enterprise search solution that can be customized to your needs. CrunchIndex can be used in small scale single server applications or you can scale the server out to provide powerful distributed crawling and indexing at an enterprise scale.
Help you search
As you type what you’re looking for, CrunchIndex will make suggestions. If you’ve made any spelling errors, a helpful correction will be suggested. Previous searches are also stored locally in browser cookies and are offered as suggestions to the user.
Unite data silos
Many organizations have “data silos” in the form of separate systems that maintain a set of data with no unified way of searching it. CrunchIndex solves this problem by being able to crawl and index multiple different types of data sources. The crawler can be configured to crawl not only web resources, it can crawl web archives ( ARC & WARC ), databases, MediaWiki XML dumps, ODP files, RDF files, RSS Feeds, text files and much more.
Extract and Transform Data
All data is different and not everyone is looking for the same thing. This is why CrunchIndex supports multiple methods for collecting, transforming and displaying data that it comes across. Specific data can be extracted from pages and classified depending on your criteria using CrunchIndex’s Page Extraction Language. Page Extraction Language is a simple scripting language that is executed on a page when it is processed by the indexer. This gives the user control over what data is extracted from a page and how it is displayed in search results. If a more robust solution is needed for more complex data then you can leverage CrunchIndex’s powerful indexing plugin feature. Indexing plugins help you tailor Indexing to your specific needs.
News Feeds
Every day a torrent of new information flows in the form of RSS feeds. CrunchIndex will pull large amounts of RSS feeds at regular intervals and index them. This allows you to keep on track of various news topics by quickly searching terms relevant to the event in question. In addition to being an effective news conduit, CrunchIndex can also output it’s results in RSS format as well. This is useful in sencenarios where you’re trying to aggregate information on a particular topic or many topics for research or simply to stay informed. In addition to the Images, Video and News sub searches you could even have a search portal that only displays news where certain terms are mentioned.
Flexible Server Layout
An index can be distributed among multiple machines to speed up both the indexing and searching process. CrunchIndex servers can be configured to handle specific roles or even act as mirrors of other CrunchIndex servers. All servers including mirrors can be used to handle search requests. If your organization is global, mirrors can be used to serve as a local replica of an index in a remote location. In the event of a failure of either pair the other will still service requests.
Caching and MemCached
Each server can be configured to maintain a local cache of query results. CrunchIndex can use it’s internal filecache for this or it can utilize memcached servers where available. In addition to query results, CrunchIndex can also be configured to cache an entire webpage. This can prove useful in situations where internet access is unreliable.
Language Support
CrunchIndex’s UI can be configured for use in English, French, Spanish, Deutsche, Arabic, Italian, Hebrew, Polish, Vietnamese, Russian Korean and more. Additional languages can be added using CrunchIndex’s localization settings. Currently CrunchIndex Indexing Engine only has a word stemmer for english. However, additional wordstemmers can be added as plugins. This provides you the ability to make language plugins for almost any language on the planet. The UI also supports various writing formats such as vertical or backwards ( Arabic, Korean, Japanese etc )
Smarter Searching
There are many ways to look at the data being searched and how it is classified. CrunchIndex can learn how to search most relevant to your needs by using CrunchIndex’s Classifier. After initially training a classifier it can be used to classify data while indexing. Classifiers can also be used in the search ranking process.
Distributed Crawling and Indexing
CrunchIndex servers can be configured to distribute the load of a large crawl, a large indexing operation or fetching search results. CrunchIndex servers act as a whole single unit to complete a task. Each server will handle it’s own portion of the crawl and indexing process. Work is distributed where it can best be handled, meaning that batches of URLS for the same site will be sent to a singe CrunchIndex server so that sites are not swarmed while crawling.
A Friendly Crawler
CrunchIndex is polite and cordial when scouring the various data sources needed in your crawl. While crawling, targets of the crawl will not be inundated with requests to the point where service is affected. In fact . In addition to being a friendly crawler, CrunchIndex is also an obedient crawler. All robot commands are obeyed and regularly checked to make sure we’re crawling according to the crawl target’s guide lines.
Flexible access control
All aspects of CrunchIndex’s user interface can be configured to restrict functionality to only those who need it. In addition to security aware UI, CrunchIndex also extends security to the data that it has indexed. Indexes and Mixes and be configured to only be available to certain users. This allows CrunchIndex to be configured to be compliant with your organizations specific data security needs.
Private search
Your crawling and indexing operations are managed by your organization, not a third party. With a private search engine you have full control over what data is crawled and indexed and who can see it. Data regarding your search trends are not sent to a third party and remain completely private. CrunchIndex even supports the ability to crawl from behind a proxy or even utilize the onion router. In addition to crawling through an onion router, CrunchIndex can also crawl the onion.
Customize and Integrate
CrunchIndex can integrate smoothly into your organisations existing IT environment. The CrunchIndex user interface can be customized, re-skinned and re-branded to match your organizations visual design themes and style. Depending on the data being indexed, you may need to display it in a non standard format. This can be done with the aid of indexing plugins that are used during and after the indexing process to transform how data is displayed to suit your needs.
Powerful Search Operators
Every search is different and sometimes it’s like searching for a needle in a hay stack. CrunchIndex supports many different search operators to slice and dice the data into something meaningful to you. The classier and crawl mixes further extend this feature by allowing you to mix and match indexes to contain information specific to your searching needs. A user can select from any number of indexes or mixes of indexes when performing a search.
Talk about the data
Searches, Crawls and Mixes can all have comment threads associated with them. This allows the user to write annotations, comments or even discussions about a certain topic. Users can also configure feeds from CrunchIndex it self. This is a powerful way disseminate information through out your organization, making CrunchIndex not only a one stop stop for enterprise searches and data collection but also a powerful information monitoring tool.