By Timo Selvaraj
While crawling websites for search is widely prevalent thanks to Google, Bing and other web search engines, there is a still a lack of knowledge on what needs to be done when it comes to crawling the website or intranet for better search results for your website visitors. Indexing by Google and what is shown by Google Search is no longer the yardstick for measuring the performance of quality search results on a company website used for marketing, sales and communications.
Prioritize relevance based on what they will search for
Focus on indexing content to ensure the key product messaging pages, marketing communication landing pages or downloadable documents that are popular are indexed and found in the search results for the right terms. SearchBlox allows you to index using single urls or documents when the crawler cannot find it due to lack of inbound links. Customer often expect the orphan pages or documents to be found without providing the inbound links to the content. SearchBlox provides reports on what search was typed in and what they clicked on which will help you change course when you find users are going to the wrong page for a search term.
Search is performed when they can’t find it on the home page or navigation
Understanding the user’s mental model and behavior is critical to search relevance on a website. Users resort to
search to find a page or document when they can’t find it using the navigational hierarchy or link on the home page and not all pages can be on the hierarchy or home page. Most often the reaction by web content managers is to start placing the links on the front page for the popular pages. This is one approach to solving the issue temporarily but in the long run it leads to overcrowding the home page or diluting the importance of links on the home page.
Robots.txt, sitemap.xml and more levers for finding the pages
Indexing the right pages without duplication and including orphan pages/documents where required is a must. The robots.txt and sitemap.xml file help the crawler in identifying the pages. It is a good practice to ensure these files are provided on the public website as they are useful for both the web search engines and a website search engine like SearchBlox. In addition to these files, using the remove duplicates option will help ensure duplicate pages with dynamic urls don’t get into the search index. With fancy urls and SEO urls often setup on websites for better web search rankings, using these levers will ensure better accuracy and relevance for your overall search.
Relevance with synonyms, stopwords and Featured Results
Given there is more than one way to represent terms when users refer to something on your website, it is important to use synonyms to map different references to the term. For example, if some users call something “universal” and others call it “global”, then you can map the terms to ensure that even though the term is “global” in your content, whenever they type in “universal”, the same results can be shown. Stopwords allow you to ignore the most common words found in any language to be ignored. This is done where words like “a”, “at” or “the” are removed during the indexing process so they are ignored. Another overlooked feature is having the spider ignore sections of the page. Using the tags <noindex> </noindex> you can remove unwanted sections of a page like headers, footers and navigation from search so the search terms go against the actual terms rather than extraneous content found in every page. Featured results is an important feature to ensure the most important marketing banners, text links or products are shown at the very top.
SearchBlox was built ground up using these best practices so website managers can provide a strong search engine. Learn more about how SearchBlox enables you to crawl websites and ensure greater relevance with search by contacting us.
Enterprise Search on AWS
Open Distro for Elasticsearch
Google Search Appliance
Google Custom Search
Google Site Search
Solr to Elasticsearch Migration
Search on Docker
Read our recent Blog posts
4870 Sadler Road Suite 300
Glen Allen, VA 23060
Phone: (866) 933-3626