User:Samuel2252/sandbox

Source: Wikipedia, the free encyclopedia.

Those are draft only, they are not finish

Norconex

Norconex is a North American information technology company specializing in Enterprise Search[1] professional services and software development (both commercial and open-source).[2][3]

Norconex inc.
Company typePrivate
IndustryEnterprise search

Information technology Information access

Open Source Software
Founded2007
HeadquartersGatineau, Quebec, Canada
Key people
Pascal Essiembre, (President)
Websitehttps://norconex.com/

The company was founded in 2007 by Pascal Essiembre (president)[4]. Norconex headquarters are in Gatineau, Quebec, Canada.[5]

Overview

Norconex positions itself as search technology-independent, working with different commercial search technology and vendors as well as open-source search. Being based in Canada’s national capital region helped Norconex grow by reaching a good presence in the Canadian Federal sector.[6] Norconex customers come from various activity sectors in the commercial sector, including manufacturing, publishing, health, technology, and legal.[7]

Open-source project

Norconex is very active in the Open-source community and is involved with miltiple Open-source project.[8][9]

In 2013 Norconex released its first contribution to the open-source community with its HTTP Web crawler (named Norconex HTTP Collector).[10][11] Eversince, they have been hard at work to provide multiple open-source content, including their crawlers (the Norconex HTTP Collector and the Norconex Filesystem Collector), their Importer[12] and multiples Committers[13] compatible with multiples well known organisation[14][15][16] to store document.

Services[17]

  • Enterprise Search Consulting
  • Enterprise Search Design
  • Big Data
  • Taxonomy Editing
  • Data Analysis

References

  1. ^ "Norconex - Crunchbase Company Profile & Funding". Crunchbase. Retrieved 2021-05-28.
  2. ^ "Content Analytics and Search Software Market Capacity, Production and Growth Rate Forecast (2021-2027) | Content Insight, Tibco Software, Accenture Intelligent, Salsify, Google, Content Analytics, Norconex – Jumbo News". Retrieved 2021-05-28.
  3. ^ "Norconex on LinkedIn". LinkedIn.{{cite web}}: CS1 maint: url-status (link)
  4. ^ "Pascal Essiembre on LinkedIn". LinkedIn.{{cite web}}: CS1 maint: url-status (link)
  5. ^ "Norconex · 815 Boulevard de la Carrière #201, Gatineau, QC J8Y 6T4, Canada". Norconex · 815 Boulevard de la Carrière #201, Gatineau, QC J8Y 6T4, Canada. Retrieved 2021-05-28.
  6. ^ Government of Canada, Public Services and Procurement Canada (2017-01-18). "NORCONEX INC (EN578-170432/208/EI)". buyandsell.gc.ca. Retrieved 2021-05-28.
  7. ^ "Success Stories". Norconex Inc. Retrieved 2021-05-28.
  8. ^ "Norconex Collectors: Open-Source Crawlers". opensource.norconex.com. Retrieved 2021-05-28.
  9. ^ "Norconex". GitHub. Retrieved 2021-05-28.
  10. ^ "Norconex Gives Back to Open-Source". Norconex Inc. 2013-06-05. Retrieved 2021-05-28.{{cite web}}: CS1 maint: url-status (link)
  11. ^ "A New Open-Source Web Crawler". web.archive.org. 2016-03-04. Retrieved 2021-05-28.
  12. ^ "Norconex Importer". opensource.norconex.com. Retrieved 2021-05-28.
  13. ^ "Committers". opensource.norconex.com. Retrieved 2021-05-28.
  14. ^ "jpmantuano/kafka-committer". GitHub. Retrieved 2021-05-28.
  15. ^ "Deploy a Norconex HTTP Collector Indexer Plugin | Cloud Search". Google Developers. Retrieved 2021-05-28.
  16. ^ "Importing Data from the Web with Norconex & Neo4j". Neo4j Graph Database Platform. 2020-02-10. Retrieved 2021-05-28.
  17. ^ "Services". Norconex Inc.{{cite web}}: CS1 maint: url-status (link)














Norconex HTTP Collector

Norconex HTTP Collector[1] is a web spider, or crawler initially created for Enterprise Search integrators and developers. It began as a closed source project developed by Norconex. It was released as open source in 2013[2][3][4][5][6][7] and is activally maintain by Norconex. The current version is version 2.x[8] but version 3.x[9] is currently in development.

Norconex HTTP Collector
Original author(s)Pascal Essiembre
Developer(s)Norconex Inc.
Stable release
2.x
Preview release
3.x
Written inJava
Operating systemCross-platform
TypeWeb Crawler
LicenseApache
Websitehttps://opensource.norconex.com/collectors/http/

Feature[1][10]

  • Multi-threaded.
  • Supports full and incremental crawls.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Supports pages rendered with JavaScript.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Page screenshots.
  • Extract page "featured" image.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • URL normalization.
  • Detects modified and deleted documents.
  • Supports different frequencies for re-crawling certain pages.
  • Supports various web site authentication schemes.
  • Supports sitemap.xml (including "lastmod" and "changefreq").
  • Supports robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires more than 20 crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Supports scripting languages for manipulating documents.
  • Reference XML/HTML elements using simple DOM tree navigation.
  • Supports external commands to parse or manipulate documents.
  • Supports crawling with your favorite browser (using WebDriver).
  • Supports "If-Modified-Since" for more efficient crawling.
  • Follow URLs from HTML or any other document format.
  • Can detects and report broken links.
  • Can send crawled content to multiple target repositories at once.
  • Many others.

Architecture

Norconex HTTP Collector was built entirely using Java. A single Collector installation is responsible for launching one or multiple crawler threads, each with their own configuration.

Each step is part of a crawler life-cycle is configurable and overwritable. Developers can provide their own interface implementation for most steps undertaken by the crawler. The default implementations provided cover a vast array of crawling use cases, and are built on stable products such as Apache Tika and Apache Derby. The following figure is a high-level representation of a URL life cycle from the crawler perspective.

The Importer and Committer modules are separate Apache licensed java libraries distributed with the Collector.

The Importer[11] module parses incoming documents from their raw form (HTML, PDF, Word, etc) to a set of extracted metadata and plain text content. In addition, it provides interfaces to manipulate a document metadata, transform its content, or simply filter the documents based on their new format. While the Collector is heavily dependent on the Importer module, the latter can be used on its own, as a general-purpose document parser.

The committer[12] module is responsible for directing the parsed data to a target repository of choice. Developers are able to write custom implementations, allowing the use of Norconex HTTP Collector with any search engines or repositories. Also, multiples committer implementations currently exists[13], for well known organisation.[14][15]

Minimum Requirements

While the Norconex HTTP Collector can be configured programmatically it also supports XML configuration files. Apache Velocity is used to parse configuration files. Using Velocity directives permits configuration re-use amongst different Collector installations and variables substitution.

The following code is the minimum XML configuration for the current version 2.x. See the documentation for more complexe configuration.[16]

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://opensource.norconex.com/collectors/http/test/minimum</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/minimum</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />
      
      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 
      
      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./examples-output/minimum/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

See also

References

  1. ^ a b "Norconex HTTP Collector". opensource.norconex.com. Retrieved 2021-05-28.
  2. ^ "Norconex Gives Back to Open-Source". Norconex Inc. 2013-06-05. Retrieved 2021-05-28.{{cite web}}: CS1 maint: url-status (link)
  3. ^ "A New Open-Source Web Crawler". web.archive.org. 2016-03-04. Retrieved 2021-05-28.
  4. ^ "Norconex Offers Open Source HTTP Crawler". Beyond Search. 2013-07-16. Retrieved 2021-05-28.
  5. ^ "Discover the Open Source Alternative to the Autonomy Crawler". Beyond Search. 2014-02-07. Retrieved 2021-05-28.
  6. ^ NT, Author Baiju (2018-09-12). "Top 50 open source web crawlers for data mining". Big Data Made Simple. Retrieved 2021-05-28. {{cite web}}: |first= has generic name (help)
  7. ^ "SolrEcosystem - SOLR - Apache Software Foundation". cwiki.apache.org. Retrieved 2021-05-28.
  8. ^ "Getting Started | Norconex HTTP Collector". opensource.norconex.com. Retrieved 2021-05-28.
  9. ^ "What's new in version 3 | Norconex HTTP Collector". opensource.norconex.com. Retrieved 2021-05-28.
  10. ^ "Norconex Collectors". opensource.norconex.com. Retrieved 2021-05-28.
  11. ^ "Norconex Importer". opensource.norconex.com. Retrieved 2021-05-28.
  12. ^ "Committers". opensource.norconex.com. Retrieved 2021-05-28.
  13. ^ "Committers". opensource.norconex.com. Retrieved 2021-05-28.
  14. ^ "jpmantuano/kafka-committer". GitHub. Retrieved 2021-05-28.
  15. ^ "Deploy a Norconex HTTP Collector Indexer Plugin | Cloud Search". Google Developers. Retrieved 2021-05-28.
  16. ^ "Documentation | Norconex HTTP Collector". opensource.norconex.com. Retrieved 2021-05-28.