Perldoop is a new open-source tool that automatically translates Hadoop-ready Perl scripts into its Java counterparts, which can be directly executed on Hadoop clusters while improving their performance significantly.
An User Manual can be downloaded here.
Perldoop v0.6.3 (november 2014) includes:
- Perldoop source code and scripts to compile all the examples using Perldoop and generate Hadoop-ready Java codes.
- Simple examples: HelloWorld and WordCount.
- More complex applications: three Natural Language Processing (NLP) modules, Name Entity Recognition (NER), PoS-Tagging and Named Entity Classification (NEC). These modules process plain text in Spanish language.
CitiusTagger and CitiusNec
A PoS-Tagger and Named Entity Classification tool for Portuguese and Spanish
CitiusTagger / CitiusNec is an open source software, written in Perl, to perform both PoS tagging and Named Entity Classification in the Portuguese and Spanish languages. It has been developed at CITIUS by the ProLNat@GE group. It makes use of the same tagset as FreeLing.
How to install
# tar xzvf CitiusTool.tar.gz
# cd CitiusTool
# sh install-citiustool.sh
How to use
# sh nec.sh
Syntax: nec.sh language file
file= path of the file input
The Spanish POS-tagger has been trained with the Ancora corpus. The current version of the lexicon contains the same forms as FreeLing.
The European Portuguese FreeLing POS-tagger has been trained with the following linguistic resources:
- Bosque 8.0 from Linguateca: a 138,000 token corpus from the Floresta Sintá(c)tica, manually revised by linguists. It has been revised and adapted to freeling format.
- Label-Lex (SW) from Label: a 900,000 single-word token lexicon generated from 120,000 lemmas. It has been adapted to freeling format and to the corpus.