Centro de Investigación en Tecnoloxías da Información da Universidade de Santiago de Compostela (CiTIUS)

Software tools


Perldoop

Perldoop is a new open-source tool that automatically translates Hadoop-ready Perl scripts into its Java counterparts, which can be directly executed on Hadoop clusters while improving their performance significantly. 

You can download the source code from the Git repository or here.

An User Manual can be downloaded here.

Perldoop v0.6.3 (november 2014) includes:

  • Perldoop source code and scripts to compile all the examples using Perldoop and generate Hadoop-ready Java codes.
  • Simple examples: HelloWorld and WordCount.
  • More complex applications: three Natural Language Processing (NLP) modules, Name Entity Recognition (NER), PoS-Tagging and Named Entity Classification (NEC). These modules process plain text in Spanish language.  
 
If you use Perldoop, please cite this article: 
J. M. Abuin, J. C. Pichel, T. F. Pena, P. Gamallo and M. Garcia. "Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters", IEEE International Conference on Big Data, pp. 766-771, 2014. (PaperBibTex Reference)

CitiusTagger and CitiusNec

A PoS-Tagger and Named Entity Classification tool for Portuguese and Spanish

CitiusTagger / CitiusNec is an open source software, written in Perl, to perform both PoS tagging and Named Entity Classification in the Portuguese and Spanish languages. It has been developed at CITIUS by the ProLNat@GE group. It makes use of the same tagset as FreeLing.

You can test it in our DEMO and download it here.

If you use this tool, please cite the article: 
P. Gamallo, J. C. Pichel, M. Garcia, J. M. Abuin, T. F. Pena. "Análisis Morfosintáctico y Clasificación de Entidades Nombradas en un Entorno Big Data", Procesamiento del Lenguaje Natural, vol. 53, pp. 17-24, 2014. (PaperBibTex Reference)

 

How to install

# tar xzvf CitiusTool.tar.gz
# cd CitiusTool 
# sh install-citiustool.sh

How to use

# sh nec.sh
Syntax: nec.sh language file

language=pt, es
file= path of the file input

Spanish PoS-Tagger

The Spanish POS-tagger has been trained with the Ancora corpus. The current version of the lexicon contains the same forms as FreeLing.

Portuguese PoS-Tagger

The European Portuguese FreeLing POS-tagger has been trained with the following linguistic resources:

  • Bosque 8.0 from Linguateca: a 138,000 token corpus from the Floresta Sintá(c)tica, manually revised by linguists. It has been revised and adapted to freeling format.
  • Label-Lex (SW) from Label: a 900,000 single-word token lexicon generated from 120,000 lemmas. It has been adapted to freeling format and to the corpus.