0000000001276174
AUTHOR
Guntis Barzdins
Pini Language and PiniTree Ontology Editor: Annotation and Verbalisation for Atomised Journalism
We present a new ontology language Pini and the PiniTree ontology editor supporting it. Despite Pini language bearing lot of similarities with RDF, UML class diagrams, Property Graphs and their frontends like Google Knowledge Graph and Protege, it is a more expressive language enabling FrameNet-style natural language annotation for Atomised journalism use case.
Inductive synthesis of term rewriting systems
Fast algorithm for inductive synthesis of term rewriting systems is described and proved to be correct. It is implemented and successfully applied for inductive synthesis of different algorithms, including the binary multiplication. The algorithm proposed supports automatic learning process and can be used for designing and implementation of ADT.
Towards efficient inductive synthesis of expressions from input/output examples
Our goal through several years has been the development of efficient search algorithm for inductive inference of expressions using only input/output examples. The idea is to avoid exhaustive search by means of taking full advantage of semantic equality of many considered expressions. This might be the way that people avoid too big search when finding proof strategies for theorems, etc. As a formal model for the development of the method we use arithmetic expressions over the domain of natural numbers. A new approach for using weights associated with the functional symbols for restricting search space is considered. This allows adding constraints like the frequency of particular symbols in t…
FrameNet CNL: A Knowledge Representation and Information Extraction Language
The paper presents a FrameNet-based information extraction and knowledge representation framework, called FrameNet-CNL. The framework is used on natural language documents and represents the extracted knowledge in a tailor-made Frame-ontology from which unambiguous FrameNet-CNL paraphrase text can be generated automatically in multiple languages. This approach brings together the fields of information extraction and CNL, because a source text can be considered belonging to FrameNet-CNL, if information extraction parser produces the correct knowledge representation as a result. We describe a state-of-the-art information extraction parser used by a national news agency and speculate that Fram…
Rigotrio At Semeval-2017 Task 9: Combining Machine Learning And Grammar Engineering For Amr Parsing And Generation
By addressing both text-to-AMR parsing and AMR-to-text generation, SemEval-2017 Task 9 established AMR as a powerful semantic interlingua. We strengthen the interlingual aspect of AMR by applying the multilingual Grammatical Framework (GF) for AMR-to-text generation. Our current rule-based GF approach completely covered only 12.3% of the test AMRs, therefore we combined it with state-of-the-art JAMR Generator to see if the combination increases or decreases the overall performance. The combined system achieved the automatic BLEU score of 18.82 and the human Trueskill score of 107.2, to be compared to the plain JAMR Generator results. As for AMR parsing, we added NER extensions to our SemEva…
RIGA at SemEval-2016 Task 8: Impact of Smatch Extensions and Character-Level Neural Translation on AMR Parsing Accuracy
Two extensions to the AMR smatch scoring script are presented. The first extension com-bines the smatch scoring script with the C6.0 rule-based classifier to produce a human-readable report on the error patterns frequency observed in the scored AMR graphs. This first extension results in 4% gain over the state-of-art CAMR baseline parser by adding to it a manually crafted wrapper fixing the identified CAMR parser errors. The second extension combines a per-sentence smatch with an en-semble method for selecting the best AMR graph among the set of AMR graphs for the same sentence. This second modification au-tomatically yields further 0.4% gain when ap-plied to outputs of two nondeterministic…
Keynote speakers: Benefits and drawbacks of the BigData era
We have voluntarily surrendered our private data to BigData companies like Google and FaceBook in hope that our data there will be safe and will be used only for ethical machine learning purposes to further advance artificial intelligence capabilities we already use daily: smart search, machine translation, speech recognition, guessing our interests etc. But alongside these positive BigData uses, unexpectedly the world was recently astounded by the success of the DataScience killer-application: microtargeting, discussed in this presentation.
Multilingual Clustering of Streaming News
Clustering news across languages enables efficient media monitoring by aggregating articles from multilingual sources into coherent stories. Doing so in an online setting allows scalable processing of massive news streams. To this end, we describe a novel method for clustering an incoming stream of multilingual documents into monolingual and crosslingual story clusters. Unlike typical clustering approaches that consider a small and known number of labels, we tackle the problem of discovering an ever growing number of cluster labels in an online fashion, using real news datasets in multiple languages. Our method is simple to implement, computationally efficient and produces state-of-the-art …
Towards efficient inductive synthesis: Rapid construction of local regularities
Given several input/output examples of some function we can state the problem: what is the “simplest” function which complies with these examples. This problem is well studied and is known to be very hard in the general case. In this paper we address a special case of the problem, when the target function can be expressed as a simple composition of known functions. We propose a new inductive synthesis algorithm for this case and show that it is efficient enough to synthesize complex geometry formulas.
ADT implementation and completion by induction from examples
There exists a fast algorithm [2] for inductive synthesis of terminating and ground confluent term rewriting systems from samples. The principles of this algorithm and the methodology of its use for implementation and completion of abstract data types are described.
RDF* Graph Database as Interlingua for the TextWorld Challenge
This paper briefly describes the top-scoring submission to the First TextWorld Problems: A Reinforcement and Language Learning Challenge. To alleviate the partial observability problem, characteristic to the TextWorld games, we split the Agent into two independent components: Observer and Actor, communicating only via the Interlingua of the RDF* graph database. The RDF* graph database serves as the “world model” memory incrementally updated by the Observer via FrameNet informed Natural Language Understanding techniques and is used by the Actor for the efficient exploration and planning of the game Action sequences. We find that the deep-learning approach works best for the Observer componen…
ZERO: An Efficient Ethernet-Over-IP Tunneling Protocol
An Ethernet over IPv4 tunneling protocol is proposed, which categorizes all Ethernet frames to be tunneled into NICE and UGLY frames. The UGLY frames are tunneled by traditional methods, such as UDP or GRE encapsulation, resulting in substantial overhead due to additional headers and fragmentation usually required to transport long Ethernet frames over IP network typically limited to MTU=1,500 bytes. Meanwhile the NICE Ethernet frames are tunneled without any overhead as plain IPv4 packets due to non-traditional reuse of “fragment offset” or “identification” field in the IP header. It is shown that for typical Internet traffic transported over Ethernet, the proposed ZERO tunneling protocol …
From Databases to Ontologies
This chapter introduces the UML profile for OWL as an essential instrument for bridging the gap between the legacy relational databases and OWL ontologies. We address one of the long-standing relational database design problems where initial conceptual model (a semantically clear domain conceptualization ontology) gets “lost” during conversion into the normalized database schema. The problem is that such “loss” makes database inaccessible for direct query by domain experts familiar with the conceptual model only. This problem can be avoided by exporting the database into RDF according to the original conceptual model (OWL ontology) and formulating semantically clear queries in SPARQL over t…
The SUMMA Platform: A Scalable Infrastructure for Multi-lingual Multi-media Monitoring
The open-source SUMMA Platform is a highly scalable distributed architecture for monitoring a large number of media broadcasts in parallel, with a lag behind actual broadcast time of at most a few minutes. The Platform offers a fully automated media ingestion pipeline capable of recording live broadcasts, detection and transcription of spoken content, translation of all text (original or transcribed) into English, recognition and linking of Named Entities, topic detection, clustering and crosslingual multi-document summarization of related media items, and last but not least, extraction and storage of factual claims in these news items. Browser-based graphical user interfaces provide humans…
Text Extraction from Scrolling News Tickers
While a lot of work exists on text or keyword extraction from videos, not a lot can be found on the exact problem of extracting continuous text from scrolling tickers. In this work a novel Tesseract OCR based pipeline is proposed for location and continuous text extraction from scrolling tickers in videos. The solution worked faster than real time, and achieved a character accuracy of 97.3% on 45 min of manually transcribed 360p videos of popular Latvian news shows.
Riga: from FrameNet to Semantic Frames with C6.0 Rules
For the purposes of SemEval-2015 Task-18 on the semantic dependency parsing we combined the best-performing closed track approach from the SemEval-2014 competition with state-of-the-art techniques for FrameNet semantic parsing. In the closed track our system ranked third for the semantic graph accuracy and first for exact labeled match of complete semantic graphs. These results can be attributed to the high accuracy of the C6.0 rule-based sense labeler adapted from the FrameNet parser. To handle large SemEval training data the C6.0 algorithm was extended to provide multi-class classification and to use fast greedy search without significant accuracy loss compared to exhaustive search. A met…
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe
It has long been speculated that deep neural networks function by discovering a hierarchical set of domain-specific core concepts or patterns, which are further combined to recognize even more elaborate concepts for the classification or other machine learning tasks. Meanwhile disentangling the actual core concepts engrained in the word embeddings (like word2vec or BERT) or deep convolutional image recognition neural networks (like PG-GAN) is difficult and some success there has been achieved only recently. In this paper we propose a novel neural network nonlinearity named Differentiable Disentanglement Filter (DDF) which can be transparently inserted into any existing neural network layer …