Wowool Workshop - April 2021
Wowool Workshop - April 2021
Wowool Workshop - April 2021
wowool Eyeontext
Part of Eyeonid Group
Purpose of presentation
● Natural Language Processing (NLP) and text mining
● Linguistic concepts
● Wowool
● Wowool python API
Eyeontext
Part of Eyeonid Group
NLP
What we do is
called Text Mining:
mining text for high
quality information
and rendered it in a
structured format.
Eyeontext
Part of Eyeonid Group
Data Analysis
Data Analysis has become an essential part of companies and
organizations’ decision making process.
Eyeontext
Part of Eyeonid Group
Eyeontext
Part of Eyeonid Group
Data Analysis
Nevertheless, most information is what we called unstructured, in
the form of news, reports, blogs, papers, essays, etc.
Eyeontext
Part of Eyeonid Group
Text mining
Eyeontext
Part of Eyeonid Group
Text mining
Eyeontext
Part of Eyeonid Group
Annotation
● Begin and End offset
● Label
● Sub-annotations
(Demo playground)
Eyeontext
Part of Eyeonid Group
Annotation
Eyeontext
Part of Eyeonid Group
Before Starting
● The question: what kind of information do you want to
extract? If the text would be printed, would you be able
to signal what you want with a marker?
Eyeontext
Part of Eyeonid Group
Corpus analysis
● Run a simple rule such as:
wow -l en -f ~/corpus/english/movies -e "Nn Nn"
● Find mentions of coronavirus:
wow -l en -f ~/corpus/english/3_2020/ -e " 'coronavirus' "
--reindex
● Run entities of interest:
wow -l en -f ~/corpus/english/movies -e "Event" --domain
english-entity --reindex
Eyeontext
Part of Eyeonid Group
Use cases
● Fluid - learning community:
○ Texts - Chats, papers, publications
○ Questions - relations, field of knowledge, questions/answers
● Healthcare:
○ Text: drug forum web site
○ Questions - health issues, drugs, dosages, addiction
● Identity Theft protection:
○ Text: leaked data from breaches.
○ Questions: emails, passwords, credit card numbers, pins
Eyeontext
Part of Eyeonid Group
Wowool SDK
Basic Linguistic Analysis
Wowool language:
● Lexicons: like vocabulary
● Rules: like grammar
Eyeontext
Part of Eyeonid Group
Command line -input
wow -l <language> -i "my sentence"
wow -l english -i "The play is a disaster"
Eyeontext
Part of Eyeonid Group
Command line -e "expression"
Eyeontext
Part of Eyeonid Group
Basic Linguistic Analysis
Tokenization:
Make sentences, words, punctuation
Lemmatization or stemming:
Find dictionary form
POS Tagging:
Noun (Nn), Proper Noun (Prop), Adjective (Adj), Punctuation (Punct)
Eyeontext
Part of Eyeonid Group
Basic linguistic analysis
Eyeontext
Part of Eyeonid Group
Basic linguistic analysis - results
"The" Init-Cap, Init-Token, 'the' Det-Def
"blue" 'blue' Nn-Sg
"-" '-' Punct
"eyed" 'eyed' Adj-Std
"monster" 'monster' Nn-Sg
"saw" 'see' V-Past
"the" 'the' Det-Def
"men" 'man' Nn-Pl
"." '.' Punct
Eyeontext
Part of Eyeonid Group
Stems - what's the point
● Generalization:
Eyeontext
Part of Eyeonid Group
Basic linguistics - Abstraction
● Tokenization:
wow -l english -f ~/corpus/english/movies -e " +Init-Cap "
● Stemming
wow -l english -f ~/corpus/english/movies -e " 'be' "
● POS
wow -l english -f ~/corpus/english/movies -e " V "
Eyeontext
Part of Eyeonid Group
Basic linguistics - Structure
Vs
Eyeontext
Part of Eyeonid Group
Basic linguistics - Meaning
Ambiguity: when a word can mean different things:
Eyeontext
Part of Eyeonid Group
Wowool language
Lexicons
Rules
Domains
Eyeontext
Part of Eyeonid Group
--domain
A domain is a collection of wow files (rules and lexicons) that are
group for a single purpose (healthcare, finance)
--domain english-entity.dom,myrules_folder
Eyeontext
Part of Eyeonid Group
Existing Domains
<language>-entity.dom: City, Country, Company, Facility, Event,
Person, Position, Organization, WorldRegion, ..
<language>-syntax.dom: NP, VP
lexicon: (input="stem")
{
acquire,buy,purchase
} = Buy;
Eyeontext
Part of Eyeonid Group
Lexicons - easy annotation
lexicon: (input="normalized_stem")
{
be beyond helpful,
believe in i,
beyond my expectation,
cover my back,
they do everything they can,
(have )?be a life saver,
(give|grant) (i|you|we) a chance,
} = Gratitude;
Eyeontext
Part of Eyeonid Group
Rules vs Lexicons
Similar:
Eyeontext
Part of Eyeonid Group
Rule Syntax
With no context: With no context:
rule : { rule : {
.. Expression .. (Prop)+ "Inc"
} = Annotation; } = Company;
rule: { rule:
Context {
{ Expression } = "Mr\."
Annotation {(Prop){1,3} } = Person
}; };
Eyeontext
Part of Eyeonid Group
Rule Elements
Tokens "literal",'stem',pos
Other annotations:
attributes:
Range: (<>){0,4}
Shortest/Longest Match
Attributes
Eyeontext
Part of Eyeonid Group
Rule Files
We are going to create a domain to deal with crime victims in the movie directory.
mkdir rules
With a plain text editor (notepad, sublime, visual studio code) create a file in that
directory called "murder.wow"
Eyeontext
Part of Eyeonid Group
Murder rules
lexicon : (input="stem") rule:
{ {
assasinate, Harm {(Prop)+} = Victim
kill, };
mistreat,
murder,
point a gun,
punch,
rape,
shoot,
stab,
torture,
} = Harm;
Eyeontext
Part of Eyeonid Group
Python Setup
Requirement: The wowool sdk, Python 3.9, 64bit AMD
Run ‘pip list -v’ to know the location of your installation,
and add that [folder]\eot\wowool\package\lib to your PATH environment.
Default: c:\python39\lib\site-packages\eot\wowool\package\lib
Windows from the CMD (console):
set PATH="%PATH%;c:\python39\lib\site-packages\eot\wowool\package\lib"
Linux/Macos:
pip list -v | grep eot
export PATH="${PATH}:/usr/local/lib/python3.9/site-packages/eot/wowool/package/lib"
Samples:
Site: https://github.com/phforesteot/eot-wowool-samples.git
Download: git clone https://github.com/phforesteot/eot-wowool-samples.git
Eyeontext
Part of Eyeonid Group
Python Analyzer
from eot.wowool.native import Analyzer, Domain
from eot.wowool.error import Error
try:
dutch = Analyzer(language="dutch")
entities = Domain( "dutch-entity" )
# run the basic dutch analysis
doc = dutch("Jan Van Den Berg werkte als hoofdarts bij Omega Pharma.")
# run the dutch entities
doc = entities(doc)
print(doc)
Eyeontext
Part of Eyeonid Group
Python Analyzer (Results)
C:( 0, 16): Person,@(family='van den berg' gender='male' given='jan' icanonical='jan van den berg' )
C:( 0, 3): PersonGiv
C:( 0, 3): GivenName,@(gender='male' )
T:( 0, 3): Jan,{+Giv, +Init-Cap, +Init-Token},[Jan:Prop-Std]
C:( 4, 16): PersonFam
T:( 4, 7): Van,{+Init-Cap, +NF},[Van:Prop-Std]
T:( 8, 11): Den,{+Init-Cap, +NF, +NF-Lex},[Den:Prop-Std]
C:( 12, 16): City,@(country='Belgium' )
T:( 12, 16): Berg,{+Init-Cap, +NF},[Berg:Prop-Std]
T:( 17, 23): werkte,[werken:V-Past]
T:( 24, 27): als,[als:Conj-Sub]
C:( 28, 37): PersonMention
C:( 28, 37): Position,@(theme='health' )
T:( 28, 37): hoofdarts,[hoofd#arts:Nn-Sg]
T:( 38, 41): bij,[bij:Prep-Std]
C:( 42, 54): Company,@(country='Belgium' sector='pharma' )
T:( 42, 47): Omega,{+Init-Cap, +NF},[Omega:Prop-Std]
T:( 48, 54): Pharma,{+Init-Cap, +NF, +NF-Lex},[Pharma:Prop-Std]
T:( 54, 55): .,[.:Punct-Sent]
Eyeontext
Part of Eyeonid Group
Python Iterate Concepts
from eot.wowool.native import Analyzer, Domain
from eot.wowool.annotation import Concept
from eot.wowool.error import Error
try:
dutch = Analyzer(language="dutch")
entities = Domain( "dutch-entity" )
doc = entities( dutch("Jan Van Den Berg werkte als hoofdarts bij Omega Pharma."))
# filter some concepts
requested_concepts = set(['Person','Position','Company'])
concept_filter = lambda concept : concept.uri in requested_concepts
for concept in Concept.iter(doc, concept_filter) :
print( f"literal: {concept.literal:<20}, stem={concept.stem}" )
except Error as ex:
print("Exception:",ex)
Eyeontext
Part of Eyeonid Group
Custom Domains (runtime_domain.py)
...
doc = entities(dutch("Jan Van Den Berg werkte als hoofdarts bij Omega Pharma."))
mydomain = Domain( source = r"""
rule:{ Person .. 'werken' .. Company }= PersonWorkCompany@(verb="work");
rule:{ Person .. Company }= PersonCompany;
""")
doc = mydomain(doc)
PersonCompany -> Jan Van Den Berg werkte als hoofdarts bij Omega Pharma
PersonWorkCompany -> Jan Van Den Berg werkte als hoofdarts bij Omega Pharma
Person -> Jan Van Den Berg
Company -> Omega Pharma Eyeontext
Part of Eyeonid Group
Compounds (dutch_compound.py)
compounds = Domain( source="""
lexicon:(input="component"){verzekering } = INSURANCE_COMP;
lexicon:(input="head"){verzekering } = INSURANCE_HEAD;
rule:{ h'verzekering' { <+currency> } = INSURANCE_PRICE };
"""
analyzer = Analyzer(language="dutch")
input = "Er zijn verzekeringsmaatschapijen €40.000.000 en verzekeringen: autoverzekeringen €100, fietsverzekering €10"
doc = compounds(analyzer(input))
Example:
topic_identifier -i "I saw black cars and a green bird and green house."
green bird: 1.0, black car: 1.0, green house: -0.00086
echo "i have green house gasses in my yellow house." > in.txt
toc -f in.txt -l english -o dummy.topic_model --stats
cat stats.md
## Nr of files: 1
|term | freq
|------------------------------ | ----
|green | 1
|green house | 1
|yellow house | 1
topic_identifier -i "I saw black cars and a green bird and green house." -t dummy.topic_model
green bird: 1.0, black car: 1.0, green house: -0.00086
Eyeontext
Part of Eyeonid Group
Topic_Identifier (topics.py)
from eot.wowool.error import Error
from eot.wowool.topic_identifier import TopicIdentifier
try:
[
('green bird', 1.0),
('black car', 1.0),
('green house', 0.6)
]
Eyeontext
Part of Eyeonid Group
EntityGraph
- The entity graph is a tool that has an API that returns a panda dataframe
- Converted to cypher to build a graph database out of text data.
- It creates node and relations at runtime.
- Keeps Slots, slots are thing you want to remember
- You can also add topic to nodes.
Eyeontext
Part of Eyeonid Group
EntityGraph (entity_graph_panda.py)
from eot.wowool.native import Analyzer, Domain
from eot.wowool.annotation import Concept
from eot.wowool.error import Error
from eot.wowool.tool import EntityGraph
graph_config = { … }
try:
english = Analyzer(language="dutch")
entities = Domain( "dutch-entity" )
myrule = Domain( source = """ rule:{ 'user' '\:' {(<>)+}=USER }; """)
doc = english("user:John \n\nJan Van Den Berg werkte als hoofdarts bij Omega Pharma.")
doc = entities(doc)
doc = myrule(doc)
print(doc)
graphit = EntityGraph( graph_config )
# returns a panda dataframe.
graphit.slots['Document'] = {"data":"file1.txt"}
results = graphit(doc)
print( results )
Eyeontext
Part of Eyeonid Group
EntityGraph (entity_graph_panda.py)
graph_config = {
Eyeontext
Part of Eyeonid Group
EntityGraph (entity_graph_panda.py)
--- From: results.df_from ------------------------------
label name gender
0 USER John NaN
1 Person Jan Van Den Berg NaN
2 Person Jan Van Den Berg male
3 USER John NaN
--- Relation: results.df_relation ----------------------
label
0 Mentions
1 hoofdarts
2 P2C
3 Mentions
--- To: results.df_to ----------------------------------
label name country
0 Document file1.txt NaN
1 Company Omega Pharma NaN
2 Company Omega Pharma Belgium
3 Person Jan Van Den Berg NaN
--- Merged: results ------------------------------------
from_label from_name from_gender rel_label to_label to_name to_country
0 USER John NaN Mentions Document file1.txt NaN
1 Person Jan Van Den Berg NaN hoofdarts Company Omega Pharma NaN
2 Person Jan Van Den Berg male P2C Company Omega Pharma Belgium
3 USER John NaN Mentions Person Jan Van Den Berg NaN
Eyeontext
Part of Eyeonid Group
EntityGraph/Neo4j (entity_graph_cypher.py)
from eot.wowool.tool.entity_graph.cypher import CypherStream
Eyeontext
Part of Eyeonid Group
Neo4jDriver
Prerequisite : pip install neo4j-connector
The neo4jdriver is a tool to store your text to a neo4jdatabase.
Eyeontext
Part of Eyeonid Group
Neo4jDriver / Docker
Setup:
Using Neo4j on your host, to do this you need to open the ports on the docker session to access the neo4j
database.
Notes:
Inside the docker, this will create the lines you need to pass the neo4j database
neo4jdriver -f corpus/english/movies/drama -l en -x links/movies.lnk --domain rules/movies/ -o /shared/out.cypher -n
import neo4j
neo4jdb = neo4j.Connector( "http://localhost:7474" , ("test" , "test" ))
with open( "out.cypher" ) as fh:
lines = fh.readlines()
for cypher_query in lines:
print(cypher_query)
neo4jdb.run(cypher_query)
Eyeontext
Part of Eyeonid Group
Build-in Testing (domains/helloworld.wow)
// domains/helloworld.dom
Build in testing functionality in the woc compiler using the keyword. lexicon:{ greetings, hello }= GREETING;
- @test: [input data]
lexicon:{ world , earthling }= PLACE;
- @expected: [capture data]=[uri] (this one is optional)
@test: he says, hello worlds
@expected: hello worlds=TEST
rule:{ GREETING PLACE }=TEST;
woc -o helloworld.dom helloworld.wow -t -l en --verbose debug
test: [.../domains/helloworld.wow]
--------------------------------------------------------
@test[failed]: eot-wowool-samples/domains/helloworld.wow:6: he says, hello worlds
@expected[failed]: { hello world } = TEST;
- - - - - - - - - - - - - - - - - - - - - - - - - - -
eot-wowool-samples/domains/helloworld.wow:6: rule:{ GREETING PLACE }= TEST
s(0,21)
{Sentence
t(0,2) "he" (Init-Token)['he':Pron-Pers, +3P, +Sg]
t(3,7) "says" ['say':V-Pres-3-Sg, +that]
t(7,8) "," [',':Punct-Comma]
{GREETING
t(9,14) "hello" ['hello':Nn-Sg, +Interj]
Missing the PLACE Concept (world)
}GREETING
t(15,21) "worlds" ['world':Nn-Pl]
}Sentence
TestResult: 1 rules, 1 tests, 1 failed Eyeontext
Output file:"/Users/phforest/dev/eot-wowool-samples/domains/helloworld.dom" Part of Eyeonid Group
Notes:
Remark from: rudy van belkom:
How do you control the validity of the results ? This question rises when mainly
with ML, where questions like accuracy and explainability are an issue. while with
rule based approach your rule control the results: There is a human implementing the
thinking behind the rules, so the accuracy can be very high. The linguistics behind
Wowool makes it that there is a theory of language behind the text. Things that many
system remove, the so called stop words: determiners, prepositions and conjunctions,
are crucial in understanding the relation between words and the whole meaning of a
statement.
Eyeontext
Part of Eyeonid Group