Information Extraction is a process of extracting information in a more structured way i.e., the information which is machine-understandable. It consists of sub fields which cannot be easily solved. Therefore, an approach to store data in a structured manner is Knowledge Graph which is a set of three-item sets called Triple where the set combines a subject, a predicate and an object.
In this article, we will discuss how to build a knowledge graph using Python and Spacy.
Let’s get started.
Code Implementation
Import all the libraries required for this project.
import spacy from spacy.lang.en import English import networkx as nx import matplotlib.pyplot as plt
These hubs will be the elements that are available in Wikipedia. Edges are the connections interfacing these elements to each other. We will extricate these components in an unaided way, i.e., we will utilize the punctuation of the sentences.
The primary thought is to experience a sentence and concentrate the subject and the item as and when they are experienced. First, we need to pass the text to the function. The text will be broken down and place each token or word in a category. After we have arrived at the finish of a sentence, we clear up the whitespaces which may have remained and afterwards we’re all set, we have gotten a triple. For example in the statement “Bhubaneswar is categorised as a Tier-2 city” it will give a triple focusing on the main subject(Bhubaneswar, categorised, Tier-2 city).
Below we have defined the code to get triples that can be used to build knowledge graphs.
def getSentences(text): nlp = English() nlp.add_pipe(nlp.create_pipe('sentencizer')) document = nlp(text) return [sent.string.strip() for sent in document.sents] def printToken(token): print(token.text, "->", token.dep_) def appendChunk(original, chunk): return original + ' ' + chunk def isRelationCandidate(token): deps = ["ROOT", "adj", "attr", "agent", "amod"] return any(subs in token.dep_ for subs in deps) def isConstructionCandidate(token): deps = ["compound", "prep", "conj", "mod"] return any(subs in token.dep_ for subs in deps) def processSubjectObjectPairs(tokens): subject = '' object = '' relation = '' subjectConstruction = '' objectConstruction = '' for token in tokens: printToken(token) if "punct" in token.dep_: continue if isRelationCandidate(token): relation = appendChunk(relation, token.lemma_) if isConstructionCandidate(token): if subjectConstruction: subjectConstruction = appendChunk(subjectConstruction, token.text) if objectConstruction: objectConstruction = appendChunk(objectConstruction, token.text) if "subj" in token.dep_: subject = appendChunk(subject, token.text) subject = appendChunk(subjectConstruction, subject) subjectConstruction = '' if "obj" in token.dep_: object = appendChunk(object, token.text) object = appendChunk(objectConstruction, object) objectConstruction = '' print (subject.strip(), ",", relation.strip(), ",", object.strip()) return (subject.strip(), relation.strip(), object.strip()) def processSentence(sentence): tokens = nlp_model(sentence) return processSubjectObjectPairs(tokens) def printGraph(triples): G = nx.Graph() for triple in triples: G.add_node(triple[0]) G.add_node(triple[1]) G.add_node(triple[2]) G.add_edge(triple[0], triple[1]) G.add_edge(triple[1], triple[2]) pos = nx.spring_layout(G) plt.figure(figsize=(12, 8)) nx.draw(G, pos, edge_color='black', width=1, linewidths=1, node_size=500, node_color='skyblue', alpha=0.9, labels={node: node for node in G.nodes()}) plt.axis('off') plt.show()
We can use pyplot libraries to build the Knowledge Graph. The above code is for displaying the graph.
if __name__ == "__main__": text = "Bhubaneswar is the capital and largest city of the Indian state of Odisha. The city is bounded by the Daya River " \ "to the south and the Kuakhai River to the east; the Chandaka Wildlife Sanctuary "\ "and Nandankanan Zoo lie in the western and northern parts of Bhubaneswar." \ "Bhubaneswar is categorised as a Tier-2 city." \ "Bhubaneswar and Cuttack are often referred to as the 'twin cities of Odisha'. " \ "The city has a population of 1163000." sentences = getSentences(text) nlp_model = spacy.load('en_core_web_sm') triples = [] print (text) for sentence in sentences: triples.append(processSentence(sentence)) printGraph(triples)
When data processing is being done, the Spacy library attaches a tag to every word so that we know a word is either a subject or an object. Given below is an example.
Knowledge Graph
Final Thoughts
In this article, we figured out how to extricate data from a given book as triples and fabricate an information diagram from it. Further, we can explore this field of data extraction in more details to learn extraction of more perplexing connections. Hope this article is useful to you.