Extracting noun phrases with contextual relevance in .NET using OpenNLP

Share Button

A few months ago I was working on a project that had a word cloud-like feature. A word cloud is an interesting way to visually represent a popular theme or topic. I had a dataset of user reviews from another project that we wanted to parse and use. This began my first exposure to Natural Language Processing (NLP) and other advanced text analytics tools.

Notes from an NLP spike

I started by extracting nouns from our dataset and calculating frequency. This resulted in a list of top used terms, but unlike a tag cloud implementation for a blog, the results were not always relevant. I needed a way to capture the theme of a sentence, sum up all the reviews with the same theme, and then present the top themes to the user. Think of a yelp user review and its core positive or negative theme. Then calculate how many other user reviews have the same theme, i.e. reviews for a pub that often mentioned its “great patio”. It wasn’t long before I started reading about NLP tools and in particular, Part of Speech (PoS) analyzers. I started learning the vocabulary of the NLP world such as N-grams, sentence chunking, and most importantly, noun phrases. Noun phrases contained two or more words (including a noun) which provide some contextual relevance to the theme of the sentence.

Below is a more formal definition of a noun phrase with an example.

A word group with a noun or pronoun as its head. The noun head can be accompanied by modifiers, determiners (such as the, a, her), and/or complements.

A noun phrase (often abbreviated as NP) most commonly functions as a subject, object, or complement.

The wells and water table had been polluted by chemical pesticides and fertilizers that leached into the earth and were washed by rain into the creeks, where the stunned fish were scavenged by the ospreys.”
(Peter Matthiessen, Men’s Lives, 1986)

- noun phrase, About.com

Identifying noun phrases is not a trivial task. I started reading up on big open source projects in the NLP game like OpenNLP (Java), NLTK (python), and LingPipe (Java). I also found a great deal of smaller analytics tools and parsers, but none seemed advanced enough to really capture the essence of a noun phrase or theme of a sentence. It was then that a colleague pointed me in the direction of SQL Server Integration Services (SSIS) text analytics transformations. Most notably, the Term Extraction and Term Lookup transformations. A PoC quickly demonstrated that these transformations were an efficient and scalable way to extract noun phrases. It was very simple to configure and get up and running (if you don’t mind using BIDS, *shudder*). I was able to extract meaningful noun phrases with a high degree of accuracy. However, it had a number of limitations.

  • It’s an SSIS package, great for parsing text after-the-fact, but not in real time.
  • It requires a SQL Server enterprise license ($$).
  • It only supports English with no plans of supporting other languages.

Ideally, I wanted to replace the SSIS package with an in-process solution, but unfortunately there are limited text analytics tools available for the .NET community. There are a few options. SharperNLP is a C# port of OpenNLP. It had a brief flurry of activity in 2006, but not much since then. Here are some notes from someone who attempted to integrate with NLTK in a .NET implementation using IronPython: Open Source NLP in C# 3.5 using NLTK.

I even put the question to StackOverflow in a question titled “Extracting terms with contextual relevance (noun phrases) from text in a .NET project”. An answer on this question revealed an option I hadn’t considered. Although I didn’t want to have an IPC (Inter-Process Communication) layer I started thinking about setting up a dedicated full text analytics server with Apache Solr. As of this writing there is a push to get OpenNLP analytics and filters committed to Solr itself, but it still requires a patch (LUCENE-2899) and a fairly lengthy configuration process to get up and running.

A viable .NET implementation

Eventually I came across a wiki article entitled “A quick guide to using OpenNLP from .NET” that introduced me to a remarkable project called IKVM.NET. After generating a shiney new .NET OpenNLP assembly with the steps provided I was able to use the OpenNLP namespaces with ease in my project.

The first step in using the parsers in OpenNLP was to instantiate a model using Java streams. I created a base class for my NounPhraseParser with a utility method to help load these models.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace OpenNLP.NET.PoC
{
    public class AbstractNounPhraseAdapter
    {
        protected readonly string ModelsPath;

        /// <summary>
        /// A path to the directory where the OpenNLP models are located.
        /// </summary>
        protected AbstractNounPhraseAdapter(string modelsPath)
        {
            ModelsPath = modelsPath;
        }

        /// <summary>
        /// Return the OpenNLP analyzer given its model type (M), the type of the analyzer (T), the filename
        /// of the model (i.e. en-maxent.bin) and a path to where the Models are lcoated (ModelsPath).
        /// </summary>
        public T ResolveOpenNlpTool(string modelPath)
            where M : class
            where T : class
        {
            var modelStream = new java.io.FileInputStream(Path.Combine(ModelsPath, modelPath));

            M model;
            try
            {
                model = (M)Activator.CreateInstance(typeof(M), modelStream);
            }
            finally
            {
                if (modelStream != null)
                {
                    modelStream.close();
                }
            }

            return (T)Activator.CreateInstance(typeof(T), model);
        }

        /// <summary>
        /// Functions to run after PoS parsing to determine if the noun phrase should be returned.
        /// </summary>
        public IEnumerable<Func> PostProcessingFilters { get; set; }

        protected bool ValidNounPhrase(string nounPhrase)
        {
            return PostProcessingFilters == null ||
                    PostProcessingFilters.Aggregate(true, (current, filter) => current && filter.Invoke(nounPhrase));
        }
    }
}

The guts of my PosNounPhraseParser itself contains a parsing method named GetNounPhrases which is based on code by Sujit Pal described in his blog posting entitled “An UIMA Noun Phrase POS Annotator using OpenNLP“.

using System;
using System.Collections.Generic;
using opennlp.tools.chunker;
using opennlp.tools.postag;
using opennlp.tools.sentdetect;
using opennlp.tools.tokenize;

namespace OpenNLP.NET.PoC
{
    /// <summary>
    /// Ported from Java implementation by Sujit Pal
    /// http://sujitpal.blogspot.ca/2011/08/uima-noun-phrase-pos-annotator-using.html
    /// </summary>
    public class PosNounPhraseParser : AbstractNounPhraseAdapter, INounPhraseParser
    {
        public PosNounPhraseParser(string modelsPath) : base(modelsPath) { }

        private static SentenceDetector _sentenceDetector;
        private SentenceDetector GetSentenceDetector()
        {
            return _sentenceDetector ?? (_sentenceDetector = ResolveOpenNlpTool("en-sent.bin"));
        }

        private static POSTagger _posTagger;
        private POSTagger GetPosTagger()
        {
            return _posTagger ?? (_posTagger = ResolveOpenNlpTool("en-pos-maxent.bin"));
        }

        private static Tokenizer _tokenizer;
        private Tokenizer GetTokenizer()
        {
            return _tokenizer ?? (_tokenizer = ResolveOpenNlpTool("en-token.bin"));
        }

        private static Chunker _chunker;
        private Chunker GetChunker()
        {
            return _chunker ?? (_chunker = ResolveOpenNlpTool("en-chunker.bin"));
        }

        public void WarmUpModels()
        {
            GetSentenceDetector();
            GetPosTagger();
            GetTokenizer();
            GetChunker();
        }

        public IList<string> GetNounPhrases(string sourceText)
        {
            if (string.IsNullOrWhiteSpace(sourceText)) throw new ArgumentNullException("sourceText");

            var nounPhrases = new List<string>();

            // return an array of start and end indexes that identify sentences
            var sentenceSpans = GetSentenceDetector().sentPosDetect(sourceText);
            foreach (var sentenceSpan in sentenceSpans)
            {
                // retrieve the actual sentence from the source text
                var sentence = sentenceSpan.getCoveredText(sourceText).toString();
                var start = sentenceSpan.getStart();

                // return an array of start and end indexes that identify various
                // tokens/tags in the sentence (i.e. noun phrases, verb phrases, etc)
                var tokenSpans = GetTokenizer().tokenizePos(sentence);
                var tokens = new string[tokenSpans.Length];
                for (var i = 0; i < tokens.Length; i++)
                {
                    tokens[i] = tokenSpans[i].getCoveredText(sentence).toString();
                }
                var tags = GetPosTagger().tag(tokens);

                // return an array of chunks that contain tag types and start/end indexes
                // for the chunk in the source text
                var chunks = GetChunker().chunkAsSpans(tokens, tags);

                foreach (var chunk in chunks)
                {
                    // filter out everything but noun phrases
                    if (chunk.getType() != "NP") continue;

                    var chunkStart = start + tokenSpans[chunk.getStart()].getStart();
                    var chunkEnd = start + tokenSpans[chunk.getEnd() - 1].getEnd();

                    // extract the noun phrase
                    var nounPhrase = sourceText.Substring(chunkStart, chunkEnd - chunkStart);

                    // run post processing functions to determine if this noun phrase
                    // is suitable for our purposes (defined by caller)
                    if (!ValidNounPhrase(nounPhrase)) continue;

                    nounPhrases.Add(nounPhrase);
                }
            }
            return nounPhrases;
        }
    }
}

And finally, a test that demonstrates the setup of my PosNounPhraseParser over the example sentence mentioned earlier in the definition of a noun phrase.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using NUnit.Framework;

namespace OpenNLP.NET.PoC
{
    [TestFixture]
    public class PosNounPhraseParserTests
    {
        [Test]
        public void PosNounPhraseParser_GetNounPhrases_Extract_Noun_Phrases_From_Sentence()
        {
            string _modelPath = @"C:\Development\NLPForDotNET\lib\opennlp-models-1.5\";

            // arrange
            var nounPhraseAdapter = new PosNounPhraseParser(_modelPath)
            {
                PostProcessingFilters = new List<Func>
                                                                        {
                                                                            // more than two words
                                                                            (nounPhrase => nounPhrase.Split(" ".ToCharArray()).Count() > 1),
                                                                            // character stop list
                                                                            (nounPhrase =>
                                                                             !(nounPhrase.Contains(".") ||
                                                                               nounPhrase.Contains("\"") ||
                                                                               nounPhrase.Contains(",") ||
                                                                               nounPhrase.Contains("”") ||
                                                                               nounPhrase.Contains("“") ||
                                                                               nounPhrase.Contains(";")))
                                                                        }
            };

            nounPhraseAdapter.WarmUpModels();

            var stopwatch = new Stopwatch();
            stopwatch.Start();

            // act
            var actualNounPhrases = nounPhraseAdapter
                .GetNounPhrases("The wells and water table had been polluted by chemical pesticides and fertilizers that leached into the earth and were washed by rain into the creeks, where the stunned fish were scavenged by the ospreys.")
                .ToArray();

            stopwatch.Stop();

            Debug.WriteLine("Total time: {0}", stopwatch.Elapsed);

            // assert
            Assert.Contains("The wells and water table", actualNounPhrases);
            Assert.Contains("chemical pesticides and fertilizers", actualNounPhrases);
            Assert.Contains("the earth", actualNounPhrases);
            Assert.Contains("the creeks", actualNounPhrases);
            Assert.Contains("the stunned fish", actualNounPhrases);
            Assert.Contains("the ospreys", actualNounPhrases);
        }
    }
}

Conclusion

I think this project worked out remarkably well. I don’t know if I’ll attempt to use something like this in a production environment, but if nothing else it was a very enlightening foray into the interesting world of Natural Language Processing. There are many other subjects in this area that I would like to explore, such as Sentiment Analysis and ways to identify subjects of significance in large bodies of text. As the IBM Watson project demonstrated to us not too long ago, this is a young field with staggering potential. The current trajectory of research along with significant advances in computation capability suggest it won’t be long before we can communicate with computers/information systems as easily as if you were talking to your best friend.

If you wish to use the solution I’ve demonstrated in this post please make your own determination on whether it’s acceptable for your project. I’m no expert in licensing, but I’ve cited all my sources where available so that the reader can execute their own due diligence.

Share Button
  • bob

    lucene.net also offers frequency of words and much more.

    • http://randonom.com/ Sean Glover

      Yes and when it comes to counting frequency (and any kind of full text search) lucene.net is a terrific choice. However, it does not offer the more advanced text analytics capabilities of OpenNLP or other NLP toolkits.

  • Pingback: C# | Pearltrees

  • Joern

    Nice article! I added a reference to it in the OpenNLP wiki. Thanks.

    • http://randonom.com/ Sean Glover

      That’s terrific. Thanks Joern! :)

  • Pingback: Extracting noun phrases with contextual relevan...

  • xxx

    The code is not even complied…. :

  • Steven

    Why didn’t you use SharpNlp given that its a .net port of OpenNlp

    • http://randonom.com/ Sean Glover

      SharpNlp was ported from a much older version of OpenNlp.

  • Danny

    Do you have the compiled code for this somewhere?

    • http://randonom.com/ Sean Glover

      I may be able to dig it up, but I believe I’ve provided all the instructions necessary to do convert the jars yourself. If you would like I can take a look and see if I have them available.

  • vmtcram

    i like to extract noun from a textfile, will you help me sir my id:vmtcram@gmail.com

  • Rushi Patel

    Hi.. sir.. i m working on .net web application project to parse youtube comments..
    I have already retrieve comments into my database.
    i want to parse and count no. of word which gives positive or negative effect.
    plz help me.

    • http://randonom.com/ Sean Glover

      Hi Rushi,

      Thanks for reading the article. Unfortunately, I haven’t extended this solution to take into account sentiment. I would suggest looking into OpenNLP’s capabilities in regards to sentiment to see if it’s a solution for you. If not, check out some of the other NLP vendors or 3rd party services (OpenAmplify, Alchemy, Semantria, etc).

      Good luck!

      Sean

  • Wistiy

    how can i add NNP+NNP tag means if (TOP (S (NP (NNP Chris) (NNP Waterson) (NNP Bergmann)))) then i need to extract “Chris Waterson Bergmann” in a single phrase