Quick Walk through about Amazon Comprehend

What is Amazon Comprehend?

Amazon Comprehend is a Natural Language Processing (NLP) service that uses machine learning to find meaning and insights in text. In short, it develops insights by recognizing the entities, key phrases, language, sentiments and other common elements of the document. Amazon Comprehend comes as one of the application services in AWS ML Stack.

Fig: AWS ML Stack

Features:

Fig: Features of Comprehend

  1. Sentiment Analysis – Analysis of emotions & opinions from text.
  2. Entities detection – Detects the real-world objects like people, places, quantities, dates, etc.
  3. Language – Detects the language in the document. (‘en’|’es’|’fr’|’de’|’it’|’pt’| etc.)
  4. Key phrases – Detects the noun phrases
  5. Syntax analysis – Detects the parts of speech.
  6. Topic Modeling – determines common themes from content of all documents.

All the features peforms Single document or Batch mode while Topic modeling performs on large document collections

Benefits:

  1. Integrated powerful NLP into your apps
  2. Deep Learning based NLP
  3. Scalable NLP
  4. Integrate with other AWS Services
  5. Encryption of output results and volume data
  6. Low Cost

Guidelines & Limitation:

Supported Regions: Comprehend supports in the below regions where as Comprehend medical is limited to few regions.

Fig: Comprehend Supported Regions

Fig: Comprehend Medical Supported Region

Throttling:  When more number of documents are passed at a time then batch jobs kills and processing doesn’t happened, these limits can be increased.

Fig: Limits

You can increase these limits by upgrading your service limits at billing section in aws.amazon.com

Overall Limits: Comprehend supports UTF-8 encoding and max of 5,000 bytes per single document.

Multiple Document requests: Comprehend accepts only 25 documents at a time in single request. While doing batch jobs, this limit won’t be enough to perform Topic Modeling.

Asynchronous Operations: To analyse large number of documents

But, asynchronous operations handles only 5GB of files at a time, where it fails to supports the large number of documents has more than 5GB of files.

Document Classification:

Language Detection: Comprehend uses identifiers RFC-6569, but it doesn’t support phonetics.

Topic Modeling: Comprehend takes only Standard ASCII (0-127) characters which is UTF-8 but it doesn’t support UTF-16 which is valid BMP character and comprehend fails to process other than UTF-8 encoding characters.

How to implement Comprehend?

Now, we are going to take an example paragraph to explain how Comprehend works. Here’s the sample paragraph –

“Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.”

Create an empty python file and start writing to that file by importing boto3 package where this package connects to the aws comprehend service. The code to use amazon comprehend service in python file is as shown below:

Fig: boto3 package and amazon comprehend

After importing comprehend service, start exploring all the features which are listed above.

  1. Key Phrases: phrases = comprehend.detect_key_phrases(Text=sample_tweet, LanguageCode=’en’)
  2. Entities Detection: entities = comprehend.detect_entities(Text=sample_tweet, LanguageCode=’en’)
  3. Sentiment Analysis: sentiments = comprehend.detect_sentiment(Text=sample_tweet, LanguageCode=’en’)
  4. Language Detection: lan = comprehend.detect_dominant_language(Text=sample_tweet)
  5. Syntax Analysis: syntax = comprehend.detect_syntax(Text=sample_tweet, LanguageCode=’en’)

The above red coloured highlighted functions are provided by boto3 to connect comprehend and their respective functions, Here “sample_tweet” is the variable name for the above paragraph which we take as an example. Now, we will print the output of each feature of Comprehend service.

Key Phrases Output: Detects what are the key phrases.

Entities Detection Output: Detects the type of Entity

Sentiment Analysis Output: Detects the sentiment whether it’s positive/negative/neutral

Language Detection Output: Detects the language.

Syntax Analysis Output: Many Number of Syntaxes are there, but listing only few.

Topic Modeling:

Amazon comprehend uses Latent-Dirichlet Allocation-based learning model to determine the topics in a set of documents. For example, the word “apple” in an article speaks about laptops, mobiles & watches. But the same word in other documents/articles speaks about fruit. In this way, the same word can appear in same set of documents which we pass. There by defining the weights helps the words to decide which comes under which topic and can define the Topic.

For accurate results,

  1. Should use 1000 documents in each topic modeling job.
  2. Each document should be atleast 3 sentences long.
  3. If a documents consists of numbers, they should be removed using corpus for easy defining of topic.
  4. Follow the guidelines & rules which are specified above.

Topic modeling is an Asynchronous process, where list of documents should be uploaded to S3 and process using Amazon Comprehend and the output is again loaded to S3 for further usage. There are ways to upload documents:

As we know Amazon Comprehend uses NLP, mostly it follows Lemmatization to detect words for topic modeling.

Document Processing Models:

Amazon provides 3 types of models for processing our documents. They are:

  1. Single-Document Processing
  2. Multiple Document Synchronous Processing
  3. Asynchronous Batch Processing

Each models has their own operations

  1. Single-Document Processing Operations:
    1. DetectDominantLanguage
    2. DetectEntities
    3. DetectKeyPhrases
    4. DetectSentiment
    5. DetectSyntax
  2. Multiple Document Synchronous Processing Operations:
    1. BatchDetectDominantLanguage
    2. BatchDetectEntities
    3. BatchDetectKeyPhrases
    4. BatchDetectSentiment
    5. BatchDetectSyntax
  3. Asynchronous Batch Processing Operations:
    1. Starting an analysis job
      1. StartDominantLanguageDetectionJob
      2. StartEntitiesDetectionJob
      3. StartKeyPhrasesDetectionJob
      4. StartSentimentDetectionJob
      5. StartTopicsDetectionJob
    2. Monitoring Analysis Jobs
      • To get Status of Single Job
        1. DescribeDominantLanguageDetectionJob
        2. DescribeEntitiesDetectionJob
        3. DescribeKeyPhrasesDetectionJob
        4. DescribeSentimentDetectionJob
        5. DescribeTopicsDetectionJob
      • To get status of Multiple Jobs
        1. ListDominantLanguageDetectionJobs
        2. ListEntitiesDetectionJobs
        3. ListKeyPhrasesDetectionJobs
        4. ListSentimentDetectionJobs
        5. ListTopicsDetectionJobs
      • Getting Analysis Results – To get status of Individual jobs
        1. DescribeDominantLanguageDetectionJob
        2. DescribeEntitiesDetectionJob
        3. DescribeKeyPhrasesDetectionJob
        4. DescribeSentimentDetectionJob
        5. DescribeTopicsDetectionJob

These are 3 types of models for processing documents and their respective operations.

Common Use-Cases:

  1. Content Personalization
  2. Semantic Search
  3. Intelligent Data Warehouse
  4. Social Analytics

Conclusion:

This is the overview of AWS Amazon Comprehend, where we use Python for consuming the Comprehend service. In our next article we will come up with a new use case where we use AWS Comprehend widely to solve the problems we face.