tjtjtjのメモ

自分のためのメモです

Entity and Sentiment Analysis with the Natural Language API(英語)

今回は5部構成。

Call the Natural Language API

最初に使用する自然言語 APIメソッドはanalyzeEntitiesです。このメソッドを使用すると、APIはテキストからエンティティ（人、場所、イベントなど）を抽出できます。

request.json

{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"Joanne Rowling, who writes under the pen names J. K. Rowling and Robert Galbraith, is a British novelist and screenwriter who wrote the Harry Potter fantasy series."
  },
  "encodingType":"UTF8"
}

curl "https://language.googleapis.com/v1/documents:analyzeEntities?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json
{
  "entities": [
    {
      "name": "Joanne Rowling",
      "type": "PERSON",
      "metadata": {
        "wikipedia_url": "https://en.wikipedia.org/wiki/J._K._Rowling",
        "mid": "/m/042xh"
      },
      "salience": 0.79828626,
      "mentions": [
        {
          "text": {
            "content": "Joanne Rowling",
            "beginOffset": 0
          },
          "type": "PROPER"
        },
        {
          "text": {
            "content": "Rowling",
            "beginOffset": 53
          },
          "type": "PROPER"
        },
        {
          "text": {
            "content": "novelist",
            "beginOffset": 96
          },
          "type": "COMMON"
        },
        {
          "text": {
            "content": "Robert Galbraith",
            "beginOffset": 65
          },
          "type": "PROPER"
        }
      ]
    },
    :
  ],
  "language": "en"
}

応答内の各エンティティについて、エンティティタイプ、関連するウィキペディアのURL（存在する場合）、特徴、およびこのエンティティがテキスト内のどこに出現したかのインデックスを取得します。顕著性とは、テキスト全体に対するエンティティの中心性を指す[0,1]の範囲の数値です。自然言語 APIは、同じエンティティをさまざまな方法で認識することもできます。レスポンスのメンションリストを見てください。APIは、 "Joanne Rowling"、 "Rowling"、 "novelist"、および "Robert Galbriath"のすべてが同じことを示していることがわかります。

Sentiment analysis with the Natural Language API

エンティティの抽出に加えて、Natural Language APIでは、テキストブロックに対して感情分析を実行することもできます。

{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"Harry Potter is the best book. I think everyone should read it."
  },
  "encodingType": "UTF8"
}

curl "https://language.googleapis.com/v1/documents:analyzeSentiment?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json
{
  "documentSentiment": {
    "magnitude": 0.8,
    "score": 0.4
  },
  "language": "en",
  "sentences": [
    {
      "text": {
        "content": "Harry Potter is the best book.",
        "beginOffset": 0
      },
      "sentiment": {
        "magnitude": 0.7,
        "score": 0.7
      }
    },
    {
      "text": {
        "content": "I think everyone should read it.",
        "beginOffset": 31
      },
      "sentiment": {
        "magnitude": 0.1,
        "score": 0.1
      }
    }
  ]
}

2つのタイプの感情値があることに注意してください。文書全体の感情と、文ごとに分類された感情です。感情メソッドは2つの値を返します。

score  - ステートメントが正または負であることを示す-1.0〜1.0の数値です。
magnitude - 正または負に関係なく、ステートメントで表される感情の重みを表す0から無限の範囲の数値です。

Analyzing entity sentiment

テキストドキュメント全体の感情の詳細を提供することに加えて、Natural Language APIはテキスト内のエンティティによる感情を細分化することもできます。例としてこの文章を使用してください。

「私は寿司が好きでしたが、サービスはひどかった」

{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"I liked the sushi but the service was terrible."
  },
  "encodingType": "UTF8"
}

「寿司」のスコアが0.7、「サービス」のスコアは-0.9。

curl "https://language.googleapis.com/v1/documents:analyzeEntitySentiment?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json
{
  "entities": [
    {
      "name": "sushi",
      "type": "CONSUMER_GOOD",
      "metadata": {},
      "salience": 0.51064336,
      "mentions": [
        {
          "text": {
            "content": "sushi",
            "beginOffset": 12
          },
          "type": "COMMON",
          "sentiment": {
            "magnitude": 0.7,
            "score": 0.7
          }
        }
      ],
      "sentiment": {
        "magnitude": 0.7,
        "score": 0.7
      }
    },
    {
      "name": "service",
      "type": "OTHER",
      "metadata": {},
      "salience": 0.48935664,
      "mentions": [
        {
          "text": {
            "content": "service",
            "beginOffset": 26
          },
          "type": "COMMON",
          "sentiment": {
            "magnitude": 0.9,
            "score": -0.9
          }
        }
      ],
      "sentiment": {
        "magnitude": 0.9,
        "score": -0.9
      }
    }
  ],
  "language": "en"
}

Analyzing syntax and parts of speech

Natural Language APIの3番目の方法であるテキスト注釈を見てみると、テキストの言語に関する詳細をさらに深く理解することができます。 annotateTextは、テキストのセマンティック要素および構文要素に関する詳細をすべて網羅した高度なメソッドです。テキスト内の各単語について、APIはその単語の品詞（名詞、動詞、形容詞など）と、それが文中の他の単語とどのように関連しているかを教えてくれます（ルート動詞ですか？修飾語？）。

{
  "document":{
    "type":"PLAIN_TEXT",
    "content": "Joanne Rowling is a British novelist, screenwriter and film producer."
  },
  "encodingType": "UTF8"
}

Joanne は名詞(NOUN) のように単語毎に品詞が分析されている。

curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json
{
  "sentences": [
    {
      "text": {
        "content": "Joanne Rowling is a British novelist, screenwriter and film producer.",
        "beginOffset": 0
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Joanne",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 1,
        "label": "NN"
      },
      "lemma": "Joanne"
    },
    {
      "text": {
        "content": "Rowling",
        "beginOffset": 7
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 2,
        "label": "NSUBJ"
      },
      "lemma": "Rowling"
    },
  :
  ],
  "language": "en"
}

Multilingual natural language processing

Natural Language APIは英語以外の言語もサポートしています。

{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"日本のグーグルのオフィスは、東京の六本木ヒルズにあります"
  }
}

curl "https://language.googleapis.com/v1/documents:analyzeEntities?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json
{
  "entities": [
    :
    {
      "name": "グーグル",
      "type": "ORGANIZATION",
      "metadata": {
        "mid": "/m/045c7b",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Google"
      },
      "salience": 0.21214141,
      "mentions": [
        {
          "text": {
            "content": "グーグル",
            "beginOffset": -1
          },
          "type": "PROPER"
        }
      ]
    },
    {
      "name": "六本木ヒルズ",
      "type": "PERSON",
      "metadata": {
        "mid": "/m/01r2_k",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Roppongi_Hills"
      },
      "salience": 0.19418614,
      "mentions": [
        {
          "text": {
            "content": "六本木ヒルズ",
            "beginOffset": -1
          },
          "type": "PROPER"
        }
      ]
    },
    :
  ],
  "language": "ja"
}