![]() ![]() Other analyzer-related properties ("searchAnalyzer" and "indexAnalyzer") won't accept a language analyzer. The "analyzer" property is the only property that will accept a language analyzer, and it's used for both indexing and queries. Set the "analyzer" property to one of the language analyzers from the supported analyzers list. In the field definition, make sure the field is attributed as "searchable" and is of type Edm.String. Set the analyzer during index creation before it's loaded with data. ![]() This means it can handle inflected and irregular word forms much better which results in more relevant search results. Microsoft's English analyzer performs lemmatization instead of stemming. It removes possessives (trailing 's) from words, applies stemming as per Porter Stemming algorithm, and removes English stop words. Lucene's English analyzer extends the Standard analyzer. The default analyzer is Standard Lucene, which works well for English, but perhaps not as well as Lucene's English analyzer or Microsoft's English analyzer. Search performance shouldn't be significantly affected for average size queries. Indexing with Microsoft analyzers is on average two to three times slower than their Lucene equivalents, depending on the language. You can use Analyze API to see the tokens generated from a given text using a specific analyzer. If possible, you should run comparisons of both the Microsoft and Lucene analyzers to decide which one is a better fit. Lucene language analyzers are faster, but the Microsoft analyzers have advanced capabilities, such as lemmatization, word decompounding (in languages like German, Danish, Dutch, Swedish, Norwegian, Estonian, Finish, Hungarian, Slovak) and entity recognition (URLs, emails, dates, numbers). Some developers might prefer the more familiar, simple, open-source solution of Lucene. ![]() Comparing Lucene and Microsoft AnalyzersĪzure Cognitive Search supports 35 language analyzers backed by Lucene, and 50 language analyzers backed by proprietary Microsoft natural language processing technology used in Office and Bing. Using one of the Japanese analyzers available in Cognitive Search is more likely to unlock this behavior because those analyzers are better equipped at splitting the chunk of text into meaningful words in the target language. (This is the heaviest and brightest group of spherical stars in our galaxy.)įor the example above, a successful query would have to include the full token, or a partial token using a suffix wildcard, resulting in an unnatural and limiting search experience.Ī better experience is to search for individual words: 明るい (Bright), 私たちの (Our), 銀河系 (Galaxy). Because it has no spaces, a language-agnostic analyzer would likely analyze the entire string as one token, when in fact the string is actually a phrase. While the default analyzer (Standard Lucene) is language-agnostic, the concept of using spaces and special characters (hyphens and slashes) to separate strings is more applicable to Western languages than non-Western ones.įor example, in Chinese, Japanese, Korean (CJK), and other Asian languages, a space isn't necessarily a word delimiter. You should also consider language analyzers when content consists of non-Western language strings. Since large chunks of text are more likely to have this content, fields consisting of descriptions, reviews, or summaries are good candidates for a language analyzer. Without linguistic awareness, these strings are parsed on physical characteristics alone, which fails to catch the connection. ![]() A common example is the association of irregular verb forms ("bring" and "brought) or plural nouns ("mice" and "mouse"). You should consider a language analyzer when awareness of word or sentence structure adds value to text parsing. If your content consists of translated strings, such as separate fields for English and Chinese text, you could specify language analyzers on each field to access the rich linguistic capabilities of those analyzers. Every searchable string field has an analyzer property. A language analyzer is a specific type of text analyzer that performs lexical analysis using the linguistic rules of the target language. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |