Match achievements rateюjust how to calculate the similarity between two words/strings.

Match achievements rateюjust how to calculate the similarity between two words/strings.

The sequence similarity algorithm was developed to meet this amazing criteria:

  • A true expression of lexical similarity – strings with lightweight variations is thought to be being close. Specifically, a significant sub-string convergence should point to increased degree of similarity within strings.
  • A robustness to variations of keyword purchase- two strings which contain the same terms, however in an alternate order, needs to be recognized as getting close. In contrast, if one sequence merely a random anagram of this figures contained in the some other, then it should (usually) getting recognized as dissimilar.
  • Vocabulary autonomy – the algorithm should function not only in English, additionally in many different dialects.

Solution

The similarity are determined in three measures:

  • Partition each sequence into a listing of tokens.
  • Computing the similarity between tokens making use of a string edit-distance formula (extension function: semantic similarity measurement with the WordNet library).
  • Processing the similarity between two token records.

Discover another discussion for the resource.

A far better similarity standing formula for variable length chain

Thank you all for your assist and ideas.

Martin Xie [MSFT] MSDN area help | suggestions to all of us Get or Request Code Sample from Microsoft Kindly make the time to draw the replies as solutions if they assist and unmark them if they create no support.

  • Marked as solution by Martin_Xie Monday, Sep 26, 2011 8:48 have always been

All responses

What exactly is your query,explain they considerably more particular,i have confused with your

Including “a_logfile.txt” and “logfile_a.txt” should always be extremely similiar and aswell “loga_file.txt” and “logfile.text” but not “myText.txt” and “logfile.txt”

If this resolved your problem,Please mouse click “tag while Solution” thereon article and “level as Helpful”. Delighted Development!

Okay I check it out once again 🙂

Better i want to compare filenames and I also need to get a share numbers in just how similiar they’ve been. I do not know if this is certainly feasible whatsoever.

As an instance a filename “a_filename.txt” and “filename_a.txt” is extremely similiar for all of us but how should I have the same consequences programmatically.

Another example filename “file_abc_.txt” and fil_abc_e.txt” can also be similiar but once again how to get the consequences programmaticaly

That is possibly more difficult than this indicates initially.

Look at http://en.wikipedia.org/wiki/String_metrics and heed certain website links.

Relation David Roentgen Every program at some point turns out to be rococo, then rubble. – Alan Perlis truly the only good measurement of rule high quality: WTFs/minute.

Introducing MSDN Community Forum.

This short article shows the answer about: how-to Compute the similarity between two words/strings. The algorithm was developed in C# and you will install the demo in.

The sequence similarity algorithm was created to meet the subsequent requisite:

  • A real representation of lexical similarity – strings with tiny differences need named being comparable. Specifically, a significant sub-string convergence should suggest a high standard of similarity between the chain.
  • A robustness to adjustment of phrase purchase- two chain which contain similar phrase, in a separate purchase, must certanly be seen as getting close. Alternatively, if one string is just a random anagram from the figures included in the additional, then it should (usually) become seen as dissimilar.
  • Words self-reliance – the formula should operate not just in English, additionally in several dialects.

Remedy

The similarity are computed in three strategies:

  • Partition each string into a summary of tokens.
  • Computing the similarity between tokens by utilizing a sequence edit-distance algorithm (extension element: semantic similarity measurement utilising the WordNet collection).
  • Processing the similarity between two token listings.

There was another discussion for your research.

A far better similarity positioning algorithm for changeable dating in Bakersfield size chain

Thanks a lot all for the help and guidelines.

Martin Xie [MSFT] MSDN neighborhood Support | suggestions to you Get or demand rule Sample from Microsoft Kindly make sure you mark the replies as solutions if they let and unmark all of them as long as they render no help.

  • Marked as answer by Martin_Xie Monday, Sep 26, 2011 8:48 in the morning

We have written a laws for my venture to discover similar labels around from database.

very first I made use of the DIFFERENCE(string1, string2)>=4 function of SQL servers it failed to assist me because like whenever first name was actually “21” and 2nd title had been “21 leap road” the result included two names whereas clearly they don’t actually close. and so the outcome set of such a query included over 700 beliefs that was inadequate in this situation.

then I discover an equivalent DISTINCTION purpose for c# which was almost exactly like SQL type of that function. for instance it matched up the similarity of “asdcdfsdfgdsgdg” and “asdewwetqwetrwe” as Great definitely certainly not true.

I then developed a class for this concern to obtain additional effective similarity between chain.

title with this lessons are StringCompare and is an introduction to this course:

WHAT EXACTLY IS STRING CONTRAST?

StringCompare try a contrasting software for chain. Not an ordinal review, but a member of family contrast that identifies exactly how much two strings were close or exactly how much not comparable.

By establishing the nice tradeoff beliefs you can acquire a beneficial contrast for strings.

HOW TO USE:

Initially you ought to create an example of StringCompare with tradeoff beliefs or default tradeoff prices.

Discover 4 standards that can be set:

1. MinSimilarityLong:

This is actually the lowest acceptable portion of similarity between two chain that comparing with StringCompare. This appreciate can be used for strings making use of period of about 8.

2. MinSimilarityshortest:

This is basically the minimum appropriate percentage of similarity between two strings that researching with StringCompare. This benefits is employed for strings using size below 8.

3. MaxToleranceLong:

Here is the optimal acceptable percentage of threshold between two strings that comparing with StringCompare. This price is employed for chain making use of amount of at the least 8.

4. MaxToleranceShort:

This is basically the optimum appropriate percentage of threshold between two strings that evaluating with StringCompare. This appreciate can be used for chain with all the length below 8.

* Once you have produced a case you are able to name InstanceName.IsEqual (string1, string2) to determine the equality of two strings.

* see that the equivalence is relative to the minSimilarty and maxTolerance your arranged earlier.

* start thinking about that higher minSimilarity values will result in much more constrained results and the other way around.

* start thinking about that reduced maxTolerance values can lead to more limited information and vice versa.

Like:



Leave a Reply