Crossover: 2014

Chapter 156 Academic Tool People Get√

Chapter 156 Academic Tool People Get√

Although Eve Carly didn't know why Lin Hui asked such a question suddenly.

But how could Eve Carly give up this opportunity to get Lin Hui's suggestion so easily.

Eve Carly first stated to Lin Hui the role that vectors usually play in the current West when calculating semantic text similarity.

Then Eve Carly officially began to answer the question Lin Hui asked her earlier:

"The introduction of vectors will make it easier for machines to process semantic text information.

If we do not introduce vectors, we have few options when dealing with semantic text similarity.

And without introducing vectors, the schemes we choose to calculate the semantic text similarity are more or less LOW.

For example, based on the string method, this method is to compare the original text.

It mainly includes edit distance, longest common subsequence, N-Gram similarity, etc. to measure.

Take the edit distance as an example, the basis for measuring the similarity between two texts is based on the minimum number of editing operations required to convert one text into another.

The editing operations defined by this algorithm include adding, deleting, and replacing.

The longest common subseries is based on...

This set of metrics is even a bit like Microsoft Word format to measure general.

Although the string-based method is simple in principle and easy to implement.

But this method does not take into account the meaning of words and the interrelationship between words and words.

Issues involving synonyms and polysemous words cannot be dealt with.

Currently, string-based methods are rarely used alone to compute text similarity.

Instead, the calculation results of these methods are incorporated into more complex methods as features representing text.

In addition to this method, there are..."

Lin Hui also knew a little about these things.

He just wanted to judge the progress of this space-time research through the mouth of Eve Carly.

Measuring semantic text similarity based on strings through editing operations and longest common subseries is indeed a bit low-end.

But low-end does not mean useless, so it cannot be said that this algorithm is worthless.

Imagine if there were a breakthrough in the field of text recognition.

If the judgment method for defining text similarity is combined with the text recognition algorithm.

Instead, the method of determining text similarity based on character strings is the most appropriate.

After all, this string-based discrimination method is the closest to the intuitive logical form of computer vision.

In fact, text recognition algorithms are also very common technologies in later generations.

Even the screenshot tool of any chat software can be very good at the task of text recognition.

And now in this time and space, even some software that specializes in text recognition as a gimmick.

In fact, the work done is just to scan the document and convert it to PDF.

A batch with low efficiency related to actual text recognition.

Lin Hui felt as if he had discovered a business opportunity by accident.

Although I found a business opportunity, it is not suitable to do it now.

After all, the aspect of text recognition is related to the field of computer vision.

The so-called computer vision is to make the machine see things.

This is a field of artificial intelligence.

This area of research is about enabling computers and systems to extract meaningful information from images, videos, and other visual inputs.

The machine takes action or provides advice based on this information.

If artificial intelligence gives computers the ability to think.

Computer vision, then, is the ability to discover, observe, and understand.

Although computer vision can't be said to be so complicated.

But at least the threshold is much higher than natural language processing.

Obviously it is not suitable for forest ash to be blended now.

But Lin Hui was patient, and Lin Hui silently took this matter to heart.

Lin Hui felt that he should not be too short-sighted.

Although some things seem very tasteless now.

It does not necessarily mean that it is useless in the long run.

Thinking of this, Lin Hui suddenly felt very lucky.

After rebirth, the experience of the previous life made him more comfortable.

On the other hand, what rebirth brought him and benefited him was a change in thinking.

When it comes to many things, Lin Hui will subconsciously consider the long-term value.

Even inadvertently consider things ten or twenty years from now.

There is this long-term way of thinking.

Lin Hui felt that in time, he would be able to reach a height that few people could reach.

But these ideas are not enough for outsiders.

Although there are some differences with Eve Carly on the method of evaluating text similarity based on strings.

But Lin Hui didn't show it, and academic exchanges are often just seeking common ground while reserving differences.

Eve Carly continued to state her opinion:

"...I think it is indeed a good idea to introduce vectors into the measurement of semantic text similarity.

But after intervening in the vector, it was like opening Pandora's box.

Vectors are used when dealing with some semantically complex text information.

It is extremely easy to form some high-dimensional spaces, causing dimension explosion.

When this happens, the application scenario often becomes extremely bad.

There is often the problem of dimension explosion.

In fact, the problem of dimension explosion has already restricted our research.

Dear Lin, what is your opinion on this issue? "

Lin Hui said: "Dimensional explosion is mainly a problem that is difficult to deal with in high dimensions.

That being the case, why not consider reducing the dimensionality of high dimensions? "

Lin Hui's tone was so calm.

As if describing a natural thing.

Dimensionality reduction?What is high-dimensional dimensionality reduction? ?

I listened to the information from the interpreter.

Eve Carly felt like vomiting blood.

She wants to learn Chinese a little bit.

She didn't know that Lin Hui's original intention was to transform high-dimensional into low-dimensional.

Or is it that Lin Hui is talking about transforming something high-dimensional into a low-dimensional one when expressing it, but the translation omits something when conveying it.

It would be too bad if some important nouns were omitted.

In the end, what Lin Hui wants to express is to convert high-dimensional data into low-dimensional data?

Or is it to convert a high-dimensional model into a low-dimensional model?
Or is there some other meaning?
Eve Carly wanted to ask.

However, considering Lin Hui's thoughtful act for Mina Kali earlier.

Eve Carly is not good. This kind of thing made the translator brought by Lin Hui feel uneasy.

Think carefully about the meaning of Lin Hui's words.

First of all, Eve Carly felt that what Lin Hui wanted to talk about should not be reducing high-dimensional data to low-dimensional data.

If high-dimensional data appears when performing natural language processing.

When analyzing high-dimensional data, it is indeed possible to perform dimensionality reduction.

Dimensionality reduction must also be performed!

Although the high-dimensional data model collects a lot of data points.

But the collected data is usually scattered in an extremely scattered and vast high-dimensional space.

In this case, many statistical methods are difficult to apply to high-dimensional data.

This is one of the reasons why the "curse of dimensionality" exists.

In the face of this disaster of dimensionality, it is difficult to process high-dimensional data without dimensionality reduction.

(ps:... people with full math talent can also master high-dimensional)
As a method of data denoising and simplification, dimensionality reduction is helpful for most modern machine learning data.

By reducing the dimensionality of the data, theoretically speaking, this complex and difficult problem can be simplified and relaxed.

The so-called dimensionality reduction in the field of machine learning refers to the use of some kind of mapping method.

Map the data points in the original high-dimensional space to the low-dimensional space.

This is done to remove noise while preserving the low-dimensional data for the information of interest.

This is very helpful for researchers to understand the structure and patterns hidden in the original high-dimensional data.

Raw high-dimensional data often contains observations of many irrelevant or redundant variables.

Dimensionality reduction can be viewed as a method of latent feature extraction.

Dimensionality reduction is a method often used in data compression, data exploration, and data visualization.

Having said that, dimensionality reduction is not just throwing a two-way foil as described in science fiction books.

It is extremely troublesome when it comes to dimensionality reduction!

When choosing a dimensionality reduction method, one has to consider many factors.

The first thing to consider is the nature of the input data.

For example, for continuous data, categorical data, count data, and distance data, they will require different dimensionality reduction methods.

It is important to consider the nature and resolution of the data.

If the dimensionality reduction is carried out without considering the nature of the input data, it can make these high-dimensional models low-dimensional.

It is very likely that the originally discrete data will be "pasted" together directly.

This situation is even worse than high-dimensional discretization.

before applying formal dimensionality reduction techniques.

Appropriate preprocessing of high-dimensional data is also required.

After all, not all data is sample data.

And sometimes when preprocessing.

The best preprocessing method is to introduce dimensionality reduction.

This will lead to a crazy nesting doll cycle.

All in all, dimensionality reduction for high-dimensional data is a super troublesome thing.

In actual processing, researchers in general natural language processing generally try their best to avoid high-dimensional explosions.

Instead of waiting for high-dimensional data to appear before performing low-dimensional processing.

To some extent, the low-dimensional processing of high-dimensional data is more like a last resort and extremely troublesome remedy.

A lot of things because of the trouble is enough to let people say goodbye.

A cumbersome process means error-prone.

And the form presented by beautiful things should be concise.

Just like Euler's formula.

Because of this, Eve Carly felt that what a genius like Lin Hui wanted to express was definitely not to reduce high-dimensional data to low-dimensional data.

If what Lin Hui wants to express is not to manipulate high-dimensional data.

But to make a fuss about the traditional vector space model?

Convert a high-dimensional vector space model to a lower-dimensional vector space model?
This way of thinking is not bad.

But such an attempt has not been done before.

It was tried a long time ago.

As early as the end of the last century, a latent semantic analysis model was proposed.

The Latent Semantic Analysis model is proposed based on the Vector Space Model (VSM).

The basic idea of the latent semantic analysis model is to obtain the space vector representation of the text.

Through singular value decomposition, the high-dimensional and sparse space vector is mapped to the low-dimensional latent semantic space.

After getting low-dimensional text vectors and word vectors.

Then use measures such as cosine similarity to calculate the semantic similarity between texts.

The essential idea of latent semantic analysis is to remove the noise in the original matrix through dimensionality reduction, thereby improving the calculation accuracy.

While this is a good idea, the approach is not universal.

This is because of the singular value decomposition used by the Latent Semantic Analysis model in the process of building the model.

This approach increases computational complexity and is less portable.

After this method was proposed.

Not that nobody has tried to improve on this method.

It was also the end of the last century.

Some researchers have proposed a probabilistic latent semantic analysis model.

This model is based on probability rather than singular value decomposition.

The main difference between this model and the latent semantic analysis model is the addition of a topic layer.

The topics are then trained using an expectation-maximization algorithm and a probabilistic latent topic model is found.

This is used to predict the observed data in the text space vector.

In this probabilistic latent semantic analysis model, polysemy words are grouped under different topics, while synonyms are grouped under the same topic.

In this way, the influence of synonyms and polysemous words on text similarity calculation can be avoided.

However, the parameters of the probabilistic latent semantic analysis model grow linearly with the number of documents.

It is prone to overfitting and poor generalization.

This situation is largely due to the explosion of dimensions.

Because overfitting only occurs when relatively few parameters are predicted in high-dimensional space and many parameters are predicted in low-dimensional space.

A model proposed to avoid a dimension explosion has a dimension explosion.

It's a bit tragic.

In fact, not only the above two models were proposed.

Since then, many research teams have made varying degrees of attempts at the model level.

But these models are either in the opposite direction, which is not conducive to dimensionality reduction.

Either it brings new problems while reducing dimensionality.

In short, these models have all kinds of unworkable places.

Although Eve Carly believes that Lin Hui is a genius.

However, Eve Carly felt that it was difficult for Lin Hui to find a brand new low-dimensional model that could avoid the explosion of dimensions after numerous previous attempts.

Eve Carly thought about it for a long time but couldn't figure out what Lin Hui meant.

Eve Carly explained her thinking process to Lin Hui.

Lin Hui listened very carefully.

After listening, Lin Hui smiled and said: "You have thought of so many situations where high-dimensional to low-dimensional transformation.

And you also mentioned earlier that when the machine recognizes the text, the natural language is often digitized in order for the machine to recognize the natural language.

Then, vectorization was further carried out in order to distinguish the attributes of these values.

Now that you can understand this, you should know that problems involving natural language processing are prone to dimension explosion to a large extent because the original data is extremely high-dimensional.

In this case, why don't we solve the problem directly from the source and make some articles on the original data? "

Hearing Lin Hui's words, Eve Carly seemed to be touched in the depths of her soul.

Eve Carly said tremblingly: "You mean to directly perform low-dimensional processing on the original high-dimensional data?
After processing the low-dimensional data, perform semantic text similarity analysis on the architecture model? "

Talking to smart people saves time and effort.

This is almost what Lin Hui wanted to express.

Now the commonly used encoding method in this space-time is 1hot encoding.

Although this encoding has played an active role in a fairly long historical period.

But this kind of encoding often brings a lot of troubles.

In the past few years, word vectors are basically encoded through distributed vectors.

Distributed encoding is equivalent to projecting the original data into a lower-dimensional space.

In this way, the original data is compressed and embedded from a sparse high-dimensional space into a lower-dimensional vector space.

This is undoubtedly very helpful for subsequent processing.

Of course, projection does not mean projection.

It has to go through the training of neural network learning.

As for how to train?

This is a purely technical matter.

In the previous life, the distributed representation of word vectors also had a proper term called word embedding.

Although what Lin Hui said to Eve Carly now involves part of the principle of distributed word vectors.

But Lin Hui is not afraid of Eve Carly plagiarizing ideas.

After all, what he said was only part of the principle.

Alternative truths are sometimes more misleading than lies.

As for how much information Eve Carly can get from Lin Hui, it mainly depends on how well the two get along.

In fact, Lin Hui really hoped that one day he could tell Eve Carly everything.

Because there is only one situation where Eve Carly can get all the information.

That is, she is willing to be Lin Hui's tool person.

Naturally, there is no need to hide anything from your own tools.

Lin Hui also urgently needs a purely academic tool person.

After all, it is ideal to be able to point out the direction of scientific research and harvest papers.

If you go too deep into it, you may not become stronger, but you will definitely become bald.

Eve Carly in front of her is smarter and more perceptive.

Eve Carly is young and passionate, which is the most important quality that an excellent prospective scientific researcher should have.

The most important thing is that Eve Carly seems very innocent and easy to deceive.

Such an innocent sister.

It seems that it shouldn't be fooled by her...

It seems that it is not so troublesome.

Things went better than Lin Hui imagined.

Had a deep chat with Eve Carly.

(ps: ...a lot of words are omitted, and I will add them slowly when I have time in the future, the added words are not counted as billing words)
Eve Carly seemed to be determined.

He mustered up his courage and said to Lin Hui: "If, I mean, if it is possible, can I be your assistant?"

Hearing her words, Lin Hui seemed very hesitant.

Eve Carly knew she was offended by the request.

Although Eve Carly moved this idea a little hastily.

But Eve Carly believes in the choices she makes in her heart.

The conversation with Lin Hui gradually deepened, and Eve Carly seemed to have come into contact with a brand new world for a while.

On the way here, Eve Carly felt Lin Hui's silent call to him: "Are you eager to open that door?"

This time communicating with Lin Hui, after hearing what Lin Hui said just now, Eve Carly seemed to have witnessed Lin Hui push the door open with her own eyes.

Naturally, Eve Carly would not let go of the opportunity to open that door with Lin Hui.

Seeing that Lin Hui seemed hesitant, Eve Carly hurriedly said, "I am willing to sign an agreement, and I will keep your research results strictly confidential..."

Eve Carly continued: "I don't need research funding either..."

The attitude of a scholar who traveled thousands of miles across the ocean was so respectful.

If Lin Hui doesn't agree, it seems very unkind.

Lin Hui decisively agreed to Eve Carly's request.

The agreement must be signed, Lin Hui does not want the results to be stolen.

However, scientific research funding and corresponding remuneration still have to be issued.

Employees can use the idea of love to generate electricity.

But the boss can't really let employees use love to generate electricity.

That's easy to die suddenly on the spot.

(End of this chapter)

Prev Index Next

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.

Crossover: 2014

Chapter 156 Academic Tool People Get√

You'll Also Like

After the success or failure, I brushed the vest in the world

The boy who is a shrine maiden does not want to marry

Collapse, my pirated Golden Garden!

I just want to drop out of school

Poseidon in Tomorrow's Ark

Azur Lane, I really didn’t lose my memory!

Miss Demon Lord will never surrender to the Saint

Starting from the other world, I joined the chat group at the same time

I was raised in the simulator, and I’m really not a bad woman in Ark!

My past life has been exposed, and I can no longer hide my identity as a demon.

Something Wrong!

Something Wrong!