English
Back
Download
Log in to access Online Inquiry
Back to the Top

Think clearly about your goals and means

Go with the trend, against technology, against human nature; buy when it falls, not when it rises; sell when it rises, not when it falls.
1. Market opportunities. The market as a whole and most of the time is chaotic, but there is a high probability and certainty in certain local areas and periods, which is our opportunity. Most of these opportunities are conceptual and vague, although there are also precise opportunities, but they are few in number.
2. Suitable profit model and purpose. Based on the opportunities provided by the market, we establish our own profit model and suitable means.
3. Corresponding means. Since most means may be conceptual and vague, we should pay attention to moderation, combination, and inaccuracy. As long as the ultimate goal is achieved, a conceptual implementation is sufficient. There is no need for precise implementation, as it would be difficult and could lead to missed opportunities or even losses. Novices often have low levels of expertise but high demands for investment behavior, sometimes even requiring precision, which is the fundamental reason why many smart (emotional) people miss opportunities and make loss errors. Experts have a grasp on conceptual and vague opportunities with a broad, moderate, and imprecise approach, but this does not affect their aggressiveness and decisive action. Perhaps the difference between heroes and clowns in the stock market lies here.
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Think clearly about your goals and means
Video playback link🔗 - YouTube
Simply put, entropy is a measure of the state of a material system, used to characterize the degree of disorder of the system. The larger the entropy, the more disorderly the system, meaning the structure and movement of the system are uncertain and irregular; conversely, the smaller the entropy, the more orderly the system, meaning it has a certain and regular state of movement. In Chinese, entropy means the quotient obtained by dividing heat by temperature. Negative entropy is a measure of the ordered, organized, and complex state of material systems.
Entropy originated in physics. German physicist Rudolf Clausius first proposed the concept of entropy to represent the degree of uniformity of any form of energy distribution in space. The more uniformly the energy is distributed, the greater the entropy.
1. A drop of ink drips into clear water, turning into a cup of light blue solution.
2. Hot water is left in the air, transferring heat to the air, eventually equalizing the temperature.
Some more examples in daily life:
1. An example of entropy force is headphone wires. We tidy up the headphone wires and put them in our pocket, but when we take them out again, they are already tangled. The invisible 'force' that causes the headphone wires to tangle is the entropy force, as the wires tend to become more chaotic.
2. Another specific example of entropy force is elastic force. The force of a spring is an example of entropy force. Hooke's Law is actually a manifestation of entropy force.
3. Gravitational force is also a type of entropy force (a hotly debated topic).
So from a microscopic perspective, entropy reflects the degree of uncertainty of the system's state. Shannon, when describing an information system, borrowed the concept of entropy, where entropy represents the average amount of information in this information system (average degree of uncertainty).
We often say not to put all our eggs in one basket when investing, as it helps reduce risk. This principle also applies in information processing. Mathematically, this principle is known as the maximum entropy principle.
Let's look at a simple example of converting pinyin to Chinese characters. If the input pinyin is 'wang-xiao-bo', using a language model, based on limited context (such as the first two words), we can provide the two most common names '王小波' and '王晓波'. It is difficult to uniquely determine which name it is, even with longer context. Of course, if the entire article is about literature, the possibility of the writer Wang Xiaobo is higher; whereas in discussions about cross-strait relations, the possibility of Taiwanese scholar Wang Xiaobo would be greater. In the example above, we only need to integrate two different types of information, namely topic information and contextual information. Although there are many makeshift methods, such as: handling thousands of different topics separately, or averaging the weights of each type of information, none of them can accurately and satisfactorily solve the problem, much like patching small circles into larger circles in the planetary motion models we discussed before. In many applications, we need to synthesize dozens or even hundreds of different information, and this small circle into large circle method is obviously not feasible.
Mathematically, the most elegant method is the maximum entropy model, which is equivalent to an elliptical model of planetary motion. The term 'maximum entropy' sounds profound, but its principle is very simple and something we use every day. Simply put, it is about retaining all uncertainties and minimizing risks.
Returning to the example of converting pinyin to Chinese characters we just mentioned, we have two pieces of information: first, according to the language model, 'wangxiao-bo' can be transformed into '王晓波' and '王小波'; second, based on the topic, Wang Xiaobo is a writer, the author of 'Golden Age,' etc., while Wang Xiaobo is a Taiwanese scholar studying cross-strait relations. Therefore, we can build a maximum entropy model that satisfies both of these pieces of information. The current question is whether such a model exists. Hungarian renowned mathematician and Shannon Award-winning information theorist Csiszar proved that for any set of non-contradictory information, this maximum entropy model not only exists but is unique. And they all have the same very simple form - exponential functions. The following formula is a maximum entropy model predicting the next word based on the context (previous two words) and the topic, where w3 is the word to be predicted ('王晓波' or '王小波'), w1 and w2 are its previous two words (for example, they are '出版' and ''), representing a rough estimate of its context, and subject represents the topic.
Think clearly about your goals and means
We can see in the formula above, there are several parameters lambda and Z that need to be trained through observed data. The maximum entropy model is the most elegant statistical model in form, but one of the most complex models in implementation.
Last time we discussed how the maximum entropy model can integrate various information together. We left a question unanswered, which is how to construct the maximum entropy model. We know that all maximum entropy models are in the form of exponential functions, now we just need to determine the parameters of the exponential function, and this process is called model training.
The most primitive training method for the maximum entropy model is an iterative algorithm called Generalized Iterative Scaling (GIS). The principle of GIS is not complicated and can be roughly summarized in the following steps:
Assume that the initial model for the zeroth iteration is a uniform distribution of equal probabilities.
Using the model of the Nth iteration to estimate the distribution of each information feature in the training data, if it exceeds the actual distribution, decrease the corresponding model parameters; otherwise, increase them.
3. Repeat Step 2 until convergence.
GIS was first proposed by Darroch and Ratcliff in the 1970s. However, they could not provide a good explanation of the physical meaning of this algorithm. It was later clarified by the mathematician Csiszar, so when discussing this algorithm, people always refer to Darroch and Ratcliff's two papers as well as Csiszar's. The time for each iteration of the GIS algorithm is long, requiring many iterations to converge, and it is not very stable, even causing overflow on 64-bit computers. Therefore, few people actually use GIS in practical applications. They just use it to understand the algorithm of the maximum entropy model.
In the 1980s, the talented twin brothers Della Pietra at IBM made two improvements to the GIS algorithm and proposed the improved iterative scaling algorithm IIS. This shortened the training time of the maximum entropy model by one to two orders of magnitude, making the maximum entropy model practical. Even so, at that time, only IBM had the conditions to use the maximum entropy model.
Due to the perfection of the maximum entropy model in mathematics, it was very tempting for scientists. Therefore, many researchers tried to fit their problems with an approximation model similar to maximum entropy. Unexpectedly, this approximation made the maximum entropy model imperfect, making it no better than a patchwork method. As a result, many enthusiasts abandoned this method. The first person to verify the advantages of the maximum entropy model in practical information processing applications was Marcus, a former IBM researcher and now at Microsoft, a disciple of Marcus' linguistics, Adwait Ratnaparkhi. Adwait Ratnaparkhi was clever in that he did not approximate the maximum entropy model, but found several natural language processing problems that were most suitable for and relatively computationally efficient with the maximum entropy model, such as part-of-speech tagging and syntactic analysis. Adwait Ratnaparkhi successfully combined context information, parts of speech (nouns, verbs, adjectives, etc.), and sentence components (subject-verb-object) through the maximum entropy model, creating the best part-of-speech tagger and syntactic analyzer in the world at that time. Adwait Ratnaparkhi's paper was groundbreaking. Adwait Ratnaparkhi's part-of-speech tagging system is still the best system using a single method to this day. Scientists saw hope in using the maximum entropy model to solve complex text information processing problems from Adwait Ratnaparkhi's achievements.
However, the computational complexity of the maximum entropy model remained a barrier. I spent a long time in school considering how to simplify the computational complexity of the maximum entropy model. Finally, one day I told my advisor that I had found a mathematical transformation that could reduce the training time of most of the maximum entropy model by two orders of magnitude based on IIS. I derived it on the blackboard for over an hour, and he did not find any flaws in my derivation. Then he went back to think for two days and told me that my algorithm was correct. From then on, we built some very large maximum entropy models. These models were much better than patchwork methods. Even after I found a fast training algorithm, to train a grammar model containing context information, thematic information, and grammatical information, I used 20 of the fastest SUN workstations in parallel and still took three months to compute. This illustrates the complexity of the maximum entropy model.
The maximum entropy model can be said to combine simplicity and complexity, where the form is simple but the implementation is complex. It is worth mentioning that in many Google products, such as machine translation, the maximum entropy model is directly or indirectly used.
Readers may wonder if the Della Pietra brothers, who were the first to improve the maximum entropy model algorithm, have not done anything in recent years. After Jerry Nick left IBM in the early 1990s, he also left academia and excelled in the financial industry. Together with many colleagues from IBM's speech recognition team, they joined a hedge fund company that was not well-known at the time, but is now one of the most successful in the world - Renaissance Technologies. We know that there may be dozens or even hundreds of factors determining stock movements, and the maximum entropy method can precisely find a model that satisfies thousands of different conditions simultaneously. Scientists such as the Della Pietra brothers have achieved great success using the maximum entropy model and other advanced mathematical tools for stock prediction. Since the fund was founded in 1988, its net return rate has averaged a whopping 34% per year. In other words, if you invested a dollar in the fund in 1988, you would have $200 today. This performance far exceeds that of the stock god Buffett's flagship company Berkshire Hathaway. During the same period, Berkshire Hathaway's total return was 16 times.
It is worth mentioning that many mathematical tools for information processing, including hidden Markov models, wavelet transforms, Bayesian networks, etc., have direct applications on Wall Street. This shows the role of mathematical models.
The Hidden Markov Model (HMM) is a statistical model used to describe a Markov process with hidden unknown parameters. The challenge lies in determining the hidden parameters of the process from observable parameters, and then using these parameters for further analysis, such as pattern recognition.
It is a statistical Markov model where the modeled system is considered a Markov process with unobserved (hidden) states.
Below is an example to illustrate:
Assume I have three different dice in my hand. The first die is the typical six-sided die (referred to as D6), with 6 faces, each face (1, 2, 3, 4, 5, 6) appearing with a probability of 1/6. The second die is a tetrahedron (referred to as D4), with each face (1, 2, 3, 4) appearing with a probability of 1/4. The third die has eight faces (referred to as D8), with each face (1, 2, 3, 4, 5, 6, 7, 8) appearing with a probability of 1/8.
Think clearly about your goals and means
Suppose we start rolling the dice, first picking one from the three dice with each die having a probability of 1/3. Then we roll the die and get a number, either 1, 2, 3, 4, 5, 6, or 7, 8. By repeating this process, we will obtain a sequence of numbers, each being one of 1, 2, 3, 4, 5, 6, 7, 8. For example, we might get a sequence like this (rolling the dice 10 times): 1 6 3 5 2 7 3 5 2 4.
This sequence of numbers is called the observable state chain. However, in the Hidden Markov Model, we not only have this observable state chain but also a hidden state chain. In this example, the hidden state chain is the sequence of dice used. For instance, the hidden state chain could be: D6 D8 D8 D6 D4 D8 D6 D6 D4 D8
Generally speaking, in HMM, when referring to a Markov chain, it actually refers to the hidden state chain because there exist transition probabilities between hidden states (dice). In our example, the next state after D6 can be D4, D6, or D8 with a probability of 1/3. The transition probability for D4 and D8 to the next state being D4, D6, or D8 is also 1/3. This setup is for initial clarity, but in reality, we can freely set the transition probabilities. For example, we could define that D6 cannot be followed by D4, with the probability of being followed by D6 at 0.9, and by D8 at 0.1. This would be a new HMM.
Similarly, despite no transition probabilities between observable states, there is a probability between hidden states and observable states called emission probability. In our example, the six-sided die (D6) has an emission probability of 1/6 for producing a 1. The probabilities for producing 2, 3, 4, 5, 6 are also each 1/6. We can similarly define other emission probabilities. For instance, if I have a tampered six-sided die from a casino, the probability of rolling a 1 is higher at 1/2, while the probabilities of rolling a 2, 3, 4, 5, 6 are 1/10.
Think clearly about your goals and means
Actually, for the HMM, if you know the transition probabilities between all hidden states and the output probabilities from all hidden states to all observable states in advance, simulating is quite easy. However, when applying the HMM model, it is often missing some information. Sometimes you know how many types of dice there are, what each type of dice is, but you don't know the sequence of dice rolled; sometimes you only see the results of rolling dice many times, and you don't know anything else. If you apply algorithms to estimate these missing pieces of information, it becomes a very important problem. I will explain these algorithms in detail below.
*******
If you just want to see a simple and easy-to-understand example, you don't need to read further.
******    
A few words, the host believes that to understand an algorithm, you need to achieve two points: understand its meaning and know its form. The host's answer actually focuses mainly on the first point. However, this point is precisely the most important, and many books do not mention it. Just like when you are pursuing a girl, and the girl says to you, 'You didn't do anything wrong!' If you only pay attention to the girl's expression, thinking you did nothing wrong, obviously you misunderstood. You need to understand the girl's meaning, 'You quickly apologize to me!' So when you see the corresponding expression, quickly apologize, kneel down and beg for forgiveness. Mathematics is the same way – if you don't understand the meaning and just look at the formulas, you often end up puzzled. However, mathematical expressions are at most a little obscure, while a girl's expressions sometimes completely contradict the original intention. That's why the host has always believed that understanding a girl is much more difficult than understanding mathematics.
Returning to the topic, the algorithms related to the HMM model are mainly divided into three categories, each solving three types of problems:
1) Knowing how many types of dice there are (number of hidden states), what each type of dice is (transition probabilities), based on the sequence of dice rolls (observable state chain), I want to know which type of dice was rolled each time (hidden state chain).
This problem, in the field of speech recognition, is called the decoding problem. There are actually two solutions to this problem, giving two different answers. Both answers are correct, but they have different meanings. The first solution is to find the maximum likelihood state path, in simple terms, I find a sequence of dice rolls where the probability of producing the observed result is the highest. The second solution is not about finding a set of dice sequences, but about finding the probability that each dice rolled is a certain type of dice each time. For example, after seeing the result, I can determine the probability that the first dice rolled is a D4 is 0.5, D6 is 0.3, D8 is 0.2. I will discuss the first solution below, but I won't write about the second solution here. If you are interested, we can address it in a separate question.
2) I want to know the probability of throwing this result based on the number of dice (hidden state quantity), what each type of dice is (transition probability), and the results of throwing the dice (visible state chain).
This question seems insignificant because many times the result you get corresponds to a relatively large probability. The purpose of asking this question is actually to check whether the observed results match the known model. If many results correspond to relatively small probabilities, then it indicates that our known model is likely incorrect, and someone may have secretly changed our dice.
3) I want to deduce what each type of dice is (transition probability) based on the number of dice (hidden state quantity), not knowing what each type of dice is, and observing the results of throwing the dice many times (visible state chain).
This question is very important because it is the most common scenario. Many times we only have visible results and do not know the parameters in the HMM model. We need to estimate these parameters from the visible results, which is a necessary step in modeling.
The problem has been stated, now let's talk about the solution. (Issue 0 was not mentioned above, it is just an auxiliary to solve the above problems.)
0. A simple problem.
Actually, this problem is not very practical. It is mentioned here first because it is helpful for the more difficult problems below.
Knowing the number of dice, what each type of dice is, what type of dice is thrown each time, based on the results of throwing the dice, calculate the probability of producing this result.
Think clearly about your goals and means
The solution is simply multiplying probabilities.
Think clearly about your goals and means
1. Seeing the invisible, deciphering the dice sequence
This is the first solution I mentioned, solving the maximum likelihood path problem.
For example, I know I have three dice, a six-sided die, a four-sided die, and an eight-sided die. I also know the results of ten throws (1 6 3 5 2 7 3 5 2 4), but I don't know which die was used each time. I want to find out the most likely dice sequence.
The simplest and most brute-force method is to exhaustively list all possible dice sequences, then calculate the probability for each sequence according to the solution to the first problem. Then we just need to pick out the sequence with the highest probability. If the Markov chain is not long, this is feasible. If it is long, the number of combinations is too large, making it difficult to complete.
Another very famous algorithm is called the Viterbi algorithm. To understand this algorithm, let's first look at a few simple examples.
First, if we only throw the dice once:
Think clearly about your goals and means

When we see the result is 1, the corresponding dice sequence with the highest probability is the D4, because the probability of D4 producing 1 is 1/4, higher than 1/6 and 1/8.
To expand on this situation, we roll two dice:
Think clearly about your goals and means
The results are 1, 6. At this point, the problem becomes complex, and we need to calculate three values, which are the maximum probabilities of the second die being D6, D4, or D8. Obviously, to obtain the maximum probability, the first die must be D4. At this point, the maximum probability for the second die to be D6 is
Think clearly about your goals and means
Similarly, we can calculate the maximum probabilities when the second die is D4 or D8. We find that the probability of the second die being D6 is the highest. To achieve this maximum probability, the first die must be D4. Therefore, the sequence with the maximum probability is D4 D6.
To expand further, we roll three dice:
Think clearly about your goals and means
Similarly, we calculate the maximum probabilities for the third die being D6, D4, or D8. Once again, we find that to obtain the maximum probability, the second die must be D6. At this point, the maximum probability for the third die to be D4 is
Think clearly about your goals and means
Think clearly about your goals and means
Likewise, we can calculate the maximum probabilities when the third die is D6 or D8. We find that the probability of the third die being D4 is the highest. To achieve this maximum probability, the second die must be D6, and the first die must be D4. Therefore, the sequence with the maximum probability is D4 D6 D4.
Writing up to this point, everyone should see a pattern. Since calculating for one, two, or three rolls of the dice works, it can be extended for any number of rolls. We find that when determining the sequence with the maximum probability, we need to do the following. First, regardless of the sequence length, start by calculating the maximum probability of obtaining each die for a sequence length of 1. Then, gradually increase the length, each time adding one more position, and recalculating the maximum probability of obtaining each die at this new length. Because the maximum probabilities of obtaining each die for the previous length have already been calculated, recalculating is actually not difficult. When calculating for the final position, we will know which die has the highest probability. Then, we need to deduce the sequence corresponding to this maximum probability from the end.
Who Moved My Dice?
For example, if you suspect that your six-sided die has been tampered with by the casino, it may have been replaced with another type of six-sided die, where the probability of rolling a 1 is higher at 1/2, while the probability of rolling a 2, 3, 4, 5, or 6 is 1/10. What do you do? The answer is simple, calculate the probability of a normal sequence rolled by three dice, then calculate the probability of an abnormal six-sided die and two other normal dice rolling this sequence. If the former is smaller than the latter, you need to be careful.
For example, the result of rolling the dice is:
Think clearly about your goals and means
To calculate the probability of getting this result using three normal dice, it is actually the sum of probabilities for all possible scenarios. Similarly, a simple and brute force method is to list all possible dice sequences, calculate the probability for each sequence, but this time, we are not looking for the maximum value, instead, we add up all the calculated probabilities to get the total probability we seek. This method still cannot be applied to very long dice sequences (Markov chains).
We will apply a solution similar to the one in the previous problem, except the previous problem was concerned with the maximum probability, while this problem is concerned with the sum of probabilities. The algorithm to solve this problem is called the forward algorithm.
First, if we only roll the dice once:
Think clearly about your goals and means
Seeing the result as 1. The total probability of generating this result can be calculated as follows, total probability is 0.18:
Think clearly about your goals and means
Expanding this scenario, we roll the dice twice:
Think clearly about your goals and means
Seeing the result as 1, 6. The total probability of generating this result can be calculated as follows, total probability is 0.05:
Think clearly about your goals and means
To expand further, we roll three dice:
Think clearly about your goals and means
Seeing the results as 1, 6, 3. The total probability of this result can be calculated as follows, the total probability is 0.03:
Think clearly about your goals and means
Similarly, step by step calculation, as long as it takes, a Markov chain can always calculate. Using the same method, we can also calculate the probability of an abnormal six-sided die and the other two normal dice rolling this sequence, then we compare the two probabilities to determine if your dice have been switched.
HMM (Hidden Markov Model) is used to describe statistical models of hidden unknown parameters, take a classic example: a friend in Tokyo decides on the day's activity {park walk, shopping, cleaning room} based on the weather {rainy, sunny}, I can only see the tweets she posted on Twitter every day "Ah, I went for a park walk the day before yesterday, went shopping yesterday, and cleaned the room today!". Therefore, based on her tweets, I can infer the weather in Tokyo for these three days. In this example, the visible state is the activity, and the hidden state is the weather.
Any HMM can be described by the following five elements:
Think clearly about your goals and means
Think clearly about your goals and means
Finding the most likely hidden state sequence is one of the three typical problems of HMM, usually solved using the Viterbi algorithm. The Viterbi algorithm is used to find the shortest path on the HMM (-log(prob), which is the maximum probability).
Let's briefly discuss the idea in Chinese, it is obvious, whether it's sunny or rainy on the first day can be calculated:
Think clearly about your goals and means
When using the Baum-Welch algorithm model and Hidden Markov model, as well as the profit chip proportion function curve trajectory equation, a critical point is to identify the main settlement area of high-probability events, and then deploy the main capital firepower in that area; the non-main settlement areas of high-probability events can be ignored to a certain extent, thereby greatly improving the effectiveness and efficiency of capital firepower usage.

This is also the main reason why James Harris Simons, a world-class great mathematician, investor with a net worth exceeding $24 billion, philanthropist, led the establishment of Renaissance Technologies LLC in 1988, which created the most profitable investment portfolio for the company, the 'Medallion Fund', dominating Wall Street in the financial market, strategically outperforming, and consistently surpassing the stock god Warren Edward Buffett and the financial giant George Soros.

Renaissance Technologies LLC has an average annual fund return rate of over 70%. The quantitative model of the Medallion Fund is based on the improvement and extension of the Baum-Welch algorithm model by Leonard Baum, exploring the relevance of possible profits, a modification completed by the algebraist James Coase. Simons and Coase established a fund based on this, naming it 'Medallion' to commemorate their past mathematical honors.
Following the trend, against technology, counter human nature. Combining individual stock characteristics, flexibly using the establishment models in Applied Mathematics, function-level quantitative analysis, along with setting and adjusting parameters, as well as handling non-functional problems flexibly and ingeniously, are all key to victory.
Disclaimer: Community is offered by Moomoo Technologies Inc. and is for educational purposes only. Read more
5
+0
1
See Original
Report
19K Views
Comment
Sign in to post a comment
成熟投资者:格局,概率,取舍。没有格局必然急功近利。不计概率会把运气当技术。不懂取舍,有所不为,最后必落入陷阱和圈套。
1541
Followers
73
Following
8690
Visitors
Follow
Discussing
Trump 2.0 Era: How will global markets evolve?
🎙️Discussion: 1. How will tariff policies affect the movement of key assets such as U.S. stocks, gold, and Bitcoin? 2. Given this context, Show More