# Sebastian Pokutta's Blog

Mathematics and related topics

## Fundamental principles(?) in mathematics

As said already in one of my previous posts, David Goldberg and I had a nice discussion about “fundamental concepts” in mathematics. Our definition of “fundamental” was that,

1. once seen you cannot imagine anymore having not known it beforehand and it completely changes your way of thinking and
2. a somewhat realistic approach, i.e., when subtracted from the “world of thinking” something is missing.

So here is our preliminary list of things that we came up with – in random order – and a very brief, (totally biased) meta-description of what we mean by these terms. Of course, this list is highly subjective! For each of these “fundamental concepts” the idea is to have (about) 3 applications and try to distill the main core. There will be (probably) a separate blog post for each point on the list.

1. Identity/Equality. Closely related to being isomorphic. The power of identity is so penetrating that I cannot even find a short explanation. Do I have to? (Aristotle’s first law of thought)
2. Contradiction. Showing that something cannot be true as it leads to a contradiction or inconsistency. Closely related to this is the principium tertii exclusi or law of the excluded third as this is how we often use proofs by contradiction: a statement holds under various assumptions because its negation leads to a contradiction. If you do not believe in the law of the excluded thirdthen you obtain a different logic/mathematics. In particular, in these logics, usually every proof also constitutes some form of an algorithm as existence by mere contradiction when assuming non-existence is not allowed. (see also Aristotle’s second/third law of thought)
3. Induction. Establishing a property by relying on the same property for smaller sub-objects.
4. Recursion. Somewhat dual to induction: a larger object is defined as a function of smaller objects that have been subject to the same construction themselves.
5. Fixpoint. The existence of a point that is invariant under a map. Equilibria in games.
6. Symmetry. The notion of symmetry. Take a cube – rotating it does not really change the cube.
7. Invariants. Think of the dimension of a vector space. Invariants are a powerful way to show that two things are not equal (or isomorphic).
8. Limits. What would we do without limits? The idea of hypothetically continuing a process infinitely long. Think of the definition of a derivative.
9. Diagonalization. One of my personal favorites. Constructing a member not being in a family by making sure it differs from all the members in the family at (at least) one position. Diagonalization often exploits self-references. An example is Cantor’s proof.
10. Double counting. You count a family of objects in two different ways. Then the resulting “amounts” have to be identical. Typical example is the handshaking in graphs.
11. Proof. The notion of proof is very fundamental. Once proven a statement remains true (provided constistency etc). Interestingly, it can be proven that some things cannot be proven. A good example for the latter is the existence of inaccesible cardinals which is consistent with ZFC.
12. Randomness. Randomness is an extremely fundamental concept. One of my favorite applications is probably the probabilistic method. Think about Johnson’s ${7/8}$approximation algorithm for 3SAT or the PCP theorem.
13. Algorithm. When considering a function ${f: M \rightarrow N}$ we are often not just interested in what ${f}$ computes but in particular howit can be computed. In this sense the algorithmic paradigm is an additional layer to the somewhat descriptive layer of classical mathematics.
14. Exponential growth. What we were particularly thinking about was the idea that a relative improvement bounded away from ${0}$ ensures exponentional progress. This is used regularly in different scaling algorithms such as barrier algorithms, potential reduction methods, and certain flow algorithms.
15. Information. The idea that often a critical amount of informationis necessary to decide a property. Then fooling set like arguments can show that the information is not sufficient. Prime examples include the classical example that sorting via comparison needs at least ${\Omega(n \log n)}$ comparisons, communication complexity, and query complexity.
16. Function/Relation. Mapping one set to another. In particular important when the function/relation is homogeneous, i.e., when it preserves the structure.
17. Density and approximation. The idea that a set (such as the reals) can be approximated arbitrarily well by an exponentially smaller set (such as the rationals). This exponential by polynomial approximation is also something that we are using in approximation algorithms, say, when we round the input. In this case the set of polytime solvable (rounded) instances is “dense” in the set of all instances. It can also be found in set theory when using prediction principles (such as Jensen’s diamond principle or Shelah’s Black Box) to predict functions on a stationary set by an exponentially smaller set.
18. Implicit definitions. The concept of defining something not in an explicit manner but as a solutionto a set of contraints.
19. Abstraction. The use of variables is so ingrained in us that we cannot even imagine to do serious mathematics without them. But abstraction is much more. It is the ability to see more clearly because we “abstract away” unnecessary details and we use “abstraction” to unify seemingly unrelated things.
20. Existence (in the sense that Brouwer hated). One of the keywords here is probably non-constructivism and the probabilistc methods and indirect arguments are two promiment methods in this category. This was something that Brouwer despised: the idea to infer, e.g., existence of something merely because the contrary statement would lead to a contradiction (Brouwer’s school of thought denies the tertium non datur). The probabilistic method might have been fine with him. Although that is not clear at all as on a deep level we are merely trading an existential quantifier for a random one… long story…
21. Duality. By duality we mean the wider idea of duality, i.e., for example the forall quantifier and the existential quantifier. Basically, when we talk about duality we often think about some structure describing the “space of positive statement” and a dual structure that describes the “space of negative statement”. In some sense duality is a form of a compact representation of the negation of a statement.
22. Counting. Counting is again something that penetrates every mathematical theory. My favorite application of counting is the Pigeonhole principle.
23. Hume’s principle (suggested by Hanno – see comments). Two quantities are the same if there exists a bijection between them. Somewhat related to “equality” however here we explicitly ask for the existence of a bijection. For example there are as many integers as their are rationals.
24. Infinity (suggested by Hanno – see comments). The idea that something is not finite. With the notion of infinity I feel that the notion of countably infinite and uncountably infinite is closely connected. In fact the Continuum Hypothesis (CH) is such a case. It is consistent with ZFC and asserts that the first uncountably infinite cardinal is the size of the power set of the natural numbers (essentially the reals), i.e., whether $\aleph_1 = 2^{\aleph_0}$). However in other models of set theory $\aleph_1 \neq 2^{\aleph_0}$ is possible, by adding, e.g., Cohen reals.

Written by Sebastian

January 2, 2012 at 8:49 pm

## indirect proofs: contrapositives vs. proofs by contradiction

Last week I read a rather interesting discussion on contrapositives vs. proofs by contradiction as part of Timothy Gowers’ Cambridge Math Tripos, mathoverflow, and Terry Tao’s blog. At first sight these two concepts, the contrapositive and the reductio ad absurdum (proof by contradiction) might appear to be very similar. Suppose we want to prove $A \Rightarrow B$ for some statements $A$ and $B$. Then this is equivalent to showing $\neg B \Rightarrow \neg A$ (at least in classical logic). The latter is the contrapositive and often it is easier to go with the contrapositive. In the case of the indirect proof we do something similar, however there is a slight difference: we assume that $A \wedge \neg B$ and deduce a contradiction. So what the big deal? The difference seems to be more of a formal character. However this is not true. In the first case we remain in the space of “true statements”, i.e., any deduction from $\neg B$ is a consequence of $A$ that we can use later “outside of the proof”. In the case of the proof by contradiction we move in a “contradictory space” (as $A \wedge \neg B$  is contradictory) and everything that we derive in this space is potentially garbage. Its sole purpose is to derive a contradiction however as we work in a contradictory system we cannot guarantee that the statement derived within the proof are true statement; in fact they are likely not be true as they should result in a final contradiction.

Interestingly a similar phenomenon is known for cutting-plane procedures or cutting-plane proof systems (both terms essentially mean the same thing; it is just a different perspective) . Let me give you an ultra-brief introduction of cutting-plane procedures. Given a polytope $P \subseteq [0,1]^n$ we are often interested in the integral hull of that polytope which is defined to be $P_I = \text{conv}(P \cap \{0,1\}^n)$. A cutting-plane procedure $M$ is now a map that assigns to $P$ a new polytope $M(P)$ such that $P_I \subseteq M(P) \subseteq P$ and $M(P)$ hopefully provides a tighter approximation of $P_I$. So what the cutting-plane procedure does, is to derive new valid inequalities for $P_I$ by examining $P$ and usually the derivation is computationally bounded (otherwise we could just guess the integral hull); the exact technical details are not too important at this point.

Now any well-defined cutting-plane procedure $M$ satisfies $M(P \cap \{cx \le \delta\}) \subseteq M(P) \cap \{cx \le \delta\}$. Or put differently, giving the cutting-plane procedure access to an additional inequality can potentially increase the strength of the procedure as compared to let it work on $P$ and then intersect with the half-space $cx \leq \delta$ afterwards. Now what does this have to do with indirect proofs and contrapositives? The connection arises from the following trivial insight: an inequality $cx \leq \delta$ (with integral coefficients and right-hand side) is valid for $P_I$ if and only if $(P \cap \{cx \geq \delta + 1 \})_I = \emptyset$. In particular a sufficient condition for the validity of $cx \leq \delta$ for $P_I$ is $M(P \cap \{cx \geq \delta + 1 \}) = \emptyset$. The key point is that $M(P \cap \{cx \geq \delta + 1 \})$ can be strictly contained in $M(P) \cap \{cx \geq \delta + 1 \}$. The first one is the indirect proof, whereas the second one is the contrapositive, as we verify the validity of $cx \leq \delta$ by testing if  $M(P) \cap \{cx \geq \delta + 1 \} = \emptyset$. However we do not use the inequality $cx \leq \delta$ in the cutting-plane procedure, i.e., the procedure has no a priori knowledge about what to prove, whereas in the case of indirect proofs, we add the negation of $cx \leq \delta$ and the procedure can use this information.

So how much do can you gain? Suppose we have a graph $G= (V,E)$ and we consider the associated fractional stable set polytope $FSTAB(G) = \{ x \in [0,1]^V \mid x_u + x_v \leq 1 \ \forall\; (u,v) \in E\}$. Typically (there are a few exceptions), for a classical cutting-plane procedure the derivation of clique inequalities is involved and we need $\Omega(\log k)$ applications of the cutting-plane procedure to derive the clique inequalities for a clique $C$ of size $k$, i.e., $\sum_{v \in C} x_v \leq 1$. However an indirect proof of the clique inequalities takes only a single application of the most basic cutting-plane operator: Consider

$FSTAB(G) \cap \{\sum_{v \in C} x_v \geq 2\} =: Q$

for a clique $C$. It is not hard to see that $Q \cap \{x_v = 1\} = \emptyset$ for all $v \in C$. A basic derivation that any sensible cutting-plane operator $M$ supports is to derive that $x_i \leq 0$, i.e., $x_i \leq 0$ is valid for $M(Q)$ whenever $x_i < 1$ is valid for $P$. Therefore we obtain that $M(Q) \subseteq \bigcap_{v \in C} \{ x_v = 0\}$. On the other hand $M(Q) \subseteq \{\sum_{v \in C} x_v \geq 2\}$ and so $M(Q) = \emptyset$ holds and thus the indirect proof derived $\sum_{v \in C} x_v \leq 1$.

So what one can see from this example is that indirect proofs (at least in the context of cutting-plane proof systems) can derive strong valid inequalities in rather few rounds and outperform their direct counterpart drastically (constant number of rounds vs. log(n) rounds). However a priori knowledge of what we want to prove is needed in order to apply the indirect proof paradigm. This makes it hard to exploit the power of indirect proofs in cutting-plane algorithms. After all, you need to know the “derivation” before you did the actual “derivation”. Nonetheless, in some cases we can use indirect proofs by guessing good candidates for strong valid inequalities and then verify their validity using an indirect proof.

Written by Sebastian

October 18, 2011 at 8:01 pm

## Steve Jobs 1955-2011

Thank you for providing us not just with different tools but with a different way to think.

You inspired all of us.

We will miss you very much!

Written by Sebastian

October 6, 2011 at 8:41 am

Posted in things that make me think

Tagged with ,

## Cambridge Mathematical Tripos

with one comment

Timothy Gowers just started a new series of blog posts for first-year mathematics students. While the blog posts will be centered around Cambridge’s courses I am pretty sure that the discussed topics and hints will be valuable to other students as well. In fact, what I find most impressive is the goal of the series: to teach people how to do mathematics! We all learned what mathematics is and results have been presented to us in a nice, cleaned-up fashion. However only very few of us were taught how to solve/approach problems – most of us learned it the hard way at some point. It is as if you go to a restaurant to get great food: this does not teach you how to cook yourself! In particular it does not teach you that the nice result is a product of quite a mess in the kitchen. When doing math, everybody will reach her or his limit sooner or later (as compared to math in school which was easy for many math students) and it is precisely this point in time, when students start to doubt their own potential. In fact some kind of a bias is bound to take place: every math problem that can be solved is “easy” and every problem that is not solved is a small personal crisis – “am I good enough”? As you did not see the mess in the kitchen, one might think that things come easy in a nice form or not at all. In the end there is no positive feedback available anymore, only negative feedback.

I am very much looking forward to this series and I am sure that Tim has some valuable insights to share!

Written by Sebastian

September 25, 2011 at 9:15 pm

## Does IBM care for CPLEX at all?

I just got an email from IBM asking me to participate in the Academic Initiative Survey. I participated with the aim to address a few shortcomings with respect to IBM’s support for, and interest in their optimization products, e.g., that it is quite a hassle to download cplex as one has to go through an uncountably infinite number of pages before one actually reaches the download page – if one reaches it at all. Also there were a few other things that I wanted to address.

But guess what. One of the first question was to what academic field one belongs to. Operations Research? Mathematics? nada. That was already a bad omen. And in fact. There were only two references to optimization at all “Linear Programming” and “Integer Programming” in the courses that I teach / want to teach (out of a gazzillion listed including a lot of voodoo stuff). Effectively, optimization and the optimization products were virtually not present at all. Cplex, OPL, OPL Studio and none of the other optimization tools were even mentioned.

This apparent lack of interest raises serious questions about IBM’s future plans for cplex and their optimization products. In particular, questions about continuity and support. Who knows… 10 years ago I would have been really scared as cplex was the strongest industrial strength solver and therefore choice number one in many applications – however times have changed and fortunately there are alternatives now.

Written by Sebastian

March 11, 2011 at 5:02 pm

## Long time no see

with one comment

It has been quite a while that I wrote my last blog post; the last one that really counts (at least to me) was back in February. As pointed out at some point it was not that I was lacking something to write about but more that I did not want to “touch” certain topics. That in turn made me wonder what a blog is good for when, in fact, one is still concerned about whether to write about certain topics. So I got the feeling that in the end, all this web 2.0 authenticity, all this being really open, direct, authentic, etc. is nothing else but a (self-) deception. On the other hand, I also did not feel like writing about yet another conference. I have to admit that I have been to some really crappy conferences lately and since I did not have anything positive to say I preferred to not say anything at all. There were a few notable exceptions, e.g., the MIP or IPCO. Another thing that bothered me (and still does) is the dilution of real information with non-sense. In fact I have the feeling that the signal-to-noise ratio considerably dropped over the last two years and I didn’t want to add to this further. This feeling of over-stimulation with web 2.0 noise seems to be a global trend (at least this is my perception). Many people gave up their blogs or have been somewhat neglecting them. Also maintaining a blog with say weekly posts (apart from passing on a few links or announcements) takes up a lot of time. Time that arguably could be better invested into doing research and writing papers.

Despite those issues or concerns I do believe that the web with all its possibilities can really enhance the way we do science. As with all new technologies one has to find a modus operandi that provides positive utility. In principle the web can provide an information democracy/diversification, however any “democratic endeavor” on the web has a huge enemy. The Matthew effect (or commonly known as “more gains more”). This term, coined by R.K. Merton, derives its name from the following biblical Gospel of Matthew (see also wikipedia):

For to all those who have, more will be given, and they will have an abundance; but from those who have nothing, even what they have will be taken away. — Matthew 25:29, New Revised Standard Version

In principle it states that the “rich get richer” while the “poor get poorer”. If we think of the different social networks (facebook, myspace, friendster) it refers to the effect that the one that has the largest user basis is going to attract more people than the one with a smaller one. In the next “round” this effect is then even more pronounced until the smaller competitor virtually ceases to exist. In the real-world this effect is often limited due to various kinds of “friction”. There might be geographic limitations, cultural barriers etc., that do wash out the advantage of the larger one so that the compounding nature of the effect is slowed down or non-existent (this hold even true in the highly globalized world we live in). That is the reason why dry cleaners, bakeries, and other forms of local business are not outperformed by globalized companies (ok, some are). In the context of the internet however there is often no inhibitor to the Matthew effect. It often translates into some type of preferential attachment although with the difference that the overall user basis is limited so that the gain of one party is the loss of another (preferential attachment processes are usually not zero-sum).

So what does this mean in the context of blogs? Blog reading is to a certain extent zero-sum. There is some limited amount of time that we are willing to spend reading blogs. Those with a large user basis will have more active discussions and move higher in the priority list for reading. In the end the smaller ones might only have a handful of readers making it hard to justify the amount of time spent writing the posts. Downscaling the frequency of posts might even pronounce the effect as it might be perceived as inactivity. One way out of this dilemma could be any form of joining the smaller units to larger ones, i.e., either “digesting” several blogs to a larger one or alternatively “shared blogging”. I haven’t made up my mind yet what (if!) I am going to do about this. But I guess, in the end some type of consolidation is inevitable.

Having bothered you with this abstruse mixture of folklore, economics, and internet, I actually intended to write about something else but somewhat related today: About deciding whether and when to dump a project. This problem is very much inspired by my previous experiences as a consultant and recent decisions about academic projects. More precisely, suppose that you have a project and you have an estimate for the overall time of the project. At some point you want to review the progress and based on what you see at this point you want to make a call whether or not you will abandon the project. The longer you wait with your review the better your information is that you gain from the review. On the other hand you potentially wasted too much time and resources to increase the confidence in your decision. In fact it might make even sense you not start a project at all. Suppose that you have an a priori estimate for the probability of success of your project, say p. Further let r(t) denote our function of erring, i.e., r(0) = 1/2 and r(1) = 0 which means that at time t= 0 we do not have any information yet and so we can only guess leading to guessing wrong with probability 50% and at time t = 1 we have perfect information. Let t denote the point in time at which we review the project (as a fraction of the overall time, here assumed to be 1). We have four cases to consider (one might opt for a different payoff function; the following one resembles my particular choice):

1. The project is going to be successful and at the point of reviewing we guessed right, i.e., we went through with it. In this case the benefit is s. This happens with probability (1-r(t)) p and expected payoff for this scenario is: (1-r(t)) p s. [alternatively one could consider the benefit s – t; or something else]
2. The project is going to be successful and at the point of reviewing we guessed wrong, i.e., we dropped the project. In this case the benefit is – (t + s), i.e., we lose our investment up to that point (here with unit value) and the overall benefit. Probability is r(t) p and expected payoff – r(t) p (t+s).
3. The project is going to fail and we guessed right: Benefit -t, i.e., the investment so far. Expected payoff – (1-r(t)) (1-p) t.
4. The project is going to fail and we guessed wrong, i.e., we went through with it: Benefit -T, were T is some cost for this scenario. Expected payoff – r(t) (1-p) T.

All in all we have the following expected overall payoff as a function of t:

$\mathbb E(t) = -[(1-r(t))p (-s) + (1-r(t))(1-p) t + r(t)p(t+s) + r(t)(1-p) T]$

Next we have to define our function which models our increase in confidence. I opted for a function that gains information in a logarithm fashion, i.e., in the beginning we gain confidence fast and then we have a tailing-off effect:

$r_k(t) := \frac{1}{2} \frac{(1 + \log(k)}{(-\log(k) + \log(1 + k)))} - \frac{\log(k + t)}{(2 (-\log(k) + \log(1 + k)))}$

The parameter k can be understood as the rate of learning. For example for k = 0.01 it looks like this:

Assuming s = 1 and T = 1, i.e., the payoffs are basically the invested time and p = 30%, the expected payoff as function of the time of review t looks like this (payoff: blue line, error rate: red line):

The maximum payoff is reached for a review after roughly 20% of the estimated overall time. However it is still negative. This suggests that we do not learn fast enough to perform a well-informed decision. For example for k = 0.001, the situation looks different:

The optimal point for a review is after 14% of the estimated project time. Having once estimated your rate of learning, one can also determine which projects one should not get involved with at all. For k = 0.001 this is the case when the probability of success p is less than roughly 27%.

Although this model is somewhat very simple it provides some nice qualitative (and partly quantitative) insights. For example that there are indeed projects that you should not even get involved with; this is somewhat clear from intuition but I was surprised that the probability of success of those is still quite high. Also, when over time your rate of learning increases (due to experience with other projects) you can get involved with more risky endeavors because your higher review confidence allows you to purge more effectively. For example when k goes down to, say, k = 0.00001 (which is not unrealistic as in this case shortly after the beginning of the project you would only err with around 20%) you could get involved with projects that only have a probability of success of 19%.

And no complaints about the abrupt ending – I consumed my allocated blogging time.

Written by Sebastian

September 6, 2010 at 5:28 am

with one comment

A paper on solving combinatorial optimization problems with biological organisms made it into Science. The paper “Rules for Biologically Inspired Adaptive Network Design” by Atsushi Tero, Seiji Takagi, Tetsu Saigusa, Kentaro Ito, Dan P. Bebber, Mark D. Fricker, Kenji Yumiki, Ryo Kobayashi, and Toshiyuki Nakagaki explains that under certain circumstances, a certain microbe can reconstruct a version of the Tokyo rail system. From the abstract:

Transport networks are ubiquitous in both social and biological systems. Robust network performance involves a complex trade-off involving cost, transport efficiency, and fault tolerance. Biological networks have been honed by many cycles of evolutionary selectionpressure and are likely to yield reasonable solutions to such combinatorial optimization problems. Furthermore, they develop without centralized control and may represent a readily scalable solution for growing networks in general. We show that the slime mold Physarum polycephalum forms networks with comparable efficiency, fault tolerance, and cost to those of real-world infrastructure networks—in this case, the Tokyo rail system. The core mechanisms needed for adaptive network formation can be captured in a biologically inspired mathematical model that may be useful to guide network construction in other domains.

Whereas the authors do not claim any real (i.e., mathematical) optimality etc. and probably the solution quality is similar to solutions that one would obtain with simulated annealing or ant colony optimization, i.e., sub-optimal solutions in many cases, a well-read website, Spiegel Online belonging to the Spiegel magazine run a story about the article a couple of days ago giving the whole thing a slightly different touch – this reminded me very much of a blog post of Mike. The article starts with (in German – I will provide a hopefully faithful translation afterwards):

Was Ingenieure mit großem Aufwand versuchen, scheint für den Schleimpilz Physarum polycephalum eine Kleinigkeit: Verkehrswege möglichst effizient zu bauen.
Translation: What Engineers try to achieve with a lot of effort, is a trivial task for the slime mold Physarum polycephalum: The construction of efficient (traffic-) networks.

Here, the witty author already gives a first indication and an executive-style summary of his opinion. And just in case you haven’t got the point that engineers and in particular discrete optimization people are superfluous, money-eating artifacts of times where excess capital had to be burnt in order to keep inflation low, the author stresses his point even further by making clear that the microbe is really dumb.

Er gehört zu den ältesten Lebensformen – und Intelligenz würde man dem schleimigen Winzling wohl kaum zusprechen.
Translation: It (the microbe) belongs to one of the oldest life forms on earth and this little slimy thing is far from being intelligent.

If one takes a closer look at the article and especially the graph of the underlying network that the magic microbes reconstructed, the graph is almost trivial (from Fresh Photos):

If you checkout the spiegel online article which also provides pictures (I didn’t want to include those for obvious reasons), the time scale shows that obtaining the final solution took the magic microbe 26 hours. Needless to say that probably trivial branch-and-bound would do the job in less than 10 secs.

In times where it is chic to suck at mathematics, a stupid microbe outperforming a whole branch of engineering and mathematics provides perfect justification for dismissing mathematics as the most decadent form of barrenness. And what is really questionable, is that in times where mathematical illiteracy is responsible to a large extent (together with greed and dismissal of the principle of universality) for the latest financial meltdown there are still authors that somewhat give the impression that the underlying mathematical problems are trivial and provide a jaded view.

We need a wikipedia page for that, distinguishing folklore optimization (yeah, every pseudo-consulting outlet does optimization) and mathematical optimization. Something that really distilles the point. Unfortunately, this page here is not really informative for the non-expert!!

Written by Sebastian

February 2, 2010 at 10:07 am