I. Performance Ratio for Approximation Algorithms Suppose we are working on an optimization problem in which each potential solution has a positive cost, and we want to find one of near-optimal cost. When the objective is to minimize the cost, the near-optimal is always going to be more than the optimal. When the objective is to maximize profit, the near-optimal is less that the optimal. The standard measure in Theory of Algorithms to measure the quality of approximation is the ratio: rho(n) = max ( cost(A) / cost (OPT) ), where the max is over all instance of size n, and we are minimizing otherwise, it is rho(n) = max ( profit(OPT) / profit(A) ) II. VERTEX COVER Given a graph G = (V,E), find a vertex cover C of minimize size; that is, for each edge (u,v) \in E, either u or v are in C. Heuristics? A simple greedy scheme: while E not empty, pick an arbitrary edge e = (u,v) add both u and v to C delete all edges from E incident to u or v example: Theorem: The greedy Vertex Cover has size at most twice the optimal. Analysis: Let A be the set of edges picked by the greedy. Since no two edges in A can share a vertex, each of them requires a separate vertex in OPT to cover. So, OPT >= |A|. On the other hand, our greedy cover has size |C| <= 2|A|. QED. Interestingly, the more natural greedy of repeatedly picking the vertex of max degree can only achieve an approximation ratio O(log n). II. MAKESPAN: Load balancing Problem: Given a set of m machines M1, .., Mm, and a set of n jobs, where job j needs t_j time for processing, the goal is to schedule the jobs on these machines so all the jobs are finished as soon as possible. That is, fine the minimum time T by which all jobs can be executed collectively. Let Ai be the set of jobs assigned to machine i. Then, the machine Mi needs a total time of Ti = \sum_{j \in Ai} t_j this is called the load on machine Mi. We wish to minimize T = max_i T_i, which is called the makespan. 2. The decision problem is NP-complete: it's in NP because we can decide whether the makespan is T or not. We can also reduce subset sum to it: (scheduling on two machines). 3. We show that the following simple greedy method gives good approximation. The algorithm makes one pass ove rthe jobs in ANY order, and assigns the next job j to the machine with the lightest current load. for j = 1, .., n Let Mi be the machine with the minimum Tk among the machines Assign j to Mi Ai <- Ai + j Ti <- Ti + t_j For instance, given 6 jobs, with sizes 2, 3, 4, 6, 2, 3, and 3 machines, the algorithm gives (2, 6), (3, 2), (4, 3), for the makespan of 8. The optimal makespan is 7: (3,4), (6), (2,2,3). 4. Analysis: Let T be the makespan produced by the algorithm, and T* the optimal. However, we do not know T*. Nevertheless, we can use a lower bound for T* to achieve an approximation guaranteee. A simple lower bound comes from the following trivial observation: since there is a total of \sum_j t_j amoung of work to be done by the k machines collectively, the makespan cannot be less than the avg load. Thus, T* >= (1/m) \sum_j t_j (*) Unfortunately, this lower bound itself can be too weak: for instance, if we have one extremely long job, then the best we can do is to put it by itself on a single machine, but still the makespan has to be as long as this job. This greedy solution is optimal, but the lower bound may be way off, if the remaining jobs are all very small. But this suggests another lower bound: T* >= max_j t_j (**) 5. Theorem: The greedy makespan produces an assignment with makespan T <= 2T*. Proof. Look at the machine that attains the max load (makespan). Let this be Mi. Let j be the last job assigned to Mi. Key Observation: When j was assigned, Mi was the machine with the least load! Its load just before assignment of j was T_i - t_j. Since this was the lightest load, *every* other machine has load at least T_i - t_j. Thus, adding up all these load, we have \sum_k T_k >= m(T_i - t_j) equivalently, T_i - t_j <= (1/m) \sum_k T_k but the sum on the right is just the sum of all the jobs, since each job is assigned to some machine. The quantity on the RHS is just the lower bound from (*), Thus, T_i - t_j <= T* Now we account for the last job. Here we simply use the inequality (**), which says that t_j <= T*. Thus, T_i = (T_i - t_j) + t_j <= 2T*. QED. 6. It is easy to construct examples where this bound is tight.. There is a better approximation: if we first sort the jobs in the decreasing order of lengths, and assign them using the greedy strategy, then one can show the approximation factor 3/2. III. SET COVER. There is set U of n elements, and a list S1, ..., Sm of m subsets of U. A Set Cover is a collection of these sets whose union is U. Each set Si has a cost (or weight) w_i, and our goal is to find a Set Cover of minimum weight. Imagine U is a set of n baseball cards you wish to collect. The market offers bundles S1, ..., Sm that are subsets of U, at prices w1, ..., wm. You wish to collect all the cards in U at the minimum possible total cost. This is the set cover problem. This is a fundamental NP-complete problem. We will develop a simple greedy algorithm for it, although the analysis is not trivial. The first idea for the greedy is to repeatedly choose the largest set in the list. But this may not be good if the next set includes most of the items already covered (procured). So, a better idea may be to choose the next set that covers most items currently not covered! 7. Greedy-Set-Cover Initialize R = U; //the set of remaining (uncovered) items while R not empty Select set Si that minimizes wi/|Si \cap R| Delete elements of Si from R end return selected sets 8. Analysis of the algorithm The sets chosen by the algorithm clearly form a Set Cover. The main question is how much larger is the weight of this cover than the optimal w*. Like the load balancing problem, we will need a lower bound for the optimal to compare, but unlike that problem, finding a good lower bound is more subtle and non-trivial. Let us look at our greedy "heuristic" and investigate the intuitive meaning of the ratio: wi/|Si \cap R| We can think of this as the "cost paid" for covering each new item. Suppose we "charge" this cost c_s to each of the items s newly covered. Note that each item is charged a cost ONLY once---when it is covered for the first time. For bookkeeping only purpose, let's add this line to the Greedy Algorithm when Si is added, to account for this cost. First note that when Si is added, its weight is distributed even among the newly covered elements. Thus these costs simple account for the weights of the sets in the cover. Lemma: If C is the greedy set cover, then \sum_{Si \in C} wi = \sum_{s \in U} c_s 9. The key to the proof is to ask: fix a particular set Sk. How much cost c_s all the elements of Sk can incur? In other words, compared to w_k, how large is \sum_{s \in Sk} c_s? We get a bound for any set Sk, even those not selected by the greedy. Lemma: For any set Sk, \sum_{s \in Sk} c_s <= wk * H(|Sk|), where H() is the Harmonic Number: Hn = 1 + 1/2 + 1/3 + ... + 1/n Proof. To simplify the notation, lets assume that the items of Sk are the first d = |Sk| items: s1, s2, ..., sd. Further, also assume that these items are labeled in the order in which they are assigned cost c by the greedy. (This is just renaming of items.) Now consider the iteration in which item sj is coverd by the greedy, for some j <= d. At the start of the iteration, sj, sj+1, ..., sd \in R (because of our labeling). Thus, |Sk \cap R| is at least (d-j+1). Therefore, the average cost of items covered by Sk is wk / (|Sk \cap R|) <= wk / (d-j+1) This is not necessarily an equality because a bunch of elements j', j'+1, ..., j, j+1, ... may get covered in one step. Suppose the set chosen by the Greedy for this iteration is Si, so the average cost of Si has to be less than the average cost of Sk. Key: it's the average cost of Si that get assigned to sj. So, we have c_{sj} = wi / (|Si \cap R|) <= wk / (|Sk \cap R|) <= wk / (d-j+1) We now add up all these costs for all items of Sk: \sum_{s \in Sk} c_s = \sum_{j=1}^d c_sj <= \sum_{j=1}^d wk / (d-j+1) = wk (1/d + 1/d-1 + ... 1/2 + 1) = wk * H(d). 10. Let d* = \max |Si| (the size of the largest set). Now comes the final part: Theorem: The greedy set cover has weight at most H(d*) times w*. Proof. Let C* be the optimal set cover, so w* = \sum_{Si \in C*} wi. For each of these sets, the previous result implies that wi >= 1/H(d*) * \sum_{s \in Si} c_s But these sets form a set cover, \sum_{Si \in C*} \sum_{s \in Si| c_s >= \sum_{s \in U} c_s So, we now have w* = \sum_{Si \in C*} wi >= \sum_{Si \in C*} (1/H(d*) * \sum_{s \in Si| c_s) >= 1/H(d*) \sum_{s \in U} c_s >= 1/H(d*) \sum_{Si \in C} wi >= 1/H(d*) w_C