%% ToC#234  v002/a010.tex    Klivans - Shpilka


\documentclass{toc}

\tocdetails{%
volume=2,number=10,year=2006,firstpage=185,received={August 31,2005}, published={September 28, 2006}}

\runningauthor{A. R. Klivans and A. Shpilka}
\runningtitle{Learning Arithmetic Circuits}
\copyrightauthor{Adam R. Klivans and Amir Shpilka}

\newcommand{\rk}{\mathrm{rank}}
\newcommand{\spa}{\mathrm{span}}

\bibliographystyle{tocplain}

\begin{document}
\begin{frontmatter}

\title{Learning Restricted Models of\\ Arithmetic Circuits}
\tocpdftitle{Learning Restricted Models of Arithmetic Circuits}
\tocpdfauthor{Adam R. Klivans and Amir Shpilka}

\author[Adam]{Adam R. Klivans\thanks{This research was supported by an NSF Mathematical Sciences Postdoctoral Fellowship.}}
\author[Amir]{Amir Shpilka}

\tockeywords{learning, exact learning, arithmetic circuits, partial derivatives,  multiplicity automata}

\begin{abstract}
We present a polynomial time algorithm for learning a large class of
algebraic models of computation. We show that any arithmetic circuit
whose partial derivatives induce a low-dimensional vector space is
exactly learnable from membership and equivalence queries. As a
consequence, we obtain polynomial-time algorithms for learning
restricted algebraic branching programs as well as noncommutative
set-multilinear arithmetic formulae. In addition, we observe that the
algorithms of Bergadano et al.\ (1996) and Beimel et al.\ (2000)
can be used to learn depth-3 set-multilinear
arithmetic circuits. Previously only versions of depth-2 arithmetic
circuits were known to be learnable in polynomial time. Our learning
algorithms can be viewed as solving a generalization of the well known
\emph{polynomial interpolation problem} where the unknown polynomial
has a succinct representation. We can learn representations of
polynomials encoding \emph{exponentially} many monomials.  Our
techniques combine a careful algebraic analysis of the partial
derivatives of arithmetic circuits with ``multiplicity automata"
learning algorithms due to Bergadano et al.\ (1997) and
Beimel et al.\ (2000).
\end{abstract}

\tocams{68Q32}
\tocacm{I.2.6, F.2.2}

\end{frontmatter}

\section{Introduction}

Let $p$ be an unknown multivariate polynomial over a fixed field.
Given random input/output pairs chosen from some distribution $D$, can
a computationally bounded learner output a hypothesis which will
correctly approximate $p$ with respect to future random examples
chosen from $D$?  This problem, known as the multivariate polynomial
learning problem, continues to be a fundamental area of research in
computational learning theory.  If the learner is allowed to query the
unknown polynomial at points of his choosing (instead of receiving
random examples) and is required to output the exact polynomial $p$,
then this problem is precisely the well-known polynomial interpolation
problem. Both the learning and the interpolation problem have received
a great deal of attention from the theoretical computer science
community.  In a learning context, multivariate polynomials are
expressive structures for encoding information (sometimes referred to
as the ``algebraic" analogue of DNF formulae (see \eg~\cite{BBB00}))
while polynomial interpolation has been studied in numerous contexts
and has important applications in complexity theory, among other
fields~\cite{ALMSS98, STV01}.

Previous research on this problem has focused on giving algorithms
whose running time is polynomial in the number of terms or monomials
of the unknown polynomial.  This is a natural way to measure the
complexity of learning and interpolating polynomials when the unknown
polynomial is viewed in the usual ``sum of monomials" representation.
That is to say, given that the polynomial $p = \sum_{i = 1}^{t} m_{i}$
is the sum of $t$ monomials, we may wish to output a list of these
monomials (and their respective coefficients), hence using at least
$t$ time steps simply to write down the list of coefficients.  Several
researchers have developed powerful interpolation and learning
algorithms for a variety of contexts which achieve time bounds
polynomial in all the relevant parameters, including $t$ (see for
example~\cite{BBB00,BM02,GKS94,HR99,KS01,SS96}).

\subsection{Arithmetic circuits}

In this paper we are concerned with learning succinct representations of
polynomials via alternate algebraic models of computation, most
notably \emph{arithmetic circuits}.  An arithmetic circuit
syntactically represents a multivariate polynomial in the obvious way:
a multiplication (addition) gate outputs the product (sum) of the
polynomials on its inputs.  The input wires to the circuit correspond
to the input variables of the polynomial and thus the output of the
circuit computes some polynomial of the input variables. We measure
the size of an arithmetic circuit as the number of gates. For example,
the standard ``sum of monomials" representation of a polynomial $p =
\sum_{i=1}^{t} \alpha_{i} x_{i_{1}}\cdots x_{i_{n}}$ ($\alpha_{i}$ is
an arbitrary field element) corresponds precisely to a depth-2
arithmetic circuit with a single sum gate at the root and $t$ product
gates feeding into the sum gate (each product gate multiplies some
subset of the input variables). To rephrase previous results on
learning and interpolating polynomials in terms of arithmetic
circuits, we could say that depth-2 arithmetic circuits with a sum
gate at the root are learnable in time polynomial in the size of the
circuit.

Moving beyond the standard ``sum of products" representation, we
consider the complexity of learning higher depth arithmetic
circuits. It is easy to see that there exist polynomial size
depth-3 (or even depth-2 with a product gate at the root)
arithmetic circuits capable of computing polynomials with
\emph{exponentially} many monomials.  For example, let $\{L_{i,j}\}, 1
\leq i,j \leq n$ be a family of $n^{2}$ distinct linear forms over
$n$ variables. Then $\sum_{i=1}^{n} \prod_{i=1}^{n} L_{i,j}$ is a
polynomial size depth-3 arithmetic circuit but cannot be written
as a sum of polynomially many monomials. Although arithmetic
circuits have received a great deal of attention in complexity
theory and, more recently, derandomization, the best known result
for learning arithmetic circuits in a representation other than
the depth-2 sum of products representation is due to Beimel et
al.~\cite{BBB00} who show that depth-2 arithmetic circuits with a
product gate of fan in $O(\log n)$ at the root and addition
gates\footnote{Beimel et al.\ actually allow addition gates to sum
powers of the input variables, rather than just summing
variables.} of unbounded fan-in in the bottom level are learnable
in polynomial time, and that circuits that compute polynomials of
the form $\sum_{i}\prod_{j}p_{i,j}(x_j)$ ($p_{i,j}$ is a
univariate polynomial of polynomial degree) can be learned in
polynomial time.\footnote{The latter class of circuit can be viewed
as a restricted version of depth-3 circuits where the addition
gates at the bottom can only sum powers of a certain variable.}

\subsection{Our results}

We learn various models of algebraic computation capable of
encoding exponentially many monomials in their input size. Our
algorithms work with respect to any distribution and require
membership query access to the concept being learned.  More
specifically we show that any class of polynomial size arithmetic
circuits whose partial derivatives induce a vector space of
dimension polynomial in the circuit size is learnable in
polynomial time. This characterization generalizes the work of
Beimel et al.~\cite{BBB00} and yields the following results:

\begin{itemize}

\item An algorithm for learning general depth-3 arithmetic circuits
with $m$ product gates each of fan in at most $d$ in time
polynomial in $m$, $2^{d}$, and $n$, the number of variables.

\item The first polynomial time algorithm for learning polynomial
size noncommutative formulae computing polynomials over a fixed
partition of the variables (note there are no depth restrictions on
the size of the formula).

\item The first polynomial time algorithm for learning polynomial
size read once, oblivious algebraic branching programs.

\end{itemize}

As an easy consequence of our results we observe a polynomial time
algorithm for learning the class of depth-3 set-multilinear
circuits: polynomials $p = \sum_{i=1}^{m} \prod_{j=1}^{n}
L_{i,j}(X_{j})$ where each $L_{i,j}$ is a linear form and the
$X_{j}$'s are a partition of the input variables. We note that
this result also follows as an easy corollary from the work
of~\cite{BBB00}. Finally we show that, with respect to known
techniques, it is hard to learn polynomial size depth-3
homogeneous arithmetic circuits in polynomial time.\footnote{A
depth-3 circuit $p = \sum_{i=1}^{m} \prod_{j=1}^{d_i}
L_{i,j}(x_1,\ldots,x_n)$ is homogeneous if in each $L_{i,j}$ the
free term is zero.} This indicates that our algebraic techniques
give a fairly tight characterization of the learnability of
arithmetic circuits with respect to current algorithms.

\subsection{Our techniques}

We use as a starting point the work on multiplicity automata due
to Beimel et al.~\cite{BBB00}.  A multiplicity automaton is a
nondeterministic finite automaton where each transition edge has
weight from the underlying field (for a precise definition see
\expref{Section}{sec:def}). On input $x$, $f(x)$ is equal to the sum,
over all accepting paths of the automaton on input $x$, of the
product of the edge weights on that accepting path. In~\cite{BergadanoV96,BBB00}, the authors, building on work due to~\cite{OSK94}, show that multiplicity automata can be learned in
polynomial time and that these multiplicity automata can compute
polynomials in their standard sum of products representation
(actually, as mentioned earlier, they can learn any polynomial $p$
of the form $p = \sum_{i=1}^{n} \prod_{j=1}^{m} p_{ij}(x_j)$ where
$p_{ij}(x_{j})$ is a univariate polynomial of polynomial degree).
Their analysis centers on the Hankel matrix of a multiplicity
automaton (see \expref{Section}{sec:def} for a definition).

We give a new characterization of learnability in terms of partial
derivatives. In particular we show that any polynomial whose
partial derivatives induce a low dimensional vector space has a
low rank Hankel matrix.  We conclude that any arithmetic circuit
or branching program whose partial derivatives form a low
dimensional vector space can be computed by polynomial size
multiplicity automaton and are amenable to the learning algorithms
developed in~\cite{BergadanoV96,BBB00}. As such, we output a
multiplicity automaton as our learner's hypothesis.

Our next task is to show which circuit classes have partial
derivatives that induce low dimensional vector spaces.  At this
point we build on work due to Nisan~\cite{Nis91} and Nisan and
Wigderson~\cite{NW97} (see also~\cite{SW01}) who analyzed the
partial derivatives of certain arithmetic circuit classes in the
context of proving lower bounds and show that a large class of
algebraic models have ``well-behaved" partial derivatives. For
example we show that the dimension of the vector space of partial
derivatives induced by a set-multilinear depth-3 arithmetic
circuit is polynomial in the size of the circuit.

Our results suggest that partial derivatives are a powerful tool for
learning multivariate polynomials, as we are able to generalize all
previous work in this area and give new results for learning
interesting algebraic models.  Additionally, we can show there are
depth-3 polynomial-size, homogeneous, arithmetic circuits whose
partial derivatives induce a vector space of superpolynomial
dimension. We feel this motivates the problem of learning depth-3
homogeneous, polynomial-size arithmetic circuits, as such a result
would require significantly new techniques.  We are hopeful that our
characterizations involving partial derivatives will further inspire
complexity theorists to use their techniques for developing learning
algorithms.

\subsection{The relationship to lower bounds}

In the case of learning Boolean functions, the ability to prove
lower bounds against a class of Boolean circuits usually coincides
with the ability to give strong learning algorithms for those
circuits.  For example the well known lower bounds of
H{\aa}stad~\cite{H86} against constant depth Boolean circuits are used
heavily in the learning algorithm due to Linial, Mansour, and
Nisan~\cite{LMN93}.  Jackson et al.~\cite{JKS02} have shown that
constant depth circuits with a majority gate, one of the strongest
circuit classes for which we can prove lower bounds
(see~\cite{ABFR91}), also admit nontrivial learning algorithms.
Furthermore Jackson et al.~\cite{JKS02} show that we will not be
able to learn more complicated Boolean circuits unless certain
cryptographic assumptions are false.

Our work furthers this relationship in the algebraic setting.  The
models of algebraic computation we can learn correspond to a large
subset of the models of algebraic computation for which strong
lower bounds are known. For example Nisan~\cite{Nis91} gives
exponential lower bounds for noncommutative formulae. Nisan and
Wigderson~\cite{NW97} prove exponential lower bounds for depth-3
set-multilinear circuits. Moreover, in both papers the authors
prove lower bounds by considering the partial derivatives spanned
by the circuit and the function computed by it, a method similar
to ours. Over finite fields there are exponential lower bounds for
depth 3 circuits~\cite{GK98,GR98}, however no exponential lower
bounds are known for general depth-3 arithmetic circuits over
infinite fields (see~\cite{SW01}). As in the Boolean case, we
exploit many of the insights from the lower bound literature to
prove the correctness of our learning algorithms.

A preliminary version of this paper appeared in COLT 2003~\cite{KlivansS03}.

\subsection{Organization}

In \expref{Section}{sec:def} we review relevant learning results for
multiplicity automata as well as state some basic facts from algebraic
complexity.  In \expref{Section}{sec:deriv} we prove our main
theorem, characterizing the learnability of arithmetic circuits via
their partial derivatives.  In Sections~\ref{sec:multi}, \ref{sec:roab},
and~\ref{sec:non} we state our main learning results for various arithmetic
circuits and algebraic branching programs.

\section{Preliminaries} \label{sec:def}

We denote with $\mathbb{F}$ the underlying field, and with
$\mathrm{char}(\mathbb{F})$ the characteristic of $\mathbb{F}$.
When studying a polynomial $f$ we either assume that
$\mathrm{char}(\mathbb{F})=0$ or that the degree of each variable
in $f$ is smaller than $\mathrm{char}(\mathbb{F})$.

\subsection{The learning model}

We will work in the model of exact learning from membership and
equivalence queries, first introduced by Angluin~\cite{A87}. In
this model a learner begins with some candidate hypothesis $h$ for
an unknown concept $f$ and is allowed access to both a
\emph{membership query} oracle and an \emph{equivalence query} oracle.
The membership query oracle takes as input $x$ and outputs $f(x)$.
The equivalence query oracle takes as input the learner's current
hypothesis $h$ and outputs a counterexample, namely an input $y$
such that $h(y) \neq f(y)$. We assume that making a membership or
an equivalence query of length $k$ takes time $k$. If no such
counterexample exists then we say that the learner has exactly
learned $f$.  We say that a concept $f$ is exactly learnable in
time $t$ if there exists an exact learner for $f$ whose running
time is bounded by $t$.  A concept class is considered to be
exactly learnable in polynomial time if for every $f$ in the
concept class there exists an exact learner for $f$ running in
time polynomial in the size of the smallest description of $f$.
Known transformations imply that if a concept class is exactly
learnable in polynomial time then it is also learnable in
Valiant's PAC model in polynomial time with membership queries.

\subsection{Multiplicity automata}

A multiplicity automaton is a nondeterministic automaton where
each transition edge is assigned a weight, and the output of the
automaton for any input $x$ is the sum over all accepting paths of
$x$ of the products of the weights on each path.

\begin{definition}
Let $\Sigma$ be an alphabet. A multiplicity automaton $A$ of size
$r$ over $\Sigma$ consists of a vector $\bar{\gamma} \in
\mathbb{F}^r$ (\ie\ $\bar{\gamma} =
(\gamma_{1},\ldots,\gamma_{r})$) and a set of matrices
$\{\mu_{\sigma}\}_{\sigma \in \Sigma}$, where each $\mu_{\sigma}$
is an $r \times r$ matrix over $\mathbb{F}$. The output of $A$ on
input $x = (x_{1},\ldots,x_{n})\in \Sigma^n$ is defined to be the
inner product of $(\prod_{i=1}^{n} \mu_{x_{i}})_{1}$ and
$\bar{\gamma}$ where $(\prod_{i=1}^{n} \mu_{x_{i}})_{1}$ equals
the first row of the matrix\footnote{We denote
$\mu_{x_n}\cdot\mu_{x_{n-1}}\cdot \ldots \cdot \mu_{x_1}$ with
$\prod_{i=1}^{n} \mu_{x_{i}}$.} $\prod_{i=1}^{n} \mu_{x_{i}}$. In
other words the output is the first coordinate of the vector
$\left( \prod_{i=1}^{n} \mu_{x_{i}} \right) \cdot \bar{\gamma}$.
\end{definition}

Intuitively each matrix $\mu_{\sigma}$ corresponds to the transition
matrix of the automaton for symbol $\sigma \in \Sigma$.  Iterative
matrix multiplication keeps track of the weighted sum of paths from
state $i$ to state $j$ for all $i,j \leq r$. The first row of the
iterated product corresponds to transition values starting from the
initial state and $\bar{\gamma}$ determines the acceptance
criteria.

Next we define the Hankel matix of a function:

\begin{definition}
Let $\Sigma$ be an alphabet and $f : |\Sigma|^{n} \to \mathbb{F}$.
Fix an ordering of all strings in $\Sigma^{\leq n}$. We construct
a matrix $H$ whose rows and columns are indexed by strings in
$\Sigma^{\leq n}$ in the following way. For $x \in \Sigma^{d}$ and
$y \in \Sigma^{n-d}$, for some $0 \leq d \leq n$, let the $(x,y)$
entry of $H$ be equal to $f(x \circ y)$. For any other pair of
strings $(x,y)$ such that $|x| + |y| \neq n$ let $H_{a,b} = 0$.
The resulting matrix $H$ is called the Hankel matrix of $f$ for
strings of length $n$. We define $H_{k}$ to be the $k$-th
``block'' of H, \ie\ $H_{k}$ is the submatrix defined by all rows
of $H$ indexed by strings of length exactly $k$ and all columns of
$H$ indexed by strings of length exactly $n-k$.
\end{definition}

The following key fact relates the rank of the Hankel matrix of a
function for strings of length $n$ with the size of multiplicity
automaton computing $f$ on inputs of length $n$:

\begin{theorem}[\cite{CP71,F74,BBB00}]
Let $f: \Sigma^{n} \to \mathbb{F}$. Then the rank of the Hankel
matrix of $f$ (over $\mathbb{F}$) is equal to the size of the
smallest multiplicity automaton computing $f$ on inputs of length
$n$.
\end{theorem}

Previous learning results have computed the rank of the Hankel
matrices of particular polynomials yielding a bound on the size of
their multiplicity automata. In fact, Beimel et al.~\cite{BBB00},
improving on~\cite{OSK94}, learn functions computed by
multiplicity automata by iteratively learning their corresponding
Hankel matrices:

\begin{theorem}[\cite{BBB00}] \label{thm:learn}
For $f: \Sigma^{n} \to \mathbb{F}$, let $r$ be the rank of the
Hankel matrix of $f$ for strings of length $n$.  Then there exists
an exact learning algorithm for $f$ running in time polynomial in
$n$, $r$, and $|\Sigma|$. Furthermore the final hypothesis output by
the learning algorithm is a multiplicity automaton of size $r$
over alphabet $\Sigma$. Moreover, if for every variable $x_i$ the
degree of $f$ as a polynomial in $x_i$ (over $\mathbb{F}$) is at
most $d$, then the running time of the learning algorithm is
polynomial in $n,r$ and $d$.
\end{theorem}

Our main technical contribution is to show that the rank of a
function's Hankel matrix is bounded by (and in most cases equal to)
the dimension of the vector space of a function's partial derivatives.
Thus we reduce the problem of learning a polynomial to bounding the
dimension of the vector space of its partial derivatives.

\subsection{Set-multilinear polynomials}

In this paper we will work primarily with polynomials that respect a
fixed partition of the input variables:

\begin{definition}
Let $X = \dot\bigcup_{i=1}^{d} X_{i}$ be a partition of the variables
into $d$ sets. A polynomial over the variables $X$ is called {\rm
set-multilinear} if every monomial $m$ is of the form $y_{1} \cdot
y_{2} \cdots y_{d}$ where each $y_{i}$ is some variable from
$X_{i}$. Thus, any set-multilinear $f$ is also homogeneous and
multilinear of degree $d$.
\end{definition}

We will sometime use the notation $f(X_1,\ldots,X_d)$ to denote
that $f$ is set-multilinear with respect to the partition $X = \dot
\bigcup_{i=1}^{d}X_i$.

\begin{example}\label{ex:det}
Let $X = (x_{i,j})_{1 \leq i,j \leq d}$ be a $d \times d$ matrix.
Let $X_i = \{x_{i,1},\ldots,x_{i,d}\}$ be the $i$-th row of $X$.
Clearly $X = \dot\bigcup_{i=1}^d X_i$. Note that both the determinant
and the permanent of $X$ are set-multilinear polynomials with
respect to this partition.
\end{example}

Another example is the class of depth-3 set-multilinear circuits,
first defined by Nisan and Wigderson~\cite{NW97},
that computes only set-multilinear polynomials.\footnote{In the
original paper~\cite{NW97} these
circuits are called multilinear circuits, but in recent
works~\cite{Raz04a,Raz04b,RazShpilka} they are referred to as
set-multilinear circuits.} To see this note
that any polynomial computed by a depth-3 set-multilinear circuits
is of the form $p = \sum_{i=1}^{n} \prod_{j=1}^{m} L_{i,j}(X_{j})$
where each $L_{i,j}$ is a linear form and the $X_{j}$'s are a
partition of the input variables. In later sections we will show
that certain algebraic branching programs also compute
set-multilinear
polynomials and will therefore be amenable to our learning techniques.

Another notation that we use is the following:

\begin{definition}
Let $X = \dot \bigcup_{i=1}^{d} X_i$. For any $1\leq k \leq d$ define
\[
\mathbb{SM}[X_1,...,X_k] = \{ \; M \mid
M = \prod_{i=1}^{k}x_{i} \;,  x_{i} \in X_i \;\}\enspace.
\]
\end{definition}

Thus $\mathbb{SM}[X_1,...,X_k]$ is the set of all set-multilinear
monomials of degree $k$.

\subsection{Partial derivatives}

In this subsection we introduce some notation for computing partial derivatives.

\begin{definition}\sloppy
Let $\mathbb{M}[x_1,\ldots,x_n]$ be the set of monomials in the
variables $x_1,\ldots,x_n$. Let $\mathbb{M}_d[x_1,\ldots,x_n]$ be
the set of monomials of degree at most $d$ in $x_1,\ldots,x_n$.
\end{definition}

\begin{example}
$\mathbb{M}_2[x_1,x_2] = \{1,x_1,x_2,x_1^2,x_1\cdot x_2,x_2^2\}$.
\end{example}


\begin{definition}
Let $d =\sum_{i=1}^{k}d_i$. For a function $f(x_1,...,x_n)$ and a
monomial $M=\prod_{i=1}^{k}{x_i}^{d_i}$ let
\[
\frac{\partial f}{\partial M} =
\frac{1}{M!} \cdot \frac{{\partial}^d f}{\prod_{i=1}^{k}{(\partial
x_i)}^{d_i}}\enspace,
\]
where  $M! = \prod_{i=1}^{k}({d_i}!)$.
\end{definition}

Recall that in case that $\mathbb{F}$ is finite we only consider
polynomials in which the degree of each variable is smaller than
the characteristic of $\mathbb{F}$. In particular we will only
consider partial derivatives with respect to monomials in which
each variable has degree smaller than $\mathrm{char}(\mathbb{F})$.

\begin{example}
Let $f(x_1,x_2,x_3)=x_1^2 x_2 +x_3$ and $M(x_1,x_2,x_3)=x_1 x_2$.
We have that
$$\frac{\partial  f}{\partial M} = 2x_1, \; M! =1\enspace.$$
\end{example}


\begin{definition}
\label{def:regular_derivative} For a function
$f(x_1,\ldots,x_{n})$ and $k \leq n$ let
$$\partial_{k}(f) = \left\{ \; \frac{\partial f}{\partial M} \; \vrule \;
{\rm monomials} \; M \in \mathbb{M}[x_1,...,x_k] \; \right\}\enspace.$$
Also
define
$$\mathrm{rank}_{k}(f) = \dim \left( \mathrm{span} \left(
    \partial_{k}(f) \right) \right)\enspace.$$
\end{definition}

Note that in $\partial_k(f)$ we only consider partial derivatives
with respect to the first $k$ variables.

\begin{example}\label{ex:det-not-sm}
Let $X = (x_{i,j})_{1 \leq i,j \leq 3}$ be a $3 \times 3$ matrix.
Let $f(X) = Det(X)$ (the determinant of $X$). Consider the
following order of the variables $x_{i,j} < x_{i',j'}$ if $i < i'$
or $i=i'$ and $j<j'$. Then $$\partial_6(X) =
\{x_{2,2}x_{3,3}-x_{2,3}x_{3,2},\;
x_{2,1}x_{3,3}-x_{2,3}x_{3,1},\;x_{2,1}x_{3,2}-x_{2,2}x_{3,1},\;
x_{3,1},\;x_{3,2},\;x_{3,3}\}\enspace.$$
Thus, $\mathrm{rank}_6(f)=6$.
\end{example}

For set-multilinear polynomials we need a slightly different
definition (although we use the same notations).

\begin{definition}
\label{def:set_regular_derivative} Let $X = \dot
\bigcup_{i=1}^{d}X_i$. For a set-multilinear polynomial
$f(X_1,\ldots,X_{d})$ and $k \leq d$ let
\[
\partial_{k}(f) = \left\{ \; \frac{\partial f}{\partial M} \; \vrule \;
{\rm monomials} \; M \in \mathbb{SM}[X_1,...,X_i] \; \mathrm{for}
\; 1 \leq i \leq k \; \right\}\enspace.
\]
We define $\text{\em s-rank}_{k}(f) =
\dim \left( \mathrm{span} \left(
    \partial_{k}(f) \right) \right)$.
\end{definition}


Note that in particular we only consider partial derivatives with
respect to monomials of the form $\prod_{i=1}^{k'}x_i$ where $x_i
\in X_i$ and $k' \leq k$. We will never consider partial
derivative with respect to the monomial $x_1 \cdot x_3$ (again,
$x_i \in X_i$).


\begin{example}\sloppy
Let $X$ be a $3 \times 3$ matrix (as in \expref{Example}{ex:det} with
$d=3$). Let $f=Det(X) = Det(X_1,X_2,X_3)$ be the determinant of
$X$ where where $X_{1} = \{x_{1,1},x_{1,2},x_{1,3}\}, X_{2} =
\{x_{2,1},x_{2,2},x_{2,3}\}, $ and $ X_{3} =
\{x_{3,1},x_{3,2},x_{3,3}\}$. Then $\partial_2(f) =
\{x_{3,1},x_{3,2},x_{3,3}\}$. Thus, $\text{\em s-rank}_2(f)=3$.
\end{example}

    Note the difference from \expref{Example}{ex:det-not-sm} where we
ignored the fact that the determinant is a set-multilinear polynomial.

\section{Characterizing learnability via partial derivatives} \label{sec:deriv}

In this section, we present our main criterion for establishing the
learnability of both arithmetic circuits and algebraic branching
programs.  We prove that any polynomial whose partial derivatives form
a low degree vector space induce low rank Hankel matrices. To relate
the rank of the Hankel matrix of $C$ to its partial derivatives we
will need the following multivariate version of Taylor's theorem:

\begin{fact} \label{fac:taylor}
Let $X = \{x_1,\ldots,x_n\}$ and let $f(X)$ be a degree $d$
polynomial. Let $\rho = (\rho_1 \circ \rho_2)$ be an assignment to
the variables, where $\rho_1$ is an assignment to the first $k$
variables and $\rho_2$ as assignment to the last $n-k$ variables.
For a monomial $M$ define $M(\rho)$ be value of $M$ on assignment
$\rho$. Then
\[
f(\rho) = f(\rho_1 \circ \rho_2) =
\sum_{M \in \mathbb{M}_d[x_1,\ldots,x_k]} M(\rho_1) \cdot
\frac{\partial f}{\partial M}(\vec{0} \circ \rho_2)\enspace.
\]
\end{fact}

\begin{proof}
Because of the linearity of the partial derivative operator it is
enough to prove the claim for the case that $f$ is a monomial. Let
$f(x_1,\ldots,x_n)=\prod_{i=1}^n {x_i}^{d_i}$, where $\sum d_i
\leq d$. Consider a monomial $M \in \mathbb{M}_d[x_1,\ldots,x_k]$
given by $M=\prod_{i=1}^{k}{x_i}^{e_i}$, where $\sum e_i \leq d$.
Notice that if there is some $1 \leq i \leq k$ with $e_i > d_i$
then $\frac{\partial f}{\partial M}=0$. Also notice that if for
some $1 \leq i \leq k$ we have that $e_i < d_i$ then
$\frac{\partial f}{\partial M}(\vec{0} \circ \rho_2)=0$ because
the assignment to $x_i$ is zero. In particular the only
contribution to the sum will come from the partial derivative
with respect to $M_0=\prod_{i=1}^{k}{x_i}^{d_i}$ that gives $\frac{\partial
f}{\partial M_0} = \prod_{i=k+1}^n {x_i}^{d_i}$. In particular
\[
\sum_{M \in \mathbb{M}_d[x_1,\ldots,x_k]} M(\rho_1) \cdot
\frac{\partial f}{\partial M}(\vec{0} \circ \rho_2) =
M_0(\rho_1)\cdot \frac{\partial f}{\partial M_0}(\vec{0} \circ
\rho_2)=\prod_{i=1}^{k}{x_i}^{d_i}(\rho_1) \cdot \prod_{i=k+1}^n
{x_i}^{d_i}(\rho_2) = f(\rho)\enspace.
\]
\end{proof}

\noindent Now we can state the main technical theorem of the paper:

\begin{theorem} \label{thm:main}
Let $f(x_1,\ldots,x_n)$ be a degree $d$ polynomial. Then for every
$k \leq n$,
\[
\dim(H_{k}(f)) \leq \rk_{k}(f)\enspace.
\]
If $f$ is multilinear then
equality holds.
\end{theorem}


\begin{proof}
We will define two matrices $V_{d,k}$ and $E_k$ such that
$\rk(E_k)
\leq \rk_{k}(f)$ and $H_{k} = V_{d,k} \cdot E_k$.\\

\noindent
\textbf{Construction of $E_k$} (Evaluation Matrix): We index
the rows of $E_k$ by the set of monomials
$\mathbb{M}_d[x_1,...,x_k]$ (in lexicographical order) and the
columns by elements of $\mathbb{F}^{n-k}$ (in lexicographical
order). The $(M,\rho)$ entry of $E_{k}$ is equal to
\[
\left(E_k \right)_{M,\rho} = \frac{\partial{f}}{\partial
  M}(\vec{0}\circ \rho)\enspace,
\]
where $\vec{0}$ is a length $k$ vector and
$\rho$ is in $\mathbb{F}^{n-k}$.  This is equal to the value of the
partial derivative of $f$ with respect to $M$ at the point
$\vec{0}\circ \rho$. When $k=0$, the matrix has only one row
(the partial derivative of order zero is the polynomial itself), in which
the $\rho$th position is equal to $f(\rho)$. The following is a
standard fact from linear algebra:
\begin{claim}  \label{fac:linalg}
$\rk(E_k) \leq \rk_{k}(f)$, and equality holds if $f$ is
multilinear.
\end{claim}

\begin{proof}
Row $M$ of $E_k$ is the evaluation of
$\frac{\partial{f}}{\partial{M}}$ on all inputs of the form
$\vec{0} \circ \rho$, where $\vec{0}$ is a length $k$ vector and
$\rho$ is of length $n-k$. Hence each vector corresponds to part
of the ``truth table'' of a particular partial derivative of $f$ in which
the assignment to the first $k$ variables is zero.
Clearly if a set of partial derivatives is linearly dependent then
so are the corresponding rows. Thus $\rk(E_k) \leq \rk_{k}(f)$.
When $f$ is multilinear, all of the variables in $M$ disappear
from the resulting polynomial, and we actually get that the rows of $E_k$
represent the entire truth table of the corresponding partial derivative
of $f$ and hence $\rk(E_k) = \rk_{k}(f)$.
\end{proof}

\noindent\textbf{Construction of $V_{d,k}$} (Generalized Vandermonde
Matrix): The rows of $V_{d,k}$ are indexed by elements of
$\mathbb{F}^{k}$ (in lexicographical order) and the columns are
indexed by the set of monomials $\mathbb{M}_d[x_1,...,x_k]$ (again
in lexicographical order). The $(\rho,M)$ entry of $V_{d,k}$ is
equal to $M(\rho)$. When $k=0$ the matrix contains only one
column, whose entries are equal to $1$. We note that the column
rank of $V_k$ is full (similarly to the usual Vandermonde matrix).

Consider the matrix product $V_{d,k} \cdot E_k$.  Notice
that its $(\rho_{1} \circ \rho_{2})$ entry is equal to
\[
(V_{d,k} \cdot E_k)_{\rho_1 \circ \rho_2} = \sum_{M \in
  \mathbb{M}_d[x_{1},\ldots,x_{k}]} M(\rho_1)
\frac{\partial{f}}{\partial{M}}(\vec{0} \circ \rho_2)
\]
which by \expref{Fact}{fac:taylor} equals $f(\rho_{1} \circ \rho_2)$. Thus $V_{d,k}
\cdot E_k = H_k$. In particular $\mathrm{rank}(H_k) \leq \rk(E_k) \leq
\rk_k(f)$. When $f$ is multilinear we have that, as before, $\rk(E_k)
= \rk_{k}(f)$, and as the column rank of $V_k$ is full it follows that
$\mathrm{rank}(H_k) = \rk(E_k) = \rk_k(f)$.
\end{proof}

By summing over all values of $k$ we obtain

\begin{corollary}\label{cor:main}
Let $f(x_1,\ldots,x_n)$ be a  polynomial. Then
\[
\dim (H(f)) \leq \sum_{k=0}^{n}\rk_{k}(f)\enspace.
\]
If $f$ is multilinear then
\[
\dim (H(f)) = \sum_{k=0}^{n}\rk_{k}(f)\enspace.
\]
\end{corollary}

Now we consider set-multilinear polynomials.  We must be careful here
to take into account partial derivatives with respect to monomials $M$
that are not in $\mathbb{SM}[X_1,\ldots,X_i]$ for any $i$.  Below, we
show that rows in $E_{k}$ corresponding to such $M$'s are zero.

Let $f(X_1,\ldots,X_d)$ be a set-multilinear polynomial with respect to $X
= \dot \bigcup\, X_i$. We order the variables of $X$ as follows: first
we set $X_1 < X_2 < \ldots < X_d$, then we order the variables in
each $X_i$ in some linear order. Consider the $(M,\rho)$ entry in
$E_k$. Notice that if $M \not \in
\bigcup_{i=1}^{d}\mathbb{SM}[X_1,\ldots,X_i]$ then $\frac{\partial
f}{\partial M}(\vec{0}\circ \rho)=0$. Indeed, assume that the first
$k$ variables cover the sets $X_1,\ldots,X_i$, as well as some
of the variables in the set $X_{i+1}$. Since we substitute $0$ to
the first $k$ variables we see that $M$ must contain a variable
from each $X_1,\ldots,X_i$ (as otherwise, because $f$ is
set-multilinear, the entire $M$-th row of $E_k$ is zero). We also
note that $M$ can't contain two variables from the set $X_j$ (as
again this would imply that the $M$-th row is zero). In
particular, in order for the $M$-th row to be non zero we must
have that $M \in \mathbb{SM}[X_1,\ldots,X_i]$ for that $i$. As a
corollary we get

\begin{corollary}\label{cor:s-rank}
Let $f$ be a set-multilinear polynomial with an ordering of the
variables as above. For each $1\leq k\leq n$ let $i_k$ be defined
by
\[
\Bigl| \bigcup_{i=1}^{i_k} X_i \Bigr| \leq k < \Bigl|
\bigcup_{i=1}^{i_k+1} X_i \Bigr|\enspace.
\]
Then $\mathrm{rank}(E_k) \leq$
$\text{\em s-rank}_{i_k}(f)$.
\end{corollary}

This corollary implies the following version of
\expref{Corollary}{cor:main} for set-multilinear polynomials.

\begin{theorem}\label{thm:main-sm}
Let $X = \dot \bigcup_{i=1}^{d}X_i$, with $|X|=n$. Let
$f(X_1,\ldots,X_d)$ be a set-multilinear polynomial. Then
\[
\dim (H(f)) \leq n\sum_{k=0}^{d}\text{\em{s-rank}}_{k}(f)\enspace.
\]
\end{theorem}

\begin{proof}\label{thm:s-main}
According to \expref{Theorem}{thm:main} we have that $\dim (H(f)) =
\sum_{k=0}^{n}\rk(E_k)$. By \expref{Corollary}{cor:s-rank} we get that
$\rk(E_k) \leq \mathrm{s}$-$\mathrm{rank}_{i_k}(f)$, where $i_k$
is such that
\[
\Bigl| \bigcup_{i=1}^{i_k} X_i \Bigr| \leq k < \bigl|
\bigcup_{i=1}^{i_k+1} X_i \Bigr|\enspace.
\]
In particular we get that
\[
\dim(H(f)) = \sum_{k=0}^{n}\rk(E_k) \stackrel{(*)}{\leq} \sum_{i=0}^{d}|X_{i+1}|\cdot \text{s-rank}_{i}(f)
\leq n\sum_{i=0}^{d}\text{s-rank}_i(f)\enspace,
\]
where inequality $(*)$
follows from the observation that for every $k$, such that $\bigl|
\bigcup_{j=1}^{i} X_j \bigr| \leq k < \bigl| \bigcup_{j=1}^{i+1} X_j
\bigr|$, it holds that $i_k=i$, and so there are $|X_{i+1}|$ such
$k$'s.
\end{proof}


\section{Learning depth-3 arithmetic circuits} \label{sec:multi}

In this section we learn depth-3 arithmetic circuits. The results
that we obtain also follow from the works of~\cite{BBV96,BBB00},
however we reprove them in order to demonstrate the usefulness of
our techniques. We begin by defining the model:

\begin{definition}
A depth 3 arithmetic circuit is a layered graph with 3
levels and unbounded fan-in. At the top we have either a sum gate or a
product gate. A depth 3 arithmetic circuit $\mathcal{C}$
with a sum (product) gate at the top  is called a
$\Sigma \Pi \Sigma$ ($\Pi \Sigma \Pi$) circuit and has the
following structure:
\[
\mathcal{C} = \sum_{i=1}^{m} \prod_{j=1}^{d} L_{i,j}(X)
\]
where each $L_{i,j}$ is a linear function in the input variables
$X = x_{1},\ldots,x_{n}$ and $m$ is the number of multiplication
gates. The size of the circuit is the number of gates, in this
case $O(m d)$.
\end{definition}

A $\Sigma \Pi \Sigma$ circuit is a homogeneous circuit if all the
linear forms are homogeneous linear forms (\ie\ the free term is
zero)  and all the product gates have the same fan in (or degree).
In other words every gate of the circuit computes a homogeneous
polynomial. We will also be interested in set-multilinear depth 3
circuits. To define this sub-model we need to impose a partition
on the variables:

\begin{definition}
A $\Sigma \Pi \Sigma$ circuit is called set-multilinear with
respect to $X = \dot \bigcup_{i=1}^{d}X_i$ if every linear function
computed at the bottom is a homogeneous linear form in one of the
sets $X_i$, and each multiplication gate multiplies $d$
homogeneous linear forms $L_{1},\ldots,L_{d}$ where every $L_{i}$
is over a distinct set of variables $X_{i}$.  That is to say a
depth-3 set-multilinear circuit $\mathcal{C}$ has the following
structure:
\[
\mathcal{C} = \sum_{i=1}^{m} \prod_{j=1}^{d} L_{i,j}(X_j)
\]
where $L_{i,j}$ is an homogeneous linear form.
\end{definition}

We now give an algorithm for learning set-multilinear depth-3
circuits. The algorithm is based on the following lemma that
characterizes the dimension of a set-multilinear circuit's partial
derivatives:

\begin{lemma} \label{lem:d3}
If a polynomial $f$ is computed by a set-multilinear depth 3
circuit with $m$ product gates then for every $1 \leq k \leq d$,
\[
{\text{\em  s-rank}}_k(f) \leq km\enspace.
\]
\end{lemma}

\begin{proof}
First notice that for every
product gate
\[
P = \prod_{i=1}^{d} L_i(X_i)
\]
we have $\text{s-rank}_k(P) \leq k$. Indeed, let $1\leq r \leq k$.
then for any monomial $M \in \mathbb{SM}[X_1,...,X_r]$ we have
that
\[
\frac{\partial P}{\partial M} = \alpha_M \cdot \prod_{i=r+1}^{d}
L_i(X_i)
\]
for some constant $\alpha_M$ depending on $M$ and
$P$. Thus, as we vary over all $r$ between $1$ and $k$ we obtain only
$k$ distinct partial derivatives.  The proof of the lemma now follows
from the linearity of the partial derivative operator.
\end{proof}

\noindent Applying \expref{Lemma}{lem:d3}, \expref{Theorem}{thm:main-sm},
and \expref{Theorem}{thm:learn} we obtain the following learning
result:

\begin{theorem}
Let $\mathcal{C}$ be a set-multilinear depth-3 circuit with $m$
product gates over $n$ variables with coefficients from a field
$\mathbb{F}$.  Then $\mathcal{C}$ is learnable in time polynomial
in $m$ and $n$.
\end{theorem}

\noindent We note that this result also follows immediately from
the results of~\cite{BBV96,BBB00}.

\subsection{Learning general depth 3 circuits}

We now give our learning algorithm for general depth-3 arithmetic
circuits. Unlike the algorithm in the set-multilinear case, this
algorithm runs in time exponential in the degree of the circuit
(and polynomial in the other parameters). Thus we can learn in
subexponential time any depth-3 circuit of sublinear degree. The
running time of the algorithm is determined by the following
lemma:

\begin{lemma} \label{lem:depth-3}
Let $f : \mathbb{F}^{n} \to \mathbb{F}$ be a polynomial over a
variable set $X$ of size $n$ computed by a depth-3 circuit with
$m$ product gates each of degree at most $d$. Then for every $1
\leq k \leq d$,
\[
{\rm rank}_k(f) \leq m \cdot \sum_{i=1}^{k}{d \choose i}\enspace.
\]
\end{lemma}

\begin{proof}
The proof is similar to the case of set-multilinear depth-3
circuits. Notice that for every product gate
\[
P = \prod_{i=1}^{d} L_i(X)
\]
we have ${\rm rank}_k(P) \leq \sum_{i=1}^{k}{d \choose i}$.
Indeed, for any monomial $M$ of degree $r$ we have that
\[
\frac{\partial P}{\partial M} \in
\spa \left\{ \; \prod_{i\in T} L_i(X) \; \vrule \; T \subset [d], \;
|T|=d-r \; \right\}\enspace.
\]
Since there are at most $m$ product gates
we obtain the claimed bound.
\end{proof}

\noindent Applying the above lemma with \expref{Theorem}{thm:main} and
\expref{Theorem}{thm:learn} we get the following learning result (that
was also obtained in~\cite{BBV96}):

\begin{theorem}
Let $f: \mathbb{F}^{n} \to \mathbb{F}$ be computed by a depth-3
arithmetic circuit with $m$ product gates each of fan in at most
$d$. Then $f$ is learnable in time polynomial in $n$, $2^{d}$, and
$m$.
\end{theorem}



\subsection{Discussion}

The fact that the rank of $f$ was bounded by the number of product
gates is unique to set-multilinear depth-3 circuits.  For example
consider the following depth-2 $\Pi \Sigma$ circuit:
\[
f(z,x_1,...,x_n) = \prod_{i=1}^{n}(z+x_i)\enspace.
\]
For every ordering of the variables, the dimension of the span of
the partial derivatives of $f$ (and hence the rank of the Hankel
matrix of $f$) is exponential in $n$; this follows from the
observation that the coefficient of $z^d$ is the $n-d$ symmetric
polynomial whose partial derivatives have dimension
$2^{\Omega(n-d)}$ (see \cite{SW01}). Thus it is no surprise that
Beimel et al.~\cite{BBB00} only considered depth-2 $\Pi \Sigma$
circuits where the product gate at the root has fan in at most
$O(\log n)$; fan in larger than $O(\log n)$ would correspond to
Hankel matrices of superpolynomial dimension and thus would not be
learnable by multiplicity automata techniques.

To show the limits of current learning techniques we point out that the
following homogeneous depth-3 arithmetic circuit
\[
C' = \prod_{i=1}^{n}(z+x_i) + \prod_{i=1}^{n}(v+u_i)
\]
is both irreducible and has exponentially many linearly
independent partial derivatives. As its degree is $n$ we can only
learn it in time exponential in $n$. We leave open the problem of
learning homogeneous depth-3 arithmetic circuits (as well as the
more difficult problem of learning general depth-3 arithmetic
circuits) of superlogarithmic degree.

\section{Learning classes of algebraic branching programs} \label{sec:roab}

Algebraic and Boolean branching programs have been intensely
studied by complexity theorists and have been particularly
fruitful for proving lower bounds. Considerably less is known in
the learning scenario --- Bshouty et al.~\cite{BTW96} and
Bergadano et al.~\cite{BBTV97} have shown some partial progress
for learning restricted width Boolean branching programs.  In this
section we will show how to learn any polynomial size algebraic
branching program that is both read once and oblivious. As such we
will be able to show that multiplicity automata are essentially
equivalent to read once, oblivious algebraic branching programs, a
characterization that may be of independent interest. We begin
with a general definition of algebraic branching programs:

\begin{definition}
An algebraic branching program (ABP), first defined by
Nisan~\cite{Nis91}, is a directed acyclic graph with one vertex of in-degree
zero, which is called \emph{source}, and one vertex of out-degree zero,
which is called the \emph{sink}. The vertices of the graph are
partitioned into levels numbered $0,...,d$. Edges are labeled with a
homogeneous linear form in the input variables and may only connect
vertices from level $i$ to vertices from level $i+1$.  The source is
the only vertex at level $0$ and the sink is the only vertex at the
level $d$.  Finally the size of the ABP is the number of vertices in
the graph.
\end{definition}

The polynomial that is computed by an ABP is the sum over all directed
paths from the source to the sink of the product of linear functions
that labeled the edges of the path. It is clear that an ABP with $d+1$
levels computes a homogeneous polynomial of degree $d$. \\

In this section we will show how to learn a natural restriction of an
algebraic branching program as mentioned above: the read once,
oblivious algebraic branching program or ROAB.

\begin{definition}
Let $X = \dot \bigcup _{i=1}^{d}X_i$ be a partition of the input
variables into $d$ disjoint sets. An ABP is {\rm oblivious} if for
every level $i$ only one set of variables $X_j$ appears.  A function
is a ROAB, a read once, oblivious algebraic branching program, if
it is an oblivious ABP and every set of variables $X_{j}$ appears in
at most one level.
\end{definition}

We are interested in learning ROABs with respect to the partition
$X = \dot \bigcup _{i=1}^{d}X_i$ in which the variables in $X_{i}$
appear on edges from level $i$ to level $i+1$. In this section we
measure the complexity of a polynomial in terms of its smallest
ROAB:

\begin{definition}
For a polynomial $f$ we define $B(f)$ to be the size of the
smallest $ABP$ for $f$. For a set-multilinear polynomial $f$ we
denote $OB(f)$ to be the size of the smallest ROAB for $f$.
\end{definition}

The main theorem of this section shows that for set-multilinear
polynomials, the size of its smallest ROAB is equal to the
dimension of the vector space induced by its partial derivatives:

\begin{theorem} \label{thm:mainroab}
For a set-multilinear polynomial $f(X_1,...,X_d)$ we have that
\[
\sum_{k=1}^{d} {\text{\em s-rank}}_{k}(f) = OB(f)\enspace.
\]
\end{theorem}

\noindent To prove \expref{Theorem}{thm:mainroab} we will need the
following theorem which is implicit in Nisan~\cite{Nis91}:

\begin{theorem}[\cite{Nis91}] \label{thm:nis}
Let $f(X_1,\ldots,X_d)$ be a set-multilinear polynomial. For each
$0 \leq k \leq d$ define a matrix $\mathcal{M}_k(f)$ as follows:

\begin{itemize}
\item {Each row is labeled with a monomial
$M_{1} \in \mathbb{SM}[X_{1},\ldots,X_{k}]$.}
\item {Each column is labeled with a monomial
$M_{2} \in \mathbb{SM}[X_{k+1},\ldots,X_{d}]$.}
\end{itemize}

If $k=0$ then $M_{1} = 1$ and if $k=d$ then $M_{2}=1$.
The $(M_{1},M_{2})$ entry of $\mathcal{M}_{k}(f)$ is
equal to the coefficient of the monomial $M_{1} \cdot M_{2}$ in
$f$.  We have that
\[
OB(f)=\sum_{k=0}^{d}{\rm rank}(\mathcal{M}_k(f))\enspace.
\]
\end{theorem}

\begin{proof}[Proof of \expref{Theorem}{thm:mainroab}]
\begin{sloppypar}We will show
that $\rk(\mathcal{M}_{k}(f)) = \text{s-rank}_{k}(f)$ which,
combined with \expref{Theorem}{thm:nis}, completes the proof. Consider
a row of $\mathcal{M}_k(f)$ corresponding to some monomial $M \in
\mathbb{SM}[X_{1},\ldots,X_{k}]$.  Since $f$ is a set-multilinear
polynomial it follows that $\frac{\partial f}{\partial M}$ is
equal to $\sum_{t} \alpha_{t} M_{t}$ where each $\alpha_{t}$ is an
element of the field and $M_{t} \in
\mathbb{SM}[X_{k+1},\ldots,X_{d}]$, for all $t$. Notice, however,
that row $M$ of $\mathcal{M}_{k}(f)$ is precisely equal to the row
vector $(\alpha_{1},\ldots,\alpha_{t})$. Hence row $M$ of
$\mathcal{M}_{k}(f)$ is equal to the coefficients of the partial
derivative of $f$ viewed as a set-multilinear polynomial in
$X_{k+1},\ldots,X_{d}$. It is a standard fact from linear algebra
that the dimension of a vector space spanned by a set of
polynomials is equal to the rank of the matrix of their
coefficients.
\end{sloppypar}
\end{proof}

Combining \expref{Theorem}{thm:mainroab} and
\expref{Theorem}{thm:main-sm} we see that any polynomial-size ROAB obeying the
above partition is computed by a polynomial-size multiplicity
automata. Applying the learning algorithm of Beimel et
  al.~\cite{BBB00}\ we obtain
\begin{theorem} \label{thm:learnroab}
Let $X = \dot \bigcup_{i=1}^{d}X_i$. Let $f(X_1,\ldots,X_d)$ be a
set-multilinear polynomial that is computed by a ROAB of size $m$.
Then $f$ is learnable in time polynomial in $m$, and $|X|$.
\end{theorem}

Notice that ROABs can be thought of as the arithmetic
generalization of OBDDs (Ordered Binary Decision Diagrams, which
are also known as oblivious read once branching programs), a model
for which Bergadano et al.~\cite{BBTV97} gave a learning algorithm
based on multiplicity automata.

\subsection{Equivalence of ROABs and multiplicity automata}

We can now prove that ROABs are essentially equivalent to
multiplicity automata. Since our learning algorithm outputs as a
hypothesis a multiplicity automaton, \expref{Theorem}{thm:learnroab}
implies that every ROAB of size $m$ in $n$ variables is computed
by a multiplicity automaton of size polynomial in $m$ and $n$. We
cannot show that every multiplicity automaton is computed by a
ROAB, but we can show that every multiplicity automaton is
computed by a ROAB which computes higher degree polynomials at
each edge.

\begin{definition}
Define a ROAB of degree $d$ to be a ROAB where every edge is
labelled with a polynomial of degree $d$.
\end{definition}

 \begin{lemma}
Let $f$ be any polynomial over $n$ variables computed by an
algebraic multiplicity automaton of size $r$. Assume also the the
degree of each variable in $f$ is bounded by $d$. Then $f$ can be
computed by a ROAB with $n+2$ levels of size $nr+2$ and degree
$d$.
\end{lemma}

\begin{proof}
Let $S \subseteq \Sigma$ be a subset of the alphabet of size
$d+1$. Let $f$ be computed by a multiplicity automaton $A$ of size $r$
consisting of the set of matrices $\{\mu_{\sigma}\}_{\sigma \in
  \Sigma}$ and the vector $\vec{\gamma} \in \Sigma^r$. Construct a
matrix $T$ where the $i,j$ entry of $T$ is a degree $d$ univariate
polynomial, $T_{i,j}$, interpolating the $(i,j)$ entry of
$\mu_{\sigma}$ for every $\sigma \in S$. That is, $T_{i,j}(\sigma)$ is
the $(i,j)$ entry of $\mu_{\sigma}$ (for $\sigma \in S$). Consider a
ROAB with $n+2$ levels each of size $r$ where every level $1 \leq i
\leq n$ has a copy of the $r$ states of $A$ (in particular we
enumerate the vertices in each level with $\{1,\ldots,r\}$). Connect
every vertex at level $k$, for $1 \leq k \leq n-1$, to every vertex at
level $k+1$. For the $j$-th vertex in level $k$ and the $i$-th vertex
in level $k+1$ we label edge $(j,i)$ with the polynomial in the
$(i,j)$ entry of $T$, having $x_k$ as its variable (\ie\ the label is
$T_{i,j}(x_k)$). Connect every vertex in level $n$ to the sink and
label edge $(i,\mathrm{sink})$ with the polynomial in the
$T_{1,i}(x_n)$ (recall the output of a multiplicity automata is the
inner product of $\vec{\gamma}$ with the first row of the product of
the $\mu_{\sigma}$'s). Also connect the source to every one of the $r$
vertices in the first level and label the edge to vertex $i$ with
$\gamma_i$. It is clear that this ROAB computes a polynomial of degree
at most $d$ in each variable, and that for every input from $S^n$ the
output of the ROAB agrees with $f$. Therefore, by the following
version of the Schwartz Zippel lemma~\cite{Schwar80,zippel79} we get
that this ROAB computes $f$ as required.

\begin{lemma}[\cite{Schwar80,zippel79}]
Let $f,g:\mathbb{F}^n \to \mathbb{F}$ be two $n$-variate
polynomials over $\mathbb{F}$. Assume that the degree of each
variable in $f$ and $g$ is at most $d$. Let $S \subseteq
\mathbb{F}$ be a set of size $d+1$. If for every $x \in S^n$ we
have that $f(x)=g(x)$ then $f=g$.
\end{lemma}
\end{proof}

\section{Learning noncommutative formul\ae} \label{sec:non}

In this section we show how to learn another type of arithmetic
circuits: polynomial size noncommutative formulae. A
noncommutative formula is an arithmetic formula where
multiplication does not necessarily commute; i.e.\ different
orderings of inputs to a product gate result in different outputs.
Intuitively this restriction makes it difficult for a formula to
use the power of cancellation. This may seem to be a strange
restriction, but it is very natural in the context of function
computation where an ordering is enforced on groups of variables.
For example, the product of $k$ matrices $M_{1},\ldots,M_{k}$
where matrix $M_{i}$ uses variables from a set $X_{i}$ can be
viewed as a set-multilinear noncommutative polynomial over an
ordering of the variables $X = \dot\bigcup\, X_{i}$ (changing the order
of the matrices will result in a different output). In addition,
many of the known algorithms for computing polynomials are
non-commutative by nature. For example, the well known algorithm
for the above mentioned iterated matrix multiplication can be
viewed as a non-commutative set-multilinear circuit. Similarly,
Ryser's algorithm for computing the permanent (see,
\eg~\cite{vanLintWilson}) can be viewed as a non-commutative
set-multilinear formula.

Nisan proved the first lower bounds for noncommutative formulae
in~\cite{Nis91}; here we will give the first learning algorithm
for set-multilinear polynomials computed by noncommutative
formulae. Previously only algorithms for learning read-once
arithmetic formulae were known (see
\eg~\cite{HancockH91,BshoutyHH95a,BshoutyHH95b,BB98}). We begin
with a general definition for arithmetic formulae:


\begin{definition}
An arithmetic formula is a tree whose edges are directed towards
the root. The leaves of the tree are labeled with input variables.
Every inner vertex is labeled with one of the arithmetic
operations $\{+,\times \}$. Every edge is labeled with a constant
from the field in which we are working. The size of the formula is
defined to be the number of vertices.
\end{definition}

\begin{sloppypar}
An arithmetic formula computes a polynomial in the obvious manner.
We now define non-commutative formulae. Roughly, a formula is
noncommutative if for any two input variables $x_{i}$ and $x_{j}$,
$x_{i}x_{j} - x_{j}x_{i} \neq 0$. More formally, let
$\mathbb{F}\{x_1,...,x_n\}$ be the polynomial ring over the field
$\mathbb{F}$ in the non-commuting variables $x_1,\ldots,x_n$. That
is, in $\mathbb{F}\{x_1,...,x_n\}$ the formal expressions
$x_{i_1}\cdot x_{i_2} \cdot \ldots \cdot x_{i_k}$ and
$x_{j_1}\cdot x_{j_2} \cdot \ldots \cdot x_{j_l}$ are equal if and
only if $k=l$ and $\forall m$ $i_m = j_m$ (whereas in the
commutative ring of polynomials we have that any monomial remains
the same even if we permute its variables, \eg\ $x_1\cdot x_2 =
x_2 \cdot x_1$). A non-commutative arithmetic formula is an
arithmetic formula where multiplications are done in the ring
$\mathbb{F}\{x_1,...,x_n\}$. As two polynomials in this ring do
not necessarily commute, we have to distinguish in every
multiplication gate between the left son and the right son. For a
polynomial $f$ let $F(f)$ be the size of the smallest
noncommutative formula computing $f$. When considering
non-commutative formulae we are interested in \emph{syntactic}
computations, \eg\ given the polynomial $x_1 \cdot x_2$ we want
the formula to output this exact polynomial and not the polynomial
$x_2 \cdot x_1$, even though they are semantically equal when
considering assignments from a field. In particular the formula
$(x_1 - x_2)\cdot (x_1+x_2)$ does not compute the polynomial $x_1
^2 - x_2^2$.
\end{sloppypar}

Note that every polynomial can be computed by a non-commutative
formula, and that given a non-commutative formula we can evaluate
it over a commutative domain.

In~\cite{Nis91} Nisan proved exponential lower bounds on the size of
noncommutative formula computing the permanent and the
determinant. An important ingredient of Nisan's result is the
following lemma relating noncommutative formula size to algebraic
branching program size:

\begin{lemma}[\cite{Nis91}] \label{lem:nis}
Let $f(X_1,\ldots,X_d)$ be a set-multilinear polynomial. Then
\[
B(f) \leq d(F(f)+1)\enspace.
\]
\end{lemma}

\sloppy{Using this we can give the following relationship between
noncommutative formulae and ROABs:}

\begin{theorem} \label{thm:nc}
Let $f(X_1,\ldots,X_d)$ be a set-multilinear polynomial computed
by a noncommutative formula of size $m$, then $f$ is computed by a
ROAB of size $d(m+1)$.
\end{theorem}

\begin{proof}
Applying \expref{Lemma}{lem:nis} we see that $f$ is computed by an
algebraic branching program $B$ of size $d(m+1)$.  We will show
that $B$ is also computed by a ROAB of size $d(m+1)$, by
constructing a ROAB with $d+1$ levels, in which the variables in
$X_k$ label the edges that go from level $k-1$ to level $k$.

Consider the set of edges in $B$ from level $i-1$ to $i$. Assume
that two different sets of variables appear from level $i-1$ to
level $i$ say $X_{i}$ and $X_{j}$. Then the output of $B$ will
contain a monomial of the form and $Yx_{j}Z$ where $x_{j} \in
X_{j}$, $Y$ is a set of variables appearing in levels less than
$i-1$ in $B$, and $Z$ is the set of variables appearing in levels
greater than $i$ in $B$. Note however that $f$ is a
set-multilinear polynomial, in particular in each monomial of $f$
the variables from $X_j$ appear as the $j$-th multiplicand. In
particular no monomial of the form $Y x_j Z$ appear in $f$. Thus,
the coefficient of any monomial $Yx_{j}Z$ must be zero. As such,
we can substitute the constant $0$ for all of the variables
$X_{j}$ appearing on these edges  and obtain an oblivious
branching program $B'$ computing the same polynomial as $B$. $B'$
can be made read-once in a similar fashion. At the end we get a
ROAB with $d+1$ levels in which the variables from $X_i$ label the
edges from level $i-1$ to level $i$.
\end{proof}

Combining \expref{Theorem}{thm:nc} with \expref{Theorem}{thm:learnroab} we obtain

\begin{theorem} \label{thm:learnnc}
Let $f(X_1,\ldots,X_d)$ be a set-multilinear polynomial, over $X =
\dot \bigcup_{i=1}^{d}X_i$, that is computable by a noncommutative
formula of size $m$ with coefficients from a field $\mathbb{F}$.
Then $f$ is learnable in time polynomial in $|X|$ and $m$.
\end{theorem}




\section{Acknowledgments}
We thank Ran Raz for  many helpful discussions in all stages of
this work. We also thank Eli Ben-Sasson for important
conversations at an early stage of this research. We thank the
anonymous referees for their valuable comments, and for bringing
\cite{BBV96} to our attention.

\bibliography{a010}


\begin{tocauthors}
\begin{tocinfo}[Adam]
Adam R. Klivans  \\
Department of Computer Science \\
The University of Texas at Austin \\
Austin, TX 78712-1188 \\
klivans\tocat{}cs\tocdot{}utexas\tocdot{}edu\\
\url{http://www.cs.utexas.edu/~klivans/}
\end{tocinfo}

\begin{tocinfo}[Amir]
Amir Shpilka \\
Department of Computer Science \\
The Technion \\
Haifa, 32000 \\
Israel  \\
shpilka\tocat{}cs\tocdot{}technion\tocdot{}ac\tocdot{}il\\
\url{http://www.cs.technion.ac.il/~shpilka/}
\end{tocinfo}
\end{tocauthors}

\begin{tocaboutauthors}
\begin{tocabout}[Adam R. Klivans]
\href{http://www.cs.utexas.edu/~klivans/}{\textsc{Adam R. Klivans}} 
received his \bsc\ and \msc\ from 
\href{http://www.cmu.edu}{Carnegie-Mellon University} and 
his \phd\ from \href{http://www.mit.edu}{MIT}, where 
Dan Spielman was his advisor.  He then held an
NSF Mathematical Sciences Postdoctoral Fellowship at 
\href{http://www.harvard.edu}{Harvard} under the guidance 
of \href{http://people.deas.harvard.edu/~valiant/}{Leslie Valiant}.
After spending six months at the 
\href{http://www.tti-c.org/}{Toyota Technological Institute at Chicago} 
as a visiting professor, he became an assistant professor at 
the \href{http://www.utexas.edu/}{University of Texas at Austin} in the
\href{http://www.cs.utexas.edu/}{Department of Computer Science}.  
He is frequently confused with 
\href{http://people.cs.uchicago.edu/~kalai/}{Adam Kalai}.

\end{tocabout}
\begin{tocabout}[Amir Shpilka]
\href{http://www.cs.technion.ac.il/~shpilka/}{\textsc{Amir Shpilka}} 
was born in 1972 in Israel and obtained his \phd\
degree in Computer Science and Mathematics from the 
\href{http://www.huji.ac.il/huji/eng/}{Hebrew University in Jerusalem} 
in 2001, under the supervision of 
\href{http://www.math.ias.edu/~avi/}{Avi Wigderson}. 
As of 2005 he is a CS faculty member at the 
\href{http://www.technion.ac.il/}{Technion}. He is married to
Carmit and has two children. His research interests lie in Complexity
Theory, mainly in Arithmetic Circuit Complexity.  When not working or
enjoying his family he likes to read and play chess.
\end{tocabout}
\end{tocaboutauthors}

\end{document}
