tags:

views:

48

answers:

1

There is no problem to get the src or alt separately,but how can get both at the same time every one with a group name.

We have to bear in mind that alt can be to the left or right of src.

I am in a hurry, so I found a quick solution, creating 3 groupnames for the src, and and for alt. I know we can do it a lot better.

private void GetFirstImage(string newHtml, out string imgstring, out string imgalt)
{
    imgalt = "";
    imgstring = "";

    string pattern = "(?<=<img(?<name1>\\s+[^>]*?)src=(?<q>['\"]))(?<url>.+?)(?=\\k<q>)(?<name2>.+?)\\s*\\>";

    try
    {
        //si hay imagen
        if (Regex.IsMatch(newHtml, pattern))
        {
            Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

            imgstring = r.Match(newHtml).Result("${url}");
            string tempalt = "", tempalt2;
            tempalt = r.Match(newHtml).Result("${name1}");
            tempalt2 = r.Match(newHtml).Result("${name2}");

            //ya tenemos la ruta de la imagen y de lo que aparece a izq y a derecha dentro de <img>

            try
            {
                pattern = "alt=(?<q>['\"])(?<alt>.+?)(?=\\k<q>)";

                //si hay algo que no sea vacío a la izquierda de la ruta
                if(!String.IsNullOrEmpty(tempalt.Trim()))
                {
                    r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

                    //si cumple con el pattern para buscar el alt
                    if (Regex.IsMatch(tempalt, pattern))
                    {

                        imgalt = r.Match(tempalt).Result("${alt}");

                    }
                }
                //si no se encontró el alt y hay algo a la derecha
                if(String.IsNullOrEmpty(imgalt) && !String.IsNullOrEmpty(tempalt2.Trim()))
                {

                    r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

                    //si se cumple el patrón del alt
                    if (Regex.IsMatch(tempalt2, pattern))
                    {

                        imgalt = r.Match(tempalt2).Result("${alt}");

                    }

                }

            }
            catch{ }

        }

    }
    catch{}

}
+5  A: 

Simple... don't use Regex. Use a DOM parser - so XmlDocument for xhtml or the HTML Agility Pack for (non-x)html.

Then just query root.SelectNodes("//img") and look at the "src" and "alt" attributes on each element (i.e. node.Attributes["src"].Value, etc)

Regex is NOT a good tool for parsing html (since it is not a regular language).

Marc Gravell
+1 for XmlDocument and I wish another +1 for reiterating that Regex + HTML = **BAD**
Lazarus