分类 Java 下的文章

SeimiCrawler v0.1.0发布了

SeimiCrawler v0.1.0终于熬过了他漫漫的开发旅程,对外发布了,目前已经同步至中央maven库。dependency:

<dependency>
    <groupId>cn.wanghaomiao</groupId>
    <artifactId>SeimiCrawler</artifactId>
    <version>0.1.0</version>
</dependency>

简介

SeimiCrawler是一个敏捷的,支持分布式的爬虫开发框架,希望能在最大程度上降低新手开发一个可用性高且性能不差的爬虫系统的门槛,以及提升开发爬虫系统的开发效率。在SeimiCrawler的世界里,绝大多数人只需关心去写抓取的业务逻辑就够了,其余的Seimi帮你搞定。设计思想上SeimiCrawler受Python的爬虫框架Scrapy启发很大,同时融合了Java语言本身特点与Spring的特性,并希望在国内更方便且普遍的使用更有效率的XPath解析HTML,所以SeimiCrawler默认的HTML解析器是JsoupXpath,默认解析提取HTML数据工作均使用XPath来完成(当然,数据处理亦可以自行选择其他解析器)。

项目主页

SeimiCrawler主页

HttpClient下完美Post/Redirect/Post

一些特殊的原因,需要利用HttpClient自动的完成Post方法的301重定向,看了默认HttpClient官方给出的默认实现是org.apache.http.impl.client.DefaultRedirectStrategy,可以定向的方法就限定死了,

/**
 * Redirectable methods.
 */
private static final String[] REDIRECT_METHODS = new String[] {
    HttpGet.METHOD_NAME,
    HttpHead.METHOD_NAME
};

显然是无法完成Post/Redirect/Post的,所以还得继续找Apache针对接口org.apache.http.client.RedirectStrategy的实现,还好在IDE的帮助下很快找到了这个Apache关于自动重定向的终极实现org.apache.http.impl.client.LaxRedirectStrategy,这回支持Post的重定向了,但是发现无法将原Post请求Body中的数据传递下去直接丢失了,这显然是非常不理想的,无法保持原有请求的完整性几乎等于白做一样。苦恼之下只能自行实现RedirectStrategy接口,不过可以继承DefaultRedirectStrategy覆盖他的isRedirectedgetRedirect,实现关键的获取重定向后的HttpUriRequest即可,即:

    @Override
    public HttpUriRequest getRedirect(HttpRequest request, HttpResponse response, HttpContext context) throws ProtocolException {

    }

看了HttpRequest 的接口后瞬间就麻木了,

- 阅读剩余部分 -

paoding-rose的Bean参数包装器扩展

/**
 * 一个强大的 'form表单'-> 'Bean' 的参数封装器,支持对Bean对象内任意深度的递归封装。
 * 以接收UserBasic有如下约定:
 * 1,只针对继承了BaseObject的Model对象生效
 * 2,基本类型属性直接按照属性名进行接收
 * 3,对象类型属性的子属性按照 '{fieldName}.{childFieldName}' 形式接收
 * 4,对象类型属性中的子对象属性的子属性按照 '{fieldName}.{childFieldName}.{childFieldName}',更深层次的子代对象以此类推
 * 5,List对象类型属性按照 '{fieldName}.{index}.{childFieldName}' 其中{index}从0开始
 * 6,属性的集合类型目前仅支持List,这基本足够了
 *
 * @author 汪浩淼 [et.tw@163.com]
 * @since 14-6-3.
 */
public class SuperBeanResolver implements ParamResolver {
    @Override
    public boolean supports(ParamMetaData metaData) {
        return metaData.getParamType().getSuperclass()!=null&&metaData.getParamType().getSuperclass().equals(BaseObject.class);
    }
    @Override
    public Object resolve(Invocation inv, ParamMetaData metaData) throws Exception {
        Class beanClass = metaData.getParamType();
        return beanDeepResolve(inv,beanClass,null);
    }

    /**
     * 深度遍历对象属性并封装信息
     * @param inv
     * @param beanClass
     * @param parentPath
     * @return
     * @throws Exception
     */
    public Object beanDeepResolve(Invocation inv,Class beanClass,String parentPath) throws Exception {
        Object ins = beanClass.newInstance();
        Field[] fields = beanClass.getDeclaredFields();
        for (Field field:fields){
            Method setter = ReflectUtil.getSetter(beanClass,field);
            String currentFieldPath = StringUtils.isNoneBlank(parentPath)?parentPath+"."+field.getName():field.getName();
            if (field.getType().equals(List.class)) {
                List chiBeanList = new LinkedList();
                Class chiClass = null;
                ParameterizedType pt = (ParameterizedType) field.getGenericType();
                Type[] types = pt.getActualTypeArguments();
                if (types.length > 0) {
                    chiClass = (Class) types[0];
                    Field[] chiFields = chiClass.getDeclaredFields();
                    boolean goon = true;
                    int i = 0;
                    while (goon) {
                        boolean tmpflag = true;
                        Object chiIns = chiClass.newInstance();
                        for (Field chif : chiFields) {
                            String chiParamKey = new StringBuilder(currentFieldPath).append(".").append(i).append(".").append(chif).toString();
                            String chiParamValue = inv.getParameter(chiParamKey);
                            if (StringUtils.isNotBlank(chiParamValue)){
                                Method chiSetter = ReflectUtil.getSetter(chiClass,chif);
                                if (chiSetter!=null){
                                    chiSetter.invoke(chiIns,ReflectUtil.cast(chiParamValue,chif.getType()));
                                }
                                tmpflag = false&tmpflag;
                            }else {
                                tmpflag = true&tmpflag;
                            }
                        }
                        goon = !tmpflag;
                        if (goon){
                            chiBeanList.add(chiIns);
                        }
                        i+=1;
                    }
                    if (chiBeanList.size()>0){
                        if (setter!=null){
                            setter.invoke(ins,chiBeanList);
                        }
                    }
                }
            }else if (field.getType().getSuperclass().equals(BaseObject.class)){
                if (setter!=null){
                    setter.invoke(ins,beanDeepResolve(inv,field.getType(),currentFieldPath));
                }
            }else {
                String paramVlaue = inv.getParameter(currentFieldPath);
                if (StringUtils.isNoneBlank(paramVlaue)){
                    if (setter!=null){
                        setter.invoke(ins, ReflectUtil.cast(paramVlaue, field.getType()));
                    }
                }
            }
        }
        return ins;
    }
}
/**
 * 完成对表单中的数组对象的接收,数组对象的属性在表单中命名规则为:
 * {num}.fieldName,其中num代表整个对象的索引,约定索引从0开始,并只针对
 * 继承了BaseObject的model对象数组起作用。
 *
 * @author 汪浩淼 [et.tw@163.com]
 *         Date:  14-6-3.
 */
public class SuperArrayResolver implements ParamResolver {
    @Override
    public boolean supports(ParamMetaData metaData) {
        Class paramType = metaData.getParamType();
        return paramType.isArray()&&paramType.getComponentType()!=null&&paramType.getComponentType().getSuperclass().equals(BaseObject.class);
    }

    @Override
    public Object resolve(Invocation inv, ParamMetaData metaData) throws Exception {
        Class beanClazz = metaData.getParamType().getComponentType();
        Field[] fields = beanClazz.getDeclaredFields();
        List<Object> res = new LinkedList<Object>();
        boolean goon = true;
        int i =0;
        while (goon){
            boolean tmpflag = true;
            Object ins = beanClazz.newInstance();
            for(Field f:fields){
                String paramKey = i+"."+f.getName();
                String curvalue = inv.getParameter(paramKey);
                if (StringUtils.isNotBlank(curvalue)){
                    tmpflag = false&tmpflag;
                    Method setter = ReflectUtil.getSetter(beanClazz,f);
                    if (setter!=null){
                        setter.invoke(ins,ReflectUtil.cast(curvalue,f.getType()));
                    }
                }else {
                    tmpflag = true&tmpflag;
                }
            }
            goon = !tmpflag;
            if (goon){
                res.add(ins);
            }
            i+=1;
        }
        Object realrs = Array.newInstance(beanClazz,res.size());
        System.arraycopy(res.toArray(),0,realrs,0,res.size());
        return realrs;
    }
}
public class ReflectUtil {
    /**
     * 获取Bean的setter方法
     * @param beanClazz
     * @param field
     * @return
     */
    public static Method getSetter(Class beanClazz,Field field){
        String fieldName = field.getName();
        String methodKey = "set"+fieldName.substring(0,1).toUpperCase()+fieldName.substring(1);
        Method setter = null;
        try {
            setter = beanClazz.getDeclaredMethod(methodKey,field.getType());
        } catch (NoSuchMethodException e) {
            //
        }
        return setter;
    }

    /**
     * 将请求参数转成对应的对象类型。
     * @param ori
     * @param type
     * @return
     */
    public static Object cast(String ori,Class type){
        if (type.equals(int.class)||type.equals(Integer.class)){
            return Integer.parseInt(ori);
        }else if (type.equals(long.class)||type.equals(Long.class)){
            return Long.parseLong(ori);
        }else if (type.equals(String.class)){
            return ori;
        }else if (type.equals(boolean.class)||type.equals(Boolean.class)){
            return Boolean.parseBoolean(ori);
        }else if (type.equals(Date.class)){
            try {
                return DateUtils.parseDate(ori,"yyyy-MM-dd HH:mm:ss","yyyy/MM/dd HH:mm:ss","yyyyMMddHHmmss","yyyy-MM-dd","yyyy/MM/dd","yyyy-MM");
            } catch (ParseException e) {
                return null;
            }
        }
        return null;
    }
}

maven plugin开发

需要

至少需要下面这些依赖

<dependency>
    <groupId>org.apache.maven</groupId>
    <artifactId>maven-plugin-api</artifactId>
    <version>3.0.3</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.maven.plugin-tools</groupId>
    <artifactId>maven-plugin-annotations</artifactId>
    <version>3.2</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.maven</groupId>
    <artifactId>maven-core</artifactId>
    <version>3.0.3</version>
</dependency>

- 阅读剩余部分 -

Java开源支持xpath的html解析器介绍--JsoupXpath

简介

JsoupXpath 是一款纯Java开发的使用xpath解析html的解析器,xpath语法分析与执行完全独立,html的DOM树生成借助Jsoup,故命名为JsoupXpath.
为了在java里也享受xpath的强大与方便但又苦于找不到一款足够强大的xpath解析器,故开发了JsoupXpath。JsoupXpath的实现逻辑清晰,扩展方便,
支持几乎全部常用的xpath语法,如下面这些:

- 阅读剩余部分 -