程式碼高𠅙

2014/09/03

原來使用 Postgresql 就可以直接查詢文字或 CSV 檔案

有些時候,我們會需要處理資料內容龐大的文字檔案,我們可能想要先略微檢視資料內容,了解資料分佈,或做些過濾查詢的動作,當然把資料先匯入 Database 再進行處理是一個可行的方案,但如果資料量龐大到數十 GB,那麼光是匯入資料的時間,以及資料匯入 DB 後產生的資料澎漲就足以令人怯步。

要想維持文字或 CSV 檔案的簡便性,又想擁有資料庫查詢的方便性,可以透過 Postgresql Foreign Table for File 的功能來達成。簡單來說,Postgresql 的 Foreign Data Wrappers 能讓您包裝其他資料來源,然後透過一個集中的資料庫來操作,而這也就是 Federated Database 的觀念。



當我們在 Federated Database 下達 SQL 指令時,中央控管的 SQL 會透過 ANSI SQL 2003 的標準 SQL/MED (Management of External Data),將 SQL 指令進行拆解,交付各個外部資料來源 (或資料庫) 來執行。最後再由中央控管的資料庫,將資料匯集起來傳給使用者。詳情可參考底下兩個連結:
Postgresql 的 Foreign Data Wrappers 實際上是 Postgresql 資料庫的 Extension,其中 file_fdw 在 Postgresql 9.1 之後的版本已內建,可讓我們把外部的文字或 CSV 檔案,變成是 Postgresql 的 FOREIGN TABLE。筆者進行測試的版本為 Postgresql 9.3 版。

為了示範,請到 http://data.gov.tw/node/8792 下載 "每日各站點進出站人數.csv" 檔案。完成下載後將檔名更改為 stationflow.csv。

接下來以 pgAdmin 執行以下 SQL:
CREATE EXTENSION file_fdw;
CREATE SERVER file_fdw_server FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE stationflow (
yyyymmdd  char(8),
stationcode  integer,
stationname  varchar(5),
input   integer,
output   integer
) SERVER file_fdw_server
OPTIONS (format 'csv', header 'true', filename 'D:/testdata/stationflow.csv', delimiter ',', null '');
這裡 filename 參數用來指向我們剛剛下載、更名的 stationflow.csv,可能需要修正目錄的正確性。

注意,目前 pgAdmin 並不會把 FOREIGN TABLE 顯示在左方的樹狀節點中,因此若要查詢系統中有哪些 FOREIGN TABLE, 可以執行以下指令:

select * from information_schema.tables where table_type = 'FOREIGN TABLE'
如果在此時執行 SQL, 如:
select count(*) from stationflow;
那應該會收到 Postgresql 提示的 UTF8 編碼錯誤的訊息。這是因為檔案其實是以 Big5 編碼,我們可以使用以下指令修正 FOREIGN TABLE 的檔案編碼:
ALTER FOREIGN TABLE stationflow OPTIONS ( encoding 'Big5' ); 
完成之後,您便可以正常執行 SQL 操作:
select count(*) from stationflow;
select * from stationflow where stationname like '台東%'; 
如果你有內部資料表,甚至可以來個內外 Join, Union 或 Sub-Query 等等:
select * from stationflow where stationcode in (select id from stationlist ); 
測試完成後,可用以下指令刪除外部資料表:
drop FOREIGN table stationflow;
這篇文章僅展示了 Postgresql 查詢文字或 CSV 檔案的功能,事實上 Postgresql 可以外接的資料來源還很多,包括 SQL Database、NoSQL Database、Http、Twitter、Hadoop 等,有興趣的同學可以自行至 PostgreSQL Extension Network 下載編譯測試。

2014/06/18

Java: JSON Library

Below is a list of major JSON format processing libraries for java. Check it out.

  • JSON Processing
    • Summary
      • JSR 353: Java API for JSON Processing - Reference Implementation
    • Comment
      • the java version official implementation
  • JSON in Java (json.org)
    • Summary
      • json.org official implementation
    • Comment
      • The API requires you to load the entire document into a String. This is somewhat cumbersome to use and inefficient for large documents. But it's a DOM model so you can get a lot done with very little code. 
  • Jackson JSON Processor
    • Summary
      • High-performance JSON processor.
    • Comment
      • Jackson ones to be the fastest json processing library in the old age.
      • Has many features with modules approach.
      • Jackson has grown into a huge collection of data type processing library
  • Boon JSON
  • google-gson 
    • Summary
      • A Java library to convert JSON to Java objects and vice-versa
    • Feature
      • Provide simple toJson() and fromJson() methods to convert Java objects to JSON and vice-versa
      • Allow pre-existing unmodifiable objects to be converted to and from JSON
      • Extensive support of Java Generics
      • Allow custom representations for objects
      • Support arbitrarily complex objects (with deep inheritance hierarchies and extensive use of generic types)
  • json-simple
    • Summary
      • JSON.simple is a simple Java toolkit for JSON. You can use JSON.simple to encode or decode JSON text. 
    • Feature
      • Full compliance with JSON specification (RFC4627) and reliable (see compliance testing)
      • Provides multiple functionalities such as encode, decode/parse and escape JSON text while keeping the library lightweight
      • Flexible, simple and easy to use by reusing Map and List interfaces
      • Supports streaming output of JSON text
      • Stoppable SAX-like interface for streaming input of JSON text (learn more)
      • Heap based parser
      • High performance (see performance testing)
      • No dependency on external libraries
      • Both of the source code and the binary are JDK1.2 compatible
    • Comment
  • genson - A fast & extensible Java <> Json library
    • Summary
      • Genson is an open-source library doing conversion from Java to Json and Json to Java. Genson targets people who want an extensible but also configurable, fast, scalable and easy to use library. 
    • Feature
      • Easy to use, fast, highly configurable, lightweight and all that into a single small jar!
      • Full databinding and streaming support for efficient read/write
      • Support for polymorphic types (able to deserialize to an unknown type)
      • Does not require a default no arg constructor and really passes the values not just null, encouraging immutability. It can even be used with factory methods instead of constructors!
      • Full support for generic types
      • Easy to filter/include properties without requiring the use of annotations or mixins
      • You can apply filtering or transformation logic at runtime on any bean using the BeaView mechanism
      • Genson provides a complete implementation of JSR 353
      • Starting with Genson 0.95 JAXB annotations and types are supported!
      • Automatic support for JSON in JAX-RS implementations
      • Serialization and Deserialization of maps with complex keys
    •  Comment
      • Looks good!
 Want to use XPath with JSON? see below.
  • Commons JXPath
    • Summary
      • The org.apache.commons.jxpath package defines a simple interpreter of an expression language called XPath. JXPath applies XPath expressions to graphs of objects of all kinds: JavaBeans, Maps, Servlet contexts, DOM etc, including mixtures thereof.  
  • json-path 
    • Summary
      • Java JsonPath implementation

2014/03/27

Java Practice: Shutdown Hook

今日 Daily Java Practice 如下,請問其中 runTime.addShutdownHook 的作用。

public class Main implements Runnable {
    public void run() {
        System.out.println("Shutting down");
    }

    public static void main(String[] arg) {
        Runtime runTime = Runtime.getRuntime();
        Main hook = new Main();
        runTime.addShutdownHook(new Thread(hook));
    }
}

2014/03/24

Java Practice: Enum

今天來玩玩 enum, 以下透過一程式,說明 enum 如何把物件導向程式的三個特質封裝、繼承跟多型完美的結合在一起。
善用 enum,可以大幅減少程式碼。
以下請說明:
  • 程式輸出訊息。
  • 為何輸出此訊息。
  • S, M, L, XL ... 與 Size 的關係,是類別跟實體的關係,還是父類別跟子類別的關係?

public class SizeIterator {
    public static void main(String[] args) {
        Size[] sizes = Size.values();
        for (Size s : sizes) {
            s.showName();
            s.showType();
        }
    }
}

enum Size {
    S {
        @Override
        public void showName(){
            System.out.println("I am too small to show to others!");
        }
    }, M, L, XL, XXL, XXXL;
    public void showName() {
        System.out.println(this.name());
    }
    public void showType() {
        System.out.println(this.getClass().getName());
    }
}

2014/03/19

Java Practice: Object Serialization

今天的 Java Practice 如下, 請服用:
  1. Coper 類別的功用為何?
  2. Coper:: <T> copyInstance(T obj) 方法使用了 Java Generic 的技巧,請解釋這裡採用型別參數的好處?
  3. main 方法中印出的兩個 User 物件,內容是否相同? 若有差異,原因為何?
abstract class Copier {
    @SuppressWarnings("unchecked")
    public static <T> copyInstance(T obj) throws IOException,
            ClassNotFoundException {
        ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
        ObjectOutputStream outStream = new ObjectOutputStream(outBytes);
        outStream.writeObject(obj);
        outStream.close();
        ByteArrayInputStream inBytes = new ByteArrayInputStream(outBytes
                .toByteArray());
        ObjectInputStream inStream = new ObjectInputStream(inBytes);
        T copied = (T) inStream.readObject();
        inStream.close();
        return copied;
    }
}

class User implements Serializable {
    private static final long serialVersionUID = 1L;
    private Date date = new Date();
    private String username;

    private transient String password;

    public User(String name, String pwd) {
        username = name;
        password = pwd;
    }

    public String toString() {
        String pwd = (password == null) ? "(n/a)" : password;
        return "logon info: \n   username: " + username 
            + "\n   date: " + date
            + "\n   password: " + pwd;
    }
}

public class CopierTest {
    /**
     * @param args
     * @throws Exception
     * @throws IOException
     */
    public static void main(String[] args) throws Exception {
        User jack = new User("Jack", "magic");
        System.out.println(jack);
        User copyOfJack = Copier.copyInstance(jack);
        System.out.println(copyOfJack);
    }
}  

2014/03/18

Java Practice: Random


練習一下使用 Java Random 類別。
以下題目,請試著寫出最後 min, max 印出來的最可能數值。
public class MathRandomTest {
 public static void main(String[] args) {
    Random rand = new Random();
   
   int n = 10;
    int min = Integer.MAX_VALUE;
    int max = Integer.MIN_VALUE;
   
    for(int i = 0; i < 100; i ++){
      int r = rand.nextInt(n + 1);
      min = Math.min(min, r);
      max = Math.max(max, r);
    }
    System.out.printf("Min = %d, Max = %d", min, max);
 }
}  

2014/03/17

Java Practice: Generics

從今天開始,預計把發給敝公司新人的 Java Practice 題目,release 在此給大家參考。Java 的初學者可以試著練習看看,老手也可溫故知新喔!

首先是 Java Generics 之應用,請在 ________ 填入正確宣告,使程式可以正常運作:
class TwoTypeParas <T,V> {
    T ob1;
    V ob2;

    TwoTypeParas(T o1, V o2) {
        ob1 = o1;
        ob2 = o2;
    }

    void showTypes() {
        System.out.println("Type of T is " + ob1.getClass().getName());
        System.out.println("Type of V is " + ob2.getClass().getName());
    }

    T getob1() {
        return ob1;
    }

    V getob2() {
        return ob2;
    }
}

public class TwoTypeParaTest {
    public static void main(String args[]) {
        TwoTypeParas______ tgObj = new TwoTypeParas______(88, "Generics");
        tgObj.showTypes();

        int v = tgObj.getob1();
        System.out.println("value: " + v);

        String str = tgObj.getob2();
        System.out.println("value: " + str);
    }
}